Human Language Record-a-thon - Introduction
0 (0 Likes / 0 Dislikes)
I'm going to start out by trying to
provide some context for the event
and the idea behind the event
and give you some structure for
how we have planned out the day
and what we hope to accomplish by the end of it.
So we have several partners and sponsors.
We have the Internet Archive
which has provided this beautiful venue that we're here at today.
It's also a very fitting venue
because the recordings that you're going to be making today
we will be uploading to the Internet Archive
into the Rosetta Project collection
and I'll tell you a little about the Rosetta Project in a moment.
We're also joined by a company called Mightyverse
that's building an online phrasebook of the world's languages.
They're here; they're going to be doing some professional video recording
of participants in the downstairs area throughout the day.
I represent The Rosetta Project at The Long Now Foundation
a project that I direct.
It's a project to build an archive of the world's languages
and our resources, our materials, are all here at the Internet Archive.
My project is sponsored by The Long Now Foundation
and that's a non-profit here in the city of San Francisco
and The Long Now Foundation has provided support
for the whole development of this project
and a staff of dedicated interns who have been working
to develop this for several weeks now.
And also by Levenger - Levenger.com and the Levenger Foundation.
They've provided the participant notebooks
and many of the gifts and prizes that we'll be awarding at the end of the day today
and also our lunch.
The Internet Archive has kindly provided our breakfast for us
and will be providing our afternoon reception as well.
So I would like to thank everybody for their support
of this very new idea and for trying it out.
A little bit about me
so you know where I'm coming from
and why this idea may have come about.
I am the director of The Rosetta Project at The Long Now Foundation.
It is a project to develop an archive of the world's languages.
Many of you will have come to that site
when you have been doing some research about the Record-a-thon
and preparing your materials for today.
That is a place where we do a lot of education
about the project
but all of the resources in our collection actually reside here.
Well, not physically here
but virtually here in the Internet Archive.
My main areas of interest
come from my background, my doctoral work
that I did in linguistics
I've studied linguistics for a couple of decades now.
And out of that area of research I focused on
the documentation of human language
and describing human language
and that involves very intense work.
And for long periods of time where you are
doing essentially what you are doing today.
You are recording people, whether it's on audio, or whether it's on video
and it used to be that people would just transcribe using phonetic writing what people said.
And then would spend a great deal of time doing analysis
of those collected materials to build things like dictionaries and grammars
that describe the whole structure of a language
or produce sets of stories and collections of texts
that are good illustrations for how a language works.
My particular interest in the area of language documentation and description
is with regard to endangered languages.
Because many languages of the world today
are likely to go out of use - become extinct -
within the next 100 years.
So this is a critical area of linguistic research and focus
and one that I'm involved with.
I'm particularly involved in documenting the native languages of North America.
I've worked some with native groups here in California.
Most of my research in this area has come from the midwestern United States
particularly working with Algonquian tribes on their languages
and I've been doing that for a long time now.
And as I've worked with these groups
this was about the mid-1990s
when I was first learning about the web and building web pages
and some of the first web sites that I saw online
that had to do with human languages
were actually of the native languages that I was working with.
The particular group I was working with was very wired.
There were a couple of people who were very into technology
and started building web sites to support their language.
And they put up recordings, they put up wordlists
and for me that was really a revelation
because in the real world
these languages are not languages that have any power.
Most of them have very little visibility
And they are often suppressed.
So in the real world it's difficult for these languages to thrive.
And I think that's a lot of the reason why we're seeing languages disappear today.
But if you look online, all of a sudden
here is a place where language could thrive.
So it occurred to me that
if there are basic enabling technologies
there should be no reason why languages of the world - all of them -
couldn't be enabled, should be able to be used online
in this new communication domain; in this new medium.
And I really believe, and I've felt this very strongly
that the online domain
is a crucial new domain for language use in the modern world.
And children, who are particularly sensitive and aware
of how their language is used and valued
and can be used in the world
are the ones who make decisions about
whether they are going to keep using the language
or whether they are going to abandon it.
And every single language is only one generation from becoming endangered
because once the children decide that they are not going to use it any more
that is the last generation that is the speaker group for that language.
and as that generation ages
the language becomes endangered, and then it becomes moribund
and then it becomes extinct.
And that's what we're seeing happen all over the world today.
So that's a little bit about me and where I'm coming from
and why I thought this would be a good event to try out.
Here's a bit about you.
So we have about 100 RSVPs for this event
these are people who are going to be showing up throughout the day.
Half of you are going to be here, physically, in the Internet Archive
and working on your recordings and mingling together and doing your uploads here.
The other half of you are going to try to do this from wherever you are in the world
using our broadcast through Cover it Live
and we have three people who are going to be staffing that
as volunteers throughout the day to communicate with people
and chat with you, and answer questions
and help with uploads
So we're really running two events -
well, we're running the same event
but we're running it in two different channels, so to speak.
So there's the in-person one and there's the remote one.
I want both of you groups to think of yourselves as belonging to this larger set
even though you can't see each other.
And certainly the people who are sitting at home in their offices
or in their living rooms and making these recordings can't see all of you
but I want you to know that you're all there
and you're all working on this main effort together today.
And there may be others
probably not everybody who is showing up today has actually RSVPd
we may know by the end of the day more about who's here, and so forth.
We know from a few of you who told us where you're joining us from.
Several of us, of course, are coming from the United States
but also the United Kingdom
Turkey, Japan and Australia.
So the goal we have today
is to collect recordings
of at least 50 different languages.
And what those different languages are
I'm not going to make any assumptions about that
I'm going to take your word if you say this is a unique language
then I believe you, I think that's a good metric.
So we're going to do 50 different languages
recorded here today at this event
or online as part of this event
uploaded to the Internet Archive.
So just making the recordings doesn't count as part of the 50
at least, that's not the rules that we're setting out.
They must be uploaded to the Internet Archive
so that we can look through the list and see how many we've gotten.
Why 50?
Well, looking around at some information that I was able to glean from the U.S. Census
the U.S. Census actually has a longer form.
Most people here in the United States get the short form.
You fill it out quickly, you send it back in, you're done.
If you're one of those people, one of those 17-odd households
you get the big, long form.
How many of you have actually gotten that long form before?
Just me? Okay, so this is probably pretty representative.
The long form is actually about 20 pages long.
There's lots and lots of questions and they ask you all sorts of information about
your household demographics.
You know, how much money you make, and how long you have to commute to work
and if you're retired
and they also ask you questions, and have for several decades now
about language use.
So what is the language that is used in your home?
How many people in your home speak these languages?
And so there's just a few questions, but based on these questions
we're actually able to glean some information
about the different languages used in the United States.
And when you correlate these with other kinds of data you can see things like
how many languages are spoken in different regions of the country.
And so we have
a list of all the languages
spoken in the top metropolitan areas in the United States
and the number one in terms of languages is Los Angeles.
Now this is the whole larger Los Angeles area
with about 137 languages
then New York city, then Seattle
then Chicago, and then San Francisco
that is the San Francisco Bay Area.
and we have about 112 of them.
So we have over 100 languages, different languages
spoken here by substantial populations in the San Francisco Bay Area.
And if you're walking around San Francisco, or if you are riding a bus
or if you're on BART, chances are you're going to hear some of these languages.
But just so you know there's actually even more of them
more than the ones that you probably hear around here
like Cantonese, or Mandarin, or Japanese
or Korean, or Russian, or French, and so forth.
So we decided as a goal to pick half of that many
approximately half of that many, 50, as our goal today
So if there's 100 languages spoken in the San Francisco Bay Area
Can we get half that many here today?
Well, when you signed up these are the languages that you say that you speak.
And this is the first page, so we have lots of interesting ones here.
So ones that you may have expected like
German and Chinese and Farsi and French
these are some of the major world languages that millions of people around the world speak.
But we have other ones too. We have Baoulé and we have Esperanto
and I believe there's an Esperanto speaker with us here today, that's a constructed language.
Let's see what else we have here -- we have Breton
hmm what are some of the other unique ones
well here's another page, we have lots more languages
we have Kumeyaay, somebody signed up online to do Kumeyaay
this is a native language of California.
Luxumbourgish, Mixteco
we have some dialects, so Rashty a Persian dialect.
So sometimes - it's actually pretty hard in some cases to have a fine dividing line
between a dialect and a language, it's actually kind of a continuum.
And so a lot of people think of dialects as being a kind of a language
but sometimes dialects can diverge so very much that we call them different languages
and this is the case, a lot of people call different varieties of Chinese
like Cantonese and Mandarin, different dialects.
Those are actually mutually unintelligible varieties with each other
so in the linguistic definition they are actually different languages.
So as I said we're just going to take your word for it
if you say this is a different language, then we'll count that as a different language.
So, 55. So if everybody who says that they're going to participate today participates
and provides the languages that they said they're going to record
we should be just fine, so let's see how we do.
So a bit of background to the idea behind this whole event.
Most people know of the main languages of the world
like Mandarin, Spanish, English
so one out of every six people on Planet Earth speaks Mandarin.
English and Spanish are about the same
and there's about one out of every 18 people on Planet Earth
speaks English or Spanish.
And so these are very robust languages, they're spoken worldwide
they're languages of major economic vitality and power.
And if you speak a smaller language chances are
you also know one of these larger languages
just so you can have access to that wider socio-economic world
and can work within it.
But there's actually 7,000 languages spoken on Planet Earth
and most of these don't have a billion speakers
or a million speakers even.
The average size language is actually about 2,500 people
and so if you look at certain parts of the world like Papua New Guinea
a lot of those languages only have a few hundred speakers
and for most of human history a language that had 2,500 speakers or so
was a vital human language.
Even down to the size of a language that might have had one or two hundred speakers
And so most of our human history we've been operating in groups
that accommodated a language of about that size.
But, what we've seen is the rise, with globalization
with new communication technology
with new transportation technology
that have been developed over the past hundred, several hundred years
we're starting to see the rise of these "superlanguages"
and these are languages used for commerce, or for science
or for mass communication
and they're used globally.
And so what's happening are all these other languages,
what I call the "long tail of languages"
this is kind of a long-tail distribution.
So all of those languages in the red are actually in the "danger zone".
Because, what's happening there are people are shifting to these major world languages
but they're not retaining those languages that are used in their home and community and have been for generations.
They're leaving them behind.
And they're leaving them behind not just to participate in this other world
but for other reasons too
so many times there are social pressures and stigmas
and sometimes outright oppression of groups that are speaking these smaller languages
or their environment, the place where they live
is being commercialized or changed in some way.
Or diasporas, people moving around the world
that all affects small language groups to the point that they are really embattled today
and they can't actually make it without some serious attention and some serious help.
The languages in the middle there in the yellow zone
that is about 4% of the languages.
So if you look at the green and the yellow
that's 5% of all of the number of languages
but that's 95% of all of the humans
speak natively one of those roughly 300 languages.
And 5% of the world's population speaks 95% of the languages.
So that gives you an idea of the magnitude, of the scale of the problem we're actually dealing with.
And so about 15 years ago
linguists have been aware that this has been happening for quite a while.
But it wasn't until about 15 years ago or so
that a seminal article was written in a journal Language
so this is the journal of the Linguistic Society of America
that really raised the call and the alarm about endangered languages
and the field of linguistics had actually been very heavily focused on theory for a long time
and one had the sense that one was kind of fiddling while Rome burned, right?
So here are all of these languages disappearing at the rate of maybe one every two weeks
and yet they're not being documented.
So there's been a real shift in the field of linguistics.
Of course theory is still very necessary.
But a lot of people who are working in theory and a lot of young scholars are now focusing on language documentation
and they're out all over the world
working with small communities on projects to document language
and in many cases to bring that language back into a more vital use within the community.
So why does it matter? Why should we care about all of these languages?
Well, if you stop and think that pretty much everybody in this room operates in a very literate society
so most of the major world languages, probably all of them have associated scripts
they've been written for a very long period of time and that's a kind of technology.
And it enables people to dissociate information
from an in-person face-to-face communication event.
So once you write something down in a book
somebody could access it and never know the person who wrote it
or never speak to that person but they can glean that knowledge.
And there's also something that's potentially missing in that
when you pick up a book, is that there's not necessarily a social relationship between you
and the person who's giving you this information.
So that's a dynamic that radically alters human interactions in societies
when you can gain access in that way
and that's a very powerful thing to be able to access knowledge in that way.
But at the same time when you adopt literacy you tend to
forget about some of the wonderful aspects of orality
so an oral culture, because a language doesn't need a writing system to be a healthy, thriving language.
So think of all of the wonderful stories that you were told as a child
and maybe they didn't come from a book but they were told to you from your parents
and they were told those stories by their parents
or the rhymes and the poems that you learned
and in oral cultures there's a very rich tradition of storytelling and narrative
and poetry, the poetic use of language as a way to capture information
and pass it on to somebody else just through language.
And the language that is used in that way is so wonderfully complex and structured
pretty much all of the things that you can do in a modern movie with cinematic techniques
to tell a narrative from one perspective or another you can do this just with language
and that's a wonderfully powerful tool that we have.
So languages are great works of art.
They're also wonderful libraries.
They are where we store and encode
information about technologies that we develop
how we structure our view of the world
how things relate to each other
and so language - you can kind of think of them as an encyclopedia
or a grand library, a storehouse of knowledge.
I also think of them as "how to" guides for living on Planet Earth.
So if you think of all these small languages all over the world
those are very small niche environments that people have been living in
for very long periods of time
and they've developed a very close relationship to those environments
and those close relationships are encoded in language.
We have words for things in our environment
and our plants and our medicines
and technologies and resources that we have.
So when you lose a language you are potentially losing that "how to" guide.
And when you lose one you might think 'oh well, that's not such a big deal'
but if you stand to lose 90% of them
that's like taking an encyclopedia and giving it to the next generation
with all of the pages ripped out except for the section starting with 'h'.
You know, that's what we're giving our children.
So I think this impoverishes all of us.
And also languages provide a window into the structure of our minds.
Languages are an expression of human culture, and the human genius.
And linguists are just beginning to understand
how languages encode our thoughts.
And how they are cognitively stored.
This is a very young science.
And yet while we're trying to discover and learn about it
it's disappearing right out from under our feet.
So really, this is a crisis
and it's something that needs attention
right away - we need to be working on this right away.
So this is the background to the ideas that formulated my thinking for the Record-a-thon.
So we lose a language once every two weeks
we have linguists out all over the world
working with communities to document human language
but with all of the resources that we can bring to bear
all of the funding that have, all of the people that we can muster
to help with this enterprise that's really not enough.
We don't have enough people to do this, we don't have enough money to do this
we don't have enough time to do this.
So it's becoming very clear to me that
we cannot just keep the process of language documentation -
certainly some part of language documentation must be done by a linguist
so linguists write dictionaries, they write grammars, they analyze human language.
But that's not all of language documentation. A lot of language documentation
is sitting down with somebody
with some kind of recording device and talking together.
And it occurs to me that pretty much all of us are now walking around with these recording devices.
We have them in our purse, and we have them in our backpack
we talk on them for phones, and then we take pictures with them
and we carry them with us all the time.
Cameras are everywhere. Cameras our on our laptops.
So we have all of these recording devices, everybody speaks a language
at least one language, right?
So why not take that part of the language documentation process,
the recording part, the part where we can just sit down and talk with each other
have conversations, and tell stories
and let everybody do that, let the world document these languages.
Everybody can document their own languages.
So everyone can provide those recordings.
And when we do this, when we record each other
and they don't have to be long recordings
in fact today what we're asking people to do is record something of about 5-10 minutes in length
and don't edit it, just upload it into the Internet Archive
and what I'm testing out today, what I really want to know
I believe that the minimum amount of language documentation
useful language documentation that somebody who isn't a linguist can reasonably produce
is video, because that is a very rich source of linguistic information
you see somebody, you see their hand motions, you see their body stance, you see their facial expressions
you see the context and the environment, and the people they're talking to
so you take video which is the richest form of language documentation
you take a small chunk of it, and if you give that to me and say
'this is the language that I say that this is in,' or 'I think it is in'
and maybe you're not even sure exactly how you spell that language
but you can provide enough information to give me an idea of what language you think this is in
maybe tell me where you're from
and then, if you get a lot of people doing this all over the world
what you're starting to build are corpora
and corpora is a fancy word
it comes from the Latin word that is 'corpus' which means 'body'
so what you're doing is you're assembling a body or a collection of material.
If you do this for every language then you're starting to collectively build a resource.
And what can you do with a collection of language videos?
If you start out small, so let's say you just have a few hours of video
you can start to use that for language learning and teaching.
If you're telling stories, I think there's no better resources that you can use to teach somebody a language.
People love stories.
So you can start immediately building teaching and learning materials.
If you get a medium sized corpus, and your corpus continues to grow and you get tens of hours
then linguists can move in and we can start to do things
like build dictionaries and grammars and reference and resource tools for that language
and that's how we start to better understand where that language fits
within the whole scope of human languages in its structure and its meaning.
And then, if you get a very large corpus - say you have hundreds of hours
then the language technologists, speech technologists can come in
and if that corpus is open, and available
then they can start to do natural language processing on it
they can start doing things like speech recognition
they can build tools for search, for browse
all of the things that are needed to enable a language to be used in the digital domain
can start to be built, once you have a corpus.
So you may think that this is kind of a strange and foreign thing
but every time you type a search into Google
every time you type an email in Google or Yahoo
every time you upload a video to YouTube and make it publicly accessible
you are building a corpus.
So most of us do this, in this room do this, probably at least one of our languages is English
so when you type your search terms or your search expressions into Google
you're helping build the Google corpus of English.
And it keeps getting better, right? It gets better and better because you are adding more data
and they're refining their natural language processing techniques, or their statistical techniques
so when you get a very large corpus like that you get these great tools
and it becomes very easy to use your language online
the online domain works really well for major world languages.
It doesn't work so well for smaller languages, because they don't have a corpus.
They haven't built up that corpus yet.
In fact I've been talking to my friends who are computational linguists
and they say that maybe only 30 of the world's languages
have a sizeable enough corpus to make them actually really usable online.
And that's really shocking, that's really surprising because we have basic technologies that can enable any language
we have Unicode, we have an international standard
for identifying all of the languages of the world.
So at this point, we have the technological capacity to be able to represent any language online
and we're moving into other kinds of communication, like video, online
where it's not so text-dominant
so we should be able to do this, right?
We should be able to do this.
Open is good.
So what we're asking people to do today
is to dedicate the recordings that you're making to the public domain.
So that when we upload them into the Internet Archive
they are open and publicly accessible.
Why? Because then your video is not just a stand-alone video, it becomes part of an open corpus.
And that enables all of those cool things we saw, just a moment ago.
Others can make use of your video so maybe you create something
maybe somebody who also speaks your language on the other side of the world can find that
and use that, and maybe they're teaching a class in their language
and maybe they found something in your video that they can use, or a story
or a word that they've never heard before.
Video data can be enriched by others.
So if you create a video and you say 'I think this is in Swahili.'
Then somebody else can come in and say, 'oh yeah, I know that'
'That's definitely in Swahili, and I know exactly where that language is from.'
And then somebody else can come in and say 'well, I can transcribe that.'
And they'll transcribe the whole video.
And then somebody else can come in and say
'Well, hey, I can translate that' and they'll translate that into another language.
And a linguist might come in and say 'look at all those words you've just given me'
'I'm going to make a lexicon, or a dictionary'
or if you have enough of them maybe 'I'm going to make a grammar.'
So video data can be enriched by others
so that's what openness and language data enables.
And that's why all of the materials in the Rosetta Project collection are open
and why I'm asking for the Record-a-thon videos to be placed into the public domain.
So today what we're doing is we're testing out an idea.
I think this is the first time this has been tested out quite like this.
So we're trying to create a model
and see if we can produce a set of language videos
that are short - that is, about 5-10 minutes in length
They're unedited, so we don't spend time going back and trying to clean them up.
There's minimal annotation or description
that is, I'm not asking you to provide lots of information about these videos.
You can provide more if you like, but it's not necessary.
They're produced on common video devices that you carry around with you all of the time.
Like my little Flip camera here. Maybe your cell phone.
And you don't have to be an expert to do it.
And this way we can help document the languages of the world
and help preserve them, help make them more vital in the modern world.
So we have two main kinds of recording activities.
A lot of you have come prepared with your recording devices
and these are free-form recordings that you can do in lots of places
nooks and crannies, this is a big wonderful building
and there's lots of places where you can go and get separated from other people
and get a pretty quiet environment to do your recording.
And then we also have structured recordings which are taking place downstairs at the lower level.
And those are provided by Mightyverse, I think I see Paul there [waves]
Paul is from Mightyverse, and his Mightyverse team, can you guys raise your hands there?
So that's the Mightyverse team behind you staffing the recording booths.
And some of you have signed up for those today, and maybe others of you are jumping in and that's fine too.
So let's try out this idea, let's have fun, and then let's share it with the world.