Human Language Record-a-thon - Introduction

I'm going to start out by trying to provide some context for the event and the idea behind the event and give you some structure for how we have planned out the day and what we hope to accomplish by the end of it. So we have several partners and sponsors. We have the Internet Archive which has provided this beautiful venue that we're here at today. It's also a very fitting venue because the recordings that you're going to be making today we will be uploading to the Internet Archive into the Rosetta Project collection and I'll tell you a little about the Rosetta Project in a moment. We're also joined by a company called Mightyverse that's building an online phrasebook of the world's languages. They're here; they're going to be doing some professional video recording of participants in the downstairs area throughout the day. I represent The Rosetta Project at The Long Now Foundation a project that I direct. It's a project to build an archive of the world's languages and our resources, our materials, are all here at the Internet Archive. My project is sponsored by The Long Now Foundation and that's a non-profit here in the city of San Francisco and The Long Now Foundation has provided support for the whole development of this project and a staff of dedicated interns who have been working to develop this for several weeks now. And also by Levenger - and the Levenger Foundation. They've provided the participant notebooks and many of the gifts and prizes that we'll be awarding at the end of the day today and also our lunch. The Internet Archive has kindly provided our breakfast for us and will be providing our afternoon reception as well. So I would like to thank everybody for their support of this very new idea and for trying it out. A little bit about me so you know where I'm coming from and why this idea may have come about. I am the director of The Rosetta Project at The Long Now Foundation. It is a project to develop an archive of the world's languages. Many of you will have come to that site when you have been doing some research about the Record-a-thon and preparing your materials for today. That is a place where we do a lot of education about the project but all of the resources in our collection actually reside here. Well, not physically here but virtually here in the Internet Archive. My main areas of interest come from my background, my doctoral work that I did in linguistics I've studied linguistics for a couple of decades now. And out of that area of research I focused on the documentation of human language and describing human language and that involves very intense work. And for long periods of time where you are doing essentially what you are doing today. You are recording people, whether it's on audio, or whether it's on video and it used to be that people would just transcribe using phonetic writing what people said. And then would spend a great deal of time doing analysis of those collected materials to build things like dictionaries and grammars that describe the whole structure of a language or produce sets of stories and collections of texts that are good illustrations for how a language works. My particular interest in the area of language documentation and description is with regard to endangered languages. Because many languages of the world today are likely to go out of use - become extinct - within the next 100 years. So this is a critical area of linguistic research and focus and one that I'm involved with. I'm particularly involved in documenting the native languages of North America. I've worked some with native groups here in California. Most of my research in this area has come from the midwestern United States particularly working with Algonquian tribes on their languages and I've been doing that for a long time now. And as I've worked with these groups this was about the mid-1990s when I was first learning about the web and building web pages and some of the first web sites that I saw online that had to do with human languages were actually of the native languages that I was working with. The particular group I was working with was very wired. There were a couple of people who were very into technology and started building web sites to support their language. And they put up recordings, they put up wordlists and for me that was really a revelation because in the real world these languages are not languages that have any power. Most of them have very little visibility And they are often suppressed. So in the real world it's difficult for these languages to thrive. And I think that's a lot of the reason why we're seeing languages disappear today. But if you look online, all of a sudden here is a place where language could thrive. So it occurred to me that if there are basic enabling technologies there should be no reason why languages of the world - all of them - couldn't be enabled, should be able to be used online in this new communication domain; in this new medium. And I really believe, and I've felt this very strongly that the online domain is a crucial new domain for language use in the modern world. And children, who are particularly sensitive and aware of how their language is used and valued and can be used in the world are the ones who make decisions about whether they are going to keep using the language or whether they are going to abandon it. And every single language is only one generation from becoming endangered because once the children decide that they are not going to use it any more that is the last generation that is the speaker group for that language. and as that generation ages the language becomes endangered, and then it becomes moribund and then it becomes extinct. And that's what we're seeing happen all over the world today. So that's a little bit about me and where I'm coming from and why I thought this would be a good event to try out. Here's a bit about you. So we have about 100 RSVPs for this event these are people who are going to be showing up throughout the day. Half of you are going to be here, physically, in the Internet Archive and working on your recordings and mingling together and doing your uploads here. The other half of you are going to try to do this from wherever you are in the world using our broadcast through Cover it Live and we have three people who are going to be staffing that as volunteers throughout the day to communicate with people and chat with you, and answer questions and help with uploads So we're really running two events - well, we're running the same event but we're running it in two different channels, so to speak. So there's the in-person one and there's the remote one. I want both of you groups to think of yourselves as belonging to this larger set even though you can't see each other. And certainly the people who are sitting at home in their offices or in their living rooms and making these recordings can't see all of you but I want you to know that you're all there and you're all working on this main effort together today. And there may be others probably not everybody who is showing up today has actually RSVPd we may know by the end of the day more about who's here, and so forth. We know from a few of you who told us where you're joining us from. Several of us, of course, are coming from the United States but also the United Kingdom Turkey, Japan and Australia. So the goal we have today is to collect recordings of at least 50 different languages. And what those different languages are I'm not going to make any assumptions about that I'm going to take your word if you say this is a unique language then I believe you, I think that's a good metric. So we're going to do 50 different languages recorded here today at this event or online as part of this event uploaded to the Internet Archive. So just making the recordings doesn't count as part of the 50 at least, that's not the rules that we're setting out. They must be uploaded to the Internet Archive so that we can look through the list and see how many we've gotten. Why 50? Well, looking around at some information that I was able to glean from the U.S. Census the U.S. Census actually has a longer form. Most people here in the United States get the short form. You fill it out quickly, you send it back in, you're done. If you're one of those people, one of those 17-odd households you get the big, long form. How many of you have actually gotten that long form before? Just me? Okay, so this is probably pretty representative. The long form is actually about 20 pages long. There's lots and lots of questions and they ask you all sorts of information about your household demographics. You know, how much money you make, and how long you have to commute to work and if you're retired and they also ask you questions, and have for several decades now about language use. So what is the language that is used in your home? How many people in your home speak these languages? And so there's just a few questions, but based on these questions we're actually able to glean some information about the different languages used in the United States. And when you correlate these with other kinds of data you can see things like how many languages are spoken in different regions of the country. And so we have a list of all the languages spoken in the top metropolitan areas in the United States and the number one in terms of languages is Los Angeles. Now this is the whole larger Los Angeles area with about 137 languages then New York city, then Seattle then Chicago, and then San Francisco that is the San Francisco Bay Area. and we have about 112 of them. So we have over 100 languages, different languages spoken here by substantial populations in the San Francisco Bay Area. And if you're walking around San Francisco, or if you are riding a bus or if you're on BART, chances are you're going to hear some of these languages. But just so you know there's actually even more of them more than the ones that you probably hear around here like Cantonese, or Mandarin, or Japanese or Korean, or Russian, or French, and so forth. So we decided as a goal to pick half of that many approximately half of that many, 50, as our goal today So if there's 100 languages spoken in the San Francisco Bay Area Can we get half that many here today? Well, when you signed up these are the languages that you say that you speak. And this is the first page, so we have lots of interesting ones here. So ones that you may have expected like German and Chinese and Farsi and French these are some of the major world languages that millions of people around the world speak. But we have other ones too. We have Baoulé and we have Esperanto and I believe there's an Esperanto speaker with us here today, that's a constructed language. Let's see what else we have here -- we have Breton hmm what are some of the other unique ones well here's another page, we have lots more languages we have Kumeyaay, somebody signed up online to do Kumeyaay this is a native language of California. Luxumbourgish, Mixteco we have some dialects, so Rashty a Persian dialect. So sometimes - it's actually pretty hard in some cases to have a fine dividing line between a dialect and a language, it's actually kind of a continuum. And so a lot of people think of dialects as being a kind of a language but sometimes dialects can diverge so very much that we call them different languages and this is the case, a lot of people call different varieties of Chinese like Cantonese and Mandarin, different dialects. Those are actually mutually unintelligible varieties with each other so in the linguistic definition they are actually different languages. So as I said we're just going to take your word for it if you say this is a different language, then we'll count that as a different language. So, 55. So if everybody who says that they're going to participate today participates and provides the languages that they said they're going to record we should be just fine, so let's see how we do. So a bit of background to the idea behind this whole event. Most people know of the main languages of the world like Mandarin, Spanish, English so one out of every six people on Planet Earth speaks Mandarin. English and Spanish are about the same and there's about one out of every 18 people on Planet Earth speaks English or Spanish. And so these are very robust languages, they're spoken worldwide they're languages of major economic vitality and power. And if you speak a smaller language chances are you also know one of these larger languages just so you can have access to that wider socio-economic world and can work within it. But there's actually 7,000 languages spoken on Planet Earth and most of these don't have a billion speakers or a million speakers even. The average size language is actually about 2,500 people and so if you look at certain parts of the world like Papua New Guinea a lot of those languages only have a few hundred speakers and for most of human history a language that had 2,500 speakers or so was a vital human language. Even down to the size of a language that might have had one or two hundred speakers And so most of our human history we've been operating in groups that accommodated a language of about that size. But, what we've seen is the rise, with globalization with new communication technology with new transportation technology that have been developed over the past hundred, several hundred years we're starting to see the rise of these "superlanguages" and these are languages used for commerce, or for science or for mass communication and they're used globally. And so what's happening are all these other languages, what I call the "long tail of languages" this is kind of a long-tail distribution. So all of those languages in the red are actually in the "danger zone". Because, what's happening there are people are shifting to these major world languages but they're not retaining those languages that are used in their home and community and have been for generations. They're leaving them behind. And they're leaving them behind not just to participate in this other world but for other reasons too so many times there are social pressures and stigmas and sometimes outright oppression of groups that are speaking these smaller languages or their environment, the place where they live is being commercialized or changed in some way. Or diasporas, people moving around the world that all affects small language groups to the point that they are really embattled today and they can't actually make it without some serious attention and some serious help. The languages in the middle there in the yellow zone that is about 4% of the languages. So if you look at the green and the yellow that's 5% of all of the number of languages but that's 95% of all of the humans speak natively one of those roughly 300 languages. And 5% of the world's population speaks 95% of the languages. So that gives you an idea of the magnitude, of the scale of the problem we're actually dealing with. And so about 15 years ago linguists have been aware that this has been happening for quite a while. But it wasn't until about 15 years ago or so that a seminal article was written in a journal Language so this is the journal of the Linguistic Society of America that really raised the call and the alarm about endangered languages and the field of linguistics had actually been very heavily focused on theory for a long time and one had the sense that one was kind of fiddling while Rome burned, right? So here are all of these languages disappearing at the rate of maybe one every two weeks and yet they're not being documented. So there's been a real shift in the field of linguistics. Of course theory is still very necessary. But a lot of people who are working in theory and a lot of young scholars are now focusing on language documentation and they're out all over the world working with small communities on projects to document language and in many cases to bring that language back into a more vital use within the community. So why does it matter? Why should we care about all of these languages? Well, if you stop and think that pretty much everybody in this room operates in a very literate society so most of the major world languages, probably all of them have associated scripts they've been written for a very long period of time and that's a kind of technology. And it enables people to dissociate information from an in-person face-to-face communication event. So once you write something down in a book somebody could access it and never know the person who wrote it or never speak to that person but they can glean that knowledge. And there's also something that's potentially missing in that when you pick up a book, is that there's not necessarily a social relationship between you and the person who's giving you this information. So that's a dynamic that radically alters human interactions in societies when you can gain access in that way and that's a very powerful thing to be able to access knowledge in that way. But at the same time when you adopt literacy you tend to forget about some of the wonderful aspects of orality so an oral culture, because a language doesn't need a writing system to be a healthy, thriving language. So think of all of the wonderful stories that you were told as a child and maybe they didn't come from a book but they were told to you from your parents and they were told those stories by their parents or the rhymes and the poems that you learned and in oral cultures there's a very rich tradition of storytelling and narrative and poetry, the poetic use of language as a way to capture information and pass it on to somebody else just through language. And the language that is used in that way is so wonderfully complex and structured pretty much all of the things that you can do in a modern movie with cinematic techniques to tell a narrative from one perspective or another you can do this just with language and that's a wonderfully powerful tool that we have. So languages are great works of art. They're also wonderful libraries. They are where we store and encode information about technologies that we develop how we structure our view of the world how things relate to each other and so language - you can kind of think of them as an encyclopedia or a grand library, a storehouse of knowledge. I also think of them as "how to" guides for living on Planet Earth. So if you think of all these small languages all over the world those are very small niche environments that people have been living in for very long periods of time and they've developed a very close relationship to those environments and those close relationships are encoded in language. We have words for things in our environment and our plants and our medicines and technologies and resources that we have. So when you lose a language you are potentially losing that "how to" guide. And when you lose one you might think 'oh well, that's not such a big deal' but if you stand to lose 90% of them that's like taking an encyclopedia and giving it to the next generation with all of the pages ripped out except for the section starting with 'h'. You know, that's what we're giving our children. So I think this impoverishes all of us. And also languages provide a window into the structure of our minds. Languages are an expression of human culture, and the human genius. And linguists are just beginning to understand how languages encode our thoughts. And how they are cognitively stored. This is a very young science. And yet while we're trying to discover and learn about it it's disappearing right out from under our feet. So really, this is a crisis and it's something that needs attention right away - we need to be working on this right away. So this is the background to the ideas that formulated my thinking for the Record-a-thon. So we lose a language once every two weeks we have linguists out all over the world working with communities to document human language but with all of the resources that we can bring to bear all of the funding that have, all of the people that we can muster to help with this enterprise that's really not enough. We don't have enough people to do this, we don't have enough money to do this we don't have enough time to do this. So it's becoming very clear to me that we cannot just keep the process of language documentation - certainly some part of language documentation must be done by a linguist so linguists write dictionaries, they write grammars, they analyze human language. But that's not all of language documentation. A lot of language documentation is sitting down with somebody with some kind of recording device and talking together. And it occurs to me that pretty much all of us are now walking around with these recording devices. We have them in our purse, and we have them in our backpack we talk on them for phones, and then we take pictures with them and we carry them with us all the time. Cameras are everywhere. Cameras our on our laptops. So we have all of these recording devices, everybody speaks a language at least one language, right? So why not take that part of the language documentation process, the recording part, the part where we can just sit down and talk with each other have conversations, and tell stories and let everybody do that, let the world document these languages. Everybody can document their own languages. So everyone can provide those recordings. And when we do this, when we record each other and they don't have to be long recordings in fact today what we're asking people to do is record something of about 5-10 minutes in length and don't edit it, just upload it into the Internet Archive and what I'm testing out today, what I really want to know I believe that the minimum amount of language documentation useful language documentation that somebody who isn't a linguist can reasonably produce is video, because that is a very rich source of linguistic information you see somebody, you see their hand motions, you see their body stance, you see their facial expressions you see the context and the environment, and the people they're talking to so you take video which is the richest form of language documentation you take a small chunk of it, and if you give that to me and say 'this is the language that I say that this is in,' or 'I think it is in' and maybe you're not even sure exactly how you spell that language but you can provide enough information to give me an idea of what language you think this is in maybe tell me where you're from and then, if you get a lot of people doing this all over the world what you're starting to build are corpora and corpora is a fancy word it comes from the Latin word that is 'corpus' which means 'body' so what you're doing is you're assembling a body or a collection of material. If you do this for every language then you're starting to collectively build a resource. And what can you do with a collection of language videos? If you start out small, so let's say you just have a few hours of video you can start to use that for language learning and teaching. If you're telling stories, I think there's no better resources that you can use to teach somebody a language. People love stories. So you can start immediately building teaching and learning materials. If you get a medium sized corpus, and your corpus continues to grow and you get tens of hours then linguists can move in and we can start to do things like build dictionaries and grammars and reference and resource tools for that language and that's how we start to better understand where that language fits within the whole scope of human languages in its structure and its meaning. And then, if you get a very large corpus - say you have hundreds of hours then the language technologists, speech technologists can come in and if that corpus is open, and available then they can start to do natural language processing on it they can start doing things like speech recognition they can build tools for search, for browse all of the things that are needed to enable a language to be used in the digital domain can start to be built, once you have a corpus. So you may think that this is kind of a strange and foreign thing but every time you type a search into Google every time you type an email in Google or Yahoo every time you upload a video to YouTube and make it publicly accessible you are building a corpus. So most of us do this, in this room do this, probably at least one of our languages is English so when you type your search terms or your search expressions into Google you're helping build the Google corpus of English. And it keeps getting better, right? It gets better and better because you are adding more data and they're refining their natural language processing techniques, or their statistical techniques so when you get a very large corpus like that you get these great tools and it becomes very easy to use your language online the online domain works really well for major world languages. It doesn't work so well for smaller languages, because they don't have a corpus. They haven't built up that corpus yet. In fact I've been talking to my friends who are computational linguists and they say that maybe only 30 of the world's languages have a sizeable enough corpus to make them actually really usable online. And that's really shocking, that's really surprising because we have basic technologies that can enable any language we have Unicode, we have an international standard for identifying all of the languages of the world. So at this point, we have the technological capacity to be able to represent any language online and we're moving into other kinds of communication, like video, online where it's not so text-dominant so we should be able to do this, right? We should be able to do this. Open is good. So what we're asking people to do today is to dedicate the recordings that you're making to the public domain. So that when we upload them into the Internet Archive they are open and publicly accessible. Why? Because then your video is not just a stand-alone video, it becomes part of an open corpus. And that enables all of those cool things we saw, just a moment ago. Others can make use of your video so maybe you create something maybe somebody who also speaks your language on the other side of the world can find that and use that, and maybe they're teaching a class in their language and maybe they found something in your video that they can use, or a story or a word that they've never heard before. Video data can be enriched by others. So if you create a video and you say 'I think this is in Swahili.' Then somebody else can come in and say, 'oh yeah, I know that' 'That's definitely in Swahili, and I know exactly where that language is from.' And then somebody else can come in and say 'well, I can transcribe that.' And they'll transcribe the whole video. And then somebody else can come in and say 'Well, hey, I can translate that' and they'll translate that into another language. And a linguist might come in and say 'look at all those words you've just given me' 'I'm going to make a lexicon, or a dictionary' or if you have enough of them maybe 'I'm going to make a grammar.' So video data can be enriched by others so that's what openness and language data enables. And that's why all of the materials in the Rosetta Project collection are open and why I'm asking for the Record-a-thon videos to be placed into the public domain. So today what we're doing is we're testing out an idea. I think this is the first time this has been tested out quite like this. So we're trying to create a model and see if we can produce a set of language videos that are short - that is, about 5-10 minutes in length They're unedited, so we don't spend time going back and trying to clean them up. There's minimal annotation or description that is, I'm not asking you to provide lots of information about these videos. You can provide more if you like, but it's not necessary. They're produced on common video devices that you carry around with you all of the time. Like my little Flip camera here. Maybe your cell phone. And you don't have to be an expert to do it. And this way we can help document the languages of the world and help preserve them, help make them more vital in the modern world. So we have two main kinds of recording activities. A lot of you have come prepared with your recording devices and these are free-form recordings that you can do in lots of places nooks and crannies, this is a big wonderful building and there's lots of places where you can go and get separated from other people and get a pretty quiet environment to do your recording. And then we also have structured recordings which are taking place downstairs at the lower level. And those are provided by Mightyverse, I think I see Paul there [waves] Paul is from Mightyverse, and his Mightyverse team, can you guys raise your hands there? So that's the Mightyverse team behind you staffing the recording booths. And some of you have signed up for those today, and maybe others of you are jumping in and that's fine too. So let's try out this idea, let's have fun, and then let's share it with the world.

Introduction to the idea and goals of the Human Language Record-a-thon, a project to engage the world in helping document the nearly 7,000 languages spoken on Planet Earth. The Human Language Record-a-thon was held as an all-day workshop at the Internet Archive on July 30, 02011, and was simulcast to remote participants around the world. Presented by Laura Welcher, Director of the Rosetta Project at The Long Now Foundation.

