Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

The Rosetta Project: A Distributed Archive Model

0 (0 Likes / 0 Dislikes)
Hi I'm Laura Welcher and I direct The Rosetta Project at The Long Now Foundation and we're here at the Linguistic Society of America annual meeting in a poster session that's about metadata, and language archives, and how we organize all of the materials in our collection. I prepared a poster for a poster session here this morning The poster talks about a new feature of the Rosetta Project And if you've been following The Rosetta Project you know that this is something we actually implemented about a year ago and I'm here to tell my linguistic colleagues and the linguistics community in the United States about it. the title of the poster is "A Distributed Archive Model" and this is a way that The Rosetta Project is really pretty innovative and different from other language archives around the world As many of you know what we did this past year is we moved all of the content of The Rosetta Project collection out of a content management system that we built... purpose built for the collection and served on our own site and we moved it out into these distributed sites third party cloud services to take advantage of both their free hosting and free archival management services as well as a certain economy of human scale in contribution and also to make them more accessible and discoverable. So the main aspect of that is moving all of our content into the Internet Archive so the Internet Archive now hosts all of the resources in the Rosetta Project collection our text resources, and we're starting to migrate also the audio we have digitized and we are starting to develop a video collection as well so that's all in the Internet Archive and that's really great because it offers us free hosting and free serving of the collection and we have on the back end of the Internet Archive something most people don't see we have this beautiful archival management system using really great practices in digital archiving it's all standard metadata, as well as we can customize the metadata for our own collection so it offers us this really robust way to manage the Rosetta collection and it makes it discoverable over the past year we've had about one hundred thousand accesses to the resources in our collection just through search on the Internet Archive site so the Internet Archive is not specially built for any particular collection so the means you have of browsing and navigating the system in the Internet Archive is just a long page of basically keywords so it's not actually super easy to find things in this way and probably very few people actually do find our resources this way you'd have to search either by ISO code for an individual language or know what the categories are for our collection like "Detailed Description" and what that means or the name of the language it's not super user friendly in that respect but it's not intended to be the Internet Archive intends for collections to develop specialized interfaces outside of the Internet Archive and then to call the resources into those interfaces so in order to make our resources more discoverable we took advantage of another free service called Freebase and Freebase is an open contribution database that's actually comprised of many, many users the idea behind Freebase is building the Semantic Web from the ground up by user contribution so we... The Rosetta Project has a set of information about human language in Freebase it's basically names of languages, the international standard identifier for languages, and relationships... taxonomic relationships between languages by their historical descent and differentiation so that's all in Freebase, and we take advantage of the fact that there's tons and tons of other information in Freebase and one of the things that Freebase did is they crawled all of Wikipedia so as part of this project we actually rectified all of the pages on human language in Wikipedia to international standard identifiers for language so that's actually linked to our dataset and so using that dataset, what we were able to do is very quickly populate an entire wiki with one page for every human language and so what you see here... this is our prototype it's actually a fully built out wiki it's not dynamic - it's static so it represents a one-time "push" of data but you can see here on this page that this section here this classification taxonomy of languages is coming from Freebase this is the Rosetta dataset in Freebase the overview here is other information in Freebase not specifically in the Rosetta collection in Freebase but this is other information that we can take advantage of by being in Freebase this section here represents a Wikipedia page on this language based on the linking we did of Wikipedia pages and data in Freebase this map is provided courtesy of the LL-Map project which I also documented as part of this session this is courtesy of the LINGUIST List and it's a project that's not even within the Rosetta universe so we were able to link to an external data provider and then down here on the bottom of the page we are able to pull in documents from the Internet Archive so what this interface represents is a portal that aggregates information on the world's languages and has a page of information for each of these languages but it isn't the collection itself - the resources are being pulled from distributed services that are out there on the net that we manage independently so this is what we're proposing for a pretty good model for linguistic archiving and a fairly robust one over the long term it's fairly recession resistant because these services are all free and a lot of the language archives that are out there are actually built on private funding and if the funding goes away the question is what happens to the management of the archive? there's a lot of questions around that but with a system like this the services... the service continues and we don't have to pay for it so that is actually one of the benefits that adds to the longevity of the collection and then we can take advantage of the fact that these sites like The Internet Archive is very discoverable people go to that to find resources and in building out a wiki, for example if you were able to integrate this with Wikipedia pages that would make these pages even more discoverable and that's one of the goals that we have is to make our materials more available to people because really the purpose is - what we're trying to do is build an open collection of information about the world's languages

Video Details

Duration: 7 minutes and 44 seconds
Country: United States
Language: English
Producer: Laura Welcher
Director: Laura Welcher
Views: 930
Posted by: laura welcher on Jan 12, 2011

Laura Welcher, director of The Rosetta Project at The Long Now Foundation presents a poster on a distributed archive model in a session titled "Metadata in Language Documentation and Description" at the Linguistic Society of America annual meeting, Pittsburgh, January 9, 02011.

Caption and Translate

    Sign In/Register for Dotsub to translate this video.