Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

BITC / Biodiversity Diagnoses - Basic Patterns 3

0 (0 Likes / 0 Dislikes)
So the question is when we work from monographs, what are the limitations? Well, certainly the limitation is that the monograph only applies up to the data that was publised. So for the Belgium-Congo for example, it was 1932. And a lot has been added to the knowledge base since 1932. Another big limitation is how much information is provided in that monograph? And you saw in that XXX monograph that I showed you, very nice data that you turn into what went where. But with the Belgium-Congo monograph, those data weren't quite so explicit. So it all depends on how detailed the monograph was. And then another major problem that you'll get into is that many times species concepts have changed, maybe since 1932. And so you have to go back and essentially interpret that information forward so that you understand which species was being referred to by that old name. And it gets very tricky because if we have a species with name A, and some taxonomist comes along and says oh no species A is really two species, one of those retains the name A and the other will take a new name, B. And so you get into these problems that the name A now has two different concepts [okay]. One that was the old inclusive species, and the new one that is a new more restricted species. And so it means that even though species A is just a Latin binomial, it has two different meanings. And so we have to be very careful to understand which meaning of the names we're referring to [okay]. And so you will get into those problems not just with the monograhy but also with the spicimens based data and the electronic data. The nomenclature in one museum may be as of 2014, and the nomenclature of another museum may be as of 1914. So it's all complicated. Also with the monographs you may have another task which is getting the data out of those books. And again, you can do that by putting together an army of undergraduate students that you pay to spend 3 hours a day capturing those data, and then you will have to deal with typographical errors and things like that or there's some effort being dedicated to scanning OCR (Optical Character Recognition) and then you have to deal with getting trained programs to recognize the task. But you know, they will make mistakes. In fact, do you all know about captures? This really a fun concept. Do you know when you're filling in a form online, and they want you to look at an image and type the number? [Participant] GBIF makes you do that. That's right, everybody did that yesterday with GBIF. If you fill out those forms, sometimes you'll see that there are two images and so you'll have to write this number or word and then this number or word. So, one of those is essentially a safety mechanism that's protecting against non-humans signing up for a million accounts on GBIF. That's usually called the denial of cyber attack. But when there are two of them, do you know where the second one is? Google has a project called google books, which is scanning and capturing all books (everything ever published). It became a big problem because a lot of those books are under copyright. And google said that's okay! We're just scanning and capturing them and when they come out of copyright protection, we've got the books. But a lot of people got very upset. But google is turning those scans into full-text version of each book. And when you scan entire libraries, there's some little problems, maybe like the paper shifted a little bit as a particular part was being scanned. And those are things where the OCR routine says I don't know. And so they're actually using us to edit their books OCR. And so those are all the little problems. And indeed I may interpret it wrong but if a thousand people try to interpret the same little problem of the scan of some book, you can take the majority vote and that's a good guess of what it was. So that's crowdsourcing at huge the level. But watch the next time that you do a capture when it has two. And look at them, one of them will be the usual fussy picture of a number and then the other one will be text from a book. Where the text was a bit messy and the OCR routine had a problem with it [okay]. And fortunately, we don't have access to that huge crowdsourcing opportunity.

Video Details

Duration: 6 minutes and 46 seconds
Language: English
License: Dotsub - Standard License
Genre: None
Views: 2
Posted by: townpeterson on Jul 26, 2016

This talk was presented in the course on National Biodiversity Diagnoses, an advanced course focused on developing summaries of state of knowledge of particular taxa for countries and regions. The workshop was held in Entebbe, Uganda, during 12-17 January 2015. Workshop organized by the Biodiversity Informatics Training Curriculum, with funding from the JRS Biodiversity Foundation.

Caption and Translate

    Sign In/Register for Dotsub above to caption this video.