Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

BITC / Biodiversity Diagnoses - Taxonomy Checks 1

0 (0 Likes / 0 Dislikes)
Good morning everyone. We're now starting the second session and will go over very quickly a few basic checks that we think should be done in basically every kind of data set you might get hold of or you might have. Those checks are extremely reliant on self-consistency, which means that often the only ways you have to detect an error or something that is wrong or doesn't really fit are either by going record by record, which is some you can do easily when you have a bunch of records or you have a reasonable large amount of records say a few thousand, then probably we need to rely on numbers which mean that you look for a pattern. You have an anticipation of how the data set should look and then look for things that deviate from that pattern that normally means that strange records will stand out, as Town has shown us in his example. We might start by looking at how taxonomical records should look, and now we find a problem because we don't know how taxonomical records should look. So we don't have a prior expectation here. We'll start with taxonomical checks. Then Lindsay, will you go with geographical checks? What's a primary biodiversity data record? You all know what is, it's quite a simple question. Basically, what is there and when was it there, seen, caught or whatever? If we can answer this three-dimensional problem, we get our primary biodiversity record. Now, since we're going to talk about taxon now we'll concern ourselves with "what". Well, meanwhile what, where, when are not the only three dimensions in our PDR but those are the most basic dimensions. But there are more to it, what else? Lots of things, as a background, who's providing the data, then has it been corrected, looked at or whatever. So there a number of additional dimensions, but the three we talked about are by far the most important. So what's in a name? We're dealing with taxon names. A name is something that we to anchor to concepts, because what we're concern with are names but taxonomical concepts. We know to explain the distribution of a certain plant or a certain animal that we know by name, and the name is just the antipoint to the data so we need to get the name right. If we don't, then what might happen is that we don't get the date that is associated with that name right. So the name provides for identification of the object of interest which is the concept. or at least what we can separate from another similar concept, we need to separate one plant from another similar plant and we use the names for that. Ideally, this name should be a univocal identification, I mean it should be associated with one entity. and should resolvable, so the name should get us to the actual name of the object. Should be ideally fixed, should be able to be checked and should be if possible Tamper-proof. Almost nothing like this happens in the biodiversity world. For example, you all have passports and I do have my ID card and this is my ID card for Spain. And it has all the requirements. So it has a unique ID, which is right here and this is my ID number and it's unique. It's univocally tied to me and it cannot be used. So it has all that, it's apparently tamper-proof in a sense because it has a T here which is a hatch of the rest of the numbers. So if I type long in some application or whatever, it won't match the T so we that something was wrong. Well, the problem is that we don't have that in the biodiversity world. So this is me, Arturo Arino although my standard scientific name is Arturo H. Arino. H is my second name which is Hugo or Hudge. But also there other Arturo Arinos around the world. This a biker, I'm a biker but I'm not in competition. Who is from my home town actually. He's in competition but not me. And this is a much famous Arturo Arino who is a graphical designer and an Argentinian but he's not me. So all of them, we're different but we all share the same name. We do have the same problem in biodiversity. There are share names which belong to different concepts. So let's look at this animal, I am going to use examples of animals because most of you are botanists. Okay, how do you call this animal in your native language? We all know it's a Humpback Whale. So how do you call this in your languages? Anybody has a local name for it? [Participants] most of us come from land locked countries. Anyways it doesn't matter. This is a whale, everybody knows a whale. But Spainians or French have different names for it. In Spain, some fishermen thought this was a fish. It has a lot of names that all belong to the same concept. It's exactly the same problem we had with Arturo Arino's name. That belongs to two different entities. For this, all those names belong to the same animal or species. In fact, if you go to the World Database for Oceanographers, you will find 84 vernacular names for Whale in 54 different languages. So who saw this? We know who saw this in theory. It was this man [Carolus Linnaeus], a Swedish who decided to order the names of organisms. So he went on to prepare this monumental work which in its first edition was only about four pages. But all the names of the animals and plants were there. And he invented what we called the binomial system of genus and species. By the time his publication went to the tenth edition, it was a large book with thousands of names in it. But the very first editions were quite short. This is for the animals and they could be either something with four legs or with two legs and two wings, or things that live both on land and in water, and fish, insects and everything else was called vermes, or worms. So over time, most us have been using this Linnaeus system to classify and to name organisms. To a great success, about 1.8 million species have been named with at least 10 million different names and that's our problem. So let's go back to our Whale. If we're scientists we don't call it Whale, we call it more properly as Megaptera novaeangliae. And we collect ancillary data, such as we know that this is another female, and it was alive at the time we took the picture, it was seen off Nothoro, which is the location. Which is properly georeferenced, or as Town will show you later it's not properly georeferenced because it lacks the isothetic radius. But I can tell you it was something like from that table. But what is important is when I recorded it, I recorded as Megaptera novaeangliae. And it went to our museum of collection records and it went directly to GBIF. So GBIF records all the Megaptera novaeangliae that people have been seen and reported. That is not the only way to report the name of a Humpback Whale. As there are many vernacular names, along history the Humpback Whale have been reported under various different names too. In fact, it has been given 46 different scientific names or synonyms over time when it should have gone by only one single name. But then there is the zoological nomenclature code that takes care of clean up or fixing names or muddying them up even more or whatever. But the fact is that we have 46 different names for this animal. So how do we solve the problem that names should be unique ? And when we deal with taxonomical records we need to know the diversity of the place, the species richeness or number of species. And therefore if we have 46 possible names for a data set for the same record, we have a record inflation or taxon name inflation which is 46. That's a problem, we have to reduce this spectrum. Obviously, we need to use unique identifiers for the same concept. But we have to deal with this, this what we'll have unless the data sets are ours and we have corrected them and ensured there are no mistakes. Unless we do that, we have to deal with this. That's a fact of life [right]. So, there are a number of solutions, and one of them is going by hand, but there are more. So a solution to the problem of multiple names could be issuing a Global Unique Identifier (GUID) for each species in the world or a Life Sicence Identifier (LSID). This has been proposed for years but has never come to a satisfactory conclusion for everyone. This is most contentious discretions that I have witnessed in GBIF and other databases. Nobody can agree, some people proposed having human readable global identifiers, while some others prefer exclusively machine readable identifiers or whatever. Over time, taxon codes have been used often, so you convert a name into code. Or more recently identifying a species by genetic markers or species registers or putting everything into species registers so a name is not reuse or nomenclators or whatever. This an example of taxon code, this was setup in my laboratory 30 years ago when I started doing these kinds of things. And we decided to use numerical codes for each taxonomical level. So basically we could have any number of names that we saw to this same taxonomical code. So by using the same taxonomical code we could get the same type of names. the problem is that we did that and other laboratories did the same thing under the same conditions with different names. So names or codes are not exchangeable. I might have this thing with over 40,000 species in my lab, but it only serves us and other labs using our system. Wish is to say that is not useful. This a system in which you assign a unique identifier to each occurrence of the species and create an identifier to each name as well. Which is very useful from my database point of view, but it can go wrong as anything else because there are many things that can go wrong. Everything that can go wrong will go wrong. There is no possibility of something that can go wrong not going wrong at sometimes rather. So there can be legitimate name changes, a species can be renamed because there were some problems with the original name or it has been discovered it was a synonym of another species, or coincides with the name that somebody else published before elsewhere (homonym). Or it can be a new combination or the Latinization of the name can be wrong. Or we might have many things because of misspellings from different data sources. There are so many things that can go wrong. And you might have even in this your errors, for instance, you can type the name of the species with an extra space at the end and you'll never notice it. For instance, there's one single sample, this an OCR of an actual species list. And when you do the OCR from our page which collected the original one, so whenever you try to do an ORC, everything in red is actually wrong we've been doing. So basically we're back to a few right renderings and more then half are wrong. This is very typical from OCR when trying to get information from poor sources. There are so many other opportunities, you can mix red labels, you can mix red legends, you can mix red cards or whatever. Even you can mix red to be generated in the printout. What else can go wrong? Misidentifications! You might misidentify something as belonging to the wrong species or whatever. So this our Megaptera novaeangliae, but somebody in the field might have confused other Whales with this. And record the name as Megaptera novaeangliae. So many things can go to the same name and they would, it's not that the name is wrong, it's what is behind the name, the actual animal or concept that is wrong. And then you have to add all that we saw to the misidentifications (synonyms and misspellings). And these things multiply each other, so the possible spectrum of problems increases almost exponentially. So from one single concept, you might have so many different actual meanings to make. In our example of Megaptera novaeangliae, if we go to GBIF data, we can find 25,771 records of this species across the world. And most of them will go by this name [Megaptera novaeangliae]. And everything else automated ways to name the species in GBIF along and there are many ways in sources. The funny thing is that this not the proper name. Which of you can point to the proper name of the species? It's not in numbers, the mean one is not really there, but the right one is actually this. This is the best name. Which is the name that is completely and properly typed as it shows according to the Zoological Code of Nomenclature. It has an author, and parenthetically it wasn't named under this genus in the original description. So basically everything is wrong, although for functional resources we might say that this name here without the author, which is something I'm ashamed to say before botanists but we zoologists tend do away with the names of the authors. I know we're always wrong, but we do nonetheless. And what kind of errors do we see here? This a synonyms, but also it has a wrong letter that couldn't use. This includes errors with synonyms, misspellings and rules. So there are several types of errors you can get. In fact, if you look at the variance, the number of records is like this, so the most problems we have is the rules. Over time, the type of errors changed too. Most records which are wronged, are wronged because of wrong Latinization and this happened recently because somebody put wrong data in the database.

Video Details

Duration: 21 minutes and 59 seconds
Language: English
License: Dotsub - Standard License
Genre: None
Views: 1
Posted by: townpeterson on Jul 26, 2016

This talk was presented in the course on National Biodiversity Diagnoses, an advanced course focused on developing summaries of state of knowledge of particular taxa for countries and regions. The workshop was held in Entebbe, Uganda, during 12-17 January 2015. Workshop organized by the Biodiversity Informatics Training Curriculum, with funding from the JRS Biodiversity Foundation.

Caption and Translate

    Sign In/Register for Dotsub above to caption this video.