Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

BITC / Biodiversity Diagnoses - Data Cleaning 2

0 (0 Likes / 0 Dislikes)
This is an example of internal consistency. This is a data record associated with a specimen. It's from the California Academy of Science. It's a bird. It's catalogue number is 96773. It's documented by a preserved specimen. It was collected in 2008. The country was the U.S. It's in California. (That field isn't showing) It was from Marin County, California. It's from San Rafael Hill in San Rafael. And, it has these coordinates. Okay? We can easily plot those coordinates and see if they are in the U.S.. Right? If they're in Zimbabwe we have a problem. We can see if they fall in California. And we can see if they fall in Marin County. That's taking one part of the data record —the coordinates— and checking them against another part of the data record —the textual description of the locality. That's a powerful thing. Oftentimes we can do it in an automated fashion for thousands of specimens. That's very important because it looks for inconsistencies within the data record. That should tell us that something is wrong. There are situations where borders move. Or, names change. I tried to do something with districts in Kenya a couple of years ago. That was a nightmarish situation because the districting has changed in the last few decades. There are many more resources for checking external consistency. This is something that Arturo will talk with you about. But, this is just an illustration of external sources. Here are two different authorities. The International Ornithological Congress World Bird Names v. 3.05. Versus Clements 6th Edition. Each of these is the beginning of the bird taxonomy list. The first thing you see here is "ostrich." Clement's considered this to be one species. But, the IOC considers it to be two species. So, if you say, "I just saw a <i>Struthio camelus</i>", under this (Clement's) concept, that looks this big, geographically. The bird is even bigger. But, the range is much of Africa. And, under the IOC concept, the range is smaller because this northeast African ostrich is considered a separate species. So, we have these resources available to us. Here's another page from late in the dataset. This is a taxonomic dataset, not an occurrence dataset. I wanted to give you this one example of Reichard's Seedeater. The IOC lists this species under (genus) Crithagra. In Clement's taxonomy, it's under (genus) Serinus. These are little yellow birds. Under the IOC, it's one inclusive species: <i>Crithagra reichardi</i>. But, under Clements, it's two species: <i>Serinus reichardi</i> and <i>S. gularis</i>. Imagine, now, that I'm doing a query on African serins. And, look at this. I've got <i>Serinus reichardi</i>. And, I've got <i>Serinus gularis reichardi</i>. Many times we just ignore the subspecies. If I do that, I now have two species. But, you guys know that those are two versions of the same taxon. One recognized as a species. And, one recognized as a subspecies. I'm going to see this only if I'm referring to that external data source. For birds, at least, there are lots of on-line resources. For plants and insects, you get into more and more trouble very easily. We can look at consistency with other records of the same species. For example, here's a swallow — <i> Hirundo angolensis</i>. These are gridded records of the species. Notice they're located in a middle belt across Africa. There are some VertNet records of <i>Hirundo angolensis</i>. These fit reasonably closely to the previous dataset. Here's a GBIF query. Look at these. What do those southern African records mean? I don't know. But, that query produces results that disagree with the independent (external) data sources. Maybe they're right. Swallows can fly very well. Maybe they migrate. I don't know. But, if I were looking for problem records, I would focus on those records. This is an oriole in Mexico. It's not the African family Oriolidae. It's a family that looked just like the Old World Oriolids to the Europeans arriving in the New World. So, they called them orioles; but, they're in a completely different family. You can see some typical features of biodiversity data. The density of records is much higher in the U.S. than in Mexico. This is because the density of birdwatcher's is much higher. And, where do U.S. bird watchers go to in Mexico? Beach resorts. So, there are lots of records from Mexican beach resorts. But, I wanted to show you that we can also look at these data with respect to environment. This was with respect to annual mean temperature. These are the primary bulk of records of that species. Then, there's a record that's an outlier in this direction. And, this record that's an outlier in that direction. That does mean these are wrong. But you should look at those records. Those are outliers. They don't fit the pattern for that species. Maybe they're right. Maybe those are the tolerance limits of the species. But, it's worth looking at them. This is one of my favorites. For the last quarter century, I've been involved in a project called the <i>Atlas of Mexican Bird Distributions</i>. It's a database that summarizes the contents of >80 natural history museums. The data were all compiled by hand by two colleagues and me. It's been a quarter century. We still haven't published the project. The dataset includes almost every bird specimen ever collected in Mexico. We quit when I traveled all the way to Moscow to work in the collections at Moscow State University, and they had only 6 specimens. And, we said, "let's stop." But, this is an interesting thing we could do once we had all those data together in one place. We took several of the most productive collectors in the history of Mexican ornithology, and we compiled all of their specimens ordered by date. Kate was collecting here one day. Then, moved camp two miles down the road and was collecting here the next day. A week later, she was 100km away. Etc. Okay? We measured the distance on the surface of the earth using that formula which is not not much better than a Euclidean distance. Here's an example for an interesting personality in Mexican ornithology. From 2-10 October 1955 2 October, he was up north. This is the west coast of Mexico. By the 5th, he had moved down to here. You can see him collecting lots of specimens on the 5th-10th. He was clearly spending time in the state of Nayarit. But then — uh oh. On the 8th, it looks like he collected a bird here, moved very quickly 4 states away, and then returned by the 9th. Or, he made a mistake in his data. Or, in the case of this collector, it may be that there were some modifications of data. He had a little problem with stealing specimens from other institutions and relabeling them. [chuckling] You can use this technique only if everybody else has made their data freely available to see. If we only had the Phillips specimens at the Univ. of Kansas where I work, we would only see a small piece of his overall itinerary. We could see this because over the last 25 years we had assembled all of the Mexican bird records of Phillips and all the other collectors. Or, now, because we are sharing our data. If and when we get to the point where all of the data for the big herbaria around the world —and, the local and national herbaria— we'll be able to put together all the specimen that Kate collected over her lengthy career as a botanist. [Note: Kate is not a botanist.] But, you can only do this with those external data. Okay? Those data that are beyond a single data record. This data record, on its own, is completely consistent. It's a bird that should be in <i>Guerrero</i>. <i>Guerrero</i> is the state. It's in the right place. It's in the right environments. The date is correct. But, when I see that it was collected by the same collector on the same day as a specimen from 4 states away, I know that something's wrong. I don't really know what. Maybe all of these specimens are wrong, and that one's right. Or, maybe the date is wrong. Maybe he collected it on 24 October, not 8 October. Or, the geographic coordinates are wrong. We did this for the top 10 collectors in Mexican ornithological history. Plus, Adolfo Navarro —my colleague of 25 years— and me. Using this technique, we found errors for all of the historical collectors and both of the modern collectors. Using this technique, both authors detected errors in our own data. So, this is one solid argument for sharing data even when there are still errors in the data. Here's Adolfo's data. We found about 2 errors in his data. Here's Allen Phillips; the collector I gave you the example from. You can see different error rates. But, again, no dataset is ever error-free. Even the data collected by the authors of the paper had problems. In the next couple of hours, we're going to talk about data cleaning. But, we should talk about what comes next. This is something that Jean brought up during our introductions. What do we do when we're wanting to use, publish, or share our data? This is not necessarily crucial for the practicum of this course, but rather for the broader lesson. We need to think about field-level descriptors of quality. Providing a radius of uncertainty is very common for spatial data. This is the approach used presently. Some of you have met John Weiczorek. John essentially implemented this system for biodiversity informatics. The idea is that you give the coordinates of the locality. And, you also give a maximum uncertainty radius. This basically says, "I don't know exactly where the point is, but I know it's within 10km of this place." That is a quality designator. If the uncertainty radius is 1m, it's pretty high quality. And, if the uncertainty radius is 1000km, it's a poor quality record. We've been doing this sort of thing for a long time. Here's a specimen in the collection that I curate. Remember Allen Phillips, the guy that was in 2 states at once on 8 October 1955? Here he is. He looked at this specimen in 1973. He looked at this specimen again in 1976 Basically, he's saying, "If this is some subspecies that begins with 'W', it's a very bright or brilliant variant." Later, he comes back and says, "This is a large adult <i>bruesterii</i>, but the back resembles <i>minor</i>." Which is to say, this man was even arguing with himself. But, he was expressing uncertainty about the taxonomy. He's pretty sure that it's <i>Loxia curvirostra</i> (and it is); but, he wasn't sure which subspecies. Here, you can see the original label. And, you can see a new label with a new determination.

Video Details

Duration: 17 minutes and 55 seconds
Language: English
License: Dotsub - Standard License
Genre: None
Views: 0
Posted by: townpeterson on Jul 26, 2016

This talk was presented in the course on National Biodiversity Diagnoses, an advanced course focused on developing summaries of state of knowledge of particular taxa for countries and regions. The workshop was held in Entebbe, Uganda, during 12-17 January 2015. Workshop organized by the Biodiversity Informatics Training Curriculum, with funding from the JRS Biodiversity Foundation.

Caption and Translate

    Sign In/Register for Dotsub to translate this video.