Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

BITC / Biodiversity Diagnoses - Data Cleaning 3

0 (0 Likes / 0 Dislikes)
And, you can see question marks. Okay? You can see expressions of how much the person who determined the record knew and was sure of versus didn't know and was not sure of. We're not going to talk about DarwinCore this week as it is well documented in the data capture course. DarwinCore provides substantial opportunity for documentation of data quality. This is for individual elements in a record. Who told you that was the location? Coordinate uncertainty in meters. Coordinate precision. Georeferencing protocol. Georeferencing sources. Identified by. Identification verification status. Those are all fields that allow you to qualify how sure you are of a particular field in a particular record. A next step is an overall assessment. In some senses, this may not be a good idea. You may have a bad quality record that's perfectly useful for something. Maybe the geographic reference is wrong, but if the temporal reference is OK, then seeing a lion outside the hotel here might be an interesting record if you're absolutely sure it's new. Right? Even if there's a 100km coordinate uncertainty, it's still an interesting record. In the biodiversity informatics world, we talk a lot about data quality and fitness-for-use. This headline was really funny. GBIF to target data quality and fitness-for-use in 2014. Uh oh. You didn't in the preceding 10 or 15 years? Answer: no. I can get myself into a lot of trouble with these videos. [indicating Arturo] He's on the science committee. So is Jean. [chuckling] We talk a lot about these things. But, what does it mean? Fitness-for-use depends on the use. You're going to put these data to in documenting national Just because the data are fit for use for that doesn't mean they're fit for use for understanding distribution of the species on a local landscape. Right? Describing a geographic range is very different from understanding individual movements. So, you can't really say you're going to "take care of data quality and fitness-for-use this year." No. It's an iterative, long-term, use-driven, process. That needs to be remembered. You can't ever say, "this dataset is clean. I can jump right into it." A last comment is about dataset diagnostics. I've given you examples of this. I keep going back to speciesLink because I love how they've done it. Here is the next version of their dataset diagnostics. You see the accumulation. You see the same problems in there. You see information about the collection. You see those diagnostics. This is the end result of some cleaning. But, you can see it's the end result of not enough cleaning. There are certain basic problems that we should fix generally. Then, if you want to use this dataset for a particular analysis, you still have to go back in and look hard at it. We've talked about a lot of this. A first pass evaluation of fitness-for-use for the whole dataset is to check if these are large or small numbers. Just to wrap things up. Data quality matters. You will get into trouble if you don't take care of these things. If you read the reprint that we provided, there's a figure illustrating the effect of cleaning the taxonomy on the completeness index. It makes about a 5% difference in how complete your inventories look if you have cleaned up the taxonomy or not. I'll show you why later. The data always hold error. The question is whether the error is so substantive that it affects negatively the uses that you want to put it to. You need to flag the likely erroneous errors; and, fix the errors that can be fixed. The data quality needs to be documented. But, fitness-for-use is not something that can be spoken of without explicitly stating what the use is. Any questions? This is the first talk that really had some substance. Jean. [Jean] I'm a bit frightened when you state that fitness-for-use depends on the use of the data and there's not a process that can treat everything at once. GBIF houses a lot data, so we have many things to do. [Town] That's very true. There are things that you can do as a useful precursor to publishing data. Consider those dataset level diagnostics. It may be very useful to go through and flag all those scientific names that appear to be non-standard. Those are generic problems. But, if somebody comes to you and says, "I have funding and I have official approval for a new national park within Benin." "Can you please tell me everything you know about the distribution of this endangered plant?" "I want to make sure this new national park includes all of the known populations." You're going to go look at those 100 records much more carefully, and perhaps for different things than you did for the whole dataset. It's frightening. It's intimidating. But, at the same time, it's a process. If we understand that it's a never-ending process, then we don't use the fact that there's error in the data as an excuse to not publish the data. Okay? Other questions? Alex. [Alex] It's more of a comment than a question. Am I correct in thinking that data leaks are therefore related to a particular use of the data? [Town] Are data leaks related to a particular use of the data? I would say that certain data leaks are more important for certain uses of the data. The leaks will still be there; but, how much you lose may vary. If I'm interested in compiling a list of the plants of Ghana, as long as I'm pretty sure that the plant was collected in Ghana, I don't really care about the geographic coordinate. So, that georeferencing may not be very important. Or, precise georeferencing to 1km instead of 100km may not matter as long as it's in Ghana. So, I would say the importance of those leaks in reducing the overall amount of data depends on the use. Okay? Other questions? Moses. [Moses] My question is about Phillips "collecting" across 4 states. Is it possible that he sent a team to collect specimen that did not have authority to assign collection numbers resulting in those specimen being attributed to Phillips? [Town] It could be. And that's why you flag the record. You don't just erase it and burn the specimen. Rather you flag it to denote that something is wrong. We always have large remarks fields that note things like, "On 8 Oct, Phillips was also collecting in <i>Nayarit</i>." Something is wrong. He collected a lot on the 8th, and on preceding and following days in <i>Nayarit</i>. So, we're pretty sure he was there. Then, there's this single specimen from <i>Guerrero</i>. So, you can lay out the whole case. But, you're right. He could have been gifted the specimen. He could have sent one of his collectors. Although, his collectors usually put things under own catalogs. But, no, I don't know that that time and place are not correct. They may be correct. So, we flag that record. And, now we switch to the use. If I'm developing a niche model for that species, and if that record is an outlier, I'd better be careful because I know that record might have a problem. Okay? You could even do an analysis with all of the data. Then you could run the same analysis omitting the suspicious records. You can include and exclude them to see how much of a difference it makes. [Moses] Okay. Now, to prevent future use... For example, my people will split into two teams. But, not all of my field assistants have collection records. So, I may be collecting in southern Korup National Park and I send a team up to northern Korup. When they bring the specimen, I have to put the same thing. [Town] So, it looks like Moses did this. [Moses] Yes. [Moses] In such a case, what should I do so that someone else who may be analyzing the data knows this? [Town] I think that depends on how you keep data for your particular world. How bird people collect data is different from how plant people collect data. But, I would seek an opportunity to divide a collecting event in northern and southern Korup. One event for each. There's an entire set of procedures whether each specimen defines a collecting event or whether you go to a place and that defines a collecting event. And then you accumulate a bunch of specimens there. Again, I don't want to speak individually to the botanical side. But, you should look within your data system and look for a way of defining two separate sites. An easy way is to get your field assistant- [Moses] -recognized as a collector. [Town] Yes. That way, a person and a place define an event. This can help avoid future confusion. Yes, Emily. [Emily] I just wanted to add that, when we collect plants, there's usually your own collecting numbers. Perhaps as the PI. The field assistants can use your numbers; but, usually there's a field for additional collectors. So, at the end of the day, there are specimens that are attributed to you specifically, and others to you and your field assistants. [Town] If that additional collectors field is not in the dataset that you build, that might be a useful way of solving that problem. Other questions? [Jean] I have some concerns with the data from Brazil, particularly the coordinate errors causing some localities to fall in the ocean or even over Africa. How can we correct these errors? This problem —where coordinates for terrestrial species are in the ocean or on the wrong continent— is common when you visualize GBIF (and other) data. [Town] Think about the types of tools we just talked about. If you have a record of a plant that says, "Country - Brazil" and it's showing up in Somalia, or in the Indian Ocean, that should be flagged as having a problem. That's a problem that is obvious enough that, often, can be fixed. You're looking for a positive longitude for a country that should have a negative longitude. That same goes for northern and southern hemispheres. That problem in data is clean enough and clear enough that it can often be fixed. Not just flag it and ignore it — fix it. But, the whole purpose of the speciesLink tools are to help the curators of the data to work through those problems. That's the really nice about those curves of suspected latitude-longitude coordinates. If a curator sees a high curve, they can go in and work until that curve goes down. When all of those curve are low, and when all of those numbers are low, then, at least at a first level, you begin to have more confidence in the dataset. How are we doing on time? [Lindsay] It's 11:15. [Town] Uh-oh. We've already gone over. [Rodrigue] I'd like to know if this data cleaning and error fixing is necessary in modeling the distribution of a species. [Town] Yes. It's very necessary. What may not be necessary is all of the archival dimension. As I said about documentation, for these projects, we're not going to create a permanent data set that documents everything that was done.

Video Details

Duration: 20 minutes and 25 seconds
Language: English
License: Dotsub - Standard License
Genre: None
Views: 2
Posted by: townpeterson on Jul 26, 2016

This talk was presented in the course on National Biodiversity Diagnoses, an advanced course focused on developing summaries of state of knowledge of particular taxa for countries and regions. The workshop was held in Entebbe, Uganda, during 12-17 January 2015. Workshop organized by the Biodiversity Informatics Training Curriculum, with funding from the JRS Biodiversity Foundation.

Caption and Translate

    Sign In/Register for Dotsub above to caption this video.