Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

BITC / Biodiversity Diagnoses - Data Cleaning 1

0 (0 Likes / 0 Dislikes)
The idea of data cleaning used to be less of a problem. When I was in graduate school, you would go to the museum and write down the data element-by-element. You would fill in each field; and, if something looked strange, you would check it right then because you had (in my case) the dead bird in your hand. Or, in your cases, the herbarium sheet. But, you were writing down, or typing into a dataset, element-by-element. And, you were really paying attention to each of those fields. Obviously, there were people who weren't careful. But, if you were thinking and caring about your data, you automatically did a lot of this process that we're going to talk about today. You'd note that there's no record of species X from a specific location. Or, you might note that you've never seen species Y with a red head. Or, that this specimen looks like the western form not the eastern form. You would be doing this along the way. Now, the world has changed. Each of you can do a query in a second and get a dataset in 15 minutes that is the plants of Zimbabwe or Uganda, and so forth. You can do that in no time. And, all of a sudden, you have 10,000 records downloaded to your computer. And, it's very tempting to say, "Okay, I've got my data. Let's start." "Let's start playing with the data. Let's do science." But, by doing that, you are trusting people you've never met. You're trusting that those people have taken care with your data. And, you don't know whether you should trust those people or not. You don't know if they did it really quickly and without much care. Or, maybe they cleaned that data with a lot of care. But you don't know. The situation gets worse. I did an analysis with one of my graduate students a few weeks ago. It was a niche model of a single species. I went to GBIF to download the data; and, there were 1.5 million records of one species. I don't have the time or energy to review 1.5 million records. Right? Another example. Right now, I'm working on a project of the digital knowledge of the birds of the world. The 2014 version of that dataset is 210 million records. Okay? My computer can't load the dataset. So, how do you do it? Do you just trust people? Do you trust 500 data managers to be doing their jobs? And, to be doing their jobs in the same way as the other 499? Bad idea. And, if you don't do this, you're going to get into trouble. You're going to get into a situation where errors, problems, and inconsistencies appear in your dataset. Even though we would love to jump in and start working, we really need to do some careful thinking about, and exploration and assessment of our data. This should be done in advance before we do the fun stuff. Almost all of you have seen me take your dataset and send you back a report that says that you need to fix this, and this, and this, and this ... That's something that I do with myself. I do it with my students. I do this with everybody. And, you need to learn to do this as well. We have two general tools. Internal consistency. And, external consistency. I'll give you quick examples of both. Then, after Arturo leads a discussion on taxonomy, I'll come back with two examples and talk about geography. Let's go over some generalities. Error is everywhere. Okay. You know the principle of entropy? Error will enter, and mess up any order, just by the nature of things. Data cleaning improves the readiness and utility of a dataset. But —this is one my favorite complaints about colleagues around the world— the existence of error is not an excuse to not share, integrate, or make available your data. Rather, the existence of error is one of the best reasons that you should share your data. When your data are being used, people find problems. And, they tell you about them. Okay? This is a very common excuse. "Oh no, we can't share our data yet because haven't finished cleaning it." Guess what? You will never finish. There will always be error in your dataset. Somewhere. Ten years after you think you finished, you'll still be finding errors. But, we can minimize errors. Especially if we have a particular use of the data. We can use a set of procedures that help reduce the frequency of errors. They can also signal the possible existence of errors. If somebody says that they have a clean dataset, they don't. What are the general strategies? Be consistent. That's the critical thing. If you tell me that you saw a lion here on the hotel grounds, that is a biodiversity data point. If the data attached to that point is 2015, it's probably wrong. But if the date attached to it is 1615, it might be correct. So, we're going to look for consistency inside and outside the data record. This is very important when we have multiple sources of information. We're going to make use of all of those sources and explore their level of agreement. That's the idea of consistency. External consistency gives us a more powerful view. Yes. When we detect and correct or flag errors, we're going to throw out some good data. This is by accident. But, that's unavoidable. So, this is an iterative process of exploring the data, visualizing the data, correcting errors that you are sure of what the problem is, (I'll show you some examples) or, flagging data records as probably having errors that you can't correct. Flagging ensures that those records don't get used in analyses that will depend on information that might be wrong. Okay? The last step is documentation of what you did. For this course, that documentation will be the methods section of a published paper. I typically have a Word document that itemizes everything I did. But, when we're doing truly archival improvement of datasets, the metadata about how the records were improved must be in the dataset. So, you don't just add a latitude-longitude coordinate to a data record. You add latitude and longitude. You add some uncertainty. And, you add how you got those data. What was the source? And, what was the methodology? Without that metadata, that latitude-longitude coordinate is close to worthless. So, you have to document what's being done. That's crucial. We need to make a distinction between flagging and fixing. It's pretty easy to figure out a set of records that probably have problems. It can be a lot more difficult to estimate, figure out, or understand what the problem is. It may simply be that the preparator 100 years ago was falling asleep and forgot to write something down. And so, the information needed to fix there may not exist. It may not be there. For certain applications, we can simply throw out records. I was talking with Alex yesterday about the University of Ghana dataset. Because of missing data, the dataset went from 50,000 records to 25,000 records. Half the data get left out of the analysis. That's for a single application. And, we're going to describe that. But, those 25,000 specimens in the University of Ghana herbarium are still specimens. They're still useful. And, they still exist. When we do more curatorial applications —like when Alex is wearing his hat as curator rather than scientist— we approach those data without considering throwing them out. Instead, we'll signal that a date is missing for this record. Or, the geographic coordinates fall outside of state X; but, the data record says it's in state X. There's a conflict. I'm not picking on the University of Ghana. But, this is the example that I had worked out. We can do this with any of the datasets around the table. So, when we wear our curator's hat —when we do archival work with data— we're not throwing out data records. We're just qualifying them. We're saying, 'be careful of the geographic reference here.' Or, 'be careful of the date here. It's got a problem that I can't fix.' Those records that may be useful for another application. They may be fixable at some other point in time. Or, it may take more work than you are able to put into it at the moment. That's the difference between flagging errors (which is pretty easy) and fixing errors, which can be pretty difficult. In my talk later this morning, I'll give you some interesting examples of that. I want to talk briefly about internal versus external consistency in biodiversity data.

Video Details

Duration: 13 minutes and 44 seconds
Language: English
License: Dotsub - Standard License
Genre: None
Views: 2
Posted by: townpeterson on Jul 26, 2016

This talk was presented in the course on National Biodiversity Diagnoses, an advanced course focused on developing summaries of state of knowledge of particular taxa for countries and regions. The workshop was held in Entebbe, Uganda, during 12-17 January 2015. Workshop organized by the Biodiversity Informatics Training Curriculum, with funding from the JRS Biodiversity Foundation.

Caption and Translate

    Sign In/Register for Dotsub to translate this video.