Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

BITC / Biodiversity Diagnoses - Geography Checks

0 (0 Likes / 0 Dislikes)
So for right now, what we want to do is to talk about essentially how do you go about detecting and either flagging or fixing geographic errors in your data? And so when we talk about big data sets, this is how big they can be [Okay]. That is, 526,129,648 million records and that's a heatmap of their distribution across the surface of the earth. And what you're seeing when you look at the white areas are, for example, places that have been extremely good at organizing their biodiversity data (Australia, South Africa, Mexico) and there are places that have lots of biodiversity scientists (Europe, North America). So there are all sorts of things that go into making this map light up where it's lit up [okay]. But each little point of light you see is one or ten or thousand or even million biodiversity records that coincide with that latitude and longitude. So, there are obvious errors. Like, if I have a latitude of up to 95, you can tell me there is something wrong with that data point. Right? But we need go deeper than that. So, we've talked about detecting errors. Ideally, we correct them, but the very least we want to put a signal on them. [Okay] this may have a problem. And whereever possible, we may be able to provide a measure of confidence. [Okay] we've already talked about all of that. So let's go through two examples. And I think these two examples will give us some good basis for learning. So, I took this turaco, and I had to do a little bit of cleaning with taxonomy problems. But that's Arturo's realm, not mine. So let's go straight to the geographic [okay]. Now if you sit in the back of this room looking at the map, [Moses is looking at this map] where is this species? Easy answer, Sub-Saharan Africa [right]. Now we got to deal with a couple of problems like that, and that. But really this Sub-Saharan African! Now maybe, the species is very mobile, or it gets lost very so often and shows up in Chicago, New Mexico or Alexandra. Or maybe not. So that's why we do these exercises. Data points that look strange, maybe wrong, or they may just be peripheral or extreme. So first let's look internally, and remember this exercise is asking whether the latitude/longitude coordinates fall within a country that the data set specifies. So focusing on this point up here, I go the data record and look at the country [South Africa]. [Okay], can anybody guess what happened? [Participant], negative and positive in latitude. So indeed, if you reflect this over the equator which is right here, you come to that. And that's an error where we have a couple of good reasons to correct it. [okay] It's a very common error where people miss that negative sign, and the data record is telling us South Africa, not saying Egypt. So that's an easy one. Then, let's look at this one, a very interesting one, a point that falls in the Democratic Republic of Congo. But, in the data record it says Sudan. So it's certainly not an issue of wrong hemisphere, rather we're just off by a little bit. Now, sometimes you get a data point that's really in Sudan, but it appears to be in the wrong country because of the complexity or simplicity of the boundary. [Okay] that don't happen. But that's not the case here either, we've got some kilometers here. So I went back to the data record, and here it is, two field meseum specimens. And it's very interesting if you look at the latitude and longitude, there is a couple of decimal points of precision on the longitude and no precision on the latitude. So probably either this is a typographic error or there was some rounding. But something happened, and actually the next example I'll show you, will be from the same institution and same error. So here is a data point, that from the data record should be in Burundi but shows up in Rwanda, when we ploted that on the map. And again, you go back to the data record, look at that same thing, zero precision/zero decimal places in latitude. So this is something particular to a single institution, I don't mean to be picking on the field museum, I did my PhD there and I use to work there as a Curator. But, there is something in its data set that is leading to zero precision data records in there latitude and longitude. We probably can't fix that without going back to where in Burundi. So if we can't fix it we're fine, we put a red flag on that. So now let's go extrinsic. Here's some more information on this turaco, and we have a good description of its range. But notice it's Natal to western Zululand, Swaziland, Cape Province, Transvaal and Swaziland. I thought we said Sub-Saharan Africa, didn't we? So right away the easy one that we consider looks pointed out in its range. But also we have to deal with the fact that this taxon actually constitutes a complex. Those are just the three closest relatives, but all of the concepts of this taxon name apply to a broader complex across all of Sub-Saharan Africa. And so we need to deal with of all of these in terms of what are they referred to? And certainly in terms of corresponding either to this particular name, or to the broader, what use to be called superspiecies, the broader complex of species. All of these in red, represent potential problems [okay]. That's a quater of the data set. And then, even after we've dealt with kind of the coarse strain range in southern Africa Even then we can look at the land cover type. So these are external data sets [okay]. But we look at the points for the species within the species known range. And what we see is this species terms to be in a closed deciduous forest. But then we have all these other things, some of them make perfect sense, deciduous woodland, instead of closed deciduous forest. But we have a few in there that make me worry, montane forest and crop lands. And those may represent either wrong georeferences, or they could aslo represent old specimens, that were indeed collected in deciduous forest, but in the time between when they were collected and this map was made, the land use has changed. This is very very common, unfortunately just to say in the last half of century or century, huge amounts of natural habitats have been transformed into croplands or grazing areas. So, that's not necessarily a problem, but those are certainly records that we might want to look at. And then we can look them in terms of environment as well, those are maps of temperature and precipitation across southern Africa. And then just for illustration, I graphed annual mean temperature against annual precipitation. And I looked at the extreme high values of precipitation, the extreme low values of precipitation and the extreme high values of temperature. And again these are not necessarily wrong, even if our data were perfect, there has to be some observation that has the highest value of temperature and the highest value of precipitation. But these are values that we might want to look at. because these are data records that place the species under a curious or different environmental circumstance. And we want to figure out whether the niche of the species with respect to these dimensions, looks like this, or whether it looks like this? And so can add some confidence to that difference by checking these points. And if there is any amount of uncertainty in their georeferencing, maybe we should think about whether we want to include a point that is surprising environmentally and not very certain in terms of those problems. Okay, so that was sort of a single species example. And you notice some of the problem data points we would correct, like the Egyptian occurrence and some of the problem ones we would signal as definitely having a problem, Like those imprecise records from the field museum. And then some we just take a look at. But now to give you a second example that is probably closer to what we are doing in this course, I'm going to go back to an example I developed for the Ghana course. Looking at the data from the University of Ghana herbarium. I want to start by saying there is nothing unique about this data set in the sense that every single one of the data sets represented around the table has the same sort of problems. So, I'm just using this as an example but it's a useful example. And this procedure really requires a lot of playing with your data and exploring your data. And I think this example goes straight well. So started with 65,000 records, 33,000 had latitude/longitudes in the correct hemispheres. But everything from the southern hemisphere I put a negative sign in front and everything from western hemisphere I put a negative in front. And so this one is essentially playing, this in excel. You can do this sort of stuff probably more efficiently in refine. And so some of our previous courses have given instructions on refine and that's great. But excel does the lot of the job. So notice that for north vs southern hemisphere, we have values of 0,1, 2, B, E ,A, K, W and north and south. And for eastern vs western hemishpere we have east and west and north and south. And now if you're setting up to capture data (this a matter for another course). Ideally what you're going to do, is to set up your data capture such that for north vs southern hemisphere, there are way too possible values. That's now much easier to do with some of the new capture platforms. But we can give fields with fully controlled vocabulary so that they can only take on these values. So when I put those data into a GIS, this what I got [okay]. And so pretty clearly what we got are just some values that are off the map. That might just be for example 12 degrees north with some minutes and seconds, but I don't put the decimal points in there. And so it looks 12,000 degrees north. Again, this happens every time I look at the data sets and for every single data set we're going to talk about this afternoon, I went through this same procedure. And now we go into Ghana, and we get this. This is just kind of coming in towards Ghana, I notice I got some data in Ethiopia, I find out could that be in the wrong hemisphere? Nope, those are data in the Ghana area from Ethiopia. I assume somebody traded or somebody visited or something. But then I got this pattern, you see those little scratches of data, that was making me worry a little bit. That took me some thinking, if anybody who wasn't in Ghana can figure this out you'll win a prize. I'll give you a clue, we usually use a decimal point to indicate the breaks between one and tens, But if you use it to indicate the breaks between degrees and minutes, remember that minutes can only go up to 59. And so I'm guessing that this is 0 minutes, 60 minutes, then I get nothing because these aren't decimals but they look like decimals. And so I kind of figure out what was going on. Here's what the data look like, it says degrees and minutes like here, so instead of reading this as 5.39 I have to read it as 5 degrees and 39 munites. And that's what it looks like afterwards, and that looks a lot better. we are not all the way there, but pretty much there. So once I interpret the degrees and minutes correctly, instead of as degrees and decimals, I still have this problem but now I could think about it better. I came into Africa, and this is looking more like what I wanted to see, because we've got some specimens from across West Africa and most of them are on land. I think I asked Alex if he had specimens from Ethiopia in his collection, and he said yes. And I still didn't know what to do with this, but I don't think I ever resolved that those are marine plants or error. So the next thing I did, was I simply put a different symbol on each point by the country it was associated with the textual description. And you know, here is Cameroon and look that those yellow stars are basically in Cameroon, in the Gulf of Guinea or down here in Nuwaka. But almost some of them are right there and almost some of the Nigerian points are in Nigeria. And then there were some problems, so notice here is a Cameroon point that's definitely in wrong country. So anywhere you see multiple symbols within the confine of one country, that's a problem. That's an internal test for consistency, the coordinates say one thing and the country name says another. It may be that the country name is right and the coordinates are wrong or maybe that the coordinates are right and the country is wrong. Without a little bit more research that's how to tell. But I can certainly take this point and this point, or these three crosses and whatever and that is yellow, or these Nigerian points and with a fair amount of confidence I can say there is a problem [okay]. And then we can do the same thing at a lower geographic level. Now I took states within Ghana and I colored each one with a particular symbol, and you can see there are some that are almost fine, just that one suspected error in there, and then there are some with bigger problems, like this one that I see it points way up there. This was an example that I developed one night in Ghana, I think it was an all night task. But it was fun that I got results I enjoyed. So you can zoom in and see how this state is mostly these crosses and then we got a few that have leaped in from the state to the north, or this state. But all of these mismatches I think we want to look at. I still don't know whether they're right or wrong but we want to look at them. And for example using that process, I sent a bunch of challenges to Jean Ganglo, and he came back and said you know we changed the names of lots of our states, so it's correct. It's fine because you've looked, and made sure it's consistent and so you move on. So we can go back to our previous example, and we can think about what did we do? [Okay] we know that a bunch of these are problems, and again as I mentioned could easily correct them. When we can do that, then we do, plus the final crucial step that we all talked about, you document how you changed the data and you always keep the original [okay]. So if it's a data set that you're developing for your own research, you might put a original country as one field [South Africa] and then correct the country. In this case, it doesn't change. And might have original decimal longitude and latitude and corrected decimal longitude and latitude, and this one will change from positive 30 to negative 30. So we always keep the original information and we always document what we did. So coming back to Ghana, there was one thing that I kept looking at. What is happening right back here? Global phenomenon happens right here. Prime meridian, right? So that is longitude zero. And so remember those problems of east, west, north, south. I was looking at this set of points here that should be over here, I was thinking why are they headed uniformly out to there but no farther? So I thought about eastern vs western hemisphere. So that's what those Volta localities look like, and I started looking at them and I notice that they came out further on this side than here. And so guess what I did? With those which are in the western hemisphere, I tried changing their bunch to the latitude to eastern hemisphere. And guess what? I almost solved them, they look a lot better. Now was I right in doing that? I don't know. But if I keep the original and I tell what I did, I can't be wrong. I made assumptions and I wrote them down. I am not wrong, but was I right? I don't know. But that's an example of how using your head and playing with the data, sometimes you could rescue huge amounts of data or very valuable data. So as we've talked about all today there are lots of biodiversity data and they have lots of problems. That every problem that can exist either already does or will soon. Our data taxonomy, biogeo-georeferences, inconsistent ideas about what is a locality, etc. You really have to get into the game of playing with your data. Just recently for an issue with the community where I live, some people need maps of grocery stores of Lawrence, Kansas. And so I downloaded their census data and I georeferenced all the grocery stores. And I actually had a lot of fun because this is playing with your data. But you have to build in the data cleaning [okay]. But that's a picture how to deal with geographic data. What's the technique? Playing! What do you do? Visualize! And how do you fix it? Think! So it's not very high tech, for geographic data it helps to be able to use a GIS program. But if you're really desperate, you can do a two dimensional plot in Excel and plot latitude vs longitude and you have a very primitive GIS. So again, this all about playing. Any questions about geographic data?

Video Details

Duration: 26 minutes and 56 seconds
Language: English
License: Dotsub - Standard License
Genre: None
Views: 4
Posted by: townpeterson on Aug 30, 2016

This talk was presented in the course on National Biodiversity Diagnoses, an advanced course focused on developing summaries of state of knowledge of particular taxa for countries and regions. The workshop was held in Entebbe, Uganda, during 12-17 January 2015. Workshop organized by the Biodiversity Informatics Training Curriculum, with funding from the JRS Biodiversity Foundation.

Caption and Translate

    Sign In/Register for Dotsub above to caption this video.