Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

BITC / Biodiversity Diagnoses - Time Checks

0 (0 Likes / 0 Dislikes)
We want to check time. Time, as you may expect is much easier than playing with geography. As Town explained, you have to play with it. And you have to play making reasonable assumptions to see why, when it was used for? One big advantage of time task is that they have a much more limited spectrum of error. The possibility of forgetting at date one are much less. At least now that we're all into this same time of Gregorian time. time, although you might find dates which are 18 days away. Later that might happen. So, we're basically into the third dimension of what, where and when. I am going back to whales. This is a recording of whales that we saw in the morning, where they were caught. Luckily enough there are almost no whales inland except for this one. Basically, all of them are where they should be. I am talking about if you remember the humpback whale. So, all of the records that are available in GBIF do have a name. Whether this name was homogeneous or not you saw it in the morning and 97% of the records are georeferenced, which means that 3% of the records has not been georeferenced because they don't have coordinates. But only 91% of the records do have a date, so when, where those whales sail. But 9% of the records which is quite a bunch of records, we have no idea whether they're recent records or records outside their a given time. So the first thing that might happen to data records is that they might lack dates. Even if they do have dates, those dates might not be correctly recorded. All might be wronged. For short, think about the biology of this species. The humpback whales are divided into different populations. There is a southern population, northern population and there is probably a resident population in the Indian Ocean. But they don't migrate, they do move at specific moments in time. By analyzing dates, you might see the patterns of movement because they should show up in dates. For instance, if we plot the number of records by month, see that during the summer is when we see more whales. That should be partly correct because whales migrate in summer and they are easily seen when migrating. Almost there is little certainty of whales in December. But there is something strange here. What is strange here? You see a continuous increase of sightings towards August and a continuous decrease of sightings towards December, but remember time is cyclic. So what should we have here? January. Why is this plot bar small and the January one high? Is that they celebrate New Year's eve and migrate New Year's day? Probably not. Another dimension of time is the continuous dimension, the non-cyclic dimension. So we might have records which are old and records that are new. Like the here, all twenty-century records and a little bit of the twenty-first-century. So more sightings have been observed in the ninety's and little sightings have been observed in the previous years except for this 1978 sighting. Why don't we have sightings here? People were quite busy at this time so they didn't have any time for sighting. But then 1978, what happened in 1978 with whales? Whales in this year were banned from hunting. You start to see more Whales because they have not hunted them more. Does it mean that we didn't see any Whales there before? Or it's just that we didn't have data? Or we have this data to ban whales hunting. So back to our problem with January. To a Whale a month is meaningless. To a Whale, the 2nd of February is the same as the 3rd of February it doesn't make a difference when they want to migrate. So why are whales easy to see on January 1st after New Year's Eve? On the first day of each month or not, let's plot it in two dimensions. Let's plot it by day and by month. What do we see? We see 107 sightings on the 1st day of January, whereas most sightings should have occurred over summer. Now, what's wrong with this plot? A date of record might have been digitized on a wrong voucher while you know a Whale has been seen in that particular area. But you don't know which day or month the Whale was seen. What should you do in such case? Nullify month and day. But some databases won't accept nullified dates like 0-0-1920 because the date doesn't exist. So they're automatically referring to January 1st. You always need to look for dates that don't make sense or dates that look suspicious such as this one. We need to check our dates, dates that might not be exactly wrong but just technically convenient, although these dates may probably stalled any calculations we want to do or the migration patterns. Let's go to this plot that we saw before with the year of records, and how many instances are there for January 1st according to year? This is the number of instances of records happening on January 1st. It is the way of doing things right like in the plot above. But the older records from the 1960s backward have an enormous amount of wrong dates which means those older records don't have the exact dates, just the year. So if we wanted to study the migration pattern of Whales, we might probably be restricted to recent data and do away with the older records. So by looking at your data you'll discover that the data might not be useful or only part of the data can be useful. However, in the previous session, Town has shown you how to correct dates by simply playing with your data. Playing with data with dates means basically looking at the dates from different perspectives. Remember that the date has to components, a cyclic component, and a continuous one directional component. So how should we plot the cyclic component? Anytime you represent a cyclic component, a much better representation system is this kind of temporary diagram in which time is cyclic. This is the 1st day of the year, the 2nd day of the year and this basically summer solstice and back. So you don't or you see more easily the patterns involved. In this case, the maximum of the migration happens in late summer and you see this spike here which is the first day of the year which is what we're looking for, we're looking for stage patterns. We see both summer and winter migrations. Sometimes we must take into account that time has components that are natural, suggest a circle or a longitudinal component and sometimes completely artificial component. For instance, to a Whale a Saturday is exactly the same as a Tuesday, so we shouldn't see any pattern. The next example I'm going to show you is not from Whales but from records of animals and plants in Spain. And if the data is ordered in a different way, this is the month and this the day of the week, and the shades represent the amount of data that are available. Well, as expected, in summer is when more people go out looking for birds or plants because in the winter it is difficult to go out in the field. So you have a lot of data in the summer than the winter which is okay. But look at this, we have an enormous amount of data collected on Tuesday, Wednesday and Thursday. And in the previous case, more data were collected on the weekends. Why is it so? It's because these are scientists who are so busy during the week except weekends. Whereas the administration officers, etc that's part of their jobs or they won't work at all on weekends but on Tuesday, Wednesday and Thursday. Now, this has nothing to do with biology, but it does have something to with the data you have available. Now, so what's in a date? A date can be represented in so many different ways so as to sometimes make things difficult. We're back to the Whale dataset, and within this 24,000 records, we'll find this many Whales to represent the dates and all of them are different. They are not homogeneous. We tend to think that they did something similar which is something simple. If you're in the US, you will say February 3rd. But people Europe will say 3rd February which means the same. But that's not even more convenient. If you wanted to order the data set by date, either the database or data set recognizes the data by date or if it recognizes it as a string, it will not be ordered correctly unless first, you put the year, month and day, which is a natural string of the data ordination. You put the string like that, year, month and day and they will naturally come order by time. That format is called the ISO format. That's the prefer format for date always (year, month, day). So we have all the systems, this is day, month, year and year, month, day. So a date might have a year or not, sometimes you don't have the year, month or day, or none. So you might have all these combinations only in the Whale data set alone. And there is a preferred method apparently which is 6-11-904, which you might guess is day, month, year or not. Because remember most sightings of Whales were in the summer. So this is probably a month, day, year. Well, you have to look at the verbatim data and make sense of the data somehow. And making sense of the data in terms of trying to know which data system they've been using. In Spain, as I said before, we use year, month, day, while in America, they'll reverse and use year, day, month. So you should try to see whether in your system there is something wrong, such as a month bigger than 12. If one of those components is bigger than 12, you should know that that component is the day always. If both components are over 12, you're in trouble. Okay, this the system year, month, day, when you analyzed it, you will discover that this can fit and that date like this should be tagged as belonging to a different system. The problem is sometimes you won't be able to distinguish that, if you have a number which is above 12, that's fine. But any dates below the 12th day of the month is something you can never know unless you've explicitly been told, you never know whether it's correct or not. So, for Humpbacks again, how many records did have issues? Of those records existing, only 0.2% had issues (which were strange things), but 10% of the records had geographical issues, which is a lot. About 2% of the records had dates issues. In this diagram, I tried to analyze what kind of problems might you have? Things missing or things wrong, and they tend to become a large trunk. This is another creative way to look at the data. Although I'll go in more details on Thursday over this. This diagram that we created several years ago, represents the cyclic component in coordinates and the unit component as a value. So what we see here is all the records on January 1st, February, March.......August and back. But those are the oldest records and newest records, I suspect that most records occurred over the past 50 years. The color here represents the amount of record are available. Different data sets might have different patterns. In this pattern you see these spokes here we don't know what they mean now. What do they mean? Records that have a month but didn't have a day. So they were assigned to the 1st day of the month. You see this, you're in trouble. This data set has a problem. Another thing you can see here in this colonogram is submarine, this black belt which represents a lack of data. This belt here is second World War. You might also separate the data by components, in this plot, each ring belongs to one different data set. Different data sets have different patterns. When you combine them, some of the common features might match up but some might not. So sometimes it pays to separate the data set by components as Town did with the Cameroon data. So in summary, what's the main issue in checking time? We hadn't still gotten to the main issue. We're now going to the main issue when checking time. We do have a very important issue with time. Anybody can guess what this issue is? You all are suffering from this issue. I am suffering more even Town. Time is unidimensional. So whenever you don't have a record in time, you'll never have it. You cannot fill the gaps of old records because if they don't exist, you cannot go and get sightings of Whales in 1940 if they didn't exist. You can go back to a place and resample (that is to some extent you can fill a void in geographical space) but it's not possible to fill a void in time space. For example, you might sample and have data for winter but you don't have a baseline on how things were 50 years ago, there is no way you can know that. So, there is no time machine for dates, unfortunately. So here's the workflow for dealing with dates. First, try to separate the components because they need to be treated separately (months, days, years). Check outstanding date frequencies, this is often done by pivoting over provider or of a date components. For instance, most of the examples you saw before were done with a simple pivot table in excel. Check impossible dates, months beyond the 12th month (look some people represent the 1st month of the year as zero, which is another problem). Years in the future, by definition we can't have existing biodiversity data which has not already been collected. But as you saw in Town's example, there were extraterrestrial coordinates which had future dates in the dataset, often they represent unknown dates. Check potential voids, look at first-day/first-month frequencies; look at day-of-year frequencies if you can. And try to get everything homogenize (in an additional field never in the original) in ISO format (year, month, day). If you don't know a particular date, don't make it up ever, leave it blank. Comment [Town]: We all fall into the temptation to use dates formats in the program we use and that is a recipe for absolute disaster. I can vouch for it because I did that mistake once about 8 years ago, I think. When I invented the colonogram, the first results I got, got me absolutely surprised because I saw 12 spokes and I thought I got the right answer. But it was month, day reversed, which means that the first half of the month always get twice as many records. And I showed it to some collegues and later did I discover that that this problem exist not to the extent I have seen it. What happened was that the data system I was using automatically converted the dates to American format. So it was an issue with the database when completely notice. I made a mistake of believing that I was looking at the data correctly plotted. Whereas, I was looking at one-third of my data coming from a different source which has been wrongly interpreted by the database system. This might happen. But perhaps once it has happened in the past, it might happen in the future. So you're right, Town. [Town] I think I'll say it still exists once you're using excel. Excel has peculiar problems that it cannot deal with dates older than 1901 unless you add a special plugin that allows you to deal with all the dates. But if you convert everything numerical values like ISO, you don't need to use the dates function at all. You basically play with numbers. And that's always safer.

Video Details

Duration: 27 minutes and 46 seconds
Language: English
License: Dotsub - Standard License
Genre: None
Views: 2
Posted by: townpeterson on Jul 26, 2016

This talk was presented in the course on National Biodiversity Diagnoses, an advanced course focused on developing summaries of state of knowledge of particular taxa for countries and regions. The workshop was held in Entebbe, Uganda, during 12-17 January 2015. Workshop organized by the Biodiversity Informatics Training Curriculum, with funding from the JRS Biodiversity Foundation.

Caption and Translate

    Sign In/Register for Dotsub above to caption this video.