BITC / Data Capture - What is Biodiversity Data - Q&A - part 2

[Town] Other questions? [participant beginning question] [Town] Hold on one moment. I need to come stand next to you. Okay. [Town and Thilina checking video status] [Participant] How do you verify the validity of data? [Town] That it's a very good question. When we have a data record, how are we going to come to trust that data record? It may have been collected a century ago. It may have been collected by someone that we have no knowledge of. It's a very difficult thing because, many times in science we prefer to collect our own data. Right? I go out into the field and I collect my own data. But, in this biodiversity informatics world, we have to trust somebody. We have to trust the original person who captured that data record. One whole day of this course, we'll be talking about error flagging, data cleaning, and data consistency. What we look for are some very clear signals of problems. There are things where I can just look at a map and I can point to whole classes of problems. Then, on a finer scale, we can look for consistency within a data record. For example, we might describe a locality in terms of geographic coordinates (latitude and longitude) and also give a description of the country, state, and municipality where the record was collected. So, we can ask whether the geographic coordinates are consistent with the textual description of that record. That's looking for internal consistency in our data record. We can go one step farther, and we can look for external consistency. We can ask whether that record gives us the same sort of signal as other records of biodiversity or of that species. Can we find every error? Can we detect and correct every problem in the data? Definitely not. But, we can at least seek a set of records that show consistency and don't conflict with one another. That's a process we will talk about a lot in the course of this week and a half. Okay? Moses [Moses] Yes. [Moses] Will it be a better practice to transcribe your primary data into publications? Because, if you transcribe to publication, you can now disseminate them to different websites? This will improve access to it. I think this will minimize problems with politicians in your country. Would this practice be effective? [Town] This is a very interesting question. Other experts, if you have opinions please throw them out. My personal opinion is that when we transcribe our primary data into publications, in some sense that is moving away from the primary record. For example, I might do a publication on the plant community. Of course, I wouldn't, I'm not a botanist. But, I might do a publication on the plant community of a particular site. I may end up losing information because maybe when I say a particular site, I go from GPS records that put the plant there, there, there, and there to saying this site has that species. But, I have finer resolution data that go into that opinion of part of the community or not. In a publication, you rarely have the possibility of publishing all of the information in your data record. My personal opinion is that it is better to share the unitary primary original data in its fundamental form. A publication is good either for calling attention to the data -announcing to the broader community that the real data exist- or in synthesizing and interpreting. I personally would argue against the idea of what are being called 'data papers'. That's my opinion; it's not the only opinion. Okay, come on. I thought I could get a rise out of you. [Christiane] I have an alternative idea. I'm not saying that I disagree with what Town says. For a lot of projects that are out there or a lot of the purposes. I think there are certain cases where there might be a benefit of going through publications. This is something that will come up in some of my lectures as well. Especially when we talk about invertebrates - insects. There's so much unknown diversity out there that whatever we're databasing rarely gets databased to the species level. If it does, there are high error rates because the collections haven't been looked at by the experts. There are people out there who have argued that the real trustworthy biodiversity data for insect specimens -I'm not saying that true for the vertebrate world. That might be less true for plants as well.- but for insects and other invertebrates, are those data coming out of taxonomic revisions. Those are big projects that experts on a particular taxon, or group of organisms, will gather all the specimens. Not from only one collection or one country, but for the entire revision they're working on. And, they look at every single specimen. They make sure identifications are correct. They describe new species if there are new species. Then, the specimen data that are published in these revisions are the real prime data. Obviously, you will argue, that people do revisions such as in one example I will be showing you, where the two researchers looked at 1500 specimens and you say, 'that's nothing'. But, there are other revisions that look at maybe 12,000 specimens. That's still a small number of specimens. But there's a certain benefit. There are other projects that wouldn't encourage you to go directly through a publication, but I would still be very concerned about identification. I think the project we're working on, for example, shows some of the avenues of dealing with that. So, it depends really. The last thing I wanted to say on that is that there are a lot of journals now, like <i>ZooKeys</i>, that make it very straightforward for you to harvest the biodiversity data out of these publications. It's not just a printed PDF with what we in taxonomy would call 'material and methods' or 'material examined' which is not tagged in any way so you can't easily access it. Journals like <i>ZooKeys</i> actually have tagged everything so that it is very easy to pull that information out. Prime data. And, easy to extract. [Town] This is stuff that interests me quite a bit. Come this way so you're in the picture. [laughter] I'm totally in agreement with linking taxonomic revisions to the data that underlie them. That's very clear to me. But, what about these straight out data papers where I as curator of birds at the University of Kansas announce that we're going to put our data from our egg collection online? [Christiane] I would never do that either because I think at the stage we're at, there are enough projects out there that show that it has become very straightforward and easy over the last 5-10 years to just take data and then put them out no matter what database you use. So, no. I would never just inventory all the specimens in our collection and publish it. It's just not very useful. It's not the best way of making data available. [Participant] I visited the Tervuren Museum in Belgium and tried to digitize specimen in the collection. I noticed that a large amount of data were erroneous and many specimens were not accurately identified. In some countries, such as mine, some taxa are ignored because there is no experts studying these taxa. So, there are large amounts of material that are not identified. What we are trying to do is get these specimens and transfer the materials to experts outside of the country for proper identification so that our collections will have the correct data and identifications. For this case, what might be a similar scenario where you have material that nobody is able to identify but you want to capture these data? How do you go about this type of data capture initiative where you have no expertise, no taxonomists who can provide identification? Maybe they can take the specimens to another country for identification; then, they return the material so that we can harmonize the data. [Town] That's a very good set of questions. Sometimes material is not worked because of lack of funds or time. Sometimes for lack of expertise. Sometimes even for lack of interest. To me, that is an initiative of the individual taxonomic community; and, different communities will work in different ways. For example, in some communities, it's very easy to do detailed images and the images can be shared very conveniently. In other communities, institutions may get funding to bring in the specialists. If a person in a specialist on this genus, maybe the institution invites the individual to come visit for two weeks to do research and, in return, help correctly identify all the specimen of that genus. But, you need to separate the identifications from capturing the information. You can capture the information properly and then have a qualifier as to how confident are we of these identifications. But, the documentary information is out there. Sometimes an expert can look at the data record from far away and say, 'you know, the true name of that is this.' And, sometimes the expert needs hands on the specimens. But, again, I think that is something the institution needs to solve. Or something that the community -the group of people who work on a particular taxon- needs to solve. In some sense, we can separate that from getting the information into existence with proper description of how sure are we of those identifications. To some degree, we can separate those two dimensions of the question. What is the information? And, what is the best possible, most accurate identification? That's a very good question. And, a very big challenge in the field. Let's get a different opinion. [John] Not really a different opinion. What I see here is an opportunity; and, very relevant to this course. You'll have even more difficulty getting experts engaged if they don't know that the material is there. But there are experts out there, and they're dying for material because that's their life blood. If you digitize as much as you can, including images, and publish those, you've basically created a shopping mall for interesting information. This will make it easier to invite those experts because they know that you have those data and, now, it is accessible. You now have a way to access those resources because they know those interesting data exist. [Town] You work with scorpions, correct? Okay... Maybe there's a collection of thousands of east African scorpions in some small university in, I don't know, western Spain. As John said, if you don't know of the existence of that, you can't do anything about it. So, capturing that basic information, even if just says 'unidentified scorpion', is very, very useful because then, an expert can say, 'oh, wow. An unidentified scorpion from Tanzania. I need to go to that collection.' Okay? So, it's the shopping mall. We don't always have to go out into the field to do new taxonomic and biodiversity work. Sometimes, somebody already did the work. Okay? [Participant] Sometimes it involves even going back to the specific point to see whether you can collect similar material and do the correct data capture.

In English. Portion of course that covers biodiversity data capture, held 13-22 January 2014, in Accra, Ghana. Experts included Melissa Tulig, Kim Watson, Christiane Weirauch, John Wieczorek, and Town Peterson.

