# BITC / Biodiversity Diagnoses - Gaps 4

0 (0 Likes / 0 Dislikes)

For multiple inventories what we got ##### multinomial distribution. But on several inventories that we could try to overlap.
But it looks at the probabilities of exclusive data from one inventory. I think I wrote quickly over here from yesterday
Go over again
But it assumes that inventories are actually independent. If inventories are not independent, it is biased.
It's violating basic principles. But anyhow,
Probably on Chao's estimate of the expected number of species, is that right?
It's probably the simplest one. It relies on the distribution, following some kinds of log-linear plot
Are you familiar Whittaker plot?
We take here the number of individuals, and we put here the number of species order from most numerous to least numerous
And we use this log ####(terms) and we take a plot basically something like this.
Which means that the relationship between species and frequencies or ####(abundances) is basically log-linear. Right?
Well, then we can take advantage of that and derive the ####(inside) that what is important basically is the tail
And especially within the tail those are appearing only once or twice
So, we have few abundant species and many non-abundant or rare species.
So, the probability of getting a new species with additional inventories when we increase sampling
has to decrease unless we have something like we have to learn bird song or whatever
And normally it will be increasingly difficult to get new species in the inventory.
What about using more than one inventory from the same place?
Independent inventories are taken in different places and times for instance.
########## inventory A, inventory B at total of 11 species and we may look at the species that are exclusive from one inventory or the other
And the ##### had a problem. They were marking animals and they knew that the marks, the tags kept getting lost
So they figure out that if they put two different tags on the same animal, they could figure how many animals have lost the tag
Because those animals would come up one single tag.
They put one ringer on the left and one ringer in the right leg. And they found birds with only one ring, we know that the another one is lost. This is very basic.
And as we say yesterday, the number of rings or the number of animals with tags includes those have lost no tag
Those that have lost one
Those that have lost the other.
And those that have lost both and we don't know. We can't recognize them and we can derive a formula which comes straight from the probability theory, so it's quite easy
That gives us how many animals have lost both tags.
And since it comes straight from the probability theory, we can use exactly the same theory to derive how many species are missing here
Because they have lost both inventories.
We can say that species 11 has missed inventory A and species 7 has missed inventory B.
But they were both in the place. Alright? As proven by having appeared in a different inventory.
So, it's exactly the same mathematics. Nothing new here.
So, we can use that formula to derive from two inventories how many species should have been called.
However, this formula has a large variance. And the confidence interval tends to be large if inventories are small.
The larger inventory, and the higher the overlap, the narrower is the confidence interval.
##### to increase the confidence interval is try to use more than two inventories.
So, you analyze this model sometimes ago into these things that are basically the same thing is applying the multinomial distribution to any number of columns
These were published and this the general distribution which, forget about this formula, it's just a concept. The operational formula is this one.
Basically is the product of the probabilities of missing inventories.
So you had to take all the possible outcomes. If you have three inventories, there are 8 possibilities. Missing one, two, three
Missing several combinations of two, and missing all three. And at the end you can calculate K as a simple product of all the probabilities of missing inventories.
more or less.
It's a little bit more than that, basically said.
I was running some simulations the other day trying to see how the confidence interval narrows by using a random set of data
and then something is basically a bootstrapping and you can see that the more inventories are there the narrower the confidence interval.
You can even get numbers with very very small confidence intervals given the right conditions.
So, what is workflow with taxon gaps?
We need to estimate the completeness of data, we need to resolve names to higher taxonomies
That's the first thing to do.
Homogenize taxonomies
We want to do a treemap or a leaf plot which is a different plot
And compare observed taxa distribution with expected taxa distribution.
And we have to be careful with rare species.
And we may compute basic diversity metrics. We may try to plot this Whittaker plot and see whether this is high slope or low slope
a high slope means it will be very difficult to find new species there, and a low slope means a lot of species will appear.
A low slope is similar to what Town showed us before in the effort - species curve like being this.
And if it's possible, calculate Sexp.
Now, time gaps!
Remember what we said yesterday. Time flows in one direction only.
Data that had not been caught, will never be caught from the past.
Let's distinguish gaps from natural trends. That's very important.
Because by nature, time data is not linear and is not homogenous. So, we may see features which are basically natural trends.
And some other features may not be natural at all.
Remember I showed this plot yesterday which is the number of ######## or humpback whale over time.
So, we see this plot here. We see obvious gaps. There are no data is these years.
But there is also very scattered data in the first two-third of the century.
And then a couple of spikes, three spikes in fact.
So, what may we be looking at here? There are two possibilities.
Either the population of the humpback whale is like this or our data collection was like this
from a constant population.
Which one do you guess?
Data collection?
Humpback whales were actively hunted until here.
So, it may be that the population was really low too.
But it may be the data collection system's problem. Because we know that the data collections had increased over time.
Let me show you this other plot here.
I'm not telling you what it is.
Data collection or natural cycles?
What is your guess here?
This is actual cycles.
This is a data for a pest. It's a butterfly.
that has outcrops in certain years
becomes a pest
It's been sampled continuously with exactly the same #### for almost one century in #### (January).
So, we can be sure that this data here are actual numbers.
Or what about this?
This is also a natural population.
This is mink or something like that from Canada and it has natural cycles.
So, we cannot equate this to a gap!
We cannot equate this to a gap too. This is natural. So, it makes life difficult for us.
Because how can we know that there is gap when there is an obvious number of gaps here and not a natural cycle.
That's a trick and that's something that there is no solution for this other than systematic sampling perhaps
So, which question should we ask when we analyze time for gaps.
First and for most, are there natural trends in this population? or in this set of species?
Because if there are natural trends we have to factor them out in order to find the gaps.
We need to look for baselines, previous knowledge, the expected natural ####(curves)
Are there natural cycles? Is this species seasonal? Does it migrate?
So, we have to look for unsampled periods of the time.
We have to look at the distribution of our sampling periods and see whether there are some specific periods which is missing.
Is there a sampling effect? We may relativize data by sampling effort that may help us
To convert line or time that has bumps into a line that in which those bumps are independent of the sampling effort.
But quite often, time gaps are combined with other gaps.
So, we may have a combination of time gaps and space gaps and taxon gaps.
And this is extremely common.
So, we need to cross-examine time gaps with other sources of information or other dimensions of the problem such as geography or taxon.
This plot here comes straight from GBIF and is the number of data records that have made to the GBIF index from 2008 to present.
And as you see it has increased over time. So, we wanted to look a gap for a species which is included in this index
One of the very first thing we need to do is to factor out the increase in data #### (viability) for that species.
Which will guaranty us nothing because that particular species may have started to appear here or because this particular increase
has ##### to the actual of that species, for instance, birds.
This is ####(calls) for animals
And this #### for plants.
So, plants have been increasing and it doesn't mean there are more plants in the world. Simply means there are more plants data available
A synthetic way to do this with ######## which I described briefly yesterday
And it's a plot in which we put the cycling component in #####
and we used the radial component for the linear component of time
So, in a plot like this each single point is one single day. For this case, from the 18th century until now.
That's one single calendar day and the colors describe the number of available information.
This actually is Spanish data, what we know from Spain.
Nothing really in the 18th and 19th centuries, more in the 20th century.
A big gap here coincided with our civil war
and a lot of data in the Spring and Summer which is when biologists go to the field, because we are happy there in sampling
and this gap here, what is it? We know what it is here, this is January, 1 that more data are there.
What is it? It's easy!
- March?
One day before.
[Town]: The least frequent day sampled in a year.
- The data will be ####(cute) 4 times younger if you were born that day
February 29.
If you were born Feb 29, you remain very young!
OK, this is just #### data. I put in #### and this is the ####(whale) data
that we saw yesterday, only from the 20th century, and we saw the whale data tend to accumulate in Aug and Sep.
remember there were second migration periods. Town said it was restricted to a very narrow set of three years.
What is this? This is natural and this is a sampling campaign.
So, What do we have here? A gap in campaigning here, a gap in campaigning afterward.
Often it pastes to #####(mix that up) and ... I don't remember what it is.
Yes, I do.
Since I went up for dinner and there was no one and I hate eating diiner alone
So, I went back to my room and had some spare time
which I should have used it to prepare for this talk, so it's more particulate, sorry.
But I was lazy, so I decided to download wolf data and plot the wolf data there and this is the wolf data for Spain.
And again, the wolf appears only in summer and only recently?
We know wolfs tend to appear in winter. They scare people. They appear in winter formerly. But nowadays, they only appear in summer.
Wolf ###(siting) expeditions in summer time.
Another example, from a recent work
This is the border between Spain and France and this is a range of mountains which we called the Pyrenees with 2000 - 3000 m height
and there are a lot of data from that region.
If you go to GBIF, you can get 400,000 data points above the 600 m line
it belongs to 13000 species from more than 75000 different localities.
So, it's a quite compact and interesting dataset.
But this dataset has been compiled over time.
Over a long time in fact.
So, in the 19th century, there were more data available. Also, having more available data in 2000, 2010, and this basically where we are now.
the color of each dot represents the number of data records there.
and you see an enormous amount of different locations. So, it's a quite complex and complete dataset.
with a few interesting points here.
a ####(field) points heavily sampled
and some areas with the high concentration of points
and finally, a region where all the dots are evenly spaced.
It's a mountainous region. It's quite difficult to go to all those places.
So, why do we have this regular pattern here?
That has appeared over time.
We'll come to it later. Alright?
Be with me...
Again, this is Spain and this is the data were ######(preserve) specimens. I lived here and this is my region
So, ### we do have #####. We have oversampled this region which is close to laboratory
But before the 1960s, there were only a few data here and here. You can't see them well.
And later on, the ####(extent) was much bigger.
But there are number of records even in my own region and some other regions and this big dot here which are un-dated records.
That's an enormous gap. We know that they are data that sometimes they have appeared but we have no idea when they were collected
Finally, I go to the geographical gaps.
Which have been covered in ##### by Town and I will only add a few brush-strokes.
Geogaps permeate every dataset. They appear always in any kind of dataset you may collect except for
say point-time data. I mean the data that come from an experimental station that is sampled every year or whatever.
Let's throw in a few additional concepts through over what we saw in the morning
Like false gaps, boundaries, combined gaps, etc. And if there is one I want to comment about