# BITC / Biodiversity Diagnoses - Inventory Completeness 2

0 (0 Likes / 0 Dislikes)

These are species-area relationships and again it's just as you get more inclusive, you see these relationships continuing to go up. Here is species accumulation curves by pit-trapping, so this is again, this nice because
You dig a sampler and you walk away from it for a night and at the end of the night, you come back and you see what's in your trap. Much better that which is in some sense sampling objectively
Much better that then you know what did you see today and how many were there. And yet, you can see some of these inventories look to be leveling off, actually no, those don't to be leveling off
You can see here one that is after a lot of sampling, it started to level off and you probably not gonna get much above this number as far as the final species number
In this case, who knows? Could keep going, could level off. OK? So, when you're dealing with inventories like these, you know it's awfully hard to guess at the final platto when your curves are so incomplete.
I'll give you an example in a bit with some made-up data. Anyhow, these are just examples for you.
Those inferences are not easy. Essentially, you are saying from data like this, how many species are there that never got detected, OK? That's the hard part. So, they're a couple different approaches
They've kind of settled down on a few now, but really an early contribution was by Jorge Soberon and his colleague Jorge Llorente in Mexico. Jorge has been very fundamental in development of this program and Alex and John and Arthudor know him quite well
But this early contribution was the Jorge's basically saying we can go beyond just saying done, or not done. And instead, what they started to do was to take inventory data, this is from inventories of butterflies in Peru.
So, these are the actual data and they started fitting the regression curves, OK? And they used three different forms of regression equation, and you can see this form A, is kind of very dispose to assume, to concluding there is a very big community
Whereas version C tends to level off very quickly.
And in that paper, here is another example that they gave. In that paper, they explored what might be the best form of the regression equation that kind of best fits to how we sample things in Biology
And unfortunately, noticed that there is a fair amount of variance among these estimators. And then particularly unfortunately, noticed that this is the same kind of matrix
But all they care about is this left edge. The first day or the first sample, they got 4 species. The second day, they got 3, the third they got 3, and then 2, then 2, then 1 and then 1.
And so my curve is going like this because each day I'm finding fewer and fewer species. But notice if that's only using the left edge of the matrix, now that matrix might look like this!
Once I see a species , I tend to see it a lot. Maybe I learned the song, OK? But all I'm saying is that this is a matrix that's relatively full. An alternative matrix might look like that.
And my guess about how many species are in this community should be a lot lower, hold on I've got both of them here. In this situation where there are lots of zeros, I have lots of rare species or hard to detect species
And so I'm guessing that I'm gonna continue discovering species for a long time. But here, most of the species are seen most of the days. So, in this case, I'm gonna guess that my curve is gonna level off pretty soon.
So all I'm saying is that species accumulation curve approach is only using this bit of matrix but these two matrices would get an identical result if you're just using the number of species discovered per day
Does that make sense?
So, a really key set of papers was this work by Anne Chao, and essentially what she was doing
was looking across the whole matrix. And essentially saying what qualities and what properties of this matrix can tell me about
Let's go back slide
You know, this species was detected 4 times, and this species here (Species P) was only detected once
But what about this matrix can tell me how many species were never detected
OK? It's actually a really fun paper published in Science back in the 70s, where a couple of Statisticians typed all of Shakespeare's sonnets into a very protocomputer
And they did some probability statistics to calculate the probability or the expected number of new words that would occur in a newly discovered previously unknown sonnet that was by Shakespear
The nice thing about sonnets is that they have a set number of lines and a set number of syllabus.
And so the idea was you know let's say a hundred words if we were to see another a hundred words in a sonnet from Shakespear
on average how many new words that we haven't seen so far in all of his previous sonnets.
So they had done this as an exercise for fun. And then in the 70s, there was a sonnet discovered that some scholars were sorting was by Shakespear.
And so they pulled their dataset out of storage and they applied it to the sonnet and I don't remember the exact numbers, but let say it had 4 new words in it
And that was within the confidence interval of what you would expect given Shakespear's use of words in past sonnets. So that was, you know
Samples would be by sonnet and species would be combinations of letters or words.
So, all in after is that there are certainly properties of these matrices that should give us more information than just the left edge.
So Chao went in and sought those properties, and essentially what she found was that if you look at the number of species that were detected five times
And the number of species detected 4 times, and the number of species detected 3 and 2 and 1 time
that from those numbers you can estimate the number of species that were detected zero times.
but they're present in the community. And she is a very very competent mathematician. She was able to show the contributions
to that numbers are greatest from the number of species detected once and the number of species detected twice
But the number of species detected 3 times, 4 times, and 5 times, and more and more was a diminishing contribution.
And so Chao produced some very simple equations, those equations that I put up yesterday that basically are estimators of that number of species detected zero times.
So we go back to my example, here we have one species detected 5 times, and another species detected 4 times and zero detected 3, and zero detected 2 and zero detected 1
In Chao's world, that's going to be an inventory that's done. But over here, all of the species have been detected only once.
And that's gonna knock the probability of a species being detected zero times way up.
So, these are complicated versions of Chao's equations, but if you look at this here, that is centrally the equation that I put up yesterday.
And you can get lower bounds and upper bounds and estimate of variance, OK?
More recent work and I don't know these indices as well have developed these incidence coverage-based estimators. Maybe Artudor understands the differences better
But there are alternative estimators and I would meet my laziness but all of these estimators are now implimented in a package called "EstimateS"
And so it's very convenient for developing kind of all of these estimators simultaneously.
And all it takes is a matrix like that.
Then I go back in time just for fun, I'll show this, but I'll show you a couple of comparative studies of different estimators. This is by Walther.
And then I show you this one mainly because it was a fun study
I was working with a Mathematician mammalogist and I was asking him what's the best estimator? And mammalogists kind of know about these things because they do a lot of long-term trapping in single sites.
And you know, notice that we're referring to some of these earlier methods that Soberon and Llorente work, but the real problem was
you never know what the truth is. So, I decided let's have some fun when I was growing up
when we would take long road trips, our game was to count license plates of different states
So, in the US we have all of these different license plates. And the nice thing about them is they have some properties that are really attractive
If you live in Kansas where I do, which is right in the middle of the country, then Kansas and Missouri are the most common species in your fauna
And Alaska and Hawaii or Maine are pretty rare.
It's actually surprising them where a car is from Alaska. That species has increased in abundance in Kansas in recent years! Anyhow, what I did was for several weeks in the course of my day
instead of going bird watching like I should, I would go license play watching
And I did that in Lawrence, Kansas and also I did that in Mexico City. So, in Mexico City you have a fauna size of 31, in Lawrence, Kansas you have a fauna size of 50.
But the very nice thing is you know the true fauna size.
And then what we did was to explore these difference estimators, and so these are the three different regression equation approaches that Soberon and Llorente used
This is the truth. This Mexico, this is the US. You can see the truth is at 30-some here and 50-some there.
And then this is the estimator, and so what you can see is that this estimator stays very low. It underestimates the true number of species, all the way out to the end
This estimator overestimates the number of species until relatively late. This estimator initially underestimates slightly and then jumps right up to the truth and stays there.
So this is the clench equation and that essentially one in both of those cases, and then we started exploring the Chao's estimator and several others
And again we got a very good result from Chao and from some of these others not as good. Here is another one that underestimates.
This, unfortunately, was early enough that they didn't include all of the ice estimators that are in "EstimateS". We did this paper a year to two soon
And then we are able to graph how wrong you are against how certain you are. So, the perfect method is down in low in both cases
But that really led at least for my thinking to a really interesting inside. And that's what I'm going to talk to you about in term of "Result-based sampling"
And this is just kind of a proposal or a thinking mechanism, and a lot of a time it's simply impractical.
When we went out to Mongolia, we basically were working in maybe four sites on the trip that I was on
But really we had no choice about when we moved.
We had supplies for a week when we were up in the mountains and supplies for 3 days when we were at this desert site and gasoline for only two days of this other site
And that was set by logistical considerations.
But there is another way to approach these things