# BITC: Ecological Niche Model Evaluation, part 3

0 (0 Likes / 0 Dislikes)

Hello, this is Town Peterson again. This is the third installment of the module on model evaluation.

We've talked about some general concepts and we've talked about the practicalities. Essentially we've used the confusion matrix to look at some metrics of performance. Now we are going to talk about significance tests. I'm not going to give you the full gammit of tests that have been applied to testing model predictions in niche modelling and distribution modelling. Rather, I'm going to give you essentially three methodologies that are quite simple, but that are very commonly used in the literature and my idea is that with these three example methodologies you'll be able to comprehend and understand the different methodologies that you see out there in the literature.

So, let's imagine a map. Ok, again: latitude, longitude, just like in our last module and on our map we have a prediction that covers about 10% of the area. Ok? Essentially what I want to do is to show you some very simple methodologies to a situation like this. So this is a binary map. Ok, this is a map in which everything is either predicted present or absent. And we'll talk about, in another module, the idea of thresholding maps, thresholding model predictions to yield binary maps. For the moment let's just imagine that we have a binary map and so we are going to call this set of approaches threshold-dependent because they depend on setting some critical threshold for what is presence versus what is absence when you have an otherwise continuous output.

Now let's imagine that we take our independent evaluation data and we overlay them on this prediction and, in a very good world, maybe we get that. So you see we have 10 testing points and of our 10 testing points, 9 of them fall inside the prediction and so essentially what we are asking is the cumulative binomial probability that we can get 9 successes out of 10 trials when the underlying probability of a success is 1 in 10. And I did this calculation and the probability is quite low; in fact, it's less than 1 in a thousand. Ok? So essentially what we are saying is that this particular model seems to have good predictive ability when it comes to anticipating the distribution of our independent testing data.

Now let's imaging another situation with a different set of testing data. Again we have 10 points but here we are saying:what's the probability of getting 1 success when we have 10 trials and a probability of 10%? And essentially this is going to be a P around 0.5, which is to say this is a non-significant prediction. And so, in the case of our blue occurrence points, our model really had no significant explanatory power about where those points would fall. In the case of the red ocurrence points, indeed, our model had some predictive ability. Now, this is a very simple case, this is a case of where we have a binary model. Most model outputs are not binary.

Most of the algorithms that you use for predicting distributions and characterizing niches instead give you a continuous assertance. So let's try another map and maybe we have an area of high predictive suitability and then some surface of lower and lower predictive suitability, down to some lowest level. Ok? That's more common. Essentially all of the algorithms that are in vogue right now produce such a continuous surface. Now here we can't use just a simple binomial approach. Why not? Because: where does it stop being suitable as we go from high to low? We'll give you another module in this curriculum that treats the question of thresholding. For the moment, let's leave that question aside and let's ask some questions about how do we evaluate models without making assumptions about thresholds? And so we can call these "threshold-independent approaches".

Now let's, again, overlay some independent data. Ok? Now what you can see is that most of our points are in the areas of higher suitability. Another dataset just like in the last example, another dataset might be all over the map but now we don't have that simple equation of the binomial probability distribution. So one thing that you'll see very commonly in the literature is what's called a Receiver Operator Characteristic and it's a graph of what's called "sensitivity" which, by the way, is 1 minus the omission rate that we just discussed

So it's graphing sensitivity against 1 minus specificity And guess what? 1 minus specificity is equal to the commission error rate, that we just discussed also. So these are familiar quantities to you, both of them go from 0 to 1, and that's actually a very important point. The line of random expectations, essentially the performance of a random classifier if we were just to use unrelated predictions and test data, it would be a diagonal line like this. And so, because each of these scale from 0 to 1, the area under this line, we'll call it the area under the curve, equals 0.5; that's for a random classifier.

Now, our blue data are pretty close to random, and so the blue, the bad model predictions, are going to look like that. In the case of the red data, notice this: in a very small area we accumulate lack of omission which is to say most of the points are clustered in a very small area of high suitability. So our model accumulates lack of omission and avoids omission in a very small area and then eventually turns and comes up to 0 omission. So this curve is going to have an AUC that's pretty high, maybe, I don't know, 0.9. The blue curve is going to have an AUC that's near 0.5 which is to say it's near random expectations.

We've talked about some general concepts and we've talked about the practicalities. Essentially we've used the confusion matrix to look at some metrics of performance. Now we are going to talk about significance tests. I'm not going to give you the full gammit of tests that have been applied to testing model predictions in niche modelling and distribution modelling. Rather, I'm going to give you essentially three methodologies that are quite simple, but that are very commonly used in the literature and my idea is that with these three example methodologies you'll be able to comprehend and understand the different methodologies that you see out there in the literature.

So, let's imagine a map. Ok, again: latitude, longitude, just like in our last module and on our map we have a prediction that covers about 10% of the area. Ok? Essentially what I want to do is to show you some very simple methodologies to a situation like this. So this is a binary map. Ok, this is a map in which everything is either predicted present or absent. And we'll talk about, in another module, the idea of thresholding maps, thresholding model predictions to yield binary maps. For the moment let's just imagine that we have a binary map and so we are going to call this set of approaches threshold-dependent because they depend on setting some critical threshold for what is presence versus what is absence when you have an otherwise continuous output.

Now let's imagine that we take our independent evaluation data and we overlay them on this prediction and, in a very good world, maybe we get that. So you see we have 10 testing points and of our 10 testing points, 9 of them fall inside the prediction and so essentially what we are asking is the cumulative binomial probability that we can get 9 successes out of 10 trials when the underlying probability of a success is 1 in 10. And I did this calculation and the probability is quite low; in fact, it's less than 1 in a thousand. Ok? So essentially what we are saying is that this particular model seems to have good predictive ability when it comes to anticipating the distribution of our independent testing data.

Now let's imaging another situation with a different set of testing data. Again we have 10 points but here we are saying:what's the probability of getting 1 success when we have 10 trials and a probability of 10%? And essentially this is going to be a P around 0.5, which is to say this is a non-significant prediction. And so, in the case of our blue occurrence points, our model really had no significant explanatory power about where those points would fall. In the case of the red ocurrence points, indeed, our model had some predictive ability. Now, this is a very simple case, this is a case of where we have a binary model. Most model outputs are not binary.

Most of the algorithms that you use for predicting distributions and characterizing niches instead give you a continuous assertance. So let's try another map and maybe we have an area of high predictive suitability and then some surface of lower and lower predictive suitability, down to some lowest level. Ok? That's more common. Essentially all of the algorithms that are in vogue right now produce such a continuous surface. Now here we can't use just a simple binomial approach. Why not? Because: where does it stop being suitable as we go from high to low? We'll give you another module in this curriculum that treats the question of thresholding. For the moment, let's leave that question aside and let's ask some questions about how do we evaluate models without making assumptions about thresholds? And so we can call these "threshold-independent approaches".

Now let's, again, overlay some independent data. Ok? Now what you can see is that most of our points are in the areas of higher suitability. Another dataset just like in the last example, another dataset might be all over the map but now we don't have that simple equation of the binomial probability distribution. So one thing that you'll see very commonly in the literature is what's called a Receiver Operator Characteristic and it's a graph of what's called "sensitivity" which, by the way, is 1 minus the omission rate that we just discussed

So it's graphing sensitivity against 1 minus specificity And guess what? 1 minus specificity is equal to the commission error rate, that we just discussed also. So these are familiar quantities to you, both of them go from 0 to 1, and that's actually a very important point. The line of random expectations, essentially the performance of a random classifier if we were just to use unrelated predictions and test data, it would be a diagonal line like this. And so, because each of these scale from 0 to 1, the area under this line, we'll call it the area under the curve, equals 0.5; that's for a random classifier.

Now, our blue data are pretty close to random, and so the blue, the bad model predictions, are going to look like that. In the case of the red data, notice this: in a very small area we accumulate lack of omission which is to say most of the points are clustered in a very small area of high suitability. So our model accumulates lack of omission and avoids omission in a very small area and then eventually turns and comes up to 0 omission. So this curve is going to have an AUC that's pretty high, maybe, I don't know, 0.9. The blue curve is going to have an AUC that's near 0.5 which is to say it's near random expectations.