ENM Choosing Algorithms 2
0 (0 Likes / 0 Dislikes)
so that's just to summerize, you are coming to this question of what algorithm you are going to use,
you need to think about XXXXXXXXXXX
Do I have reliable presence absence data
Can I really think that my absence records are reliable in this context
when we talked a lot yesterday about this issue of reliability of absence records
Do I really only have presence data? or should I be using one of these approaches that takes backgrounds into account
either from pseudo absences or through background XXXXXXXX
and one reason you might use a presence only approach over presence background approach would be
if you don't really XXXXXXXXX context of the last talk that Town gave
if you don't have a good estir mate of what your study reasons should be
what your dispersal capacity or what element should be
in that case you might really choose an approach that is truly presence only
we are going to get into more detail now, just trying to introduce these different phases or categories of models
and how they use it
a second kind of general consideration that we need to be thinking about when selecting and calibrating models
is a neat way of looking at this problem is kind of conceptual to start with
it's a very cool way that XXXXX and XXXXX are calling about
as it's presented in the past
let's suppose that we have some relationship between two things
it's going to be extremely generic but in our context this is probably going to be the probability of the species occuring
against some environmental variable, say temperature
but again let's XXXXXXXX, this is a relationship between 2 variables
and this is the truth. That is actually in nature what that relationship looks like
to mention, that might not be a good example, this might be a little bit complex relationship that might not really
be the case in nature
but that's how the truth is OK?
so what we need to XXXX out when we take what we refer to as a train XXXX calibration sign
these are those points, we get out of the field, we XXXXX of species and they tell us something about the relationship between
the probability of finding the species and some environmental temperature indication
so these are our train XXXXs OK?
then we are going to build the model
we are going to build some simple model that estimates what that relationship is
between these two factors
between X and Y for conception
so this might be some very simple linear model that kind of makes sense looking at it
that fit in through those points
OK, then what we'll do, we go back into the field
and we take some test points
ok we, go backyard and we take some more records
from nature
In practise from what we've been covering this week we might divide our original
say 100 localities or 10 localities or 20 XXXX, we got into some point that we are going to use to build the models
and fit the models and some of it we are going to use to test the models
but all what we have in these test points and these training points
all of which had taken from the XXXXXX
Now what we are going to do is calibrate or take some sort of XXXX
we can say what will XXXX and what will fail
the points that we used to build the model
and now just represented by these blue lines here
of course the difference between what the training point was
and what models there you will see now
this is some sort of coming XXXXXXXX
now you can go out and do the same test points
and we can say what's our error on the test points
XXX X XXXXX evaluation that's our key statistic is how well these test points were used to fit the model
remember our model is the simple line here
was not built using these red test points, it was using the blue test points
our real valuable measure of predictive performance of these little orange bars here
the error between the test points that weren't used to build the model
or train the model and the actual truth in nature
OK now that's one example, that's point XXX XXX
let's take same truth, let's take exactly the same training sample
instead of points in nature, this is a XXXX XXX XX
and let's play a much more complex model
so let's take some sort of model that can fit more complex responses
looks nice, straight in some aspects, but we don't really know the truth here
all we know our current records that fit a model and a nice XXX XXXXX XXX
again we go out and take same test points
exactly the same test points
we can take out our training error, look our training error is extremely small
because we fit this really complex, really XXXXX effects
straight through our points looked great
but then the error on the test points this model is a XXXXXXXX model
didn't see is much much higher than with the more simple model that didn't fit the points so well
gave a more realistic impression than what nature was really like
and this is getting at this point of XXXXXXXXXX
this is a real risk and a real serious issue
and particularly at this moment
because as in XXXX algorithms kind of getting more impressive, getting more complex
they are able to fit these much more complex response curves we refer to
and these responses between environmental variables
and the probability of other species occuring
we have this real issue of potential to fit to the training points
extremely closely but then you lose predictive ability because
we have overfit to the tree XXXXXXXX
here's another slightly less conceptual way of looking at it
Just look at this part "A" here. this is our environment
variable against it might be temperature, precipitation
water logging capacity of soil,
salinity of the ocean
whatever and this is our probability that the species occurs
now the blue line here is conceptually is kind of nice representation
of what might be the case in nature
the height of probability XXXXXXX some intermediate value
and then that drops off nicely at each end
this red line here might XXXXXX XXXX model, kind of overfits to this response curve
so if I would XXXXX to fit on this red line
we might be able to fit a perfect XXXXX through the training points
but it's overfit to this model
Now in contrast we might have a lot of XXXX effects
so this would be the case with a very simple bioclim model
and if we XXXXXXX it this will make a little bit more sense running a XXXX XXXXXX
BIOCLIM is essentially all you are going to do in this case is
supose we just say the lowest value associated with an occurance record
will say 10 degrees in height of value, say 30 degrees over 20 degrees
Now we'll just say that the probability of the occuring within that range is you know extremely high
and outside that range is extremely low
so it's kind of XXXXXX to it's a box model
so we are not fitting our response curve within the XXXXXX where we observed the species
we are going to say that the species got a curve
so it's a high probability and outside it's a low probability
and we are going to say that it's underfitting to this true response curve
so that's the more conceptual way of looking at it
let's XXXX XXX exactly right information in this phrase so you can look in 2 dimensions
this might be temperature, this is precipitation
for example any other number of dimensions of the niche
these black records are true occurance records
that we use to build the model
actually I have XXXXXXXXX with the process, we suggest our known approach records
the blue here that is actually the kind of abiotic suitable area, this is actually what we are saying in reality
the niche of the species
now a model that is overfit in a nichespace would look like this
because of the complexity of the response curve it's going to fit very neatly and closely around this XXXXXX
you might have this one point here that is fitting the XXXXXXX
by contrast this kind of envelope and box model is simply going to save XXX within these bounds
we are going to say like 10 to 20 degrees or from 8 mm of precipitation to Y m of precipitation
we are going to kind of draw a box around XXXX and save up
it could be present anywhere within that area
that might be kind of underfitting if predicting a broader area
within niche space than true niche represented by this blue area
then let's go back into geographical space and what we might visualize is againtrue distribution
and kind of XXXX on diagrams I showed you yesterday
conceptually suppose we have abiotically suitable area here and a couple of patches out here
overfit model is going to predict is going to tick out just these very few areas
XXXXX areas that neatly around the trained points
it took some test records from these areas that might also be
inhabited because they are abiotically suitable and this overfit model is not going to predict those
by contrast, this kind of simple box model might take too much broader area
so more over fit models are going to fit broader area kind of a complex models that fit more
complex niches that are going to fit predict some smaller area and the model becomes less complex and less over fit
than you are going to predict in a broader area
and what we are faced with comes back to a kind of challenges of model evaluation
we'll talk about tomorrow trying to get this balance right between over and under predicted
so that we can get a realistic estimate of what the true nature or true distribution of the species is
we are going to talk a lot about that when we come back to how we calibrate the models, how we evaluvate
the models to get this right balance
and between over and under prediction
over fitting into occurance data under fitting to occurance data
XXX XXXXX because it's a really really crucial one particularly at this time as to say when
we have algorithms that essentially if you apply them blindly and don't approach carefully
it's extremely easy to overfit because it is so powerful that it fits such complex response curves like this
and good thing is in approaches like XXX the XXXX of so many methods like XXXXXXXXXX
you can actually look at these spots and you can see how your model is fitting against environmental variables
so you can see how fit the model is whether there is kind of XXXXX it might be over fitting
that was just a couple of general considerations about model selection
the final thing I want to touch on is
is it really important which algorithm I choose?
OK, with emphasize XXXX trying to do the same thing but does it really matter if I choose
BIOCLIM model or XXXX model or XXXXXX or XXXXXX or whatever
I'm going to give you an example that kind of illustrates the yes it can make a huge difference
it's bit of a worst case scenario I picked up a couple of species that really XXXX this
so thought it's a XXX of this
but the point is that the models can be a bit different in the predictions
this was the work we did. it's published as a couple of papers that cover similar background
and teh reference is again presentation
we took a few species of Proteasaea in South Africa
so these are the plant group that is endemic to South Africa
we took a trophy standardized data set
so we took exactly the same presence records, exactly the same absence records and XXXX exactly the same environmental variables