Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

ENM Choosing Algorithms 2

0 (0 Likes / 0 Dislikes)
so that's just to summerize, you are coming to this question of what algorithm you are going to use, you need to think about XXXXXXXXXXX Do I have reliable presence absence data Can I really think that my absence records are reliable in this context when we talked a lot yesterday about this issue of reliability of absence records Do I really only have presence data? or should I be using one of these approaches that takes backgrounds into account either from pseudo absences or through background XXXXXXXX and one reason you might use a presence only approach over presence background approach would be if you don't really XXXXXXXXX context of the last talk that Town gave if you don't have a good estir mate of what your study reasons should be what your dispersal capacity or what element should be in that case you might really choose an approach that is truly presence only we are going to get into more detail now, just trying to introduce these different phases or categories of models and how they use it a second kind of general consideration that we need to be thinking about when selecting and calibrating models is a neat way of looking at this problem is kind of conceptual to start with it's a very cool way that XXXXX and XXXXX are calling about as it's presented in the past let's suppose that we have some relationship between two things it's going to be extremely generic but in our context this is probably going to be the probability of the species occuring against some environmental variable, say temperature but again let's XXXXXXXX, this is a relationship between 2 variables and this is the truth. That is actually in nature what that relationship looks like to mention, that might not be a good example, this might be a little bit complex relationship that might not really be the case in nature but that's how the truth is OK? so what we need to XXXX out when we take what we refer to as a train XXXX calibration sign these are those points, we get out of the field, we XXXXX of species and they tell us something about the relationship between the probability of finding the species and some environmental temperature indication so these are our train XXXXs OK? then we are going to build the model we are going to build some simple model that estimates what that relationship is between these two factors between X and Y for conception so this might be some very simple linear model that kind of makes sense looking at it that fit in through those points OK, then what we'll do, we go back into the field and we take some test points ok we, go backyard and we take some more records from nature In practise from what we've been covering this week we might divide our original say 100 localities or 10 localities or 20 XXXX, we got into some point that we are going to use to build the models and fit the models and some of it we are going to use to test the models but all what we have in these test points and these training points all of which had taken from the XXXXXX Now what we are going to do is calibrate or take some sort of XXXX we can say what will XXXX and what will fail the points that we used to build the model and now just represented by these blue lines here of course the difference between what the training point was and what models there you will see now this is some sort of coming XXXXXXXX now you can go out and do the same test points and we can say what's our error on the test points XXX X XXXXX evaluation that's our key statistic is how well these test points were used to fit the model remember our model is the simple line here was not built using these red test points, it was using the blue test points our real valuable measure of predictive performance of these little orange bars here the error between the test points that weren't used to build the model or train the model and the actual truth in nature OK now that's one example, that's point XXX XXX let's take same truth, let's take exactly the same training sample instead of points in nature, this is a XXXX XXX XX and let's play a much more complex model so let's take some sort of model that can fit more complex responses looks nice, straight in some aspects, but we don't really know the truth here all we know our current records that fit a model and a nice XXX XXXXX XXX again we go out and take same test points exactly the same test points we can take out our training error, look our training error is extremely small because we fit this really complex, really XXXXX effects straight through our points looked great but then the error on the test points this model is a XXXXXXXX model didn't see is much much higher than with the more simple model that didn't fit the points so well gave a more realistic impression than what nature was really like and this is getting at this point of XXXXXXXXXX this is a real risk and a real serious issue and particularly at this moment because as in XXXX algorithms kind of getting more impressive, getting more complex they are able to fit these much more complex response curves we refer to and these responses between environmental variables and the probability of other species occuring we have this real issue of potential to fit to the training points extremely closely but then you lose predictive ability because we have overfit to the tree XXXXXXXX here's another slightly less conceptual way of looking at it Just look at this part "A" here. this is our environment variable against it might be temperature, precipitation water logging capacity of soil, salinity of the ocean whatever and this is our probability that the species occurs now the blue line here is conceptually is kind of nice representation of what might be the case in nature the height of probability XXXXXXX some intermediate value and then that drops off nicely at each end this red line here might XXXXXX XXXX model, kind of overfits to this response curve so if I would XXXXX to fit on this red line we might be able to fit a perfect XXXXX through the training points but it's overfit to this model Now in contrast we might have a lot of XXXX effects so this would be the case with a very simple bioclim model and if we XXXXXXX it this will make a little bit more sense running a XXXX XXXXXX BIOCLIM is essentially all you are going to do in this case is supose we just say the lowest value associated with an occurance record will say 10 degrees in height of value, say 30 degrees over 20 degrees Now we'll just say that the probability of the occuring within that range is you know extremely high and outside that range is extremely low so it's kind of XXXXXX to it's a box model so we are not fitting our response curve within the XXXXXX where we observed the species we are going to say that the species got a curve so it's a high probability and outside it's a low probability and we are going to say that it's underfitting to this true response curve so that's the more conceptual way of looking at it let's XXXX XXX exactly right information in this phrase so you can look in 2 dimensions this might be temperature, this is precipitation for example any other number of dimensions of the niche these black records are true occurance records that we use to build the model actually I have XXXXXXXXX with the process, we suggest our known approach records the blue here that is actually the kind of abiotic suitable area, this is actually what we are saying in reality the niche of the species now a model that is overfit in a nichespace would look like this because of the complexity of the response curve it's going to fit very neatly and closely around this XXXXXX you might have this one point here that is fitting the XXXXXXX by contrast this kind of envelope and box model is simply going to save XXX within these bounds we are going to say like 10 to 20 degrees or from 8 mm of precipitation to Y m of precipitation we are going to kind of draw a box around XXXX and save up it could be present anywhere within that area that might be kind of underfitting if predicting a broader area within niche space than true niche represented by this blue area then let's go back into geographical space and what we might visualize is againtrue distribution and kind of XXXX on diagrams I showed you yesterday conceptually suppose we have abiotically suitable area here and a couple of patches out here overfit model is going to predict is going to tick out just these very few areas XXXXX areas that neatly around the trained points it took some test records from these areas that might also be inhabited because they are abiotically suitable and this overfit model is not going to predict those by contrast, this kind of simple box model might take too much broader area so more over fit models are going to fit broader area kind of a complex models that fit more complex niches that are going to fit predict some smaller area and the model becomes less complex and less over fit than you are going to predict in a broader area and what we are faced with comes back to a kind of challenges of model evaluation we'll talk about tomorrow trying to get this balance right between over and under predicted so that we can get a realistic estimate of what the true nature or true distribution of the species is we are going to talk a lot about that when we come back to how we calibrate the models, how we evaluvate the models to get this right balance and between over and under prediction over fitting into occurance data under fitting to occurance data XXX XXXXX because it's a really really crucial one particularly at this time as to say when we have algorithms that essentially if you apply them blindly and don't approach carefully it's extremely easy to overfit because it is so powerful that it fits such complex response curves like this and good thing is in approaches like XXX the XXXX of so many methods like XXXXXXXXXX you can actually look at these spots and you can see how your model is fitting against environmental variables so you can see how fit the model is whether there is kind of XXXXX it might be over fitting that was just a couple of general considerations about model selection the final thing I want to touch on is is it really important which algorithm I choose? OK, with emphasize XXXX trying to do the same thing but does it really matter if I choose BIOCLIM model or XXXX model or XXXXXX or XXXXXX or whatever I'm going to give you an example that kind of illustrates the yes it can make a huge difference it's bit of a worst case scenario I picked up a couple of species that really XXXX this so thought it's a XXX of this but the point is that the models can be a bit different in the predictions this was the work we did. it's published as a couple of papers that cover similar background and teh reference is again presentation we took a few species of Proteasaea in South Africa so these are the plant group that is endemic to South Africa we took a trophy standardized data set so we took exactly the same presence records, exactly the same absence records and XXXX exactly the same environmental variables

Video Details

Duration: 13 minutes and 52 seconds
Country: United States
Language: English
Views: 27
Posted by: townpeterson on Jul 12, 2013


Caption and Translate

    Sign In/Register for Dotsub above to caption this video.