Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

Machine Learning Webinar Series

0 (0 Likes / 0 Dislikes)
This will basically be an introduction to our framework. It will motivate why neural networks are interesting. So roughly we’re going to spend a little bit of time just looking at sort of what the industry’s looking like for neural networks. Like what applications are there out there? Why are they interesting? Why should people care about them? And then have a simple introduction: What are neural networks? So we’re going to assume that people here have not necessarily heard about neural networks. And then we’re going to show a full example of training a hand-writing classifier. A digit classifier. And that will show off all the different pieces that you’ve learned. And this really—is yeah, it will be the foundation of all the different pieces that you need to understand properly in order to do your own sort of work in neural networks. And the principles are very simple. So we’re going to flesh this out a bit and take this a bit slowly. Okay, so there’s a prelude to this whole thing. So, deep learning and neural networks— at least neural networks, are an idea that go back to the 60’s about. Lots of the main ideas were developed in the 60’s and 70’s, things like back propagation, training the networks, but they sort of, they fell out of fashion for a long time. And it was really in 2012 when they came back with a vengeance. So, what exactly happened? So, Here’s a very simple problem for humans, but for programmers it’s a very hard problem. So the problem is the following: you have a set of images. So, things like, you know, different breeds of cats, two different animals, you know, kettles and pots and all kinds of things. And you have 1.5 million training images. There’s like a thousand categories of these things. And the problem is to take unseen images, one that you've, you know— given the training images, you have some—you create some classifier or something that you want to do, and then you get to test that out on some unseen images. And the question is how good can you make a classifier to do that? So it turns out to be an incredibly hard problem. There’s a very famous data set called ImageNet which is exactly this. So there’s 1000 categories, there’s 1.5 million training images. Oopsy. And then the ImageNet organizers, they have some unseen test images that competitors to this competition that happens once a year, they never get to see. They just get to submit their algorithm and then it gets automatically graded. So, this competition ran for a number of years, from like 2010. And the sort of approaches that people used— there was all kinds of approaches and we’ll look at some of them now. But basically this the sort of error rate that people were getting. So, even the best methods were getting always above 25%. So even 2011, the best method was roughly about 25%. And in 2012 there was a group led by Geoffrey Hinton, from— and Alex Krizhevsky that basically were the first group to use neural networks on the this competition. And they had something like a fifteen layer network and this thing they used GPUs to train. So it's a massive dataset and previously neural networks were just too slow to train on something as big as this. Especially ones with lots of layers. So they were able to train this massive net and this was their result. They had something like 17 or 15% error rate. And the entire rest of the field are these little dots over here. So they had suddenly a completely different— You can see how clustered the top results are here. So the previous methods, you know, they were variations on a theme, but really this method was completely different. And it destroyed the rest of the competition. And what’s happened since then is that every year these neural networks have gotten better and better. So basically almost all competitors started using neural networks in these years over here. So yeah, by 2017 the error rate was less than 5% on this data set. Okay so this was basically the catalyst. As soon as you have a massive competition like this where you can't really cheat because the organizers have the unseen data. Then the entire research community has to basically pay massive attention to this. And that’s what they did. So, immediately after—well okay so, the sort of non-deep learning approaches with things like careful feature engineering, so you basically want to find some sort of feature of the image that you can feed to things like support vector machines. And yeah these are things that basically maxed out at like 25% error rate. While Deep Learning approach it uses the pixels directly. This is zero feature engineering. It basically it frees the researchers to look for better networks that are generally good but they don't have to spend all of their time trying to find better features. And these are very complicated things that require massive engineering effort. But also yeah, it lets—the entire system is able to learn into ends. Like in this case the features that you're getting, you know, they're sort of— there’s other principles involved in sort of deciding what a good feature might be. They're not learned together with the actual problem that you’re trying to learn. They’re just general descriptors that you try and construct. So the neural network was revolutionary in that as well. In that it, it freed people from having to do this sort of engineering. And it basically—every piece of the problem gets to be optimized in the optimization procedure. Which is one of the reasons why you get amazing performance. So, many of these models trained on ImageNet can actually be seen. We have an unreleased neural network repository which I would encourage people to look at. It will be released quite soon. But you can basically get access to these—some of these models. For example, like this was one of the— I think winning models from 2016. Maybe? And this thing can be easily evaluated and it correctly predicts that there’s a peacock. And in Wolfram Language we have things like image identifier that are basically built on the same technology. And you can also, obviously in Mathematica, you can click and see exactly what's inside this network and then delve deeper as well into each of these layers. Okay, so this was the sort of main catalyst event that caused a Cambrian explosion of different networks and different tasks that these networks were suddenly put to use for. So it seems here most people in the field think that this was sort of the catalytic event. And as you can see for example from Google Trends: basically in 2012 it was a flat line. There was not much happening if you use the term Deep Learning for example. And then after 2012 it went up dramatically and it keeps going up. o, the other thing that happened was that people realized that these neural nets, they weren't just good at classifying dogs and cats and what not. They were good at all kinds of domain of— of learning task. From language translation to speech recognition etc. So I’m just going to have look at a few of these quickly. Because it’s interesting to know roughly “what can these things do?” Just to have an idea of where this field is currently at. And I don't want people to leave with the idea that all that these networks can do is something like classify images. That’s absolutely not the case. They are used in a wide variety of things. So for example, so Google Translate, they switched to using neural nets in 2016. They had a different kind of system before that. And it had major performance improvements. So from a New York Times article: “As dawn broke over Tokyo, Google Translate was the No. 1 trend on Japanese Twitter, just above some cult anime series and the long-awaited new single from a girl-idol supergroup. Everybody wondered: How had Google Translate become so uncannily artful?” And you can actually use this Mathematica as well by just using text translation and this calls the google neural translation API. Oh Okay. There we go. Okay, so, other things. Autonomous vehicles: so one hottest sort of things at the moment is self driving cars and the sort of massive rush for all the different companies like Uber and Waymo and what not, to construct the first commercially available autonomous vehicle. And one of the key components of these vehicles is on neural networks. So the neural networks do all kinds of image processing. So we actually have one of them in the neural network repository and you click on this link. I didn’t actually put the code in here because it requires a bit of extra preprocessing code that is a bit ugly. But you can look at this link to actually see the full thing, but basically— and run in Mathematica, sorry. But basically—the idea is that for example, we can take a scene and the neural network can tell you “okay these pixels here correspond to people” so it sees that are people walking around exactly where they are, where their legs are, etc., etc. And it can obviously say “there are other pieces here that are the road, the sky is up here, there’s some buildings” etc., etc. So basically neural networks are used to get an idea of the world around the car. Other things. So things like speech recognition. So basically all the modern speech recognition systems that you might find on your phone or anywhere else, will use deep learning now. So we actually have an example of that in the repository, which again you can click on. Which is a very good speech recognition system, trained by Baidu on about 8000 hours of English. To other things like question answering: so we have a new function in 11.3 called FindTextualAnswer. And this was trained with trained with Wolfram Neural Network Framework. And there’s a nice blog post that you guys can look at from Jerome from our group about exactly how that was done. So yeah, I mean here’s a simple example: you can give a question and there’s some big paragraph of stuff. So it correctly extracted that it’s one meter—the size of the dodo from this text. And basically for this sort of task, every method is a neural network method. So it's basically though (unintelligible) for this sort of thing. Okay, so rapidly onto other things, like reinforcement learning. So AlphaGo, people might have heard of this. So Go is game that was generally considered an almost unsolvable game and at least a decade away from having any sort of reasonable solution. And DeepMind shocked the community by releasing a Go player that beat one the previous world champions, Lee Sedol. Quickly after that, there’s a new version of this—of Go Zero, which completely learns from scratch and it’s a lot better than even AlphaGo. So this yeah—and one of the key components was in neural net. So (unintelligible) random things. So things that you might not even think are really learning tasks. So, things like restyling images. So, I might say I want to hear the content image, I want to restyle that into the style of this thing for example. So, I can run this. And here’s the resulting image and you can do this you know all kinds of different images. And this is the foundation of imagery style which is a newish function is Mathematica. So one of the trends that I wanted you to see is that for internal functionality we are using our own framework to do all of these things like imagery style and the question answering to all kinds of other things that are coming up from speak recognition to translation. So this framework has to be powerful enough for us to actually train these things ourselves and so users basically get—yeah, they get a very good framework. Okay, another random example is lets say colorization so in this case you want to say here's a black and white image. Please colorize or reconstruct what a plausible color-scheme looks like for that so we can try this and there's the new image. So the network obviously has to recognize this is grass, this is probably a dog and dogs come in certain colors when they look like this etc, etc. Okay, so it could go on and on there's a lot of things that these things could from reconstructing 3D faces from 2D faces to geolocating an image based on just the image and a massive training set that we’re trained to all kinds of other things like medical imaging even I think yesterday was the FDA. I think it's the FDA. What is the medical organization in the US? They approved a diabetic retinopathy detector that's based on deep learning for medical use. So medical imaging is a massive, massive field. There's all kind of NLP stuff, National Language processing. From sentiment analysis to all kinds of other things so you should really go to the site and you find a bunch of other— of networks that are currently runnable in Mathematica. Okay, so now that we’ve seen and hopefully motivated somewhat why people should be interested in this sort of thing. That they can applied to all these different problem domains that previously didn’t have really good solutions. So what are neural networks? So let's spend a bit of time thinking about this question. So, the modern term for neural networks is actually differentiable programming and we’ll why that is quite soon. And also just that people know, there’s a very nice introduction to neural nets tutorial in the Mathematica documentation. There’s a link for that there. And some of it is based on that. Okay, so the very most basic building block of the neural network is a layer. So—and other frameworks might call this an operator. So a layer and an operator, it’s some basic piece of computation that acts on numeric tenses. So, let’s have a very simple example here so there's something called ElementWiseLayer which allows you to have layers for— for all sorts of Elementwise functions like Tanh, and clip, and all kinds of things of that sort. So, one thing is—sort of property of these things is that they can only act on numeric tenses. So here’s an example so ElementWiseLayer we can apply this to some input and this is the same as in top level if we said—yeah—and they're both the same result. So here’s where the differentiable programming idea comes in. So the idea is that you want to have layers that are differentiable and the reason for that is going to be for being able to optimize and learn what the parameters are in these networks efficiently. So the idea is that if every piece is differentiable then if you stick them all together it will remain differentiable and it will be very efficient to optimize. So, there’s a way of differentiating these layers with NetPortGradient. So you can say give it some data and you say I want to get the gradient of the port input. And what is a port? Well these layers, if you click on it you can expand it. It says there’s a number of ports. So there’s an input port and an output port. And the reason for this is that you want some way of being able to specify unambiguously, you know, if you want to connect up, let’s say the output of this thing to something else. In this case it's trivial. It’s just one input and output. But for many layers they can have multiple inputs or multiple outputs. So, you can specify in this case explicitly like I want the gradient for the input port. So, we can do that and that's exactly equivalent to just differentiating symbolically and then in this case just replacing that symbolic variable with— with some inputs. (Incomplete thought) Okay, so there’s some properties that these layers have. So one difference between these layers and other Mathematica functions is that they can operate transparently on GPUs. So it's become imperative to train on GPUs. Like, CPUs are just too slow. So this can be either CPU or GPU for example. And you can also if you have multiple GPUs you can specify potentially what GPU you want to run this thing on. Like if I had four GPUs I could do something like that—or three GPUs. Okay so everything works on GPU, so any combination of these things will also work on GPU. Then some other random things. So, one is that there’s something—and this idea of shape inference. So these layers they—if you tell it what the input size is going to be— so I’m telling it in this case it’s a matrix of four rows and thirty-two columns then it will automatically infer what the output shape is going to be. And we’ll see a bit later why this is important and useful. Okay, so one of the most important differences though, between just normal functions and Mathematica and layers is that layers can have learnable parameters. So if this is all that it was doing, so like, here’s another version of Tanh or whatever. Like, there’s no way that this thing can learn anything because you just— it's just a function that you’re constructing explicitly. But the magic happens when you have learnable parameters. So, layers that have learnable parameters, there's—they will often look like this: there will be something uninitialized. If you click on it there's the ports, the input and the output and again, they've got—if I told it what the input was and I tell it that this thing has, it's a three layer net—sorry three neurons, let’s say. Then it knows exactly what the output is going to be but it also has these parameters here: weights and biases. So these are things are things that you want to learn and they can adapt based on the data. So in the beginning they don't have any values right—they're just— they're not defined and if you try to extract the values then it will just say automatic. But there’s a function called NetInitialize which allows you to initialize these weights and biases to some random values. And that's the general philosophy with neural networks is that before training you need to randomly initialize them to something that's not zero— like the weights. If everything is zero then it has huge difficulties learning. And there’s actually there’s a very complicated procedure for deciding on what the scale of the random data should be that you’re initializing these things to. And all of the sort of thing our framework completely automates. So there's no decisions that you need to make. Okay, and then now, you can say net extract and I want to have this dot layer here. I’ve initialized it here so if I look at the weights there actually are some weights now and I could do the same with biases. Okay, and obviously it goes without saying that I could now evaluate this thing as well. So if—Oopsy. So if I was to say dot2 of let's say one, this thing take a vector of size two it will now evaluate but dot1, if I try to evaluate this thing on that it will say it cannot evaluate net because I have not specified the values for the weights. Okay, so all of these examples so far are things with one import port. So one classic example of something that has multiple ports is a Loss function. So Loss function is something very simple. It basically takes an input and a target and it tries to make—it tries to have a high number, so there's always one number— a Loss function—and it tries to have a very low number if the target and the input are very similar and it has a very high number if they’re very far apart. In this case it's basically the Euclidean distance between two vectors for example if these were vectors that are being put in. So here’s a very simple example. So I have an input and a target, they’re quite different. If I make the target—you know—this is four and this is one, if I make this 99 this number goes up. But if I made this one, this number is now smaller. So this number gives the network an idea of how close it is to the truth and this is the key part that is used for training the network. There's also—you can list what layers are actually available to you in the system and each of these obviously has a lot of documentation and they all have different sort of behavior. So some of them are very similar to what Mathematica would do. So things like AppendLayer. It does basically the same thing as Append or like DotPlus or DotLayer. It does the same thing as Mathematica Dot. But then there’s a whole bunch of layers that have no analogs in top level Mathematica. So things like InstanceNormalizationLayer which does some sort of normalization of the data or SpatialTransformationLayer and whatnot. Okay, and this is—like becoming proficient in using neural networks involves understanding what all these different layers do and when they are appropriate to use. And that’s going to be probably a larger topic that will be carried on for the next—I think—sessions. Okay, so at least one of them—what does the linear layer do at least. So this is the simplest learnable layer. And this is also one of the oldest layer types. And I must mention that these layers, they keep growing. One of the reasons that people invent new layers all the time and as they invent them we want to implement them and have them in our framework Okay, so the simplest learnable layer is just the linear layer so the Mathematica equivalent of that will be this. You’re just ‘dot-product’-ing the weight and the data and then you’re adding some bias term. So the data is some sort of vector and the weight is a matrix. So basically what this layer is doing and we can just verify that it is actually the same, you know we can extract the actual weights. Oh dear what’s going on there? Oh, I didn’t—sorry—initialize that. There we go. Okay, so all that this is doing like the linear layer, it's often called the fully connected layer because you could interpret, you know, let's say an input vector of length three. So one, two, three. So this is what this input here might look like and there’s basically the matrix—can be— that you’ve got, the weight matrix can be interpreted as sort of, connections between this neuron and the neurons of the output. And there’s a weight per connection, so that weight matrix basically represents that. And all that it is, is the linear transformation of the data. And linear transformations, you know they can have things like, they can rotate the vector, they can rescale it, they can shift it a bit etc., etc. So whatever a linear transformation can do, this later can learn to do that thing if it’s good for producing the predicted result. Okay, but this is a big topic and things like what do convolution layers do? And what do pooling layers do? etc., etc., or recurrent layers, these will be topics that will be very deeply explored I think in the next talks so I’m not going to go into that too much. Okay, so layers I think we’ve understood now, right? So layers, they act of numeric tenses and they have a whole bunch of properties like they're differentiable and they work on all different devices etc., etc. But layers themselves are not very useful, like just that linear layer, it can't really learn much. And in fact one linear layer is the same as logistic regression if you were to do a classification problem and try and solve with a single hidden layer network. Or linear regression as if you try and, you know, if you are predicting a numerical value with a single linear layer. So—and those are very weak sort of learning systems. So, one of the main ideas of neural networks or deep learning, I guess, is that you want to have a long set of transformations. A deep set of transformations that allow much more complicated behavior to be learned than just single layers. So, first container type is a chain. So this is very simple. So a chain is simply a linear—it’s a sequence of operations—so, a linear sequence. So it has one input—we can try this out. So if we look at his thing here—so this thing has one input, it has one output and its doing first the Tanh and then it's doing LogisticSigmoid on the data. And this is exactly the same as this for example in top level. And you can also verify that they are the same. Okay, this is a bit of a silly example because you could in this case— it would also be equivalent—so Elementwise Layer is a very powerful layer. It can have for example, something like this. So, logistic sigmoid takes in that and takes in and that a perfectly valid layer. You can compose all the different primitives that an Elementwise layer has and you can make a pure function and that's a perfectly valid Elementwise layer. So, this would be the better way of writing this thing. But this is just to show what this actually does. Okay, so a lot of networks and traditionally almost all of the networks have been just chains. So, yeah—the sort of traditional image classification networks even the one that won image they call be represented by this chain construct. That’s all they need. However, what happened after the 2012 ImageNet results was that people realize that there's a lot of power in having actual— or more complicated sort of structures that you can have and one of them is the graph. So a very simple graph would be something like this. You have two different inputs and these inputs—this first input goes into Tanh, the second goes into the LogisticSigmoid and then they get added up together and then returned. So, this graph—the idea is that you can first sort of list here—and this can be a list or an association. So, if it’s a list it means you basically (incomplete thought) have all the layers that you want and then the second argument for NetGraph is a list that gives the connectivity structure. So just exactly like graph, how graph works. So in this case I'm saying that the first port to of import is going to the first layer so Tanh and the second port of the input goes to the second element in here; this one here. And then both of these go to the TotalLayer which totals them up. And you can do the same thing with association and then with names rather. Some people find that much easier. So, Tanh goes to that and then you’d have this stuff here for example changing to being Tanh. I personally prefer that method as well and this must obviously have been association. And these ones must be changed to do. Okay, so the question then is what is this equivalent to? And in this case it’s just function composition like this. Okay, so you can build very complicated structures with NetGraph, you can also differentiate it because all of the layers are differentiable so will the netgraph be differentiable and this is an important point—is that containers behave exactly like just normal layers. So they’re differentiable, they run on the GPUs but they can also be nested. So NetChain expects a layer over but a NetChain is a layer as well. So this is perfectly valid syntax and if you dig deep into this thing, you know, you can—you can find the different layers of NetGraph—of NetChain sorry. Okay, and obviously the models in the repository are almost all some form of container. So, an example of this would be the restyling net. This thing has two different inputs the style image and the content image and it does some processing and you can dig deeper and click and these things and get— you know, see what it looks it. Okay, so containers are important for building more complicated structures. But containers and layers they still only ever operate on numerical tenses. But what we had—what you might have noticed right at the beginning was acting on actual images like we would take a net model of this thing and it would apply directly to the image. So how does that—yeah—how do we—how does that happen? So, the basic idea is that we have another construct and this construct is quite different from anything that we've seen in other frameworks. So the construct is the idea of a net encoder. So net encoders are special—almost like a layer but they can’t go inside the net graph or NetChain (incomplete thought) They’re not real there. They’re not differentiable, they don't work on all the devices etc., etc. But they can be attached to the inputs of other layers which we’ll see now. So basically what they do is they take some type, so like an image or an audio object, or text, or any of the other types that you might want to operate on with neural networks. And it basically produces the appropriate tensor from that thing. And you can see from this so this says okay it must be gray scale image, it has one color channel, there's no mean or variance images that you could set as well and it’s got an output shape of this. So if you give it an image, it produces this tensor. So it makes a guarantee that the tensor that this thing outputs—this is 1x12x12 tensor. And if this image was much bigger it would just reshape it, if it was a different color space it would conform it. So it can basically guarantee to the network this is the kind of tensor that you will be getting. And another cool thing that you can do with an encoder, is that it can operate directly on files as well. So if I have a file for some image, then it can directly run this and give you a tensor. And this should be 1x12x12. Oopsy, sorry. 1x12x12, great. And the reason why is powerful is that you can do out of core learning for images and audio files with this because you just have to—you can use the file as an actual data point to feed to the training for example. So you don't actually have to load all the files into memory. You can just have them on disc and then have—just give the file names and then net train can act directly on them. All the networks can act directly on just the file paths So file paths is completely valid inputs to these encoders. And yeah if you’re more interested in—I was going to touch more on this later but I decided to simplify things. So if you are interested in training large data sets, then there’s a tutorial on doing this. Okay—which I recommend you looking— and so there’s there’s a whole bunch of other encoders from audio, to characters, to tokens to—yeah. You can write your own custom encoder as well. So it depends on what data type you are dealing with. Okay, and as I mentioned the main sort of property of these things is that remember the input port. So you can specify the sizes for the input port but instead of doing that, you can attach and encoder to it. And if I do that instead of saying what the size is, it will tell you that the input is an image in this case. And what this means is that you can act directly so in this case it's a pooling layer, you can act directly on an image. If you try to do that without this, it will just complain because it's not a type it accepts. So if we did this— pooling layer—and we said let's give it some image “data supplied to port input was not a tensor of rank greater than 1. Okay, and one more thing, if you remember the networks that we showed right in the beginning. For ImageNet, these things had the magical property that here’s the NetChain. They could be directly applied to the image. Just give it to the— give the image to the network. And the reason for that is that it has this image encoder on top And you can always extract the image encoder, you can say NetExtract of the input and you can see what this image encoder is doing. It’s resizing them to 299 back to 99, it has some mean images, some mean image, it has— and its RGB color space. So that's why this network just works out of the box. So for other frameworks you often have extra preprocessing code that would have to conform the image into the size that the network requires. Okay, the basic, almost brother of the analog of the net encoder is for the input, is that there’s going to be a net decoder for the output and the reason for this is very simple. It’s that for a lot of networks, for example like classification tasks, you really want— the network was always going to give you back a numeric tensor, but that's not really what you want. You want some sort of class. You want to say is this a dog or a cat for example. So what this does is that if you create this decoder, you say it's a class decoder and there's two classes: there's a dog and a cat. If I give it a probability vector it will then interpret, it will take the maximum probability which is 0.9 and say this was a cat. Because that's the one that corresponds to 0.9. And you can use this exactly like a classifier function. Which is that you can ask for different properties here. Like what are the probabilities and it will give you an association with the probabilities for each of the things. And exactly the same analog if you have a—you can attach this to the output port of a network. And this is—you can see now why these ports are sudden names. Like you have—you need ways of referring to them. So like if there were multiple outputs for something how do I specify which output to attach some decoder to for example. Okay, In this case there's a Softmax layer which basically just produces, it takes a random vector and it produces a normalized vector so it sums to one If we apply this, now Softmax directly works on—it directly produces something like this which definitely not just a numeric array. Okay, so the same thing happens with these networks. So things like this ImageNet network, it has a whole bunch of classes and if you click on this thing, you'll see that the output has a class attached to it which we can extract and if we ran this network it means instead of giving you— you know just a big splurge of probabilities like there's a vector of size 1001 in this case, you can actually get the real calls from that. And you can obviously if you want the probabilities you can ask for them here as well. But this might take some time Okay and you can get the output decoder in this case. And you can get the class labels and all that sort of thing out if you really want. Okay, so I hope that people can see sort of what the motivation of this thing is. That it makes—it makes the networks very clean. Like this is ready to go—this network. It doesn't require lots of post-processing scripts and whatnot. So, okay, so now that we've discussed the main sort of concepts of those decoders, encoders, containers, and layers. So now we actually want to train something. So these networks are not very interesting unless you can actually do some learning. So how do we do this? Well there's a whole bunch of automation that’s introduced by the framework. I’m going to show you some examples where I don't make use of this automation and just do it very explicitly that you can see exactly what's going on. So here's a very simple example. We have some data which is pretty much a straight line One goes to two, two goes to four, three goes to six, four goes to eight. And here's a very simple net that we want to have to solve this problem. So, in this case we’re predicting one number. A vector of size one. and here’s where the magic happens. So we have an input that comes in, it goes through the linear layer and this has parameters and we want to find the parameters so that the prediction of this linear layer is the same as the target prediction, so that the mean squared loss layer becomes very small. So, if we do it right now, we’ve initialize this network, it's just random initialization. This is pretty big. So obviously that's what we'd expect because you know these parameters haven't seen any of the data, they’re just randomly picked. so currently the predictions of the network of this linear layer don’t match at all what the target looks like. Okay, so now we can actually train it. And there's a very simple construct called NetTrain. So, NetTrain takes a network and it takes some data, the train data, and it will train. So, that happened very quickly. We can do it again. So basically this tells you here what that—the loss with the output looks like as a function of training iterations. So it does many iterations and you hope that thing goes down because you want the loss to get smaller which means that your predictions and your target match. And now we can check exactly the same thing which previously we had 15.31. If we run this now with a train network we get something like that. So this network now has learned to predict what the target looks like. Okay, and we can also extract the parameters. We had a net with some weights. And we have a new network, the train network and we can see what the weights look like. So for the train net, the weights are completely different. And that's the whole point of training; that it can change the parameters. Okay and just a quick rough idea of how this works, it uses something called gradient descent and this is where the differentiability comes in and where it’s so important. So the basic idea is suppose this was the—this J here is the loss. So, it’s a function of the weights of the network right? Of the parameters of the network. So we want to find—yeah, so for different parameters that you have, these—this loss is going to be different values and as we saw. So, (incomplete thought) you have a point though, where there’s going to be a minimum. And in fact there’s multiple minimas. So there’s a whole landscape of minima. But the basic idea is that if you’re a blob and you know, and you’re sitting over here—in parameter space. So if you're—yeah, if your parameter puts you over here, Then if you have the gradient, the gradient gives you an idea of the direction that you must go so that you get more closer to the minimum. and that’s what NetTrain basically does. It uses the differentiability of the— of the net and it moves you in the direction where you need to go. Okay, so this is also very similar—people might realize to things like NMinimize or FindMinima except that its vastly more efficient for various reasons. Okay, so that's the basic idea of the training and you don't have to worry. I mean yeah, it’s good to know what it's doing but NetTrain does automate a lot of that stuff for you. So, now that we have this background, let us actually look at a real sort of example. And this is often considered a toy example. But it is the simplest toy— yeah maybe the least trivial toy example that you can have that actually does something nice. So, we get the data and we can look at some examples and pray most of you have seen this sort of thing. And one of the reason for using this example as a starting point is that a lot of people have seen it and its familiar— the problem that is involved for this. So, the problem is very simple. You have some handwritten digits and you have the actual label. And so you want to predict for unseen images. You want to have—you want to learn how to predict the correct label for those. So, what have we learned? Well the first thing obviously, and this is something that I think the next speakers will emphasize a lot but basically it’s very seldom where you will be actually putting together very complicated nets from scratch with lots and lots of layers. So the most common thing is you know—the community—the neural network community has spent a lot of money and time looking for good architectures for doing certain problems. So your first instinct should always be to say: let's go to the neural network repository and find something that does something similar to what we are doing. If its image classification there a whole bunch of incredible networks that have won on ImageNet that will be good a very good fit for what you are doing. And it’s unlikely that you will ever find a better architecture. So this should be your first instinct almost always. Find an existing network for a certain task. And one other thing—one other reason why it’s very nice to use the repository is that you are guaranteed that the implementation was correct. It’s very easy to see a paper and then make mistakes with these numbers. Like this a very small net you’re probably not going to make mistakes but for some of the networks that have 500 or 1000 layers it’s very easy to make small mistakes and then the whole network doesn't work properly. So yeah, so this should be your first instinct. And in fact for this particular task there’s a very famous network that we have called LeNet. Which, at some point, supposedly sorted about 10% of America's checks. This was like in the early 90’s. Yann LeCun worked for I think Bell Labs and they made one of the first sort of industrial grade neural network recognition machines for check digital recognition. So it has a very—yeah pristine history—or—just history. o this network does exactly what we want. It's a very simple network it has convolution layers and pooling and this is something for the image talk I’m sure they will explain carefully what these layers do, and why they're special, and why they're used for images. But yeah, so you could run this model immediately and you'll see that it works. But let’s do this from scratch so using all the different things we learned. And I also want to say, if you want to see the most efficient way of—the fastest way of doing this sort of task, there's a tutorial for MNist digital classification. And it's very, very short. It’s a very, very simple few lines of code to actually do this the simplest way. But the simplest way is not the most general way. So the simplest way works for problems where there's basically one prediction, where there's one input, but it doesn't work for anything else. So we’re going to show for this, basically the general approach that will work for any sort of problem. Even though it makes for this very, very simple problem—it makes it look a lot harder than it is. Okay so the first thing that we want is we want a decoder because if we want this thing to predict the classes—and the classes are going to be between zero and nine Because the digits are between that. And we’ll also want an encoder which takes an image and it makes sure that it's 28x28, the size, and they’re going to be gray-scale. Okay, so the next thing is to actually define this convolutional network. So, I'm not going to explain all that carefully about what the motivation is for each of the things. Maybe the only thing I’ll mention is that convolutions are usually used for structures like sounds, like spectrograms, or images which are large and you don't want blow-up of your parameters like the fully connected layer or the linear layer gets incredibly large for images. Like if you have a 500x500 and then you need a matrix that multiplies that you have a massive—you just have a blow-up of parameters. There’s a whole bunch of other reasons Like there's certain symmetries that are encoded or invariances that you—that you hard code into this here. So you say things like—for a classification problem, I don't care that, you know, the object could be anywhere in the image. I don't really care. And that invariance to shifting the object in the image— image—that's one of the reasons why you have pooling layer for example. So you can prove that a pooling layer makes it equivarient to translations or an object to translations in the image. And so this is sort of principle in neural networks is that if you can encode structure— if you know that there's some structure in your input you want to use layers appropriate for that to exploit that structure and it makes learning a lot more efficient. But this is yeah—this is a longer topic that maybe we can do some other time— on the next version. Or maybe even the questions. Okay but basically—yeah—there's a whole bunch of convolution layers there's a nonlinearity called like a Ramp layer. And this is a very important thing for (unintelligible) as well. So if you plotted this thing, it will just look like let's say minus 3 to 3, and this thing is—it’s flat and then it goes upward like this. This is also called RELU layer or Rectified Linear Layer. And you'll see them between things like linear layers or convolution or whatever and they basically break the sort of linearity— each of the operations is a linear operation and so you have a bit of nonlinear behavior here and that helps the network learn much better. Okay, so we’ve now created this thing. Now, this is where things become more complicated than they need to be for this very simple example But I'm going to show how to do this for this general case. So here I'm actually constructing the training net. So for these simple examples you can actually feed this straight into NetTrain. And it will construct the whole learning graph— or the TrainingNet itself and then give you back the original network that you had. But that doesn't work for all networks especially when there’s—yeah—when there’s multiple inputs or outputs. So, what you can do is— in this case—yeah, here’s Lenet. We’re just having the initialized net so the input goes to that. and then the input goes to “loss” which we would just explicitly add it here and this will be CrossEntropyLossLayer. And CrossEntropyLoss is usually what you use for classification. So, this is one of the things that you have to learn as well like what losses are appropriate for different tasks. And this is a very good loss if you have two different probability distributions that you're comparing. Okay, and we can evaluate this is some training examples and you'll see that it’s quite large. Okay, so we have this TrainingNet and we want to find the parameters in the LeNet so that the prediction that it makes is similar to what the target is. And the Loss gives you a number that will be small when that happens. There’s one more thing, which is that this net now has two ports. It has a target port and an input port. So the most general way of representing data when there are multiple ports is the association form. So basically you have this sort of thing. So you have one port and it has a bunch of examples and then there's another port that has a bunch of examples. So for training this is the most general form. And some networks, you know, they might have five of these ports and so you have—you can easier represent that data. Okay, now let’s train this thing. So, this is the more complicated syntax of the previous thing that I had but let's start this. So one—the first thing to note is that I’m going to use GPU. Now, not everyone can use GPU. You can only use it if you have an Nvidia graphics card—unfortunately. So whilst it’s running we explain what all the different things are. And sorry also this— this—this bay looks a bit strange for presentations. There’s some bug. It makes it look small. Okay, so it’s busy training. So let's look at all the different pieces whilst its doing that. So there's the TrainingNet, there's the training data association, there’s a validation set. Now this is very important. So when you use things like classifier there's a lot of things that are automated for you to prevent things like over-fitting. So over-fitting means you do very well on your training data but you do terribly on any other data. And there's a very good tutorial on over-fitting if you're really interesting in that called: Training Neural Networks with Regularization. So, basically this validation set where it does, it’s not allowed to use this data for training but it checks as you can see here so this last curve, the blue is the validation performance. And as you can see it's getting better and better on the training set but it’s not getting better on the validation set. So, what this will do is it will so okay, we can take the network that did the best for— on the validation set and return that to the user. So for example this graph might look like, you know, the validation set goes down, down, down and then suddenly it can up again and that's the sort of classic signature of over-fitting. So in this case it doesn't look like it’s over-fitted too much. It looks quite fine. But, this thing if you provided a validation set, it will protect you quite well from any sort of over-fitting because it will just take the point that it did the best on the validation set on—where it wasn't over-fitting yet. And that's really probably the most recommended thing for people to do— to save themselves from that problem because yeah like NetTrain automatically, you know, deal with over-fitting like classify and predict might do. Okay, a whole bunch—yeah, some other things like how many rounds of training, this means how many times does it go through the entire data set before it stops. And you can obviously—we can actually, run this one more time— you can stop the training at any point. So if I wanted to stop it now I could just say stop and it will return the training result for me. Okay, that was maybe not the greatest idea because I have to retrain it for a little bit at least. And yes, in this last (unintelligible) thing was that there's an argument all here. It just means that it returns this NetTrain result's object. So normally if you don't give this argument it will return the actual train network. But in this—yeah, in this case it gives back the train network that's in this object but it gives back all the other training information as well. So things like what was the validation loss etc., etc. and we’ll see that now. So, this is almost done. And this will take a lot longer on your CPU unfortunately. Okay, let's stop it now. Okay, so we have this network that's bee train with the result object. So form the result object here's a whole bunch of result properties that you have. So things like you can get a nice plot of the— you know, of the loss for training—for the training set which is the orange and also the blue which is the validation set loss. And we can see that it hasn't over-fitted too much because the validation set is still going down. Okay, we can also get the network out—so results [“TrainedNet”] will give you back the NetGraph. So, what we really though, is we want the prediction network. So we want to get out, you know, this net graph but also inside the NetGraph there’s a node Lenet which is the prediction graph. We can discard the rest of the network because we’re not doing training anymore. So, there’s a little bit of annoyance that you have to do which is that the encoders disappear if you— you know this thing is now—yeah, if you extract this thing, you know, the output encoder’s not there and then an input encoder’s not there so you have to just put them back which is pretty simple with NetReplacePart. And we can now use this trained net to actually make some predictions. So let’s say this is a two and we can get some probabilities out. It’s pretty confident that this is a three but it's a bit unsure. As you see like it’s been cut out a bit—so it’s pretty good that it’s not 100% confident and you can do things like obtain the accuracy of this thing which gets about 98.6 So that's roughly—yup So, for a classification task it’s very simple to get metrics out because you can just use classifier measurements. So if it’s a classification task with classifier measurements and the same thing that happens with predictor if you're doing regression. Okay, and—oh, this was—this is a duplicate. Oopsy. You can do the same thing with Classify as well. So you can try and run Classify. And you can see how much simpler Classify is than this. So Classify is fully automated. It does all the preprocessing, it does—yeah, it does everything for you basically. And all of this rigmarole that we go through, you can achieve something similar with Classify. Except that Classify has a lot of limitations. It’s not as flexible, you can't do out of core learning etc., etc. And obviously if though—any problem was to do simple image classification then something like Classify might be appropriate. Okay, let's just stop it because we don't have that much time. Okay, we didn't let it train very long but it got that. Okay, so just to summarize quickly: so I’ve given a very basic introduction to the framework and different pieces of it and there’s obviously I gave lots of links to things that are more complicated. So things like out of core training—yeah. There’s all kinds of other ways of dealing with over-fitting so things like ZoomDropout or other kinds of rigorization. So those are things that you’ll probably have to look at yourself. Okay, and the last thing that I want to leave you with before I give you to Talie is that just quick sort of rundown of this framework. So, first of all it’s an industrial strength scalable framework. We use all of the latest in video libraries, we use MXNet as a backend. And we contribute to MXNet as well. So, all the—yeah, all the different— all the companies that are working on MXNet and its backed by a whole bunch of companies from Amazon, to Apple, to others. Users automatically get all the benefits that accrue to— yeah, all the things that—improvements are made to MXNet. We also use this extensively in Wolfram Research for— we’re developing a whole bunch of functionality based on neural networks. From speech recognition, to things you've already seen like the question answering, to image identify, there’s image contents is coming. There’s something like a list of 70 different functions that we’ve got that we want to make with this framework. So, it’s not just something that—yeah, that we don't use. We use it very extensively. And yeah, we also plan to have the best repository of networks. Which is really a lot better than any other frameworks repository. So we’re putting in a lot of effort into collecting form nets from other frameworks. So we convert them for you. We convert the preprocessing into Mathematica code if there’s preprocessing involved and all the hassle of converting. And converting is a tricky business. It might—it's unfortunately a lot harder than it needs to be right now. So yes, and also the networks that we train ourselves are going to go onto this repository. So those hopefully 70 functions that are on our list, those nets will be available to users. So one thing I that think—here, there have been comments about this as well. There’s no lock-in. So we thought it was a very important principle to follow, that users of the framework are not forced to have their stuff only in Mathematica ‘cause you might want to deploy on TensorFlow serving or who knows what. Like there’s a lot of options you might have, and we definitely don't want to lock you in. So, we can—we already support import and export to MXNet. I should have actually put in the link for that. But yeah, there’s—you can look in the documentation for that. And we also have plans to support, sort of the main candidate for a completely cross-platform format. Which was started by the Pytorch, and Caffe2, and Microsoft CNTK framework people. But it’s now been adopted by MXNet and Chainer and a bunch of others. And we have people who support that as well—to both import and export to that. And yeah, we’ve hopefully—we’ve tried to make ourselves be easier to use than other frameworks. And this is sort of the selling point, I guess, that we want to have—is that we want to be as simple to use as possible and for certain things you night see for the next talks so things like true variable-length sequence support. These are things that are very hard to do in other frameworks like Keras and what not where you have to do unmasking and ugly things. And things like the encoders and whatnot, it just makes it very simple because everything just runs. There’s no extra python packages that you need to run to process you audio or images. Everything is just part of the language. So yeah—and yeah, in the single (unintelligible) it works out of the box on all different platforms from raspberry pie (unintelligible), to yeah—all the other major platforms. And also one thing that I didn't mention was that we do support quite simple cloud deployment. So there's like APIFunction and CloudDeploy that allow you to deploy these networks onto the Wolfram Cloud for you to make APIs with. Okay cool, I think that’s basically the end of my talk.

Video Details

Duration: 1 hour, 4 minutes and 5 seconds
Language: English
License: Dotsub - Standard License
Genre: None
Views: 9
Posted by: wolfram on Apr 2, 2019

Machine Learning Webinar Series

Caption and Translate

    Sign In/Register for Dotsub above to caption this video.