Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

Machine Learning Webinar Series

0 (0 Likes / 0 Dislikes)
This'll basically be an introduction to our framework. It'll motivate why neural networks are interesting. So, roughly, we're going to spend a little bit of time just looking at, sort of, what the industry's looking like for neural networks. Like, what applications are there out there? Why are they interesting? Why should people care about them? And then have a simple introduction to what are neural networks? So, we're going to assume that people here have not necessarily heard about neural networks, and then we're going to show a full example of training a handwriting classifier, a digit classifier. And that will show off all the different pieces that you've learnt. And this really will be the foundation of all the different pieces that you need to understand properly in order to do your own work in neural networks. And the principles are very simple. So, we're going to flesh this out a bit and take this a bit slowly. OK. So, there's a prelude to this whole thing. So, deep learning and neural networks— at least neural networks— are an idea that go back to the 60s, about. Lots of the main ideas were developed in the 60s and 70s, things like back-propagation for training the networks, but they fell out of fashion for a long time. And it was really in 2012 when they came back with a vengeance. So, what exactly happened? So, here's a very simple problem for humans, but for programmers, it's a very hard problem. So, the problem is the following: you have a set of images— so, things like, you know, different breeds of cats, two different animals, you know, kettles and pots and all kinds of things— and you have 1.5 million training images. There's like a thousand categories of these things. And the problem is to take unseen images, one that you've, you know— given the training images, you have some— you create some classifier or something that you want to do, and then you get to test that out on some unseen images. And the question is: how good can you make a classifier to do that? So, it turns out to be an incredibly hard problem. There's a very famous dataset called ImageNet which is exactly this. So, there's 1000 categories, there's 1.5 million training images. Oopsy. And then, the ImageNet organizers, they have some unseen test images that competitors to this competition that happens once a year, they never get to see. They just get to submit their algorithm and then it gets automatically graded. So, this competition ran for a number of years, from like 2010, and the sort of approaches that people used— there was all kinds of approaches, and we'll look at some of them now— but basically, this is the sort of error rate that people were getting. So, even the best methods were getting always above 25%. So even in 2011, the best method was roughly about 25%. And in 2012, there was a group led by Geoffrey Hinton and Alex Krizhevsky that basically were the first group to use neural networks on this competition. And they had something like a fifteen-layer network and this thing, they used GPUs to train. So, it's a massive dataset and previously, neural networks were just too slow to train on something as big as this, especially ones with lots of layers. So, they were able to train this massive net, and this was their result. They had something like 17% or 15% error rate. And the entire rest of the field are these little dots over here. So, they had suddenly a completely different— you can see how clustered the top results are here. So, the previous methods, you know, they were variations on a theme, but really, this method was completely different, and it destroyed the rest of the competition. And what's happened since then is that every year, these neural networks have gotten better and better, so basically, almost all competitors started using neural networks in these years over here. So yeah, by 2017, the error rate was less than 5% on this dataset. OK, so, this was basically the catalyst. As soon as you have a massive competition like this, where you can't really cheat because the organizers have the unseen data, then the entire research community has to basically pay massive attention to this, and that's what they did. So, immediately after— OK, so, the non-deep learning approaches were things like careful feature engineering, So, you basically want to find some sort of feature of the image that you can feed to things like support vector machines. And yeah, these are things that basically maxed out at like 25% error rate. Whilst Deep Learning's approach, it uses the pixels directly. This is zero feature engineering. It basically frees the researchers to look for better networks that are generally good, but they don't have to spend all their time trying to find better features, and these are very complicated things that require massive engineering effort. But also, it lets— the entire system is able to learn end-to-ends. Like, in this case, the features that you're getting, you know, they're sort of— there's other principles involved in deciding what a good feature might be. They're not learnt together with the actual problem that you're trying to learn. They're just general descriptors that you try and construct. So, the neural network approach was revolutionary in that as well, in that it freed people from having to do this sort of engineering, and it basically— every piece of the problem gets to be optimized in the optimization procedure, which is one of the reasons why you get amazing performance. So, many of these models trained on ImageNet can actually be seen. We have an unreleased neural network repository which I would encourage people to look at. It will be released quite soon. But you can basically get access to some of these models. For example, like, this was one of the, I think, winning models from 2016, maybe? And this thing can be easily evaluated, and it correctly predicts that there's a peacock. And in Wolfram Language, we have things like ImageIdentify that are basically built on the same technology, and you can also, obviously, in Mathematica, you can click and see exactly what's inside this network and then delve deeper as well into each of these layers. OK, so, this was the sort of main catalyst event that caused a Cambrian explosion of different networks and different tasks that these networks were suddenly put to use for. So, it seems here, most people in the field think that this was sort of the catalytic event. And as you can see, for example, from Google Trends: basically, in 2012 it was a flat line. There was not much happening if you use the term Deep Learning, for example. And then after 2012 it went up dramatically and it keeps going up. So, the other thing that happened was that people realized that these neural nets, they weren't just good at classifying dogs and cats and whatnot. They were good at all kinds of domains of learning task, from language translation to speech recognition, etc. So, I'm just going to have a look at a few of these quickly because it's interesting to know roughly, you know, "what can these things do?" Just to have an idea of where this field is currently at. And I don't want people to leave with the idea that all that these networks can do is something like classify images. That's absolutely not the case. They are used in a wide variety of things. So, for example, Google Translate, they switched to using neural nets in 2016. They had a different kind of system before that. And it had major performance improvements. So, from a <i>New York Times</i> article: "As dawn broke over Tokyo, Google Translate was the No. 1 trend on Japanese Twitter, just above some cult anime series and the long-awaited new single from a girl-idol supergroup. Everyone wondered: how had Google Translate become so uncannily artful?" And you can actually use this in Mathematica as well by just using TextTranslation and this calls the Google neural translation API. Oh, OK. There we go. OK, so, other things. Autonomous vehicles. So, one of the hottest things at the moment is self driving cars and the sort of massive rush for all the different companies, like Uber and Waymo and whatnot, to construct the first commercially available autonomous vehicle. And one of the key components of these vehicles is—are—neural networks. So, the neural networks do all kinds of image processing. So, we actually have one of them in the neural network repository, and you click on this link. I didn't actually put the code in here because it requires a bit of extra preprocessing code that is a bit ugly. But you can look at this link to actually see the full thing, but basically— and run it in Mathematica, sorry. But basically, the idea is that, for example, we can take a scene and the neural network can tell you, "OK, these pixels here correspond to people," so, it sees that there are people walking around, exactly where they are, where their legs are, etc., etc., and it can obviously say, you know, "There's other pieces here that are the road, the sky is up here, there's some buildings," etc., etc. So basically, neural networks are used to get an idea of the world around the car. Other things. So, things like speech recognition. So basically, all the modern speech recognition systems that you might find on your phone, or anywhere else, will use deep learning now. So, we actually have an example of that in the repository, which, again, you can click on, which is a very good speech recognition system, trained by Baidu on about 8000 hours of English. To other things like question answering. So, we have a new function in 11.3 called FindTextualAnswer. And this was trained with the Wolfram Neural Network Framework. And there's a nice blog post that you guys can look at from Jerome, from our group, about exactly how that was done. So yeah, I mean, here's a simple example. You can give it a question, and there's some big paragraph of stuff. So, it correctly extracted that it's one meter— the size of the dodo— from this text. And basically, for this sort of task, every method is a neural network method. So, it's basically the only game in town for this sort of thing. OK, so, rapidly on to other things, like reinforcement learning. So, AlphaGo, people might have heard of this. So, Go is a game that was generally considered an almost unsolvable game and at least a decade away from having any sort of reasonable solution. And DeepMind shocked the community by releasing a Go player that beat one of the previous world champions, Lee Sedol. Quickly after that, there's a new version of this, called AlphaGo Zero, which completely learns from scratch, and it's a lot better than even AlphaGo. So, this—yeah, one of the key components was a neural net. So, on to other random things. So, things that you might not even think are really learning tasks. So, things like restyling images. So, I might say I want to— here's a content image, I want to restyle that into the style of this thing, for example. So, I can run this. And here's the resulting image. And you can do this with, you know, all kinds of different images. And this is the foundation of ImageRestyle, which is a newish function in Mathematica. So, one of the trends that I wanted you to see is that for internal functionality, we are using our own framework to do all of these things, like ImageRestyle and the question answering, to all kinds of other things that are coming up, from speech recognition to translation. So, this framework has to be powerful enough for us to actually train these things ourselves, and so users basically get a very good framework. OK, another random example is, let's say, colorization. So, in this case, you want to say, "Here's a black and white image. Please colorize that or reconstruct what a plausible color scheme looks like for that." So, we can try this, and there's the new image. So, the network obviously has to recognize, you know, "This is grass, this is probably a dog and dogs come in certain colors when they look like this," etc., etc. OK, so, it could go on and on. There's a lot of things that these things can do, from reconstructing 3D faces from 2D faces, to geolocating an image based on just the image and a massive training set that we're trained on, to all kinds of other things, like medical imaging, even. I think today was, or yesterday was, the FDA— I think it's the FDA— what is the medical organization in the US? They approved a diabetic retinopathy detector that's based on deep learning for medical use. So, medical imaging is a massive, massive field. There's all kind of NLP stuff, natural language processing, from sentiment analysis to all kinds of other things. So, you should really go to the site, and you can find a bunch of networks that are currently runnable in Mathematica. OK, so, now that we've seen and hopefully motivated somewhat why people should be interested in this sort of thing, that they can be applied to all these different problem domains that previously didn't have really good solutions. So, what are neural networks? So, let's spend a bit of time thinking about this question. So, the modern term for neural network is actually 'differentiable programming' and we'll see why that is quite soon. And also, just that people know, there's a very nice Introduction to Neural Nets tutorial in the Mathematica documentation. There's a link for that there. And some of it is based on that. OK. So, the very most basic building block of a neural network is a layer. and other frameworks might call this an operator. So, a layer and an operator, it's some basic piece of computation that acts on numeric tensors. So, let's have a very simple example here. So, there's something called ElementWiseLayer which allows you to have layers for all sorts of elementwise functions, like Tanh, and Clip, and all kinds of things of that sort. So, one thing is— a sort of property of these things is that they can only act on numeric tensors. So, here's an example. So, ElementWiseLayer, we can apply this to some input, and this is the same as in top level if we said—yeah. And they're both the same result. So, here's where the differentiable programming idea comes in. So, the idea is that you want to have layers that are differentiable, and the reason for that is going to be for being able to optimize and learn what the parameters are in these networks efficiently. So, the idea is that if every piece is differentiable, then if you stick them all together, it will remain differentiable and it will be very efficient to optimize. So, there's a way of differentiating these layers with NetPortGradient. So, you can, say, give it some data, and you say, "I want to get the gradient of the port input." And what is a port? Well, these layers— if you click on it, you can expand it— it says there's a number of ports. So, there's an input port and an output port. And the reason for this is that you want some way of being able to specify unambiguously, you know, if you want to connect up, let's say, the output of this thing to something else. In this case, it's trivial— there's just one input and output— but for many layers, they can have multiple inputs or multiple outputs. So, you can specify, in this case, explicitly, like, "I want the gradient for the input port." So, we can do that, and that's exactly equivalent to just differentiating symbolically and then, in this case, just replacing that symbolic variable with some inputs. And, well, yeah, what this is really—yeah. OK, so, there's other some properties that these layers have. So, one difference between these layers and other Mathematica functions is that they can operate transparently on GPUs. So, it's become imperative to train on GPUs— like, CPUs are just too slow. So, this can be either CPU or GPU, for example. And you can also— if you have multiple GPUs, you can specify, potentially, which GPU you want to run this thing on. Like, if I had four GPUs, then I could do something like that— or three GPUs. OK, so, everything works on GPU, so any combination of these things will also work on GPU. Then some other random things. So, one is that there's something— and this idea of shape inference. So, these layers, they— if you tell it what the input size is going to be— so, I'm telling it that, in this case, it's a matrix of 4 rows and 32 columns— then it will automatically infer what the output shape is going to be. And we'll see a bit later why this is important and useful. OK, so, one of the most important differences, though, between just normal functions in Mathematica and layers, is that layers can have learnable parameters. So, if this was all that it was doing, so like, just, oh yeah, here's another version of Tanh or whatever, like, there's no way that this thing can learn anything, because you just— you know, it's just a function that you're constructing explicitly. But the magic happens when you have learnable parameters. So, layers that have learnable parameters, there's— they will often look like this: there will be something uninitialized. If you click on it, there's the ports, the input and the output, and again, they've got— if I told it what the input was and I tell it that this thing has, you know, it's a three-layer net— sorry, three neurons, let's say. Then it knows exactly what the output is going to be, but it also has these parameters here: weights and biases. So, these are things are things that you want to learn, and they can adapt based on the data. So, in the beginning, they don't have any values, right—they're just— they're not defined, and if you try to extract the values then it will just say Automatic. But there's a function called NetInitialize which allows you to initialize these weights and biases to some random values. And that's the general philosophy with neural networks, is that before training you need to randomly initialize them to something that's not zero, like the weights. If everything is zero then it has huge difficulties learning. And there's actually a very complicated procedure for deciding on what the scale of the random data should be, that you're initializing these things to. And all of this sort of thing, our framework completely automates, so there's no decisions you need to make. OK, and then now, you can say NetExtract, and I want to have this dot layer here. I've initialized it, so if I look at the weights, there actually are some weights now, and I could do the same with biases. OK, and obviously it goes without saying that I could now evaluate this thing as well. So, if—oopsy. So, if I was to say dot2 of, let's say, 1— this thing takes a vector of size 2— it will now evaluate, but dot1, if I try to evaluate this thing on that it will say it cannot evaluate net because I haven't specified the value for the weights. OK, so, all of these examples so far are things with one input port. So, one classic example of something that has multiple input ports is a Loss function. So, a loss function is something very simple. It basically takes an input and a target and it tries to make— it tries to have a high number— so, it's always one number— a loss function— and it tries to have a very low number if the target and the input are very similar, and it has a very high number if they're very far apart. In this case, it's basically the Euclidean distance between two vectors, for example, if these were vectors that we were putting in. So, here's a very simple example. So, I have an input and a target. They're quite different. If I make the target—you know— this is 4 and this is 1, then if I make this 99, this number goes up. But if I made this 1, this number is now smaller. So, this number gives the network an idea of how close it is to the truth, and this is the key part that is used for training the network. There's also a way you can list what layers are actually available to you in the system, and each of these obviously has a lot of documentation, and they all have different behavior. So, some of them are very similar to what Mathematica would do. So, things like AppendLayer. It does basically the same thing as Append. Or like DotPlus or DotLayer. It does the same thing as Mathematica Dot. But then there's a whole bunch of layers that have no analogs in top-level Mathematica. So, things like InstanceNormalizationLayer, which does some sort of normalization of the data, or SpatialTransformationLayer and whatnot. OK, and this is a—yeah— like, becoming proficient in using neural networks involves understanding what all these different layers do, and when they're appropriate to use. And that's going to be probably a larger topic that will be carried on for the next—I think—sessions. OK, so, at least one of them— what does the linear layer do, at least? So, this is the simplest learnable layer. And this is also one of the oldest layer types. And I must mention that these layers, they keep growing. One of the reasons is that people invent new layers all the time, and as they invent them, we want to implement them and have them in our framework. OK, so, the simplest learnable layer is just the linear layer. So, the Mathematica equivalent of that will be this. You're just 'dot-product'-ing the weight and the data and then you're adding some bias term. So, the data is some sort of vector, and the weight is a matrix. So basically, what this layer is doing, and we can just verify that it is actually the same, you know, we can extract the actual weights. Oh dear, what's going on there? Oh, I didn't—sorry— initialize that. There we go. OK, so, all that this is doing, like, the linear layer, it's often called the fully-connected layer because you could interpret, you know, there's a, let's say, an input vector of length 3. So, 1, 2, 3. So, this is what this input here might look like. And there's basically— the matrix can be— that you've got, the weight matrix can be interpreted as, sort of, connections between this neuron and the neurons of the output. And there's a weight per connection, so that weight matrix basically represents that. And all that it is is the linear transformation of the data. And linear transformations, you know, they can have things like, they can rotate the vector, they can rescale it, they can shift it a bit, etc., etc. So, whatever a linear transformation can do, this layer can learn to do that thing if it's good for producing the predicted result. OK, but this is a big topic, and things like, what do convolution layers do? And what do pooling layers do? etc., etc., or recurrent layers, these are going to be topics that will be very deeply explored, I think, in the next talks, so I'm not going to go into that too much. OK, so, layers I think we've understood now, right? So, layers, they act on numeric tensors and they have a whole bunch of properties, like, they're differentiable and they work on all different devices, etc., etc. But layers themselves are not very useful, like, just that linear layer, it can't really learn much, and in fact, one linear layer is the same as logistic regression if you were to do a classification problem and try and solve with a single hidden-layer network. Or linear regression, as if you try and, you know, if you're predicting a numerical value with a single linear layer. So—and those are very weak learning systems. So, one of the main ideas of neural networks, or deep learning, I guess, is that you want to have a long set of transformations, a deep set of transformations, that allow much more complicated behavior to be learnt than just single layers. So, first container type is a chain. So, this is very simple. So, a chain is simply a linear— it's a sequence of operations— so, a linear sequence. So, it has one input, and we can try this out. So, if we look at this thing here— so, this thing has one input, it has one output, and it's doing first the Tanh and then it's doing LogisticSigmoid on the data. And this is exactly the same as this, for example, in top level. And you can also verify that they are the same. OK, this is a bit of a silly example because you could, in this case— it would also be equivalent— so, ElementwiseLayer is a very powerful layer. It can have, for example, something like this. So, LogisticSigmoid takes in that and that takes in— and that's a perfectly valid layer. You can compose all the different primitives that an ElementwiseLayer has and you can make a pure function, and that's a perfectly valid ElementwiseLayer. So, this would be the better way of writing this thing. But this is just to show what this actually does. OK, so, a lot of networks, and traditionally almost all the networks, have been just chains. So, yeah, the sort of traditional image classification networks, even the one that won ImageNet, they can all be represented by this chain construct. That's all they need. However, what happened after the 2012 ImageNet result was that people realized that there's a lot of power in having actual or more complicated, sort of, structures that you can have, and one of them is the graph. So, a very simple graph would be something like this. You have two different inputs, and these inputs— this first input goes into Tanh, the second one goes into LogisticSigmoid, and then they get added up together and then returned. So, this graph—the idea is that you can first sort of list here— and this can be a list or an association. So, if it's a list, it means you basically have all the layers that you want, and then the second argument for NetGraph is a list that gives the connectivity structure. So, just exactly like how Graph works. So, in this case, I'm saying that the first port of the input is going to the first layer, so Tanh, and the second port of the input goes to the second element in here, this one here. And then both of these go to the TotalLayer, which totals them up. And you can do the same thing with association and then with names, rather. Some people find that much easier. So, Tanh goes to that, and then you'd have this stuff here, for example, changing to being Tanh. I personally prefer that method as well, and this must obviously have been association. And these ones must be changed to do. OK, so, the question then is: what is this equivalent to? And in this case, it's just function composition, like this. OK, so, you can build very complicated structures with NetGraph. You can also differentiate it. Because all of the layers are differentiable, so will the NetGraph be differentiable, and this is an important point— is that containers behave exactly like just normal layers. So, they're differentiable, they run on the GPUs, but they can also be nested. So, NetChain expects a layer over here, but a NetChain is a layer as well. So, this is perfectly valid syntax and if you dig deep into this thing, you know, you can find the different layers of NetGraph— of NetChain, sorry. OK, and obviously the models in the repository are almost all some form of container. So, an example of this would be this restyling net. This thing has two different inputs, the style image and the content image, and it does some processing, and you can dig deeper and click on all these things and get—you know, see what it looks like. OK, so containers are important for building more complicated structures. But containers and layers, they still only ever operate on numerical tensors. But what we had— what you might have noticed right at the beginning— was acting on actual images, like, we would take a net model, this thing, and it would apply directly to the image. So, how does that— how do we— how does that happen? So, the basic idea is that we have another construct, and this construct is quite different from anything that we've seen in other frameworks. So, the construct is the idea of a NetEncoder. So, NetEncoders are special— almost like a layer, but they can't go inside the NetGraph or NetChain. They're not real there. They're not differentiable, they don't work on all the devices etc., etc. But they can be attached to the inputs of other layers, which we'll see now. So basically, what they do is they take some type, so like, an image or an audio object, or text, or any of the other types that you might want to operate on with neural networks, and it basically produces the appropriate tensor from that thing. And you can see from this. So, this says, OK, it must be a grayscale image, it has one color channel, there's no mean or variance images that you could set as well, and it's got an output shape of this. So, if you give it an image, it produces this tensor. So, it makes a guarantee that the tensor that this thing outputs is a 1×12×12 tensor. And if this image was much bigger, it would just reshape it; if it was a different color space, it would conform it. So, it can basically guarantee to the network, "This is the kind of tensor that you'll be getting." And another cool thing that you can do with an encoder is that it can operate directly on files as well. So, if I have a file for some image, then it can directly run this and give you a tensor, and this should be 1×12×12. Oopsy, sorry. 1×12×12, great. And the reason why this is powerful is that you can do out-of-core learning for images and audio files with this because you just have to— you can use the file as an actual data point to feed to the training, for example. So, you don't actually have to load all the files into memory. You can just have them on disc and then just give the file names, and then NetTrain can act directly on them. All the networks can act directly on just the file paths. So, file paths are completely valid inputs to these encoders. And if you're more interested in— I was going to touch more on this later, but I decided to simplify things. So, if you are interested in training on large datasets, then there's a tutorial on doing this, which I recommend you looking, and so, there's a whole bunch of other encoders from audio, to characters, to tokens, to—yeah. You can write your own custom encoder as well. So, it depends on what data type that you are dealing with. OK, and as I mentioned, the main sort of property of these things is that—remember the input port? So, you can specify what the size is for the input port, but instead of doing that, you can attach an encoder to it. And if I do that, instead of saying what the size is, it will tell you that the input is an image, in this case. And what this means is that you can act directly— so, in this case, it's a pooling layer— you can act directly on an image. If you try to do that without this, it will just complain, because it's not a type that it accepts. So, if we did this—PoolingLayer— and we said, "Let's give it some image." "Data supplied to port "input" was not a tensor of rank ≥ 1." OK, and one more thing. If you remember the networks that we showed right in the beginning, for ImageNet, these things had the magical property that— here's the NetChain— they could be directly applied to the image. Just give the image to the network. And the reason for that is that it has this image encoder on top. And you can always extract the image encoder. You can say, NetExtract of the input, and you can see what this image encoder is doing. It's resizing them to 299 back to 99, it has some mean images, some mean image, it has—and it's RGB color space. So that's why this network just works out of the box. So, for other frameworks, you often have extra preprocessing code that would have to conform the image into the size that the network requires. OK. The basic, almost brother of the analog of the NetEncoder is, for the input, is that there's going to be a Net Decoder for the output, and the reason for this is very simple. It's that for a lot of networks, for example, like, classification tasks, you really want— the network was always going to give you back a numeric tensor, but that's not really what you want. You want some sort of class. You want to say, "Is this a dog or a cat?" for example. So, what this does is that if you create this decoder, you say it's a class decoder and there's two classes: there's a dog and a cat. If I give it a probability vector, it will then interpret, it will take the maximum probability, which is 0.9, and say this was a cat, because that's the one that corresponds to 0.9. And you can use this exactly like a classifier function, which is that you can ask for different properties here, like, "What are the probabilities?" and it will give you an association with the probabilities for each of the things. And exactly the same analog if you have a— you can attach this to the output port of a network. And this is— you can see now why these ports have certain names. Like, you have—you need ways of referring to them. So like, if there were multiple outputs for something, how do I specify which output to attach some decoder to, for example? OK. In this case, there's a Softmax layer which basically just produces— it takes a random vector and it produces a normalized vector, so it sums to 1. If we apply this, now Softmax directly works on— well, it directly produces something like this, which is definitely not just a numeric array. OK. So, the same thing happens with these networks. So, things like this ImageNet network. It has a whole bunch of classes, and if you click on this thing, you'll see that the output has a class attached to it which we could extract, and if we ran this network it means, instead of giving you— you know, just a big splurge of probabilities, like, there's a vector of size 1001 in this case, you can actually get the real class from that. And you can, obviously, if you want the probabilities you can ask for them here as well, but this might take some time. OK, and you can get the output decoder in this case. And you can get the class labels and all that sort of thing out if you really want. OK, so, I hope that people can see sort of what the motivation of this thing is, that it makes the networks very clean. Like, this is ready to go, this network. It doesn't require lots of post-processing scripts and whatnot. So, OK. So, now that we've discussed the main sort of concepts of those decoders, encoders, containers, and layers. So now, we actually want to train something. So, these networks are not very interesting unless you can actually do some learning. So, how do we do this? Well, there's a whole bunch of automation that's introduced by the framework, but I'm going to show you some examples where I don't make use of this automation and just do it very explicitly, that you can see exactly what's going on. So, here's a very simple example. We have some data which is pretty much a straight line: 1 → 2, 2 → 4, 3 → 6, 4 → 8. And here's a very simple net that we want to have to solve this problem. So, in this case, we're predicting one number, a vector of size 1. And here's where the magic happens. So, we have an input that comes in, it goes through the linear layer, and this has parameters, and we want to find the parameters so that the prediction of this linear layer is the same as the target prediction, so that the mean-squared loss layer becomes very small. So, if we do it right now, we've initialized this network, it's just random initialization. This is pretty big. So, obviously, that's what we expect because, you know, these parameters haven't seen any of the data; they're just randomly picked. So currently, the predictions of the network of this linear layer don't match at all what the target looks like. OK, so now, we can actually train it, and there's a very simple construct called NetTrain. So, NetTrain takes a network and it takes some data, the training data, and it will train. So, that happened very quickly. We can do it again. So basically, this tells you, here, what the loss with the output looks like as a function of training iterations. So, it does many iterations, and you hope that that thing goes down because you want the loss to get smaller, which means that your predictions and your target match. And now we can check exactly the same thing, which previously we had 15.31. If we run this now with a trained network, we get something like that. So, this network now has learnt to predict what the target looks like. OK, and we can also extract the parameters. We had a net with some weights, and we have a new network, the trained network, and we can see what the weights look like. So, for the trained net, the weights are completely different. And that's the whole point of training, that it can change the parameters. OK, and just a quick rough idea of how this works. It uses something called gradient descent, and this is where the differentiability comes in and where it's so important. So, the basic idea is: suppose this J here is the loss. So, it's a function of the weights of the network, right? Or of the parameters of the network. So, we want to find— so, for different parameters that you have, this loss is going to be different values, as we saw. So, you have a point, though, where there's going to be a minimum. And in fact, there's multiple minima. So, there's a whole landscape of minima. But the basic idea is that if you're a blob and you're sitting over here in parameter space, so, if your parameter puts you over here, then if you have the gradient, the gradient gives you an idea of the direction that you must go so that you get more closer to the minimum. And that's what NetTrain basically does. It uses the differentiability of the net and it moves you in the direction where you need to go. OK, so, this is also very similar, you might—people might realize, to things like NMinimize or FindMinima, except that it's vastly more efficient for various reasons. OK, so, that's the basic idea of the training and you don't have to worry. I mean, yeah, it's good to know what it's doing, but NetTrain does automate a lot of that stuff for you. So, now that we have this background, let us actually look at a real sort of example. And this is often considered a toy example, but it is the simplest toy— maybe the least trivial toy example that you can have that actually does something nice. So, we get the data, and we can look at some examples and probably most of you have seen this sort of thing. And one of the reason for using this example as a starting point is that a lot of people have seen it, and it's familiar, the problem that is involved for this. So, the problem is very simple. You have some handwritten digits, and you have the actual label. And so, you want to predict, for unseen images, you want to have— you want to learn how to predict the correct label for those. So, what have we learnt? Well, the first thing, obviously, and this is something that I think the next speakers will emphasize a lot, but basically, it's very seldom where you will be actually putting together very complicated nets from scratch with lots and lots of layers. So, the most common thing is, you know, the community— the neural network community has spent a lot of money and time looking for good architectures for doing certain problems. So your first instinct should always be to say, "Let's go to the neural network repository and find something that does something similar to what we're doing." If it's image classification, there's a whole bunch of incredible networks that have won on ImageNet that will be a very good fit for what you are doing, and it's unlikely that you'll ever find a better architecture. So, this should be your first instinct, almost always. Find an existing network for a certain task. And one other thing, one other reason why it's very nice to use the repository, is that you are guaranteed that the implementation was correct. It's very easy to see a paper and then make mistakes with these numbers. Like, this a very small net— you're probably not going to make mistakes— but for some of the networks that have 500 or 1000 layers, it's very easy to make small mistakes, and then the whole network doesn't work properly. So this should be your first instinct. And in fact, for this particular task, there's a very famous network that we have, called LeNet, which, at some point, supposedly sorted about 10% of America's checks. This was, like, in the early 90s. Yann LeCun worked for, I think, Bell Labs and they made one of the first industrial-grade neural network recognition machines for check digit recognition. So, it has a very pristine history, or prestigious history. So, this network does exactly what we want. It's a very simple network; it has convolution layers and pooling, and this is something that for the image talk, I'm pretty sure they'll explain carefully what these layers do, and why they're special, and why they're used for images, but yeah. So, you could run this model immediately, and you'll see that it works. But let's do this from scratch, so, using all the different things we learnt. And I also want to say, if you want to see the most efficient way of— the fastest way of doing this sort of task, there's a tutorial for MNIST digital classification. And it's very, very short. It's a very, very simple few lines of code to actually do this the simplest way. But the simplest way is not the most general way. So, the simplest way works for problems where there's basically one prediction, where there's one input, but it doesn't work for anything else. So, we're going to show for this, basically the general approach that will work for any sort of problem, even though it makes— for this very, very simple problem, it makes it look a lot harder than it is. OK, so, the first thing that we want is we want a decoder, because if we want this thing to predict the classes— and the classes are going to be between 0 and 9 because the digits are between that. And we also want an encoder which takes an image and it makes sure that it's 28×28, the size, and they're going to be grayscale. OK. So, the next thing is to actually define this convolutional network. So, I'm not going to explain all that carefully about what the motivation is for each of the things. Maybe the only thing I'll mention is that convolutions are usually used for structures like sounds, like spectrograms, or images which are large, and you don't want blowup of your parameters. Like, the fully-connected layer, or the linear layer, gets incredibly large for images. Like, if you have a 500×500 image and then you need a matrix that multiplies that, you have a massive— you just have a blowup of parameters. There's a whole bunch of other reasons. Like, there's certain symmetries that are encoded or invariances that you hard-code into this here. So, you say things like, "For a classification problem, I don't care that, you know, the object could be anywhere in the image. I don't really care." And that invariance to shifting the object in the image, that's one of the reasons why you have a pooling layer, for example. So, you can prove that a pooling layer makes it equivariant to translations, or an object to translations in the image. And so, this is a principle in neural networks, is that if you can encode structure— so, if you know that there's some structure in your input— you want to use layers appropriate for that to exploit that structure, and it makes learning a lot more efficient. But this is a longer topic that maybe we can deal with some other time— on the next version, or maybe even the questions. OK, but basically, there's a whole bunch of convolution layers, there's a nonlinearity, like a Ramp layer, and this is a very important thing for you to know as well. So, if you plotted this thing, it will just look like, let's say, -3 to 3, and this thing is flat, and then it goes upwards like this. This is also called RELU layer sometimes, or Rectified Linear Layer. And you'll see them between things like linear layers or convolutions or whatever, and they basically break the sort of linearity— each of these operations is a linear operation, and so you have a bit of nonlinear behavior here, and that helps the network learn much better. OK. So, we've now created this thing. Now, this is where things become more complicated than they need to be for this very simple example, but I'm going to show how to do this for the general case. So, here, I'm actually constructing the training net. So, for these simple examples, you can actually feed this straight into NetTrain, and it will construct the whole learning graph, or the TrainingNet itself, and then give you back the original network that you had. But that doesn't work for all networks, especially when there's multiple inputs or outputs. So, what you can do is— in this case—yeah, here's LeNet. We're just having the uninitialized net, so the input goes to that. And then the input goes to "Loss" which we've just explicitly added here, and this will be CrossEntropyLossLayer. And CrossEntropyLoss is usually what you use for classification. So, this is one of the things that you have to learn as well— like, what losses are appropriate for different tasks. And this is a very good loss if you have two different probability distributions that you're comparing. OK, and we can evaluate this on some training examples, and you'll see that it's quite large. OK, so, we have this TrainingNet and we want to find the parameters in the LeNet so that the prediction that it makes is similar to what the target is. And the Loss gives you a number that will be small when that happens. There's one more thing, which is that this net now has two ports. It has a target port and an input port. So, the most general way of representing data when there are multiple ports is the association form. So basically, you have this sort of thing. So, you have one port and it has a bunch of examples, and then there's another port that has a bunch of examples. So for training, this is the most general form. And some networks, you know, they might have five of these ports and so you have— you can easily represent that data. OK, now, let's train this thing. So, this is a more complicated syntax of the previous thing that I had, but let's start this. So, the first thing to note is that I'm going to use GPU. Now, not everyone can use GPU. You can only use it if you have an Nvidia graphics card, unfortunately. So, whilst it's running, we can explain what all the different things are. And sorry—also, this bay looks a bit strange for presentations. There's some bug. It makes it look small. OK, so, it's busy training. So, let's look at all the different pieces whilst it's doing that. So, there's the TrainingNet, there's the training data association, there's a validation set. Now, this is very important. So, when you use things like Classifier, there's a lot of things that are automated for you to prevent things like overfitting. So, overfitting means that you do very well on your training data but you do terribly on any other data. And there's a very good tutorial on overfitting if you're really interested in that, called "Training Neural Networks with Regularization." So, basically, this validation set, what it does, it's not allowed to use this data for training, but it checks, as you can see here. So, this last curve, the blue is the validation performance. And as you can see, it's getting better and better on the training set but it's not getting better on the validation set. So, what this will do is it will say, "OK, we can take the network that did the best on the validation set, and return that to the user." So, for example, this graph might look like, you know, the validation set goes down, down, down, and then suddenly it can go up again and that's the classic signature of overfitting. So, in this case, it doesn't look like it's overfitted too much. It looks quite fine. But, this thing, if you provide it a validation set, it will protect you quite well from any sort of overfitting because it will just take the point that it did the best on the validation set on— where it wasn't overfitting yet. And that's really probably the most recommended thing for people to do to save themselves from that problem. Because, yeah, like, NetTrain doesn't automatically deal with overfitting like Classify and Predict might do. OK. A whole bunch— yeah, some other things, like how many rounds of training, this means how many times does it go through the entire dataset before it stops. And you can obviously— we can actually run this one more time— you can stop the training at any point. So, if I wanted to stop it now, I could just say, "Stop," and it will return the training result for me. OK, that was maybe not the greatest idea because I have to retrain it for a little bit at least. And yes, in this last [indistinct], the thing was that there's an argument All here. It just means that it returns this NetTrainResultsObject. So, normally, if you don't give this argument, it will return the actual trained network. But in this case, it gives back the trained network that's in this object, but it gives back all the other training information as well. So, things like, what was the validation loss etc., etc., and we'll see that now. So, this is almost done, and this will take a lot longer on your CPU, unfortunately. OK, let's stop it now. OK, so, we have this network that's been trained; we have the ResultsObject. So, from the ResultsObject, there's a whole bunch of properties that you have. So, things like, you can get a nice plot of the loss for the training set, which is the orange, and also the blue, which is the validation set loss. And we can see that it hasn't overfitted too much because the validation set is still going down. OK, we can also get the network out— so, results["TrainedNet"] will give you back the NetGraph. So, what we really want, though, is we want the prediction network. So, we want to get out, you know, this NetGraph, but also, inside that NetGraph, there's a node LeNet, which is the prediction graph. We can discard the rest of the network because we're not doing training anymore. So, there's a little bit of annoyance that you have to do, which is that the encoders disappear if you— you know, this thing is now— if you extract this thing, you know, the output encoder's not there, and then an input encoder's not there, so you have to just put them back, which is pretty simple with NetReplacePart. And we can now use this trained net to actually make some predictions. So, this is a 2, and we can get some probabilities out. It's pretty confident that this is a 3, but it's a bit unsure. And as you can see, like, it's been cut out a bit— so that's pretty good that it's not 100% confident. And you can do things like obtain the accuracy of this thing, which gets about 98.6. So, that's roughly—yeah. So, for a classification task, it's very simple to get metrics out, because you can just use ClassifierMeasurements. So, if it's a classification task, there's ClassifierMeasurements, and the same thing that happens with Predictor if you're doing regression. OK, and—oh, this is a duplicate. Oopsy. You can do the same thing with Classify as well. So, you can try and run Classify. And you can see how much simpler Classify is than this. So, Classify is fully automated. It does all the preprocessing; it does everything for you, basically. And all of this rigmarole that we went through, you can achieve something similar with Classify. Except that Classify has a lot of limitations: it's not as flexible, you can't do out-of-core learning, etc., etc. And obviously, if the only problem was to do simple image classification, then something like Classify might be appropriate. OK, let's just stop it. We don't have that much time. OK, well, we didn't let it train very long, but it got that. OK, so, just to summarize quickly: so, I've given a very basic introduction to the framework and different pieces of it, and there's, obviously, I gave lots of links to things that are more complicated: so, things like out-of core-training—yeah. There's all kinds of other ways of dealing with overfitting: so, things like using dropout or other kinds of regularization. So, those are things that you'll probably have to look at yourself. OK, and the last thing that I want to leave you with before I give you to Talie is that—just a quick sort of rundown of this framework. So, first of all, it's an industrial-strength scalable framework. We use all the latest in video libraries, we use MXNet as a backend, and we contribute to MXNet as well. So, all the different companies that are working on MXNet— and it's backed by a whole bunch of companies, from Amazon, to Apple, to others— users automatically get all the benefits that accrue to— all the things that— improvements are made to MXNet. We also use this extensively in Wolfram Research for— we're developing a whole bunch of functionality based on neural networks, from speech recognition, to things you've already seen like the question-answering, to ImageIdentify, there's ImageContents is coming. There's something like a list of 70 different functions that we've got that we want to make with this framework. So, it's not just something that we don't use. We use it very extensively. And yeah, we also plan to have the best repository of networks, which is already, I think, a lot better than any other framework's repository. So, we're putting a lot of effort into collecting from nets from other frameworks. So, we convert them for you. We convert the preprocessing into Mathematica code if there's preprocessing involved, and all the hassle of converting, and converting is a tricky business. It might— it's unfortunately a lot harder than it needs to be right now. So yes, and also the networks that we train ourselves are going to go into this repository. So, those hopefully 70 functions that are on our list, those nets will be available to users. So, one thing I think— here, there have been comments about this as well. There's no lock-in. So, we thought it was a very important principle to follow, that users of the framework are not forced to have their stuff only in Mathematica, 'cause you might want to deploy on TensorFlow serving, or who knows what. Like, there's a lot of options you might have, and we definitely don't want to lock you in. So, we can—we already support import and export to MXNet. I should have actually put the link for that. But yeah, there's—you can look in the documentation for that. And we also have plans to support sort of the main candidate for a completely cross-platform format, which was started by the PyTorch and Caffe2 and Microsoft CNTK framework people. But it's now been adopted by MXNet and Chainer and a bunch of others, and we hope to be able to support that as well— to both import and export to that. And yeah, we've hopefully— we've tried to make ourselves be easier to use than other frameworks, and this is sort of the selling point, I guess, that we want to have, is that we want to be as simple to use as possible, and for certain things you might see for the next talks, so things like true variable-length sequence support. These are things that are very hard to do in other frameworks, like Keras and whatnot, where you have to do unmasking and ugly things. And things like the encoders and whatnot, it just makes it very simple because everything just runs. There's no extra Python packages that you need to run to process your audio or images. Everything is just part of the language. So yeah—and yeah, in the single [indistinct] it works out of the box on all different platforms, from Raspberry Pi [indistinct], to all the other major platforms. And also one thing I didn't mention was that we do support quite simple cloud deployment. So, there's, like, APIFunction and CloudDeploy that allow you to deploy these networks onto the Wolfram Cloud for you to make APIs with. OK cool, I think that's basically the end of my talk.

Video Details

Duration: 1 hour, 4 minutes and 5 seconds
Language: English
License: Dotsub - Standard License
Genre: None
Views: 14
Posted by: wolfram on Apr 2, 2019

Machine Learning Webinar Series

Caption and Translate

    Sign In/Register for Dotsub to translate this video.