Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

04_WolframDataSciencePlatform

0 (0 Likes / 0 Dislikes)
Hi, my name is Dillon Tracy. At this point if you guys have seen either the keynote by Stephen Wolfram this morning or the other keynote by Tom Wickham-Jones this morning you had some exposure to the Cloud and you may have even played with it a little bit yourself already using a free encounter or something else. The purpose of this talk is to give you a little bit more in-depth exposure to what the data science platform is and is going to be and how the workflow is going to work. So I’ll talk very briefly about how this data science platform fits in with the other Cloud offerings that we have and will have; why you might want to do data science in the Cloud in the first place, why that’s a good thing; and then I’ll go through a few concrete examples to show you what the workflow looks like. Ok, so earlier this summer we launched the programming Cloud —Stephen Wolfram mentioned this this morning— the emphasis in that Cloud platform is on deployments, on prototyping team collaboration, managing your deployments, counting the as it rolls in and so on. And then late in the summer / early in the fall we launched Mathematica Online, the idea behind this one is to extend the Mathematica desktop experience into the Cloud so the data science platform is in this portfolio of stuff and the emphasis here is going to be on this sort of on this bread and butter data science workflow: import something, analyze it, publish your results And I’ll show you concretely how that works. On the upstream end, you’re going to have data that could structured or unstructured, it could be sitting in a database, it could be sitting in a file, it could be being updated in real time, it could be static. You’re going to bring it into the Wolfram system where you’ll to be able to leverage some semantic interpretations on the data and you’ll do your Wolfram Language customizations, work up your visualizations and so on. Then on the downstream end, you go through a CDF templating step where you can produce a customized document, often with interactive elements that you can get in front of a decision maker. So why would you want to do data science in the Cloud? I think the answer has something to do with accessibility and reach. We can take by analogy a little bit. If you have not yet yourself played with the idea of an API function, you may have seen Stephen Wolfram do it this morning. So functions are now fluid things, you can not only write and execute them in your desktop environment, but you can Cloud deploy them and once you do that they execute in the Cloud. Once that happens, so for example I deployed an API function yesterday that returns a map of a country —here this one is returning a map of Italy— this request is coming from my browser, the computation is being done in the Cloud, and the response is being served back to my browser. My browser doesn’t know what technology stack is running up in the Cloud, it doesn’t care; it just makes an HTTP request and gets a result. So that has sort of the effect of sort of democratizing computation, right? Twenty years ago, or even a few months ago, if I gave you a piece of Wolfram Language code and I told you to execute it for me, it would have been an acceptable excuse to say, “I don’t have Mathematica”, but that excuse just doesn’t exist anymore because I can publish my stuff as an API function and you have a browser, right? Something similar is happening to computable documents. These desktop notebooks that we know and love with all their wonderful interactive elements, now that they are usable in a non-augmented, zero configuration web browser, there’s an analogous phenomenon happening where the access to results is being democratized. And this solves a very serious problem in data science workflows. At gatherings like this one I’ve had many conversations with people like you who have done marvelous analyses, worked on these notebooks doing all wonderful things, and you say, “What happened? How did this play out in your organization? What kind of change took place?” And the person you are talking to says, “Well, my boss doesn’t have Mathematica.” And you say, “Well couldn’t he install the CDF browser plugin?” And they say, “Well IT wouldn’t let us do that.” And then you say, “Oh.” And the conversation… That’s happened enough times that I’m excited about solving this problem. I think that’s one of the killer features of doing data science in the Cloud. Let’s play with some real data. Here is some real data. his is a spreadsheet of house utilities that I keep on my house it’s just monthly data that goes back a few years and it’s just a simple rectangle of data with half a dozen columns that’s kept in a Cloud spreadsheet and here’s the URL for it. I’m going to import this data into my desktop session. I not only imported it but I semantically imported it, so this is a new function in version 10, you may have gotten a glimpse of it this morning, so this is not only going to take care of the mechanics of tearing apart he file and getting it into the Wolfram system for me but it’s also going to attempt to recognize entities in the file. So this just took in a raw CSV—it was just text —but it still managed to recognize the dates as dates—that’s what these funny yellow/orange boxes are on the left— and then fields that contained dollar amounts are marked as having units. This is taking a lot of the workaday sting out of doing data science. These are the kinds of things you used to have to do by hand, but you don’t have to any more This thing that’s being returned is a Dataset with a capital D. There are all kinds of funny queries that you can do on it —that’s sort of its own conversation, that’s its own topic. Just as a very, very basic example I’m going to show you this command will extract two columns and then give them to a visualization function and this will show you my electricity usage in kilowatt hours going back for a few years. That’s on the desktop; I’m going to switch to the Cloud now. This is what the front-page of the data science platform looks like. These big red buttons across the top are quick starts for very common workflows —getting data in from a file that you have in the Cloud, getting data in from a file that you have on your desktop, getting data in from a URL. We data in a URL so let’s go through that one. Let’s duplicate this desktop workflow in the Cloud. I’m clicking “Import Data from URL”. I’m going to paste this URL here. And when I click “Continue”, the platform is going to make a notebook for me that is seeded with some code to help me start my analysis. This is a normal Mathematica notebook, it’s a normal Mathematic notebook in the Cloud, and all the things that you’re used to doing in a notebook can be done. his is a template notebook, which is a special augmented kind of notebook, these buttons across the top allow me to place and control the behavior of template variables. These template variables can be filled with whatever I want fill them with when I generate a customized document from this template and this is useful for customized reporting at the downstream end. At any rate, the same semantic import that you saw on the desktop can now happen in the Cloud. The kind of visualizations that we do on the desktop can also happen in the Cloud What you’re seeing here is the corresponding Cloud operations to what we just did on the desktop and here’s the corresponding visualization to what we just did on the desktop. This is not yet a very compelling report but let’s pretend that it is and let’s deploy it. I’m going to press this big red “Deploy” button and click “Automated Report” and this GUI pops up and it’s got all sorts of letters and knobs that I can twiddle like the output format and the people I want to receive copies of this report and what schedule I want to run it on —let’s say I get my utilities bill once a month so let’s run this late in the month— and I click “Create”. Now this template, along with some other things, is bundled up into this entity on the server and it now appears in my reports listing on the sidebar here. It’s this one here because I didn’t give it a name. If I want it to run now, I don’t have to wait until the end of the month, I’m going to click this little gear menu and click “Run Report Now”. When I do that, the server’s somewhere up in the Cloud, the server is grinding into action, a kernel is being launched, this template is being applied, a log file is being generated, the resulting document is produced, it’s archived, and it’s sent to all the people who were listed as recipients for the document. Here it is. If I click this link in my notification email, it should open the resulting generated document in the Cloud. There’s my report. As I say, that isn’t so compelling yet but imagine if I worked on that problem for a couple of hours— so I spent some more time doing more sophisticated visualizations, I could be working on the desktop and uploading the template, I could be working right in the Cloud interface —and let’s say after a couple of hours I came up with this. So this is working on the same data and you can see my semantic import— things have gotten a little bit more sophisticated, I’m doing some row d duplication where the utilities company made a mistake. You can glimpse some visualization functions in here—some DateList plots, here are some histograms, some paired histograms, here is some forecasting using more sophisticated stuff like TimeSeriesModelFit and TimeSeriesForecast. I can evaluate this just as I would a normal notebook but it gets compelling to other people when I deploy it as an automated report so, again, I can go this GUI and push “Create”, get a report in my sidebar and ask it to run now if I wanted to. While we wait for that chug I think I can show you the results of a previous run. Here these orange peaks are my gas usage in the winter and these blue peaks are my electricity usage in the summer and here we have some distributions of my kilowatt hour and therms usage and here they are broken down by season and here’s some forecasting, here’s my combined gas and electricity usage, here it is projected out twelve months using a seasonally adjusted sarima model and here’s what I’m on the hook for this fall and winter. There are all kinds of directions that you can take this in— you may want to compare how the prediction from last month matched up to the actuality for this month and so on. We’ve already been able to cover the workflow from one end to the other so this is a document that could be consumed, hat could be put in front of other people. We went from one end—the raw data—to this pretty quickly and the hope is that these deployments, these automated reports, will be pretty lightweight and do lots of them very quickly and the turnover for them will be short. I’m going to flip back to the desktop now. I want to show a little bit more about what you can do with semantically imported data so this is an acute dataset, this is in a tree inventory done by the city of Champaign. Here are the columns that you get—you get a tree species and a common name and then you get some stats about the size and shape of the tree and you get a location—you get a longitude and latitude pair. I’m going to semantically import this dataset. This time, I’m going to give the interpreter a hint—that the first column contains a plant and the interpreter will attempt to assign a plant species to that column. Once that semantic import finishes I’ll give you a glimpse of what the raw results of the semantic import look like. Once again, we’re on the desktop here—there is no Cloud involved yet. So if you look in the left-hand column you see that some of these, in cases where there actually was a tree and not a vacant site or something, you get an entity back and this entity has curated metadata attached to it— his is an idea brought in from Wolfram Alpha— and this metadata is available to you in your analysis and can facilitate these so-called knowledge-based analytics. Also notice that the position has been wrapped, the latitude/longitude pair has been recognized as a latitude/longitude pair and has been wrapped in geopostion—this will make mapping applications easier as we’ll see. So to give you an idea of the kinds of things that knowledge-based analytics gives you access to, , if I ask for the number of properties associated with the red maple I get about a hundred and fifty. So, for example, I can ask for the foliage color of the red maple and this may or may not come back as green. So let’s do something similar in the Cloud. Here’s a template that I worked up in the Cloud to look at to look at the same trees dataset. Here you can see me doing the semantic import of the trees dataset. I’m going to do a little bit of munging on it and I’m going to do some data science-y stuff on it where I look at the five most common trees and then you see this GeoGraphics here—this is a super fun feature, new in Mathematica 10—it’s going to do some mapping of the tree species. I’m going to zero in on the Ash Trees. I’m going to isolate one of the species and look at them. When I’m happy with this, I’ll do the same thing I showed you before— deploy, automated report, create —and I’ll get this trees template report. Ask it to run now. While we’re waiting for that, I’ll show you the results of a previous run. Here are the five most common trees in the dataset. These thumbnails of the tree species came for free with the metadata attached to the entities. I didn’t have to go look that up, I didn’t have to do any work at all— it was just a property attached to the entities. So that came for free. Here is a pushpin map of the sampling of tree location, color-coded by species. Where are we? I guess we’re at that intersection. A lot of distinct tree species so this is a little bit busy. If I isolate the Ash Trees, things get a little bit more manageable. So all of the Ash Trees in Illinois, and in many parts of the United States, are being eaten by these. These are Emerald ash borers that are thought to have come over from East Asia in packing material. This dataset did not contain data on sick trees, but I found myself wishing that it did because that would have been neat. So I made up some data, I simulated some. Here’s some simulated sick trees based on a two dimensional distribution and then I colored the sick trees red so you can see this outbreak, this ash borer infestation west of I-57 in this map down here. So if you worked for the city or something it would be nice to have this in your email every Monday, right? In fact, for the data science course that I taught yesterday morning I worked up an example where we used the machine learning functions— the Classify functions—to classify, to identify candidate sick trees in the rest of the city based on the characteristics of the ones that were actually sick. Provided that approach actually makes sense, if you can predict vulnerability to ash borers based on the kinds of things that arborists measure in these kinds of datasets, then that would be a real labor saver because the city only has so many trucks and so many arborists to send around. It could be an interesting application. Everything that you’ve seen so far in the Cloud has been kind of GUI-centric, but we think every hard about language designs. It’s generally the case that everything you can do with web GUI has a corresponding language equivalent. These automated reports, these bundles of stuff that execute up in the Cloud are known in the language as document generators and I’ll show you briefly how these work from the language level. So here I am back in the desktop. Let me show you a very simple template. Here’s a super simple document template that has one slot in it for an author. I’m going to deploy an automated report using that template from here, from the desktop, so that’s a one-liner—that’s this—it uses this CloudDeploy idiom that we use to Cloud deploy everything. I’m going to specify a document generator with that template and minimally that’s all you need—you don’t even necessarily need a schedule. So I’m going to deploy that now. So that thing’s being bundled up and sent up to the Cloud right now. Once that’s up there, I want might to trigger it to run without waiting for its schedule—in this case it doesn’t have one so I have to trigger it if I expect it to run at all. There are two ways that I can do that: I can do that in a blocking way where I wait for the result or I can do that in a non-blocking way where I let it go do its thing as if it had been triggered by its natural calendar trigger and then wait for an email notification or go check it later. Here I will evaluate it in a blocking way. Right now evaluation of that thing took place in the Cloud. I got back a Cloud object and if I click on this, this should be the document that was generated by that generator. There it is. So to go back to that utilities example, if you wanted to deploy that from the desktop without going through that web GUI— that’s just this, it’s a one-liner, here this one as a schedule “monthly”. The intention here is that these deployments will be very lightweight. Like here at Wolfram, we have an enterprise data warehouse and there’s a team that manages it and works with it and requesting a report to be set up in it is not a heavyweight process but there’s some process there and the hope is that this will be much more maneuverable, that I could deploy ten of these in a morning if I wanted to, with slight variations for different people. And here I’ll tell this one to run asynchronously just for the heck of it. I’ll give you a glimpse of a more complicated example. So here’s a template for a stock report This will give you a bit more of a sense of what you can do. This stock report is fed with a bunch of tickers—you give it a list of tickers that you want a stock report on— and then it uses the financial data paclet to look at some properties on those instruments and then it gives you back little interactive widgets that let you explore those properties over the time series of those properties over the past N days. This makes use of a template feature called “the repeating block” that allows the template to handle a variable number of elements. So I could give this template one instrument or five instruments or fifty. Let’s say I’m working at a financial company and somebody says to me, “I want a daily report on the performance over the past 30 days of the five stocks in the trucking industry that had the highest trading volumes of the previous day Do it.” By nature, you don’t know what properties you’re going to be reporting on until that day, right? So this is the case where the document generator itself needs some intelligence to be able to decide what to fill the template generator with at the time the document is being generated. We call that case “driver logic”. “Driver logic” can live in inline code, it can live in its own script, or it can even live in its own notebook. Here I’ve put it in its own function, so this driver is going to use financial data to look up the of the trucking industry sector and it’s going to look up the volume yesterday and it’s going to sort them and take the top end and then it’s going to go about preparing the other slots for filling the template with. The actual deployment of the document generator itself is still pretty simple— we’re just specifying the template and a schedule and also the driver logic. So I’m going to deploy that and run it. This one hits the financial data paclet pretty hard so it takes a couple minutes to run. In the mean time I think I can show you the results of a previous run. I’ll wait a bit for that to come— maybe it’ll come before the end of the session. That’s a glimpse of the DSP workflow and it’s how it fits in this import, analyze, publish paradigm. You’ve seen data come in and you’ve seen these documents come out. I’ve shown you how the semantic interpretation helps streamline some aspects of the analysis and brings in some of the stuff that you might not have even thought of doing otherwise. Very importantly, doing data science in the Cloud solves the dissemination problem, which in my opinion and in my experience has been the biggest single obstacle to getting the Wolfram stack into organizations for doing data science. Also I want to leave you with the point that the Cloud can be used in a standalone way but it also hydridizes pretty naturally with the desktop so these Cloud deploys that you see are easily managed. If you do data science even now without the Cloud, you have probably learned to maintain a mental model of the kinds of things that are happening in memory and the kinds of things that are happening on disk and you say, “well this is a high latency operation this is a low latency operation”, dealing with the Cloud adds another layer to that—this is happening in the Cloud, this is happening locally— but it’s sort of the same idea and the platform works naturally in both modes. I understand this is very quick and I’d be happy to demonstrate more detailed aspects of this that you’d like to see more of. Please grab me wherever you see me around the conference. Any of these other team members whom you meet here will also be happy to speak with you about it. Thank you.

Video Details

Duration: 24 minutes and 19 seconds
Country:
Language: English
License: Dotsub - Standard License
Genre: None
Views: 14
Posted by: wolfram on Apr 15, 2015

04_WolframDataSciencePlatform

Caption and Translate

    Sign In/Register for Dotsub to translate this video.