Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

Big Data. Small Data. All Data.

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[TechEd 2013: Shawn Bice, Director, Data Systems Group] All right, let's go ahead and get started. So by way of introduction, my name is Shawn Bice. I'm from the Database Systems Group. I'm here to represent work done by all of the engineering teams at Microsoft that are building technology for our data platform. So I appreciate your patience—we'll start at 11:10, we'll keep things on time—and over this course of the hour, basically we're going to talk about technology areas that connected together can drive results. And then we'll try to spend more than half of our session time in demonstration. We think that's very useful because you can see things get put together firsthand. But over the pages—to sort of give you a reference so you know the flow of this hour, we'll talk a little bit about data warehousing, parallel data warehouse. The things to really—just by raise of hands, how many people here are familiar with PDW? So a few of us. For those of you that are—a couple of concepts that I would sort of ask you to pay particular attention to because they're new—external tables is one thing. So just keep an eye out for that. But really what we're going to focus on for PDW is looking at how you can query across structured and unstructured data, and you'll see that in the demonstration. Then we're going to talk about HDInsight. How many people here are familiar with HDInsight or the work we're doing with Hadoop? Okay, so just a few of us. How many people here feel like—just by a raise of hands— you need to know more about Hadoop but you're sort of in that beginning stage of learning? So a lot of us, okay. So I'll try to spend a little bit of extra time there. The things that I would call out now is what we're going to show you with Hadoop—when you think about Hadoop— sort of conceptually break it down into a couple of parts— a whole bunch of unstructured data and some storage mechanism, and then a particular type of processing or compute processing over that data. And what we're going to do is process a bunch of logged files and reduce it into a smaller set. Then we'll talk about analytics particularly, we'll get into Excel and we'll show—we're going to switch personas and we're going to look at how we can model and mash up data and get some insights, and we're going to end with SQL Server 2014, particularly looking at the In-Memory OLTP investments that we're making. So in any keynote or foundation session, I think it's great to start with a story about data. This story actually goes back to the mid-1800's. And the thing that's—I think—really interesting about this story is the way this physician John Snow went about doing something. He had a theory about an outbreak of this disease called cholera, and cholera is basically a small intestine infection from bacteria. Now back in the day, germs and disease was not well understood, and just to explain this in short, he and a lot of people believed that disease was passed through bad air. So you would breathe air and somehow you would get an ailment. There wasn't a lot of belief that it was passed by touching something, and John Snow was this aspiring physician who thought that, "hey, there's such a thing as—" his theory was that this cholera disease was actually waterborne. And what he did is he took this map and he plotted basically where outbreak was, basically where people either died or were sick from this particular disease. Now when he really took the time to look— if you look at sort of the pin dot or the darkened colors inside of this big orange or red circle there, what he really— his theory was that this disease or—was possibly because of this Broad Street water pump. So the water pump was nearest all of these deaths, if you will, in this particular area. So his theory was that it was the Broad Street water pump. Now he took this theory to the city council, and the city council basically said, "well it can't be the Broad Street water pump because there's all these anomalies over the page." Specifically they noted the brewery—where you see that big green star was a brewery close to the water pump—but there's no deaths in the brewery. And it wasn't—and they also pointed out these other blue dots and they said, "Well, if it's that water pump, then how do you explain these ailments or deaths that are well outside of that circle?" And it wasn't until John Snow enlisted an external data source as one from Reverend Henry Whitehead and then also by talking to folks in the community. And he discovered a few things by bringing in these external sources. One, Reverend Henry Whitehead had explained that the brewery had its own water source, so that the people that lived and worked at that brewery were not actually going to that Broad Street pump. And it also turns out that the people at the brewery were allocated some allotment of beer that they drank each and every day, not going to that pump. Interesting fact. Then they also found as an external data source—of talking to families that lived where you see those blue dots— it turns out when they brought their kids to school, those kids drank from that pump and they brought that water home, and when they brought that water home, it explained the sickness. So at this point, John Snow went back to the city council with his original theory and also including some external data sources to basically say, "Hey, the action we need to take is to close that Broad Street pump," and which they did, and it curtailed that ailment in that particular area. Now if you put that in today's terms, what John Snow did is he first started off by finding and combining data by mapping out data that he had of ailments, or deaths. He formed his theory, took it to a council, and then he refined that theory with some external data sources from Reverend Whitehead and then talking to folks in the community. And ultimately he took action, and the action was removing that pump handle. It's interesting to point out that in—nearly 175 years ago when this happened, that one particular event actually led to how people thought about sanitization and water, and how to fix that up in cities so that they could curtail ailments like that— so something that was done nearly 200 years ago, just in today's terms. So if we think about big data, I'm sure everyone in this room has heard the term "big data." And it can sort of mean—it can mean whatever you want it to be, honest. The first time I think I heard the term big data, I used to hear these particular phrases that went with it. But I—the way I think about big data— and I try to explain it in just really simple terms— there is an explosion of data in this world. The world is at a tipping point today. What do I mean by that? Well, sensors, GPS devices, wristbands—like the one I have on that tracks my activity each and every day—telephones, cameras, machines are creating more data than human beings—that's a fact. And it's not unreasonable to believe that within the next year or two that almost every vital object on Earth is emitting data— it's probably not a crazy statement. And if you think of medicine—you think—I think of the medicine that my kids are going to experience will be very different than what I experienced, just based on data. If you think of an Alzheimer's patient wearing a device recording each and every thing that that person does every day, so that in the evening they can try to remember and retain those memories from the day—that's not a crazy thought. So if you're in the data business and you're a data professional, I would say this is—these are the glory days that are right in front of us. I've heard people comment that the era of computing is just beginning right now. And if you think about how much data that we have at our hands and what we can do with that data, it really is going to push on a set of these things. One, it's going to save tremendous amount of information that you're going to have to deal with. You know, the day of the terabyte, [blows raspberry] it's nothing, right? You think of petabytes and you think of data bigger than— getting insights out of it very quickly. That is a reality that we're walking into. When you think of things like hardware and storage economics, well, people aren't going to throw data away anymore because the economics around storage allow for that. The old day of, "Hey, I've got some business questions that I want to ask. Let's think of those questions, form a schema, and then go get the data and put it in there—" I think those days are probably numbered. The era of keeping all the information you have and admitting that, "I don't know the questions I want to ask in my business. "I don't even know all the data sources. "But when I discover—or I come up with a question and I discover a new source, I need to be able to plug it right in to my system quickly." This notion of multiple data types— I don't believe there is one data model that solves all problems—I just don't. There are going to be multiple data models—what do I mean by that? Well, there's a relational model I think most of us here understand. That relational model has a set of operators that manipulate that data. But that's probably not going to be the only model we're dealing with. There's probably going to be data borne in adjacent objects. Why—because of mobile phones and Windows applications that can benefit from a data model like that, a schemeless data model that's read-optimized and scales very well. There's going to be data borne in flat files—what do I mean by that? I do huge log files—huge log files. There's—you just need to be able to reduce this down into a small set. So when you think of multiple data types, just think of multiple data models, and really what that's going to put a lot of pressure on is it's going to say, "Yep, there's a native model with a native interface, but it's going to put pressure on this capability to query across those models." That's what something like—that's why PolyBase is important. And at the same time, there's user expectations that say, "You know what? "With all this data—it's fantastic, but I am not waiting around. "My business cannot afford to wait around. "I can't be a John Snow and walk around society trying to get more information. I should be able to sit in Excel, find data sources, attach to them, shape information, and get to insights very quickly." And then the last thing I want to point out here is multiple data sources— Quentin touched on this in the keynote this morning. I personally believe this is going to be a major, major thing for all of us to pay close attention to, right? You think of how hard it was to find a data source in the enterprise today, imagine how hard that's going to be tomorrow. This is why when you hear things like Data Explorer, that tool is—this is why that tool is important. You should be able to open Excel and type in some type of information you're looking for, and have the sources come back to you in relevance, opposed to you—remember—how many of us have walked around and said, "Hey, what's the connection string to get to that data source?" Those days are probably over—it just won't scale. It simply won't scale. Okay, so with that setup, let's use a fictitious airline. Just by raise of hands here, how many people are frequent fliers? How many people were delayed flying in—a lot of us. Okay, so the scenario we're going to go with is this— our theory is that we can create a better frequent flier program. We want to be able to work with a variety of data sets, but at the end of the day we want to basically see if we can improve our frequent flier program based on some information that is available to us. And the very first thing that we want to do setting up our first demo here, is we want to—we want to basically find and combine relational and non-relational data. Now the way we're going to do that is by—we're going to start off in PDW, and in PDW, we're going to create some external tables. And if you think of data sources as nouns, the nouns that we're going to deal with in this scenario is— we have relational data at our data warehouse. This is from our frequent flier app, so if any of you have registered as a frequent flier—all of that information in this scenario— think of it basically in a relational model. But we're going to pull in some weather information from NOAA. We're going to pull in some flight information from the ATC. And then we're going to pull in some sentiment information, and then we're going to mash that up and join all that data together. Ultimately, we want to be able to query that as a single query using T-SQL, a familiar skill that we have. And the second thing we're going to look at is Hadoop. So if you're not familiar with Hadoop, really what you're going to see here is we're going to process a bunch of log files and see if we can reduce that into a set of information that we can join into this bigger set that we're trying to put together. So with all of that, let's go ahead and have Mike Flasko come up and show us this demo. >> All right, thanks Shawn. [inaudible; voices lowered] Give me one second just to switch over—perfect, all right. So as Shawn mentioned, what we want to do first with our fictitious airline is start to combine a few data sources. I've got a couple of sources—one is flight data— just typical relational data we're probably all used to. It records when a flight took off, from where, was it delayed. If I'm an airline, I probably have that in my data warehouse somewhere already, right? What I want to be able to do is combine that with user sentiment. The best way to get that these days is oftentimes social media. It's where people get to be them; they get to say exactly what they think. And so ideally, we can query people's sentiment as they say it out on Twitter with what was going on with our flight records and understand how happy or unhappy somebody was with our service. What you're looking at right now is actually a Twitter feed. It's one that I captured some number of days back, but the key thing here is it does not look like relational data. It's jagged—there's exception messages throughout it. There's a lot of semi-structured information. What our goal is going to be is capture this Twitter data, join it with our structured data, and understand the user sentiment of what's going on. All right, so what I did beforehand is I captured some Twitter data. Don't worry about the specifics of this next one. We'll go through it in one of our coming demos. I used a project called "Pig"—it's in the Hadoop ecosystem. It allows me to easily work with data like the Twitter feed, to reshape it and give me something that looks like this, which is effectively—on a particular day, from what airport, what was the user sentiment? So I literally combed through a bunch of unstructured information to come up with a sentiment score, and the way I did that was I was just looking for words like, "I'm in this airport and my experience sucked," or #fail or #delay, or something like that. We'll go into the specifics of that in a second. But the key thing here is that once I've done this, now I can start to work with this data and— using technologies and approaches that you're probably already familiar with. What I've gone ahead and done is created a table— I'm not going to zone in too much on this. It's just a standard SQL table—it's something we've probably all done a million times. And so I've got a data table that's just recording flights, and it records delay coming out of those flights. Here's the more interesting one. I've got another table called sentiment. This table is actually just a lens that I'm imposing over top of that Twitter data that we looked at a second ago. This keyword external basically says I'm creating a table over some data that's not in my data warehouse—it's somewhere else. In this case, it's sitting in a Hadoop cluster. And I'm saying location equals—you can see there's an IP address. There's a text file behind this. You know, normally you wouldn't go with one text file, you'd go with many and go big data style, but the key thing here is that what we're doing is we're imposing a table view on top of a bunch of text files that we've processed to understand user sentiment. And so from this point on, I get to treat this data just like SQL tables, which is pretty cool. And so at this point, let's see how this plays out. So let's start SQL Management Studio— probably a tool that many of us are familiar with. I'm going to connect when we come up, databases, start a new query, and in parallel data warehouse, it has this capability built in through PolyBase—it allows you to do this. And so let me go over to my database—this is just a mirror of what we've got— and I'm going to run a particular query. Let's go grab it and bring it in. And I'll slow down for just a second. So what we're going to do here is we're selecting the first five rows. And notice here what we're joining is that sentiment table and that flight table. Okay? That sentiment table, again, is not a native table. It's just a view on top of some external data that's sitting in Hadoop. And then what we're asking for is the results ordered by sentiment. So what I'm looking for here is the flights that people had the most to say about on Twitter. And I'm ordering it by sentiment score to understand when people were most dissatisfied with my particular service. And so sentiment score here is just a numeric value that was incremented every time I saw somebody saying something negative about a flight on that day out of that airport. So if we run this, now we're running just standard SQL using approaches that we're fairly familiar with. Let me start up one other tool here so that we get a better view of this. Let me get this one, great, okay. Let's see if we can the results—it's coming. There we go, and let's zoom in. Sorry about that—I thought I had it started. All right, and so the first column we can see is the sentiment score column. You can see high numbers—590, 540, followed by some lower numbers. And so what we're seeing is that in these first 2 rows, we've got some high sentiment scores coming back. So people were fairly dissatisfied on 11/22 in Seattle and Chicago. And this other column here is our delay, which you can see people were averaging about a 98-minute delay and a 103-minute delay on 11/22 in these particular days. And so we've started to bring together now some user sentiment with flight records to understand people were fairly unhappy, there was points of high delay, and so— that started for us to look at, okay what was exactly going on in Chicago and Seattle that was causing this with our fliers? And so we step back and kind of dug in a little bit more, and I created another external table on top of Hadoop data called "Words." Words was just a view into what were people actually saying, and it did a count on the words that people were using the most often. So now we can say things like "select top 5," and what we're going to do is we're going to grab the most-used words, in this case, the top 5 or so most-used words. So again, we're just standard SQL, so we get to type in what we're just typically used to, we're going to order this by count and descend it, and here we should get the top 5 words people were saying. If I know my SQL, then I'll get it right, right? And now let's take a look at what people were saying. So we can see that people were talking about airport—no surprise there. They were talking about delay and weather—okay, I don't really need a ton of tools to understand that people— when they're delayed—will be talking about the delay. You can see the word Thanksgiving. Okay, we're looking at data from last November, so that kind of makes sense. And then you also see this word called "app" come up, so people are starting to talk about the app as well or some app experience they're having. It's not something I would expect to see inside of sentiment data that where people are talking about my service, so it's probably something I want to drill a little bit more on. >> Yeah, that seems unusual—why would we even see something like app? Is this referring to the airline app or registration app or something else? >> Right, and so typically—I don't know about you guys, but every time I'm on a flight and something's happening, the first thing I do is I pull out my phone and I want to know what's going on, am I delayed, can I get on a different flight, can I go on standby, that kind of stuff. And so our airline has an app— it has a phone app for iPhone, Android, Windows phone— we've got a web site, et cetera—a typical airline experience, right? And so the first thing that happens when you experience a delay is you yank out your phone, you load up your airline app, you check your flights, and you might try an upgrade or change or look to be rerouted somewhere else, right? And so that's what we expect a lot of users to do in a situation like this. And so you can kind of imagine, yeah, it makes sense for people to be talking about the app in these cases, but in this case, why were they talking about it? And so this is where we start to bring in more data sources. And so with all of those apps, what we were doing is we were collecting the telemetry off of all of them and aggregating it to the cloud. This is something that Hadoop does really, really, really well. And so what I've done is we've aggregated all these log files up into the cloud. Hadoop does a great job of processing log files that look kind of like this. This looks something like a standard IIS log, like a web log or a log you generate off of a mobile application. It's not important to look at the specifics, but the idea is that it's semi-structured, it's jagged, and oftentimes if you think about collecting this throughout the course of the year, we're not talking about a megabyte of files to look through. We're oftentimes talking about terabytes of files. And so that's where Hadoop and some of our big data offerings come in. It really let's you deal with data en masse that has semi-structure and query it fairly easily. And so back to our scenario of having all of these aggregated up in the cloud, what I have done is I've logged in already to my Windows Azure account. I have a Hadoop cluster that I've already provisioned— it takes about 5, 10 minutes to provision one— and I can come down and I can click "manage cluster." When I click "manage cluster," that takes me into kind of the dashboard of my cluster. I saw just a few hands in familiarity with Hadoop so I'll slow down in just a second, but it's enough to know right here that this is the dashboard for your cluster— once you've got one, this is your starting point. The first thing you often want to do is start to play with your big data. And so the way we do that is pretty simple. You take your log files, you upload them to Windows Azure Blob store. Think of Windows Azure Blob store—if you're not familiar—as a big disk in the sky. You can upload just any file you want just like you're copying a file to a file system. Hadoop, in this case, is great at attaching to those files and helping you process them. And so what I'm doing here is I'm going to jump into my interactive console— which is just a little lightweight web command line that lets me work with that data. So what I've done is I've taken this log file, uploaded it to my Blob store, and I've created a table view on top of it. Now this goes back to that same concept of imposing a schema onto data when you read it. And so you say, "I'd like to see this arbitrary data source over here like this." And so that's what we're doing here by saying there's an external table called "mobile apps," and it's backed by this location, which is just Windows Azure Blob store. What I've done is I've uploaded a bunch of text files to that raw directory in my Blob store, and I've told Hadoop, "I would like to treat it like this table." So if you can do a match that gets me this kind of a rectangle out of that data, that's what I want. This is something called Hive, for those that are familiar, in the Hadoop space. And so I've already created this table, and so now when I want to work with that data, I get to do things that are pretty familiar to me. I get to write select statements and use very typical syntax that we're used to. So I'm going to say "select *" from here. Because we're talking about potentially big data, I don't want to return a terabyte of data—we'd be here for quite a while— so instead I'm just going to say, maybe, "limit 5," just to show you what it looks like to start to use Hadoop to poke at some of this data, just in kind of an interactive way. The reason it's taking a few seconds is Hadoop's used to processing terabytes of data at a time— maybe 100 gigabytes or so—so it does all of this in parallel. It's able to take that terabyte, partition it up, and then run this query in parallel across a bunch of nodes to really speed it up. So you can see kind of what we've got right here. And so the reason we're doing all of this is because we want to be able to process that log file information and wiggle it down to that particular date where we were seeing poor user sentiment, because we want to know what was going on with our application during that period in time. And so what I had done is I'd run a query something like this that says, go ahead and give me a result set table called "logs for the month," and trim it off only by something that was in this particular month. Now you're probably asking like, "what's with the substring thing, Mike? "Why aren't you treating that like a "date time"—what's going on? I—this is just a big hacky business or what?" Well, it turns out that oftentimes when you're combining data sources, you're not always in control of the original schema of that data. You could choose to put it through a reshaping ETL process, but you're never going to know kind of what the— how you want to use that data all the time. One thing that Hadoop's great about is, however the format of that data is that comes in, you can oftentimes work with it and shape it in it's natural form to what you need without having to go through kind of a movement and shaping process up front— just kind of a preparation step, which is kind of cool. So what I've done now is I've got those log files off the month of November. We've got flight delay information, we've got sentiment information. The last thing I want to get is weather information. And so for weather, as Shawn was mentioning, these data sets for weather are available in a lot of places online—NOAA's one source of them— and what I've done is I've downloaded the weather set for the entire year of 2012, created a table over top of it the same way, and asked Hadoop to process it after the fact the same way I did my log files. So I'm not going to go through it again, but the idea is that the same approach I took for log files worked for my text-based weather data so that I could parse it up and get just the month of November and just in and around the date of 11/22. So I didn't have to worry about ETL'ing it, shrinking it, et cetera, I could just use it in its form. So at this point, we've got flight data, we've got sentiment data, we've got our app log data, and we've also got weather data kind of at our fingertips, but we still don't know why were people so upset at that particular point in time. We need to combine it and kind of look further. >> Yep, all right, let's go back here to number 1. >> All right, don't leave—okay. So that gives you a very simple example as to how we took all these different data sets and combined them into something that we can now operate against. So now we want to switch personas and think of Mike now— think of both Mike and I being at this Blue Yonder airline—and we have a theory. Our theory says that well, we know that there's going to be weather delays. we know there's going to be mechanical delays. But something else is causing "dissat" that we just can't quite put our finger on yet, and this is what we want to go analyze in the system. So with that, we know a few things. Mike, as a business analyst, is probably well aware that we have a data warehouse, that we have a mobile application that's running on multiple devices, that there are a number of data sources that exist in the world. But the thing we know for sure, Mike, as an analyst, he doesn't want to deal with all that complexity. Instead, he wants to interact with that through Excel. The reason that he wants to interact with that through Excel is so that he can quickly mash things up in a familiar tooling space. And for Mike, it's all about visualization— smart enough that we can ask questions together, but when we start to visualize a particular problem, our goal is to rapidly get to this insight that we can take action on. So with that, let's go ahead and switch personas and get into the next demo. Okay great, so what I'm sitting in right now is Excel 2013. What I've already done is pulled in a couple of data sets. I pulled in the flight information out of my data warehouse. I figure everybody's seen us at least once take data out of a SQL system and put it into Excel, so I skipped that step. I pulled in the weather stuff, but what I'm going to do next is I'm going to pull in that application telemetry that we were processing with Hadoop in the cloud. So I'm going to see—you're going to see how you can connect Excel to the cloud to pull in data out of Hadoop. And so for that, we're going to use something called Data Explore. We're going to say "from other sources," Windows Azure HDInsight. I'm going to point it at our cluster in the cloud. It's going to give me a listing of everything that's available to me. Mobile logs—looks like what I want. I want it from the month, so I'll select this guy. And now I'm starting to see, "okay, this data kind of looks reasonable." I want that actually in table form. I know it was just a comma file, so that's pretty simple. I don't worry about some of the extra columns. If they can get a rectangle out of this thing, that sounds good. Okay, now I can see there's headings at the top but they're not being treated right, so use that as my headers. So now I'm quickly into something that looks pretty natural to me as an Excel user—that's cool. They have dates and times but I don't want to look at— that granular, so just treat this as a date instead—get rid of that time information. Verb—I don't even know what that is, so get rid of that. URI—probably not going to use that. Okay cool, so now I can see that I've got some information by date. I can see how long a request took on average for that particular date. I can see it breaks up by customer type and device—that looks pretty good. So let's rename this as our "app log," great. And let's say we're done with that. And what it's doing right now is it's going—connecting to the cloud. All of this that I've been doing so far has been working off a preview of the data. So if there was a significant amount back there, this shaping wasn't occurring against it at all, I was doing it off of a preview, and once I say done, now it's going to do the download. So it's pulled in that set for me. I'm going to say "load to data model"—what this load to data model does is it brings the data into a PowerPivot model. So it uses our in-memory BI engine and just wires it all up for you so that PowerPivot, Power View, GeoFlow, all just work for you now on top of this data. And so now what I want to do is start to visualize this to kind of understand what was going on. So let's start to play with the data a little bit, and for that we're going to use something called GeoFlow— it's a preview that's available on Excel—let's start a new tour. Let's make ourselves a little bit of space. So now we've got this canvas—weather, flight, all this stuff oftentimes looks best on a map, so let's start with that. And let's start to lay down weather information. So we've got "lat" and "long" information—let's map that. These are all the weather points that we have. Weather often looks best as a heat map, so let's change it to a heat map, and let's make the values on that map the wind speed. And we want to view all of this by date, so let's say—let's actually view it by date. And we'd like it to be kind of changed by day, so let's change this to day and get ourselves going. So now we've got a heat map that shows us the weather on a particular day, and as I change the day, it's going to re-draw the heat map. So now I can see wind speed—when there was high wind, where it was—on a particular map. So for—after this now what we can do is we can start to layer on some of that additional data, like our delay information and some of the other sources. So let me start by that and come out and add another layer. And this time, we're going to start to put down some of that flight information. And so for this, we're going to select city and state. This is where our airports are. And what I'd like to do first here is put on the delay information. So let's add delay, and again, let's see this—all this data by date. Let's do it by date and go from there. So now what we can do is—let's change the perspective here a little bit so we can start to really see what's going on—okay. Let's bring ourselves this way a little bit, zoom in, and come out here—great. So now we can start to see across the U.S. as weather changed, how did it affect delays? So we're getting a pretty good view of what's going on at this point. >> So now that you've—we're looking at weather, we're looking at delays— these are things that we're pretty familiar with. But what about this app thing—this app thing is probably the bigger concern because it's not something we'd typically look at. >> Right, and so what we can do there is— we've got app data, we've got 2 kinds of data that we've pulled in— we've got sentiment data as well as our application data. So what I'll try first is bringing in sentiment data and see how that affects it, and then we'll bring in the app data. And so let me select sentiment here and bring that data in, and again, we've got it factored by day. And so let's kind of—change our date a little bit— and let's go back to that key day that we saw on 11/22. So we can start to see that there's a few spikes happening on 11/22, where the blue bar is our sentiment and the green bar is our delay. And so originally when we queried this data, we saw these 2 spikes and we thought, "Huh, there's gotta be something specific to Chicago and Seattle that was causing people to be so upset." But now that I see this data in a more macro lens, I can kind of walk around it and I can see that, wow, there's actually a lot of green bars on here, which are showing significant delay almost all over the Northeast and the Northwest. And so if I look here I've got 296-minute delays, I've got 103-minute delays, I've got 193-minute delays. So looking at the data in here made me think it was a pinpointed problem, but it's not until I get to see the data out in this kind of a visualization that I can understand these pinpoints are probably where people got stuck at like hubs of airports or something like that, so they had time to kind of voice their frustration. I don't know about you guys, when I flew here I had one of these situations as well where I was—I got very familiar with the Houston airport for quite a while, as all the other airports around me couldn't get planes into Houston to pick us up and bring us over to the conference. So it was a situation kind of like this, where there was a much broader situation going on where a visualization like this really lets you understand the data. >> And clearly, the whole region is affected or impacted by weather—>> Right. >> —but we don't have these spikes, which is just sort of an anomaly on that page. >> And so if we look at these a little more, we say, yeah, what's going on? If we go back and remember what people were exactly saying in those tweets, they were saying airport, delay, weather—okay, great. We now understand this wasn't a localized situation, it was fairly broad. But they were also complaining about the app. I don't yet understand what that meant. But we know it was happening on 11/22. So given that it's not a localized situation, let's kind of step back and look at the app data in the more aggregate. And so for that, I'm going to open up another sheet, and what it's doing is it's loading something called Power View, which another way to visualize data in Excel. The way you get data in is using the same techniques that I just showed with Data Explore, and so I've brought the app data into this sheet as well. And what I'm visualizing up top here is that log file information that we had processed with Hadoop. And in the bars up here what we're seeing is request delay mapped by date. And so what we can see is when somebody was using our apps, they were all experiencing roughly the same delay, like when I go to do an operation, how long it takes to succeed. All about the same delay until about 11/22 here, when the performance of the app or the responsiveness of our set of apps on aggregate really started to degrade, and people were experiencing more significant delay. And so if we kind of pan down and look, now we can see here that the same delay is being visualized by device type. If you remember in those logs, it recognized what device people were using to access the app. And you can see that the experience is pretty much the same across device type. And so that leads us to think there's probably nothing wrong with the app itself on the device. There's got to be something going on the back end that's causing this kind of delay. And so we saw significant delay going on. We saw high "dissat," and we also see that it was on 11/22, which is kind of Thanksgiving season. So there's probably high volume—a lot of people on there going what the heck's going on—we've got a ton of travelers. So there's probably some contention going on. The next thing, as a benefits coordinator, is that I look down here, Shawn, and I can see that the vast amount of people using my app on this day are my preferred fliers. So I can get a sense of who was experiencing this poor experience, and it was the people that I want to keep the happiest. And so unfortunately, it wasn't a great experience for them, but there's a clear opportunity for us to do something with this app. >> Absolutely, hey, thanks for your time. >> Thanks. [applause] All right, okay. So we had this—we had this theory that basically said we're familiar with whether we know we're going to have mechanical delays. Sentiment is a new stream for us that we can incorporate, and ultimately this is allowing us to take action. And much like the John Snow story where we started off with data— we brought in external sources— now it's really about taking action on some of the new insights we have. Ultimately we want to build a "worry-free Elite" program, which, for example, could auto-generate alternate itineraries, maybe booking cars and hotels, restaurant reservations. That would be fantastic for our preferred fliers. But there's a second thing that we need to take action on. My hunch is if we have everybody pulling out a device in a region trying to change their flights, there is some kind of contention with our back end. And in this case, our back end is SQL Server, it's a—it sits behind this app for many, many years. And let's switch gears and talk a little bit—let's do a little bit of diagnostics on that back end and see if we can discover that problem and address it. So now we're going to switch gears to the DBA persona. The next demo that we're going to show is the DBA—actually trouble-shooting— let's understand what's going on with our back-end database. Really what I want to call out to everybody in the room right now is when you think of SQL server 2014, one of the biggest investments we've made is in In-Memory OLTP. The way to think about that is pretty simple. Really what we've done is we've addressed this hardware trend that says there's a lot of memory on machines. And at the end of the day, if you feel like you've done everything you can to increase through-put but you've just literally run out of options, this is where memory comes into play. So think of SQL server 2014 In-Memory OLTP as a feature that's just built in. As Quentin said this morning, you don't have to go buy another SKU or another appliance to go with it, it's just built in to the product. It's memory optimized, which you'll see some of the through put here in a second, and it really doesn't require any application re-write. So with that, let's keep pushing our theory forward. And we know that there's something with that back end, and let's have Sunil come up and show us In-Memory OLTP. >> Thanks, Shawn. So there's a DBA for this application. I heard from Shawn and Mike that my application is having performance problems with the peak workload induced by airport delays. So what you see here is a custom driver written for my application. Let me start this driver here. And what this driver is doing is simulating a workload that is introduced by different appliances and devices. People are quitting the database system, and what you see here is I'm getting approximately 2,400 transactions per second, which is pretty good. But it is clearly not meeting the business requirement at the peak workload that Mike and Shawn just talked about. So to—getting a more deeper understanding of what is going on, let me look at the diagnostics. So I click on this button "diagnostics" here, and you notice it is exposing 2 key metrics for me. One is a CPU utilization—you see it is running around 50 to 60%. It is not using the full CPU capacity of the machine, but that is one. And second thing is I'm seeing tons of latching. And I strongly believe that these 2 factors are causing the performance delays that my customers are experiencing. >> I see, so in that region that Mike was showing us, delays happen for weather reasons, everyone pulls out their phone to start making connections. Now we have just about everybody around connecting to this database. And now we see this latching, which could be this concern. So what's the story behind latching in this context? >> Right. >> These base tables, they use pages to store the data roles, and to guarantee the physical consistency of the data, the pages need to be latched. In a highly concurrent environment that we are seeing with this application right now, many consumers are accessing their reservation and they are hitting the same page. When they hit the same page, it leads to latching bottleneck. Now page latching is a common issue with this disk-based relational database systems, so this is pretty much what we have. We have an application that is running into a latching problem. Now as—with SQL 2014 we have In-Memory OLTP, so it's a DBA. I'm interested to know how it can solve my problem that I'm seeing here, all right? So the SQL for '14 is shipped with a tool, and I'm going to bring the tool here. What it does is it analyzes my workload, and identifies the hot spots. So in this tool, the way you see it, it has already profiled my application, and it has completed the analysis of 2 key components in my application— One is the tables, and second is through procedure. So let me first click on the tables and see what it tells me. So I click on the tool and I see the table, and here it shows a chart to me. On the y-axis it shows which tables I can move to In-Memory OLTP for maximum promised gain— higher on y-axis, the maximum gain I can get. This tool has analyzed that. And on the x-axis, it shows what is the cost of migrating— how easy it is for me to migrate. The higher the x-value, the easier it is to migrate. So if you look at this quadrant—this is my sweet spot— and what it is telling me is a ticket reservation detail table is the hotspot in my application. So with that information, I want to migrate that table to n-memory technology. So by using a table from one mode like this based in-memory is a very simple process, you can write a very simple T-SQL script to create the table and the in-memory, and migrate the data. I'm sure all of us have done that many, many times. So for this demo, what I've done is I have wired in that script in my tool, which is here. So let me—let me just migrate the table that my tool identified into In-Memory OLTP. So now the migration is complete and let me start it. Right off the back, you'll see not 10%, not 20%—you're seeing 7X gain. I mean, this is insane, and I'm pretty positive that this has solved my business problem at hand, and not only that, I was able to do that without changing my application, without upgrading my hardware. Isn't that phenomenal? [applause] >> Yeah, it is. [applause] You know it's interesting, a 7X increase is just— it's crazy in terms of that much more through put. But why, why do I see that kind of increase now that I have this table now running in-memory? >> Right. So in-memory storage is a different mechanism. It does not use pages to store the data. When the pages are gone, the latching is gone, number 1. Number 2, because data is in physical memory, we can optimize the way we access the data. Those 2 factors combined are giving me 7X gain. Now the most interesting part about this is there is more—we're not done yet. We can give you even better performance. So let me stop here and go back to the tool. Remember the tool, it has analyze tables and the stored procedures. So let me go back and look at what it says about the stored procedures—that's my business logic. So I click on the stored procedure analysis, and what you see is the tool has identified the high CPU consumers in my application. These are the procedures that are taking the most CPU cycles in my application. So what I need to do is, I need to migrate these stored procedures into memory optimized engine, okay? So I'm going to go back to my driver where I've hard-wired the script to migrate these stored procedures into In-Memory OLTP. So I go here, and now it's migrating those stored procedures into In-Memory OLTP. Now that migration is complete, and let me start it. And what you see is the performance gain has gone to 27X. This is a game changer for us, not only to solve my business problem. It is solved with tons of headroom for future growth. >> Wow, thank you. [applause] Hand me that, yep I've got it. Just a second—[inaudible; voice lowered] There you go—perfect. Thank you—okay, so it's great to see these demos. I always pause right here because when you're sitting there watching it, if done right, it tells a really good story. But in reality, I just want to thank—there's a whole team of people in this room right now that've been putting—they've put a ton of time into putting these demos together, so— hey, thank you to you all and I appreciate it. [applause] Okay, so the final stretch here and I'll get us out of here a couple of minutes early. So for SQL Server 2014, yes, our big investment is In-Memory OLTP. What you really need to know is it's built in, you use your familiar skill set. You can run it on existing hardware. You can migrate existing applications. Does that mean you can migrate every app in your environment and it will just work? No, right? There's—we've done a lot of homework to get an idea as to the objects that are most commonly used and that's where we focus, but you could have an object in an application that isn't supported, and I would tell you that now so you pay attention to those kind of details. There's other things that we've done in SQL Server 2014. Quentin touched on the enhanced in-memory ColumnStore—or clustered ColumnStore index. It's updatable—it brings compression—it's fantastic. You'll have great, great data warehousing query speeds. We bring—we've enhanced AlwaysOn—so AlwaysOn is our HA technology. We introduced more readable secondaries. The team has done a lot of work to increase our manageability in terms of diagnostics and management views. And then we've done a lot to bring Azure closer to your on-premise resources. And specifically what I mean by that— there's some bread-and-butter canonical scenarios like backup. So with Smart Backup, we make it out of the box, a piece of cake to— and a very safe and reliable way, back up your on-premise database into Azure store. So it's really straightforward. And then I would say this to you now—and there's a scenario in here— it's one of my favorites—I'm going off script here for a second. But one of my favorite hybrid scenarios is enrolling an Azure VM with your on-premise availability group, which is phenomenal. But a lot of people have asked about that. They've said, "Hey, I want to do reporting. "I don't want to stand up a bunch of stuff on my premise, "but I'd love to be able to enroll in Azure VM as a secondary—that would be remarkable. "And then I can wire up Azure reporting services to that VM and have an instant reporting solution." And keep an eye out for that—you'll see it in 2014. So that quintessential journey that we saw in the beginning with John Snow remains true today. It's the same process, but what's different is that with our tools it's easy. When you have a theory, with our tools, it's powerful—which we all saw. And then of course, when you're ready to take action, it's complete. Now as far as our connecting the technologies, once again— that we looked at—and then I'll give you just the real brief call to action here. PDW—if you don't understand it when you go to learn it— what I would encourage you is just sort of learn this one concept called an external table. There are tons—there is tons of value in PDW, but that external table is an interesting concept, especially as you saw in the demonstration from Mike. Because that's the thing that's going to allow you to bring data sets— unstructured data sets—into PDW so you can query it through a single fashion, through familiar tools. Then of course, we have Azure HDInsight, Excel, and then we looked at 2014. Call to action is—for PDW, the best thing to do is reach out to your Microsoft account rep, or you can go to And there are—there's all kinds of information about PDW. Turns out, PDW ships in a big appliance. That's why you don't see those things sitting all over the hallways here. Second thing is for HDInsight—do not be afraid of Hadoop. Do not be intimidated by Hadoop. You can actually learn this—the concepts—really fast. If you're at work right now and you feel like everyone's breathing down your throat to learn about Hadoop, don't be intimidated. Be anxious—it's okay, I was when I first started to learn about it— but I'll tell you, as you get into it, it's really not that complex. From a computer science standpoint, it speaks to a lot of distributed concepts that most of us are familiar with. It's just done through different open-source projects. The biggest thing that I want everybody to know on Open Source, is it is not business as usual at Microsoft with Open Source. We are very, very serious about being a participant in that community. We have our own committers. Committers are folks that can actually have the authority to check in code into these open source projects. If you're not familiar with Open Source— a few things to learn there is just know what a committer is. Every person in this room can participate and help a given project. What's a project—HDFS or the Hadoop Distributed File System. What's another project—Hive—they have all these zoo names. But it's relatively straightforward. The ones that I would tell you—that I would encourage you to pay the— I'd probably go learn first—HDFS, Pig, and Hive. Go learn those—go learn those quickly. Hive is one of those that stands out quite a bit. Why—because it's a data warehouse query-like language. But learn about committers, know what it is, and from a Microsoft perspective, what I want everybody to know is we are absolutely committed to participating in that landscape and fostering that ecosystem. If you've never stood up a Hadoop cluster before, just go sign up for the HDInsight public preview, and you'll have a Hadoop cluster set up over lunchtime—it's a piece of cake. It's really not that hard. The things that I would share with you if you're in the early learning stage is I'd be careful about building out your own cluster on premise—it's incredibly hard. The first time I ever set one of these up myself on a Linux-based system, which I— man I am all about understanding what we have and what other choices that exist out there—I'm not one of those guys that ignores that— it took me about 5 or 6 days, and I would consider myself to be pretty familiar with distributed systems. The thing that I like about the cloud is I don't have time for that, and I can file a new cluster and have it up and running and try out— the thing that Mike showed you has all these tutorials that are great for you to learn. And then in Excel, we showed a couple of really good tools in there for codename Data Explorer and codename GeoFlow. You can download those previews today, work with those. And then finally, sign up for the SQL Server 2014 CTP 1. And thank you all very much for your time. [applause] [Edgenet] Edgenet provides software services for the retail industry. Our solutions are really focused on selling product, so it helps retailers sell more product to their customers, and helps the customers find products easier. When a customer is shopping online, they have to have a great experience or they're just going to go to another website to buy the product they're looking for. So that experience needs to be very concise and give them the information they want in a quick manner. Looking at the progression of in-memory systems from Microsoft with Power View to a billion rows and PowerPivot, we knew that In-Memory OLTP would be a game changer for us. One of the things that we've had a problem with was being able to have real-time product updates, as well as having that system online for consumers to use. With In-Memory OLTP, you don't use a disk. Everything is cached in-memory, so that speed constraint of disk is gone. You can support a lot higher load on a smaller machine, and it's very efficient at the memory that it uses. You don't need large servers for it. The traditional system was updated once a day and the process took a few hours. The new system can be done 7 times faster in about 20 minutes, and the system isn't down while it's being updated. That information is streamed directly into that database, and it's always real time. One of the great things about SQL Server In-Memory OLTP is it uses the same tool set that the developers and DBA's are used to. To them it looks basically like standard SQL Server. Using this technology, we can provide real-time product updates and real-time information that other people can't provide. In retail, speed is everything. With our systems and In-Memory OLTP, our customers can accelerate their business over their competitors. [Microsoft]

Video Details

Duration: 56 minutes and 2 seconds
Country: United States
Language: English
Genre: None
Views: 7
Posted by: asoboleva99 on Jul 9, 2013
Languages for CC from MT engine:
Fra, Ger, Rus, Bra, Ita, Kor, Jpn, Spa, CHT

Caption and Translate

    Sign In/Register for Dotsub to translate this video.