Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

VideoBigDataScenario3-ETL

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[Big Data Scenario] [Extract, Transform, and Load (ETL)] [male speaker] Hi, and welcome to this presentation on a big data scenario on extract, transform, and load operations, or ETL. My name's Graeme Malcolm, and in this presentation I'm going to show you an example of how we can integrate an ETL solution into a custom application and then use HD insight to perform processing against some big data that's been captured by the application. [Racecar Analysis] So the scenario for this example is that of a race car. We have a race car team who wants to analyze the telemetry that is captured from the car that goes around the track and does a lap. So we have a custom application that captures real time telemetry from the car from various sensors and things like the brakes, the engines, and so forth. Those readings then need to be combined together and loaded into a database for analysis, and to do that, we're going to use some of the techniques for processing data in HD Insight to get the data into the shape that we need it to be and then load it into a Windows Azure SQL database table so that we can perform some analysis. So let's take a look at how we would go about implementing that solution. Let's start by looking at the Microsoft Azure infrastructure that we have in place. So we have an Azure SQL database instance which has a database named Race Car. And if we have a look at that database itself, you can see that it contains a single table called Lap Data. And if we actually go and have a look at what's in that table at the moment— using this query— we can see that the table is currently empty, and it has columns for a time, a latitude and longitude to tell us whereabouts on the track a particular event was logged, the speed the car was going at, the revs that were being recorded by the engine at that point, the oil temperature and brake temperature. So we've got an empty table there that we're going to have to populate with data that we capture from the car as it goes around the lap. Let's have a look at what else we have. We have an HD Insight cluster running, which we're going to use to process the data that we capture in order to get it into the right shape and filter and so forth in order to get it ready to be analyzed and put it into the SQL database table. And that HD Insight cluster uses an Azure storage account as well. So we have a storage account that we will use for the files that are going to be processed by our HD Insight cluster. So that's what our Azure environment looks like. Let's now take a look at the custom application that we're using to capture the data. And here's the source project for the application. It's written in C#— it's a simple console application. And I've added a couple of packages to this project in order to use some of the SDK components for HD Insight. So if we have a look at what's actually been added in the NuGet Package Manager, you can see that I've added the Microsoft.net API for Hadoop web client package and the Microsoft.net library for Avro, which we'll talk a little bit about in a short while. And there are some other packages that are dependencies of these 2 that have been added automatically as well. So we've got those packages in place. Let's have a look at what the application actually does. And obviously it's tracking data that's being read from centers in a car. And we're doing that by implementing each reading— each sensor reading—as an object in the application itself. So we have some classes to define these objects, and there are some annotations on these projects to help us when we serialize the data, giving us an idea of what these fields actually mean. So we have a class called engine reading which has the time that the reading was taken, the revs that have been recorded by the engine at that point, and the oil temperature. Then we have another class called brake reading, which is obviously reading information about the brakes, and again, we get the time the reading was taken, the brake temperature at that time. And then finally, we have a GPS reading, which has obviously been read from a global positioning sensor on the car—so we're getting the time the reading was taken, the position, which is of type location, and the speed that the car was going at that point. Now this location type is actually just another struct that we've created here which has the lat and lon— latitude and longitude—values that tell us whereabouts geographically the car was when that was taken. So there's a slightly more complex type here, because this class has a custom— has a field that is a custom type in itself. So in reality, we'd have a program that captures this data via some sort of wireless link to the car and possibly using something like Stream Insight to process the data in real time. Because this is just a simulation, we're simulating the readings by reading them from a text file, and we're recording them in these lists. So we have a list of the GPS readings that we record, a list of the engine readings, and a list of the brake readings. So as the car goes around and does its lap and we read these sensor values, we'll just be adding them to a list as the lap is completed. So here's the main program itself. And as I said, you can see we're just simply reading these readings from a file— in reality, we'd be getting them from a real-time stream from the sensors. So while the lap is running, we're gathering the data. We're adding it to those lists, and then when we finish the lap, we're ready to take that data and bundle it up and load it up to our database for analysis. So the first thing we need to do is we need to take those objects that we've recorded— those lists of objects—and we need to serialize them into some sort of file that can be processed. So to do that, we're using the Avro package— the Avro capability and the HD Insight SDK. Avro is a serialization format in which I can serialize the data and the schema as well. So the file contains information about the schema of the data as well as the actual data. And I can compress that data and I can use that within any of the HD Insight technologies. They can read Avro files and understand the schemas from them. So let's take a look at the code that we used to actually serialize those readings. And we'll just go to the definition of that function. And we basically do the same thing 3 times. We do it once for each list— for our GPS readings, our engine readings, and our break readings. Here's the code for the GPS readings, and really, all we're doing is we're creating ourselves a file stream, and then we're creating a writer using the Avro container create writer method. And when we do that, what we're doing is we're simply going to write a stream of these GPS reading objects into this writer, and I've specified I want to use a deflate codec as well. So I'm compressing the data as I write it. That's then just written out to a file, so we end up with a file with the schema and the data for those GPS readings serialized and compressed. And then there's similar code for the list of engine readings and the list of brake readings as well. So all of that data is serialized out, ready to be processed. That's just written to a local file on the client computer. Once we've done that, we're ready to go and upload those files to our Azure storage account so they can be processed by HD Insight. And I've got an asynchronized method here that I'm going to call to upload those files. So if we go and take a look at that definition there— I'm gathering some information from the application settings, just things like the user account and the key to use when I upload the data— the name of the container that I'm uploading the data to. So all of that information is in the app.config file. You can see there's a bunch of key values here with various settings that I'm going to need to use in my code. So I get the required credentials and information, about where I actually want to upload the data to, and then I create this WebHDFSClient object, which allows me to go and create a Blob storage adaptor to upload data to that Blob store. And you can see that I'm just simply creating files based on the files that I've got locally. So I'm uploading that GPS file that I serialized to my local computer, and I'm creating a file called gps.avro in my storage account so I can then pick that up using my HD Insight cluster to process the data. So I do that for all 3 of the files and get those uploaded. And once I've uploaded those files, I'm now ready to start using some HD Insight jobs to process that data. And to do that, rather than cull all the jobs individually from the client, I'm actually going to use a workflow of various different operations— so some hive operations, some peg operations, and a scoop operation as well. And I'm going to encapsulate all of those things in a workflow that I'm going to use Oozie to execute. So the first thing I need to do is upload these workflow files. If we're going to have a look at the—the code for that— again, it's pretty similar to what we just saw. Getting some details from the app.config about where I want to upload the files to, and then I'm creating my WebHDFSClient class and my Blob storage adaptor, and I'm uploading all of the files that are in a local folder called Workflow—Oozie workflow. So actually, have a look at that folder— we can see here in Solution Explorer— here's the Oozie Workflow folder, and it contains a number of files. The main one that we're concerned about is Workflow to XML file, which defines the Oozie Workflow that the XL is going to perform— that the HD Insight is going to perform—not XL. So let's have a look at what's in that workflow file. And here it is here. And what a Oozie workflow file is— really just a XML file that defines a series of actions that are going to be performed by HD Insight. So the first action is called Drop Tables. That's where we start—and I've passed on a couple of parameters here— the nameNode is the Oozie cluster on which we want to actually execute this— this action. And I've got a job tracker that I'm using to track the status of the job. I'm passing that up. I'm telling it that this is a hive job— you can see the type of job here. And the hive job that's going to be run is going to use some hive settings that are defined in this file here— hive-default.xml. That just gives Oozie the kind of environment settings for hive. The really important part is that we're telling it to go and execute the hive code in this file here— so go and execute droptables.hql. And here's what droptables.hql looks like. Very simple—just simply drops the tables that we would be using to process the data— some hive tables that we're using as ways of shaping the data. So we're getting rid of them in case they're left over from the last time around. And then if that works okay— let's assume that that action succeeds— we move on to the next action, which is CleanseData. And CleanseData is actually— it's not an action, it's a thing called a fork, and what that's doing is splitting out my workflow into 3 parallel work streams. So in this case, I'm going to start 3 other jobs in parallel with each other. So I'm going to filter the—my GPS data, and my engine data, and my brake data, and for efficiency, I'm going to split that across into 3 separate paths. So the code to actually go ahead and do these—over here is the code to do the FilterGPS task. Again, a couple of parameters passed in, and again, I'm telling it to run a file. This time, it's a peg action, so the file is a peg Latin script called GPS.peg. And if I have a look at GPS.peg, here's the code. So there's some peg Latin code here. Oozie is going to use peg to run this code. We're going to load the Avro file that we uploaded from the client application, and note that I'm using Avro storage. So what that allows peg to do is to read the file, and from the file itself it will get the schema— it will understand what the data in the file looks like. So I don't need to tell it what columns there are in the file or anything like that. It gets all of that from the file itself using Avro. Then I'm going to filter it where the position is not null and the time is not null— it's quite possible that I might have some daft data in there that I've received from the sensors where the data's incomplete, so if the position is null or the time is null, we'll just get rid of those values that I don't need. Then we'll go through and we'll generate the time. We'll flatten out the position— remember, the position is the complex object that has the long and lat— so we'll flatten that out into its 2 distinct values— and the speed, and we'll store those results into this location here, Race Car GPS. So there's similar code for processing the brake data and also for processing the engine data. And we're going to go through each of those and effectively just flatten out that data— filter it to get rid of any null values— and then store the results in a kind of tabular format. in these various locations—Race Car Brake, Race Car Engine, and Race Car GPS. So when the workflow has run all of those peg jobs and they've all completed, we'll come to this join here where we can reunite the different streams of the workflow. And we'll go into this next action, which is to create tables. And this is a hive action that we're executing here. And it's going to run the CreateTables.hql script. And again, let's just go and have a look at that. What that script is going to do is create a table on top of each of these files that we generated using peg. So we used peg to go and flatten out the data and filter it to make sure we've got what we want. We're then effectively projecting a hive table on each of those outputs that got generated by peg so that we can use hive QL to query the results that peg generated. We're also creating a fourth table called Lap which has fields for all of the various values, because obviously we want to combine these 3 different types of sensors into one table. And so we've created this table called Lap, which of course currently will be empty. And then once we've created those tables, the next action in our workflow is to load the Lap table. And again, it's a hive action, and it's going to be this LoadLapTable.hql file. So we'll go ahead and get that. And effectively, all we're doing is joining the data— the 3 different, distinct sensor tables— and then joining it based on the time. So for the same time stamp, we've got the different settings that were being recorded at that point. And we're loading all of that data into the Lap table. So we've got all the readings from the GPS table, plus the revs, the oil temp, and the brake temp for each point in time during the lap. So we end up with a hive table that has a complete record of all of the sensor readings all the way around the lap the car completed. And once we've loaded that table, our final task is to transfer that data from the HD Insight cluster, actually, from where it's going to be stored in our Azure storage, across into our Azure SQL database. And to do that we're going to use a Sqoop action— we'll use the Sqoop technology to transfer data in and out of HD Insight— in this case, transfer out. Again, there are a whole bunch of arguments that we're passing across to Sqoop, including things like a connection string for the database server that we're connecting to and the table that we want to load the data into and the folder on Azure storage where we're getting the source data from, what the terminators are and so on. So we're going to get the data from there. We're going to run that Sqoop job, and Sqoop is then going to transfer the data across into our table in SQL database so that we can then do our analysis. At the end of that we've finished the workflow, so we just go to the end. And if anything goes wrong at any point, we go to this fail marker here and we'll get some sort of error messages to what happened. So you can see that we've encapsulated all of the steps in our ETL into this Oozie workflow. And what we do once we've uploaded those files and serialized the data files— we go ahead and we create and execute the Oozie job. So we get all the settings that we need in order to call those various parameters and workflow things like the connection string for our SQL server— our Azure SQL database—and so on. We go ahead and we start the workflow by creating this OozieHttpClient class pointing at the appropriate cluster with the appropriate credentials. Create a new Oozie job with the properties that we need for the Oozie job, and then we can go ahead and add the various parameters that are required by some of those actions, and start the job. So we submit the job. What we'll get back from that is some sort of unique identifier for the job, and we're going to display that in the console window. And that actually allows us to go and use a browser to view the current status of the job as it runs. So that's our program, basically— that's everything we need the program to do. Let's go ahead and run it and see what happens when we upload the data. So we'll start our application. And you can see that it immediately starts capturing the different sensor readings as the car goes around the lap. And it will carry on doing that until it's ready to complete the lap and start uploading the data to be processed by HD Insight. So when the lap is complete, the data is saved and then it's uploaded to Azure storage and then the workflow files themselves are uploaded, and finally the application will initiate the Oozie workflow. And there we get the path of the Oozie workflow that we can go ahead and view the status. I'm just going to copy that to the clipboard. And we'll just use Notepad to— clean up that link. There we go. And now what I can do is I can use my browser— to go and have a look at that link. And it should ask me to log in using my cluster credentials. And I can actually track the Oozie job. So I can see the status of the job as it's being performed— I can see the actions that are being started. So we're on the CleanseData action at this point, where we've forked out—and actually, we've started the Filter GPS action. If I refresh that, I can see more information coming in as the actions themselves get started. So I can continue to keep refreshing that page until I see that the workflow is complete. And when the entire workflow is done, I can see that there's an End action with a status of "Done." So we've finished everything we need to do. And if I now go and have a look at my table in Azure SQL database, I find that the data has, in fact, been processed by my ETL workflow and transferred across. I now have a table in my Azure SQL database that I can use for analysis. So for example, my racing team analysts can use something like Excel. And from here they're going to connect to the Azure SQL database that we've just loaded the data into. So we'll get that from (inaudible) as if it were a SQL folder. Now the name of the SQL server we're connecting to is actually the connection string that we'll require for here. We'll just go and have a look and verify what that is. And here it is here. The 1433 is the default port so we don't need that. And so I can type in that name there. I'm going to use my credentials that I created when I first provisioned that Azure SQL database. And I can connect to there— the database is Race Car, and I should see the Lap Data table there. So now I've got my data available to me, so I can go ahead and download that into my workbook there. And having downloaded that data, I can now go ahead and start visualizing it. So for example, I could go and insert a Power View report. And in Power View, it's automatically added all those fields into there. Actually, all I really want here is the brake temp, the lap time, and the long and lat. So we'll get rid of everything else that's in there. I don't need this filters pane either. And what I might do with that is— I might—if I take the size—if I make it a little bit bigger, I could turn this into a map. And it's got things in slightly the wrong place, so I'm going to tell it that lat is the latitude and lon is the longitude. And that I want the size to be indicated by the brake temperature and the location to be based on the lap time— the time that the reading was recorded. If I do that, I can then actually see a plot of what happened as the car went around on the lap, and I can see where the brake temperature was at its highest as it goes around the corner, or as the brakes obviously get more use. And I could then go and add, perhaps, a different set of data over here so I could have the lap time with the oil temperature, perhaps, and display that as a line chart. And I can see the oil temp pretty much increased gradually as the lap progressed. I'll just resize that slightly. There we go. And I could go ahead and add another table in here that shows me the lap time and the speed. And this time I might do that as some sort of column chart. So I can start to see the varying speed as the car goes around on the lap. And I could do things like— I could click on a specific point on the lap here, and I could actually see the oil temperature at that point was this value here— 86.81—and if I scrolled along here, I can see what speed the car was doing at that particular point in the lap as well. So I start to have the ability to go back and retrospectively look at the settings that were recorded as the car went around the lap. And when I'm ready to clear that, I just clear it and get all my data back. So hopefully, in this demonstration you've seen how we can extend a custom application by using the SDKs for HD Insight in order to build an ETL process that uses HD Insight to process the data—in this case, to filter it and shape it— before loading it into somewhere like an Azure SQL database table that we can then use as a basis for analysis and use tools like Excel to analyze that data. Thanks very much for joining us, and I hope you enjoyed the demonstration. [Microsoft]

Video Details

Duration: 25 minutes and 58 seconds
Country: United States
Language: English
License: All rights reserved
Genre: None
Views: 80
Posted by: duncanma on Jul 2, 2014

This is a real request, please proceed. I need closed caption for English and MT-ed captions for all MT languages. the tools does not allow me to add all languages, this is why i am adding to the body of description field ----- (Please provide translations for these languages: English (eng).)

Caption and Translate

    Sign In/Register for Dotsub to translate this video.