Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

VideoBigDataScenario2: Data Warehouse On Demand

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[Big Data Scenario] [Data Warehouse On Demand] [male narrator] Hi and welcome to this demonstration of Big Data Scenario for Data Warehouse on Demand. My name is Graham Malcom, and I'm going to talk you through how we can use HDInsight to provide a data warehouse for big data analysis. [A. Datum Meteorological Research] So for the demonstration, we're going to look at a fictional company called A. Datum who perform meteorological research. What they do is they collect data about tornadoes, including the severity and the costs incurred by damage, and so forth, and this is analyzed on a seasonal basis. So the data is collected and stored as and when the tornadoes arise, but periodically we want to perform analysis of that data when new seasonal data is available. So let's have a look and see how A. Datum can use HDInsight and the Microsoft data technology stack to perform this analysis. Let's start by looking at the Microsoft Azure environment that A. Datum has got. And you can see there is an Azure account here, and we have an Azure storage account setup as well. And this is where we can store our data that we collect about tornadoes as time goes by, and we've got a blob container here, which is where we're actually going to store the data. So we're just starting the process now so there's no existing data in there, but as time goes by, we'll continue to just upload the data and use that as a commodity store. Now when we need to analyze the data that we've got, we will spin up an HDInsight cluster, and you can see I've already got 1 up and running here now. The point is that I can retain the data in the storage account even when I tear the cluster down. And I can bring the cluster back up again when I need it, so I'm only paying for the cluster when I actually need to use it for analysis, but I'm using the storage all the time as a commodity store for the data that I'm gathering. So we've got our HDInsight cluster up and running. What I'm probably going to want to do is to create some hive tables that I can use to do some analysis of my data with. So here's the script code I've got to create that data warehouse and its tables. First of all, I'm creating a database Now you don't have to do this. There's a default database that's created for you when you instantiate an HDInsight cluster but it's a good idea just to create a separate data way site— it acts as a separate schema or a separate namespace for your tables as are in your data warehouse and that keeps them distinct from any other tables that you might happen to have. So I'm creating this database, and I'm specifying a location in the HDFS, which of course is going to be hosted in my Azure blob storage as to where the files for this database should be stored. I'm then creating some tables. I've got 3 tables, and there's a fairly common pattern to data warehouses. We use what's generally called a dimensional model. So I've got a couple of dimension tables here and a fact table that contains the things I want to aggregate— the actual data I'm interested in. So for dimensions, I've created a table of states for the US states and a table of date values, which is quite a common thing to do in a data warehouse. It makes it easier to aggregate data by month or year or different time periods. The fact table itself with the tornadoes information in it is slightly larger-- there's a whole bunch of things that I might want to aggregate in here. I can relate this to the date table based on the date code and to the state table based on the state code. The other interesting thing that's worth noticing about this is that I've partitioned this table. I've got a partition by year, which is going to—under the covers—where this table is stored is going to create a separate folder for each year that I've got data for, and the data files themselves will be stored in the appropriate folders. And as they get more and more data as I gather more data over time, that will improve performance as we created the table. I'm storing these in the sequence file format. It's a file format that's supported by HDInsight and Hive and allows things like compression, and it's a splittable format that makes it a good format to use for creating for performance. Notice that all the tables are external. so that when I drop the tables and I get rid of the cluster, it doesn't affect the data. The data files stay exactly where they are in the Azure storage; it's all with the table metadata that's dropped, and then I can recreate that on top of the existing data when I need it again, so that the data is retained between the times I am creating it. So that's the script I'm using. Let's take a look at how we go ahead and run that script. I'm using PowerShell to do this. Here's my PowerShell environment, and I've already set this up so that it's linked to my Azure subscription. If I go ahead and run that get AzureSubscription, you can see that I've imported the certificates so there's already a connection between my local PowerShell environment and my Azure subscription. And what I'm going to do is I'm just going to bring this down so we can see a script. And I'll clear that out of the way. And first of all, all I'm going to do is connect to our cluster and our storage account. Find the folder where this PowerShell script is actually stored because that's where that CreateDWTables.hql file is the one that creates the data warehouse in the tables. I'm then going to upload that file. Use the set AzureStorageBlobContent command to upload that file to the Azure Store. And then I'm going to run an Azure HDInsight job based on that file, or should I say Hive job. There's a new Azure HDInsight Hive job definition that points you to that script file that I'm uploading and we're going to run that job. Under the covers, that will perform a MapReduce job, it will run those Hive QL commands, and that will create the database and the tables. So let's go ahead and run this PowerShell Script. And the first thing it does is it uploads the script file and then it's going to go ahead and start an HDInsight job to run that script file and create the database. So we'll just wait for the job to complete. So when the job's done, I get the report back that tells me how long each of the statements took that created the database and then created the different tables in the database. So now we've got that database ready to use. So the first time that I'm going to use this I'm going to upload the data. In the future, the data will already be there, and I'll add more data as I get it, so they'll be periodically adding more data files as I collect them. But initially, I'm going to upload some data to there. And let's just take a look at what we've got. We've got a number of data files here. The dates table, or the dates data file, is basically that code value for the date and then the date in actual date format. And then I've got the month, number of the year, the—sorry— the day number of the month, the month number of the year, the year, and the day of week. And then I've got the strings that represent the day name of the week and the month name. So there's some stuff that's going to get uploaded to that date table. Typically, you're going to load the date table initially with enough values to cover the data, the tornadoes data in this case that we've got, and then we might add more dates to the table as we go forward in the future. The states table is very simple. It's just the state code and the state name. Unless something fairly dramatic happens, I don't think we'll be adding to this; this is all of the states in the US, but I guess what might happen is we might extend our analysis to other parts of the world and upload some additional states into there from other countries. And then the tornadoes data itself, we've got a whole bunch of information about data. You can see it's a tab delineated text file. that we've got that has the date code, then the state code, and the state code, and then the time that the event started and whole bunch of other information about the tornadoes that have occurred over that period. So I'm going to have to upload this data to the location where the tables are created. And to do that, what I want to do is stage the data first. I'll create a staging table that I can upload the data into initially, and then I can read it from that staging table and do any type conversion or any additional functions that I need to do on the data before I load it into my production tables. So to do that, I've got some staging scripts and the first scripts are the create table scripts. So for example, there's one here to create a staging table for my dates. And it just follows the format of that file that I've got, that source file. It says that it's a tab delineated text file that I'm loading, and it tells it this is where I'm expecting the data files to be. So there's a staging folder that's going to have a dates folder and that's where I'll upload this file to. And the same is true of the tornadoes, for example. So we'll have a look at that; again, it's just a— it follows the structure of the file. It's a tab delineated text file, and it's located in this location. Now a couple of things to notice: one is this is obviously a text file, whereas our production tables are sequence file formats, so there's a difference in the format. And also this tornadoes table, the staging table, doesn't have any partitioning, it's just a flat table that contains the tornadoes data. And obviously we're going to have to generate the partitioning information when we move it from staging into the production table. So to do that, we have scripts to do the loading of the staged data into the production tables, so here's the one for the dates, and it's a straight forward statement that takes data from the staged table and inserts into the date's table, and it just selects the data to insert. There's a similar one for states For tornadoes, it's slightly different. We have to set this Hive option here. We have to set the partition mode to none strict because we're going to dynamically generate the partition value and that requires that setting. So once we've set that on, we can go ahead and insert into the tornadoes table, partitioned by year. And again, we're taking data from the staging table, but then that last column here that we're getting we're actually generating the year by taking the first 4 characters of the date code, which is actually the year. So we're loading from a flat table into a partition table and we're using this function to generate the partitioning value. So the first thing we need to do is we need to upload those staging scripts. We're going to use these each time we need to load data into those tables. We're going to stage it and then load it, so it's a 2-phase process each time we want to upload data. So again, I've got a PowerShell script to do this. We'll just clear the pane at the bottom, fairly straight forward. Again, we're just finding the location of this script file and then we're uploading all of the script— the staging script files— to this location in here, DW/staging/scripts. And it's just going to loop through that local folder I've got here with the script files in them and upload them all to the Azure store. So let's go ahead and run that. And it just starts looping through those local staging scripts and uploads each of them to that location in the Azure store. When it's finished, all of those files have been uploaded. Okay, well we're now ready to actually do some loading of our data, and you've seen the data source files and you've seen the staging scripts we're going to use. And I've got a separate PowerShell script for each of those types of data, one for dates, one for states, one for tornadoes. We could—of course— do it all in the one file, but by separating them out, it makes them easier to manage incremental loads of tornadoes when I don't need to add any states or add any dates or anything like that. I can do them all independently. So this script will—again— run the existing Hive scripts that I've just uploaded, the staging scripts, It runs the Hive script to create the staging table that I'm going to load the data into. Then it goes ahead and uploads the local data file, in this case the dates.txt file, and it puts it into the location where I've defined that staging table. So that, in affect, populates the staging table. And then I can run the other Hive script that I loaded among the staging scripts to load the staged data into the production data. To do that, select against the staging table and load that into the production table. So let's go ahead and run this one. And again, it's going to kick off an Azure HDInsight Hive job to go ahead and create the staging table. When the Hive job to create the staging table is complete, we're going to upload the data file, and then we're going to run another Hive job this time to load the stage data into the production table. And when the job is finished, again, I get the output to tell me how long each of the different jobs, different parts of the jobs took. And the data has now been uploaded to the staging table and then loaded into the production table. So let's go ahead and do exactly the same with the states data, again, just running the appropriate scripts to create the staging table and then to load the staged data into the production table. And again, that job completes and loads the staged data into the production table. And finally we'll do the same with our tornadoes data. So when the job to load the tornadoes data is complete, we can actually see the results of the different partitions that were generated from the data, so we've got a partition for each year that's been loaded. And what we've now got is a repeatable process using these scripts where we can take new data, we can load that to the staging table. The staging tables are not defined as external. They're internal tables, so when we drop them, we delete any previously staged data, upload the new data, and then we insert that into the production table from the staging table. And the production table is defined as an external table. So when it's dropped or the cluster is deleted, the data that's already there, stays where it is. So we're able to continually add to that data and just bring up the cluster as and when when we need it. So we've got a repeatable process for creating a data warehouse on demand, if you like, and incrementally adding data over time. And just to see how we might use that data warehouse, well one of the things obviously is that we might want to just run querys against the tables and get some information back. So I've got here a PowerShell script that's just going to go and select the states, the month number of the year, and the total property loss and crop loss costs from those tables and group it by year. Unfortunately this is only for the year 2010. So actually it's going to group it by month and display for each state and each month what the total property cost and crop cost was. So let's go ahead and run that script. And when the results are returned I can actually see, for each different state and each month in 2010, I can see the total property cost and crop cost associated with that. So it's working. I've got a data warehouse that has tables, and those tables have data, and I can query them. How I would actually use that in a production environment, is I'd probably have some sort of modeling and client visualization tool. So we're gonna use Excel to see how we might use this. We'll just start up Excel and create a new workbook. And one of the things I can do with Excel is I can use Power Pivot to create a data model based on a data source such as a data warehouse. Now in this case I'm gonna connect to our Hive data warehouse. And to use that I'm gonna use ODBC. So actually in my Control Panel, if I have a look at my Administrative Tools, I have an ODBC data source name because I've installed the Microsoft Hive driver for ODBC. So if we actually have a look at this DSN, you can see that it's connecting to my HDInsight cluster. And I just need to supply it with the username and password, and off it will go and connect and make a connection to the Hive tables there. So this ODBC driver has been installed. I've got both the 64- and the 32-bit versions installed in here. And I can use these from any ODBC-compliant client such as Excel. So I'm gonna use that data source name in my Power Pivot data model. So we'll go ahead and create a new Power Pivot data model. And we're gonna grab some data from this data source. It's an OLE DB ODBC data source. And we'll just call this HiveOnHDInsight. And I'm gonna go and build a connection string. The provider I want to use is that OLE DB provider for ODBC. And the data source name is the one we've just looked at, that Hive ODBC data source name. I need to specify the credentials for my HDInsight cluster. So I'll just put those in there. And we'll save those in the workbook. I don't need to specify the catalog to connect to. It's just gonna connect to the default Hive database catalog. So we'll click OK there, and then we're ready to carry on with the wizard. So I get the choice to either select individual tables or I could write a HiveQL query that brings the data that I want. In this case I'm just gonna import the 3 tables from the data warehouse. So we'll go and select those. So it lists the tables that are available, and we're gonna have our dates, states and tornadoes table. And I could do further filtering in here. I could filter out rows I'm not interested in, and I could give them a slightly more friendly name. So I'll capitalize these names just to make them a bit nicer to read. And then I can go ahead and click finish at this point. And Power Pivot is gonna go off and connect to my HDInsight server and start importing the data from those tables. So we'll just wait for the data to be imported. And when the data has been imported, I can see the different tables that have now been added to my data model. And I might want to make a few changes to the structure of the data models. So some simple things, like I might want to just sort the data in the tables. I might want to do things like change date type, so change this to a date and change the format so that it's just literally the date that I care about. And I can do things like say the weekday, if I'm gonna sort data by weekday, then actually what that means is don't sort it alphabetically by the name of the day of the week. Sort it by the day of the week. Which is this column here. And similarly with the month, don't sort it alphabetically by month if I choose to sort it that way. Sort it by the month number of the year. So little bits of detail, changes that I might want to make to the data model. Nothing really I want to change with the states. With the tornadoes, again I might just want to sort that into the right order, just to try and make sure the performance is okay. And I might want to change the event time again to a date. But this time I'll leave the time in there, because that's something I care about. And I might want to take my property loss and crop loss columns and format those as currency so that we're actually getting them as values. So I've made some changes to the data model based on the data that I've brought in. I might also have a look at the diagram view of this data model and see these 3 tables that I've got. And I might just want to create some relationships between them. So for example I've got my date code that I can join from tornadoes to date with. And the same with the state code to the states table. So I can create those relationships and define the structure of my data model. And I might want to go into my dates table here, just take a closer look at that. And I might want to define some sort of hierarchy here. So we'll define a hierarchy called Date. And we'll add the year and then the month and then the day of the month to that hierarchy. And that lets me drill up and down when I'm analyzing the data and aggregate it at different levels. So I'm able to use Power Pivot to refine the data model that I've got from my Hive tables in my Hive data warehouse. And what I can then do with that, if I want to analyze the data here, I can go ahead and create a Pivot table for example. So we'll create a Pivot table. We'll just put it in the existing worksheet. And here's the fields from those tables that I've got in my data model. I might want to show my date hierarchy here. And I might want to show the state name across on the columns like so. And then from the tornadoes table I might be interested in the cost of the property loss. So I might bring that in like so. And in that comes, and I can see the different costs associated with property across the states and down by the years. And if I expand one of the years, because I defined that hierarchy I can actually then drill into the month and then into the individual day of the month and look at the costs that were incurred based on tornadoes that occurred then. So I'm able to do some fairly detailed analysis of my data, having got it from my Hive tables and refined it in a data model in Power Pivot. One of the other things I might want to do is visualize this data. And because it's largely geographical— it's based on events that have occurred in different states in the US across different time periods— one of the tools I might want to use to analyze this is Power Map. So if I have a look at Power Map, this is a component that comes with the Power BI service for Office 365. So I can create what's called a tour using Power Map. We'll give the tour a name. We'll just call it... Tornadoes over Time. And I can first of all specify a geographical field that I want to use to overlay data onto this globe here. I've got a globe of the world here, and I want to overlay some data onto there. So let's choose the state name initially as our geographical field. And it recognizes the state name. It's mapped that to the state/province level. There's a bunch of different levels I can map data to, but in this case it's the state. It's identified the different states in the US. So we've got a geographical field. And on the next page I can choose the date I'm actually interested in analyzing here, the analytical data that I want to display. And for that I want to have a look at the tornadoes. I want to look at that property cost and crop cost. So there's the property loss, and I can add that. You can see that it's added if I drill in here. You can see it's added a column for each of those. And I might also want to add the crop loss in there as well. And again it's added that. In cases where there is crop loss, if we drill in a little bit, it's actually stacked in the column. So there's comparatively little crop loss compared to the property loss there. So we're not really seeing that very much. But I might, instead of displaying it as a single column, create a clustered column. So we're seeing a separate column for each of the different types of loss, and I can now see the orange crop loss compared to the blue property loss. Now it's created a layer called Layer 1. This is not a particularly useful name for me. So I'm gonna change that to Costs. And I can resize this legend here that's gonna be used. And I can go ahead and I can change those colors as well. So I've got blue and orange here. I might want to just have a look at the settings for there and say for property loss we'll leave that as blue, and for crop loss let's change that to purple just so we've got 2 different colors there. And it updates that immediately in here. Okay, so far so good. I might also want to see how this has changed over time. So I'm gonna come down to the time box here, and I'm gonna, in my tornadoes table, select the event time. And that then fills in that time box. And so this player has appeared here. So I could play the events over time and see how the costs have grown over time. And I can choose to increment in days, months, quarters, years. I'm gonna increment by month so that each month we see the next value. And I'm gonna configure it so that the data accumulates over time. So we'll see the accumulative costs as time has elapsed in these different areas. And again, if I just zoom out we can see that there. And I can actually drag the slider, and you can see the values change as we move over time. You can see that it's added this little legend here to tell me the current time we're looking at. I'm gonna edit that and change the format a little bit so that instead of showing the date and time, I'm only really interested in the month and year. So we'll go ahead and change that. And we'll put that out of the way down at the bottom here, and move this legend up at the top. So we've now got the property loss and the crop loss shown as a clustered column chart in each state, and we're able to view that over time, which is pretty good. But I want to overlay on top of that the severity, the categories of the different tornadoes we've seen. So I'm gonna add a layer. And again the geographical field that I want to display needs to be chosen. So for the tornadoes, I'm gonna choose the start latitude and longitude. And it automatically mapped those to latitude and longitude. So it's identified where those actually happened. And then for each of those I'm gonna display the severity, the category of the hurricane up here. So I'll add that category in there. It's again gone for this idea of displaying the sum as a column chart. I don't really want it as a column chart. I want it actually as a heat map. So I can see the different severities of the hurricanes based on a gradient scale from cold to hot, like a heat map. And again I want to see that over time. So I'm gonna choose the event time once again as my time field. Again I'm gonna increment by month. So we'll just modify that to increment by month. And this time instead of the data accumulating over time, I actually want it to stay in place until it's replaced, so it shows me the most recent category that's there. as time elapses. And I can see how that affects the US as we view the data that's displayed. So I've now got 2 layers on my map. This Layer 2 I'm just gonna rename to Category. And I can see the tornado category displayed there. Both of the layers are gonna be incrementing over time. I'm just gonna change a few settings to my overall scene here. I'm gonna change the duration so that the whole scene lasts 20 seconds. And I am gonna apply an effect. So we're gonna use the Push In effect so that we animate the globe as we view the data and push in and see more detail. And I could change the effect speed so that we push in fairly quickly. And we perhaps increment rather slowly. And I'm now ready to view my tour to see this data. So let's go ahead and play the tour. And Power Map animates the data over time. So I can see the most recent severity of tornado that we've had in the different areas, the longitude and latitude locations. And I can see the crop and property loss costs in each of the states. And we zoom in there as time goes by and we work our way through all the different data that we've got. And eventually it will bring me up to date. So we get closer and closer. And when we finish zooming in, I can scroll around the map. I can use the mouse to move around, or if this was a touchscreen I could just drag with my finger. And I can hover over the different values there, and I can actually see the property loss, and the crop loss if I hover over that as well, for each of the different states. And I can use, again, my mouse or my finger to scroll around and see all the different values that are in there. So what you've seen in this demonstration is the ability to use HDInsight to create a data warehouse on top of data that I'm storing in Azure Storage as a commodity store. I'm using a staging technique so that I can incrementally add more data to that storage, but I only have to create the data warehouse when I need it. I can tear it down when I don't without deleting the data. So the data stays where it is. I then created the data model in Excel, and of course I can refresh the tables in here. If I have a look at my tables in Power Pivot, I can refresh these tables. So 3 months from now, if I've added more data to the data store and I've recreated the HDInsight cluster, I could refresh the data. It will update the data model and update all the analysis I've been doing. So that's how we use HDInsight as a commodity data store and a data warehouse on demand in a big data scenario. Hope you found that useful. [Microsoft]

Video Details

Duration: 33 minutes and 31 seconds
Country: United States
Language: English
License: All rights reserved
Genre: None
Views: 88
Posted by: duncanma on Jul 2, 2014

This is a real request, please proceed. I need closed caption for English and MT-ed captions for all MT languages. the tools does not allow me to add all languages, this is why i am adding to the body of description field ----- (Please provide translations for these languages: English (eng).)

Caption and Translate

    Sign In/Register for Dotsub to translate this video.