Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

VideoBigDataScenario2: Data Warehouse On Demand

0 (0 Likes / 0 Dislikes)
[Big Data Scenario, Data Warehouse On Demand] [male speaker] Hi and welcome to this demonstration of a big data scenario for Data Warehouse On Demand. My name is Graeme Malcolm, and I'm going to talk you through how we can use HDInsight to provide a data warehouse for big data analysis. [A. Datum Meteorological Research, Seasonal Analysis of tornadoes, Data collected and stored, Periodic analysis when new seasonal data is available] So for the demonstration, we're going to look at a fictional company called A. Datum, who perform meteorological research. What they do is they collect data about tornadoes, including the severity and the costs incurred by damage and so forth, and this is analyzed on a seasonal basis. So the data is collected and store as and when the tornadoes arise, but periodically we want to perform analysis of that data when new seasonal data is available. So let's have a look and see how A. Datum can use HDInsight and the Microsoft Data Technology Stack to perform this analysis. Let's start by looking at the Microsoft Azure environment that A. Datum have got. And you can see there's an Azure account here, and we have an Azure storage account set up as well. And this is where we can store our data that we collect about tornadoes as time goes by, and we've got a blob container here which is where we're actually going to store the data. So we're just starting the process now, so there's no existing data in there but as time goes by we'll continue to upload the data and use that as a commodity store. Now when we need to analyze the data that we've got, we will spin up an HDInsight cluster. And you can see I've already got one up and running here now. The point is that I can retain the data in the storage account even when I tear the cluster down, and I can bring the cluster back up again when I need it. So I'm only paying for the cluster when I actually need to use it for analysis, but I'm using the storage all the time as a commodity store for the data that I'm gathering. So we've got our HDInsight cluster up and running. What I'm probably going to want to do is to create some Hive tables that I can use to do some analysis of my data with. So here's the script code I've got to create that data warehouse and its tables. First of all, I'm creating a database. Now you don't have to do this. There's a default database that's created for you when you insert an HDInsight cluster. But it's a good idea just to create a separate database. It acts as a separate schemer, a separate name space for your tables that are in your data warehouse and that keeps them distinct from any other tables that you might happen to have. So I'm creating this database, and I'm specifying a location in the HDFS which, of course, is going to be hosted in my Azure blob store, as to where the files for this database should be stored. I'm then creating some tables. I've got three tables. And there's a fairly common pattern to data warehouses. We use what's generally called a dimensional model. So I've got a couple of dimension tables here and a fact table that contains the things I want to aggregate, the actual data I'm interested in. So for dimensions, I've created a table of states for the US states and a table of date values which is quite a common thing to do in a data warehouse and it makes it easier to aggregate data by month or year or different time periods. The fact table itself, the tornadoes information in it, is slightly larger. There's a whole bunch of things that I might want to aggregate in here. And I can relate this to the date table based on the date code and to the state table based on the state code. The other interesting thing that's worth noticing about this is that I've partitioned this table. I've got a partition by year which is going to—under the covers where this table is stored— is going to create a separate folder for each year that I've got data for, and the data files themselves will be stored in the appropriate folders. As I get more and more data—as I gather more data over time— that will improve performance as we query the table. I'm sorting these in the sequence file format. It's a format that's supported by HDInsight and Hive. And it allows things like compression, and it's a split-able format that makes it a good format to use for creating—for performance. Notice that all the tables are external so that when I drop the tables and I get rid of the cluster, it doesn't affect the data. The data files stay exactly where they are in the Azure storage. It's only the table metadata that's dropped. And then I can recreate that on top of the existing data when I need it again. So the data is retained between the times I'm creating it. So that's the script I'm using. Let's take a look at how we go ahead and run that script. And I'm using PowerShell to do this. Here's my PowerShell environment. I've already set this up so that it's linked to my Azure subscription. If I go ahead and run that get Azure subscription, you can see that I've imported a certificate so there's already a connection between my local PowerShell environment and my Azure subscription. And what I'm going to do is I'm just going to bring this down so we can see a script, and I'll clear that out of the way. And first of all, all I'm going to do is connect to our cluster and our storage account. Find the folder where this PowerShell script is actually stored because that's where—that createdwtables.hql file is the one that creates the data warehouse and the tables. I'm then going to upload that file. Use the set as your storage blob content command to upload that file to the Azure store. And then I'm going to run an Azure HDInsight job based on that file. This is a Hive job. There's a new AzureHDInsightHiveJob definition that points to that script file that I'm uploading, and we're going to run that job under the covers. That will perform a MapReduce job. It will run those HiveQL commands, and that will create the database and the tables. So let's go ahead and run this PowerShell script. And the first thing it does is it uploads the script file. And then it's going to go ahead and start an HDInsight job to run that script file and create the database. So we'll just wait for the job to complete. So when the job is done, I get the report back that tells me how long each of the statements took, so it created the database and then created the different tables in the database. And so now I've got that database ready to use. So the first time that I'm going to use this, I'm going to upload the data. In the future the data will already be there and I'll add more data as I get it. So there'll be periodically adding more data files as I collect them. But initially I'm going to upload some data to there. And let's just take a look at what we've got. We've got a number of data files here. The dates data file is basically just that called value for the date, then the actual date in date format. And then I've got the month, number of the year—sorry, the day number of the month— the month number of the year, the year, and the day of week. And then I've got the strings that represent the day name of the week and the month name. So there's some stuff that's going to get uploaded to that date table. Typically you're going to load the date table initially with enough values to cover the data—the tornadoes data in this case—that we've got. And then we might add more dates to the table as we go forward in the future. The states table is very simple. It's just the states code and the state name. And unless something really dramatic happens, I don't think we'll be adding to this. This is all of the states in the US, but I guess what might happen is we might extend our analysis to other parts of the world and upload some additional states into there from other countries. And in the tornadoes data itself, we've got a whole bunch of information about the data. You can see it's a tab-delimited text file that we've got. It has the date code and the state code and then the time that the event started and a whole bunch of other information about the tornadoes that have occurred over that period. So I'm going to have to upload this data to the location where the tables are created. And to do that what I want to do is stage the data first. I'll create a staging table that I can upload the data into initially, and then I can read it from that staging table and do any type conversion or any additional functions that I need to do on the data before I load it into my production tables. So to do that I've got some staging scripts. The first scripts are the create table scripts. So for example, there's one here to create a staging table for my dates, and it just follows the format of that file that I've got—that source file. It says that it's a tab-delimited text file that I'm loading, and it tells that this is where I'm expecting the data files to be. So there's a staging folder that's going to have a dates folder, and that's where I'll upload this file to. And the same is true of the tornadoes, for example. So we have a look at that. Again, it follows the structure of the file. It's a tab-delimited text file, and it's located in this location. Now a couple of things to notice. One is this is obviously a text file, whereas our production files are sequence file format. So there's a difference in the format. And also this tornadoes table, the staging table, doesn't have any partitioning. It's just a flat table that contains the tornadoes data. And obviously we're going to have to generate the partitioning information when we move it from staging into the production table. So to do that we have scripts to do the loading of the stage data into the production tables. So here's the one for the dates. And it's a straightforward statement that takes data from the staging table and inserts into the dates table, and it just selects the data to insert. There's a similar one for states. For tornadoes it's slighty different. We have to set this Hive option here. We have to set the partition mode to non-strict because we're going to dynamically generate the partition value. And that requires that setting. So once we've set that on, we can go ahead and insert into the tornadoes table partitioned by year, and again we're taking data from the staging table, but then that last column here that we're getting, we're actually generating the year by taking the first four characters of the date code, which is actually the year. So we're loading from a flat table into a partition table, and we're using this function to generate the partitioning value. So the first thing we need to do is we need to upload those staging scripts. We're going to use these each time we need to load data into those tables. We're going to stage it and then load it, so it's a two-phase process each time we want to upload data. So again I've got a PowerShell script to do this. We'll just clear the pane at the bottom. Fairly straightforward. Again we're just finding the location of this script file. And then we're uploading all of the staging script files to this location here, dw/staging/scripts. And it's just going to loop through that local folder that I've got here with those script files in them and upload them all to the Azure store. So let's go ahead and run that. And it just starts looping through those local staging scripts and uploads each of them to that location in the Azure store. And when it's finished, all of those files have been uploaded. Okay, well we're now ready to actually do some loading of our data. And you've seen the data source files and you've seen the staging scripts we're going to use. And I've got a separate PowerShell script for each of those types of data— one for dates, one for states, one for tornadoes. We could, of course, do that all in the one file, but by separating them out it makes it easier to manage incremental loads of tornadoes when I don't need to add any states or add any dates or anything like that. I can do them all independently. So this script will again run the existing Hive script I've just uploaded, those staging scripts. It runs a Hive script to create the staging table that I'm going to load the data into. Then it goes ahead and uploads the local data file, in this case the dates.txt file. And it puts it into the location where I've defined that staging table. So that in effect populates the staging table. And then I can run the other Hive script that I loaded on the staging scripts to load the stage data into the production data. So to do that, select against the staging table and load that into the production table. So let's go ahead and run this one. And again, it's going to kick off an Azure HD Insight Hive job to go ahead and create the staging table. And when the Hive job to create the staging table is complete, we're going to upload the data file. And then we're going to run another Hive job, this time to load the stage data into the production table. And when the job's finished, again I get the output to tell me how long each of the different jobs and different parts of the job took. And the data has now been uploaded to the staging table, and then loaded into the production table. So let's go ahead and do the same with the states data. Again just running the appropriate scripts to create the staging table and then to load the stage data into the production table. And again, that job completes and loads the stage data into the production table. And finally we'll do the same with our tornadoes data. So when the job to load the tornadoes data is complete, we can actually see in the results the different partitions that were generated from the data. So we've got a partition for each year that's been loaded. And what we've now got is a repeatable process using these scripts where we can take new data. We can load that to the staging table. The staging tables are not defined as external. They're internal tables. So when we drop them, we delete any previously staged data, upload the new data, and then we insert that into the production table from the staging table. And the production table is defined as an external table. So when it's dropped or the cluster is deleted, the data that's already there stays where it is. We're able to continually add to that data and just bring up the cluster as and when we need it. So we've got a repeatable process for creating a date warehouse on demand, if you like, and incrementally adding data over time. And just to see how we might use that data warehouse, well one of the things obviously is that we might just want to run queries against the tables and get some information back. So I've got here a PowerShell script that's just going to go and select the states, the month number of the year, and the total property lost and crop loss costs from those tables and group it by year. I'm filtering this so it's just for the year 2010. So actually it's just going to group it by month and display for each state and each month what the total property costs and crop cost was. So let's go ahead and run that script. And when the results are returned, I can actually see for each different state and each month in 2010, I can see the total property costs and crop costs associated with that. So it's working. I've got a data warehouse that has tables, and those tables have data and I can query them. How I would actually use that in a production environment is I'd probably have some sort of modeling and client visualization tool. So we're going to use Excel to see how we might use this. We'll just start up Excel and create a new workbook. And one of the things I can do with Excel is I can use PowerPivot to create a data model based on a data source such as a data warehouse. Now in this case I'm going to connect to our Hive data warehouse. And to do that I'm going to use ODBC. So actually in my control panel, if I have a look at my administrative tools, I have an ODBC data source name because I've installed the Hive driver—the Microsoft Hive driver— for ODBC. So we actually have a look at this DSN. You can see that it's connecting to my HDInsight cluster. And I just need to supply it with the user name and password. And off it will go and connect and make a connection to the Hive tables there. So this ODBC driver has been installed. I've got both the 64- and the 32-bit versions installed in here. And I can use these from any ODBC compliant client such as Excel. So I'm going to use that data source name in my PowerPivot data model. So we'll go ahead and create a new PowerPivot data model. And we're going to grab some data from this data source. It's an OLE DB ODBC data source. And we'll just call this HiveOnHDInsights. And I'm going to build up the connection strength. The provider I want to use is that OLE DB provider for ODBC. And the data source name is the one we've just looked up, that Hive ODBC data support source name. I need to specify the credentials for my HDInsight cluster, so I'll just put those in there. And we'll save those in the workbook. I don't need to specify the catalog to connect to. It's just going to connect to the default Hive database catalog. So we'll click okay there, and then we're ready to carry on with the wizard. So I get the choice to either select individual tables or I could write a Hive QL query that brings back the data that I want. In this case I'm just going to import the three tables from the data warehouse. So we'll go ahead and select those. So it lists the tables that are available, and we're going to have our dates, states, and tornadoes table. I could do further filtering in here. I could filter out rows I'm not interested in, and I could give them slightly more friendly names. So I'm going to capitalize these names just to make them a bit nicer to read. And then I can go ahead and click finish at this point. And PowerPivot is going to go off and connect to my HDInsight server and start importing the data from those tables. So we'll just wait for the data to be imported. And when the data has been imported, I can see the different tables that have now been added to my data model. And I might want to go and make a few changes to the structure of the data model, so some simple things, like I might want to just sort the data in the tables. I might want to do things like change date type, so I'll change this to a date and change the format so that it's just literally the date I care about. And I can do things like say the weekday—if I'm going to sort data by weekday, then actually what that means is don't sort it alphabetically by the name of the day of the week. Sort it by the day of the week which is this column here. And similarly with the month, don't sort it alphabetically by month if I choose to sort it that way. Sort it by the month number of the year. So little bits of changes I might want to make to the data model. Nothing really I want to change with the states. With the tornadoes, again I might want to just sort that into the right order just to try and make sure the performance is okay. And I might want to change the event time again to a date but this time I'll leave the time in there because that's something I care about. And I might want to take my property loss and crop loss columns and format those as currency so that we're actually getting them as values. So I've made some changes to the data model based on the data that I brought in. I might also go and have a look at the diagram view of this data model and see these tables that I've got. And I might just want to create some relationships between them. So for example, I've got my date cord that I can join from tornadoes to date with. And the same with the state code to the states table. So I can create those relationships and define the structure of my data model. Then I might want to go into my dates table here and just take a closer look at that. And I might want to define some sort of hierarchy here so we'll define a hierarchy called date. And what we'll do is we'll add the year and then the month and then the day of the month to that hierarchy. And that lets me drill up and down when I'm analyzing the data and aggregate it at different levels. So I'm able to use PowerPivot to refine the data model that I've got from my Hive tables and my Hive data warehouse. And what I can then do with that, if I want to analyze the data here I can go ahead and create a Pivot table, for example. So we'll create a Pivot table. We'll just put it in the existing worksheet. And here is the fields from those tables that I've got in my data model. What I might want to do is I might want to show my date hierarchy here. And I might want to show the state name across on the columns like so. And then from the tornadoes table, I might be interested in the cost of the property loss. I might bring that in like so. And in that comes. And I can see the different costs associated with property across the states and down by the years. If I expand one of the years, because I defined that hierarchy I can actually then drill into the month and then into the individual day of the month and look at the costs that were incurred based on tornadoes that occurred then. So I'm able to do some fairly detailed analysis of my data having got it from my Hive tables and refined it in a data model in PowerPivot. One of the other things I might want to do is I might want to visualize this data. And because it's largely geographical, it's based on events that have occurred in different states in the US across different time periods, one of the tools that I might want to use to analyze this is PowerMap. So if I have a look at PowerMap, this is a component that comes with the Power BI service for Office 365. So I can create what's called a tour using PowerMap. So we'll give the tour a name. We'll just call it Tornadoes Over Time. And what I can do is I can first of all specify a geographical field that I want to use to overlay data onto this globe here. I've got a globe of the world here, and I want to overlay some data onto there. So let's choose the state name initially as our geographical field. And it recognizes the state name. It's mapped that to the state province level. There's a bunch of different levels that I can map data to. In this case it's the state. It's identified the different states in the US, so we've got a geographical field. And on the next page I can choose, what's the data I'm actually interested in analyzing here? What's the analytical data that I want to display? And for that what I want to really do is have a look at the tornadoes. I want to look at that property cost and crop cost. So there's the property loss, and I can add that. And you can see that it's added if I drill in here. You can see it's added a column for each of those. I might also want to add the crop loss in there as well. And again it's added that. In cases where there is crop loss, if we drill in a little bit it's actually stacked in the columns. There's comparatively little crop loss compared to the property loss there. So we're not really seeing that very much. But what I might do is I might, instead of displaying it as a single column, create a clustered column so we're actually seeing a separate column for each of the different types of loss. And I can now see the orange crop loss compared to the blue property loss. It's created a layer called layer 1. That's not a particularly useful name for me so I'm going to change that to costs. And I can resize this legend here that's going to be used. And I can go ahead and I can change those colors as well. So I've got blue and orange here. What I might want to do is just have a look at the settings for there and say, well, for property loss we'll leave that as blue. And for crop loss, let's change that to purple just so that we've got two different colors there. And it goes ahead and updates that immediately in here. Okay, well so far so good. What I might also want to see is how this has changed over time. So what I'm going to do is I'm going to come down to the time box here and I'm going to in my tornadoes table select the event time. And that then fills in that time box. And so what I get is this player has appeared here. So I can play the events over time and see how the costs have grown over time. And I can choose to increment in days, months, quarters, years. I'm going to increment by month so each month we see the next value. And I'm going to configure it so that the data accumulates over time. So we'll see the accumulative costs as time has elapsed in these different areas. And again, if I just zoom out we can see that there. And I can actually drag the slider, and you can see the values change as we move over time. You can see that it's added this little legend here to tell me what's the current time we're looking at. I'm going to edit that and just change the format a little bit so that instead of showing the day and time, I'm only really interested in the month and year. So we'll go ahead and change that. And we'll just put that out of the way down at the bottom here and move this legend up to the top. So we've now got the property loss and the crop loss shown as a cluster column chart in each state, and we'll able to view that over time which is pretty good. But what I want to do is I want to overlay on top of that the severity, the categories, of the different tornadoes that we've seen. So I'm going to add a layer. And again, the geographical field that I want to display needs to be chose, so for the tornadoes I'm going to choose the start latitude and longitude. And it automatically mapped those to latitude and longitude. It's identified where those actually happened. And I'm then going to—for each of those—I'm going to display the severity, the category, of the hurricane that appeared. So I'll add that category in there. It's again going for this idea of displaying the sum as a column chart. I don't really want it as a column chart. I want it actually as a heat map so I can see the different severities of the hurricanes based on a gradient scale from cold to hot, like a heat map. And again, I want to see that over time. So what I'm going to do is I'm going to choose the event time once again as my time field. Again I'm going to increment by month. So we'll just modify that to increment by month. And this time instead of the data accumulating over time, I actually want it to stay in place until it's replaced so it shows me the most recent category that's there as time elapses. And I can see how that affects the US as we view the data that's displayed. So I've now got two layers on my map. This layer too I'm just going to rename to Category. And I can see the tornado category displayed there. Both of the layers are going to be incrementing over time. I'm just going to change a few settings to my overall scene here. What I'm going to do is change the duration so that the whole scene lasts 20 seconds. And I am going to apply an effect. So we're going to use the push-in effect so that we animate the globe as we view the data and push in and see more detail. What I can do is I can change the effect speed so that we push in fairly quickly and we perhaps increment rather slowly. And I'm now ready to go and view my tour to see this data. So let's go ahead and play the tour. And PowerMap animates the data over time, so I can see the most recent severity of tornado that we've had in the different areas, the different longitude and latitude locations. And I can see the crop and property loss costs in each of the states. And we zoom in there as time goes by and we go through all the different data that we've got. And eventually it will bring me up to date as we get closer and closer. And when we finish zooming in, I can scroll around the map. I can use the mouse to move around, or if this was a touch screen I could just drag with my finger. And I can hover over the different values there and I can actually see the property loss and the crop loss if I hover over that as well for each of the different states. And I can use again my mouse or my finger to scroll around and see all the different values that are in there. So what you've seen in this demonstration is the ability to use HDInsight to create a data warehouse on top of data that I'm storing in Azure storage as a commodity store. I'm using a staging technique so that I can incrementally add more data to that storage, but I only have to create the data warehouse when I need it. I can tear it out when I don't without deleting the data. So the data stays where it is. I then created a data model in Excel, and of course, I can refresh the tables in here. If I have a look at my tables in PowerPivot, I can refresh these tables so if a few months from now if I've added more data to the database, to the data store, and I've recreated the HDInsight cluster, I can refresh the data. It will update the data model and update all of the analysis that I've been doing. So that's how we use HDInsight as a commodity data store and a data warehouse on demand in a big data scenario. I hope you found that useful. [Microsoft]

Video Details

Duration: 33 minutes and 31 seconds
Country: United States
Language: English
License: All rights reserved
Genre: None
Views: 31
Posted by: asoboleva99 on Feb 9, 2016

This is demo ----- (Please provide translations for these languages: Chinese (Traditional) (chi_hant), English (eng).)

Caption and Translate

    Sign In/Register for Dotsub to translate this video.