VideoBigDataScenario2: Data Warehouse On Demand
0 (0 Likes / 0 Dislikes)
[Big Data Scenario, Data Warehouse On Demand]
[male speaker] Hi and welcome
to this demonstration of a big data scenario for Data Warehouse On Demand.
My name is Graeme Malcolm, and I'm going to talk you through how we can use HDInsight
to provide a data warehouse for big data analysis.
[A. Datum Meteorological Research, Seasonal Analysis of tornadoes, Data collected and stored, Periodic analysis when new seasonal data is available]
So for the demonstration, we're going to look at a fictional company
called A. Datum, who perform meteorological research.
What they do is they collect data about tornadoes, including the severity and the costs
incurred by damage and so forth, and this is analyzed on a seasonal basis.
So the data is collected and store as and when the tornadoes arise,
but periodically we want to perform analysis of that data
when new seasonal data is available.
So let's have a look and see how A. Datum can use HDInsight
and the Microsoft Data Technology Stack to perform this analysis.
Let's start by looking at the Microsoft Azure environment
that A. Datum have got. And you can see there's an Azure account here,
and we have an Azure storage account set up as well.
And this is where we can store our data that we collect
about tornadoes as time goes by, and we've got a blob container here
which is where we're actually going to store the data.
So we're just starting the process now, so there's no existing data in there
but as time goes by we'll continue to upload the data and use that as a commodity store.
Now when we need to analyze the data that we've got,
we will spin up an HDInsight cluster. And you can see I've already got one up and running here now.
The point is that I can retain the data in the storage account
even when I tear the cluster down, and I can bring the cluster back up again
when I need it. So I'm only paying for the cluster when I actually need to use it
for analysis, but I'm using the storage all the time as a commodity store
for the data that I'm gathering. So we've got our HDInsight cluster up and running.
What I'm probably going to want to do is to create some Hive tables
that I can use to do some analysis of my data with.
So here's the script code I've got to create that data warehouse and its tables.
First of all, I'm creating a database. Now you don't have to do this.
There's a default database that's created for you when you insert an HDInsight cluster.
But it's a good idea just to create a separate database. It acts as a separate schemer,
a separate name space for your tables that are in your data warehouse
and that keeps them distinct from any other tables that you might happen to have.
So I'm creating this database, and I'm specifying a location in the HDFS
which, of course, is going to be hosted in my Azure blob store,
as to where the files for this database should be stored.
I'm then creating some tables. I've got three tables.
And there's a fairly common pattern to data warehouses.
We use what's generally called a dimensional model.
So I've got a couple of dimension tables here and a fact table that contains
the things I want to aggregate, the actual data I'm interested in.
So for dimensions, I've created a table of states for the US states
and a table of date values which is quite a common thing to do in a data warehouse
and it makes it easier to aggregate data by month or year
or different time periods. The fact table itself,
the tornadoes information in it, is slightly larger. There's a whole bunch of things
that I might want to aggregate in here. And I can relate this to the date table
based on the date code and to the state table based on the state code.
The other interesting thing that's worth noticing about this
is that I've partitioned this table. I've got a partition by year
which is going to—under the covers where this table is stored—
is going to create a separate folder for each year that I've got data for,
and the data files themselves will be stored in the appropriate folders.
As I get more and more data—as I gather more data over time—
that will improve performance as we query the table.
I'm sorting these in the sequence file format.
It's a format that's supported by HDInsight and Hive.
And it allows things like compression, and it's a split-able format that makes it
a good format to use for creating—for performance.
Notice that all the tables are external so that when I drop the tables
and I get rid of the cluster, it doesn't affect the data. The data files stay exactly where they are
in the Azure storage. It's only the table metadata that's dropped.
And then I can recreate that on top of the existing data
when I need it again. So the data is retained between the times I'm creating it.
So that's the script I'm using. Let's take a look at how we go ahead and run that script.
And I'm using PowerShell to do this. Here's my PowerShell environment.
I've already set this up so that it's linked to my Azure subscription.
If I go ahead and run that get Azure subscription, you can see
that I've imported a certificate so there's already a connection
between my local PowerShell environment and my Azure subscription.
And what I'm going to do is I'm just going to bring this down so we can see a script,
and I'll clear that out of the way. And first of all, all I'm going to do
is connect to our cluster and our storage account. Find the folder
where this PowerShell script is actually stored because that's where—that createdwtables.hql file
is the one that creates the data warehouse and the tables.
I'm then going to upload that file. Use the set as your storage blob content command
to upload that file to the Azure store. And then I'm going to run an Azure HDInsight job
based on that file. This is a Hive job. There's a new AzureHDInsightHiveJob definition
that points to that script file that I'm uploading, and we're going to run that job
under the covers. That will perform a MapReduce job.
It will run those HiveQL commands, and that will create the database
and the tables. So let's go ahead and run this PowerShell script.
And the first thing it does is it uploads the script file.
And then it's going to go ahead and start an HDInsight job to run that script file
and create the database.
So we'll just wait for the job to complete.
So when the job is done, I get the report back that tells me how long each of the statements took,
so it created the database and then created the different tables in the database.
And so now I've got that database ready to use.
So the first time that I'm going to use this, I'm going to upload the data.
In the future the data will already be there and I'll add more data as I get it.
So there'll be periodically adding more data files as I collect them.
But initially I'm going to upload some data to there.
And let's just take a look at what we've got. We've got a number of data files here.
The dates data file is basically just that called value for the date,
then the actual date in date format.
And then I've got the month, number of the year—sorry, the day number of the month—
the month number of the year, the year, and the day of week.
And then I've got the strings that represent the day name of the week and the month name.
So there's some stuff that's going to get uploaded to that date table.
Typically you're going to load the date table initially with enough values
to cover the data—the tornadoes data in this case—that we've got.
And then we might add more dates to the table as we go forward in the future.
The states table is very simple. It's just the states code and the state name.
And unless something really dramatic happens, I don't think we'll be adding to this.
This is all of the states in the US, but I guess what might happen
is we might extend our analysis to other parts of the world
and upload some additional states into there from other countries.
And in the tornadoes data itself, we've got a whole bunch of information
about the data. You can see it's a tab-delimited text file
that we've got. It has the date code and the state code
and then the time that the event started and a whole bunch of other information
about the tornadoes that have occurred over that period.
So I'm going to have to upload this data to the location where the tables are created.
And to do that what I want to do is stage the data first. I'll create a staging table
that I can upload the data into initially, and then I can read it from that staging table
and do any type conversion or any additional functions that I need to do on the data
before I load it into my production tables.
So to do that I've got some staging scripts.
The first scripts are the create table scripts. So for example, there's one here
to create a staging table for my dates, and it just follows the format
of that file that I've got—that source file. It says that it's a tab-delimited text file
that I'm loading, and it tells that this is where I'm expecting the data files to be.
So there's a staging folder that's going to have a dates folder,
and that's where I'll upload this file to. And the same is true of the tornadoes, for example.
So we have a look at that. Again, it follows the structure of the file.
It's a tab-delimited text file, and it's located in this location.
Now a couple of things to notice. One is this is obviously a text file,
whereas our production files are sequence file format.
So there's a difference in the format. And also this tornadoes table,
the staging table, doesn't have any partitioning. It's just a flat table
that contains the tornadoes data. And obviously we're going to have to generate
the partitioning information when we move it from staging
into the production table. So to do that we have scripts to do the loading
of the stage data into the production tables. So here's the one for the dates.
And it's a straightforward statement that takes data from the staging table
and inserts into the dates table, and it just selects the data to insert.
There's a similar one for states. For tornadoes it's slighty different.
We have to set this Hive option here. We have to set the partition mode
to non-strict because we're going to dynamically generate the partition value.
And that requires that setting. So once we've set that on,
we can go ahead and insert into the tornadoes table partitioned by year,
and again we're taking data from the staging table, but then that last column here
that we're getting, we're actually generating the year by taking the first four characters
of the date code, which is actually the year. So we're loading from a flat table
into a partition table, and we're using this function to generate the partitioning value.
So the first thing we need to do is we need to upload those staging scripts.
We're going to use these each time we need to load data into those tables.
We're going to stage it and then load it, so it's a two-phase process each time
we want to upload data. So again I've got a PowerShell script to do this.
We'll just clear the pane at the bottom.
Fairly straightforward. Again we're just finding the location of this script file.
And then we're uploading all of the staging script files
to this location here, dw/staging/scripts.
And it's just going to loop through that local folder that I've got here
with those script files in them and upload them all to the Azure store.
So let's go ahead and run that.
And it just starts looping through those local staging scripts
and uploads each of them to that location in the Azure store.
And when it's finished, all of those files have been uploaded.
Okay, well we're now ready to actually do some loading of our data.
And you've seen the data source files and you've seen the staging scripts we're going to use.
And I've got a separate PowerShell script for each of those types of data—
one for dates, one for states, one for tornadoes.
We could, of course, do that all in the one file, but by separating them out
it makes it easier to manage incremental loads of tornadoes
when I don't need to add any states or add any dates or anything like that.
I can do them all independently. So this script will again run the existing Hive script
I've just uploaded, those staging scripts. It runs a Hive script to create the staging table
that I'm going to load the data into.
Then it goes ahead and uploads the local data file, in this case the dates.txt file.
And it puts it into the location where I've defined that staging table.
So that in effect populates the staging table.
And then I can run the other Hive script that I loaded on the staging scripts
to load the stage data into the production data.
So to do that, select against the staging table and load that into the production table.
So let's go ahead and run this one.
And again, it's going to kick off an Azure HD Insight Hive job
to go ahead and create the staging table.
And when the Hive job to create the staging table is complete,
we're going to upload the data file.
And then we're going to run another Hive job, this time to load the stage data
into the production table.
And when the job's finished, again I get the output to tell me
how long each of the different jobs and different parts of the job took.
And the data has now been uploaded to the staging table, and then loaded into the production table.
So let's go ahead and do the same with the states data.
Again just running the appropriate scripts to create the staging table
and then to load the stage data into the production table.
And again, that job completes and loads the stage data into the production table.
And finally we'll do the same with our tornadoes data.
So when the job to load the tornadoes data is complete,
we can actually see in the results the different partitions that were generated
from the data. So we've got a partition for each year that's been loaded.
And what we've now got is a repeatable process
using these scripts where we can take new data. We can load that to the staging table.
The staging tables are not defined as external. They're internal tables.
So when we drop them, we delete any previously staged data,
upload the new data, and then we insert that into the production table
from the staging table. And the production table is defined as an external table.
So when it's dropped or the cluster is deleted,
the data that's already there stays where it is.
We're able to continually add to that data and just bring up the cluster as and when we need it.
So we've got a repeatable process for creating a date warehouse
on demand, if you like, and incrementally adding data over time.
And just to see how we might use that data warehouse,
well one of the things obviously is that we might just want to run queries
against the tables and get some information back.
So I've got here a PowerShell script that's just going to go and select
the states, the month number of the year, and the total property lost
and crop loss costs from those tables and group it by year.
I'm filtering this so it's just for the year 2010.
So actually it's just going to group it by month and display for each state
and each month what the total property costs and crop cost was.
So let's go ahead and run that script.
And when the results are returned, I can actually see for each different state
and each month in 2010, I can see the total property costs and crop costs
associated with that. So it's working. I've got a data warehouse
that has tables, and those tables have data and I can query them.
How I would actually use that in a production environment
is I'd probably have some sort of modeling and client visualization tool.
So we're going to use Excel to see how we might use this.
We'll just start up Excel and create a new workbook.
And one of the things I can do with Excel is I can use PowerPivot
to create a data model based on a data source such as a data warehouse.
Now in this case I'm going to connect to our Hive data warehouse.
And to do that I'm going to use ODBC.
So actually in my control panel,
if I have a look at my administrative tools,
I have an ODBC data source name
because I've installed the Hive driver—the Microsoft Hive driver—
for ODBC. So we actually have a look at this DSN.
You can see that it's connecting to my HDInsight cluster.
And I just need to supply it with the user name and password.
And off it will go and connect and make a connection to the Hive tables there.
So this ODBC driver has been installed. I've got both the 64- and the 32-bit versions
installed in here. And I can use these from any ODBC compliant client
such as Excel. So I'm going to use that data source name
in my PowerPivot data model. So we'll go ahead and create a new PowerPivot data model.
And we're going to grab some data from this data source.
It's an OLE DB ODBC data source.
And we'll just call this HiveOnHDInsights.
And I'm going to build up the connection strength.
The provider I want to use is that OLE DB provider for ODBC.
And the data source name is the one we've just looked up,
that Hive ODBC data support source name.
I need to specify the credentials for my HDInsight cluster,
so I'll just put those in there. And we'll save those in the workbook.
I don't need to specify the catalog to connect to. It's just going to connect to the default
Hive database catalog. So we'll click okay there, and then we're ready
to carry on with the wizard.
So I get the choice to either select individual tables
or I could write a Hive QL query that brings back the data that I want.
In this case I'm just going to import the three tables from the data warehouse.
So we'll go ahead and select those.
So it lists the tables that are available, and we're going to have our dates,
states, and tornadoes table.
I could do further filtering in here. I could filter out rows I'm not interested in,
and I could give them slightly more friendly names.
So I'm going to capitalize these names just to make them a bit nicer to read.
And then I can go ahead and click finish at this point.
And PowerPivot is going to go off and connect to my HDInsight server
and start importing the data from those tables.
So we'll just wait for the data to be imported.
And when the data has been imported, I can see the different tables
that have now been added to my data model.
And I might want to go and make a few changes to the structure of the data model,
so some simple things, like I might want to just sort the data
in the tables. I might want to do things like change date type,
so I'll change this to a date and change the format
so that it's just literally the date I care about.
And I can do things like say the weekday—if I'm going to sort data by weekday,
then actually what that means is don't sort it alphabetically
by the name of the day of the week. Sort it by the day of the week
which is this column here. And similarly with the month,
don't sort it alphabetically by month if I choose to sort it that way.
Sort it by the month number of the year.
So little bits of changes I might want to make to the data model.
Nothing really I want to change with the states.
With the tornadoes, again I might want to just sort that
into the right order just to try and make sure the performance is okay.
And I might want to change the event time again to a date
but this time I'll leave the time in there because that's something I care about.
And I might want to take my property loss and crop loss columns
and format those as currency so that we're actually getting them
as values. So I've made some changes to the data model
based on the data that I brought in. I might also go and have a look
at the diagram view of this data model and see these tables that I've got.
And I might just want to create some relationships
between them. So for example, I've got my date cord that I can join
from tornadoes to date with. And the same with the state code
to the states table. So I can create those relationships
and define the structure of my data model. Then I might want to go
into my dates table here and just take a closer look at that.
And I might want to define some sort of hierarchy here
so we'll define a hierarchy called date.
And what we'll do is we'll add the year and then the month
and then the day of the month to that hierarchy.
And that lets me drill up and down when I'm analyzing the data
and aggregate it at different levels.
So I'm able to use PowerPivot to refine the data model that I've got
from my Hive tables and my Hive data warehouse.
And what I can then do with that, if I want to analyze the data here
I can go ahead and create a Pivot table, for example.
So we'll create a Pivot table. We'll just put it in the existing worksheet.
And here is the fields from those tables that I've got in my data model.
What I might want to do is I might want to show my date hierarchy here.
And I might want to show the state name across on the columns
like so. And then from the tornadoes table, I might be interested
in the cost of the property loss. I might bring that in like so.
And in that comes. And I can see the different costs associated with property
across the states and down by the years. If I expand one of the years,
because I defined that hierarchy I can actually then drill into the month
and then into the individual day of the month
and look at the costs that were incurred based on tornadoes that occurred then.
So I'm able to do some fairly detailed analysis of my data
having got it from my Hive tables and refined it in a data model
in PowerPivot. One of the other things I might want to do
is I might want to visualize this data. And because it's largely geographical,
it's based on events that have occurred in different states in the US
across different time periods, one of the tools that I might want to use
to analyze this is PowerMap. So if I have a look at PowerMap,
this is a component that comes with the Power BI service
for Office 365. So I can create what's called a tour
using PowerMap. So we'll give the tour a name.
We'll just call it Tornadoes Over Time.
And what I can do is I can first of all specify
a geographical field that I want to use to overlay data
onto this globe here. I've got a globe of the world here,
and I want to overlay some data onto there. So let's choose the state name
initially as our geographical field. And it recognizes the state name.
It's mapped that to the state province level. There's a bunch of different levels
that I can map data to. In this case it's the state.
It's identified the different states in the US, so we've got a geographical field.
And on the next page I can choose, what's the data I'm actually interested
in analyzing here? What's the analytical data that I want to display?
And for that what I want to really do is have a look at the tornadoes.
I want to look at that property cost and crop cost.
So there's the property loss, and I can add that.
And you can see that it's added if I drill in here. You can see it's added a column
for each of those. I might also want to add the crop loss in there as well.
And again it's added that. In cases where there is crop loss,
if we drill in a little bit it's actually stacked in the columns.
There's comparatively little crop loss compared to the property loss there.
So we're not really seeing that very much. But what I might do
is I might, instead of displaying it as a single column,
create a clustered column so we're actually seeing a separate column
for each of the different types of loss.
And I can now see the orange crop loss compared to the blue property loss.
It's created a layer called layer 1. That's not a particularly useful name for me
so I'm going to change that to costs.
And I can resize this legend here that's going to be used.
And I can go ahead and I can change those colors as well.
So I've got blue and orange here. What I might want to do
is just have a look at the settings for there and say,
well, for property loss we'll leave that as blue.
And for crop loss, let's change that to purple
just so that we've got two different colors there. And it goes ahead
and updates that immediately in here.
Okay, well so far so good. What I might also want to see
is how this has changed over time. So what I'm going to do
is I'm going to come down to the time box here
and I'm going to in my tornadoes table select
the event time. And that then fills in that time box.
And so what I get is this player has appeared here.
So I can play the events over time and see how the costs have grown over time.
And I can choose to increment in days, months, quarters, years.
I'm going to increment by month so each month we see the next value.
And I'm going to configure it so that the data accumulates over time.
So we'll see the accumulative costs as time has elapsed
in these different areas. And again, if I just zoom out we can see that there.
And I can actually drag the slider, and you can see the values change
as we move over time.
You can see that it's added this little legend here to tell me
what's the current time we're looking at.
I'm going to edit that and just change the format a little bit
so that instead of showing the day and time, I'm only really interested in the month and year.
So we'll go ahead and change that.
And we'll just put that out of the way down at the bottom
here and move this legend up to the top.
So we've now got the property loss and the crop loss shown as a cluster column chart
in each state, and we'll able to view that over time
which is pretty good. But what I want to do is I want to overlay
on top of that the severity, the categories, of the different tornadoes
that we've seen. So I'm going to add a layer.
And again, the geographical field that I want to display
needs to be chose, so for the tornadoes I'm going to choose
the start latitude and longitude. And it automatically mapped those
to latitude and longitude. It's identified where those actually happened.
And I'm then going to—for each of those—I'm going to display
the severity, the category, of the hurricane that appeared.
So I'll add that category in there. It's again going for this idea
of displaying the sum as a column chart. I don't really want it
as a column chart. I want it actually as a heat map
so I can see the different severities of the hurricanes based on a gradient scale
from cold to hot, like a heat map.
And again, I want to see that over time.
So what I'm going to do is I'm going to choose the event time once again
as my time field. Again I'm going to increment by month.
So we'll just modify that to increment by month.
And this time instead of the data accumulating over time,
I actually want it to stay in place until it's replaced
so it shows me the most recent category that's there as time elapses.
And I can see how that affects the US as we view the data that's displayed.
So I've now got two layers on my map. This layer too I'm just going to rename
to Category. And I can see the tornado category displayed there.
Both of the layers are going to be incrementing over time.
I'm just going to change a few settings to my overall scene here.
What I'm going to do is change the duration so that the whole scene lasts 20 seconds.
And I am going to apply an effect. So we're going to use the push-in effect
so that we animate the globe as we view the data and push in
and see more detail. What I can do is I can change the effect speed
so that we push in fairly quickly and we perhaps increment rather slowly.
And I'm now ready to go and view my tour
to see this data. So let's go ahead and play the tour.
And PowerMap animates the data over time, so I can see the most recent severity
of tornado that we've had in the different areas, the different longitude and latitude locations.
And I can see the crop and property loss costs
in each of the states. And we zoom in there as time goes by
and we go through all the different data that we've got.
And eventually it will bring me up to date
as we get closer and closer.
And when we finish zooming in, I can scroll around the map.
I can use the mouse to move around, or if this was a touch screen
I could just drag with my finger.
And I can hover over the different values there and I can actually see
the property loss and the crop loss if I hover over that as well
for each of the different states.
And I can use again my mouse or my finger to scroll around
and see all the different values that are in there.
So what you've seen in this demonstration
is the ability to use HDInsight to create a data warehouse
on top of data that I'm storing in Azure storage as a commodity store.
I'm using a staging technique so that I can incrementally add more data
to that storage, but I only have to create the data warehouse
when I need it. I can tear it out when I don't without deleting the data.
So the data stays where it is. I then created a data model
in Excel, and of course, I can refresh the tables in here.
If I have a look at my tables in PowerPivot, I can refresh these tables
so if a few months from now if I've added more data to the database, to the data store,
and I've recreated the HDInsight cluster, I can refresh the data.
It will update the data model and update all of the analysis that I've been doing.
So that's how we use HDInsight as a commodity data store
and a data warehouse on demand in a big data scenario.
I hope you found that useful.
[Microsoft]