VideoBigDataScenario2: Data Warehouse On Demand
0 (0 Likes / 0 Dislikes)
[Big Data Scenario]
[Data Warehouse On Demand]
[male narrator] Hi and welcome
to this demonstration
of Big Data Scenario
for Data Warehouse on Demand.
My name is Graham Malcom,
and I'm going to talk you through
how we can use HDInsight
to provide a data warehouse
for big data analysis.
[A. Datum
Meteorological Research]
So for the demonstration,
we're going to look
at a fictional company
called A. Datum
who perform
meteorological research.
What they do is they collect
data about tornadoes,
including the severity
and the costs incurred by damage,
and so forth, and this is analyzed
on a seasonal basis.
So the data is collected
and stored as and when
the tornadoes arise,
but periodically
we want to perform analysis
of that data
when new seasonal
data is available.
So let's have a look
and see how A. Datum
can use HDInsight
and the Microsoft
data technology stack
to perform this analysis.
Let's start by looking
at the Microsoft Azure environment
that A. Datum has got.
And you can see there
is an Azure account here,
and we have an Azure
storage account setup as well.
And this is where we can
store our data that we collect
about tornadoes
as time goes by,
and we've got a blob container
here, which is where we're
actually going to store the data.
So we're just
starting the process
now so there's
no existing data in there,
but as time goes by, we'll
continue to just upload the data
and use
that as a commodity store.
Now when we need to analyze
the data that we've got,
we will spin up
an HDInsight cluster,
and you can see I've already
got 1 up and running here now.
The point is that I
can retain the data
in the storage account even
when I tear the cluster down.
And I can bring
the cluster back up again
when I need it, so I'm only
paying for the cluster
when I actually need
to use it for analysis,
but I'm using
the storage all the time
as a commodity store
for the data that I'm gathering.
So we've got our HDInsight
cluster up and running.
What I'm probably
going to want to do
is to create some hive tables
that I can use
to do some analysis
of my data with.
So here's the script code
I've got to create that
data warehouse and its tables.
First of all,
I'm creating a database
Now you don't have to do this.
There's a default database
that's created for you
when you instantiate
an HDInsight cluster
but it's a good idea just to create
a separate data way site—
it acts as a separate schema
or a separate namespace
for your tables as are
in your data warehouse
and that keeps them distinct
from any other tables
that you might happen to have.
So I'm creating this database,
and I'm specifying a location
in the HDFS, which of course
is going to be hosted
in my Azure blob storage
as to where the files for this
database should be stored.
I'm then creating some tables.
I've got 3 tables, and
there's a fairly common pattern
to data warehouses.
We use what's generally
called a dimensional model.
So I've got a couple of
dimension tables here
and a fact table that contains
the things I want to aggregate—
the actual data I'm interested in.
So for dimensions,
I've created a table of
states for the US states
and a table of date values,
which is quite a common
thing to do in a data warehouse.
It makes it easier to aggregate
data by month or year
or different time periods.
The fact table itself with the
tornadoes information in it
is slightly larger--
there's a whole bunch of things that
I might want to aggregate in here.
I can relate this to the date
table based on the date code
and to the state table
based on the state code.
The other interesting thing that's
worth noticing about this
is that I've partitioned this table.
I've got a partition by year,
which is going to—under the
covers—where this table is stored
is going to create a separate
folder for each year
that I've got data for, and
the data files themselves
will be stored in the
appropriate folders.
And as they get more
and more data as I gather
more data over time,
that will improve performance
as we created the table.
I'm storing these
in the sequence file format.
It's a file format
that's supported by
HDInsight and Hive
and allows things
like compression,
and it's a splittable format
that makes it a good format
to use for creating
for performance.
Notice that all the tables
are external.
so that when I drop the tables
and I get rid of the cluster,
it doesn't affect the data.
The data files stay exactly
where they are
in the Azure storage; it's all
with the table metadata
that's dropped, and then I can
recreate that on top
of the existing
data when I need it again,
so that the data
is retained between
the times I am creating it.
So that's the script I'm using.
Let's take a look at how we
go ahead and run that script.
I'm using PowerShell to do this.
Here's my
PowerShell environment,
and I've already set this up
so that it's linked
to my Azure subscription.
If I go ahead and run
that get AzureSubscription,
you can see
that I've imported
the certificates
so there's already a connection
between my local
PowerShell environment
and my Azure subscription.
And what I'm going
to do is I'm just going
to bring this down
so we can see a script.
And I'll clear
that out of the way.
And first of all,
all I'm going to do
is connect to our cluster
and our storage account.
Find the folder
where this PowerShell script
is actually stored
because that's where
that CreateDWTables.hql file
is the one that creates
the data warehouse in the tables.
I'm then going
to upload that file.
Use the set
AzureStorageBlobContent command
to upload that file
to the Azure Store.
And then I'm going to run
an Azure HDInsight job
based on that file,
or should I say Hive job.
There's a new Azure HDInsight
Hive job definition
that points you to that
script file that I'm uploading
and we're going to run that job.
Under the covers,
that will perform a MapReduce job,
it will run those
Hive QL commands,
and that will create
the database and the tables.
So let's go ahead
and run this PowerShell Script.
And the first thing it does
is it uploads the script file
and then it's going
to go ahead and start
an HDInsight job to run
that script file
and create the database.
So we'll just wait
for the job to complete.
So when the job's done,
I get the report back
that tells me how long
each of the statements took
that created the database
and then created the different
tables in the database.
So now we've got
that database ready to use.
So the first time
that I'm going to use this
I'm going to upload the data.
In the future, the data will already
be there, and I'll add more data
as I get it,
so they'll be periodically
adding more data files
as I collect them.
But initially, I'm going
to upload some data to there.
And let's just take
a look at what we've got.
We've got a number
of data files here.
The dates table,
or the dates data file,
is basically
that code value for the date
and then the date
in actual date format.
And then I've got the month,
number of the year,
the—sorry—
the day number of the month,
the month number
of the year, the year,
and the day of week.
And then I've got
the strings that represent
the day name of the week
and the month name.
So there's some stuff that's going
to get uploaded to that date table.
Typically, you're going
to load the date table
initially with enough
values to cover the data,
the tornadoes data
in this case that we've got,
and then we might add
more dates to the table
as we go forward in the future.
The states table is very simple.
It's just the state code
and the state name.
Unless something
fairly dramatic happens,
I don't think we'll be adding
to this; this is all
of the states in the US,
but I guess what might happen is
we might extend
our analysis to other parts
of the world and upload
some additional states
into there from other countries.
And then the tornadoes
data itself,
we've got a whole bunch
of information about data.
You can see it's
a tab delineated text file.
that we've got
that has the date code,
then the state code,
and the state code,
and then the time
that the event started
and whole bunch
of other information
about the tornadoes that have
occurred over that period.
So I'm going to have
to upload this data
to the location
where the tables are created.
And to do that, what I want
to do is stage the data first.
I'll create a staging table
that I can upload
the data into initially,
and then I can read it
from that staging table
and do any type conversion
or any additional functions
that I need to do
on the data before I load it
into my production tables.
So to do that,
I've got some staging scripts
and the first scripts
are the create table scripts.
So for example,
there's one here to create
a staging table for my dates.
And it just follows
the format of that file
that I've got, that source file.
It says that it's a tab delineated
text file that I'm loading,
and it tells it this is where
I'm expecting the data files to be.
So there's a staging folder
that's going to have a dates folder
and that's where
I'll upload this file to.
And the same is true
of the tornadoes, for example.
So we'll have a look
at that; again, it's just a—
it follows the structure
of the file.
It's a tab delineated
text file,
and it's located
in this location.
Now a couple of things
to notice: one is this is
obviously a text file,
whereas our production tables
are sequence file formats,
so there's a difference
in the format.
And also this tornadoes table,
the staging table,
doesn't have any partitioning,
it's just a flat table
that contains the tornadoes data.
And obviously we're going
to have to generate
the partitioning information
when we move it
from staging
into the production table.
So to do that,
we have scripts
to do the loading
of the staged data
into the production tables,
so here's
the one for the dates,
and it's a straight forward
statement that takes data
from the staged table
and inserts into the date's table,
and it just selects
the data to insert.
There's a similar one
for states
For tornadoes,
it's slightly different.
We have to set
this Hive option here.
We have to set the partition
mode to none strict
because we're going to dynamically
generate the partition value
and that requires that setting.
So once we've set that on,
we can go ahead and insert
into the tornadoes table,
partitioned by year.
And again, we're taking data
from the staging table,
but then that last column here
that we're getting
we're actually
generating the year
by taking the first
4 characters of the date code,
which is actually the year.
So we're loading from a flat
table into a partition table
and we're using this function
to generate
the partitioning value.
So the first thing
we need to do
is we need to upload
those staging scripts.
We're going
to use these each time
we need to load data
into those tables.
We're going to stage it and then
load it, so it's a 2-phase
process each time
we want to upload data.
So again, I've got a
PowerShell script to do this.
We'll just clear
the pane at the bottom,
fairly straight forward.
Again, we're just finding
the location of this script file
and then we're uploading
all of the script—
the staging script files—
to this location in here,
DW/staging/scripts.
And it's just going to loop through
that local folder I've got here
with the script files in them
and upload them all
to the Azure store.
So let's go ahead
and run that.
And it just starts looping through
those local staging scripts
and uploads each
of them to that location
in the Azure store.
When it's finished, all of those
files have been uploaded.
Okay, well we're
now ready to actually
do some loading of our data,
and you've seen
the data source files
and you've seen the staging
scripts we're going to use.
And I've got a separate
PowerShell script
for each of those types of data,
one for dates, one for states,
one for tornadoes.
We could—of course—
do it all in the one file,
but by separating them out,
it makes them easier to manage
incremental loads of tornadoes
when I don't need to add
any states or add any dates
or anything like that. I can do
them all independently.
So this script will—again—
run the existing
Hive scripts that I've just
uploaded, the staging scripts,
It runs the Hive script
to create the staging table
that I'm going to load
the data into.
Then it goes ahead
and uploads the local data file,
in this case the dates.txt file,
and it puts it into the location
where I've defined
that staging table.
So that, in affect,
populates the staging table.
And then I can run the other
Hive script that I loaded
among the staging scripts
to load
the staged data
into the production data.
To do that, select
against the staging table
and load that
into the production table.
So let's go ahead
and run this one.
And again,
it's going to kick off
an Azure HDInsight Hive job
to go ahead and create
the staging table.
When the Hive job
to create the staging table
is complete,
we're going to upload
the data file,
and then we're going
to run another Hive job
this time to load
the stage data
into the production table.
And when the job is finished,
again, I get the output
to tell me how long
each of the different jobs,
different parts of the jobs took.
And the data has
now been uploaded
to the staging table and then
loaded into the production table.
So let's go ahead
and do exactly the same
with the states data, again,
just running
the appropriate scripts
to create the staging table
and then to load
the staged data
into the production table.
And again, that job
completes and loads
the staged data
into the production table.
And finally we'll do the same
with our tornadoes data.
So when the job
to load the tornadoes data
is complete, we can
actually see the results
of the different partitions
that were generated
from the data, so we've got
a partition for each year
that's been loaded.
And what we've now got
is a repeatable process
using these scripts
where we can take new data,
we can load
that to the staging table.
The staging tables
are not defined as external.
They're internal tables,
so when we drop them,
we delete
any previously staged data,
upload the new data,
and then we insert
that into the production table
from the staging table.
And the production table
is defined as an external table.
So when it's dropped
or the cluster is deleted,
the data that's already there,
stays where it is.
So we're able to continually
add to that data
and just bring up the cluster
as and when when we need it.
So we've got
a repeatable process
for creating
a data warehouse on demand,
if you like, and incrementally
adding data over time.
And just to see
how we might
use that data warehouse,
well one of the things obviously
is that we might want to just
run querys against the tables
and get some information back.
So I've got here
a PowerShell script
that's just going to go
and select the states,
the month number of the year,
and the total property loss
and crop loss
costs from those tables
and group it by year.
Unfortunately this is only
for the year 2010.
So actually it's going
to group it by month
and display for each state
and each month
what the total property cost
and crop cost was.
So let's go ahead
and run that script.
And when
the results are returned
I can actually see,
for each different state
and each month in 2010,
I can see
the total property cost
and crop cost associated with that.
So it's working.
I've got a data warehouse
that has tables, and those tables
have data, and I can query them.
How I would actually use
that in a production environment,
is I'd probably have some sort
of modeling and client visualization tool.
So we're gonna use Excel
to see how we might use this.
We'll just start up Excel
and create a new workbook.
And one of the things
I can do with Excel
is I can use Power Pivot
to create a data model
based on a data source
such as a data warehouse.
Now in this case I'm gonna connect
to our Hive data warehouse.
And to use that
I'm gonna use ODBC.
So actually in my Control Panel,
if I have a look
at my Administrative Tools,
I have an ODBC data source name
because I've installed
the Microsoft Hive driver for ODBC.
So if we actually have a look
at this DSN,
you can see that it's connecting
to my HDInsight cluster.
And I just need to supply it
with the username and password,
and off it will go and connect
and make a connection
to the Hive tables there.
So this ODBC driver has been installed.
I've got both the 64-
and the 32-bit versions installed in here.
And I can use these
from any ODBC-compliant client
such as Excel.
So I'm gonna use
that data source name
in my Power Pivot data model.
So we'll go ahead and create
a new Power Pivot data model.
And we're gonna grab some data
from this data source.
It's an OLE DB ODBC data source.
And we'll just call this
HiveOnHDInsight.
And I'm gonna go and build
a connection string.
The provider I want to use
is that OLE DB provider for ODBC.
And the data source name
is the one we've just looked at,
that Hive ODBC data source name.
I need to specify the credentials
for my HDInsight cluster.
So I'll just put those in there.
And we'll save those
in the workbook.
I don't need to specify
the catalog to connect to.
It's just gonna connect
to the default Hive database catalog.
So we'll click OK there,
and then we're ready
to carry on with the wizard.
So I get the choice
to either select individual tables
or I could write a HiveQL query
that brings the data that I want.
In this case I'm just gonna import
the 3 tables from the data warehouse.
So we'll go and select those.
So it lists the tables
that are available,
and we're gonna have our dates,
states and tornadoes table.
And I could do
further filtering in here.
I could filter out
rows I'm not interested in,
and I could give them
a slightly more friendly name.
So I'll capitalize these names
just to make them
a bit nicer to read.
And then I can go ahead
and click finish at this point.
And Power Pivot is gonna go off
and connect
to my HDInsight server
and start importing
the data from those tables.
So we'll just wait
for the data to be imported.
And when the data
has been imported,
I can see the different tables
that have now been added to my data model.
And I might want to make a few changes
to the structure of the data models.
So some simple things, like I might want
to just sort the data in the tables.
I might want to do things
like change date type,
so change this to a date
and change the format
so that it's just literally the date
that I care about.
And I can do things
like say the weekday,
if I'm gonna sort data
by weekday,
then actually what that means is
don't sort it alphabetically
by the name
of the day of the week.
Sort it by the day of the week.
Which is this column here.
And similarly with the month,
don't sort it alphabetically by month
if I choose to sort it that way.
Sort it by the month number of the year.
So little bits of detail,
changes that I might want
to make to the data model.
Nothing really I want to change
with the states.
With the tornadoes, again I might just want
to sort that into the right order,
just to try and make sure
the performance is okay.
And I might want to change
the event time again to a date.
But this time I'll leave the time in there,
because that's something I care about.
And I might want to take
my property loss and crop loss columns
and format those as currency
so that we're actually getting them
as values.
So I've made some changes
to the data model
based on the data
that I've brought in.
I might also have a look
at the diagram view of this data model
and see these 3 tables that I've got.
And I might just want to create
some relationships between them.
So for example
I've got my date code
that I can join
from tornadoes to date with.
And the same with the state code
to the states table.
So I can create those relationships
and define the structure of my data model.
And I might want
to go into my dates table here,
just take a closer look at that.
And I might want to define
some sort of hierarchy here.
So we'll define
a hierarchy called Date.
And we'll add the year
and then the month
and then the day of the month
to that hierarchy.
And that lets me drill up and down
when I'm analyzing the data
and aggregate it at different levels.
So I'm able to use Power Pivot
to refine the data model
that I've got from my Hive tables
in my Hive data warehouse.
And what I can then do with that,
if I want to analyze the data here,
I can go ahead and create
a Pivot table for example.
So we'll create a Pivot table.
We'll just put it in the existing worksheet.
And here's the fields from those tables
that I've got in my data model.
I might want to show
my date hierarchy here.
And I might want to show
the state name across on the columns
like so.
And then from the tornadoes table
I might be interested
in the cost of the property loss.
So I might bring that in like so.
And in that comes, and I can see
the different costs associated with property
across the states
and down by the years.
And if I expand one of the years,
because I defined that hierarchy
I can actually then drill into the month
and then into
the individual day of the month
and look at the costs that were incurred
based on tornadoes that occurred then.
So I'm able to do
some fairly detailed analysis of my data,
having got it from my Hive tables
and refined it
in a data model in Power Pivot.
One of the other things
I might want to do is visualize this data.
And because
it's largely geographical—
it's based on events that have occurred
in different states in the US
across different time periods—
one of the tools
I might want to use to analyze this
is Power Map.
So if I have a look at Power Map,
this is a component
that comes with the Power BI service
for Office 365.
So I can create what's called a tour
using Power Map.
We'll give the tour a name.
We'll just call it...
Tornadoes over Time.
And I can first of all specify
a geographical field that I want to use
to overlay data
onto this globe here.
I've got a globe of the world here,
and I want to overlay
some data onto there.
So let's choose
the state name initially
as our geographical field.
And it recognizes the state name.
It's mapped that
to the state/province level.
There's a bunch of different levels
I can map data to,
but in this case it's the state.
It's identified
the different states in the US.
So we've got a geographical field.
And on the next page I can choose
the date I'm actually interested
in analyzing here,
the analytical data
that I want to display.
And for that I want
to have a look at the tornadoes.
I want to look at that property cost
and crop cost.
So there's the property loss,
and I can add that.
You can see that it's added
if I drill in here.
You can see it's added
a column for each of those.
And I might also want to add
the crop loss in there as well.
And again it's added that.
In cases where there is crop loss,
if we drill in a little bit,
it's actually stacked in the column.
So there's comparatively little crop loss
compared to the property loss there.
So we're not really seeing that very much.
But I might,
instead of displaying it as a single column,
create a clustered column.
So we're seeing a separate column
for each of the different types of loss,
and I can now see
the orange crop loss
compared to the blue property loss.
Now it's created a layer called Layer 1.
This is not
a particularly useful name for me.
So I'm gonna change that to Costs.
And I can resize this legend here
that's gonna be used.
And I can go ahead
and I can change those colors as well.
So I've got blue and orange here.
I might want to just have a look
at the settings for there
and say for property loss
we'll leave that as blue,
and for crop loss
let's change that to purple
just so we've got
2 different colors there.
And it updates that immediately in here.
Okay, so far so good.
I might also want to see
how this has changed over time.
So I'm gonna come down
to the time box here,
and I'm gonna,
in my tornadoes table,
select the event time.
And that then fills in that time box.
And so this player has appeared here.
So I could play the events over time
and see how the costs
have grown over time.
And I can choose to increment
in days, months, quarters, years.
I'm gonna increment by month
so that each month we see the next value.
And I'm gonna configure it
so that the data accumulates over time.
So we'll see the accumulative costs
as time has elapsed
in these different areas.
And again, if I just zoom out
we can see that there.
And I can actually drag the slider,
and you can see the values change
as we move over time.
You can see that it's added
this little legend here
to tell me the current time
we're looking at.
I'm gonna edit that
and change the format a little bit
so that instead of showing
the date and time,
I'm only really interested
in the month and year.
So we'll go ahead and change that.
And we'll put that out of the way
down at the bottom here,
and move this legend
up at the top.
So we've now got
the property loss and the crop loss
shown as a clustered column chart
in each state,
and we're able to view that over time,
which is pretty good.
But I want to overlay
on top of that
the severity, the categories
of the different tornadoes we've seen.
So I'm gonna add a layer.
And again the geographical field
that I want to display needs to be chosen.
So for the tornadoes,
I'm gonna choose
the start latitude and longitude.
And it automatically mapped those
to latitude and longitude.
So it's identified
where those actually happened.
And then for each of those
I'm gonna display
the severity, the category
of the hurricane up here.
So I'll add that category in there.
It's again gone for this idea
of displaying the sum as a column chart.
I don't really want it as a column chart.
I want it actually as a heat map.
So I can see
the different severities of the hurricanes
based on a gradient scale
from cold to hot,
like a heat map.
And again
I want to see that over time.
So I'm gonna choose
the event time once again
as my time field.
Again I'm gonna increment by month.
So we'll just modify that
to increment by month.
And this time instead of the data
accumulating over time,
I actually want it to stay in place
until it's replaced,
so it shows me
the most recent category that's there.
as time elapses.
And I can see
how that affects the US
as we view the data that's displayed.
So I've now got 2 layers on my map.
This Layer 2
I'm just gonna rename
to Category.
And I can see the tornado category
displayed there.
Both of the layers
are gonna be incrementing over time.
I'm just gonna change a few settings
to my overall scene here.
I'm gonna change the duration
so that the whole scene lasts 20 seconds.
And I am gonna apply an effect.
So we're gonna use the Push In effect
so that we animate the globe
as we view the data
and push in and see more detail.
And I could change the effect speed
so that we push in fairly quickly.
And we perhaps increment rather slowly.
And I'm now ready to view my tour
to see this data.
So let's go ahead and play the tour.
And Power Map
animates the data over time.
So I can see the most recent severity
of tornado that we've had
in the different areas,
the longitude and latitude locations.
And I can see
the crop and property loss costs
in each of the states.
And we zoom in there
as time goes by
and we work our way through
all the different data that we've got.
And eventually
it will bring me up to date.
So we get closer and closer.
And when we finish zooming in,
I can scroll around the map.
I can use the mouse to move around,
or if this was a touchscreen
I could just drag with my finger.
And I can hover
over the different values there,
and I can actually see
the property loss,
and the crop loss
if I hover over that as well,
for each of the different states.
And I can use,
again, my mouse or my finger
to scroll around and see
all the different values that are in there.
So what you've seen
in this demonstration
is the ability to use HDInsight
to create a data warehouse
on top of data that I'm storing
in Azure Storage as a commodity store.
I'm using a staging technique
so that I can incrementally add
more data to that storage,
but I only have to create
the data warehouse when I need it.
I can tear it down when I don't
without deleting the data.
So the data stays where it is.
I then created
the data model in Excel,
and of course I can refresh
the tables in here.
If I have a look at my tables
in Power Pivot,
I can refresh these tables.
So 3 months from now,
if I've added more data to the data store
and I've recreated the HDInsight cluster,
I could refresh the data.
It will update the data model
and update all the analysis I've been doing.
So that's how we use HDInsight
as a commodity data store
and a data warehouse on demand
in a big data scenario.
Hope you found that useful.
[Microsoft]