VideoBigDataScenario4: BI Integration
0 (0 Likes / 0 Dislikes)
[Big Data Scenario]
[BI Integration]
[Graham Malcolm] Hi, my name's Graham Malcolm.
Welcome to this demonstration
of a big data scenario for BI integration.
What we're gonna look at in this scenario
is a hypothetical company called Adventure Work Cycles,
who are an online retailer who sell bikes and cycling accessories
and they have an existing enterprise BI solution.
And that BI solution consists of a data warehouse,
we have some analysis services data models
built on top of that data warehouse,
and we're using those to cover some reports and dashboards
in SQL Server Reporting Services.
So there's a fairly standard existing enterprise BI solution
that's managed by the IT department.
And what we want to do is we want to take advantage
of our ability to process big data using HDInsight
to see if we can find some potentially useful data
in log files from the web servers
that are used to cover the online retail solution.
So let's take a look and see how we can incorporate
that big data source from those log files
into our existing BI solution.
Before we start integrating big data into an enterprise BI solution,
it's probably a good idea to take a look at the existing BI infrastructure
and understand what's going on there.
So here we are at Adventure Works.
And the BI structure for the enterprise is based on a data warehouse
which is implemented in SQL Server.
The data warehouse has a number of tables
and each of the tables represents either a dimension of the business
by which we want to aggregate data
or the facts that we actually want to aggregate.
And there are dimension and fact tables in here.
So for example there's a dimension table
that lets me slice and dice my data by product.
And if I took a look at that product data
we can see that there's a record in here for each product,
and also a record for facts that I want to measure
where there is no product involved at all.
So I can always relate to either explicitly there was no product
for this thing that I'm aggregating,
or here's the product that we should use to slice and dice the data.
The products are identified using a product key
which is unique in this table,
and an alternate key
which is probably the key that was used in the original system.
And the reason that we do that
is the product key is actually what we refer to as a surrogate key.
It allows us to have slowly changing dimensions
where a product may change over time, as price might change.
So I could have 2 entries for the same product
representing it at different points in time.
The facts in the data warehouse
are the usually numerical measures that I want to aggregate.
So for example, I have facts about internet sales.
And if I look at that, I can see that each internet sale
is related to a product and an order date
and a customer, and a whole bunch of other different things
that I might want to slice and dice by.
So this data warehouse,
we refer to this kind of structure as a dimensional structure.
We've got the dimensional model for the data warehouse,
and on top of that we have built an analysis services model.
So we have an analysis services tabular data model here
which contains a number of tables,
each of which is based on the underlying tables in the data warehouse.
And this model pre-aggregates the data, structures it,
adds things like key performance indicators,
and makes it easier for my analytical tools and reports
to process that data and display it properly.
So that's the underlying infrastructure.
One of the things that we might want to do with that
is we might want to use the data model as a basis for a dashboard.
So here's a very simple dashboard that an executive might come to
and they can see the overall revenue performance,
and they can take a look and break down by each quarter of this year,
they can take a look and see how close we were to our growth target.
So you can see that we were slightly behind for the 1st quarter,
but we've made up some ground in the 2nd.
As well as things like dashboards and reports,
I might want to empower some of my business analysts
to do some of their own analysis.
So they can do so using Excel.
And I can go ahead and I can connect
to our analysis services model.
Just go ahead and log in using our Windows credentials.
We can choose which of the perspectives or cubes
we want to view in there.
I'm just gonna go with the main model cube.
And I'll go ahead and we will add this
as a pivot table report.
And what I now get is a nice pivot table view
of those different tables that I've got in my tabular model,
my analysis services model.
So for example I could take a look
at the sales revenue,
internet total sales,
and I can see that figure appears in my Excel workbook.
And I can then slice and dice that by various different dimensions.
So I could find our date dimension here,
and we could go ahead and bring that in view by calendar year.
And I could also go and look at my products that I've got in there,
bring that in by category.
And I can then start drilling up and dive into the data
to do some analysis.
So I've got a fairly comprehensive enterprise BI solution already built
based on a SQL Server data warehouse
with a SQL Server Analysis Services data model on top
and then using reporting services to provide dashboards for executives
and I'm empowering my analysts to use Excel
to create their own analysis
against the data model and analysis services.
So how could we extend this to start bringing big data into here?
In this case, the specific big data
that the analysts at Adventure Works are interested in
is the log data from the web servers
that service the company's ecommerce site.
So we've been gathering these.
You can see I've got daily log files
for 6 months of data here that we've collected from our web servers.
And if we have a look into each of these log files,
you can see that there are some header rows,
and then we've got the actual data
which is the data and the time,
the IP address of the client system,
the username which in most cases is anonymous,
and the server IP address and port, and so on.
There's a whole bunch of fields in here.
There's a couple of interesting ones here.
There's the stem, which is really just the path within the website
where we've gone to find the page,
and then there's the query string.
And in some cases on the default page there's none,
but when we've gone to view a product,
the query string includes some additional parameters
that were passed to view the web page and see the actual product.
So there's a whole bunch of data in these log files,
and you can see that they're fairly verbose.
There's a lot of data in each of the files.
So we've got a huge quantity of log data
that we want to somehow process and incorporate into our analysis
and see if there's any relationship between activity on the website
and the sales of the products.
So to do that, we're gonna use
some processing in HDInsight to process the data,
and we're gonna use Hive primarily to do this.
So one of the things we're gonna do
is we're gonna create a staging table
that represents the schema of the log,
and we'll upload that log file to there.
This space delimiter here is represented by 32.
That's the delimiter that's used in the files.
It's actually a space.
And we're gonna store that data just in a folder called "data"
as your storage.
And then we're gonna create another log table here
where we're partitioning by year,
because as we gather the data over time
we're gonna have a huge volume of data,
and we might want to have it partitioned out by year.
And we're gonna store that in sequence file format.
So we've got a staging table to which we'll upload the log,
and then we've got the actual IIS log table
which is where we're gonna put the data in order to to query it
and do some processing of that using HDInsight.
To actually do the loading of the staging table, here's the statement.
We set the partition mode to non-strict,
because we can dynamically generate the partition.
And we load from the log staging table into the IIS log table,
partitioning by year and month.
And we're using these formulae here
to figure out the year and the month
based on the data that we've got in the table.
So there's a certain amount of Hiive activity going on.
You'll see we're also filtering to take out the header rows
which start with # as well.
So we're cleaning up the data as we put it into the Hive table
ready to be processed.
So to actually go ahead and load that, we have a PowerShell script here.
So we'll just go ahead and open this up
in the PowerShell interactive scripting environment.
And I'm connecting to my cluster,
connecting to my storage account.
I'm finding out the folder where this file itself is stored,
this PowerShell script,
because that tells me where to go
to find the script files that I need to upload,
and that's those two HiveQL scripts we just saw.
So we go ahead and upload those to the storage account,
and we're gonna store them in a folder called "scripts."
And then we are going to go and have a look
at all of those scripts that are on there
and run them to create the Hive tables,
and then we're gonna go and have a look
at the IIS log files and upload those.
So there'll be a little bit of time taken as we upload those files.
Once they're uploaded,
we're then gonna run the 2nd of the two scripts that we uploaded
which is gonna actually load the stage data
into the actual table that we want to create.
So that's our PowerShell script.
We'll just go ahead and run that.
So when the script's finished,
we've uploaded the data files to our Azure Storage,
we've uploaded the two HiveQL script files,
and we've used them to create a Hive table
for the staged files that we've uploaded,
and then another Hive table that has the cleaned-up data
that we're ready to process.
So now we're ready to go and incorporate this data
that we've uploaded to our Azure Storage
into our BI environment that we have in the enterprise.
Perhaps the easiest way
to start integrating this big data into our BI solution
is to integrate at the report level,
so effectively bring data through the data warehouse
and the data model into Excel,
and then also bring the big data directly from the Hive table into Excel
and mash it up together.
So to do that, I've got an Excel workbook here
that already has a Power Pivot data model that I've created.
So we'll just take a look at that.
And the existing data model contains some of the tables
from the underlying data warehouse.
So we've got the dates table here,
the products,
the product categories and subcategories,
and the internet sales facts that we're interested in.
And you can see that I've got a column here
that shows me the profit—
that's calculated as the sales amount minus the cost.
So I'm able to do some analysis of my internet sales.
What I want to do is bring in another table into this data model
from our Hive table.
So to do that, I'm gonna use a connection to our Hive table
using ODBC.
And ahead of time, I have a look in Control Panel.
You can actually see that I've got
an ODBC data source name that I've created
called Hive in this case,
which connects to my HDInsight server.
So I can use this data source from my client computer
to connect and create Hive tables.
So let's go ahead and use that in our Power Pivot model here.
And what I'm gonna do
is I'm gonna say I wanna get data from another source.
And the other source is an ODBC source.
And we'll call this "HiveTable."
For the connection string, I'm just gonna build the connection string
and simply go ahead and choose the provider to be the ODBC provider
and then select our Hive data source name
that we saw just a minute ago.
So that's nice and easy.
I'll put in the credentials
that I need to connect to the HDInsight cluster,
and I'm gonna save that password so that I can go ahead and connect.
So now that we've got that connection string,
we can go ahead and connect to the cluster
and start getting some data from there.
And what I could do is just simply select the Hive tables that we've got,
but actually what I want to do is to write a HiveQL query
and let my HDInsight cluster do all the heavy lifting
to aggregate the data just to get me what I need.
So we'll call this "Page Hits."
And we'll enter a statement there—a HiveQL statement,
I know it says SQL statement,
it's a HiveQL statement—
to bring back the log date and the number of hits
in 2008 before July,
and I wanted to group that by the log date.
So I'm really bringing back a subset of the data as pre-aggregated,
and I'm using the HDInsight cluster to do the heavy lifting
to process all of that data and bring back just the results I need.
So we can go ahead and finish on there.
And Power Pivot starts to import the data,
the results from that query that I specified.
And once the data has been imported,
I can go ahead and close the wizard,
and I actually see that data added to my data model here.
So we've now got a table called "Page Hits,"
which has the total number of hits
for each day within that 6-month period.
So I can incorporate this into my data model a little bit more.
If I have a look at the diagram view of the data model,
we can start to integrate that table with the other tables.
So I've put my Page Hits here, which I'll just bring across.
And I've got my Dates table here.
And I know that my log date here is actually a date,
so I'll go ahead and change that data type.
And I'm gonna create a relationship between the log date and the date.
So now I've integrated the table of data
that I've retrieved from my Hive table
into my data model.
And that's me ready now to do some analysis with that.
So we can go ahead and save our data model.
And we're back in Excel.
It's actually left the Power Pivot window open behind.
So I'll just close that down.
I don't need that window anymore.
And what I might do is actually take advantage
of the fact that I've got that data in my data model
and insert a Power View report.
So let's go ahead and create some Power View.
And so I've got my report here.
I can go ahead and add a title into here,
so "Web Sales" or something like that.
I don't actually need to apply any filters,
so we can go ahead and close that filter.
And what I might do is I might want to bring across
the dates that we're looking at.
In this case we're looking in 2008,
because we filtered the data that way.
So I'm gonna bring across the calendar month, if we can find it.
Month. Here.
And I might be interested in the internet sales
that I've got in that month.
And I might really just be interested
in the number of orders, the order quantity
that comes across there.
And I might also be interested
in the page hits that I've got for each month.
And because I've got that relationship between the tables—
the page hits and the date table—
it's able to aggregate those page hits by month across there,
so I can see how that works.
And what I might want to do is I might want to change this
to look a little bit different,
perhaps just make it a line chart.
And I'll just resize that slightly
so that we can have a slightly clearer view.
And you can see that obviously the hits are increasing over time.
The order quantity seems to be relatively static.
And I'm just gonna go and make a couple of changes to this
so that we get the data labels
so we're a little bit clearer about what those values are.
So there's a slight increase in the order quantity as we go forward,
but there's a marked increase in the number of page hits
that we're needing to get those orders.
So there's some useful information there.
I might want to add something else to this report.
I might be interested in, for example, the day of the week.
So the day, and also the page hits that I'm getting on each day.
So I could go ahead and look at that.
And I might want to just average those page hits out.
So I'll have a look at the average number of hits per day.
And we could go ahead and make that into a bar chart.
And just resize that
so that we can see what that kind of profile looks like.
Over an average week, what's the number of hits we get?
And obviously we're more busy on Saturdays and Sundays
than we are elsewhere.
And equally I could come in here and I could go
and have a look at perhaps the quarter.
Yeah, let's look at the calendar quarter.
And that's actually aggregated to that.
So we don't want it to do that.
We'll just go and add some additional data into there.
So we'll have the internet sales again.
We'll take the order quantity and the page hits.
And what we'll do is we'll display that as a column chart,
with the calendar quarter being the axis.
And that makes it a little bit clearer for us to see what's going on.
So let's just resize that.
So I can see in quarter 1 and quarter 2
the comparative order quantity vs the page hits.
And I can interact with this.
So I could click "Sunday" and it filters all of the data
to show me what happened on Sundays across those periods.
So by integrating the results from my big data processing in HDInsight
into my data model in Excel,
I'm able to mash up data from my enterprise corporate BI world
with this external big data that I've processed in HDInsight.
And that's a pretty useful way for me to do my analysis.
Now, integrating our big data in an Excel workbook
is great for individual users or for small groups of analysts
who want to look at that data alongside data
from the corporate data model.
But if I want to make this big data processing results
part of our BI ecosystem
and include it in dashboards and reports,
it might make sense to look further up the chain
and incorporate that data into our analytical data models
in SQL Server analysis services.
So to do that, here's the project for our analysis services data model.
And I'm gonna go ahead and import some data from a data source.
And again we're gonna use our ODBC driver to get this.
And we'll build a connection.
This time though I'm not gonna use the data source name.
I'm gonna use the ODBC driver,
but I'm actually gonna enter a connection string.
And that connection string—just paste that in there.
The reason I'm doing this—
this is the same connection string as is used by the DSN.
But the reason I want to explicitly put the connection string
is because the DSN might not exist on the analysis server
where I'm gonna deploy the data model.
So I will explicitly put the connection string in there,
we'll add the credentials that we need to connect to the Hive table,
save that password in there,
and then we're ready to go.
So when we're ready to go ahead and connect to there,
I need to specify the credentials
that analysis services should use to do this.
I'm actually gonna just use the server's account,
because the credentials that are reused
are the ones that I've included in the connection string.
And again, rather than select from a list of tables or views,
I'm gonna write a query to actually bring back the data I want
and make HDInsight do the heavy lifting.
So we'll name the query "Page Hits" and we'll put in our statement.
And this time I'm bringing back the date,
the sum of the bytes that were sent to the server,
the sum of the bytes that were returned,
and the number of records—
in other words, the number of page hits that we got.
So let's go ahead and finish that wizard.
And the data is imported into a new table in the data model.
And when the data has been transferred, I can close the wizard
and have a look at the table that's been imported.
Now, unlike with Power Pivot, there are a few additional things
that I need to do with my data here.
So one of the things that Power Pivot does for me
that I need to do for myself when working with analysis services
is I need to create my own measures to actually aggregate the data.
So what I can do is I can go ahead and I can create a new measure here
that is just gonna be the sum of the sc_bytes.
And it creates a formula for me there.
I'm just gonna change the name that it's allocated there
to something more useful.
So we'll call that "Bytes Sent."
And if I widen that column,
we can actually see that it's calculated the bytes sent.
And I'll just do the same
for the bytes that are received.
Again, give that a sensible name.
And finally I'll do the same for the page hits as well.
I'll just call that "Hits."
So it's gone ahead and calculated out those measures for me,
those aggregated measures.
And what I'm gonna do is I'm gonna go and I'm gonna change
the format of these bits of data.
You can see they're just plain numbers at the moment.
So we'll just select all 3 of those,
and we'll change the format
so that it's a whole number for each of these.
And we'll show the thousands separator as well.
And now when I look at the format of that data,
if I just widen this column,
you can see that it's slightly easier to read and to handle there.
So we did a little bit of working with our model to format the data.
I'm also gonna go and have a look at the diagram view of this model.
And we'll just see if we can find our table that we've just added.
Over here to the right.
I'm just gonna drag this this way so that it's next to my dates table.
I can then start to do some formatting of my model.
So we'll just increase this font slightly so we can see what we're doing.
So here's my page hits table.
I've got my log date column here.
And again I can just have a look at the format of that
and instead of being a text, I'll choose this to be a date.
And I can go ahead
and I can, once it's updated that, create a relationship
between this table and the date table.
So that will let me aggregate the page hits by the date.
And now that I've done that,
we'll just go and have another closer look at this table.
We'll just maximize this table up so we can see it in detail.
And obviously there are the 3 calculated columns that I created
which I'm interested in.
I don't necessarily need to see any of these other fields.
So I'm just gonna hide the detail columns there,
just to avoid confusing me by having those in the user interface
when I start to work with the data.
I'm only really interested in the aggregated totals.
So we'll just hide those from the client tools.
So now that we've got the page hits table
incorporated into our data model,
we can make some use of it.
For example, here's the internet sales table.
And you can see that there are a number
of calculated columns that I've created, or calculated measures.
And I've added a few at the end here.
I've got the actual units, which is the sum of the order quantity.
So that's the number of units that have actually been sold.
And I've got target units,
which is the hits value that I defined in my page hits table,
multiplied by 0.065.
So I'm creating a relationship
between the number of page hits that I get
and the number of units that I expect to sell
as a conversion rate that I'm trying to achieve.
And what I've done is I've created a KPI here
that measures the actual units against the target units
and will give me an indication as to whether I'm on target or not.
So if I have achieved 100% of the units that I'm trying to sell,
we get a green icon.
Filling that up to 80%, you get this middle yellow icon.
And if we're below 80%, we get a red icon.
So I can go ahead and deploy that data model,
and now I can start using that throughout my BI ecosystem.
In this case, we can use the big data
that we've incorporated into our analysis services model
in our dashboard report.
So you can see that I've got the report that we saw earlier on.
And I've modified it slightly.
I've added a data set to here called "Hits and Sales."
And if we take a look at that data set,
you can see that it's connecting to the sales cube
that we've got in our analysis services model,
and we've got this rather verbose MDX query here.
If I have a look at the query designer,
you can see a bit more clearly what that's doing.
We filtered for the latest year of data that we have in the first semester,
and we're bringing back the month, the actual units goal—
in other words, that KPI target value—
the actual value that we achieved,
and then the status indicator,
which comes back as a number between -1 and 1.
So we've got that data set in our report now.
We've added a table over here where we're displaying the month name
and then an indicator based on the value that we get on that status value.
And we can expand down to see the actual units and actual target.
So I've deployed this to the server.
So if I actually go and have a look in my browser,
I can see that I've got the dashboard here.
It now includes the page hits vs units sold
that I'm getting by mashing up my big data results from HDInsight
with the sales data in my data warehouse.
And I can expand through each of these months
and see the actual units sold and the target.
And I'm pretty much on target for most of the year.
In June, we're a little bit behind.
So we've seen how to integrate our big data processing results
in the report level using Excel
and at the data model level in SQL Server Analysis Services.
What of course you might want to do
is to integrate at the data warehouse level.
You might want to bring the results of your big data processing
into your SQL Server-based data warehouse,
and then propagate that data
throughout the entire BI stack you have in the organization.
So there's a number of ways to do that.
In this scenario we've chosen to use a technique
that uses SQL Server Integration Services to copy the data.
And what we've done is we've created a new fact table
in the data warehouse
for the IIS log data that we're analyzing.
And that obviously references some of the existing dimension tables.
And of course you might remember from earlier
that the dimension tables use a surrogate key
to identify each individual role
that doesn't necessarily relate to the original business key
that might be in the source data, such as the product key.
So we'll have to do a little bit of work to look up
the appropriate surrogate key to make that relationship work.
To do that, we've decided that we're gonna stage the data
that we get from our Hive tables.
So we've got a staging schema just to keep the staging tables separate.
And we've got an IIS log table in there
for the data that we're gonna bring across from the Hive table.
The other thing that you commonly do in a BI environment,
when you're moving data around for your ETL workflow,
is you only want to extract the data that you need.
You don't want to extract the entire Hive table each time,
in this instance.
So what we've done is we've created a high water table
that stores the maximum date that we've previous extracted.
So the next time we do the extraction,
next month after we load next month's log files,
we don't want to extract everything,
we just extract the data that's relevant for us.
So we've got this high water table here.
And because we haven't used it at all,
we'll initialize that with a value of a whole bunch of 0s.
So basically we're gonna bring out all of the data that's in there,
and then we'll update that high water value
so that next time we filter the extraction based on what we need.
So those tables have been created.
And I've created an SSIS control flow here
in a package to start the extraction.
So the first thing we do is we get that high water mark.
We find out what's the maximum date
that we've already extracted to the data warehouse.
And you can see that there's a query here.
We'll just bring that up so we can see it a bit more clearly.
So we're selecting the maximum value from the high water mark table,
so that we know what that last extracted value was.
And we're storing the results of that
in a variable called "High Water Mark."
It's a user variable within my SSIS package.
So we're bringing out the dates that'll be the border if you like,
or the boundary value for the data that we need to extract,
and storing that in a variable.
Now if I have a look at the variables that are in this package,
there's that high water mark variable that we've just seen.
There's another variable called "Hive Query,"
which is what we're gonna use to extract the data from Hive.
And if we have a look at that, we can actually see—
it's quite difficult to see, but it's a select expression.
It's a Hive query that incorporates the high water mark variable in there.
So we've got an expression that creates a HiveQL query
that extracts all the data that we need
after the high water mark date.
So once we've done that we'll empty the staging data
from any data that's left over
from the last time we extracted data.
Then we're gonna go ahead and stage the log data,
and finally we'll reset the high water mark
to be whatever the value is that we've now extracted,
so that we're always keeping that high water mark up to date
and each subsequent extraction only takes the data that we need.
The stage IIS log data is a data flow.
So if we have a look at that,
we can see that there's a number of steps involved in the data flow.
And the first of which is we extract from a Hive table.
To connect to that Hive table,
we've got a connection manager over here.
And if we have a look at the definition of that connection manager,
there's a connection string that uses the Hive ODBC driver
to connect to my HDInsight server.
And we'll put in the password that's required there.
We've then got this Hive data source,
and if we have a quick look at that
you can see that the data source uses a SQL command,
and it's that variable, it's that HiveQL query
that includes the high water mark in there as well.
So we're dynamically generating the code that we need
to extract the data from the Hive table.
When that data comes out, we then do some data conversion.
Basically all we're doing is taking the log data and product ID
which will come out as fairly long strings.
They'll be 4000 characters
because of the way I've set up the connection.
And what we'll do is we'll just shorten them.
We're just taking them down to 50 so that they fit
into the table that we're gonna load them into.
So there's a little bit of data type changing going on there.
We could of course have a whole bunch
of different transformations happening to the data as it's extracted.
And then finally it's loaded into the staging table in the data warehouse.
And if we have a look at the mappings there
you can see that we're loading the short values into those strings
and then the various other values that would be extracted.
So all that should be ready to go.
Let's go ahead and run our package.
So to just resize, we can see each of the stages is running.
The data flow is now running,
so we're extracting data from the Hive table.
And as the data's extracted and pushed through the data flow,
the data conversion occurs
and eventually it's loaded into the staging table
and we update the high water mark
to reflect the data that we've just extracted.
So this package has gone and extracted the data
and loaded it into the staging table.
We then have another package which is a very simple 1-step process
where we are going to insert the stage data
into the appropriate table in the data warehouse.
And to do that we're gonna use a join
to look up the appropriate surrogate key
based on the business keys that we've got in the data.
So let's go ahead and run that.
So now we've extracted the data from the Hive table
using a HiveQL query.
And that HiveQL query used a regular expression
to figure out the product key that we had
in those various query strings that we saw earlier on.
So if there was a product key we'll have that business key.
We then use this SSIS package here to load that stage data
and look up the corresponding surrogate key
for the business keys for the product.
So we should now have a table in our data warehouse
that's conformed to our dimensional model
and that has a record of all the page hits
including which product was referenced
if a product was referenced in the query string.
So let's now see how we can use that
in our BI environment.
So to make use of our data warehouse
which now contains some data
that we've transferred across from HDInsight,
in Reporting Services,
I've created a data source that connects to our data warehouse.
And what we're gonna do
is we're gonna create a new report using Report Builder.
So we'll open up Report Builder.
And we'll create a new table or matrix report.
So I'm gonna go ahead and create a data set.
And that data set will be based
on the AdventureWorksDW data source that we've just seen.
And I actually already have
a Transact-SQL query that I want to use for this.
So we'll just go and get that.
And it's in the Lab 4 lab files folder.
So here's the query I want to use.
What it does is select the top 10 selling products.
So I could actually preview the data by running that.
And I can see it brings back the product,
the picture of the product,
the amount of revenue, and the page views.
In other words, the big data that I've brought across
and integrated into my data warehouse
is being returned by this query as well
from that fact IIS log table that we created.
So if I have a look and see what we've got,
we're just gonna have all of these appear in the values section.
And we can pretty much go through the rest of the wizard and finish it.
And you can see that I get my report.
I'll just add a title to that report.
And if we take a look at what we've got,
the large photo field that we've got there,
I just want to get rid of that
and add an image into there.
So we'll just go and click Insert.
Image.
And that goes into that field.
And that is gonna be from the data source.
So we need to specify that comes from the database.
And the field we want to use is the large photo,
and we'll save it as a JPEG.
So that photo comes in.
We'll just resize the height of the row to accommodate the picture.
And we should be able to preview our report,
just to make sure that everything's working as we expect.
And yep, I can see that I've got the revenue and the page views
for each of the products that we've sold.
We'll just widen the column here
and we'll format the revenue as currency.
And we'll format the number of views.
We'll just get that to be left-aligned.
And there's still a little bit of tidying up I could do.
Again, we'll just go ahead and make this move across to the left.
So now we're starting to see some interesting information.
And I'm ready to go ahead and save that report.
So let's save the report.
And we'll put it in our Adventure Works Reports folder
up on the server.
And we'll call it "Top 10 Sellers."
So I can now close Report Builder.
And if I refresh the page, there's my Top 10 Sellers report.
And I can go and browse that
and view information about the top 10 selling products
that includes the number of page views
that we've had for each of those products.
So I've been able to integrate data
from my big data and log files.
My source file has gone into the Hive tables and HDInsight.
I've then used SQL Server integration services to extract some data.
I got the HiveQL to do all the hard work
of processing data and aggregating it,
bring the results out,
store those results in my data warehouse,
conforming them to my dimensional data model,
and now I'm able to use them throughout my entire BI ecosystem
just like any other data that I've got in my data warehouse.
So that's how we can use HDInsight
to integrate big data into our managed corporate BI environment.
Next let's take a look and see
how we can extend our self service BI to incorporate big data
and enable users to share their insights
and make big data results discoverable across the enterprise.
Now let's suppose that in Adventure Works
we have a number of business analysts
who want to perform their own analysis.
It's great that we've got the enterprise BI model that has the data in it
and that is used to feed our dashboards and our reports
and empower our users across the enterprise.
But I might have some business analysts
who want to perform their own self service analysis.
So first of all they might want to analyze the big data that we've created
using something like Pig.
So I've got analysts who are connecting to our HDInsight server
and they're using Pig scripts like this one
to load the data from those IIS log files that we've uploaded
and then applying some Pig processing
to get whatever information they need out of those files.
So in this case we're filtering the logs to remove the header rows,
we're grouping them by the log date,
and we're counting the number of page hits,
and we're getting the sum of the bytes that were sent
and the bytes that were received,
and then flattening all of that out, ordering it by the log date,
and storing the results into this web traffic folder.
So it's a fairly simple Pig Latin script
that a business analyst might have the Pig programming knowledge
to be able to create.
And so they've created that script file
and now they want to use that to process the big data.
So to do that, we're gonna use PowerShell to run the Pig script.
Here's our PowerShell code.
And the first thing we do is we upload
that Pig script that we've just seen
up into our Azure storage.
We're gonna put that in our scripts folder.
And then we're gonna run that Pig Latin script,
and of course it will run against the log data
that's already been uploaded to our Azure store.
And when it's finished it will produce the results
and store them in that folder
where we said they were gonna be,
and the script and the processing will be complete.
So let's go ahead and run that script.
And the first thing that happens is we upload the script file.
And then we kick off a job to actually do the Pig processing for us.
So we'll just wait for that job to complete.
So when the job's complete,
the Pig script has been used to process the data
and generate the results that we need.
One of the things we're gonna need
is the storage key for our storage account
in order to retrieve the results.
So we're just gonna go and display that
information into the console window here.
And the variable that we retrieved from that was
the storage account key.
So that's the key that we're gonna need.
We'll go ahead and just copy that.
You can also get this by going to the Windows Azure console
and copying it from there.
We'll take that and we can now use that in Excel
when we go to retrieve the results.
So let's have a look at Excel.
And in my blank workbook here, I'm gonna use Power Query.
And what I'm gonna do is I'm gonna connect to our data source.
So from another source,
and in this case it's from Windows Azure HDInsight.
So we'll go ahead and choose that option.
And it's asking for the name
of the Azure Storage account associated with the cluster.
So we'll just enter that.
and go ahead and browse to the location.
Now we need the key that we've just copied
in order to connect to that data source.
So we'll just paste that in.
And in the Navigator pane,
you can see that it's found that storage account
and it's found the various containers that are stored within that.
So what we're gonna do is we're gonna go and edit
this query that we're generating here.
And up comes the query editor.
And I'm just gonna refresh this
so that we make sure we've got an up-to-date view
of all the blobs that were in that container.
And what I'm looking for is a folder called "Web Traffic,"
which is where we put the results of our Pigs query earlier on.
So let's go and filter this
so that we are looking for something that contains
"web traffic."
And we find the paths of blobs that contain "web traffic."
And out of those two,
there's the SUCCESS file which is generated by the Pig job
And Part R 00000,
which is actually the results that were generated by the Pig job.
So that's what we want.
We'll go ahead and select that.
And all the time you can see that Power Query
is intelligently interpreting that data
and making some various changes to it.
So it's imported that data as CSV.
It recognized that it was delimited text.
It's changed the type of the first column from text to date,
which makes perfect sense.
So we're just going to rename that column
so that we know what's happening with that.
And the 2nd column here is actually the number of page hits.
So we'll make that change as well.
And the 3rd column is the bytes that were sent.
And the 4th column is the bytes that were received.
So I've managed to get the results of my Pig query,
I've browsed using Power Query from Excel into the Azure Storage
to find those results, and I'm able to bring them in.
And I can continue to shape and filter this data if necessary.
But actually I've done most of the work I need to do with Pig.
So I can go ahead and I can save that down.
So if I go ahead and just apply and close that query,
you can see that I've got this query now to find in my workbook.
I didn't choose the name of it,
so it's just taken the default name which was "hdfs-container."
And it's brought down that data and added it to my worksheet,
which is great.
But what I really want to do
is to make this data discoverable for other people in the organization,
people that perhaps don't have
the necessary knowledge or skills to use Power Query
to go in and filter that data the way that I've done,
and maybe they don't know about where the data would be
or maybe they don't know about the schema.
But I want to share this data and make it discoverable for people.
So what I can do is I can go ahead and share the query.
I'm signed into my Office 365 account, so I can share that query.
And I'll give it a meaningful name this time.
We'll call it "Adventure Works Web Traffic."
And I could add a description in here.
So it's "Daily web activity,"
or something like that that's meaningful.
And I can choose who I want to share this with.
I could share it with just myself, so I can access it from any device,
or I could share it with everyone in the enterprise,
or to specific groups of people.
I'll go ahead and share it with everyone.
I could add a URL to some documentation.
Perhaps I want to write a Word document or an HTML page
that explains what this data's all about.
But I'll go ahead and I'll share that as it is just now.
So let's share a copy.
And once that's shared, I should be able to look at my shared query.
So if we actually go and have a look on the Power Query tab here
and look at Shared,
sure enough it shows up as a query that I've shared.
So here I am as another user using Excel.
And I've got my workbook open that I've been doing some analysis in.
And again I can use Power Query.
But this time instead of connecting to the HDInsight cluster
and having to know the key for the Azure Storage container and so on,
I can just simply go and do an online search
and search for "Web Hits" in this case.
And in the results,
I can see some results from various public data sources.
But right at the top of the list,
I've got results from that query that I shared previously.
So now other users in the organization can find that query
and they could go ahead and use that as the basis
of the data that they want to import into their own analysis.
So I can go ahead and I can open up that query.
And I could just simply go ahead and import the data as it is,
or I could now go and apply my own changes to this.
So for example, I could filter the date.
And I might be only interested in dates
that are between the 1st of January, 2008,
and the 30th of June, 2008.
So that's gonna filter my data down
and result in a data set that contains
just the data that I'm interested in looking at.
So I could go ahead and change the name of this query.
It doesn't change the shared query.
It's just the query that I'm creating in this workbook.
And I'll go ahead and change this to "Web Traffic."
And instead of loading it into the workbook,
I'll load it directly into the workbook data model.
So let's go ahead and do that.
So I can see that the data is being loaded.
It's being loaded up into the data model.
There's been 182 rows loaded.
And if I go now into Power Pivot
and actually take a look at my data model,
I can see that web traffic table has been added to my data model here.
And if I have a look at the diagram view,
I can go ahead and I can incorporate that data into my data model.
So I'm just simply gonna take the date
and join it to the date field in that table there.
So now I've got that data in my data model.
That's pretty useful.
Let's go ahead and close the data model window
and return to Excel.
And now I might want to actually use that.
So I could go ahead and I could insert
perhaps a Power View report here.
And again, I'm not gonna use any filters.
I'll just close this workbook queries pane.
We don't need that anymore.
And we'll add a title to our report.
And what I might be interested in doing
is taking a look at the different months that I have.
So I've got a month field here.
And take a look at the web traffic that we received for that month.
So we'll start off just by looking at the page hits.
We'll bring that in.
And if we pull that across the top half of the report,
I could go ahead and I could change that to be perhaps a line chart
showing me those page hits across that period of time.
And then what I might want to do is
underneath there, I might want to again just take a look
at some data, perhaps again by month.
And we'll take a look at the number of bytes received and bytes sent.
And I might just go ahead and create that as a clustered column chart.
So I start to get some pretty good visualizations of that web data
across a time period.
So all this is great, but what I want to do
is I want to share these insights that I've found
with some other people in the organization.
So to do that, I'm just gonna go ahead and save this workbook here.
And if I have a look at the Office 365 site,
because I've got the Power BI service installed
I've got this Power BI site within my team site.
So I'm gonna go ahead and I'm gonna add a document to this.
And I'll just upload that file.
And the file I want to upload
is in Lab 4.
There it is.
Adventure Works Analysis.
So that's the file to which I've just added that report.
So we'll go ahead and upload that,
and it'll take a few minutes just to upload to the Power BI site.
And I can see that it's gradually being uploaded here.
And I get a little preview of it appearing here.
Obviously I could upload as many of these as I need to do.
What I might want to do
is just to make it obvious for users to find this,
I might add this as a featured report.
And the featured reports appear in a band across the top of the page.
So here's my featured report.
So a user could come to this website
or they could use the Windows 8 application,
the Windows Store application on a tablet,
and they could come to this location,
and they can go ahead and view this report.
So I open up that report.
Off it goes,
and within the browser it gets rendered and visualized,
and I can see that data.
And it will work, it's interactive.
So I could go ahead and I could filter by clicking on a month.
And I can change the data as I need to do.
So pretty good stuff.
I've got this interactive report, and I can now view that in a browser.
I've shared that with other users in my Office 365 organization.
And the final thing I might want to do
is because I've got that data in my data model now—
if I just return to my Power BI site—
I've go the data in the data model,
and that data has come from my internal data warehouse
and also from the Hive tables that I queried—
well, actually in this case from a Pig query—
from the results from Pig that I've brought in from my Azure Storage.
So that data model has a whole bunch of information in it
that I might want to be able to query.
So what I'm gonna do is I'm gonna go ahead and add this report
to the Q&A feature that I have in Power BI.
And what that will let me do is go to the Power BI Q&A interface
and actually ask some questions in just natural language
to go and find some information from this workbook.
So for example, I could ask a question like
"Show average
"page hits
"and average sales amount
"by month
as bar."
And you can see that it's automatically updating that data as we go.
I'm just gonna clear away
some of this additional information that's here.
But you can see that it's automatically dynamically generating
visualizations of the data based on the questions I ask.
So I've been able to take big data from log files,
load them into my HDInsight cluster,
I've incorporated that into my enterprise BI model,
I've managed that,
but I've also empowered my business analysts
to go ahead and do self-service analysis.
So they were able to run a Pig query,
generate the data that they need from that source data,
incorporate that into an Excel workbook using Power Query,
add it to the data model
and finesse the data model using Power Pivot,
and then they were able to create a report using Power View,
upload that workbook as a report into Power BI,
and when within Power BI I'm able to view the report
and use natural language to query it.
So a very powerful way to take big data,
incorporate it into your enterprise BI environment,
and empower your users to find and share insights.
[Microsoft]