Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

VideoBigDataScenario4: BI Integration

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[Big Data Scenario] [BI Integration] [Graham Malcolm] Hi, my name's Graham Malcolm. Welcome to this demonstration of a big data scenario for BI integration. What we're gonna look at in this scenario is a hypothetical company called Adventure Work Cycles, who are an online retailer who sell bikes and cycling accessories and they have an existing enterprise BI solution. And that BI solution consists of a data warehouse, we have some analysis services data models built on top of that data warehouse, and we're using those to cover some reports and dashboards in SQL Server Reporting Services. So there's a fairly standard existing enterprise BI solution that's managed by the IT department. And what we want to do is we want to take advantage of our ability to process big data using HDInsight to see if we can find some potentially useful data in log files from the web servers that are used to cover the online retail solution. So let's take a look and see how we can incorporate that big data source from those log files into our existing BI solution. Before we start integrating big data into an enterprise BI solution, it's probably a good idea to take a look at the existing BI infrastructure and understand what's going on there. So here we are at Adventure Works. And the BI structure for the enterprise is based on a data warehouse which is implemented in SQL Server. The data warehouse has a number of tables and each of the tables represents either a dimension of the business by which we want to aggregate data or the facts that we actually want to aggregate. And there are dimension and fact tables in here. So for example there's a dimension table that lets me slice and dice my data by product. And if I took a look at that product data we can see that there's a record in here for each product, and also a record for facts that I want to measure where there is no product involved at all. So I can always relate to either explicitly there was no product for this thing that I'm aggregating, or here's the product that we should use to slice and dice the data. The products are identified using a product key which is unique in this table, and an alternate key which is probably the key that was used in the original system. And the reason that we do that is the product key is actually what we refer to as a surrogate key. It allows us to have slowly changing dimensions where a product may change over time, as price might change. So I could have 2 entries for the same product representing it at different points in time. The facts in the data warehouse are the usually numerical measures that I want to aggregate. So for example, I have facts about internet sales. And if I look at that, I can see that each internet sale is related to a product and an order date and a customer, and a whole bunch of other different things that I might want to slice and dice by. So this data warehouse, we refer to this kind of structure as a dimensional structure. We've got the dimensional model for the data warehouse, and on top of that we have built an analysis services model. So we have an analysis services tabular data model here which contains a number of tables, each of which is based on the underlying tables in the data warehouse. And this model pre-aggregates the data, structures it, adds things like key performance indicators, and makes it easier for my analytical tools and reports to process that data and display it properly. So that's the underlying infrastructure. One of the things that we might want to do with that is we might want to use the data model as a basis for a dashboard. So here's a very simple dashboard that an executive might come to and they can see the overall revenue performance, and they can take a look and break down by each quarter of this year, they can take a look and see how close we were to our growth target. So you can see that we were slightly behind for the 1st quarter, but we've made up some ground in the 2nd. As well as things like dashboards and reports, I might want to empower some of my business analysts to do some of their own analysis. So they can do so using Excel. And I can go ahead and I can connect to our analysis services model. Just go ahead and log in using our Windows credentials. We can choose which of the perspectives or cubes we want to view in there. I'm just gonna go with the main model cube. And I'll go ahead and we will add this as a pivot table report. And what I now get is a nice pivot table view of those different tables that I've got in my tabular model, my analysis services model. So for example I could take a look at the sales revenue, internet total sales, and I can see that figure appears in my Excel workbook. And I can then slice and dice that by various different dimensions. So I could find our date dimension here, and we could go ahead and bring that in view by calendar year. And I could also go and look at my products that I've got in there, bring that in by category. And I can then start drilling up and dive into the data to do some analysis. So I've got a fairly comprehensive enterprise BI solution already built based on a SQL Server data warehouse with a SQL Server Analysis Services data model on top and then using reporting services to provide dashboards for executives and I'm empowering my analysts to use Excel to create their own analysis against the data model and analysis services. So how could we extend this to start bringing big data into here? In this case, the specific big data that the analysts at Adventure Works are interested in is the log data from the web servers that service the company's ecommerce site. So we've been gathering these. You can see I've got daily log files for 6 months of data here that we've collected from our web servers. And if we have a look into each of these log files, you can see that there are some header rows, and then we've got the actual data which is the data and the time, the IP address of the client system, the username which in most cases is anonymous, and the server IP address and port, and so on. There's a whole bunch of fields in here. There's a couple of interesting ones here. There's the stem, which is really just the path within the website where we've gone to find the page, and then there's the query string. And in some cases on the default page there's none, but when we've gone to view a product, the query string includes some additional parameters that were passed to view the web page and see the actual product. So there's a whole bunch of data in these log files, and you can see that they're fairly verbose. There's a lot of data in each of the files. So we've got a huge quantity of log data that we want to somehow process and incorporate into our analysis and see if there's any relationship between activity on the website and the sales of the products. So to do that, we're gonna use some processing in HDInsight to process the data, and we're gonna use Hive primarily to do this. So one of the things we're gonna do is we're gonna create a staging table that represents the schema of the log, and we'll upload that log file to there. This space delimiter here is represented by 32. That's the delimiter that's used in the files. It's actually a space. And we're gonna store that data just in a folder called "data" as your storage. And then we're gonna create another log table here where we're partitioning by year, because as we gather the data over time we're gonna have a huge volume of data, and we might want to have it partitioned out by year. And we're gonna store that in sequence file format. So we've got a staging table to which we'll upload the log, and then we've got the actual IIS log table which is where we're gonna put the data in order to to query it and do some processing of that using HDInsight. To actually do the loading of the staging table, here's the statement. We set the partition mode to non-strict, because we can dynamically generate the partition. And we load from the log staging table into the IIS log table, partitioning by year and month. And we're using these formulae here to figure out the year and the month based on the data that we've got in the table. So there's a certain amount of Hiive activity going on. You'll see we're also filtering to take out the header rows which start with # as well. So we're cleaning up the data as we put it into the Hive table ready to be processed. So to actually go ahead and load that, we have a PowerShell script here. So we'll just go ahead and open this up in the PowerShell interactive scripting environment. And I'm connecting to my cluster, connecting to my storage account. I'm finding out the folder where this file itself is stored, this PowerShell script, because that tells me where to go to find the script files that I need to upload, and that's those two HiveQL scripts we just saw. So we go ahead and upload those to the storage account, and we're gonna store them in a folder called "scripts." And then we are going to go and have a look at all of those scripts that are on there and run them to create the Hive tables, and then we're gonna go and have a look at the IIS log files and upload those. So there'll be a little bit of time taken as we upload those files. Once they're uploaded, we're then gonna run the 2nd of the two scripts that we uploaded which is gonna actually load the stage data into the actual table that we want to create. So that's our PowerShell script. We'll just go ahead and run that. So when the script's finished, we've uploaded the data files to our Azure Storage, we've uploaded the two HiveQL script files, and we've used them to create a Hive table for the staged files that we've uploaded, and then another Hive table that has the cleaned-up data that we're ready to process. So now we're ready to go and incorporate this data that we've uploaded to our Azure Storage into our BI environment that we have in the enterprise. Perhaps the easiest way to start integrating this big data into our BI solution is to integrate at the report level, so effectively bring data through the data warehouse and the data model into Excel, and then also bring the big data directly from the Hive table into Excel and mash it up together. So to do that, I've got an Excel workbook here that already has a Power Pivot data model that I've created. So we'll just take a look at that. And the existing data model contains some of the tables from the underlying data warehouse. So we've got the dates table here, the products, the product categories and subcategories, and the internet sales facts that we're interested in. And you can see that I've got a column here that shows me the profit— that's calculated as the sales amount minus the cost. So I'm able to do some analysis of my internet sales. What I want to do is bring in another table into this data model from our Hive table. So to do that, I'm gonna use a connection to our Hive table using ODBC. And ahead of time, I have a look in Control Panel. You can actually see that I've got an ODBC data source name that I've created called Hive in this case, which connects to my HDInsight server. So I can use this data source from my client computer to connect and create Hive tables. So let's go ahead and use that in our Power Pivot model here. And what I'm gonna do is I'm gonna say I wanna get data from another source. And the other source is an ODBC source. And we'll call this "HiveTable." For the connection string, I'm just gonna build the connection string and simply go ahead and choose the provider to be the ODBC provider and then select our Hive data source name that we saw just a minute ago. So that's nice and easy. I'll put in the credentials that I need to connect to the HDInsight cluster, and I'm gonna save that password so that I can go ahead and connect. So now that we've got that connection string, we can go ahead and connect to the cluster and start getting some data from there. And what I could do is just simply select the Hive tables that we've got, but actually what I want to do is to write a HiveQL query and let my HDInsight cluster do all the heavy lifting to aggregate the data just to get me what I need. So we'll call this "Page Hits." And we'll enter a statement there—a HiveQL statement, I know it says SQL statement, it's a HiveQL statement— to bring back the log date and the number of hits in 2008 before July, and I wanted to group that by the log date. So I'm really bringing back a subset of the data as pre-aggregated, and I'm using the HDInsight cluster to do the heavy lifting to process all of that data and bring back just the results I need. So we can go ahead and finish on there. And Power Pivot starts to import the data, the results from that query that I specified. And once the data has been imported, I can go ahead and close the wizard, and I actually see that data added to my data model here. So we've now got a table called "Page Hits," which has the total number of hits for each day within that 6-month period. So I can incorporate this into my data model a little bit more. If I have a look at the diagram view of the data model, we can start to integrate that table with the other tables. So I've put my Page Hits here, which I'll just bring across. And I've got my Dates table here. And I know that my log date here is actually a date, so I'll go ahead and change that data type. And I'm gonna create a relationship between the log date and the date. So now I've integrated the table of data that I've retrieved from my Hive table into my data model. And that's me ready now to do some analysis with that. So we can go ahead and save our data model. And we're back in Excel. It's actually left the Power Pivot window open behind. So I'll just close that down. I don't need that window anymore. And what I might do is actually take advantage of the fact that I've got that data in my data model and insert a Power View report. So let's go ahead and create some Power View. And so I've got my report here. I can go ahead and add a title into here, so "Web Sales" or something like that. I don't actually need to apply any filters, so we can go ahead and close that filter. And what I might do is I might want to bring across the dates that we're looking at. In this case we're looking in 2008, because we filtered the data that way. So I'm gonna bring across the calendar month, if we can find it. Month. Here. And I might be interested in the internet sales that I've got in that month. And I might really just be interested in the number of orders, the order quantity that comes across there. And I might also be interested in the page hits that I've got for each month. And because I've got that relationship between the tables— the page hits and the date table— it's able to aggregate those page hits by month across there, so I can see how that works. And what I might want to do is I might want to change this to look a little bit different, perhaps just make it a line chart. And I'll just resize that slightly so that we can have a slightly clearer view. And you can see that obviously the hits are increasing over time. The order quantity seems to be relatively static. And I'm just gonna go and make a couple of changes to this so that we get the data labels so we're a little bit clearer about what those values are. So there's a slight increase in the order quantity as we go forward, but there's a marked increase in the number of page hits that we're needing to get those orders. So there's some useful information there. I might want to add something else to this report. I might be interested in, for example, the day of the week. So the day, and also the page hits that I'm getting on each day. So I could go ahead and look at that. And I might want to just average those page hits out. So I'll have a look at the average number of hits per day. And we could go ahead and make that into a bar chart. And just resize that so that we can see what that kind of profile looks like. Over an average week, what's the number of hits we get? And obviously we're more busy on Saturdays and Sundays than we are elsewhere. And equally I could come in here and I could go and have a look at perhaps the quarter. Yeah, let's look at the calendar quarter. And that's actually aggregated to that. So we don't want it to do that. We'll just go and add some additional data into there. So we'll have the internet sales again. We'll take the order quantity and the page hits. And what we'll do is we'll display that as a column chart, with the calendar quarter being the axis. And that makes it a little bit clearer for us to see what's going on. So let's just resize that. So I can see in quarter 1 and quarter 2 the comparative order quantity vs the page hits. And I can interact with this. So I could click "Sunday" and it filters all of the data to show me what happened on Sundays across those periods. So by integrating the results from my big data processing in HDInsight into my data model in Excel, I'm able to mash up data from my enterprise corporate BI world with this external big data that I've processed in HDInsight. And that's a pretty useful way for me to do my analysis. Now, integrating our big data in an Excel workbook is great for individual users or for small groups of analysts who want to look at that data alongside data from the corporate data model. But if I want to make this big data processing results part of our BI ecosystem and include it in dashboards and reports, it might make sense to look further up the chain and incorporate that data into our analytical data models in SQL Server analysis services. So to do that, here's the project for our analysis services data model. And I'm gonna go ahead and import some data from a data source. And again we're gonna use our ODBC driver to get this. And we'll build a connection. This time though I'm not gonna use the data source name. I'm gonna use the ODBC driver, but I'm actually gonna enter a connection string. And that connection string—just paste that in there. The reason I'm doing this— this is the same connection string as is used by the DSN. But the reason I want to explicitly put the connection string is because the DSN might not exist on the analysis server where I'm gonna deploy the data model. So I will explicitly put the connection string in there, we'll add the credentials that we need to connect to the Hive table, save that password in there, and then we're ready to go. So when we're ready to go ahead and connect to there, I need to specify the credentials that analysis services should use to do this. I'm actually gonna just use the server's account, because the credentials that are reused are the ones that I've included in the connection string. And again, rather than select from a list of tables or views, I'm gonna write a query to actually bring back the data I want and make HDInsight do the heavy lifting. So we'll name the query "Page Hits" and we'll put in our statement. And this time I'm bringing back the date, the sum of the bytes that were sent to the server, the sum of the bytes that were returned, and the number of records— in other words, the number of page hits that we got. So let's go ahead and finish that wizard. And the data is imported into a new table in the data model. And when the data has been transferred, I can close the wizard and have a look at the table that's been imported. Now, unlike with Power Pivot, there are a few additional things that I need to do with my data here. So one of the things that Power Pivot does for me that I need to do for myself when working with analysis services is I need to create my own measures to actually aggregate the data. So what I can do is I can go ahead and I can create a new measure here that is just gonna be the sum of the sc_bytes. And it creates a formula for me there. I'm just gonna change the name that it's allocated there to something more useful. So we'll call that "Bytes Sent." And if I widen that column, we can actually see that it's calculated the bytes sent. And I'll just do the same for the bytes that are received. Again, give that a sensible name. And finally I'll do the same for the page hits as well. I'll just call that "Hits." So it's gone ahead and calculated out those measures for me, those aggregated measures. And what I'm gonna do is I'm gonna go and I'm gonna change the format of these bits of data. You can see they're just plain numbers at the moment. So we'll just select all 3 of those, and we'll change the format so that it's a whole number for each of these. And we'll show the thousands separator as well. And now when I look at the format of that data, if I just widen this column, you can see that it's slightly easier to read and to handle there. So we did a little bit of working with our model to format the data. I'm also gonna go and have a look at the diagram view of this model. And we'll just see if we can find our table that we've just added. Over here to the right. I'm just gonna drag this this way so that it's next to my dates table. I can then start to do some formatting of my model. So we'll just increase this font slightly so we can see what we're doing. So here's my page hits table. I've got my log date column here. And again I can just have a look at the format of that and instead of being a text, I'll choose this to be a date. And I can go ahead and I can, once it's updated that, create a relationship between this table and the date table. So that will let me aggregate the page hits by the date. And now that I've done that, we'll just go and have another closer look at this table. We'll just maximize this table up so we can see it in detail. And obviously there are the 3 calculated columns that I created which I'm interested in. I don't necessarily need to see any of these other fields. So I'm just gonna hide the detail columns there, just to avoid confusing me by having those in the user interface when I start to work with the data. I'm only really interested in the aggregated totals. So we'll just hide those from the client tools. So now that we've got the page hits table incorporated into our data model, we can make some use of it. For example, here's the internet sales table. And you can see that there are a number of calculated columns that I've created, or calculated measures. And I've added a few at the end here. I've got the actual units, which is the sum of the order quantity. So that's the number of units that have actually been sold. And I've got target units, which is the hits value that I defined in my page hits table, multiplied by 0.065. So I'm creating a relationship between the number of page hits that I get and the number of units that I expect to sell as a conversion rate that I'm trying to achieve. And what I've done is I've created a KPI here that measures the actual units against the target units and will give me an indication as to whether I'm on target or not. So if I have achieved 100% of the units that I'm trying to sell, we get a green icon. Filling that up to 80%, you get this middle yellow icon. And if we're below 80%, we get a red icon. So I can go ahead and deploy that data model, and now I can start using that throughout my BI ecosystem. In this case, we can use the big data that we've incorporated into our analysis services model in our dashboard report. So you can see that I've got the report that we saw earlier on. And I've modified it slightly. I've added a data set to here called "Hits and Sales." And if we take a look at that data set, you can see that it's connecting to the sales cube that we've got in our analysis services model, and we've got this rather verbose MDX query here. If I have a look at the query designer, you can see a bit more clearly what that's doing. We filtered for the latest year of data that we have in the first semester, and we're bringing back the month, the actual units goal— in other words, that KPI target value— the actual value that we achieved, and then the status indicator, which comes back as a number between -1 and 1. So we've got that data set in our report now. We've added a table over here where we're displaying the month name and then an indicator based on the value that we get on that status value. And we can expand down to see the actual units and actual target. So I've deployed this to the server. So if I actually go and have a look in my browser, I can see that I've got the dashboard here. It now includes the page hits vs units sold that I'm getting by mashing up my big data results from HDInsight with the sales data in my data warehouse. And I can expand through each of these months and see the actual units sold and the target. And I'm pretty much on target for most of the year. In June, we're a little bit behind. So we've seen how to integrate our big data processing results in the report level using Excel and at the data model level in SQL Server Analysis Services. What of course you might want to do is to integrate at the data warehouse level. You might want to bring the results of your big data processing into your SQL Server-based data warehouse, and then propagate that data throughout the entire BI stack you have in the organization. So there's a number of ways to do that. In this scenario we've chosen to use a technique that uses SQL Server Integration Services to copy the data. And what we've done is we've created a new fact table in the data warehouse for the IIS log data that we're analyzing. And that obviously references some of the existing dimension tables. And of course you might remember from earlier that the dimension tables use a surrogate key to identify each individual role that doesn't necessarily relate to the original business key that might be in the source data, such as the product key. So we'll have to do a little bit of work to look up the appropriate surrogate key to make that relationship work. To do that, we've decided that we're gonna stage the data that we get from our Hive tables. So we've got a staging schema just to keep the staging tables separate. And we've got an IIS log table in there for the data that we're gonna bring across from the Hive table. The other thing that you commonly do in a BI environment, when you're moving data around for your ETL workflow, is you only want to extract the data that you need. You don't want to extract the entire Hive table each time, in this instance. So what we've done is we've created a high water table that stores the maximum date that we've previous extracted. So the next time we do the extraction, next month after we load next month's log files, we don't want to extract everything, we just extract the data that's relevant for us. So we've got this high water table here. And because we haven't used it at all, we'll initialize that with a value of a whole bunch of 0s. So basically we're gonna bring out all of the data that's in there, and then we'll update that high water value so that next time we filter the extraction based on what we need. So those tables have been created. And I've created an SSIS control flow here in a package to start the extraction. So the first thing we do is we get that high water mark. We find out what's the maximum date that we've already extracted to the data warehouse. And you can see that there's a query here. We'll just bring that up so we can see it a bit more clearly. So we're selecting the maximum value from the high water mark table, so that we know what that last extracted value was. And we're storing the results of that in a variable called "High Water Mark." It's a user variable within my SSIS package. So we're bringing out the dates that'll be the border if you like, or the boundary value for the data that we need to extract, and storing that in a variable. Now if I have a look at the variables that are in this package, there's that high water mark variable that we've just seen. There's another variable called "Hive Query," which is what we're gonna use to extract the data from Hive. And if we have a look at that, we can actually see— it's quite difficult to see, but it's a select expression. It's a Hive query that incorporates the high water mark variable in there. So we've got an expression that creates a HiveQL query that extracts all the data that we need after the high water mark date. So once we've done that we'll empty the staging data from any data that's left over from the last time we extracted data. Then we're gonna go ahead and stage the log data, and finally we'll reset the high water mark to be whatever the value is that we've now extracted, so that we're always keeping that high water mark up to date and each subsequent extraction only takes the data that we need. The stage IIS log data is a data flow. So if we have a look at that, we can see that there's a number of steps involved in the data flow. And the first of which is we extract from a Hive table. To connect to that Hive table, we've got a connection manager over here. And if we have a look at the definition of that connection manager, there's a connection string that uses the Hive ODBC driver to connect to my HDInsight server. And we'll put in the password that's required there. We've then got this Hive data source, and if we have a quick look at that you can see that the data source uses a SQL command, and it's that variable, it's that HiveQL query that includes the high water mark in there as well. So we're dynamically generating the code that we need to extract the data from the Hive table. When that data comes out, we then do some data conversion. Basically all we're doing is taking the log data and product ID which will come out as fairly long strings. They'll be 4000 characters because of the way I've set up the connection. And what we'll do is we'll just shorten them. We're just taking them down to 50 so that they fit into the table that we're gonna load them into. So there's a little bit of data type changing going on there. We could of course have a whole bunch of different transformations happening to the data as it's extracted. And then finally it's loaded into the staging table in the data warehouse. And if we have a look at the mappings there you can see that we're loading the short values into those strings and then the various other values that would be extracted. So all that should be ready to go. Let's go ahead and run our package. So to just resize, we can see each of the stages is running. The data flow is now running, so we're extracting data from the Hive table. And as the data's extracted and pushed through the data flow, the data conversion occurs and eventually it's loaded into the staging table and we update the high water mark to reflect the data that we've just extracted. So this package has gone and extracted the data and loaded it into the staging table. We then have another package which is a very simple 1-step process where we are going to insert the stage data into the appropriate table in the data warehouse. And to do that we're gonna use a join to look up the appropriate surrogate key based on the business keys that we've got in the data. So let's go ahead and run that. So now we've extracted the data from the Hive table using a HiveQL query. And that HiveQL query used a regular expression to figure out the product key that we had in those various query strings that we saw earlier on. So if there was a product key we'll have that business key. We then use this SSIS package here to load that stage data and look up the corresponding surrogate key for the business keys for the product. So we should now have a table in our data warehouse that's conformed to our dimensional model and that has a record of all the page hits including which product was referenced if a product was referenced in the query string. So let's now see how we can use that in our BI environment. So to make use of our data warehouse which now contains some data that we've transferred across from HDInsight, in Reporting Services, I've created a data source that connects to our data warehouse. And what we're gonna do is we're gonna create a new report using Report Builder. So we'll open up Report Builder. And we'll create a new table or matrix report. So I'm gonna go ahead and create a data set. And that data set will be based on the AdventureWorksDW data source that we've just seen. And I actually already have a Transact-SQL query that I want to use for this. So we'll just go and get that. And it's in the Lab 4 lab files folder. So here's the query I want to use. What it does is select the top 10 selling products. So I could actually preview the data by running that. And I can see it brings back the product, the picture of the product, the amount of revenue, and the page views. In other words, the big data that I've brought across and integrated into my data warehouse is being returned by this query as well from that fact IIS log table that we created. So if I have a look and see what we've got, we're just gonna have all of these appear in the values section. And we can pretty much go through the rest of the wizard and finish it. And you can see that I get my report. I'll just add a title to that report. And if we take a look at what we've got, the large photo field that we've got there, I just want to get rid of that and add an image into there. So we'll just go and click Insert. Image. And that goes into that field. And that is gonna be from the data source. So we need to specify that comes from the database. And the field we want to use is the large photo, and we'll save it as a JPEG. So that photo comes in. We'll just resize the height of the row to accommodate the picture. And we should be able to preview our report, just to make sure that everything's working as we expect. And yep, I can see that I've got the revenue and the page views for each of the products that we've sold. We'll just widen the column here and we'll format the revenue as currency. And we'll format the number of views. We'll just get that to be left-aligned. And there's still a little bit of tidying up I could do. Again, we'll just go ahead and make this move across to the left. So now we're starting to see some interesting information. And I'm ready to go ahead and save that report. So let's save the report. And we'll put it in our Adventure Works Reports folder up on the server. And we'll call it "Top 10 Sellers." So I can now close Report Builder. And if I refresh the page, there's my Top 10 Sellers report. And I can go and browse that and view information about the top 10 selling products that includes the number of page views that we've had for each of those products. So I've been able to integrate data from my big data and log files. My source file has gone into the Hive tables and HDInsight. I've then used SQL Server integration services to extract some data. I got the HiveQL to do all the hard work of processing data and aggregating it, bring the results out, store those results in my data warehouse, conforming them to my dimensional data model, and now I'm able to use them throughout my entire BI ecosystem just like any other data that I've got in my data warehouse. So that's how we can use HDInsight to integrate big data into our managed corporate BI environment. Next let's take a look and see how we can extend our self service BI to incorporate big data and enable users to share their insights and make big data results discoverable across the enterprise. Now let's suppose that in Adventure Works we have a number of business analysts who want to perform their own analysis. It's great that we've got the enterprise BI model that has the data in it and that is used to feed our dashboards and our reports and empower our users across the enterprise. But I might have some business analysts who want to perform their own self service analysis. So first of all they might want to analyze the big data that we've created using something like Pig. So I've got analysts who are connecting to our HDInsight server and they're using Pig scripts like this one to load the data from those IIS log files that we've uploaded and then applying some Pig processing to get whatever information they need out of those files. So in this case we're filtering the logs to remove the header rows, we're grouping them by the log date, and we're counting the number of page hits, and we're getting the sum of the bytes that were sent and the bytes that were received, and then flattening all of that out, ordering it by the log date, and storing the results into this web traffic folder. So it's a fairly simple Pig Latin script that a business analyst might have the Pig programming knowledge to be able to create. And so they've created that script file and now they want to use that to process the big data. So to do that, we're gonna use PowerShell to run the Pig script. Here's our PowerShell code. And the first thing we do is we upload that Pig script that we've just seen up into our Azure storage. We're gonna put that in our scripts folder. And then we're gonna run that Pig Latin script, and of course it will run against the log data that's already been uploaded to our Azure store. And when it's finished it will produce the results and store them in that folder where we said they were gonna be, and the script and the processing will be complete. So let's go ahead and run that script. And the first thing that happens is we upload the script file. And then we kick off a job to actually do the Pig processing for us. So we'll just wait for that job to complete. So when the job's complete, the Pig script has been used to process the data and generate the results that we need. One of the things we're gonna need is the storage key for our storage account in order to retrieve the results. So we're just gonna go and display that information into the console window here. And the variable that we retrieved from that was the storage account key. So that's the key that we're gonna need. We'll go ahead and just copy that. You can also get this by going to the Windows Azure console and copying it from there. We'll take that and we can now use that in Excel when we go to retrieve the results. So let's have a look at Excel. And in my blank workbook here, I'm gonna use Power Query. And what I'm gonna do is I'm gonna connect to our data source. So from another source, and in this case it's from Windows Azure HDInsight. So we'll go ahead and choose that option. And it's asking for the name of the Azure Storage account associated with the cluster. So we'll just enter that. and go ahead and browse to the location. Now we need the key that we've just copied in order to connect to that data source. So we'll just paste that in. And in the Navigator pane, you can see that it's found that storage account and it's found the various containers that are stored within that. So what we're gonna do is we're gonna go and edit this query that we're generating here. And up comes the query editor. And I'm just gonna refresh this so that we make sure we've got an up-to-date view of all the blobs that were in that container. And what I'm looking for is a folder called "Web Traffic," which is where we put the results of our Pigs query earlier on. So let's go and filter this so that we are looking for something that contains "web traffic." And we find the paths of blobs that contain "web traffic." And out of those two, there's the SUCCESS file which is generated by the Pig job And Part R 00000, which is actually the results that were generated by the Pig job. So that's what we want. We'll go ahead and select that. And all the time you can see that Power Query is intelligently interpreting that data and making some various changes to it. So it's imported that data as CSV. It recognized that it was delimited text. It's changed the type of the first column from text to date, which makes perfect sense. So we're just going to rename that column so that we know what's happening with that. And the 2nd column here is actually the number of page hits. So we'll make that change as well. And the 3rd column is the bytes that were sent. And the 4th column is the bytes that were received. So I've managed to get the results of my Pig query, I've browsed using Power Query from Excel into the Azure Storage to find those results, and I'm able to bring them in. And I can continue to shape and filter this data if necessary. But actually I've done most of the work I need to do with Pig. So I can go ahead and I can save that down. So if I go ahead and just apply and close that query, you can see that I've got this query now to find in my workbook. I didn't choose the name of it, so it's just taken the default name which was "hdfs-container." And it's brought down that data and added it to my worksheet, which is great. But what I really want to do is to make this data discoverable for other people in the organization, people that perhaps don't have the necessary knowledge or skills to use Power Query to go in and filter that data the way that I've done, and maybe they don't know about where the data would be or maybe they don't know about the schema. But I want to share this data and make it discoverable for people. So what I can do is I can go ahead and share the query. I'm signed into my Office 365 account, so I can share that query. And I'll give it a meaningful name this time. We'll call it "Adventure Works Web Traffic." And I could add a description in here. So it's "Daily web activity," or something like that that's meaningful. And I can choose who I want to share this with. I could share it with just myself, so I can access it from any device, or I could share it with everyone in the enterprise, or to specific groups of people. I'll go ahead and share it with everyone. I could add a URL to some documentation. Perhaps I want to write a Word document or an HTML page that explains what this data's all about. But I'll go ahead and I'll share that as it is just now. So let's share a copy. And once that's shared, I should be able to look at my shared query. So if we actually go and have a look on the Power Query tab here and look at Shared, sure enough it shows up as a query that I've shared. So here I am as another user using Excel. And I've got my workbook open that I've been doing some analysis in. And again I can use Power Query. But this time instead of connecting to the HDInsight cluster and having to know the key for the Azure Storage container and so on, I can just simply go and do an online search and search for "Web Hits" in this case. And in the results, I can see some results from various public data sources. But right at the top of the list, I've got results from that query that I shared previously. So now other users in the organization can find that query and they could go ahead and use that as the basis of the data that they want to import into their own analysis. So I can go ahead and I can open up that query. And I could just simply go ahead and import the data as it is, or I could now go and apply my own changes to this. So for example, I could filter the date. And I might be only interested in dates that are between the 1st of January, 2008, and the 30th of June, 2008. So that's gonna filter my data down and result in a data set that contains just the data that I'm interested in looking at. So I could go ahead and change the name of this query. It doesn't change the shared query. It's just the query that I'm creating in this workbook. And I'll go ahead and change this to "Web Traffic." And instead of loading it into the workbook, I'll load it directly into the workbook data model. So let's go ahead and do that. So I can see that the data is being loaded. It's being loaded up into the data model. There's been 182 rows loaded. And if I go now into Power Pivot and actually take a look at my data model, I can see that web traffic table has been added to my data model here. And if I have a look at the diagram view, I can go ahead and I can incorporate that data into my data model. So I'm just simply gonna take the date and join it to the date field in that table there. So now I've got that data in my data model. That's pretty useful. Let's go ahead and close the data model window and return to Excel. And now I might want to actually use that. So I could go ahead and I could insert perhaps a Power View report here. And again, I'm not gonna use any filters. I'll just close this workbook queries pane. We don't need that anymore. And we'll add a title to our report. And what I might be interested in doing is taking a look at the different months that I have. So I've got a month field here. And take a look at the web traffic that we received for that month. So we'll start off just by looking at the page hits. We'll bring that in. And if we pull that across the top half of the report, I could go ahead and I could change that to be perhaps a line chart showing me those page hits across that period of time. And then what I might want to do is underneath there, I might want to again just take a look at some data, perhaps again by month. And we'll take a look at the number of bytes received and bytes sent. And I might just go ahead and create that as a clustered column chart. So I start to get some pretty good visualizations of that web data across a time period. So all this is great, but what I want to do is I want to share these insights that I've found with some other people in the organization. So to do that, I'm just gonna go ahead and save this workbook here. And if I have a look at the Office 365 site, because I've got the Power BI service installed I've got this Power BI site within my team site. So I'm gonna go ahead and I'm gonna add a document to this. And I'll just upload that file. And the file I want to upload is in Lab 4. There it is. Adventure Works Analysis. So that's the file to which I've just added that report. So we'll go ahead and upload that, and it'll take a few minutes just to upload to the Power BI site. And I can see that it's gradually being uploaded here. And I get a little preview of it appearing here. Obviously I could upload as many of these as I need to do. What I might want to do is just to make it obvious for users to find this, I might add this as a featured report. And the featured reports appear in a band across the top of the page. So here's my featured report. So a user could come to this website or they could use the Windows 8 application, the Windows Store application on a tablet, and they could come to this location, and they can go ahead and view this report. So I open up that report. Off it goes, and within the browser it gets rendered and visualized, and I can see that data. And it will work, it's interactive. So I could go ahead and I could filter by clicking on a month. And I can change the data as I need to do. So pretty good stuff. I've got this interactive report, and I can now view that in a browser. I've shared that with other users in my Office 365 organization. And the final thing I might want to do is because I've got that data in my data model now— if I just return to my Power BI site— I've go the data in the data model, and that data has come from my internal data warehouse and also from the Hive tables that I queried— well, actually in this case from a Pig query— from the results from Pig that I've brought in from my Azure Storage. So that data model has a whole bunch of information in it that I might want to be able to query. So what I'm gonna do is I'm gonna go ahead and add this report to the Q&A feature that I have in Power BI. And what that will let me do is go to the Power BI Q&A interface and actually ask some questions in just natural language to go and find some information from this workbook. So for example, I could ask a question like "Show average "page hits "and average sales amount "by month as bar." And you can see that it's automatically updating that data as we go. I'm just gonna clear away some of this additional information that's here. But you can see that it's automatically dynamically generating visualizations of the data based on the questions I ask. So I've been able to take big data from log files, load them into my HDInsight cluster, I've incorporated that into my enterprise BI model, I've managed that, but I've also empowered my business analysts to go ahead and do self-service analysis. So they were able to run a Pig query, generate the data that they need from that source data, incorporate that into an Excel workbook using Power Query, add it to the data model and finesse the data model using Power Pivot, and then they were able to create a report using Power View, upload that workbook as a report into Power BI, and when within Power BI I'm able to view the report and use natural language to query it. So a very powerful way to take big data, incorporate it into your enterprise BI environment, and empower your users to find and share insights. [Microsoft]

Video Details

Duration: 1 hour and 29 seconds
Country: United States
Language: English
License: All rights reserved
Genre: None
Views: 117
Posted by: duncanma on Jul 2, 2014

This is a real request, please proceed. I need closed caption for English and MT-ed captions for all MT languages. the tools does not allow me to add all languages, this is why i am adding to the body of description field ----- (Please provide translations for these languages: English (eng).)

Caption and Translate

    Sign In/Register for Dotsub to translate this video.