Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

Big Data-Scenario 1: Iterative Data Exploration

0 (0 Likes / 0 Dislikes)
[Big Data Scenario: Iterative Data Exploration] [Graeme] Hi, my name is Graeme Malcolm. Welcome to this demonstration of a big data scenario for iterative data exploration. So, in this scenario, we're going to take a look at a fictional company called Blue Yonder Airlines. And Blue Yonder are interested in what people are saying about them on Twitter. So... To help them understand that, they've collected some tweets over a period of weeks using the Twitter API. And now that they've got all of that data, they're ready to explore the contents of those tweets using HDInsight. So let's take a look and see how they went about doing that. Here we are in Windows, and the Twitter data that we've collected I've got open here in Excel. It's just a tab-delimited text file, and I've got a whole bunch of tweets that have been recorded that have the address for our Twitter handle, the BlueYonderAirlines. We've got the date, the URL of the tweet, the name of the Twitter user, and the actual text of the tweet as well. So there's a whole bunch of different tweets here that might contain some useful information to tell me what people are thinking about Blue Yonder Airlines and the services that we provide. So we'll just close this down. We don't need to save that. We've got that stored as a text file in this data folder here that we're ready to go ahead and use. Let's see how we can do some analysis against that data. To do that we're going to use HDInsight. To work with HDInsight, rather than create a remote desktop connection and use the console on the remote desktop, we're going to use PowerShell to do the whole thing remotely from our client application here. As you can see, I've got the PowerShell environment all set up. I've installed the Azure module so you can see the various different commandlets that are available for working with Azure. I've also set up the environment with a certificate for my Azure subscription. So if I say Get-Azure Subscription, if I can type the right command, we can see the information about my subscription. The environment is all set up and ready for me to use and work with Azure through PowerShell. The first thing I'm going to do, I've got a script here that I've put together, and all I'm really going to do here is upload my source data to Azure storage so that I can access it from HDInsight. So I need my storage account name and the name of the blob container where I want to upload the data to. The rest of the script basically just finds out where this PowerShell script itself is stored because that's where I've stored the source data. I'm going to upload the data to a folder in Azure called tweets. The first thing I need to do is I need to get my storage account key. There is a command to do that, Get-AzureStorageKey. Pass that the name of my account. Because I've already set up the security, I've already got a certificate installed on the local computer, it can go ahead and get that access key. Then I just simply create a new storage context. I'm going to iterate through all of the files that are in the folder called data, which is in the folder where the script is actually stored. So it will pick up the files that are in there. There is actually only one in this case, and it's going to go ahead and it's going to use the set as your storage blob content to upload the files that it finds up to Azure. So far, so good. Let's go ahead and run this script. I'll just bring this up so we can see a little bit of the output as it runs. The script starts to run. Immediately it goes and finds the files in that folder and starts to upload them. And when all the files are uploaded the script finishes and I've got the files now uploaded, ready to work with them in HDInsight. Okay, that's good. Let's think about how we're actually going to explore this data. One of the first things I might want to do is I might want to create a Hive table, put the data into the table, and then query it just like a SQL Server table using Transact-SQL, except this time all I'm doing is creating the metadata for a table on top of the file that's already there, and then I'm going to go and query that and get some information from it. To do that I've got a Hive script file here which we'll have a look at. I'm going to drop the table if it already exists and then create the table called Tweets. I've defined the schema of the data. It matches the tab-delimited format that I've got in that text file. I can see that it's terminated by a tab. I'm going to define this as a table on top of a folder at this location here, twitterdata/tweets. Obviously I uploaded the file to a folder just called tweets, and so then I'm going to load the data from that folder into the table. So there's a little bit of extra code there to actually move the file that I uploaded into the folder that the table is created on top of. To do that I'm again going to use PowerShell. Let's have a look at the PowerShell code here. There's a bit more of this one. This time I need the cluster name as well as the storage account and the container. Again, I just find the local folder where this script is stored. The reason I'm doing that is because I'm going to use very similar code to what I did last time, but this time I'm going to upload that CreateTable HQL file that contains my HiveQL code to create the table. So I'll go ahead and upload that file. I'm going to stick that in a folder called twitterdata/scripts. Once that's been uploaded, I'm ready to go ahead and run that as a Hive job. I use this New-AzureHDInsight HiveJobDefinition commandlet, quite a long, complicated name. But what that does is create a job in HDInsight that's going to run some Hive code, and I just simply pass it the name of the file that I uploaded. That's the script file that creates the table. I'll then go ahead and start that job on my cluster. While I'm waiting for it to run, it's going to run asynchronously so I'm going to wait for it to complete and I'm going to get the job output. So in other words, just the information. If I was running this interactively in the console, Hive would be displaying status information as it runs. So I'm going to get that and I'm going to display that as the output in the client. Then just to verify that the table has in fact been created, I'm going to do some querying against that table. I've got a HiveQL query here where I'm selecting the date and the number of tweets from that Tweets table and I'm going to group that and sort it by the date. So I should end up with a list of how many tweets were made on each date in this week's worth of data that I've uploaded. To use that script I'm going to use this Use-AzureHDInsightCluster to specify the name of the cluster, and then a very simple Invoke-Hive commandlet lets me specify a query, a Hive query that I'm going to run against that cluster, and it will bring back and display the results in the console window. So again, let's go ahead and run this PowerShell script. We'll just clear the screen at the bottom first before we go. Off it goes, and the first thing it does is upload the script file. Once that script file is uploaded, it's going to start an Azure job to run the HiveQL script that I've uploaded. So the job starts and it says Now Running. And then a second job is started to run the Hive query that I've included in my PowerShell script. This job performs some MapReduce code in the background. The Hive gets translated to MapReduce. And when the job is complete, I can see the results that have been returned. And in this case I can see that as the time has gone on over the course of the week the number of tweets is gradually increasing. So we at least learned one thing here— that people are mentioning us more often on Twitter as time goes by. But that's perhaps not particularly useful for understanding what they're saying. They could all be saying that they think we're wonderful, or they could be equally saying that they think they've had a terrible experience with us. So we really need to start exploring the contents of the tweets themselves to get some information. To do that we're going to use some explicit MapReduce code. You saw here that the Hive code we used was translated into MapReduce. What we could actually do is just use some MapReduce code that's written in Java instead of using Hive. We're going to use this WordCount class here. This is actually a sample that's provided with HDInsight, and it's a fairly standard Hadoop sample. There's a class here called WordCount, and it's got a function called map and it's got a function called reduce in this static class here as well. So we should see that a bit further down, I believe. Yeah, there it is, reduce. So what's actually going to go on here is we're going to pass in the text, the text from the tweets. So each of these tweets that's in this text file, we just threw the text into here. The map function is basically going to go through all the different words that it finds, splitting it out using a space as the delimiter, so find all the different words, and it's going to then create a key value pair set for each of those words that it finds. So for every word it will have here's the word and it just nominally puts a count of 1. We're then going to go into this reduce function, and then for each of those key values, the words that have been found, we're going to add up the number of times that they appear. So each one comes in, we're going to add up those nominal 1 values that we're given, and that will give us the total number of times that that word appeared in the text. That then gives us some frequency information about the use of the words in the text, and that might give us an indication as to what's important to people when they're tweeting about us. I don't really need to do anything more with this Java file because, as I said, it's the standard example. It's already in place in the cluster for me. So if I have a look at this PowerShell script here, we can see how we're going to use this. Again, we're going to connect to our cluster. We are going to use this New- AzureHDInsightMapReduce JobDefinition commandlet this time to create a MapReduce job. We point that at the compiled Java JAR file for that sample code that we've just seen. That's actually stored up on the cluster in this path here up in Azure storage in example/jars/ hadoop-map-reduce-examples.jar. Within that there's a class called wordcount, and we're going to pass in a couple of arguments. The first argument is the location of the data that we want to actually process. That's my source data, which you remember we moved into that twitterdata/tweets folder where the table is defined. We're going to bring out the results of that to this path here. So go ahead and do the MapReduce job, generate the results, and store the results file in twitterdata/words. We'll go ahead and start that job. Once it's complete—it will take a little while to run but once it's complete we'll wait for it and we will go ahead and download the output file. The standard naming convention for files when MapReduce jobs work is part-r and then a bunch of numbers— in this case 5 zeroes. So we're going to assume that that's what gets generated, we're going to find that text file that has the count of the number of words in it, and we're going to go ahead and download that locally and then just simply display it in the console window using this cat command. So we're running a MapReduce job that generates the count of the words, downloading the results, and displaying it down here. Let's go and run that and see what we find. The MapReduce job starts. That's it running now. And when the job is complete, the output is displayed and we download the results file and we display the contents of that results file. There's quite a large file there. If we scroll up, you can see here some of the common words that have appeared in there and how frequently they've appeared. These aren't sorted into any particular order, so I'm going to have to dig through this and try and figure out which are the most common words. There are some other problems that I might have. It's counted things like the URLs as a word because it's just a string delimited by text, by space, and of course each one has appeared once because each tweet is unique. So there's a whole bunch of things in here that aren't necessarily very useful to me, but there is some data here that I might find useful as I start to explore what I've got in there. I can see some words in there that might be useful to me. However, I'll probably want to refine this approach. And rather than show you how terrible my Java programming skills are, it's probably easier to use an abstraction like Pig and write a Pig script to go ahead and do this. So let's take a look at how we might do that. I've got in here a WordCount.pig file. Let's have a look at what's in here. This is a script that I've written in Pig Latin. What I'm going to do with this is kind of similar to what we did with that MapReduce code previously, but we're going to do a little bit more with this one. We're going to be a little bit more precise about what data we want from that. We're going to load into this tweets relation here, we're going to load the source file. We impose that schema onto the data. We're only actually interested in what's in the tweet side of things. We don't really care about the date, the ID, and the author or anything like that. So we'll split that tweet value into the individual words. We'll take this tweets relation that we've loaded, and for each one that we've got we'll use this tokenize idea to flatten out the tweet into the different words within that tweet, similar to what the MapReduce code did. We'll take that and then we'll do a little bit of cleaning to that. We're going to get rid of things that have additional punctuation or brackets or HTTP type things or tags or all that kind of thing, things that aren't actually words. We're going to go ahead and get rid of them. We use this regular expression here to specify this is what we think of as a word. It's something that's got an A to Z text value in it. So we take out those words. There might be some that are actually just empty strings, and we're going to go ahead and get rid of those as well. So we'll filter those out. We're being a little bit more precise about what we want to count as a word. The code to do that in Pig Latin is obviously much simpler than it would be to do in Java. Then we're going to group those filtered words by word. So we'll find all the words that are the same and we'll group them together and then we'll count them. We're just simply going to count the number of instances of each word that we've found. We'll sort them into order by count so that the most common words are at the top of the list. And then we're really only interested in the top 100 words. There could be thousands of words in your list. Just get the top 100, bring them out, and store them in a file in this twitterdata/wordcounts output folder. That's where we're going to put the results. So it's a fairly straightforward Pig Latin script that's going to process that data. It does a similar job to the WordCount Java code that we saw earlier, but it's a little bit more precise in what it does, more specific, but what we're counting is words and it's a little bit easier to understand than writing the whole thing in Java. So to run that Pig Latin script then, again we'll use some PowerShell. I'll just clear the screen here so that we can see what's going on a little bit. In this script we're going to connect to our cluster. We're going to upload that Pig Latin script using the same technique we've seen before. So we just go and upload the Pig Latin file. And then there's a New-Azure HDInsightPigJobDefinition commandlet this time. So very similar to when we ran a Hive job and when we ran a MapReduce job, there's a commandlet for running Pig jobs, and we just simply point at the Pig Latin file that we've uploaded. We could actually use— instead of file we could say -query and actually hard code the Pig Latin script here, but it's easier just to upload the file and point to the file. So we'll start the job and we'll wait for that to complete. We'll display the status that gets returned by Pig when it runs. And then when the job is finished, we know that it stores its results in this twitterdata/wordcounts folder. Again, the default name for the file will be part-r with 5 zeroes. There's only going to be the top 100 records, so the chances are it's only going to be one file. So we can go ahead and connect and download that file and again display the results in the console window. This time we should see the 100 most common words that appeared in the tweets. Let's just resize this to let us see the results a bit more clearly. We'll go ahead and run that PowerShell script. The script file is uploaded and we start the Pig job. And again, this is going to be under the covers and it's going to be translated into a MapReduce job by HDInsight. When the job is complete, the results are downloaded, and what we've got now is the top 100 words in the order of how frequently they're used. So if we take a look at these results, I can see, yep, it's definitely got some information in there, but I can kind of see that some of these words, words like to and in and the and a, yeah, they're very common, but they're not particularly useful for helping me understand what people are concerned about, what the main sort of focus of the tweets is about. So I probably want to modify my code to get rid of some of these words that don't really add any analytical value. Let's take a look and see how we might do that. A common technique for doing this is to use a stop list or a list of noise words that you can filter out. I've actually got in my lookups folder here some noise words. It's just a text file that contains a number of very common words that aren't really particularly useful to me. So we've got them in that file. What we're going to do is modify our Pig code so that we're actually going to filter out those noise words. Here's a modified version of that WordCount Pig script that we just used. Again, we're loading the same Twitter data and again, we're finding each of the words, but where it gets more interesting is we then load data from that noise words file that we've just seen. So we're going to upload that and we're going to load the data from there into this relation here called noise words, which is just basically one word with some characters. Each of those noise words we're going to join to the filtered words relation. So use a join, joining on the idea of find me records in my words that I've been counting that match the noise words that are in my noise words file. And then we'll filter out where we've got the noise word being null. In other words, we've managed to get all of the words that we've counted, but there isn't a corresponding noise word. It's a left outer join, so that will include nulls where there isn't a matching word. So we get rid of all of the rows where we've got a matching noise word, and what we're left with is all of the words that didn't have a noise word that matched them, so the words that are interesting to me. And then the rest of the code is much the same as it was before. We group the words and count them and sort them and then we go and limit them to 100 results and bring out those results into twitterdata/noisewordcounts this time. Similar code to the last time with the exception that we're using a join in order to filter out the words that match our noise word list, or stop list, so that we can get rid of those. The PowerShell to do that, again, as you would imagine, is very similar to what we did before. Really the only difference is that in addition to uploading the Pig Latin script, we upload the noise words stop list file. So we're uploading that stop list file first of all, then we're uploading our revised Pig Latin script, and then we're going to run that Pig job just as we did before, only this time the results won't include any of the words that are in our stop list. So let's go ahead and run that code. We'll just make this window a bit bigger. First of all, the noise list is uploaded and then secondly, the script file is uploaded, and then we start our Pig Latin job. And when the job is complete, I can see the results have been downloaded. If I just take a look at those, I can see that it hasn't got words like the and a and in. Those have been filtered out so I can actually see the words that are useful. Okay, so I'm beginning to get some useful information, but there's a couple of things in here. You can see delayed and delay are both listed there. And I can see if I scroll further down I've got bags here and bag further down as well. There are some words that effectively mean the same thing, but because they're slightly different they've been counted twice. So I probably want to do a little bit more processing of the data to just say delayed means the same thing as delay as far as I'm concerned, so just add those 2 things together. To do that I'm going to actually create a custom function that's going to figure out if 2 words are the same as one another. I'm going to use this WordDistance java code file here. If I open that up and we take a look at that, effectively I'm using a technique called Jaro scoring— J-A-R-O. It's a fairly common statistical technique for seeing if 2 words are similar. So what we're doing is we're passing in the words and we're getting out a score that tells me how similar these 2 words are. And that score is effectively between 0 and 1. One means they are identical, and then 0 means they are completely different, and anything in between indicates how similar we think those words are. So there's a reasonable amount of Java code here to do that for me. This is something that would be pretty difficult to do in something like Pig, so I've had to write some Java code. However, rather than do the whole operation as a custom MapReduce job, I want to continue using Pig and using the same technique that I'm using, but I just want to include that Java code in the Pig processing. And the way that I do that is I create a user-defined function. I've compiled that Java code into this JAR file here, WordDistanceUDF.jar, and what I'm going to do is upload that JAR and register it and then use that Java function from my Pig code. If we have a look at the Pig code here for MatchWords, you can see that I'm registering that JAR that I've uploaded and then I'm loading my Twitter data, I'm finding all the different words in this that are similar to what I did before, I'm loading in the noise words and I'm filtering out all the different noise words, same as I did before, I'm grouping by word. But then what I'm doing is I'm going to create duplicates of the word. So for every word that I find I'll create a duplicate set so I've got 2 of each word, and I'm going to cross-join them so I've got every combination of every word joined to one another. Conceptually at this point you've got— imagine a table with 2 columns. Each column has all of the possible words. So we've got every possible combination of the words that we've found. And then what I'm going to do is I'm going to use this Jaro function of mine. You can see in here I've got the WordDistanceUDF. WordDistance and I'm passing in the different words that I've got in there. I'll get back a double value that tells me the distance between each pair of words. I'm going to filter out the ones that are less than 0.9. I'm going to say if the Jaro distance is 0.9 or higher, they are 90% similar, so they're probably meaning the same thing. And that way I can get a set of words that are synonyms for one another, if you like. And then I'll get those results and I'll store them in this MatchedWords folder. So I should end up with a list of words that are considered to be similar with one another. Let's go ahead and run that one. Very similar code in my PowerShell script— we'll just clear the screen again—to what we've seen before. You can see first of all I'm uploading the WordDistanceUDF.jar same as I've done previously with some scripts. The JAR gets uploaded, the Pig Latin script gets uploaded as well, and then I'm ready to go ahead and run that Pig Latin script and get back the results, which at this time will be a list of words that are similar to one another. So let's go ahead and run that code. First of all, the JAR file gets uploaded, and then the script itself gets uploaded and then we start running the Pig script as a job on the HDInsight server. So when the job is complete, we download our results. And if we take a look at these we can see where there's a Jaro score of 1 the words are obviously identical. But we can see things like seats is considered to be pretty similar to seat and seated. It actually isn't matched to seats. There's a little bit of confusion, I guess, in terms of how things are mapped. It's not quite as precise as we would like. So one of the things that we could do is we could mess around with that Jaro score. We could say we'll bring back the results that are 85% accurate. The problem then is you start to get into issues where Seattle is matched to seated, and of course those don't mean the same thing. The Jaro scoring technique, the custom UDF that we've created, has produced some useful results. I can see things like film and films is considered the same thing and delays and delay and delayed and delay. So it has picked up some things that are pretty useful to match, but there are some cases where it hasn't quite given us the accuracy we need. What I can do with these results, rather than rely on using that technique in my analysis, I can use this as a starting point to create myself a list of synonyms that I then want to use to apply when I'm aggregating my data. So I could take these results and I could go ahead and create a file of recognized synonyms. I could edit this and add additional synonyms as well. So you can see that I've added things like bag and luggage. I'm considering that to be the same thing. So this gives me a bit more freedom rather than just relying on Jaro distance to actually specify these are the synonyms that I want to recognize in my data. What I can then do is I can use those synonyms in another Pig Latin script. So if I have a look at this script here what I'm doing this time is again I'm loading the tweet data, I'm doing exactly the same as I did before. I'm cleansing them out, getting just the words that I'm interested in, filtering them, adding in the noise words using that join and getting rid of the words that are considered stop words that I don't need. And then I'm loading the synonyms file that we've just looked at, and I've got what we call the leading value, which is what I'm going to change any synonyms to match. And then what I'm doing is just simply joining the words that I've found left after joining them to the synonyms so that we include things that don't have synonyms. And then where I have got synonyms I'm generating the actual word. The leading value is what I'm bringing out of that. Then at the end I'm going to union the words that had synonyms with the words that didn't have synonyms so we end up with a complete list of all the words that I'm actually interested in. And where they had a synonym I'm going to merge them to use the same value. Then I'm going to group all of those words just as I did before, count them, sort them, and store them out. So what I should end up with from this Pig script is all of the words not counting any noise words that I don't care about. And where I've got words that mean the same thing I'm going to use one leading value for all of those words and count all of those as one word. And then I'm going to store those top 100 words into this folder here, twitterdata/synonymcounts. To go ahead and run that Pig Latin script, again, we'll just take a look at the PoweShell code here. Similar to before, I'm uploading my noise words, I'm uploading my synonyms, and then I'm uploading my Pig script that I want to run. I'll go ahead and run that using the New-AzureHDInsight PigJobDefinition, run that job, and then go and grab the output once the job is finished and download that output and display it on the screen. Let's go ahead and run this PowerShell script. It uploads the noise words file first of all, and then it's going to upload my synonyms file and finally it's going to upload my Pig script file. And once those files have been uploaded, it's then going to start a Pig job to run that Pig script and actually go and process the data. So the Pig job starts. And once again, we'll just wait for it to complete. When the job is complete, we download the results again. We can see some useful information here about the words that have been found. It's merged together delayed and delays and delay into one single count here for words that pretty much mean that people were concerned about being delayed. So now that I've got the data downloaded, I can actually do some analysis with that. I could, for example, fire up Excel here and then I'll just simply go ahead and open up the file that's been downloaded by my script. That would have been in this folder here, synonymcounts, and it's a text file that's been downloaded, except of course it doesn't have an extension. Here it is here. We'll go ahead and open that. It's delimited and it's tab-delimited. So here's the results of my Pig script. I could go ahead and I could add some conditional formatting here to help me see what's actually going on. And I can start to analyze that data and see what are the most common words that are used in tweets about the airline and what are the things that concern our customers. Obviously we could continue iterating, we could make our code more and more complex and do even deeper analysis of the data that we've found, but I guess the point of the scenario is to talk about the iterative approach of exploring data. So one of the scenarios for big data is very much we don't really have a clear idea of what it is we're trying to find; we're just really interested in what are people saying about us on Twitter and is there anything useful that we can find in the Twitter data. And what we've done is used a series of iterations, of exploring the data. Sometimes you might go down a blind alley, sometimes you might find that you get some results and that leads you to a totally different avenue of exploration, but we're able to explore the data just by making our scripts more complex, by mixing and matching with Pig, with Hive, with MapReduce code written in Java in order to just examine the data and find out anything interesting that's in it. That's a really common approach to big data analysis, and one of the benefits of using HDInsight for that type of thing is of course it's a service in Azure. You're only paying for it when you need it, so you spin the cluster up, you use it to explore your data, and then when you're done you can tear the cluster down and you're no longer paying for it. So there's no capital outlay to get that cluster in place. Hopefully you've found this scenario useful, a chance to see the kind of things we can do when we're exploring data in a kind of ad hoc, iterative, exploratory fashion. And hopefully you'll find HDInsight a useful tool for doing that with your own data. [Microsoft] [Microsoft] [Microsoft]

Video Details

Duration: 37 minutes and 21 seconds
Country: United States
Language: English
License: All rights reserved
Genre: None
Views: 57
Posted by: asoboleva99 on Jan 29, 2015

Demo do not tranlslate ----- (Please provide translations for these languages: English (eng), Russian (rus).)

Caption and Translate

    Sign In/Register for Dotsub to translate this video.