Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

Develop and Run Hadoop Streaming MapReduce jobs on Windows Azure HDInsight Service

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[Develop and Run Hadoop Streaming MapReduce Job on Windows Azure HDInsight Service] MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. Most of the MapReduce jobs are written in Java. Hadoop provides a streaming API that enables me to write Map and Reduce functions in languages other than Java. This video shows how to run the C# streaming sample from the HDInsight sample gallery and how to use C# programs with the Hadoop streaming interface. The HDInsight Cluster Provisioning Process usually involves the following steps. First, I create a Windows Azure account and enable the HDInsight preview. HDInsight Cluster uses Windows Azure blog storage as the default file system. I must create a storage account before I can provision an HDInsight cluster. I won't demonstrate the provision process in this video. For more information, see Getting Started with Windows Azure HDInsight Service on To open the HDInsight sample gallery sign in to the management portal. Click HDInsight. The cluster I'm going to use is called hdi0501. Click Manage Cluster. Enter my credentials. This screen is called Cluster Dashboard. From here, I can open the interactive console, open Remote Desktop to the virtual machine, monitor the cluster, and check Job History. Notice on the Job History tile the number is zero, meaning no jobs have been executed on this cluster. Click Samples to open the sample gallery. There are currently 5 samples deployed. The C# streaming sample is the one I will demonstrate in this video. From the HDInsight sample gallery, click C# Streaming. It has a short description, the Details section, and the Downloads section. Hadoop C# Streaming contains the Visual Studio project of the mapper and reducer. Hadoop-streaming.jar is a Hadoop stream jar file. Cat.exe is the compiled mapper written in C#. Wc.exe is the compiled reducer written in C#. And davinci.txt is the MapReduce job input file. Both the mapper and reducer read characters line by line from the standard input stream and write to the standard output stream. The mapper code in the cat.cs file uses a stream reader object to read the characters of the incoming stream into the console, which in turn writes the stream to the standard output stream with the static Console.WriteLine method. The reducer code in the wc.cs file uses a stream reader object to read characters from the standard input stream that have been output by the cat.exe mapper. As it reads the characters with the Console.ReadLine method, it counts the words by counting space and end of line characters at the end of each word, and then it writes the total to the standard output stream with the Console.WriteLine method. To deploy the sample, click Deploy to Your Cluster. It opens the Create Job interface with the fields populated. If you are writing your own mapper and reducer programs you can upload the executables to the cluster using the interactive JavaScript console and then use Create Job from the Cluster Dashboard. On the Create Job page, I can customize the parameters if I want to, for example, use a different input file instead of the default one. The first parameter shows the path for the 2 executables, and the second parameter shows the input file name and output file name. The output file is in the streaming output folder, and the last parameter specifies the mapper and reducer. Click Execute Job to run the job. The time it takes to run depends on the number of nodes in the cluster and the size of the file. When the Completed Successfully message appears, my job is done. Now that the MapReduce job is completed, I'm going to use the interactive JavaScript console to check the result. Go back to the cluster dashboard. Notice the Job History tile shows 1 now, which indicates that one job has been executed. Click Interactive Console. To check the result, I look at the file in the output folder Streaming Output. The default final output file name is part-00000. I use the cat command to display the content. There are 232,536 words in the input file. Please visit and search for HDInsight to find more related articles and videos. [Microsoft]

Video Details

Duration: 5 minutes and 27 seconds
Country: United States
Language: English
Genre: None
Views: 6
Posted by: asoboleva99 on Aug 14, 2013

Please deliver English transcript and MT-ed files for 10 langauges (Ru, Bra, CHS, CHT, Jap, Kor, Fre, Ger, Spa, Ita)

Caption and Translate

    Sign In/Register for Dotsub to translate this video.