Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
PRODUCTS HORTONWORKS SANDBOX TUTORIAL 1: HELLO WORLD AN OVERVIEW OF HADOOP WITH HIVE AND PIG
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
1/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing software is called MapReduce. Here's a short video introduction to MapReduce: Introduction to MapReduce
Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process, and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, and mobile device locationbased marketing.
Apache Hive
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
2/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
The Apache Hive project provides a data warehouse view of the data in HDFS. Using a SQL-like language Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. Since you are using data in HDFS your operations can be scaled across all the datanodes and you can manipulate huge datasets.
Apache HCatalog
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
3/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Teradata Aster can also use the location and metadata in HCatalog. In the tutorials we will see how we can now reference data by name and we can inherit the location and metadata.
Apache Pig
Pig is a language for expressing data analysis and infrastructure processes. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster. Thats all for now lets get started with some examples of using these tools together to solve real problems!
Using HDP
Here we go! We're going to walk you through a series of step-by-step tutorials to get you up and running with the Hortonworks Data Platform(HDP).
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
The file is about 11 megabytes, and may take a few minutes to download. Fortunately, to learn 'Big Data' you don't have to use a massive dataset. You need only use tools that scale to massive datasets. Click and save this file to your computer.
The File Browser interface should be familiar to you as it is similar to the file manager on a Windows PC or Mac. We begin in our home directory. This is where we'll store the results of our work. File Browser also lets us upload files.
Uploading a File
To upload the example data you just downloaded,
Select the 'Upload' button Select 'Files' and a pop-up window will appear. Click the button which says, 'Upload a file'. Locate the example data file you downloaded and select it. A progress meter will appear. The upload may take a few moments. When it is complete you'll see this:
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
5/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
Now click the file name "NYSE-2000-2001.tar.gz". You'll see it, displayed in tabular form:
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
6/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
You can use File Browser just like your own computer's file manager. Next register the dataset with HCatalog.
Select "Create a new table from file" from the Actions menu on the left.
Fill in the Table Name field with 'nyse_stocks'. Then click on Choose a file button. Select the file we just uploaded 'NYSE-2000-2001.tsv.gz'.
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
7/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
You will now see the options for importing your file into a table. The File options should be fine. In Table preview set all text type fields to Column Type 'string' and all decimal fields (ex: 12.55) to Column Type 'float.' The one exception is 'stock_volume' field should be set as 'bigint.' When everything is complete click on the "Create Table" button at the bottom.
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
8/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
9/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
On the right hand side there is a query window and an execute button. We will be typing our queries in the query window. When you are done with a query please click on the execute button. Note: There is a limitation of one query in the composition window. You can not type multiple queries separated by semicolons. Since we created our table in HCatalog, Hive automatically knows about it. We can see the tables that Hive knows about by clicking on the Tables tab.
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
10/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
In the list of the tables you will see our table, n y s e _ s t o c k s . Hive inherits the schema and location information from HCatalog. This separates meta information like schema and location from the queries. If we did not have HCatalog we would have to build the table by providing location and schema information. We can see the records by typing S e l e c t*f r o mn y s e _ s t o c k sin the Query window. Our results would be:
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
11/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
12/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
We can count the records with the query s e l e c tc o u n t ( * )f r o mn y s e _ s t o c k s . You can click on the Beeswax icon to get back to the query screen. Evaluate the expression by typing it in the query window and hitting execute.
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
13/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
This job takes longer and you can watch the job running in the log. When the job is complete you will see the results posted in the Results tab.
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
14/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
You can select specific records by using a query like s e l e c t*f r o mn y s e _ s t o c k sw h e r e s t o c k _ s y m b o l = " I B M " .
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
15/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
16/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
So we have seen how we can use Apache Hive to easily query our data in HDFS using the Apache Hive query language. We took full advantage of HCatalog so we did not have to specify our schema or location of the data. Apache Hive allows people who are knowledgable in query languages like SQL to immediately become productive with Apache Hadoop. Once they know the schema of the data can they quickly and easily formulate queries.
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
Let's get started To get to the Pig interface click on the Pig icon on the icon bar at the top. This will bring up the Pig user interface. On the left is a list of your scripts and on the right is a composition box for your scripts. A special feature of the interface is the Pig helper at the bottom. The Pig helper will provide us with templates for the statements, functions, I/O statements, HCatLoader() and Python user defined functions. At the very bottom are status areas that will show the results of our script and log files
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
18/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
19/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
So now we have our table loaded into Pig and we stored it "a "
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
20/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
Now we have extracted all the records with IBM as the stock_symbol.
So the variable "d " will contain the average volume of IBM stock when this line is executed.
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
21/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
We can save our completed script using the Save button at the bottom and then we can Execute it. This will create a MapReduce job(s) and after it runs we will get our results. At the bottom there will be a progress bar that shows the job status. At the bottom we click on the Save button again Then we click on the Execute button to run the script Below the Execute button is a progress bar that will show you how things are running. When the job completes you will see the results in the green box. Click on the Logs link to see what happened when your script ran. The average of stock_volume This is where you will see any error messages. The log may scroll below the edge of your window so you may have to scroll down.
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
22/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
Summary
Now we have a complete script that computes the average volume of IBM stock. You can download the results by clicking on the green download icon above the green box.
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
23/24
12/27/13
Hadoop Tutorial: Hello World - An Overview of Hadoop with HCatalog, Hive and Pig
If you look at what our script has done, you see in Line 5 we: Pulled in the data from our table using HCatalog, we took advantage that HCatalog provided us with location and schema information, if that needs to change in the future we would not have to rewrite our script. Pig then went through all the rows in the table and discarded the ones where the stock_symbol field is not IBM Then an index was built for the remaining records The average of stock_volume was calculated on the records We did it with 5 lines of Pig script code! Feedback We are eager to hear your feedback on this tutorial. Please let us know what you think. Click here take survey to
If you're having difficulties configuring Sandbox for these tutorials, you might find the answer in the Sandbox Forums
hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
24/24