BD Avrohadoop PDF

Big data serialization using Apache Avro with Hadoop
Share serialized data among applications
Nathan A. Good 29 October 2013

Senior Consultant and Freelance Developer
Freelance Developer
Apache Avro is a serialization framework that produces data in a compact binary format that
doesn't require proxy objects or code generation. Get to know Avro, and learn how to use it with
Apache Hadoop.
Apache Avro is a framework that allows you to serialize data in a format that has a schema built
in. The serialized data is in a compact binary format that doesn't require proxy objects or code
generation. Instead of using generated proxy libraries and strong typing, Avro relies heavily on the
schemas that are sent along with the serialized data. Including schemas with the Avro messages
allows any application to deserialize the data.
InfoSphere BigInsights Quick Start Edition

Avro is a component of InfoSphere BigInsights, IBM's Hadoop-based offering. InfoSphere
BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere
BigInsights. Using Quick Start Edition, you can try out the features that IBM has built to
extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided
learning is available to make your experience as smooth as possible including step-by-step,
self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time
or data limit, you can experiment on your own time with large amounts of data. Watch the
videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.
This article shows how to use Avro to define a schema, create and serialize some data, and send
it from one application to another. I use the Eclipse integrated development environment (IDE) to
create a sample application demonstrating the use of Avro. Download the sample code for this
application.
Installation
To install Avro, you must download the libraries and reference them in your code.
Download the libraries

To download the Avro libraries for use in Java technology, download the avro-VERSION.jar and
avro-tools-VERSION.jar files (see Resources). The two libraries depend on the Jackson libraries,
Copyright IBM Corporation 2013 Trademarks

Big data serialization using Apache Avro with Hadoop Page 1 of 17
developerWorks ibm.com/developerWorks/
a link to which is also in Resources. Alternatively, use Apache Maven or IVY to download the Java
archive (JAR) files and all of their dependencies automatically. An example of the Maven entry is
provided in Listing 1.
Listing 1. An example of the entries for Avro dependencies

<dependencies>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<artifactId>avro-tools</artifactId>
</dependency>
</dependencies>
See Resources to download libraries for other implementations, such as Python, Ruby, PHP, or
C#.
Create a project
This article uses the Eclipse IDE with the m2e (Maven 2) plug-in installed to build the examples for
using Avro. You will need a new project in which to create your sample schemas, compile them,
and run samples. To create a new project in Eclipse, click File > New > Maven Project. Select the
check box next to Create a simple project to quickly create a project that supports Maven, which
this article uses to handle the Avro dependencies.
As noted in Download the libraries, one easy way to get Avro and all of its dependencies without
doing it manually is to use Maven or IVY. The sample code included here uses a Maven pom.xml
file to configure the Avro dependency and download it automatically from inside Eclipse. For more
information about how Maven works, see Resources.
To create your own pom.xml file, create a new XML file by clicking File > New > Other. Select
XML File from the list, then name the file pom.xml. When you have created the file, place the entire
contents of Listing 2 inside the file and save.
Listing 2. The entire Maven pom.xml file to get started

<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>avroSample</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>avroSample</name>
<url>http://maven.apache.org</url>

ibm.com/developerWorks/ developerWorks
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.6</version>
<scope>test</scope>
</dependency>
<dependency>
<artifactId>avro</artifactId>
</dependency>
<dependency>
<artifactId>avro-tools</artifactId>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.0-beta9</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.0-beta9</version>
</dependency>
</dependencies>
</project>
After you have created the pom.xml file, resolve the dependencies. Doing so downloads the Avro
JAR files along with all of the other libraries that Avro requires. To resolve the dependencies in
your Eclipse IDE, save the pom.xml file. When Eclipse builds the project, the Maven Builder will
automatically download the new dependencies.
Avro schemas
Avro schemas describe the format of the message and are defined using JavaScript Object
Notation (JSON). The JSON schema content is put into a file that the tools will reference later.
Defining a schema
Before creating a schema in your Eclipse project, create a source folder in which to put the
schema files. To create the new source folder, click File > New > Source Folder. Enter src/main/
avro and click Finish. The source folder follows Maven conventions, so if you want to use the
Maven plug-in to generate sources, you can.
Now that you have a source folder that will contain the schema files, create a new file inside the
source folder by clicking File > New > File. Give your new file a name such as automobile.asvc.

Now, you can add JSON to define the schema. Listing 3 demonstrates a simple schema that
defines an automobile.
Listing 3. Sample Avro schema file for an Automobile object

{
"namespace":"com.example.avroSample.model",
"type":"record",
"name":"Automobile",
"fields":[
{
"name":"modelName",
"type":"string"
},
{
"name":"make",
"type":"string"
},
{
"name":"modelYear",
"type":"int"
},
{
"name":"passengerCapacity",
"type":"int"
}
]
}
The Automobile is of type record, which is a complex type that contains a list of fields. The record
complex type requires the attributes shown below.
Table 1. Attributes available on a record

Attribute name Description Example
type The data type of the entry record
name The name of the object. (This will be the class Automobile
name when generated to code.)
namespace A namespace for the object to avoid naming com.example.avroSample.model

collisions. (When generated to code, this will
be the package name.)
fields An array of the fields (attributes or properties) modelName, etc.

of the object
Each field in the list of fields contains data about its name and type. Table 2 contains a list of the
attributes of a field.
Table 2. Attributes of a field

Attribute name Description
name The name of the field. (This will be the property name when generated
to code.)
type The data type of the field
See Resources for more information about the JSON schema.

Compiling the schema

Now that you have defined a schema, you can compile it using the Avro tools. You can either use
the JAR method of generating the sources or you can use a plug-in inside the pom.xml file. If you
want to use the Maven method, you can skip to Use a Maven plug-in.
Use the command line

To use the JAR method, click Run > External Tools > External Tools Configurations. Select a
location by clicking Browse File System to find the Java location in your environment, such as C:
\Program Files\Java\jre7\bin\java.exe on a Windows computer. For your working directory, select
the project by clicking Browse Workspace. The working directory will be used as a base path for
the relative paths that you supply in the Arguments box.
In the Arguments box, add the values shown in Listing 4.
Listing 4. Adding arguments

-jar avro-tools-1.7.5.jar
compile
schema
src/main/avro/automobile.avsc
src/main/java
To run the tool, click Run.
Use the Maven plug-in

To use the Maven plug-in to generate the Java proxy classes from the Avro schema, put the XML
shown in Listing 5 in the pom.xml file.
Listing 5. Adding the build plug-in to the pom.xml file


<build>
<plugins>
<plugin>
<artifactId>avro-maven-plugin</artifactId>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>
${project.basedir}/src/main/avro/
</sourceDirectory>

<outputDirectory>
${project.basedir}/src/main/java/
</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>

</dependencies>
</project>
Now that you have the plug-in in the Maven pom.xml file, Maven will generate the sources for you
automatically during the generate-sources phase, which is executed before the compile phase. For
more information about the Maven life cycle, see Resources.
The generated Java class file will be located in the src/main/java folder. Listing 6 shows some of
the example Automobile class, which has been generated with the com.example.avroSample.model
package specified in the schema namespace.
Listing 6. Example of the generated Automobile class

/**
* Autogenerated by Avro
*
* DO NOT EDIT DIRECTLY
*/
package com.example.avroSample.model;
@SuppressWarnings("all")
@org.apache.avro.specific.AvroGenerated
public class Automobile extends org.apache.avro.specific.SpecificRecordBase
implements org.apache.avro.specific.SpecificRecord {
public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.

Schema.Parser().parse("{\"type\":\"record\",\"name\":\"Automobile\",
\"namespace\":\"com.example.avroSample.model\",\"fields\":[{\"name\":
\"modelName\",\"type\":\"string\"},{\"name\":\"make\",\"type\":\"string\"},{
\"name\":\"modelYear\",\"type\":\"int\"},{\"name\":\"passengerCapacity\",
\"type\":\"int\"}]}");
public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }
@Deprecated public java.lang.CharSequence modelName;
@Deprecated public java.lang.CharSequence make;
@Deprecated public int modelYear;
@Deprecated public int passengerCapacity;
/**
* Default constructor. Note that this does not initialize fields
* to their default values from the schema. If that is desired then

* one should use {@link \#newBuilder()}.

*/
public Automobile() {}
/**
* All-args constructor.
*/
public Automobile(java.lang.CharSequence modelName,
java.lang.CharSequence make, java.lang.Integer modelYear,
java.lang.Integer passengerCapacity) {
this.modelName = modelName;
this.make = make;
this.modelYear = modelYear;
this.passengerCapacity = passengerCapacity;
}
/* snipped... */
}
Now that you have generated the source files, you can use them in a Java application to see how
it works.
Serialization with the schema

At this point, you should have a schema file defined with JSON content and have at least one Java
source file generated from the schema. Now you are ready to write some test code that builds
objects using the generated classes, assigns values to their properties, and serializes the objects
to a file.
Listing 7 contains an example of Java code that uses a Builder to return an instance of an object.
Listing 7. Using a Builder to create an instance of an object

Automobile auto = Automobile.newBuilder().setMake("Speedy Car Company")
.setModelName("Speedyspeedster").setModelYear(2013)
setPassengerCapacity(2).build();
DatumWriter<Automobile> datumWriter =
new SpecificDatumWriter<Automobile>(Automobile.class);
DataFileWriter<Automobile> fileWriter =
new DataFileWriter<Automobile>(datumWriter);
try {
fileWriter.create(auto.getSchema(), outputFile);
fileWriter.append(auto);
fileWriter.close();
} catch (IOException e) {
LOGGER.error("Error while trying to write the object to file <"
+ outputFile.getAbsolutePath() + ">.", e);
}
Values are assigned to the properties on the object before the build() method is called to return
the instance. A DatumWriter is used to write the data from the object through the DataFileWriter,
which writes the data contained in the object to a file while conforming to the supplied schema.
After you create a DataFileWriter implementation and call create() with the schema and file, you
can add objects to the file by using the append() method, as shown in Listing 7.

To execute the sample code, either put the code in the main method of a Java class or add it to a
unit test. In the sample code available with this article, the code is added to a unit test that you can
execute in Eclipse or Maven.
When you execute this code, the serializer creates a file called autos.avro in the target folder of
the project. Because the file is in a small, binary format, you won't be able to usefully inspect its
contents using a normal text editor.
Deserialization with the schema

When you have at least one object serialized to a file, you can use Java code to deserialize the
contents of the file into objects. Here, a unit test is useful because you can make assertions to
verify that the values of the deserialized object are the same as the original values.
Listing 8 shows Java code that can be used to deserialize an Automobile object from the same file
that was created in the previous step.
Listing 8. Java code to deserialize an Automobile object

DatumReader<Automobile> datumReader =
new SpecificDatumReader<Automobile>(Automobile.class);
try {
DataFileReader<Automobile> fileReader =
new DataFileReader<Automobile>(outputFile, datumReader);
Automobile auto = null;
if (fileReader.hasNext()) {
auto = fileReader.next(auto);
}
LOGGER.error("Error while trying to read the object from file <"
}
This time, the code uses a DatumReader, which reads the content through the DataFileReader
implementation.
The DatumReader is an iterator (that is, it extends the Iterator interface), so it reads through the
file and returns objects using the next() method. This behavior will be familiar to anyone working
with RecordSet objects in Java.
The code shown in Listing 9 ensures that the values on the properties are the same values
originally assigned to the object before it was written to the file. When executed as a unit test,
these assertions pass.
Listing 9. Adding assertions to verify the original values of the object

DatumReader<Automobile> datumReader =
new SpecificDatumReader<Automobile>(Automobile.class);
try {
DataFileReader<Automobile> fileReader =
new DataFileReader<Automobile>(outputFile, datumReader);

Automobile auto = null;
if (fileReader.hasNext()) {
auto = fileReader.next(auto);
}
assertEquals("Speedy Car Company", auto.getMake().toString());

assertEquals("Speedyspeedster", auto.getModelName().toString());
assertEquals(Integer.valueOf(2013), auto.getModelYear());
LOGGER.error("Error while trying to read the object from file <"
}
Now that you have seen Avro in action both writing and reading from a file, you can see how it
works with Apache Hadoop.
Integration with Hadoop

Apache Hadoop is a framework that enables the processing of large datasets over distributed
computers. Using Hadoop, you can process data over thousands of computers for scalable
applications. For more information about Hadoop, see Resources.
Avro allows you to use complex data structures within Hadoop MapReduce jobs. To try out
MapReduce functionality, you need to add dependencies to the pom.xml file to pull in the Hadoop
library and Avro MapReduce library. To add the dependencies, modify your pom.xml file to look like
Listing 10.
Listing 10. Adding the dependencies for Avro Hadoop libraries to the pom.xml
file

<build>

</build>
<dependencies>

<dependency>
<artifactId>avro-mapred</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
</dependency>

</dependencies>
</project>
Add the dependencies to the pom.xml file and save it. The m2e plug-in automatically downloads
the newly added JAR files and their dependencies.
After resolving the dependencies, create an example using Avro and MapReduce that shows
counts by model name in a list of automobiles, building off the prior examples. The model count
example is based on the color count example on the Avro site (see Resources).
The example includes three classes: one that extends AvroMapper, one that extends AvroReducer,
and a class with code to initiate the MapReduce job and write the results.
Extending AvroMapper
The AvroMapper class extends and implements several Hadoop-supplied interfaces to provide the
ability to collect or map data. To demonstrate the capabilities of Avro in MapReduce functions,
create a small class, as shown below.
Listing 11. The ModelCountMapper class

package com.example.avroSample.mapReduce;
import java.io.IOException;
import org.apache.avro.mapred.AvroCollector;
import org.apache.avro.mapred.AvroMapper;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.mapred.Reporter;
import com.example.avroSample.model.Automobile;
/**
* Class class will count the number of models of automobiles found.
*/
public final class ModelCountMapper extends
AvroMapper<Automobile, Pair<CharSequence, Integer>> {
private static final Integer ONE = Integer.valueOf(1);
@Override
public void map(Automobile datum,
AvroCollector<Pair<CharSequence, Integer>> collector,
Reporter reporter) throws IOException {
CharSequence modelName = datum.getModelName();
collector.collect(new Pair<CharSequence, Integer>(modelName, ONE));
The map method simply retrieves the model name from the passed-in Automobile object and adds
a value of 1 for each occurrence of the model name. There is no math so far in the map method.
Rather, the math to summarize the counts is in the class that extends AvroReducer.

Extending AvroReducer
Listing 12 shows the class that extends AvroReducer. This class accepts the values that the
ModelCountMapper object has collected and summarizes them by looping through the values.
Listing 12. Code to extend AvroReducer

import java.io.IOException;
import org.apache.avro.mapred.AvroCollector;
import org.apache.avro.mapred.AvroReducer;
import org.apache.hadoop.mapred.Reporter;
public class ModelCountReducer extends

AvroReducer<CharSequence, Integer, Pair<CharSequence, Integer>> {
/**
* This method "reduces" the input
*/
@Override
public void reduce(CharSequence modelName, Iterable<Integer> values,
AvroCollector<Pair<CharSequence, Integer>> collector, Reporter reporter)
throws IOException {
int sum = 0;
for (Integer value : values) {

sum += value.intValue();
}
collector.collect(new Pair<CharSequence, Integer>(modelName, sum));
The reduce method in the ModelCountReducer class "reduces" the values the mapper collects into
a derived value, which in this case is a simple sum of the values.
Running the example

To see the example in action, create a class that executes your classes using the AvroJob, as
shown in Listing 13.
Listing 13. Sample class that runs the MapReduce job

import java.io.File;
import org.apache.avro.Schema;
import org.apache.avro.Schema.Type;
import org.apache.avro.mapred.AvroJob;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.example.avroSample.model.Automobile;
public class ModelNameCountApp extends Configured implements Tool {
private static final Logger LOGGER = LogManager

.getLogger(ModelNameCountApp.class);
private static final String JOB_NAME = "ModelNameCountJob";
@Override
public int run(String[] args) throws Exception {
JobConf job = new JobConf(getConf(), ModelNameCountApp.class);

job.setJobName(JOB_NAME);
FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));
AvroJob.setMapperClass(job, ModelCountMapper.class);
AvroJob.setReducerClass(job, ModelCountReducer.class);
AvroJob.setInputSchema(job, Automobile.getClassSchema());
AvroJob.setOutputSchema(
job,
Pair.getPairSchema(Schema.create(Type.STRING),
Schema.create(Type.INT)));
JobClient.runJob(job);
return 0;
/**
* Creates an instance of this class and executes it to provide a call-able
* entry point.
*/
public static void main(String[] args) {
if (args == null || args.length != 2) {

throw new IllegalArgumentException(
"Two parameters must be supplied to the command, " +
"input directory and output directory.");
}
new File(args[0]).mkdir();
new File(args[1]).mkdir();
int result = 0;
try {
result = new ModelNameCountApp().run(args);
} catch (Exception e) {
result = -1;
LOGGER.error("An error occurred while trying to run the example", e);
}
if (result == 0) {
LOGGER.info("SUCCESS");
} else {
LOGGER.fatal("FAILED");
}

This example creates an instance of a JobConf object, then uses the AvroJob class to configure the
job before executing it.
To run the example from inside Eclipse, click Run > Run Configurations to open the Run
Configurations Wizard. Select Java Application from the list, then click New. Enter the name of
the project and main class name (com.example.avroSample.mapReduce.ModelNameCountApp) on the
Main tab, then enter the name of the directory you used as output from the tests you ran earlier.
Enter the name of a suitable output directory in the Program Arguments box, such as target/
avro target/mapreduce.
When you are finished adding in the directory names, click Run to run the example.
When you run the example, it will use the ModelCountReducer to collect the names of the models
and the count for each using the map() method, then it applies processing to the map of keys
and values using the reduce() method of the ModelCountReducer object. In this example, the
processing is simply adding the counts to summarize them.
Conclusion
Learn more. Develop more. Connect more.
The new developerWorks Premium membership program provides an all-access pass to
powerful development tools and resources, including 500 top technical titles for application
developers through Safari Books Online, deep discounts on premier developer events, video
replays of recent O'Reilly conferences, and more. Sign up today.
Apache Avro is a serialization framework that allows you to write object data to a stream. Avro
serialization includes the schema with it in JSON format which allows you to have different
versions of objects. Although code generation and proxy objects are not required, you can use
Avro tools to generate proxy objects in Java to easily work with the objects.
Avro also works well with Hadoop MapReduce. MapReduce allows you to do large-scale
processing on many objects to do calculations across many processors simultaneously. You can
use Avro and MapReduce together to process many items serialized with Avro's small binary
format.

Downloads
Description Name Size
Sample code avro-sample.zip 11KB

Resources
Learn
Review the ColorCount example, which demonstrates how to use Avro with MapReduce.
Learn more about Apache Maven and the Maven build life cycle.
Read more about Apache Avro 1.7.5 and the Avro JSON schema.
Read more about Apache Ivy and Hadoop.
Take this free course from Big Data University on Hadoop Reporting and Analysis (log-
in required). Learn how to build your own Hadoop/big data reports over relevant Hadoop
technologies such as HBase, Hive, etc., and get guidance on how to choose between various
reporting techniques: Direct Batch Reports, Live Exploration, and Indirect Batch Analysis.
Learn the basics of Hadoop with this free Hadoop Fundamentals course from Big Data
University (log-in required). Learn about the Hadoop architecture, HDFS, MapReduce, Pig,
Hive, JAQL, Flume, and many other related Hadoop technologies. Practice with hands-
on labs on a Hadoop cluster using any of these methods: on the Cloud, with the supplied
VMWare image, or install locally.
Explore free courses from Big Data University on topics ranging from Hadoop Fundamentals
and text analytics essentials to SQL access for Hadoop and real-time stream computing.
Create your own Hadoop cluster with this free course from Big Data University (log-in
required).
Learn more about big data in the developerWorks big data content area. Find technical
documentation, how-to articles, education, downloads, product information, and more.
Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based
offering that extends the value of open source Hadoop with features like Big SQL, text
analytics, and BigSheets.
Follow these self-paced tutorials (PDF) to learn how to manage your big data environment,
import data for analysis, analyze data with BigSheets, develop your first big data application,
develop Big SQL queries to analyze big data, and create an extractor to derive insights from
text documents with InfoSphere BigInsights.
Find resources to help you get started with InfoSphere Streams, IBM's high-performance
computing platform that enables user-developed applications to rapidly ingest, analyze, and
correlate information as it arrives from thousands of real-time sources.
Stay current with developerWorks technical events and webcasts.
Follow developerWorks on Twitter.
Get products and technologies
Get the example source code from GitHub.

Download Apache Avro binaries from the Releases page for all platforms, including Java and
PHP.
Download Jackson binaries from the Jackson site.
Download Eclipse from the official site.
Download InfoSphere BigInsights Quick Start Edition, available as a native software
installation or as a VMware image.

Download InfoSphere Streams, available as a native software installation or as a VMware

image.
Use InfoSphere Streams on IBM SmartCloud Enterprise.
Build your next development project with IBM trial software, available for download directly
from developerWorks.
Discuss
Ask questions and get answers in the InfoSphere BigInsights forum.

Ask questions and get answers in the InfoSphere Streams forum.
Check out the developerWorks blogs and get involved in the developerWorks community.
Check out IBM big data and analytics on Facebook.

About the author

Nathan A. Good
Nathan A. Good lives in the Twin Cities area of Minnesota. Professionally, he does
software development, software architecture, and systems administration. When
he's not writing software, he enjoys building PCs and servers, reading about and
working with new technologies, and trying to get his friends to make the move to
open source software. He's written and co-written many books and articles, including
Professional Red Hat Enterprise Linux 3, Regular Expression Recipes: A Problem-
Solution Approach, and Foundations of PEAR: Rapid PHP Development.
Copyright IBM Corporation 2013

(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)

BD Avrohadoop PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

BD Avrohadoop PDF

Caricato da

Copyright:

Formati disponibili

Big data serialization using Apache Avro with Hadoop

Share serialized data among applications

Nathan A. Good 29 October 2013

InfoSphere BigInsights Quick Start Edition

Download the libraries

Copyright IBM Corporation 2013 Trademarks

Listing 1. An example of the entries for Avro dependencies

Listing 2. The entire Maven pom.xml file to get started

Big data serialization using Apache Avro with Hadoop Page 2 of 17

Big data serialization using Apache Avro with Hadoop Page 3 of 17

Listing 3. Sample Avro schema file for an Automobile object

Table 1. Attributes available on a record

type The data type of the entry record

namespace A namespace for the object to avoid naming com.example.avroSample.model

fields An array of the fields (attributes or properties) modelName, etc.

Table 2. Attributes of a field

type The data type of the field

See Resources for more information about the JSON schema.

Big data serialization using Apache Avro with Hadoop Page 4 of 17

Compiling the schema

Use the command line

In the Arguments box, add the values shown in Listing 4.

Listing 4. Adding arguments

To run the tool, click Run.

Use the Maven plug-in

Listing 5. Adding the build plug-in to the pom.xml file

Big data serialization using Apache Avro with Hadoop Page 5 of 17

Listing 6. Example of the generated Automobile class

public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.

Big data serialization using Apache Avro with Hadoop Page 6 of 17

* one should use {@link \#newBuilder()}.

Serialization with the schema

Listing 7. Using a Builder to create an instance of an object

Big data serialization using Apache Avro with Hadoop Page 7 of 17

Deserialization with the schema

Listing 8. Java code to deserialize an Automobile object

Automobile auto = null;

Listing 9. Adding assertions to verify the original values of the object

Big data serialization using Apache Avro with Hadoop Page 8 of 17

Automobile auto = null;

assertEquals("Speedy Car Company", auto.getMake().toString());

Integration with Hadoop

Big data serialization using Apache Avro with Hadoop Page 9 of 17

Listing 11. The ModelCountMapper class

private static final Integer ONE = Integer.valueOf(1);

CharSequence modelName = datum.getModelName();

collector.collect(new Pair<CharSequence, Integer>(modelName, ONE));

Big data serialization using Apache Avro with Hadoop Page 10 of 17

Listing 12. Code to extend AvroReducer

public class ModelCountReducer extends

for (Integer value : values) {

collector.collect(new Pair<CharSequence, Integer>(modelName, sum));

Running the example

Listing 13. Sample class that runs the MapReduce job

Big data serialization using Apache Avro with Hadoop Page 11 of 17

public class ModelNameCountApp extends Configured implements Tool {

private static final Logger LOGGER = LogManager

private static final String JOB_NAME = "ModelNameCountJob";

JobConf job = new JobConf(getConf(), ModelNameCountApp.class);

FileInputFormat.setInputPaths(job, new Path(args[0]));