Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Apache Avro is a serialization framework that produces data in a compact binary format that
doesn't require proxy objects or code generation. Get to know Avro, and learn how to use it with
Apache Hadoop.
Apache Avro is a framework that allows you to serialize data in a format that has a schema built
in. The serialized data is in a compact binary format that doesn't require proxy objects or code
generation. Instead of using generated proxy libraries and strong typing, Avro relies heavily on the
schemas that are sent along with the serialized data. Including schemas with the Avro messages
allows any application to deserialize the data.
This article shows how to use Avro to define a schema, create and serialize some data, and send
it from one application to another. I use the Eclipse integrated development environment (IDE) to
create a sample application demonstrating the use of Avro. Download the sample code for this
application.
Installation
To install Avro, you must download the libraries and reference them in your code.
a link to which is also in Resources. Alternatively, use Apache Maven or IVY to download the Java
archive (JAR) files and all of their dependencies automatically. An example of the Maven entry is
provided in Listing 1.
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-tools</artifactId>
<version>1.7.5</version>
</dependency>
</dependencies>
See Resources to download libraries for other implementations, such as Python, Ruby, PHP, or
C#.
Create a project
This article uses the Eclipse IDE with the m2e (Maven 2) plug-in installed to build the examples for
using Avro. You will need a new project in which to create your sample schemas, compile them,
and run samples. To create a new project in Eclipse, click File > New > Maven Project. Select the
check box next to Create a simple project to quickly create a project that supports Maven, which
this article uses to handle the Avro dependencies.
As noted in Download the libraries, one easy way to get Avro and all of its dependencies without
doing it manually is to use Maven or IVY. The sample code included here uses a Maven pom.xml
file to configure the Avro dependency and download it automatically from inside Eclipse. For more
information about how Maven works, see Resources.
To create your own pom.xml file, create a new XML file by clicking File > New > Other. Select
XML File from the list, then name the file pom.xml. When you have created the file, place the entire
contents of Listing 2 inside the file and save.
<groupId>com.example</groupId>
<artifactId>avroSample</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>avroSample</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.6</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-tools</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.0-beta9</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.0-beta9</version>
</dependency>
</dependencies>
</project>
After you have created the pom.xml file, resolve the dependencies. Doing so downloads the Avro
JAR files along with all of the other libraries that Avro requires. To resolve the dependencies in
your Eclipse IDE, save the pom.xml file. When Eclipse builds the project, the Maven Builder will
automatically download the new dependencies.
Avro schemas
Avro schemas describe the format of the message and are defined using JavaScript Object
Notation (JSON). The JSON schema content is put into a file that the tools will reference later.
Defining a schema
Before creating a schema in your Eclipse project, create a source folder in which to put the
schema files. To create the new source folder, click File > New > Source Folder. Enter src/main/
avro and click Finish. The source folder follows Maven conventions, so if you want to use the
Maven plug-in to generate sources, you can.
Now that you have a source folder that will contain the schema files, create a new file inside the
source folder by clicking File > New > File. Give your new file a name such as automobile.asvc.
Now, you can add JSON to define the schema. Listing 3 demonstrates a simple schema that
defines an automobile.
]
}
The Automobile is of type record, which is a complex type that contains a list of fields. The record
complex type requires the attributes shown below.
name The name of the object. (This will be the class Automobile
name when generated to code.)
Each field in the list of fields contains data about its name and type. Table 2 contains a list of the
attributes of a field.
name The name of the field. (This will be the property name when generated
to code.)
<build>
<plugins>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.7.5</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>
${project.basedir}/src/main/avro/
</sourceDirectory>
<outputDirectory>
${project.basedir}/src/main/java/
</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- snipped -->
</dependencies>
</project>
Now that you have the plug-in in the Maven pom.xml file, Maven will generate the sources for you
automatically during the generate-sources phase, which is executed before the compile phase. For
more information about the Maven life cycle, see Resources.
The generated Java class file will be located in the src/main/java folder. Listing 6 shows some of
the example Automobile class, which has been generated with the com.example.avroSample.model
package specified in the schema namespace.
/**
* Default constructor. Note that this does not initialize fields
* to their default values from the schema. If that is desired then
/**
* All-args constructor.
*/
public Automobile(java.lang.CharSequence modelName,
java.lang.CharSequence make, java.lang.Integer modelYear,
java.lang.Integer passengerCapacity) {
this.modelName = modelName;
this.make = make;
this.modelYear = modelYear;
this.passengerCapacity = passengerCapacity;
}
/* snipped... */
}
Now that you have generated the source files, you can use them in a Java application to see how
it works.
Listing 7 contains an example of Java code that uses a Builder to return an instance of an object.
DatumWriter<Automobile> datumWriter =
new SpecificDatumWriter<Automobile>(Automobile.class);
DataFileWriter<Automobile> fileWriter =
new DataFileWriter<Automobile>(datumWriter);
try {
fileWriter.create(auto.getSchema(), outputFile);
fileWriter.append(auto);
fileWriter.close();
} catch (IOException e) {
LOGGER.error("Error while trying to write the object to file <"
+ outputFile.getAbsolutePath() + ">.", e);
}
Values are assigned to the properties on the object before the build() method is called to return
the instance. A DatumWriter is used to write the data from the object through the DataFileWriter,
which writes the data contained in the object to a file while conforming to the supplied schema.
After you create a DataFileWriter implementation and call create() with the schema and file, you
can add objects to the file by using the append() method, as shown in Listing 7.
To execute the sample code, either put the code in the main method of a Java class or add it to a
unit test. In the sample code available with this article, the code is added to a unit test that you can
execute in Eclipse or Maven.
When you execute this code, the serializer creates a file called autos.avro in the target folder of
the project. Because the file is in a small, binary format, you won't be able to usefully inspect its
contents using a normal text editor.
Listing 8 shows Java code that can be used to deserialize an Automobile object from the same file
that was created in the previous step.
if (fileReader.hasNext()) {
auto = fileReader.next(auto);
}
} catch (IOException e) {
LOGGER.error("Error while trying to read the object from file <"
+ outputFile.getAbsolutePath() + ">.", e);
}
This time, the code uses a DatumReader, which reads the content through the DataFileReader
implementation.
The DatumReader is an iterator (that is, it extends the Iterator interface), so it reads through the
file and returns objects using the next() method. This behavior will be familiar to anyone working
with RecordSet objects in Java.
The code shown in Listing 9 ensures that the values on the properties are the same values
originally assigned to the object before it was written to the file. When executed as a unit test,
these assertions pass.
if (fileReader.hasNext()) {
auto = fileReader.next(auto);
}
} catch (IOException e) {
LOGGER.error("Error while trying to read the object from file <"
+ outputFile.getAbsolutePath() + ">.", e);
}
Now that you have seen Avro in action both writing and reading from a file, you can see how it
works with Apache Hadoop.
Avro allows you to use complex data structures within Hadoop MapReduce jobs. To try out
MapReduce functionality, you need to add dependencies to the pom.xml file to pull in the Hadoop
library and Avro MapReduce library. To add the dependencies, modify your pom.xml file to look like
Listing 10.
Listing 10. Adding the dependencies for Avro Hadoop libraries to the pom.xml
file
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<build>
<!-- snipped -->
</build>
<dependencies>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-mapred</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.1.0</version>
</dependency>
</dependencies>
</project>
Add the dependencies to the pom.xml file and save it. The m2e plug-in automatically downloads
the newly added JAR files and their dependencies.
After resolving the dependencies, create an example using Avro and MapReduce that shows
counts by model name in a list of automobiles, building off the prior examples. The model count
example is based on the color count example on the Avro site (see Resources).
The example includes three classes: one that extends AvroMapper, one that extends AvroReducer,
and a class with code to initiate the MapReduce job and write the results.
Extending AvroMapper
The AvroMapper class extends and implements several Hadoop-supplied interfaces to provide the
ability to collect or map data. To demonstrate the capabilities of Avro in MapReduce functions,
create a small class, as shown below.
import java.io.IOException;
import org.apache.avro.mapred.AvroCollector;
import org.apache.avro.mapred.AvroMapper;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.mapred.Reporter;
import com.example.avroSample.model.Automobile;
/**
* Class class will count the number of models of automobiles found.
*/
public final class ModelCountMapper extends
AvroMapper<Automobile, Pair<CharSequence, Integer>> {
@Override
public void map(Automobile datum,
AvroCollector<Pair<CharSequence, Integer>> collector,
Reporter reporter) throws IOException {
The map method simply retrieves the model name from the passed-in Automobile object and adds
a value of 1 for each occurrence of the model name. There is no math so far in the map method.
Rather, the math to summarize the counts is in the class that extends AvroReducer.
Extending AvroReducer
Listing 12 shows the class that extends AvroReducer. This class accepts the values that the
ModelCountMapper object has collected and summarizes them by looping through the values.
import java.io.IOException;
import org.apache.avro.mapred.AvroCollector;
import org.apache.avro.mapred.AvroReducer;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.mapred.Reporter;
/**
* This method "reduces" the input
*/
@Override
public void reduce(CharSequence modelName, Iterable<Integer> values,
AvroCollector<Pair<CharSequence, Integer>> collector, Reporter reporter)
throws IOException {
int sum = 0;
The reduce method in the ModelCountReducer class "reduces" the values the mapper collects into
a derived value, which in this case is a simple sum of the values.
import java.io.File;
import org.apache.avro.Schema;
import org.apache.avro.Schema.Type;
import org.apache.avro.mapred.AvroJob;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.example.avroSample.model.Automobile;
@Override
public int run(String[] args) throws Exception {
AvroJob.setMapperClass(job, ModelCountMapper.class);
AvroJob.setReducerClass(job, ModelCountReducer.class);
AvroJob.setInputSchema(job, Automobile.getClassSchema());
AvroJob.setOutputSchema(
job,
Pair.getPairSchema(Schema.create(Type.STRING),
Schema.create(Type.INT)));
JobClient.runJob(job);
return 0;
/**
* Creates an instance of this class and executes it to provide a call-able
* entry point.
*/
public static void main(String[] args) {
new File(args[0]).mkdir();
new File(args[1]).mkdir();
int result = 0;
try {
result = new ModelNameCountApp().run(args);
} catch (Exception e) {
result = -1;
LOGGER.error("An error occurred while trying to run the example", e);
}
if (result == 0) {
LOGGER.info("SUCCESS");
} else {
LOGGER.fatal("FAILED");
}
This example creates an instance of a JobConf object, then uses the AvroJob class to configure the
job before executing it.
To run the example from inside Eclipse, click Run > Run Configurations to open the Run
Configurations Wizard. Select Java Application from the list, then click New. Enter the name of
the project and main class name (com.example.avroSample.mapReduce.ModelNameCountApp) on the
Main tab, then enter the name of the directory you used as output from the tests you ran earlier.
Enter the name of a suitable output directory in the Program Arguments box, such as target/
avro target/mapreduce.
When you are finished adding in the directory names, click Run to run the example.
When you run the example, it will use the ModelCountReducer to collect the names of the models
and the count for each using the map() method, then it applies processing to the map of keys
and values using the reduce() method of the ModelCountReducer object. In this example, the
processing is simply adding the counts to summarize them.
Conclusion
Learn more. Develop more. Connect more.
The new developerWorks Premium membership program provides an all-access pass to
powerful development tools and resources, including 500 top technical titles for application
developers through Safari Books Online, deep discounts on premier developer events, video
replays of recent O'Reilly conferences, and more. Sign up today.
Apache Avro is a serialization framework that allows you to write object data to a stream. Avro
serialization includes the schema with it in JSON format which allows you to have different
versions of objects. Although code generation and proxy objects are not required, you can use
Avro tools to generate proxy objects in Java to easily work with the objects.
Avro also works well with Hadoop MapReduce. MapReduce allows you to do large-scale
processing on many objects to do calculations across many processors simultaneously. You can
use Avro and MapReduce together to process many items serialized with Avro's small binary
format.
Downloads
Description Name Size
Sample code avro-sample.zip 11KB
Resources
Learn
Review the ColorCount example, which demonstrates how to use Avro with MapReduce.
Learn more about Apache Maven and the Maven build life cycle.
Read more about Apache Avro 1.7.5 and the Avro JSON schema.
Read more about Apache Ivy and Hadoop.
Take this free course from Big Data University on Hadoop Reporting and Analysis (log-
in required). Learn how to build your own Hadoop/big data reports over relevant Hadoop
technologies such as HBase, Hive, etc., and get guidance on how to choose between various
reporting techniques: Direct Batch Reports, Live Exploration, and Indirect Batch Analysis.
Learn the basics of Hadoop with this free Hadoop Fundamentals course from Big Data
University (log-in required). Learn about the Hadoop architecture, HDFS, MapReduce, Pig,
Hive, JAQL, Flume, and many other related Hadoop technologies. Practice with hands-
on labs on a Hadoop cluster using any of these methods: on the Cloud, with the supplied
VMWare image, or install locally.
Explore free courses from Big Data University on topics ranging from Hadoop Fundamentals
and text analytics essentials to SQL access for Hadoop and real-time stream computing.
Create your own Hadoop cluster with this free course from Big Data University (log-in
required).
Learn more about big data in the developerWorks big data content area. Find technical
documentation, how-to articles, education, downloads, product information, and more.
Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based
offering that extends the value of open source Hadoop with features like Big SQL, text
analytics, and BigSheets.
Follow these self-paced tutorials (PDF) to learn how to manage your big data environment,
import data for analysis, analyze data with BigSheets, develop your first big data application,
develop Big SQL queries to analyze big data, and create an extractor to derive insights from
text documents with InfoSphere BigInsights.
Find resources to help you get started with InfoSphere Streams, IBM's high-performance
computing platform that enables user-developed applications to rapidly ingest, analyze, and
correlate information as it arrives from thousands of real-time sources.
Stay current with developerWorks technical events and webcasts.
Follow developerWorks on Twitter.
Discuss
Nathan A. Good lives in the Twin Cities area of Minnesota. Professionally, he does
software development, software architecture, and systems administration. When
he's not writing software, he enjoys building PCs and servers, reading about and
working with new technologies, and trying to get his friends to make the move to
open source software. He's written and co-written many books and articles, including
Professional Red Hat Enterprise Linux 3, Regular Expression Recipes: A Problem-
Solution Approach, and Foundations of PEAR: Rapid PHP Development.