Big Data Configuration

Configurations read their properties from resourcesXML files with a simple
structure for defining name-value pairs. See Example 5-1.

Example 5-1. A simple configuration file, configuration-1.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>color</name>
<value>yellow</value>
<description>Color</description>
</property>
<property>
<name>size</name>
<value>10</value>
<description>Size</description>
</property>
<property>
<name>weight</name>
<value>heavy</value>
<final>true</final>
<description>Weight</description>
</property>
<property>
<name>size-weight</name>
<value>${size},${weight}</value>
<description>Size and weight</description>
</property>
</configuration>
Assuming this configuration file is in a file called configuration-1.xml, we can
access its properties using a piece of code like this:
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));
Example 5-2. A second configuration file, configuration-2.xml

<configuration>
<property>
<name>size</name>
<value>12</value>
</property>
<property>
<name>weight</name>
<value>light</value>
</property>
</configuration>
Resources are added to a Configuration in order:
Configuration conf = new Configuration();
Properties defined in resources that are added later override
Example 5-3. A Maven POM for building and testing a MapReduce application
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>com.hadoopbook</groupId>
<artifactId>hadoop-book-mr-dev</artifactId>
<version>3.0</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
</properties>
<dependencies>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.0.0</version>
</dependency>

<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.hamcrest</groupId>
<artifactId>hamcrest-all</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.mrunit</groupId>
<artifactId>mrunit</artifactId>
<version>0.8.0-incubating</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-test</artifactId>
<scope>test</scope>
</dependency>

<dependency>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-core</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<finalName>hadoop-examples</finalName>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<outputDirectory>${basedir}</outputDirectory>
</configuration>
</plugin>
</plugins>
</build>
</project>
Setting Up the Development Environment
The hadoop-local.xml file contains the default Hadoop configuration for the
default
filesystem and the jobtracker:
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>local</value>
</property>
</configuration>
The settings in hadoop-localhost.xml point to a namenode and a jobtracker both

running on localhost:
<configuration>
<property>
<value>hdfs://localhost/</value>
</property>
<property>
<value>localhost:8021</value>
</property>
</configuration>
Finally, hadoop-cluster.xml contains details of the clusters namenode and
jobtracker addresses. In practice, you would name the file after the name of the
cluster, rather than cluster as we have here:
<configuration>
<property>
<value>hdfs://namenode/</value>
</property>
<property>
<value>jobtracker:8021</value>
</property>
</configuration>
Mapper
The test for the mapper is shown in Example 5-5.
Example 5-5. Unit test for MaxTemperatureMapper
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.*;
public class MaxTemperatureMapperTest {
@Test
public void processesValidRecord() throws IOException, InterruptedException {
Text value = new Text("0043011990999991950051518004+68750+023550FM12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9-00111+99999999999");
// Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInputValue(value)
.withOutput(new Text("1950"), new IntWritable(-11))
.runTest();
}
}
Example 5-6. First version of a Mapper that passes MaxTemperatureMapperTest

public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature = Integer.parseInt(line.substring(87, 92));
context.write(new Text(year), new IntWritable(airTemperature));
}
}
@Test
public void ignoresMissingTemperatureRecord() throws IOException,
InterruptedException {
Text value = new Text("0043011990999991950051518004+68750+023550FM12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9+99991+99999999999");
// Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInputValue(value)
.runTest();
}
@Override
public void map(LongWritable key, Text value, Context context)
String line = value.toString();
String year = line.substring(15, 19);
String temp = line.substring(87, 92);
if (!missing(temp)) {
int airTemperature = Integer.parseInt(temp);
context.write(new Text(year), new IntWritable(airTemperature));
}
}
private boolean missing(String temp) {
return temp.equals("+9999");
}
Reducer
The reducer has to find the maximum value for a given key. Heres a simple test
for
this feature, which uses a ReduceDriver:
@Test
public void returnsMaximumIntegerInValues() throws IOException,
InterruptedException {
new ReduceDriver<Text, IntWritable, Text, IntWritable>()
.withReducer(new MaxTemperatureReducer())
.withInputKey(new Text("1950"))
.withInputValues(Arrays.asList(new IntWritable(10), new IntWritable(5)))
.withOutput(new Text("1950"), new IntWritable(10))
.runTest();
}
We construct a list of some IntWritable values and then verify that
MaxTemperatureReducer picks the largest. The code in Example 5-7 is for an
implementation
of MaxTemperatureReducer that passes the test.
Example 5-7. Reducer for the maximum temperature example
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}

Big Data Configuration

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Big Data Configuration

Caricato da

Copyright:

Formati disponibili

Configurations read their properties from resourcesXML files with a simple

structure for defining name-value pairs. See Example 5-1.

Example 5-2. A second configuration file, configuration-2.xml

The settings in hadoop-localhost.xml point to a namenode and a jobtracker both

Example 5-6. First version of a Mapper that passes MaxTemperatureMapperTest

Potrebbero piacerti anche