Sunday, May 20, 2012

KeyValueTextInputFormat

After a bit of a lay off, I found some data to try a in a MR job.  My training used TextInputFormat, so I wanted to try a key/value pair for this dataset.  I copied my existing driver, map, and reduce programs and made changes to the driver and map code, then compiled and ran.  BOOM!  First, I didn't have the new input class declared, so I added that:

import org.apache.hadoop.mapred.KeyValueTextInputFormat;

I'm using a pipe ("|") for a separator, so I have this code:

conf.setInputFormat(KeyValueTextInputFormat.class);

conf.set("key.value.separator.in.input.line", "|");


Next, I fought through various type mismatch errors in the mapper.  Eventually, I came up with this code in the driver:
 
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

And this code in the mapper:
 
MyMapper extends MapReduceBase implements Mapper'<'Text, Text, Text, IntWritable'>'
public void map(Text key, Text value,OutputCollector'<'Text, IntWritable'>'output, Reporter reporter)

My generic mapper used a LongWritable for the key.  It appears that the key in this input format is a Text, which threw me.  The errors were a bit cryptic the first time, but after a few iterations, I caught on to the issue.  I loaded a small test file looking like:

1|this is the text
2|this is more text

The output looked like I wanted, words and counts, cool.

Env: 020.2 pseudo-distributed mode, CentOS on VM

No comments: