Inflatable IT: KeyValueTextInputFormat

Sunday, May 20, 2012

KeyValueTextInputFormat

After a bit of a lay off, I found some data to try a in a MR job. My training used TextInputFormat, so I wanted to try a key/value pair for this dataset. I copied my existing driver, map, and reduce programs and made changes to the driver and map code, then compiled and ran. BOOM! First, I didn't have the new input class declared, so I added that:

import org.apache.hadoop.mapred.KeyValueTextInputFormat;

I'm using a pipe ("|") for a separator, so I have this code:

conf.setInputFormat(KeyValueTextInputFormat.class);

conf.set("key.value.separator.in.input.line", "|");

Next, I fought through various type mismatch errors in the mapper. Eventually, I came up with this code in the driver:

conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

And this code in the mapper:

MyMapper extends MapReduceBase implements Mapper'<'Text, Text, Text, IntWritable'>'
public void map(Text key, Text value,OutputCollector'<'Text, IntWritable'>'output, Reporter reporter)

My generic mapper used a LongWritable for the key. It appears that the key in this input format is a Text, which threw me. The errors were a bit cryptic the first time, but after a few iterations, I caught on to the issue. I loaded a small test file looking like:

1|this is the text
2|this is more text

The output looked like I wanted, words and counts, cool.

Env: 020.2 pseudo-distributed mode, CentOS on VM

Inflatable IT

Sunday, May 20, 2012

KeyValueTextInputFormat

No comments:

About Me

Great Recession

Followers

Blog Archive