After a bit of a lay off, I found some data to try a in a MR job. My training used TextInputFormat, so I wanted to try a key/value pair for this dataset. I copied my existing driver, map, and reduce programs and made changes to the driver and map code, then compiled and ran. BOOM! First, I didn't have the new input class declared, so I added that:
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
I'm using a pipe ("|") for a separator, so I have this code:
conf.setInputFormat(KeyValueTextInputFormat.class);
conf.set("key.value.separator.in.input.line", "|");
Next, I fought through various type mismatch errors in the mapper. Eventually, I came up with this code in the driver:
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
And this code in the mapper:
MyMapper extends MapReduceBase implements Mapper'<'Text, Text, Text, IntWritable'>'
public void map(Text key, Text value,OutputCollector'<'Text, IntWritable'>'output, Reporter reporter)
My generic mapper used a LongWritable for the key. It appears that the key in this input format is a Text, which threw me. The errors were a bit cryptic the first time, but after a few iterations, I caught on to the issue. I loaded a small test file looking like:
1|this is the text
2|this is more text
The output looked like I wanted, words and counts, cool.
Env: 020.2 pseudo-distributed mode, CentOS on VM
Sunday, May 20, 2012
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment