Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

this is in python Develop a MapReduce-based solution constructing an inverted ch

ID: 662849 • Letter: T

Question

this is in python

Develop a MapReduce-based solution constructing an inverted character index of a list of words. The index should map every character appearing in at least one of the words to a list of words containing the character. Your work consists of designing the mapper function getChars () and reducer function getCharlndexQ. mp = SeqMapReduce(getChars. getcharlndex) mp.process([?ant?, ?bee?, ?cat?, ?dog?, ?eel?]) Results: [(?a?, [?ant?, ?cat?]), (?c?, [?cat?]), (?b?, [?bee?]), (?e?, [?eel?, ?bee?]), (?d?, [?dog?]), (?g?, [?dog?]), (?1?, [?eel?]), (?o?, [?dog?]), (?n?, [?ant?]), (?t?, [?ant?, ?cat?])]

Explanation / Answer

We can write a very similar program to this in Hadoop MapReduce; it is included in the Hadoop distribution insrc/examples/org/apache/hadoop/examples/WordCount.java. It is partially reproduced below:

Listing 4.2: Hadoop MapReduce Word Count Source

There are some minor differences between this actual Java implementation and the pseudo-code shown above. First, Java has no native emit keyword; the OutputCollector object you are given as an input will receive values to emit to the next stage of execution. And second, the default input format used by Hadoop presents each line of an input file as a separate input to the mapper function, not the entire file at a time. It also uses a StringTokenizer object to break up the line into words. This does not perform any normalization of the input, so "cat", "Cat" and "cat," are all regarded as different strings. Note that the class-variable word is reused each time the mapper outputs another (word, 1) pairing; this saves time by not allocating a new variable for each output. The output.collect() method will copy the values it receives as input, so you are free to overwrite the variables you use.

THE DRIVER METHOD

There is one final component of a Hadoop MapReduce program, called the Driver. The driver initializes the job and instructs the Hadoop platform to execute your code on a set of input files, and controls where the output files are placed. A cleaned-up version of the driver from the example Java implementation that comes with Hadoop is presented below: