Consider a hypothetical Hadoop-MapReduce configuration and the characteristics o
ID: 3830593 • Letter: C
Question
Consider a hypothetical Hadoop-MapReduce configuration and the characteristics of the text data (corpus) as given below. a. Assume word count algorithm. b. Relevant vocabulary of the input corpus is {a, b, c, d.... x}. Note y and z are not included c. Assume that the input corpus has a total of G meaningful/relevant words. Other (dummy) words are not emitted by the map function. d. Input corpus is split into S splits of D documents each, each split handled by a mapper. e. There are R reducers. f. Both partitioners and combiners are used in the data flow. Answer these questions for the configuration listed above: What is size of the input keyspace processed by the Mappers? How many Mappers are there? Provide the expression in terms of the variables discussed. Derive an expression for the number of times a MAP function will be invoked or called in a given mapper. Assuming uniform distribution (no skew with all the words in the vocabulary equally likely in the input) derive the expression for the number of copies of that the shuffle executes for moving data from map-side machines to reduce side machines (assume combiners and partitioners). How many distinct keys would each Reducer handle or be in charge of "reducing"? How many parts are there in the output after the map-reduce process is completed? Draw a figure that shows the data flow concept and label them with the expressions you have derived above.Explanation / Answer
1. Input keyspace is ideally defined as the no. of permutations/words that could be made with the vocabulary (a,b,...,x). However, as only G words of them are relevant, hence the input keyspace size is G.
2. No. of mappers = no. of splits in the input corpus. So, there are S mappers.
3. The map function would be called for D times, since there are D documents in each splits, handled by each mapper.
4. Let us say there are W words in each document on average. So, we have S*D*W words in total. Because of uniform distribution, each word (out of the G mentioned in the corpus) can occur for (S*D*W/G) times. Combiner would combine the multiple occurences of each word which gets out from the mapper. So, there would be G shuffle operations, each of those G words coming in a <word, (S*D*W)/G> pair.
5. Assuming uniform distribution, each of the reducer would handle (G/R) distinct keys.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.