Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Assignment – 1 1- The DISTINCT(X) operator is used to return only distinct (uniq

ID: 3780987 • Letter: A

Question

Assignment – 1

1- The DISTINCT(X) operator is used to return only distinct (unique) values for datatype (or column) X in the entire dataset .

As an example, for the following table A:

A.ID

A.ZIPCODE

A.AGE

1

12345

30

2

12345

40

3

78910

10

4

78910

10

5

78910

20

DISTINCT(A.ID) = (1, 2, 3, 4, 5)

DISTINCT(A.ZIPCODE) = (12345, 78910)

DISTINCT(A.AGE) = (30, 40, 10, 20)

Implement the DISTINCT(X) operator using Map-Reduce. Provide the algo-

rithm pseudocode. You should use only one Map-Reduce stage, i.e. the algorithm should

make only one pass over the data.

2-The SHUFFLE operator takes a dataset as input and randomly re-orders it.

Hint: Assume that we have a function rand(m) that is capable of outputting a random integer between [1, m].

Implement the SHUFFLE operator using Map-Reduce. Provide the algorithm pseudocode.

3-What is the communication cost (in terms of total data flow on the network between mappers and reducers) for following query using Map-Reduce:

Get DISTINCT(A.ID from A WHERE A.AGE > 30 )

The dataset A has 1000M rows, and 400M of these rows have A.AGE <= 30. DISTINCT(A.ID) has 1M elements. A tuple emitted from any mapper is 1 KB in size.

4-Consider the checkout counter at a large supermarket chain. For each item sold, it generates a record of the form [ProductId, Supplier, Price]. Here, ProductId is the unique identifier of a product, Supplier is the supplier name of the product and Price is the sales price for the item. Assume that the supermarket chain has accumulated many terabytes of data over a period of several months.

The CEO wants a list of suppliers, listing for each supplier the average sales price of items provided by the supplier. How would you organize the computation using the Map-Reduce computation model?

For the following questions give short explanations of your answers.

5-True or False: Each mapper/reducer must generate the same number of output key/value pairs as it receives on the input.

6-True or False: The output type of keys/values of mappers/reducers must be of the same type as their input.

7-True or False: The input to reducers is grouped by key.

8-True or False: It is possible to start reducers while some mappers are still running.

A.ID

A.ZIPCODE

A.AGE

1

12345

30

2

12345

40

3

78910

10

4

78910

10

5

78910

20

Explanation / Answer

Solutions:

1.The solution exploits MapReduce’s ability to group keys together to remove duplicates. Use a mapper to emit X from each record as the key.

The reducer simply emits the keys.

Pseudo code:

map(key, record):

output (value, null) from column X of each input row

reduce(key, records):

output key

2.All the mapper does is output the record as the value along with a random key. In other words, each record is sent to a random reducer. The reducer emits the values.

Pseudo code:

map(key, record):

rand(m) = pick a random integer in [1, m]

output (rand(n), record  

reduce(key, records):

for record in records:

output record

3.Let, p = 1000M, r = 400M, q = 1M

There will be 2 jobs and the output of WHERE is chained to DISTINCT :

WHERE emits (p - r) tuples from the mapper

DISTINCT emits (p - r) tuples from the mapper

Total = 2(p - r) = 2(600M) = 1200M * 1KB = 1.12 TB (approx)

Total = (p-r) = 600M * 1KB, if the values are filtered in the mapper.

4.SELECT AVG(Price) FROM DATA GROUP BY Supplier

Pseudo code :

map(key, record):

output [ record(SUPPLIER), record(PRICE) ]

reduce(SUPPLIER, list of PRICE):

emit [ SUPPLIER, AVG(PRICE) ]

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote