The method/solution should be different for each part, since the dataset is beco
ID: 3831125 • Letter: T
Question
The method/solution should be different for each part, since the dataset is becoming larger and larger.
You are given a dataset of phone calls between pairs of people, listing the caller, called, time of phone call and duration of the phone call (in seconds), a snapshot is given below: Here the first line represents a phone call lasting slightly over two minutes, the second just a quick 1 second call, etc. Your task is to compute for each pair of phone numbers the total amount of time the parties spent on the phone with each other (regardless of who called whom). 1. Suppose your dataset is the call log of a small town of 100,000 people each of whom calls 50 people on average. Please describe how you would compute the statistics. 2. Suppose your dataset is a call log of a large city of 10,000,000 people, each of whom calls 100 people on average. Please describe how you would compute the statistics. 3. Suppose the dataset is a call log of a nation of 300,000,000 people, each of whom calls 200 people on average. Please describe how you would compute the statistics. In writing your descriptions above, you don't need to provide actual working code, but please provide enough detail that someone can easily implement your approach. What differences are there between the three different approaches? Would you use an in-memory or streaming approach? A single machine or multiple machines?Explanation / Answer
1.
As the data of small town will not be very big, we can compute the statistics on a single machine using following approach:
(person1, person2) -> (total time spent)
2.
For a large city, we can calculate the the statistics using a machine with higher RAM, Disk, Multicore processor (bacically a supercomputer) using the previous appoach.
3.
For the dataset of whole nation, we need to use multiple machines, parallel processing, or frameworks like Mapreduce while keeping the approach same as in part 1.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.