I need help on this Scala / Spark homework: **Please build an RDD using sc.textF
ID: 3713014 • Letter: I
Question
I need help on this Scala / Spark homework:
**Please build an RDD using sc.textFile(…) for reading the words in. Given a text file “basketball_words_only.txt”, complete the following tasks:
1. Write a MapReduce program in Scala to find 1) the words that account for at least 3% of the document “basketball_words_only.txt”, 2) the 4 most frequent words in the document.
Remember Apache Spark uses lazy computation on RDDs. While many advantages exist, a disadvantage is that a same RDD may be recomputed. Please avoid this kind of recomputing in your program.
Below shows the correct output:
Words that account for at least 3% are "the","is","basketball","and",
the appears 10 times
basketball appears 8 times
is appears 6 times
and appears 6 times
2. Still using “basketball_words_only.txt” as input, write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
Example, in “basketball_words_only.txt”, the word “basketball” is followed by
? “is” five times
? “has” two times
? “court” once
Then “is” is the word that follows “basketball” the most
Another example, the word “the” is followed by
? “ball” three times
? “court” twice
? “most” once
? “basket” once
? “end” once
? “game” once
? “team” once
If there are multiple such followers that appear the most, pick any one arbitrarily.
At last, display the “most frequent” follower for “basketball”, “the”, and “competitive” as follows:
"basketball" is followed by "is" 5 times.
"the" is followed by "ball" 3 times.
"competitive" is followed by "basketball" 2 times.
The text file (Basketball_words_only.txt):
basketball is a team competitive sport in which two teams of five active players each try to score points against one another by throwing a ball through a 10 foot high hoop under organized rules basketball is one of the most popular and widely viewed sports in the court points are scored by passing the ball through the basket from above the team with more points at the end of the game wins the ball can be advanced on the court by bouncing it dribbling or passing it between teammates disruptive physical contact fouls is not permitted and there are restrictions on how the ball result be handled violations through time basketball has developed to involve common techniques of shooting passing and dribbling as well as players positions and offensive and defensive structures while competitive basketball is carefully regulated numerous variations of basketball has developed for casual play in some countries basketball is also a popular spectator sport while competitive basketball is primarily an indoor sport played on a basketball court less regulated variations have become exceedingly popular as an outdoor sport among both inner city and rural groups
Explanation / Answer
To solve the above problem, we need to write the code and then we can count the word. To do so, we can write the script code as
val text=sc.textFile("basketball_words_only.txt");
val count=text.flatMap(line=>line.split("")).map(word=>(word,1).reduceByKey(_+_) counts.collect;
After writing the above script code we have to run the code. To run the code we write the following code as
spark-shell -i WordCountScala.scala
Once we are runing the code, it will count the word.
Now as per the given question,
Words that account for at least 3% are "the","is","basketball","and", the appears 10 times,
basketball appears 8 times , is appears 6 times and appears 6 times
Now in the counting process, we can put these as a keywords. to get such a values, we need to put all of them as akey value.
for example,
basketball appears 8 times . Here, one keyword is basketball and other key is 8. so the expression can be written as
val text=sc.textFile("basketball_words_only.txt");
val count=text.flatMap(line=>line.split("")).map(word=>(word,1).reduceByKey("basketball"+(word.count<=8)) counts.collect;
Hence when the script will run, it will count the appearance of basketball exactly up to 8 times.
Simillarly,other key attributes can also be computed in the same manner.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.