I have 3 databases movies.dat, users.dat and ratings.dat. I have to write a corr
ID: 3836803 • Letter: I
Question
I have 3 databases movies.dat, users.dat and ratings.dat. I have to write a corresponding python/mapper and reducer code to identify a movie with highest value of average males rating minus average females rating and another movie with highest value of average females rating minus average males rating. Simply one movie which is most highly rated by men but not by women, and a movie most highly rate by women but not by men. From mapper Im trying to extract movieID, rating and gender so that further results can be taken out by reducer. But from mapper im getting incomplete data. Im joing users.dat and movies.dat. Any solution? Please do give a code/codes.
ratings.dat = UserID::MovieID::Rating::Timestamp
movies.dat = MovieID::Title::Genres
users.dat = UserID::Gender::Age::Occupation::Zip-Code
my mapper code:
#!/usr/bin/env python
import sys
userID=""
movieID=""
gender=""
rating=""
#--- get all lines from stdin ---
for line in sys.stdin:
line=line.strip()
lst = line.split("::")
if len(lst) == 4:
userID=lst[0]
movieID=lst[1]
rating=lst[2]
elif len(lst)==5:
userID=lst[0]
gender=lst[1]
else:
movieID=lst[0]
print '%s, %s, %s'%(movieID,rating,gender)
mapper query I run is : cat users.dat ratings.dat | ./mapper.py
Explanation / Answer
First of all, proper indentation is required for the python code to run. So, check your code for indentation errors.
Now, to start with your problem. The ouput of mapper is fed to the reducer, you need to aggregate the rating of a specific movieID according to gender. So, sort according to unique movieID which will have two columns, M and F. Check for the highest male rated movie and highest female rated movie.
This is actually sorting using two creteria, one is highest rated by men and second is not mostly highest rated by women. You can use, sorted function in pyhton, once you have created the list.
The following is proper indented code:
for line in sys.stdin:
line=line.strip()
lst = line.split("::")
if len(lst) == 4:
userID=lst[0]
movieID=lst[1]
rating=lst[2]
elif len(lst)==5:
userID=lst[0]
gender=lst[1]
else:
movieID=lst[0]
print '%s, %s, %s'%(movieID,rating,gender)
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.