I need help in writting the PIG command for this question: Question : Find close
ID: 3606744 • Letter: I
Question
I need help in writting the PIG command for this question:
Question : Find closest city for each tweet? For the dataset assume you have two files: full_text_clean.txt: (userid, lat, lon, tweet, modified_lat, modified_lon) and cities_clean.txt: (city_name, lat, lon, modified_lat, modified_lon) [D2L -> Assignment 3 – Pig -> cities_clean.txt].
Hint: For that purpose, both files include a modified lat and lon column (last two columns of both files). So for each of geo-tagged tweets, you will map to multiple nearby cities using the last two columns of both files. After that, for each geo-tagged tweet, you then calculate the distance using the actual lat-lon values and pick the closest city.
Calculating Euclidean Distance (pig example): SQRT((lat_1 – lat_2) * (lat_1 – lat_2) + (lon_1 – lon_2) * (lon_1 – lon_2))
Lat_1/Lon_1 refer to lat/lon in full_text_clean.txt. Lat_2/Lon_2 refer to lat/lon in cities_clean.txt
Only submit command.
Explanation / Answer
######### 1 ################ grunt> a = load '/user/root/pig/full_text_clean.txt'; grunt> b= sample a 0.1; grunt> c = store b into '/user/root/pig/full_text_small.txt '; ########### 2 ############## grunt> a = load '/user/root/pig/full_text_clean.txt' AS (id:chararray,lat:float, lon:float, tweet:chararray,modified_lat:float,modified_lon:float); grunt> b = foreach a generate flatten(TOKENIZE(tweet)) as token; grunt> c= group b by token; grunt> d= foreach c generate flatten(group),COUNT(b) as cnt; grunt> e= order d by cnt desc; grunt> f= limit e 4; grunt> dump f; (I,109447) (RT,78153) (the,75595) ######### 3 $########## grunt> a = load '/user/root/pig/full_text_clean.txt' AS (id:chararray,lat:float, lon:float, tweet:chararray,modified_lat:float,modified_lon:float); grunt> b= GROUP a All; c= foreach b generate COUNT_STAR(a); (377616) ###### 4 ########## a = load '/user/root/pig/full_text_clean.txt' AS (id:chararray,lat:float, lon:float, tweet:chararray,modified_lat:float,modified_lon:float); b = load '/user/root/pig/cities_clean.txt' AS (city_name:chararray,lat:float, lon:float,modified_lat:float,modified_lon:float); c= join a by (modified_lat,modified_lon),b by(modified_lat,modified_lon); d= foreach c generate a::tweet as tweet,b::city_name as city,SQRT((a::lat - b::lat) * (a::lat - b::lat) + (a::lon - b::lon) * (a::lon - b::lon)) as distance; e= group d by (tweet,city_name); f= foreach e { >> sortd = order d by distance asc; >> l = limit sortd 1; >> generate FLATTEN(group) AS (tweet,cityy),l as dd; >> }; g= limit f 3; dump g;
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.