1) Use the spark ’s csv reader to read the csv files with the columns names set
ID: 3888762 • Letter: 1
Question
1) Use the spark ’s csv reader to read the csv files with the columns names set correctly, all “ ? ” replaced with null and correct datatype detected for each column.
Q1. W hat was the default data type of each column without using the p arsing option?
Q2. What command would you use to see that pars ing done in 1) has been correct?
Q3. What were the column names without applying parsing?
2) Write the spark sql query to find the count of rows in the data that are true and false and arrange them in descending order.
Q1. How would you achieve the same count using the data frame API?
Q2. Which method would you prefer and why?
3) Create “ summary ” statistics ( count, mean, stddev , min, max) for matches and misses in the dataset. Display the output summaries on the screen.
4) Explain why reshaping or pivoting step was done on our data set and what did it achi eve ?
5) Use cros sTab s function with different values of threshold and justify which value gives you the best TP vs FP tradeoff.
Explanation / Answer
Q1.Return the Numpy default datatype of each column without using parsing function.
Q2.Preserve and restore commands are used
Q3.Use attribute names without applying parsing.
2
Q1.By using apachespark we can achieve the same count using data frame API.
Q2.Pyspark method is used to get the same data.
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.