Phase 4 1. Preparing Data and Upload to HDFS. You can use variety of ways to pre
ID: 3805780 • Letter: P
Question
Phase 4
1. Preparing Data and Upload to HDFS. You can use variety of ways to prepare your data set including:
- Use API provided by each website, such as Facebook API, Twitter API and Flickr API
- Use benchmarking data sets, such as
o UCI data set: http://archive.ics.uci.edu/ml/datasets.html
o Wikipedia database: https://en.wikipedia.org/wiki/Database_testing
- Government database
o US Census data: http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml
o NOAA weather data: https://www.ncdc.noaa.gov/cdo-web/
- Implement Data collection program using Web query
- Synthesized data set
- Use googling
2. You data set MUST have at least 100,000 instances (or rows)
3. Upload your data set into HDFS (VM)
4. Implement MapReduce program on your Developer VM and Spark
- You can use Java or any Steaming with other program language such as Python.
o 1 MapReduce Programming
o 1 Spark
o 1 Hive or Pig
5. Submit your source code to Canvas and download link for your data set
- All source files should be compressed with TAR (e.g., tar cvf XXX.tar) on VM (JAR, TAR or ZIP file format ONLY)
Explanation / Answer
This is the way to move your dataset from local file system to Hadoop Distributrd File System.
Where local_file is the dataset name with the location where it is saved in your local system.
HDFS_location is the path where you HDFS is configured.
VM is the virtual Machine used to use dual operating system in the same machine.
Map reduce is the programming model in Hadoop to perform any opearation and store the file in HDFS. Spark is advanced framework which perform the mapreduce task 100 times faster than mapreduce.
Pig and Hive are ETL tools.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.