Phase 4 1. Preparing Data and Upload to HDFS. You can use variety of ways to pre

ID: 3805780 • Letter: P

Question

Phase 4
1. Preparing Data and Upload to HDFS. You can use variety of ways to prepare your data set including:
- Use API provided by each website, such as Facebook API, Twitter API and Flickr API
- Use benchmarking data sets, such as
o UCI data set: http://archive.ics.uci.edu/ml/datasets.html
o Wikipedia database: https://en.wikipedia.org/wiki/Database_testing
- Government database
o US Census data: http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml
o NOAA weather data: https://www.ncdc.noaa.gov/cdo-web/
- Implement Data collection program using Web query
- Synthesized data set
- Use googling
2. You data set MUST have at least 100,000 instances (or rows)
3. Upload your data set into HDFS (VM)
4. Implement MapReduce program on your Developer VM and Spark
- You can use Java or any Steaming with other program language such as Python.
o 1 MapReduce Programming
o 1 Spark
o 1 Hive or Pig
5. Submit your source code to Canvas and download link for your data set
- All source files should be compressed with TAR (e.g., tar cvf XXX.tar) on VM (JAR, TAR or ZIP file format ONLY)

Explanation / Answer

This is the way to move your dataset from local file system to Hadoop Distributrd File System.

Where local_file is the dataset name with the location where it is saved in your local system.

HDFS_location is the path where you HDFS is configured.

VM is the virtual Machine used to use dual operating system in the same machine.

Map reduce is the programming model in Hadoop to perform any opearation and store the file in HDFS. Spark is advanced framework which perform the mapreduce task 100 times faster than mapreduce.

Pig and Hive are ETL tools.

Navigate

Phase 3: Add to the generic implementation completed in Phase 2 to include the e

Phase 4 Individual Projects (due in 6/13/11 ACCT201-1102B-14 Accounting 1 Task N

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

Phase 4 1. Preparing Data and Upload to HDFS. You can use variety of ways to pre

Question

Explanation / Answer

Related Questions

Navigate