MapReduce: Simplified Data Processing on Large Clusters
Steps in Map Reduce
Data Flow In MapReduce
Input and Output types of a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
MapReduce | Input | Output |
---|---|---|
Map | <k1, v1> | list (<k2, v2>) |
Reduce | <k2, list(v2)> | list (<k3, v3>) |
- PayLoad − Applications implement the Map and the Reduce functions, and form the core of the job.
- Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
- NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
- DataNode − Node where data is presented in advance before any processing takes place.
- MasterNode − Node where JobTracker runs and which accepts job requests from clients.
- SlaveNode − Node where Map and Reduce program runs.
- JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
- Task Tracker − Tracks the task and reports status to JobTracker.
- Job − A program is an execution of a Mapper and Reducer across a dataset.
- Task − An execution of a Mapper or a Reducer on a slice of data.
- Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.
- sample.txt
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
ProcessUnits.java
mkdir units
- /home/hadoop/hadoop-core-1.2.1.jar
javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
jar -cvf units.jar -C units/ .
hadoop fs -mkdir /input_dir
hadoop fs -put /home/hadoop/sample.txt /input_dir
hadoop fs -ls /input_dir/
hadoop jar units.jar t5750.hadoop.ProcessUnits /input_dir /output_dir
hadoop fs -ls /output_dir/
hadoop fs -cat /output_dir/part-00000
1981 34 1984 40 1985 45
hadoop fs -cat /output_dir/part-00000 | hadoop fs -get /output_dir /home/hadoop
Usage − hadoop [--config confdir] COMMAND
Option | Description |
---|---|
namenode -format | Formats the DFS filesystem. |
secondarynamenode | Runs the DFS secondary namenode. |
namenode | Runs the DFS namenode. |
datanode | Runs a DFS datanode. |
dfsadmin | Runs a DFS admin client. |
mradmin | Runs a Map-Reduce admin client. |
fsck | Runs a DFS filesystem checking utility. |
fs | Runs a generic filesystem user client. |
balancer | Runs a cluster balancing utility. |
oiv | Applies the offline fsimage viewer to an fsimage. |
fetchdt | Fetches a delegation token from the NameNode. |
jobtracker | Runs the MapReduce job Tracker node. |
pipes | Runs a Pipes job. |
tasktracker | Runs a MapReduce task Tracker node. |
historyserver | Runs job history servers as a standalone daemon. |
job | Manipulates the MapReduce jobs. |
queue | Gets information regarding JobQueues. |
version | Prints the version. |
jar <jar> | Runs a jar file. |
distcp <srcurl> <desturl> | Copies file or directories recursively. |
distcp2 <srcurl> <desturl> | DistCp version 2. |
archive -archiveName NAME -p <parent path> <src>* <dest> | Creates a hadoop archive. |
classpath | Prints the class path needed to get the Hadoop jar and the required libraries. |
daemonlog | Get/Set the log level for each daemon |
Usage − hadoop job [GENERIC_OPTIONS]
GENERIC_OPTION | Description |
---|---|
-submit <job-file> |
Submits the job. |
-status <job-id> |
Prints the map and reduce completion percentage and all job counters. |
-counter <job-id> <group-name> <countername> |
Prints the counter value. |
-kill <job-id> |
Kills the job. |
-events <job-id> <fromevent-#> <#-of-events> |
Prints the events' details received by jobtracker for the given range. |
-history [all] <jobOutputDir> - history < jobOutputDir> |
Prints job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each task can be viewed by specifying the [all] option. |
-list[all] |
Displays all jobs. -list displays only jobs which are yet to complete. |
-kill-task <task-id> |
Kills the task. Killed tasks are NOT counted against failed attempts. |
-fail-task <task-id> |
Fails the task. Failed tasks are counted against failed attempts. |
-set-priority <job-id> <priority> |
Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW |
hadoop job -status <JOB-ID>
e.g.
hadoop job -status job_1557970197535_0001
hadoop job -history <DIR-NAME>
e.g.
hadoop job -history job_1557970197535_0001
hadoop job -kill <JOB-ID>
e.g.
hadoop job -kill job_1557970197535_0001
mkdir word-count
vi ~/word-count/word-count-data.txt
HDFS is a storage unit of Hadoop
MapReduce is a processing tool of Hadoop
start-dfs.sh
start-yarn.sh
hdfs dfs -mkdir /word-count
hdfs dfs -put ~/word-count/word-count-data.txt /word-count
WordCountMapper, WordCountReducer, WordCountRunner
# IDE-->Gradle-->Tasks-->build-->jar
hadoop jar ~/word-count/hadoop-demos.jar t5750.hadoop.mapred.WordCountRunner /word-count /word-count-output
# Browsing HDFS: http://192.168.100.210:50070/explorer.html#/
hdfs dfs -cat /word-count-output/part-00000
mkdir char-count
vi ~/char-count/char-count-info.txt
hdfs dfs -mkdir /char-count
hdfs dfs -put ~/char-count/char-count-info.txt /char-count
CharCountMapper, CharCountReducer, CharCountRunner
# IDE-->Gradle-->Tasks-->build-->jar
hadoop jar ~/char-count/hadoop-demos.jar t5750.hadoop.mapred.CharCountRunner /char-count /char-count-output
hdfs dfs -cat /char-count-output/part-00000