[Hadoop] Apache Hadoop - HDFS & Map Reduce
2022, Apr 17
Apache Hadoop
- hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
- HDFS : Distributed File System - for data storage
- Map Reduce : Parallel & distributed data processing (MapReduce)
- Master/Slave Architecture : HDFS, YARN
HDFS
- NameNode : Master
- controls and distributes resources to each datanode
- receives all report from datanodes
- records metadata (information about data blocks e.g. location of files stored, size of files, permissions, hierarchy)
- DataNode : Slave
- store actual data
- serves read & write requests from clients
- Secondary NameNode
- receive first copy of fsimage
- serves Checkpointing : combines editlog + fsimage and create newly updated fsimage and send it to NameNode
- editlog : contains records of modifications in data
- editlog (new) : retains all new changes untill the next checkpoint happens and send it to editlog
- fsimage : contains total informations about data
- happens periodically (default 1 per 1h)
- makes NameNode more available
HDFS Data Blocks
- how the data is actually stored in datanodes?
- Each file is stored on several HDFS blocks (default size is 128MB, but last node will only contain the remaining size)
- These datablocks are distributed across all the dataNodes in your Hadoop cluster
Advantages of distributing data file
- This kind of distributing file system is highly scalable (flexible) and efficient without making resources wasted
- Save your time by processing data in a parallel manner
- Fault Tolerance : How Hadoop Adress DataNode Failure?
- suppose one of the DataNode containing the datablocks crashed
- to prevent data loss from defective DataNode, we should makes multiple copies of data
- Solution : Replication Factor - how many replicas are created per 1 datablock
- Each of DataNode has different sets of copies of datablocks (thrice by default)
- Even if two of the copis become defective, we still have 1 left intact
- Solution : Replication Factor - how many replicas are created per 1 datablock
HDFS Writing Mechanism
- Step 1. Pipeline Setup
- ClientNode sends write request about Block A to NameNode
- NameNode sent IP addresses for DN 1,4,6 where block A will be copied
- First, CN asks DN1 to be ready to copy block A, and sequentially DN1 ask the same thing to DN4 and DN4 to DN6
- This is the Pipeline!
- Step2 : Data Streaming (Actual Writing)
- As the pipeline has been created, the client will push the data into the pipeline
- Client will copy the block (A) to DataNode 1 only.
- The remaining replication process is always done by DataNodes sequentially.
- DN1 will connect to DN4 and DN4 will copy the block and DN4 will connect to DN6 and DN6 will copy the block
- Step3 : Acknowledgement Reversely
- each DN will confirm that each step of replication succeed
- DN6 -> DN4 -> DN1 -> Client -> NameNode -> Update Metadata
- Summary for Multiple Blocks
HDFS Reading Mechanism
Map Reduce : Parallel & Distributed Processing
- Single file is splited into multiple parts and each is processed by one DataNode simultaneously
- This system allows much faster processing
- Map Reduce can be divided into 2 Distinct Steps
- Map Tasks
- Reduce Tasks
- Example : Overall MapReduce Word Counting Process
- 3 Major Parts of MapReduce Program
- Mapper Code : how map tasks will process the data to prdouce the key-value pair
- Reducer Code : combine intermediate key-value pair generated by Mapper and give the final aggregated output
- Driver Code : specify all the job configurations (job name, input path, output path, etc..)
YARN (Yet Another Resource Negotiator) : for MapReduce Process
- Cluster management component of Hadoop
- It includes Resource Manager, Node Manager, Containers, and Application Master
- Resource Manager (NameMode)
- major component that manages application and job scheduling for the batch processes
- allocates first container for AppMaster
- Slave Nodes (DataNode)
- Node Manager :
- control task distribution for each DataNode in cluster
- report node status to RM : monitors container’s resource usage(memory, cpu..) and report to RM
- AppMaster :
- coordinates and manages individual application
- only run during the application, terminated as soon as MapReduce job is completed
- resource request : negotiate the resources from RM
- works with NM to monitor and execute the tasks
- Container :
- allocates a set of resources (ram, cpu..) on a single DataNode
- report MapReduce status to AppMater
- scheduled by RM, monitored by NM
- Node Manager :
- Resource Manager (NameMode)
- YARN Architecture
- <img src="https://user-images.githubusercontent.com/92680829/159842914-d0031950-45ff-488c-904c-df9fc425d11e.png" width="600">
Hadoop Architecture : HDFS (Storage) & YARN (MapReduce)
Hadoop Cluster
Hadoop Cluster Modes
- Multi-Node Cluster Mode
- one cluster cosists of multiple nodes
- Pseudo-Distributed Mode
- Standalone Mode