[Hadoop] Apache Hadoop - HDFS & Map Reduce

Apache Hadoop

hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
HDFS : Distributed File System - for data storage
Map Reduce : Parallel & distributed data processing (MapReduce)
Master/Slave Architecture : HDFS, YARN

NameNode : Master
- controls and distributes resources to each datanode
- receives all report from datanodes
- records metadata (information about data blocks e.g. location of files stored, size of files, permissions, hierarchy)
DataNode : Slave
- store actual data
- serves read & write requests from clients
Secondary NameNode
- receive first copy of fsimage
- serves Checkpointing : combines editlog + fsimage and create newly updated fsimage and send it to NameNode
  - editlog : contains records of modifications in data
  - editlog (new) : retains all new changes untill the next checkpoint happens and send it to editlog
  - fsimage : contains total informations about data
  - happens periodically (default 1 per 1h)
- makes NameNode more available

how the data is actually stored in datanodes?
- Each file is stored on several HDFS blocks (default size is 128MB, but last node will only contain the remaining size)
- These datablocks are distributed across all the dataNodes in your Hadoop cluster

This kind of distributing file system is highly scalable (flexible) and efficient without making resources wasted
Save your time by processing data in a parallel manner
Fault Tolerance : How Hadoop Adress DataNode Failure?
- suppose one of the DataNode containing the datablocks crashed
- to prevent data loss from defective DataNode, we should makes multiple copies of data
  - Solution : Replication Factor - how many replicas are created per 1 datablock
    - Each of DataNode has different sets of copies of datablocks (thrice by default)
    - Even if two of the copis become defective, we still have 1 left intact

Step 1. Pipeline Setup
- ClientNode sends write request about Block A to NameNode
- NameNode sent IP addresses for DN 1,4,6 where block A will be copied
- First, CN asks DN1 to be ready to copy block A, and sequentially DN1 ask the same thing to DN4 and DN4 to DN6
- This is the Pipeline!

Step2 : Data Streaming (Actual Writing)
- As the pipeline has been created, the client will push the data into the pipeline
- Client will copy the block (A) to DataNode 1 only.
- The remaining replication process is always done by DataNodes sequentially.
  - DN1 will connect to DN4 and DN4 will copy the block and DN4 will connect to DN6 and DN6 will copy the block

Step3 : Acknowledgement Reversely
- each DN will confirm that each step of replication succeed
- DN6 -> DN4 -> DN1 -> Client -> NameNode -> Update Metadata
Summary for Multiple Blocks

Single file is splited into multiple parts and each is processed by one DataNode simultaneously
This system allows much faster processing
Map Reduce can be divided into 2 Distinct Steps
1. Map Tasks
2. Reduce Tasks
  - Example : Overall MapReduce Word Counting Process
3 Major Parts of MapReduce Program
- Mapper Code : how map tasks will process the data to prdouce the key-value pair
- Reducer Code : combine intermediate key-value pair generated by Mapper and give the final aggregated output
- Driver Code : specify all the job configurations (job name, input path, output path, etc..)

- <img src="https://user-images.githubusercontent.com/92680829/159842914-d0031950-45ff-488c-904c-df9fc425d11e.png"  width="600">