[Hadoop] Apache Hadoop - HDFS & Map Reduce

[Hadoop] Apache Hadoop - HDFS & Map Reduce

2022, Apr 17    


Apache Hadoop


  • hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
  • HDFS : Distributed File System - for data storage
  • Map Reduce : Parallel & distributed data processing (MapReduce)
  • Master/Slave Architecture : HDFS, YARN


HDFS


  1. NameNode : Master
    • controls and distributes resources to each datanode
    • receives all report from datanodes
    • records metadata (information about data blocks e.g. location of files stored, size of files, permissions, hierarchy)
  2. DataNode : Slave
    • store actual data
    • serves read & write requests from clients
  3. Secondary NameNode
    • receive first copy of fsimage
    • serves Checkpointing : combines editlog + fsimage and create newly updated fsimage and send it to NameNode
      • editlog : contains records of modifications in data
      • editlog (new) : retains all new changes untill the next checkpoint happens and send it to editlog
      • fsimage : contains total informations about data
      • happens periodically (default 1 per 1h)
    • makes NameNode more available


HDFS Data Blocks


  • how the data is actually stored in datanodes?
    • Each file is stored on several HDFS blocks (default size is 128MB, but last node will only contain the remaining size)
    • These datablocks are distributed across all the dataNodes in your Hadoop cluster


Advantages of distributing data file


  • This kind of distributing file system is highly scalable (flexible) and efficient without making resources wasted
  • Save your time by processing data in a parallel manner
  • Fault Tolerance : How Hadoop Adress DataNode Failure?
    • suppose one of the DataNode containing the datablocks crashed
    • to prevent data loss from defective DataNode, we should makes multiple copies of data
      • Solution : Replication Factor - how many replicas are created per 1 datablock
        • Each of DataNode has different sets of copies of datablocks (thrice by default)
        • Even if two of the copis become defective, we still have 1 left intact


HDFS Writing Mechanism


  • Step 1. Pipeline Setup
    • ClientNode sends write request about Block A to NameNode
    • NameNode sent IP addresses for DN 1,4,6 where block A will be copied
    • First, CN asks DN1 to be ready to copy block A, and sequentially DN1 ask the same thing to DN4 and DN4 to DN6
    • This is the Pipeline!


  • Step2 : Data Streaming (Actual Writing)
    • As the pipeline has been created, the client will push the data into the pipeline
    • Client will copy the block (A) to DataNode 1 only.
    • The remaining replication process is always done by DataNodes sequentially.
      • DN1 will connect to DN4 and DN4 will copy the block and DN4 will connect to DN6 and DN6 will copy the block


  • Step3 : Acknowledgement Reversely
    • each DN will confirm that each step of replication succeed
    • DN6 -> DN4 -> DN1 -> Client -> NameNode -> Update Metadata
  • Summary for Multiple Blocks


HDFS Reading Mechanism




Map Reduce : Parallel & Distributed Processing


  • Single file is splited into multiple parts and each is processed by one DataNode simultaneously
  • This system allows much faster processing
  • Map Reduce can be divided into 2 Distinct Steps
    1. Map Tasks
    2. Reduce Tasks
      • Example : Overall MapReduce Word Counting Process
  • 3 Major Parts of MapReduce Program
    • Mapper Code : how map tasks will process the data to prdouce the key-value pair
    • Reducer Code : combine intermediate key-value pair generated by Mapper and give the final aggregated output
    • Driver Code : specify all the job configurations (job name, input path, output path, etc..)


YARN (Yet Another Resource Negotiator) : for MapReduce Process


  • Cluster management component of Hadoop
  • It includes Resource Manager, Node Manager, Containers, and Application Master
    • Resource Manager (NameMode)
      • major component that manages application and job scheduling for the batch processes
      • allocates first container for AppMaster
    • Slave Nodes (DataNode)
      • Node Manager :
        • control task distribution for each DataNode in cluster
        • report node status to RM : monitors container’s resource usage(memory, cpu..) and report to RM
      • AppMaster :
        • coordinates and manages individual application
        • only run during the application, terminated as soon as MapReduce job is completed
        • resource request : negotiate the resources from RM
        • works with NM to monitor and execute the tasks
      • Container :
        • allocates a set of resources (ram, cpu..) on a single DataNode
        • report MapReduce status to AppMater
        • scheduled by RM, monitored by NM


  • YARN Architecture


- <img src="https://user-images.githubusercontent.com/92680829/159842914-d0031950-45ff-488c-904c-df9fc425d11e.png"  width="600">


Hadoop Architecture : HDFS (Storage) & YARN (MapReduce)




Hadoop Cluster



Hadoop Cluster Modes


  • Multi-Node Cluster Mode
    • one cluster cosists of multiple nodes
  • Pseudo-Distributed Mode
  • Standalone Mode


Hadoop Ecosystem