[Hadoop] Run Hadoop Cluster on Docker # 2 - Create Hadoop Cluster and Set Up Hadoop Deamons on Docker Containers

1. Create Hadoop Cluster : NameNode and 3 DataNodes

- Create Nodes and Connect them

Previously, we made hadoop-base container where we install and set Hadoop configurations on CentOS image
For this time, we will gonna make 4 nodes (single namenode and 3 datanodes) using previously set centos:hadoop image
Firstly, let’s make NameNode

(mac term)

$ docker run --it -h --name namenode -p 50070:50070 centos:7

Port Forwarding
- connect localhost port 50070 on local PC to 50070 port of Docker containers

Now let’s make 3 Datanodes linked to Namenode container

$ docker run --it -h --name dn1 --link namenode:nn centos:7
$ docker run --it -h --name dn2 --link namenode:nn centos:7
$ docker run --it -h --name dn3 --link namenode:nn centos:7

Link three datanodes to namenode with option --link
- –link [container_name]:[alias]
This allows /etc/hosts file of slave containers to contain IP address of master container
Any change on IP address of linked container is updated automatically

- Store Information of DataNodes into NameNode

get IP addresses of all three datanodes with the command
docker inspect [target_container] | grep IPAddress

Now, add IP addresses of datanodes to /etc/hots file of NameNode
exec NameNode and edit /etc/hosts with vim like below
```
$ (mac term) docker exec -it namenode /bin/bash
$ (nn container) vim /etc/hosts 
```
- Note that all changes to hosts file are reset when you stop and restart the container, so make sure to re-edit the file every restart (preparing shell script file would be convenient)

Now let’s add hostname of each datanode to slaves file
- The $HADOOP_INSTALL/hadoop/conf directory contains some configuration files for Hadoop
- slaves file is one of those
  - This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. By default this contains the single entry localhost
- Ohter documentations are here

2. Launch Hadoop Deamons on Docker Containers

You’ve just successfully created all the nodes required and connect them each other
Now let’s actually run Hadoop deamons by activating start-all.sh script
```
$ (nn container) start-all.sh
```
If you type the command above, you’ll see multiple warnings and questions like “Are you sure you want to continue connecting (yes/no)?”
Ignore all of them and answer yes
When you finish all steps, you can finally see these lines below

Startup Scripts
- $HADOOP_INSTALL/hadoop/bin directory contains some scripts used to launch Hadoop DFS and Hadoop Map/Reduce daemons
- start-all.sh is one of those
  - It starts all Hadoop daemons, the namenode, datanodes, the jobtracker and tasktrackers.
  - Now Deprecated; use start-dfs.sh then start-mapred.sh instead
- Ohter documentations are here

- Check Current Process Status

(nn container) after executing the script, type ps -ef to check the current process status

you’ll see java process is running, but no detail is shown there
So, you can alternatively use command jps : lists processes currently running on jvm
- namenode :
- datanode (dn2) :

You can see all Hadoop deamons are set normally on each container for HDFS and YARN Hadoop system
- Secondary NameNode, Resource Manager are running on namenode container
- while DataNode and NodeManager are activatded on datanode containers

- Start Hadoop on your Nodes

Now you can use Hadoop commands on your container terminal
- 1) # hdfs dfsadmin -report : to check current status of Hadoop cluster
  - You’ll find ‘Live datanodes (3)’ output, which shows that three datanodes (dn1, 2, 3) are currently running
  - Also, you can drill down into the detail status of each datanode
- 2) # hdfs dfs -ls / or # hadoop fs -ls / : shows current file system
  - hadoop fs [args]
    - FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are dealing with different file systems such as Local FS, (S)FTP, S3, and others
  - hdfs dfs [args] ( hadoop dfs [arg] has been deprecated)
    - specific to HDFS. would work for operation relates to only HDFS.
- 3) hdfs dfs -mkdir /[filename] : literally makes dir
  - check if file was made by using previous command # hdfs dfs -ls /

- Access to Admin Web Page

you can access to admin webpage deamon and see current status of Hadoop cluster on your local PC
open the web browser and access to url ‘localhost:[port number]’
- use the port number that you set for port mapping when creating master node (namenode)
- example : localhost:50070

main page shows the summary of HDFS memory usages
you can also browse the directories of HDFS system through ‘Utilities’ tap