[Hadoop] Run Hadoop Cluster on Docker # 1 - Set up Hadoop on CentOS Container

1. Download CentOS Image

(mac term) On your mac terminal, type the command line below to create new container with CentOS image (version 7 here)

$ docker run --restart always --name [container_name] -dt centos:7

now you can see new centos image is created in your docker images list (Docker Dashboard)

new centos container is created with the name you set with the option --name [container_name] (here, my_centos)
(mac term) execute the centos container that you’ve just created
```
$ docker exec -it my_centos_container /bin/bash
```

you can see the container list on run with the command docker ps

(mac term) execute docker

$ docker exec -it [container_name] /bin/bash

after this command executed, you can see that your current serving environment is changed from base to root@[container_id]

2. Setting Hadoop Base on CentOS Image

(mac term) create new container that will be your hadoop base with the name ‘hadoop_base’
```
$ docker run -it --name hadoop_base -dt centos:7
```

(mac term) exec hadoop_base docker exec -it hadoop_base /bin/bash

(container) update yum packages and install all required libraries

/* CentOS Container */
$ yum update
$ yum install wget -y
$ yum install vim -y
$ yum install openssh-server openssh-clients openssh-askpass -y
$ yum install java-1.8.0-openjdk-devel.x86_64 -y

wget : free software package for interacting with REST APIs to retrieve files using HTTP, HTTPS, FTP and FTPS
vim : edit files at terminals
openssh-server openssh-clients openssh-askpass : connectivity tool for remote login with the SSH protocol
java : select the desired java version

(container) type commands below to allow password-free interaction between containers (nodes of hadoop clusters)

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

$ ssh-keygen -f /etc/ssh/ssh_host_rsa_key -t rsa -N ""
$ ssh-keygen -f /etc/ssh/ssh_host_ecdsa_key -t ecdsa -N ""
$ ssh-keygen -f /etc/ssh/ssh_host_ed25519_key -t ed25519 -N "" 

(container) adding JAVA_HOME directory to PATH

$ readlink -f /usr/bin/javac     ## check your java directory
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/bin/javac
$ vim ~/.bashrc      ## you can edit your PATH at terminal by using vim 

(vim) type ‘i’ to start writing mode and add your java direc (note! type except ‘/bin/javac’ part)

.
.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.aarch64
export PATH=$PATH:$JAVA_HOME/bin
.
.

result

(vim) to exit from writing mode, enter esc
(vim) to store the edit and exit from vim, type :w (store) -> :q (exit)
(container) make sure to actually execute the content of a file you’ve edited
```
$ source ~/.bashrc
```

Install Hadoop and Set Hadoop Configurations on CentOS Image

(container)

$ mkdir /hadoop_home       
$ cd /hadoop_home
$ wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
## choose the hadoop version you want (here, hadoop-2.7.7)
$ tar -xvzf hadoop-2.7.7.tar.gz         ## unzip

(container) add HADOOP_HOME directory to your PATH
```
$ vim ~/.bashrc
```

(vim)

.
.
export HADOOP_HOME=/hadoop_home/hadoop-2.7.7
export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
## run sshd 
/usr/sbin/sshd          
.
.

result

(container) $ source ~/.bashrc

(container) create files (temp, namenode, datanode) in $HADOOP_HOME directory

$ mkdir /hadoop_home/tmp
$ mkdir /hadoop_home/namenode
$ mkdir /hadoop_home/datanode

Now, edit hadoop configurations with vim

(container)

$ cd $HADOOP_CONFIG_HOME
## create mapred-site.xml at $HADOOP_CONFIG_HOME direc
$ cp mapred-site.xml.template mapred-site.xml 

1) core-site.xml

(container) go to file core-site.xml

vim $HADOOP_CONFIG_HOME/core-site.xml

(vim)

<!-- core-site.xml -->
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/hadoop_home/tmp</value>
    </property>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://nn:9000</value>      <!-- nn : hostname of namenode, name as you wnat-->
        <final>true</final>
    </property>
</configuration>

2) hdfs-site.xml

<!-- hdfs-site.xml -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/hadoop_home/namenode</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/hadoop_home/datanode</value>
        <final>true</final>
    </property>
</configuration>

3) mapred-site.xml

<!-- mapred-site.xml -->
<configuration>

    <property>
        <name>mapred.job.tracker</name>
        <value>nn:9001</value>
    </property>

</configuration>

Finally, format namenode and commit the container to centos:hadoop image

(container)

  $ hadoop namenode -format
  $ exit

(mac term)

  $ docker commit -m "hadoop in centos" hadoop_base centos:hadoop

docker commit -m [message] [container_name] [image_name]

Next posting, we will gonna create namenode and multiple datanodes with the created hadoop-base image file below (centos:hadoop)