Setup A Hadoop Cluster

This post records the steps how to setup a Hadoop cluster (version 2.6.1). Refer here.

Assume we have the following servers and the network :

192.168.149.86 HadoopMaster
192.168.149.87 HadoopSlave1
192.168.149.88 HadoopSlave2
192.168.149.89 HadoopSlave3
192.168.149.90 HadoopSlave4

Preparation

Create “group/user” on each server(master node and slave nodes).

1 2	sudo addgroup hadoop sudo adduser --ingroup hadoop hadoop

Note that all the commands you should use the “hadoop” user to execute. And all the directories involved below should be owned by the “hadoop” user as well.

Make the nodes can find each other by domain name.

Add the following content info the “/etc/hosts” file on each server(master node and slave nodes):

192.168.149.86 HadoopMaster
192.168.149.87 HadoopSlave1
192.168.149.88 HadoopSlave2
192.168.149.89 HadoopSlave3
192.168.149.90 HadoopSlave4

Install JDK(master node and slave nodes).

In this case JDK version “jdk1.8.0_45” is used. And the installation directory is “/export/jdk1.8.0_45”. Using the same directory on each node(master and slaves) makes it easier to set “JAVA_HOME” in the “~/.bashrc” file of each node(master and slaves).

Configure “/home/hadoop/.bashrc” file(master node and slave nodes).

/home/hadoop/.bashrc

export JAVA_HOME=/export/jdk1.8.0_45

export HADOOP_PREFIX=/export/App/hadoop-2.6.1
export HADOOP_HOME=${HADOOP_PREFIX}
export HADOOP_MAPRED_HOME=${HADOOP_HOME} 
export HADOOP_COMMON_HOME=${HADOOP_HOME} 
export HADOOP_HDFS_HOME=${HADOOP_HOME} 
export HADOOP_YARN_HOME=${HADOOP_HOME} 
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop 
export HDFS_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_LOG_DIR=/export/Logs/hadoop
export HADOOP_PID_DIR=/export/Data/hadoop-2.6.1
export HADOOP_MAPRED_LOG_DIR=/export/Logs/hadoop
export HADOOP_MAPRED_PID_DIR=/export/Data/hadoop-2.6.1

export YARN_HOME=${HADOOP_HOME}
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export YARN_LOG_DIR=/export/Logs/hadoop
export YARN_PID_DIR=/export/Data/hadoop-2.6.1

export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH

Configure SSH(master node).

From the “HadoopMaster” node run the following commands to make the master node be able to access the slave nodes without password.

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@HadoopSlave1
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@HadoopSlave2
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@HadoopSlave3
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@HadoopSlave4

Download the “Hadoop” package file and put it to the master node.

In this case “hadoop-2.6.1” is used.

# 1. Download "Hadoop" package file.
wget -t 99 -c http://apache.01link.hk/hadoop/common/hadoop-2.6.1/hadoop-2.6.1.tar.gz
# 2. Send the file to each node.
scp hadoop-2.6.1.tar.gz root@192.168.149.86:/export/App
# 3. Login the "HadoopMaster(192.168.149.86)" server and extract the file
# and the "/export/App/hadoop-2.6.1" directory will be created.
cd /export/App
tar -xvf hadoop-2.6.1.tar.gz

Update the “hadoop” configuration files on the “master” node

All the following files are on the “HadoopMaster(192.168.149.86)”

Update “/export/App/hadoop-2.6.1/etc/hadoop/core-site.xml”

/export/App/hadoop-2.6.1/etc/hadoop/core-site.xml

<!-- Paste these lines into <configuration> tag OR Just update it by replacing localhost with master -->
<property>
  <name>fs.default.name</name>
  <value>hdfs://HadoopMaster</value>
</property>

Update “/export/App/hadoop-2.6.1/etc/hadoop/hdfs-site.xml”

/export/App/hadoop-2.6.1/etc/hadoop/hdfs-site.xml

<!-- Paste/Update these lines into <configuration> tag -->
<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/export/Data/hadoop-2.6.1/hdfs/namenode</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/export/Data/hadoop-2.6.1/hdfs/datanode/</value>
</property>

Update “/export/App/hadoop-2.6.1/etc/hadoop/yarn-site.xml”

/export/App/hadoop-2.6.1/etc/hadoop/yarn-site.xml

<!-- Paste/Update these lines into <configuration> tag -->
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>HadoopMaster:8025</value>
</property>
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>HadoopMaster:8035</value>
</property>
<property>
    <name>yarn.resourcemanager.address</name>
    <value>HadoopMaster:8050</value>
</property>

Update “/export/App/hadoop-2.6.1/etc/hadoop/mapred-site.xml”

/export/App/hadoop-2.6.1/etc/hadoop/mapred-site.xml

<!-- Paste/Update these lines into <configuration> tag -->
<property>
    <name>mapreduce.job.tracker</name>
    <value>HadoopMaster:5431</value>
</property>
<property>
    <name>mapred.framework.name</name>
    <value>yarn</value>
</property>

Create “/export/App/hadoop-2.6.1/etc/hadoop/masters” file

/export/App/hadoop-2.6.1/etc/hadoop/masters

1 2	# Add name of master nodes HadoopMaster

Update “/export/App/hadoop-2.6.1/etc/hadoop/slaves” file

/export/App/hadoop-2.6.1/etc/hadoop/slaves

## Add name of slave nodes
HadoopSlave1
HadoopSlave2
HadoopSlave3
HadoopSlave4

Distribute the “hadoop” files from master node to the slave nodes

We do this to reduce the configuration step in the last section.

# 1. Login the "master" node. Pacakge the "hadoop" directory on it.
cd /export/App
tar -czvf hadoop-2.6.1-config.tar.gz hadoop-2.6.1/
# 2. Distribute the package file to the slave nodes.
scp hadoop-2.6.1-config.tar.gz root@192.168.149.87:/export/App
scp hadoop-2.6.1-config.tar.gz root@192.168.149.88:/export/App
scp hadoop-2.6.1-config.tar.gz root@192.168.149.89:/export/App
scp hadoop-2.6.1-config.tar.gz root@192.168.149.90:/export/App
# 3. Login each slave node and run the following commands to extract.
cd /export/App
tar -xvf hadoop-2.6.1-config.tar.gz

Create directories for “namenode/datanode” on the “master/slave” nodes

# 1. Run the following command on the "master" node to create the "namenode" directory.
mkdir -p /export/Data/hadoop-2.6.1/hdfs/namenode/
# If the "hadoop" user does not have permission to create this directory, use "root"
to do it and run the following command to set the owner of this directory as "hadoop".
chown hadoop:hadoop -R /export/Data/hadoop-2.6.1/
# 2. Run the following command on the "slave" nodes to create the "datanode" directory.
mkdir -p /export/Data/hadoop-2.6.1/hdfs/datanode/
# If the "hadoop" user does not have permission to create this directory, use "root"
to do it and run the following command to set the owner of this directory as "hadoop".
chown hadoop:hadoop -R /export/Data/hadoop-2.6.1/

Format “namenode” on the “master” node

1 2	# run the following command: hdfs namenode -format

Start Hadoop Daemon on the “master” node

# 1. Start the "dfs" daemon:
start-dfs.sh

# 2. Start the "yarn" daemon:
start-yarn.sh

If everything works, you can access here and here in your browser. The you can find the information about the HDFS cluster and Yarn cluster.

Gangmax Blog