Hadoop (Single Node) 2.7.1 with Docker from scratch.

Introduction.

The idea of this post is create a Hadoop 2.7.1 Single Node Docker image from scratch.

In all projects related with Hadoop, developers expend a lot of time in the installation and maintenance of Hadoop in their local machines. With Docker, you can have your development environment ready in seconds.

This will be a guide, not only to have a Hadoop image ready to use for development purpose, but also to understand and to know how to create a Docker image from scratch.

This is the first post of a series of two.

  1. In this first one, I am going to explain how to create a Hadoop Single Node image because is more easy to install than a Hadoop Cluster Node image.
  2. In the second one, I will to explain how to create a Hadoop Cluster Node image.

Of course, all images generated are published in my public Docker and Githup account, so you are free to use them without create other new one.

Target.

The target is to have the ability to start up a hadoop container ready to use in seconds, only executing a command similar to this:

$ docker run --name hadoop-2.7.1 -it -P angelcervera/docker-hadoop:2.7.1-single

Or something more sofisticate, similar to this:

docker run --name my-new-hadoop-2.7.1 \
  -v /home/username/hadoop/logs:/opt/hadoop/logs \
  -v /home/username/hadoop/shared:/root/shared \
  -p 50070:50070 \
  -p 50075:50075 \
  -p 50060:50060 \
  -p 50030:50030 \
  -p 19888:19888 \
  -p 10033:10033 \
  -p 8032:8032 \
  -p 8030:8030 \
  -p 8088:8088 \
  -p 8033:8033 \
  -p 8042:8042 \
  -p 8188:8188 \
  -p 8047:8047 \
  -p 8788:8788 \
  -it angelcervera/docker-hadoop:2.7.1-single

Prerequisites.

I am a Ubuntu 14.04.3 LTS user, so this tutorial is oriented to this OS. Anyway, you can follow this tutorial using any OS compatible with Docker.

List of things:

  • Internet Connection: It is necessary download a lot of stuff, so better if you have a good internet connection.
  • Docker: I am going to use the version 1.8
    If you don't have Docker installed, you can do it following the instructions in the Docker site: https://docs.docker.com/installation
  • DockerHub account. To create a new one: https://hub.docker.com/

Resources.

You can fork the whole project from GitHub: https://github.com/angelcervera/docker-hadoop

Also, the image is published in DockerHub: https://hub.docker.com/r/angelcervera/docker-hadoop/

Please, comments and suggestion: https://github.com/angelcervera/docker-hadoop/issues

Of course, if you are happy, don't hesitate to mark as favorite and share.

Let's do it!

Checking requirements.

The first step is to be sure that we have installed the last version of docker. The result must me version

$ docker version
Client:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:19:00 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:19:00 UTC 2015
 OS/Arch:      linux/amd64
  

Now that we are sure that we have docker installed, we can start working on the new image.

There are two possibilities to do it, update a container or create a Dockerfile. I am going to use the second one, so I will have this file in my GitHub repository, in Docker Hub Autobuilder and, in the future, will be easy to modify the image without start from zero.

Layout.

The layout for this project is really simple. The Dockerfile in the root and a config folder that I created to store all configuration files that I am going to copy to the container.

docker-hadoop/
├── config
│   ├── core-site.xml
│   ├── docker_entrypoint.sh
│   ├── hadoop-env.sh
│   ├── hdfs-site.xml
│   ├── mapred-site.xml
│   ├── ssh_config
│   └── yarn-site.xml
├── Dockerfile
└── README.md

DockerFile.

And now is when the diversion starts.

The DockerFile is a file used to describe the steps to generate an Docker image, using an easy DSL language. I am going to describe line per line what means the content of the DockerFile.

Anyway, if you check the source of this file in Github you can read a lot of useful comments.

The main idea is that all instructions in this file are going to be executed in order when you build the image with the command:

$ docker build -t angelcervera/hadoop .

After execute the build command, you will have an image ready to use in your local environment.

$ docker images
REPOSITORY                   TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
angelcervera/docker-hadoop   2.7.1-single        1a7acd653fe5        15 hours ago        1.121 GB

Instructions.

FROM

The instruction FROM is used to indicate in which image is based your new image.

FROM ubuntu:14.04

MAINTAINER

You want that everybody know that you are the author, right? This is your opportunity ;)

MAINTAINER Angel Cervera Claudio <angelcervera@gmail.com>

USER

Every step in the build process and running the container is going to use the user indicated with this command.

USER root

WORKDIR

This will be the start point folder per all commands in the build process or when the container is started.

WORKDIR /root
Every step in the build process and running the container is going to use the user setted with this command. - See more at: http://www.simplexportal.com/simplexadmin/builder/site/edit/blog/2015/10/04/hadoop_singlenode_with_docker_from_scratch/post.html#sthash.47jzhvmF.dpuf

ENV

With this instruction we can set environment variables that we can use inside the Dockerfile or when logged in the container. Be careful, because those variables are not in the shell if you will start it from other shell.

ENV HADOOP_VERSION 2.7.1
ENV HADOOP_PREFIX /opt/hadoop

RUN

This instruction is used to execute commands in the container. This is the way to modify the content of the original image. Every execution is going to be persisted.

Now, we are going to execute a series of RUN instructions to:

  1. Install dependencies with apt-get.
  2. Download the Hadoop distribution. Pay attention in the use of the ${HADOOP_VERSION} environment variable setted before.
  3. Extract the Haddop distribution file, create a link to avoid the use of version numbers and create a folder to use as data folder.
    I am going to install Hadoop under /opt because http://unix.stackexchange.com/questions/11544/what-is-the-difference-between-opt-and-usr-local
  4. Create the SSH key as said the Hadoop installation instructions.
# Install all dependencies
RUN apt-get update && apt-get install -y wget ssh rsync openjdk-7-jdk

# Download hadoop.
RUN wget -O /tmp/hadoop-${HADOOP_VERSION}.tar.gz http://mirrors.whoishostingthis.com/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz \
    && wget -O /tmp/hadoop-${HADOOP_VERSION}.tar.gz.mds  http://mirrors.whoishostingthis.com/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz.mds

# Install hadoop
RUN tar -C /opt -xf /tmp/hadoop-${HADOOP_VERSION}.tar.gz \
    && ln -s /opt/hadoop-${HADOOP_VERSION} ${HADOOP_PREFIX} \
    && mkdir /var/lib/hadoop

# Install ssh key
RUN ssh-keygen -q -t dsa -P '' -f /root/.ssh/id_dsa \
    && cat /root/.ssh/id_dsa.pub >> /root/.ssh/authorized_keys

COPY / ADD

This instruction is used to copy resources into the container. There is other more powerful instruction that is ADD (can decompress files, etc.), but the documentation recommends COPY when is possible.

In the next step, I am going to copy all resources under the config folder into the container.

Every file must be in the right location:

  1. Copy ssh configuration file. This configuration file has the configuration to avoid ask to add unknown hosts in the list of trust hosts. More information: http://shallowsky.com/blog/linux/ssh-tips.html
  2. Copy Hadoop configuration files to set a Single Node instance.
  3. Format the namenode. Yes!!! We are executing Hadoop inside of the container, for first time.
  4. Copy the script that we are going to use to initialize the container, and change the file attributes.
  5. Create a folder under /root that we can use to share files with the host. This is not related with Hadoop, but I am used to do it. :)
  6. Clean caches, not used files, etc. to keep the image as small as possible.
# Config ssh to accept all connections from unknow hosts.
COPY config/ssh_config /root/.ssh/config

# Copy Hadoop config files
COPY config/hadoop-env.sh ${HADOOP_PREFIX}/etc/hadoop/
COPY config/core-site.xml ${HADOOP_PREFIX}/etc/hadoop/
COPY config/hdfs-site.xml ${HADOOP_PREFIX}/etc/hadoop/
COPY config/mapred-site.xml ${HADOOP_PREFIX}/etc/hadoop/
COPY config/yarn-site.xml ${HADOOP_PREFIX}/etc/hadoop/

# Format hdfs
RUN ${HADOOP_PREFIX}/bin/hdfs namenode -format

# Copy the entry point shell
COPY config/docker_entrypoint.sh /root/
RUN chmod a+x /root/docker_entrypoint.sh

# Folder to share files
RUN mkdir /root/shared && \
    chmod a+rwX /root/shared

# Clean
RUN rm -r /var/cache/apt /var/lib/apt/lists /tmp/hadoop-${HADOOP_VERSION}.tar*

EXPOSE

This instruction is used to expose ports outside of the container. By default, you can use any port inside of the container without the risk of conflicts with other containers or the host.

But if you want to access to the container from the outside of the container, you need to expose it.

Below a fragment of ports exposed. To get the full list (a lot of them), please, check the Dockfile source.

################### Expose ports

### Core

# Zookeeper
EXPOSE 2181

# NameNode metadata service ( fs.defaultFS )
EXPOSE 9000

# FTP Filesystem impl. (fs.ftp.host.port)
EXPOSE 21

### Hdfs ports (Reference: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml)

# NameNode Web UI: Web UI to look at current status of HDFS, explore file system .......
EXPOSE 50070 50470

# DataNode : DataNode WebUI to access the status, logs etc. (dfs.datanode.http.address / dfs.datanode.https.address)
EXPOSE 50075 50475

# DataNode  (dfs.datanode.address / dfs.datanode.ipc.address)
EXPOSE 50010 50020

# Secondary NameNode (dfs.namenode.secondary.http-address / dfs.namenode.secondary.https-address)
EXPOSE 50090 50090

# Backup node (dfs.namenode.backup.address / dfs.namenode.backup.http-address)
EXPOSE 50100 50105

# Journal node (dfs.journalnode.rpc-address / dfs.journalnode.http-address / dfs.journalnode.https-address )
EXPOSE 8485 8480 8481

To map any of those ports with a local port, it is necessary to use the -p parameter when you start the container. The format is: -p host_port:exposed_port

With the parameter -P Docker is going to select an aleatory available port for you. In the case of this container, it is not a good idea because we are going to expose 37 ports!!! So the result will be a mess.

The easy way to get current list of mapped port is with the command:

$ docker ps

VOLUME

By default, all folders inside of the container are inaccessible from the host or other containers. This isolation is really good, but sometimes we want to use a folder from the host or other data volume container.

To do this, you can use the VOLUME instruction. All folders listed any VOLUME instruction could be mounted outside of the container with the parameter -v. This parameter has the format -v host_folder:container_folder.

There are other ways to mount this folder, sharing folders between containers, for example, but we are going to talk only about the first one.

The easy way to get current list of folder shared is with the command:

$docker inspect container_id
VOLUME ["/opt/hadoop", "/root/shared"]
The easy way to get current list of mapped port is with the command - See more at: http://www.simplexportal.com/simplexadmin/builder/site/edit/blog/2015/10/04/hadoop_singlenode_with_docker_from_scratch/post.html#sthash.pxku6I1g.dpuf

ENTRYPOINT

With this instruction it is possible to set the command that is going to be executed  when you run the container.

The idea of Docker is one service per container. In this case, it is special because Hadoop uses 5 services. So I created a script to start all services. I copied this script under /root.

ENTRYPOINT [ "/root/docker_entrypoint.sh" ]

Building and executing.

In this point, I have the Dockfile ready to build the image:

$ docker build -t yourusername/hadoop .
Sending build context to Docker daemon 151.6 kB
Step 0 : FROM ubuntu:14.04
 ---> 91e54dfb1179
Step 1 : MAINTAINER Angel Cervera Claudio ..........
..........
..........
..........
A lot of Steps
..........
..........
..........
Removing intermediate container e72251cbec5c
Successfully built f6262ea86133
$

Listing the images:

$ docker images
REPOSITORY                   TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
yourusername/hadoop          latest              f6262ea86133        3 minutes ago       1.121 GB
postgres                     latest              506c40f60539        3 weeks ago         265.3 MB
ubuntu                       14.04               91e54dfb1179        6 weeks ago         188.4 MB
mongo                        latest              c6f67f622b2a        9 weeks ago         261 MB
$

And after build, execute:

$ docker run --name my-new-hadoop-2.7.1 \
  -v /tmp/hadoop_image/logs:/opt/hadoop/logs \
  -v /tmp/hadoop_image/shared:/root/shared \
  -p 50070:50070 \
  -p 50075:50075 \
  -p 50060:50060 \
  -p 50030:50030 \
  -p 19888:19888 \
  -p 10033:10033 \
  -p 8032:8032 \
  -p 8030:8030 \
  -p 8088:8088 \
  -p 8033:8033 \
  -p 8042:8042 \
  -p 8188:8188 \
  -p 8047:8047 \
  -p 8788:8788 \
  -it yourusername/hadoop

 * Starting OpenBSD Secure Shell server sshd                                                                                                                 [ OK ] 
Starting namenodes on [localhost]
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
localhost: starting namenode, logging to /opt/hadoop/logs/hadoop-root-namenode-80009ec37c2b.out
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-80009ec37c2b.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-root-secondarynamenode-80009ec37c2b.out
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop/logs/yarn--resourcemanager-80009ec37c2b.out
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-root-nodemanager-80009ec37c2b.out
root@80009ec37c2b:~# 

Now, it is possible execute a hadoop example:

root@80009ec37c2b:~# /opt/hadoop/bin/hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 16 10000                                 
Number of Maps  = 16
Samples per Map = 10000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Wrote input for Map #10
Wrote input for Map #11
Wrote input for Map #12
Wrote input for Map #13
Wrote input for Map #14
Wrote input for Map #15
Starting Job
15/10/06 14:13:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/10/06 14:13:00 INFO input.FileInputFormat: Total input paths to process : 16
15/10/06 14:13:01 INFO mapreduce.JobSubmitter: number of splits:16
15/10/06 14:13:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1444140733870_0001
15/10/06 14:13:01 INFO impl.YarnClientImpl: Submitted application application_1444140733870_0001
15/10/06 14:13:01 INFO mapreduce.Job: The url to track the job: http://efeb26eaf4ce:8088/proxy/application_1444140733870_0001/
15/10/06 14:13:01 INFO mapreduce.Job: Running job: job_1444140733870_0001
15/10/06 14:13:07 INFO mapreduce.Job: Job job_1444140733870_0001 running in uber mode : false
15/10/06 14:13:07 INFO mapreduce.Job:  map 0% reduce 0%
15/10/06 14:13:14 INFO mapreduce.Job:  map 13% reduce 0%
15/10/06 14:13:15 INFO mapreduce.Job:  map 38% reduce 0%
15/10/06 14:13:18 INFO mapreduce.Job:  map 44% reduce 0%
15/10/06 14:13:19 INFO mapreduce.Job:  map 56% reduce 0%
15/10/06 14:13:20 INFO mapreduce.Job:  map 63% reduce 0%
15/10/06 14:13:21 INFO mapreduce.Job:  map 69% reduce 0%
15/10/06 14:13:22 INFO mapreduce.Job:  map 75% reduce 0%
15/10/06 14:13:24 INFO mapreduce.Job:  map 88% reduce 0%
15/10/06 14:13:25 INFO mapreduce.Job:  map 94% reduce 0%
15/10/06 14:13:26 INFO mapreduce.Job:  map 100% reduce 100%
15/10/06 14:13:26 INFO mapreduce.Job: Job job_1444140733870_0001 completed successfully
15/10/06 14:13:26 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=358
		FILE: Number of bytes written=1967912
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=4214
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=67
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Job Counters 
		Launched map tasks=16
		Launched reduce tasks=1
		Data-local map tasks=16
		Total time spent by all maps in occupied slots (ms)=70731
		Total time spent by all reduces in occupied slots (ms)=9374
		Total time spent by all map tasks (ms)=70731
		Total time spent by all reduce tasks (ms)=9374
		Total vcore-seconds taken by all map tasks=70731
		Total vcore-seconds taken by all reduce tasks=9374
		Total megabyte-seconds taken by all map tasks=72428544
		Total megabyte-seconds taken by all reduce tasks=9598976
	Map-Reduce Framework
		Map input records=16
		Map output records=32
		Map output bytes=288
		Map output materialized bytes=448
		Input split bytes=2326
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=448
		Reduce input records=32
		Reduce output records=0
		Spilled Records=64
		Shuffled Maps =16
		Failed Shuffles=0
		Merged Map outputs=16
		GC time elapsed (ms)=981
		CPU time spent (ms)=7920
		Physical memory (bytes) snapshot=5351145472
		Virtual memory (bytes) snapshot=15219265536
		Total committed heap usage (bytes)=3422552064
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=1888
	File Output Format Counters 
		Bytes Written=97
Job Finished in 26.148 seconds
Estimated value of Pi is 3.14127500000000000000
root@80009ec37c2b:~#

And from other terminal, you can check the state of the container

$ docker ps
CONTAINER ID        IMAGE                 COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NAMES
80009ec37c2b        yourusername/hadoop   "/root/docker_entrypo"   6 minutes ago       Up 6 minutes        21-22/tcp, 2181/tcp, 0.0.0.0:8030->8030/tcp, 8031/tcp, 0.0.0.0:8032-8033->8032-8033/tcp, 8040/tcp, 0.0.0.0:8042->8042/tcp, 0.0.0.0:8047->8047/tcp, 8045-8046/tcp, 0.0.0.0:8088->8088/tcp, 8090/tcp, 8190/tcp, 8480-8481/tcp, 0.0.0.0:8188->8188/tcp, 8485/tcp, 9000/tcp, 0.0.0.0:8788->8788/tcp, 10020/tcp, 0.0.0.0:10033->10033/tcp, 10200/tcp, 50010/tcp, 0.0.0.0:19888->19888/tcp, 0.0.0.0:50030->50030/tcp, 0.0.0.0:50060->50060/tcp, 0.0.0.0:50070->50070/tcp, 50020/tcp, 50090/tcp, 50100/tcp, 50105/tcp, 50470/tcp, 0.0.0.0:50075->50075/tcp, 50475/tcp   my-new-hadoop-2.7.1
$

Notes.

RUN

Every RUN create a new container based in the previouse container created, execute the command in a new shell and commit changes (like docker commit). So, the next example generate two containers and two files, one under / and other under /root

FROM ubuntu:14.04
RUN cd root && touch inRoot.txt
RUN touch outRoot.txt

List of other useful commands.

$ docker ps
$ docker images
$ docker port CONTAINER_ID
$ docker logs CONTAINER_ID
$ docker top CONTAINER_ID
$ docker rm CONTAINER_ID
$ docker rmi IMAGE_ID
$ docker attach CONTAINER_ID
$ docker exec -it CONTAINER_ID bash

Lectures.