Hadoop (Single Node) 2.7.1 with Docker from scratch
Introduction.
The idea of this post is create a Hadoop 2.7.1 Single Node Docker image from scratch.
In all projects related with Hadoop, developers expend a lot of time in the installation and maintenance of Hadoop in their local machines. With Docker, you can have your development environment ready in seconds.
This will be a guide, not only to have a Hadoop image ready to use for development purpose, but also to understand and to know how to create a Docker image from scratch.
This is the first post of a series of two.
In this first one, I am going to explain how to create a Hadoop Single Node image because is more easy to install than a Hadoop Cluster Node image.
In the second one, I will to explain how to create a Hadoop Cluster Node image.
Of course, all images generated are published in my public Docker and Githup account, so you are free to use them without create other new one.
Target.
The target is to have the ability to start up a hadoop container ready to use in seconds, only executing a command similar to this:
1
$ docker run --name hadoop-2.7.1 -it -P angelcervera/docker-hadoop:2.7.1-single
Of course, if you are happy, don't hesitate to mark as favorite and share.
Let's do it!
Checking requirements.
The first step is to be sure that we have installed the last version of docker. The result must me version
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
$ docker version Client: Version: 1.8.2 API version: 1.20 Go version: go1.4.2 Git commit: 0a8c2e3 Built: Thu Sep 10 19:19:00 UTC 2015 OS/Arch: linux/amd64
Server: Version: 1.8.2 API version: 1.20 Go version: go1.4.2 Git commit: 0a8c2e3 Built: Thu Sep 10 19:19:00 UTC 2015 OS/Arch: linux/amd64
Now that we are sure that we have docker installed, we can start working on the new image.
There are two possibilities to do it, update a container or create a Dockerfile. I am going to use the second one, so I will have this file in my GitHub repository, in Docker Hub Autobuilder and, in the future, will be easy to modify the image without start from zero.
Layout.
The layout for this project is really simple. The Dockerfile in the root and a config folder that I created to store all configuration files that I am going to copy to the container.
The DockerFile is a file used to describe the steps to generate an Docker image, using an easy DSL language. I am going to describe line per line what means the content of the DockerFile.
Anyway, if you check the source of this file in Github you can read a lot of useful comments.
The main idea is that all instructions in this file are going to be executed in order when you build the image with the command:
1
$ docker build -t angelcervera/hadoop .
After execute the build command, you will have an image ready to use in your local environment.
1 2 3
$ docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE angelcervera/docker-hadoop 2.7.1-single 1a7acd653fe5 15 hours ago 1.121 GB
Instructions.
FROM
The instruction FROM is used to indicate in which image is based your new image.
1
FROM ubuntu:14.04
MAINTAINER
You want that everybody know that you are the author, right? This is your opportunity ;)
1
MAINTAINER Angel Cervera Claudio
USER
Every step in the build process and running the container is going to use the user indicated with this command.
1
USER root
WORKDIR
This will be the start point folder per all commands in the build process or when the container is started.
1
WORKDIR /root
ENV
With this instruction we can set environment variables that we can use inside the Dockerfile or when logged in the container. Be careful, because those variables are not in the shell if you will start it from other shell.
This instruction is used to execute commands in the container. This is the way to modify the content of the original image. Every execution is going to be persisted.
Now, we are going to execute a series of RUN instructions to:
Install dependencies with apt-get.
Download the Hadoop distribution. Pay attention in the use of the ${HADOOP_VERSION} environment variable setted before.
This instruction is used to copy resources into the container. There is other more powerful instruction that is ADD (can decompress files, etc.), but the documentation recommends COPY when is possible.
In the next step, I am going to copy all resources under the config folder into the container.
Every file must be in the right location:
Copy ssh configuration file. This configuration file has the configuration to avoid ask to add unknown hosts in the list of trust hosts. More information: http://shallowsky.com/blog/linux/ssh-tips.html
Copy Hadoop configuration files to set a Single Node instance.
Format the namenode. Yes!!! We are executing Hadoop inside of the container, for first time.
Copy the script that we are going to use to initialize the container, and change the file attributes.
Create a folder under /root that we can use to share files with the host. This is not related with Hadoop, but I am used to do it. :)
Clean caches, not used files, etc. to keep the image as small as possible.
# Format hdfs RUN ${HADOOP_PREFIX}/bin/hdfs namenode -format
# Copy the entry point shell COPY config/docker_entrypoint.sh /root/ RUN chmod a+x /root/docker_entrypoint.sh
# Folder to share files RUN mkdir /root/shared && \ chmod a+rwX /root/shared
# Clean RUN rm -r /var/cache/apt /var/lib/apt/lists /tmp/hadoop-${HADOOP_VERSION}.tar*
EXPOSE
This instruction is used to expose ports outside of the container. By default, you can use any port inside of the container without the risk of conflicts with other containers or the host.
But if you want to access to the container from the outside of the container, you need to expose it.
Below a fragment of ports exposed. To get the full list (a lot of them), please, check the Dockfile source.
To map any of those ports with a local port, it is necessary to use the -p parameter when you start the container. The format is: "-p host_port:exposed_port"
With the parameter -P Docker is going to select an aleatory available port for you. In the case of this container, it is not a good idea because we are going to expose 37 ports!!! So the result will be a mess.
The easy way to get current list of mapped port is with the command:
1
$ docker ps
VOLUME
By default, all folders inside of the container are inaccessible from the host or other containers. This isolation is really good, but sometimes we want to use a folder from the host or other data volume container.
To do this, you can use the VOLUME instruction. All folders listed any VOLUME instruction could be mounted outside of the container with the parameter "-v". This parameter has the format "-v host_folder:container_folder".
There are other ways to mount this folder, sharing folders between containers, for example, but we are going to talk only about the first one.
The easy way to get current list of folder shared is with the command:
1
$docker inspect container_id
1
VOLUME ["/opt/hadoop", "/root/shared"]
ENTRYPOINT
With this instruction it is possible to set the command that is going to be executed when you run the container.
The idea of Docker is one service per container. In this case, it is special because Hadoop uses 5 services. So I created a script to start all services. I copied this script under /root.
1
ENTRYPOINT [ "/root/docker_entrypoint.sh" ]
Building and executing.
In this point, I have the Dockfile ready to build the image:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
$ docker build -t yourusername/hadoop . Sending build context to Docker daemon 151.6 kB Step 0 : FROM ubuntu:14.04 ---> 91e54dfb1179 Step 1 : MAINTAINER Angel Cervera Claudio .......... .......... .......... .......... A lot of Steps .......... .......... .......... Removing intermediate container e72251cbec5c Successfully built f6262ea86133 $
Listing the images:
1 2 3 4 5 6 7
$ docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE yourusername/hadoop latest f6262ea86133 3 minutes ago 1.121 GB postgres latest 506c40f60539 3 weeks ago 265.3 MB ubuntu 14.04 91e54dfb1179 6 weeks ago 188.4 MB mongo latest c6f67f622b2a 9 weeks ago 261 MB $
* Starting OpenBSD Secure Shell server sshd [ OK ] Starting namenodes on [localhost] localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. localhost: starting namenode, logging to /opt/hadoop/logs/hadoop-root-namenode-80009ec37c2b.out localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-80009ec37c2b.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts. 0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-root-secondarynamenode-80009ec37c2b.out starting yarn daemons starting resourcemanager, logging to /opt/hadoop/logs/yarn--resourcemanager-80009ec37c2b.out localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-root-nodemanager-80009ec37c2b.out root@80009ec37c2b:~#
root@80009ec37c2b:~# /opt/hadoop/bin/hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 16 10000 Number of Maps = 16 Samples per Map = 10000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Wrote input for Map #10 Wrote input for Map #11 Wrote input for Map #12 Wrote input for Map #13 Wrote input for Map #14 Wrote input for Map #15 Starting Job 15/10/06 14:13:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/10/06 14:13:00 INFO input.FileInputFormat: Total input paths to process : 16 15/10/06 14:13:01 INFO mapreduce.JobSubmitter: number of splits:16 15/10/06 14:13:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1444140733870_0001 15/10/06 14:13:01 INFO impl.YarnClientImpl: Submitted application application_1444140733870_0001 15/10/06 14:13:01 INFO mapreduce.Job: The url to track the job: http://efeb26eaf4ce:8088/proxy/application_1444140733870_0001/ 15/10/06 14:13:01 INFO mapreduce.Job: Running job: job_1444140733870_0001 15/10/06 14:13:07 INFO mapreduce.Job: Job job_1444140733870_0001 running in uber mode : false 15/10/06 14:13:07 INFO mapreduce.Job: map 0% reduce 0% 15/10/06 14:13:14 INFO mapreduce.Job: map 13% reduce 0% 15/10/06 14:13:15 INFO mapreduce.Job: map 38% reduce 0% 15/10/06 14:13:18 INFO mapreduce.Job: map 44% reduce 0% 15/10/06 14:13:19 INFO mapreduce.Job: map 56% reduce 0% 15/10/06 14:13:20 INFO mapreduce.Job: map 63% reduce 0% 15/10/06 14:13:21 INFO mapreduce.Job: map 69% reduce 0% 15/10/06 14:13:22 INFO mapreduce.Job: map 75% reduce 0% 15/10/06 14:13:24 INFO mapreduce.Job: map 88% reduce 0% 15/10/06 14:13:25 INFO mapreduce.Job: map 94% reduce 0% 15/10/06 14:13:26 INFO mapreduce.Job: map 100% reduce 100% 15/10/06 14:13:26 INFO mapreduce.Job: Job job_1444140733870_0001 completed successfully 15/10/06 14:13:26 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=358 FILE: Number of bytes written=1967912 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=4214 HDFS: Number of bytes written=215 HDFS: Number of read operations=67 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=16 Launched reduce tasks=1 Data-local map tasks=16 Total time spent by all maps in occupied slots (ms)=70731 Total time spent by all reduces in occupied slots (ms)=9374 Total time spent by all map tasks (ms)=70731 Total time spent by all reduce tasks (ms)=9374 Total vcore-seconds taken by all map tasks=70731 Total vcore-seconds taken by all reduce tasks=9374 Total megabyte-seconds taken by all map tasks=72428544 Total megabyte-seconds taken by all reduce tasks=9598976 Map-Reduce Framework Map input records=16 Map output records=32 Map output bytes=288 Map output materialized bytes=448 Input split bytes=2326 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=448 Reduce input records=32 Reduce output records=0 Spilled Records=64 Shuffled Maps =16 Failed Shuffles=0 Merged Map outputs=16 GC time elapsed (ms)=981 CPU time spent (ms)=7920 Physical memory (bytes) snapshot=5351145472 Virtual memory (bytes) snapshot=15219265536 Total committed heap usage (bytes)=3422552064 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1888 File Output Format Counters Bytes Written=97 Job Finished in 26.148 seconds Estimated value of Pi is 3.14127500000000000000 root@80009ec37c2b:~#
And from other terminal, you can check the state of the container
Every RUN create a new container based in the previouse container created, execute the command in a new shell and commit changes (like docker commit). So, the next example generate two containers and two files, one under "/" and other under "root"
1 2 3
FROM ubuntu:14.04 RUN cd root && touch inRoot.txt RUN touch outRoot.txt