Hadoop (Single Node) 2.7.1 with Docker from scratch
The idea of this post is create a Hadoop 2.7.1 Single Node Docker image from scratch.
In all projects related with Hadoop, developers expend a lot of time in the installation and maintenance of Hadoop in their local machines. With Docker, you can have your development environment ready in seconds.
This will be a guide, not only to have a Hadoop image ready to use for development purpose, but also to understand and to know how to create a Docker image from scratch.
This is the first post of a series of two.
In this first one, I am going to explain how to create a Hadoop Single Node image because is more easy to install than a Hadoop Cluster Node image.
In the second one, I will to explain how to create a Hadoop Cluster Node image.
Of course, all images generated are published in my public Docker and Githup account, so you are free to use them without create other new one.
The target is to have the ability to start up a hadoop container ready to use in seconds, only executing a command similar to this:
$ docker run --name hadoop-2.7.1 -it -P angelcervera/docker-hadoop:2.7.1-single
Of course, if you are happy, don't hesitate to mark as favorite and share.
Let's do it!
The first step is to be sure that we have installed the last version of docker. The result must me version
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
$ docker version Client: Version: 1.8.2 API version: 1.20 Go version: go1.4.2 Git commit: 0a8c2e3 Built: Thu Sep 10 19:19:00 UTC 2015 OS/Arch: linux/amd64
Server: Version: 1.8.2 API version: 1.20 Go version: go1.4.2 Git commit: 0a8c2e3 Built: Thu Sep 10 19:19:00 UTC 2015 OS/Arch: linux/amd64
Now that we are sure that we have docker installed, we can start working on the new image.
There are two possibilities to do it, update a container or create a Dockerfile. I am going to use the second one, so I will have this file in my GitHub repository, in Docker Hub Autobuilder and, in the future, will be easy to modify the image without start from zero.
The layout for this project is really simple. The Dockerfile in the root and a config folder that I created to store all configuration files that I am going to copy to the container.
The DockerFile is a file used to describe the steps to generate an Docker image, using an easy DSL language. I am going to describe line per line what means the content of the DockerFile.
Anyway, if you check the source of this file in Github you can read a lot of useful comments.
The main idea is that all instructions in this file are going to be executed in order when you build the image with the command:
$ docker build -t angelcervera/hadoop .
After execute the build command, you will have an image ready to use in your local environment.
1 2 3
$ docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE angelcervera/docker-hadoop 2.7.1-single 1a7acd653fe5 15 hours ago 1.121 GB
The instruction FROM is used to indicate in which image is based your new image.
You want that everybody know that you are the author, right? This is your opportunity ;)
MAINTAINER Angel Cervera Claudio
Every step in the build process and running the container is going to use the user indicated with this command.
This will be the start point folder per all commands in the build process or when the container is started.
With this instruction we can set environment variables that we can use inside the Dockerfile or when logged in the container. Be careful, because those variables are not in the shell if you will start it from other shell.
This instruction is used to copy resources into the container. There is other more powerful instruction that is ADD (can decompress files, etc.), but the documentation recommends COPY when is possible.
In the next step, I am going to copy all resources under the config folder into the container.
To map any of those ports with a local port, it is necessary to use the -p parameter when you start the container. The format is: "-p host_port:exposed_port"
With the parameter -P Docker is going to select an aleatory available port for you. In the case of this container, it is not a good idea because we are going to expose 37 ports!!! So the result will be a mess.
The easy way to get current list of mapped port is with the command:
$ docker ps
By default, all folders inside of the container are inaccessible from the host or other containers. This isolation is really good, but sometimes we want to use a folder from the host or other data volume container.
To do this, you can use the VOLUME instruction. All folders listed any VOLUME instruction could be mounted outside of the container with the parameter "-v". This parameter has the format "-v host_folder:container_folder".
There are other ways to mount this folder, sharing folders between containers, for example, but we are going to talk only about the first one.
The easy way to get current list of folder shared is with the command:
$docker inspect container_id
VOLUME ["/opt/hadoop", "/root/shared"]
With this instruction it is possible to set the command that is going to be executed when you run the container.
The idea of Docker is one service per container. In this case, it is special because Hadoop uses 5 services. So I created a script to start all services. I copied this script under /root.
ENTRYPOINT [ "/root/docker_entrypoint.sh" ]
Building and executing.
In this point, I have the Dockfile ready to build the image:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
$ docker build -t yourusername/hadoop . Sending build context to Docker daemon 151.6 kB Step 0 : FROM ubuntu:14.04 ---> 91e54dfb1179 Step 1 : MAINTAINER Angel Cervera Claudio .......... .......... .......... .......... A lot of Steps .......... .......... .......... Removing intermediate container e72251cbec5c Successfully built f6262ea86133 $
Listing the images:
1 2 3 4 5 6 7
$ docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE yourusername/hadoop latest f6262ea86133 3 minutes ago 1.121 GB postgres latest 506c40f60539 3 weeks ago 265.3 MB ubuntu 14.04 91e54dfb1179 6 weeks ago 188.4 MB mongo latest c6f67f622b2a 9 weeks ago 261 MB $
* Starting OpenBSD Secure Shell server sshd [ OK ] Starting namenodes on [localhost] localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. localhost: starting namenode, logging to /opt/hadoop/logs/hadoop-root-namenode-80009ec37c2b.out localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-80009ec37c2b.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts. 0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-root-secondarynamenode-80009ec37c2b.out starting yarn daemons starting resourcemanager, logging to /opt/hadoop/logs/yarn--resourcemanager-80009ec37c2b.out localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-root-nodemanager-80009ec37c2b.out root@80009ec37c2b:~#
root@80009ec37c2b:~# /opt/hadoop/bin/hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 16 10000 Number of Maps = 16 Samples per Map = 10000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Wrote input for Map #10 Wrote input for Map #11 Wrote input for Map #12 Wrote input for Map #13 Wrote input for Map #14 Wrote input for Map #15 Starting Job 15/10/06 14:13:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/10/06 14:13:00 INFO input.FileInputFormat: Total input paths to process : 16 15/10/06 14:13:01 INFO mapreduce.JobSubmitter: number of splits:16 15/10/06 14:13:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1444140733870_0001 15/10/06 14:13:01 INFO impl.YarnClientImpl: Submitted application application_1444140733870_0001 15/10/06 14:13:01 INFO mapreduce.Job: The url to track the job: http://efeb26eaf4ce:8088/proxy/application_1444140733870_0001/ 15/10/06 14:13:01 INFO mapreduce.Job: Running job: job_1444140733870_0001 15/10/06 14:13:07 INFO mapreduce.Job: Job job_1444140733870_0001 running in uber mode : false 15/10/06 14:13:07 INFO mapreduce.Job: map 0% reduce 0% 15/10/06 14:13:14 INFO mapreduce.Job: map 13% reduce 0% 15/10/06 14:13:15 INFO mapreduce.Job: map 38% reduce 0% 15/10/06 14:13:18 INFO mapreduce.Job: map 44% reduce 0% 15/10/06 14:13:19 INFO mapreduce.Job: map 56% reduce 0% 15/10/06 14:13:20 INFO mapreduce.Job: map 63% reduce 0% 15/10/06 14:13:21 INFO mapreduce.Job: map 69% reduce 0% 15/10/06 14:13:22 INFO mapreduce.Job: map 75% reduce 0% 15/10/06 14:13:24 INFO mapreduce.Job: map 88% reduce 0% 15/10/06 14:13:25 INFO mapreduce.Job: map 94% reduce 0% 15/10/06 14:13:26 INFO mapreduce.Job: map 100% reduce 100% 15/10/06 14:13:26 INFO mapreduce.Job: Job job_1444140733870_0001 completed successfully 15/10/06 14:13:26 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=358 FILE: Number of bytes written=1967912 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=4214 HDFS: Number of bytes written=215 HDFS: Number of read operations=67 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=16 Launched reduce tasks=1 Data-local map tasks=16 Total time spent by all maps in occupied slots (ms)=70731 Total time spent by all reduces in occupied slots (ms)=9374 Total time spent by all map tasks (ms)=70731 Total time spent by all reduce tasks (ms)=9374 Total vcore-seconds taken by all map tasks=70731 Total vcore-seconds taken by all reduce tasks=9374 Total megabyte-seconds taken by all map tasks=72428544 Total megabyte-seconds taken by all reduce tasks=9598976 Map-Reduce Framework Map input records=16 Map output records=32 Map output bytes=288 Map output materialized bytes=448 Input split bytes=2326 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=448 Reduce input records=32 Reduce output records=0 Spilled Records=64 Shuffled Maps =16 Failed Shuffles=0 Merged Map outputs=16 GC time elapsed (ms)=981 CPU time spent (ms)=7920 Physical memory (bytes) snapshot=5351145472 Virtual memory (bytes) snapshot=15219265536 Total committed heap usage (bytes)=3422552064 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1888 File Output Format Counters Bytes Written=97 Job Finished in 26.148 seconds Estimated value of Pi is 3.14127500000000000000 root@80009ec37c2b:~#
And from other terminal, you can check the state of the container
Every RUN create a new container based in the previouse container created, execute the command in a new shell and commit changes (like docker commit). So, the next example generate two containers and two files, one under "/" and other under "root"
1 2 3
FROM ubuntu:14.04 RUN cd root && touch inRoot.txt RUN touch outRoot.txt