Configuring Hadoop with Ansible: How to write a play-book

Prithviraj Singh
12 min readJan 21, 2021

--

Brand Logos for Ansible and Hadoop

Ansible :- a revolutionary yet simple IT automation engine that automates cloud provisioning, configuration management, application deployment, etc. It is designed for multi-tier deployments, Ansible models your IT infrastructure by describing how all of your systems inter-relate, rather than just managing one system at a time.

While…

Hadoop :- an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Today our task is to

🔰 Configure Hadoop and start cluster
services using Ansible Playbook

And while doing so let’s learn how to actually write an Ansible Playbook … So let’s get started. To learn more about Ansible and writing playbooks check out my last blog:- https://prithvirajsingh1604.medium.com/configuring-httpd-server-on-docker-ansible-playbook-548463a4b60c.

To get started with writing an actual Play-book we’ll need to first have some prerequisites sorted. All the following steps are done over Control node of Ansible which is a RHEL-8 (it can be any Linux-OS as well) with a yum configured (though this is only required for software installation) and is also able to ping to all the other systems which we are going to use today…

  1. Getting the software: you’ll need to download two software for Hadoop :-

hadoop-1.2.1–1.x86_64.rpm

jdk-8u171-linux-x64.rpm

Once we have arranged those two you’ll need to download it on your Ansible Control node. Next we’ll need Ansible which we can get from an Epel (use command yum install ansible on a system configured with epel) or through Python (use command pip3 install ansible on a system which has pip3 on it).

2. Write an inventory file which is going to be used to tell our Control node what all IP it is going to use to configure Hadoop on. I’ll be making a local inventory to help me with this task. So to do this create a workspace folder/directory inside which we’ll make our files from now onwards. Now I create a file named inventory(here the name of the file does not matter) and write the following:

Inventory Screen-Shot

Here I created three groups [data], [client] & [name] which contain essential information like the host-name of my managed nodes which you can create using /etc/hosts, user-name, password, and connection protocol over which Ansible is going to connect with the managed nodes. I also created a parent group [main] whose children are [name] & [data] groups hence the main group will be able to use two nodes under it namenode as well as the datanode. For more info on Ansible inventory click here.

3. Now to be able to use this inventory we’ll have to write the path of inventory in an ansible.cfg file. For this again I’ll be creating a local configuration file named ansible.cfg (here the name of the file is of great importance) and write the content something like the following:

ansible.cfg screen-shot

Here [defaults],inventory & host_key_checking are keywords from Ansible where defaults suggest that to run the upcoming parameters on default unless mentioned otherwise, inventory here is a keyword who needs a value of inventory file’s path, which in my case is just inventory (relative path). And host_key_checking keyword needs a Boolean value, when given a false value it skips host’s key check. To learn more about Ansible Configuration file click here.

To be able to use the inventory that uses host-names (this step is not necessary) instead of IP addresses you’ll also need to configure /etc/hosts files like the following:

hosts file’s screen-shot

So basically here we associate a local DNS and a host-name to IP’s in our network (to check the IP of your OSs use the ip addr on their own terminals respectively). Here in my case, the local host is the datanode but it can a new system as well.

4. To check if everything so far we’ve done (other than step 1) are running good, while in the workspace we’ve created, use the command ansible all -m ping. We expect an reply which looks something like the following:

Here all three of my nodes replied with a "pong" which means we’re good to go. If however you are getting error in any of the node please check with the earlier steps again.

Now onwards to write our own Play-book… To do this first create a file with a .yml extension. In this file we’re going to write our plays and tasks to configure our Hadoop cluster with a NameNode, DataNode, and a Client though we can also scale up the number of DataNodes and Clients by increasing the number of IPs we write in our inventory under the suitable group name.

To write a Play-book we need a plan over which we can work on to create our Playbook such that we are able able to configure the cluster (or any other thing) without missing any crucial step. This plan contains the general idea of what we are about to do in a step by step manner. My plan for our task at hand looked something like…

Let’s write our first play. In our first play according to my plan we’re going to fetch the IP of my namenode and assign a variable to hold it.

Play 1 Screen-shot

Since we want to fetch the IP of name node which is a part of group name , so we’re going to use the group name as our host for this play, to do this we would have to write the first line as shown. Next we are going to write all the task which Ansible will have to perform on the host so we write the keyword tasks and since we have to use a YAML file we’ll have to use it’s syntax as well (YAML uses key-value pair). Next we write the first and only task for this play whose name can be set using the key name whose value will be the name of our task (though it is not of any use by Ansible but is certainly helpful while using the Playbook to sort errors). In this task we’re going to use a module set_fact which is going to create a variable which can be used across multiple plays in this Playbook. The name of the variable can be chosen by us and hence I chose the name as nameip which uses Ansible-facts to fetch the IP of the system the task is running on. And writing cacheable: yes is going to give us the capability to be use this variable in some other plays. To find more modules which suits your working use a little searching tool and find the module over the official Ansible document.

Step 1 Done✅

Next let’s move on to play2 where we are going to perform step2 that is where we are going to copy and install all the required software on the managed nodes.

Here our host is all which means we are going to perform this step on all the nodes at the same play. Here we also don’t want Ansible to collect facts from the managed nodes, so that we get a faster smoother working. Here to copy our software I am using copy module, where we need to provide a source src from which to copy our files and a destination dest where we need to copy these files. Next two tasks is where we are going to install our software, but since this can’t be done by yum and hence we are not able to use any module for this, we’ll need to give raw commands to Ansible to perform, this is achieved by the command module. Now since command module is not going to take advantage of RAL, we will have to manage some things. Here in command module we don’t get the ability to check whether something is already been done, hence we’ll might have to do this on our own. While installing something with rpm if some software is already installed we’ll get a non zero return value which is going to be looked at by Ansible as an error, so this error must be handled. So I handled this error by first checking if the software has already been installed or not, and if not then go ahead and install the software, to implement this I used block and rescue exception handlers.

Step2 Done✅

Next lets move on to play 3 where we are going to copy some configuration files of Hadoop from our control node. check out this blog to learn more about Hadoop configuration.

Our configuration files look like this:-

core-site.xml & hdfs-site.xml respectively

If you know how to configure Hadoop cluster, you can see that this is how our configuration files will look like. But for the exception of {{ hostvars['namenode']['nameip'] }} and {{ x }}, so the first one is the same variable nameip that we made in play1, but since it was a variable for host group name which contains the namenode, so to be able to use this variable for other hosts both clientnode and datanode, we’ll need to use the hosts with the combination of all the nodes for example parent group main is combination of groups name and data and hence the parent group contains namenode and datanode, now if we want to transfer any variable from one node to other we’ll use hostvars and then mention the node which is providing the variable (in our case namenode) and then the name of the variable we need to transfer (in our case nameip), this is done in python’s list style as Ansible is built over python. Also {{}} is used as Ansible uses jinja to read some data written somewhere over some other file (in our case configuration files). Both of these files are located inside the workspace that we created. x is also a variable which will be described later.

play3 screen-shot

In this task we use main as our host and hence Ansible will be working over namenode and datanode. Here we are going to perform two tasks, one where copy the hdfs-site file over to the both the node on the location /etc/hadoop/hdfs-site.xml with the module named template .We here use template module rather than copy module as template module is able to print the assigned value of variables where-ever they are called instead of its name. The second task we are going to perform here is to create a folder for Name node and Data Node in the /drive with the name equal to the value of x this is done by the use of file module. Here x is the same x as our previous discussion, it stores the value of host-names of the currently operated node, hence i.e. namenode and datanode, but from index 0 to 4 (4 is excluded) i.e. name and data respectively.

Step3 and step5-Done✅ also half of step4 too…

Let’s move on to play4 where we complete step4 by copy the remaining the file (hdfs-site.xml)…

play4 screen-shot

Again we use template module to copy the file to all of the nodes at once to the location /etc/hadoop/core-site.xml

Step4-step6 Done✅

Now let’s move on to play-5…

Play5 Screen-shot

This play we are choosing multiple steps to be performed; them being steps 7, 8 and 9, i.e. formatting namenode, starting service, and opening the port number 9001 for tcp from the firewalld, we’ll also keep this port opened i.e. permanent . Since most of these tasks are only required to be done on the namenode we’ll only use it as our host.

For the first task we’ll be using command module. Now since command module is not idempotent, and re-formatting a NameNode means losing all prior data of the cluster, we require to find a way to make this task idempotent on our own. So to do this we are going to use block in which we write the task that is going to make sure if our NameNode has been formatted or not, hence here we would like to check if there is a file named VERSION inside /name/current i.e. if this file does exist then our NameNode has been formatted earlier, and if not then the cat command will throw an error which will be rescued by our rescue where we write the command for formatting the NameNode.

Second task we perform here is too start the NameNode service which requires us to use the command hadoop-daemon.sh start namenode to do which we again need to use command module. However if our namenode services are already started then this command is going to throw an error and hence it will stop Ansible to perform further tasks and Plays. To overcome this we want Ansible to ignore the error it is going to throw and hence we use ignore_errors: yes.

Lastly, we want this play to open the firewalld at a certain port number as discussed earlier. So, to do this we are going to use a module firewalld and going through its documents and examples in the documents we can easily write its keys and values [click here].

Now the only task remaining is starting DataNode services Which will exactly be like starting NameNode services, but on the appropriate host.

Play6:-

This is the last play as well as the last task for us. As I said earlier it is going to be same as starting DataNode Service, but on an appropriate host (target) i.e. data . Here we want to run a command, hence we use the command module and the command we want to run is hadoop-daemon.sh start datanode and that’s it.

We’ve successfully written our Play-book, now only thing left is to run it…

But before we do that let’s check how our nodes look prior to running our Playbook:

ClientNode before

Here we do not have any software installed other than JDK, nor does it have any configuration files.

NameNode before

Same as ClientNode…

DataNode before 1
DataNode before 2

Here we do have the software as we want to copy them from here, but they aren’t installed, also it does have the files.

Now let’s run the Playbook and see let’s see what happens…

Playbook Run1
Playbook Run 2

Here any text in green🟩 means: No changes required.

Any text in yellow🟨 means: Changes were required and were made successfully.

Any text in red🟥 means: Changes were required but Ansible was unable to make them.

Also there is a purple text, which is just some warning about changes in some module in the current version, but for now can be ignored.

Now let’s check if the changes were made and if so, is the cluster running as intended…

NameNode after

So here in NameNode we can see that files were copied and software were installed. Hence we can be sure that the same can be said about other nodes as well.

ClientNode after

As suggested the required configurations were made in the ClientNode as well.

Now let’s check if the Cluster is running:

ClientNode after 2

As we can see that the Client is able to fetch the report of the cluster using the command hadoop dfsadmin -report hence NameNode is running properly as well as ClientNode, we can also see the required IP was converted to DataNode 192.168.29.5 and hence it also is running properly…

So that’s it…

Task completed Successfully…🥳🥳\( ̄︶ ̄*\))

We’ve created our own Playbook\(@^0^@)/

Also, https://github.com/Prithviraj-Singh/Ansible-Hadoop-configuration here is the github for all the files we’ve used, only exception being the hosts file (for which I am confident enough that you’ll be able to make it in no time😁).

That’s all kind readers… Thanks for reading( ゚д゚)つ Bye

--

--

No responses yet