With standard tools, setting up a Hadoop cluster on your own machines still involves a lot of manual labor. This is annoying the first time you have to do it, but its even worse in cases where a cluster (such as a test system) needs to be set up repeatedly or machines enter and exit the cluster often.
The good news is that the tools Foreman, Puppet and Ambari allow to automate this process to a very large extend. Here we give a quick explanation of how this is done, how you can setup the infrastructure to automatically provision a Hadoop cluster on bare metal.
This is the second article in a series on the automation of Hadoop cluster provisioning and configuration management. In the first we described how to deploy your own virtual Hadoop cluster. You might want to start with the first article, in particular if you are not familiar with Ambari.
- A Puppet enabled Foreman server that is ready to use in your desired infrastructure. Including a running dns, tftp and dhcp server. (We are currently using Foreman Version 1.4.2.) Foreman helps managing servers through their lifecycle, from provisioning and configuration to orchestration and monitoring. With Puppet you can easily automate repetitive tasks and quickly deploy applications. You can find further information about such a Foreman server here and how to install such a server here.
- A set of machines, discovered by the Foreman server. These machines need no operating system or anything. Any number of machines is possible and you can add additional machines at any time. Caution: The state of these machines will be lost, including the content of their harddrives.
Now you can start to configure the provisioning, to tell Foreman what kind of operating system and services you want on the machines. All except the first and second step take place in the user interface of Foreman.
- Needed Files: First you need to download the files that describe the installation of Ambari server and agent (the Hadoop management and monitoring tool). To do this, log into your Foreman server and do the following.
sudo mkdir /etc/puppet/environments/ambari_dev
sudo curl "http://vzach.de/data/ambari-provisioning.zip" -o "ambari-provisioning.zip"
sudo unzip ambari-provisioning.zip
- Puppet Config: Add the following lines to the end of your Puppet configuration file (
/etc/puppet/puppet.conf) for Foreman to find the downloaded files.
modulepath = /etc/puppet/environments/ambari_dev/modules
- Operating System: Go to Hosts/Operating systems, click on “New Operating system” and choose the configuration as shown below. This enables the provisioning of CentOS.
OS – CentOS 6.5, Red Hat, x86_64
Partition Table – Kickstart default
Installation Media – CentOS mirror
- Provisioning Templates: Go to Hosts/Provisioning templates and do the following for “Kickstart default” and “Kickstart default PXELinux”. These templates automate the CentOS installation, including the installation of Puppet.
- Click on the entry, go to the tab “Association” and check the box at “CentOS 6.5”.
Association – CentOS 6.5
- Go to the configured operating system again and choose each template.
Templates – Kickstart default & Kickstart default PXELinux
- Puppet Environment: Go to Configure/Environments and click on “Import from …” (your Puppet master should show up there). Now check the entry “ambari_dev” and click on “Update”. This imports the Puppet files that you downloaded in step one.
Environment – ambari_dev
- Host Group – Ambari Server: Go to Configure/Host groups, click on “New Host Group” and choose the configuration as shown below. (Some entries depend on your own infrastructure: Puppet CA, Puppet Master, Domain, Subnet, (new) Root Password)
This combines the settings for a special group of machines. Here we define that every machine in the Ambari server group actually runs an Ambari server and agent. Usually you will only need one server in this group.
Host Group – ambari_server, ambari_dev, …
Included Classes – interfering_services, ntp, ambari_server, ambari_agent
OS – x86_64, CentOS 6.5, …
- Host Group – Ambari Agent: Create the “ambari_agent” host group like in the previous step (os and network configuration stays the same). Every machine other than the one with the Ambari server will be in this group. The location of services from the Hadoop stack will be defined in Ambari itself. Therefore every machine other than the Ambari server will be provisioned in the same way.
Host Group – ambari_agent, ambari_dev, …
Included Classes – interfering_services, ntp, ambari_agent
- Default Values: Go to Configure/Puppet classes, click the entry “ambari_server”, choose the tab “Smart Class Parameter” and click on the entry “ownhostname”. Now enter
ambariserver. + the domain of your ambari_server host group (in our case “ambariserver.local.cloud”) and submit the update. Also do this for the class “ambari_agent” with parameter “serverhostname” and the same value (“ambariserver…”).
The only Ambari server you’ll need will be located at the given name by default. Therefore you don’t need to configure that everytime you add a new machine to the cluster. (It is possible to override this value though.)
Default Value – “ambariserver.” + …
- Smart Values: For class “ambari_agent” and parameter “ownhostname” enter the value
<%= @host.fqdn %>. This trick works only when you disable “safemode_render” under Administer/Settings/Provisioning/safemode_render (set to false). It allows you to automatically parametrize the Puppet files with the hostname of a new machine.
Setting up the machines
Starting the machines now is quiet easy:
- Choose one of your discovered hosts and click on “Provision”. Now enter the name
ambariserver and choose the host group “ambari_server”. Everything else is automatically filled in. Continue by submitting and make sure the chosen machine is now restarting.
You just started to provision a machine with an Ambari server and an additional Ambari agent on the same node. After this process is done, you could already start to configure your cluster of one machine with Ambari, however, generally you want to add more machines.
Provisioning – ambariserver, ambari_server, …
- Additional machines can be provisioned with the “ambari_agent” host group and names chosen by yourself. You can also repeat this step if you want to add new machines to an existing cluster.
Configure your Hadoop Cluster
As described in our virtual provisioning blog post, you can now go to the Ambari server user interface (port 8080) and continue to install your Hadoop Cluster through its graphical interface. Keep in mind that the hostnames now depend on your own configuration. The “manual registration on the hosts” also shouldn’t bother you here, again we’ve already configured this for you.
With the tools shown here, you can automate the provisioning of machines for your Hadoop cluster – in this way enabling you to save effort in operations and to be more daring when trying out new configurations (after all, you can just setup the entire infrastructure in a few hours with very little manual effort). The configuration shown here is even portable to virtual machines and in this way can be used to create minimal clusters on developer machines or for automated testing.
The approaches shown here go a long way towards realizing the Infrastructure as Code vision for Hadoop – i.e. the description of the entire Hadoop cluster in configuration files that can be maintained and managed together with the rest of the source code. Configuration files that enable everyone – be they dev or ops – to quickly setup an identical infrastructure automatically.
The missing component is only the configuration of the actual Hadoop services using Ambari – but even this can be automated (we look at this some other time).
Valentin Zacharias and Malte Nottmeyer