Hadoop provisioning automation following the “Infrastructure as Code” paradigm
What is the quickest and best way to get a virtual Hadoop cluster running on your development machine?
One option is the use of golden images, like those prepared by Hortonworks or Cloudera. These are virtual machines that come completely configured with tutorials and everything – however, its just one virtual machine and they are really geared towards initial learning without too much room to configure.
The other option is to use Ambari (the graphical monitoring and management environment for Hadoop) to configure Hadoop on your virtual machines. Ambari is getting better almost daily and has reached a level of maturity that makes using an Ambari managed mini-cluster easily superior to using the above mentioned golden images for everything but maybe running your very first tutorial.
Recently Hortonworks has described the steps to use Ambari to configure virtual machines here – but this manual still has something like 20 commands just to setup Ambari – and its still just one node (not much of a cluster…).
Adding Puppet to the mix we can do better – here we’ll give you the tools and show you how to setup a 3 (virtual) node Hadoop cluster managed by Ambari with just three commands.
You need to have a few things to get started.
- A decent machine – we will run three virtual machines with 2GB RAM each – we tried this only on machines with 16GB RAM and you should have 8GB at the very least.
- Vagrant – A tool that helps to manage virtual development environments. You need to have this installed – together with a provider for virtual machines such as Virtual Box. Please make sure that your versions are current (Vagrant does not(!) automatically alert you to the availability of new versions).
Setup Virtual Machines and Install Ambari
Open a terminal window, download and unzip the vagrant and puppet files that we created:
curl "http://vzach.de/data/ambari-provisioning.zip" -o "ambari-provisioning.zip" unzip ambari-provisioning.zip
These files contain the Puppet code (a tool for the automation of configuration management that is supported by Vagrant) to setup the virtual machines to run Ambari (which in turn will setup Hadoop on your virtual cluster). For convenience these files also include the puppet standard library stdlib.
Now change into the newly created ambari-provisioning directory and start everything by typing
Then grab a coffee and find something nice to read – it will take a while (expect something like 15 minutes, but it very much depends on your machine and your internet connection).
What happens is that first a virtual machine image for CentOS is downloaded, three virtual machines (named one, two and three) are created based on this image and the virtual machines are configured to run Ambari: firewall services are stopped, ntp is installed and started, etchost files are changed to enable communication between the virtual machines, the agent/clients are installed&started and finally the Ambari clients are given information on where to find the server machine. Machine “one” will run the Ambari server, all three machines will run Ambari agents. The files only change the configuration of the virtual machines (that are not accessible from the global internet), nothing is installed directly on your machine. You can see all of this by looking at the puppet modules in the downloaded folder (all in all its just around 250 LOC – not including the puppet standard lib stdlib that we included for convenience). You can find an explanation of the structure and content of such files in this (German) introduction to Vagrant and Puppet.
Configure your Hadoop Cluster
Now you can use the graphical interface of Ambari to setup and configure your cluster – just open 192.168.0.101:8080, login with the default Ambari user & password (admin, admin), name your cluster, choose a service stack such as the default HDP2.0. Then enter the hostnames and choose manual configuration as shown below (the system will warn you twice that you need to manually install Ambari agents on all machines – but don’t worry, we did this for you already):
Choose some services that you want to run and the machines they should run on
Fill in missing configuration info.
And deploy. This again will take quite long (30 minutes and more), but will run completely unattended and – on a decent developer machine – you should be able to continue working in the meantime.
That’s it – you should now have a complete Hadoop cluster with all the services you configured.
This here is just a fun technological demonstration, however, there is some seriousness in our motivation to do it: the techniques used here can be used to manage standardised test and development environments together with the code (the “Infrastructure as Code” vision), ensuring that all developers immediately and easily have access to such environments and that these environments can be versioned together with the rest of the codebase. And you can go even further – the code we created can be used to provision on real (not virtual) machines (see here) and even the manual configuration with Ambari can be automated – but we show this in some later blog post.
We’ve now also made the puppet module used available on Puppet Forge here. However, note that this is not everything we used in this Blog Post (the vagrant file is not included and the module etchosts -that ensures that the virtual nodes can find themselves – is also not included as it is not generally needed).
Valentin Zacharias and Malte Nottmeyer.