Overview

Installing a Hadoop Cluster with three Commands

7 Comments

Hadoop provisioning automation following the “Infrastructure as Code” paradigm

What is the quickest and best way to get a virtual Hadoop cluster running on your development machine?

One option is the use of golden images, like those prepared by Hortonworks or Cloudera. These are virtual machines that come completely configured with tutorials and everything – however, its just one virtual machine and they are really geared towards initial learning without too much room to configure.

The other option is to use Ambari (the graphical monitoring and management environment for Hadoop) to configure Hadoop on your virtual machines. Ambari is getting better almost daily and has reached a level of maturity that makes using an Ambari managed mini-cluster easily superior to using the above mentioned golden images for everything but maybe running your very first tutorial.

Recently Hortonworks has described the steps to use Ambari to configure virtual machines here  – but this manual still has something like 20 commands just to setup Ambari  – and its still just one node (not much of a cluster…).

Adding Puppet to the mix we can do better –  here we’ll give you the tools and show you how to setup a 3 (virtual) node Hadoop cluster managed by Ambari with just three commands.

Prerequisites

You need to have a few things to get started.

  • A decent machine – we will run three virtual machines with 2GB RAM each – we tried this only on machines with 16GB RAM and you should have 8GB at the very least.
  • Vagrant – A tool that helps to manage virtual development environments. You need to have this installed – together with a provider for virtual machines such as Virtual Box. Please make sure that your versions are current (Vagrant does not(!) automatically alert you to the availability of new versions).

Setup Virtual Machines and Install Ambari

Open a terminal window, download and unzip the vagrant and puppet files that we created:

curl "http://vzach.de/data/ambari-provisioning.zip" 
   -o "ambari-provisioning.zip"
unzip ambari-provisioning.zip

These files contain the Puppet code (a tool for the automation of configuration management that is supported by Vagrant) to setup the virtual machines to run Ambari (which in turn will setup Hadoop on your virtual cluster). For convenience these files also include the puppet standard library stdlib.
Now change into the newly created ambari-provisioning directory and start everything by typing

vagrant up

Then grab a coffee and find something nice to read – it will take a while (expect something like 15 minutes, but it very much depends on your machine and your internet connection).

What happens is that first a virtual machine image for CentOS is downloaded, three virtual machines (named one, two and three) are created based on this image and the virtual machines are configured to run Ambari: firewall services are stopped, ntp is installed and started, etchost files are changed to enable communication between the virtual machines, the agent/clients are installed&started and finally the Ambari clients are given information on where to find the server machine. Machine “one” will run the Ambari server, all three machines will run Ambari agents. The files only change the configuration of the virtual machines (that are not accessible from the global internet), nothing is installed directly on your machine. You can see all of this by looking at the puppet modules in the downloaded folder (all in all its just around 250 LOC – not including the puppet standard lib stdlib that we included for convenience). You can find an explanation of the structure and content of such files in this (German) introduction to Vagrant and Puppet.

Configure your Hadoop Cluster

Now you can use the graphical interface of Ambari to setup and configure your cluster – just open 192.168.0.101:8080, login with the default Ambari user & password (admin, admin), name your cluster, choose a service stack such as the default HDP2.0. Then enter the hostnames and choose manual configuration as shown below (the system will warn you twice that you need to manually install Ambari agents on all machines – but don’t worry, we did this for you already):

Ambari Installation Options

Choose some services that you want to run and the machines they should run on

Ambari Hadoop service selection

Assign service masters to nodes

Fill in missing configuration info.

Customize services

Customize services

And deploy. This again will take quite long (30 minutes and more), but will run completely unattended and – on a decent developer machine – you should be able to continue working in the meantime.

That’s it – you should now have a complete Hadoop cluster with all the services you configured.

Ambari monitoring dashboard

Ambari monitoring dashboard

Conclusion

This here is just a fun technological demonstration, however, there is some seriousness in our motivation to do it: the techniques used here can be used to manage standardised test and development environments together with the code (the “Infrastructure as Code” vision), ensuring that all developers immediately and easily have access to such environments and that these environments can be versioned together with the rest of the codebase. And you can go even further – the code we created can be used to provision on real (not virtual) machines (see here) and even the manual configuration with Ambari can be automated – but we show this in some later blog post.

Update

We’ve now also made the puppet module used available on Puppet Forge here. However, note that this is not everything we used in this Blog Post (the vagrant file is not included and the module etchosts -that ensures that the virtual nodes can find themselves – is also not included as it is not generally needed).

Authors

Valentin Zacharias and Malte Nottmeyer.

Kommentare

  • Kosmaj

    7. May 2014 von Kosmaj

    Hi,
    Thanks for a nice article!
    I’m trying to run this on a Mac (running iOS 10.8.5 with 16G RAM) but it fails.
    First, the “files” shared folder was missing and the Puppet folders were referred to by wrong names in “Vagrantfile”. After creating the “files” folder and renaming Puppet folders to puppetManifests and puppetModules the script kept on running and the machine “one” was “booted and ready” but then the script failed on the following step:

    ==> one: Configuring and enabling network interfaces…
    The following SSH command responded with a non-zero exit status.
    /sbin/ifdown
    Stderr from the command:
    usage: ifdown

    Any ideas how to overcome the problem?
    Thanks!

    • Valentin Zacharias

      Thanks for the feedback – I corrected the file names and uploaded a new version. However, I cannot replicate the SSH error message from vagrant – any chance this is due to older versions of vagrant/virtualbox?

      • Valentin Zacharias

        In the end it turned out to be a bug in the newest version of vagrant (see here).

        For everyone: please do not use Vagrant 1.6.0 – use an older version or wait for 1.6.1.

  • Adam

    First I want to thank you for the post and Vagrant setup.

    Unfortunately when running vagrant up, during provisioning I’m getting an error when the ambari-server repo is trying to be accessed

    Error: Execution of ‘/usr/bin/yum -d 0 -e 0 -y install ambari-server’ returned 1: Error: Cannot retrieve repository metadata (repomd.xml) for repository: Updates-ambari-1.x. Please verify its path and try again

    Is this on hortonworks end? It seems that the path in the repo that is added doesn’t work 🙁

    • Sean Kruzel

      11. July 2014 von Sean Kruzel

      It looks like the repo reference created by the “http://public-repo-1.hortonworks.com/ambari/centos6/1.x/GA/ambari.repo” line in files /modules/ambari_agent/manifests/init.pp and /modules/ambari_server/manifests/init.pp is slightly incorrect.

      To fix this (albeit a bit of a hack), I replaced part of the top code with these lines

      # Ambari Repo
      exec { ‘get-ambari-server-repo’:
      command => “wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/GA/ambari.repo“,
      cwd => ‘/etc/yum.repos.d/’,
      creates => ‘/etc/yum.repos.d/ambari.repo’,
      user => root
      }
      # Fix an error in that repo reference
      exec { ‘fix-ambari-server-repo’:
      command => “sed -i ‘s/centos6\\/1.x\\/updates\\w*$/centos6\\/1.x\\/updates\\/1.6.0/g’ ambari.repo”, cwd => ‘/etc/yum.repos.d/’,
      user => root,
      require => Exec[get-ambari-server-repo]
      }
      # Ambari Server
      package { ‘ambari-server’:
      ensure => present,
      require => Exec[fix-ambari-server-repo]
      }

  • Stas

    Great tutorial! To run it smoothly I just replaced repository address to “http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.2.3.7/ambari.repo” in both init.pp files

  • Siarhei

    24. May 2015 von Siarhei

    How you go from host machine to Ambari which exists on guest machine in private network using 192.168.0.101:8080? 192.168.0.101 is unreachable. I described problem in http://stackoverflow.com/questions/30385200/guest-ip-is-unreachable-under-vagrant-using-private-network

    Port forwarding also not working using private network

Comment

Your email address will not be published. Required fields are marked *