Installing a Hadoop Cluster with three Commands

22.4.2014 | 5 minutes of reading time

Hadoop provisioning automation following the “Infrastructure as Code” paradigm

What is the quickest and best way to get a virtual Hadoop cluster running on your development machine?

One option is the use of golden images, like those prepared by Hortonworks or Cloudera . These are virtual machines that come completely configured with tutorials and everything – however, its just one virtual machine and they are really geared towards initial learning without too much room to configure.

The other option is to use Ambari (the graphical monitoring and management environment for Hadoop) to configure Hadoop on your virtual machines. Ambari is getting better almost daily and has reached a level of maturity that makes using an Ambari managed mini-cluster easily superior to using the above mentioned golden images for everything but maybe running your very first tutorial.

Recently Hortonworks has described the steps to use Ambari to configure virtual machines here – but this manual still has something like 20 commands just to setup Ambari – and its still just one node (not much of a cluster…).

Adding Puppet to the mix we can do better – here we’ll give you the tools and show you how to setup a 3 (virtual) node Hadoop cluster managed by Ambari with just three commands.

Prerequisites

You need to have a few things to get started.

A decent machine – we will run three virtual machines with 2GB RAM each – we tried this only on machines with 16GB RAM and you should have 8GB at the very least.
Vagrant – A tool that helps to manage virtual development environments. You need to have this installed – together with a provider for virtual machines such as Virtual Box. Please make sure that your versions are current (Vagrant does not(!) automatically alert you to the availability of new versions).

Setup Virtual Machines and Install Ambari

Open a terminal window, download and unzip the vagrant and puppet files that we created:

curl "http://vzach.de/data/ambari-provisioning.zip" 
   -o "ambari-provisioning.zip"
unzip ambari-provisioning.zip

These files contain the Puppet code (a tool for the automation of configuration management that is supported by Vagrant) to setup the virtual machines to run Ambari (which in turn will setup Hadoop on your virtual cluster). For convenience these files also include the puppet standard library stdlib.
Now change into the newly created ambari-provisioning directory and start everything by typing

vagrant up

Then grab a coffee and find something nice to read – it will take a while (expect something like 15 minutes, but it very much depends on your machine and your internet connection).

What happens is that first a virtual machine image for CentOS is downloaded, three virtual machines (named one, two and three) are created based on this image and the virtual machines are configured to run Ambari: firewall services are stopped, ntp is installed and started, etchost files are changed to enable communication between the virtual machines, the agent/clients are installed&started and finally the Ambari clients are given information on where to find the server machine. Machine “one” will run the Ambari server, all three machines will run Ambari agents. The files only change the configuration of the virtual machines (that are not accessible from the global internet), nothing is installed directly on your machine. You can see all of this by looking at the puppet modules in the downloaded folder (all in all its just around 250 LOC – not including the puppet standard lib stdlib that we included for convenience). You can find an explanation of the structure and content of such files in this (German) introduction to Vagrant and Puppet .

Configure your Hadoop Cluster

Now you can use the graphical interface of Ambari to setup and configure your cluster – just open 192.168.0.101:8080, login with the default Ambari user & password (admin, admin), name your cluster, choose a service stack such as the default HDP2.0. Then enter the hostnames and choose manual configuration as shown below (the system will warn you twice that you need to manually install Ambari agents on all machines – but don’t worry, we did this for you already):

Ambari Installation Options

Choose some services that you want to run and the machines they should run on

Ambari Hadoop service selection

Assign service masters to nodes

Fill in missing configuration info.

Customize services

And deploy. This again will take quite long (30 minutes and more), but will run completely unattended and – on a decent developer machine – you should be able to continue working in the meantime.

That’s it – you should now have a complete Hadoop cluster with all the services you configured.

Ambari monitoring dashboard

Conclusion

This here is just a fun technological demonstration, however, there is some seriousness in our motivation to do it: the techniques used here can be used to manage standardised test and development environments together with the code (the “Infrastructure as Code” vision), ensuring that all developers immediately and easily have access to such environments and that these environments can be versioned together with the rest of the codebase. And you can go even further – the code we created can be used to provision on real (not virtual) machines (see here ) and even the manual configuration with Ambari can be automated – but we show this in some later blog post.

Update

We’ve now also made the puppet module used available on Puppet Forge here . However, note that this is not everything we used in this Blog Post (the vagrant file is not included and the module etchosts -that ensures that the virtual nodes can find themselves – is also not included as it is not generally needed).

Authors

Valentin Zacharias and Malte Nottmeyer.

Was this post helpful?

Likes

Blog author

Valentin Zacharias

Do you still have questions? Just send me a message.

fromValentin Zacharias

Wer wird Weltmeister – Logistische Regression und Random Forests

Heute – rechtzeitig zum ersten Spiel der Fußball WM – wollen wir unserer Blog Serie zur Einführung in Data Science Methoden am Beispiel der Fußball WM mit einigen Vorhersagen fortsetzen. Kurz zur Erinnerung: In den letzten Blog Posts hatten wir unseren...

11.6.2014 | 7 Minuten Lesezeit

Valentin Zacharias

Wer wird Weltmeister – Feature Selection

Dies ist der dritte Beitrag in unserer Blog Serie zur Einführung in Data Science Methoden am Beispiel der Vorhersage der Fußball WM. Einleitung und ersten Teil findet sich hier und hier . Kurz zur Erinnerung: Das Ergebnis beim letzten Mal war eine Datenbank...

2.6.2014 | 8 Minuten Lesezeit

Valentin Zacharias

Ist Deutschland im Fußball eine Turniermannschaft?

Mit dem beim letzten Mal vorgestellten Datensatz wollen wir heute als erstes einmal eine schon immer kontroverse Frage beantworten: Ist Deutschland eine Turniermannschaft? Wieder benutzen wir R – aus Spaß und weil es erlaubt, solche Analysen relativ...

19.5.2014 | 5 Minuten Lesezeit

Valentin Zacharias

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

CI/CD-Pipelines mit AWS CDK CodePipeline

Das Aufsetzen der CI/CD-Pipeline ist ein typischer Task in der Anfangszeit eines Projekts. Ist die Pipeline dann aufgesetzt, sind Änderungen nur noch selten notwendig. Dementsprechend wenig Routine entwickeln Programmierende im Umgang mit der Konfiguration...

Cloud
CI/CD
AWS

17.7.2023 | 4 Minuten Lesezeit

Dennis

Automatische Dependency-Updates mit Renovate

Bei der Softwareentwicklung ist es sinnvoll, bereits bestehende Funktionen wiederzuverwenden. Das spart Zeit und es wird unwahrscheinlicher, auf Probleme zu stoßen, die andere bereits gelöst haben. Funktionen können aus diesem Grund in Libraries gebündelt...

Softwareentwicklung
CI/CD

17.4.2023 | 6 Minuten Lesezeit

Alexander Backes

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

Platform Engineering mit BackstageIm folgenden Interview berichten Marc Schnitzius und Pascal Sochacki von ihren ersten Erfahrungen mit Backstage als Platform-Engineering-Lösung.Marco Paga: Marc, Pascal, ihr habt eine Sicht auf Platform Engineering, ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

2.3.2023 | 12 Minuten Lesezeit

Marco Paga

Maximilian Mayer

„Platform Engineering ist eine Art von Knowledge Sharing“

Warum „Platform Engineering“ eigentlich der falsche Begriff ist und wie man den Golden Path findet, erklärt Daniel Kocot, Senior Solution Architect, im folgenden Interview.Marco Paga: Warum ist Platform Engineering interessant?Daniel Kocot: Ich habe ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

20.2.2023 | 11 Minuten Lesezeit

Daniel Kocot

Marco Paga

Platform Engineering – Machen das nicht alle schon?

Plattformen sind aktuell ein sehr populäres Konzept, insbesondere in der Softwareentwicklung von Unternehmen. Viele sagen aber auch: So neu ist das doch gar nicht. Wir bieten unseren Entwicklern seit Jahren alle relevanten Tools und Werkzeuge, damit ...

DevOps
Accelerate

7.12.2022 | 2 Minuten Lesezeit

Matthias Niehoff

Open Policy Agent – Maschinen, die auf Regeln starren

Der Open Policy Agent (OPA) ist eine universell einsetzbare, quelloffene Policy Engine, also eine Sammlung von Komponenten, die eine einheitliche und effiziente Umsetzung von Regeln aller Art erlaubt. Dieser Artikel zeigt ein kleines Praxisbeispiel. ...

CI/CD
Softwarearchitektur
IT-Security

19.10.2022 | 5 Minuten Lesezeit

Marco Paga

Platform Engineering – Eine Einordnung

Aktuell kocht mit Platform Engineering gerade ein Thema hoch, das in den Weiten des World Wide Web für viele Reaktionen sorgt. Gerade auch Kunden aus dem Enterprise-Umfeld führt es zu interessanten Nebeneffekten, wenn aus DevOps-Teams plötzlich Platform...

Accelerate
CI/CD
DevOps

12.9.2022 | 4 Minuten Lesezeit

Daniel Kocot

Passwörter sicher per GitOps deployen mit SealedSecrets

In einem GitOps-Workflow beschreibt das Entwicklungsteam alle Ressourcen eines Kubernetes-Projekts in einem Git-Repository. Dadurch können sowohl das Entwicklungsteam als auch das Infrastrukturteam alle Bestandteile eines Projektes überblicken. Was jedoch...

DevOps
Kubernetes

13.6.2022 | 10 Minuten Lesezeit

Raffael Stein

Terraform Remote State richtig nutzen

Was ist Terraform und was ist State?Terraform ist ein Tool für die Verwaltung von Infrastruktur in Form von Code, gehört also in den sogenannten Infrastructure-as-Code-Bereich (IaC). Eine kurze Einführung und ein Vergleich zu anderen Tools findet sich...

Infrastructure
Softwarearchitektur
Cloud
DevOps

21.4.2022 | 7 Minuten Lesezeit

Alexander Kasper

Tekton Triggers in der Praxis

Tekton Triggers in der PraxisDieser Artikel ist Teil einer Reihe, die sich mit Tekton CI/CD und dem praktischen Einsatz beschäftigt.Im ersten Artikel haben wir die Installation vorgenommen und die erste Pipeline erstellt. Im zweiten Artikel haben wir...

CI/CD

4.3.2022 | 6 Minuten Lesezeit

Marco Paga

Tekton Buildpack Pipeline: Alles schon da?

Im ersten Artikel haben wir die Tekton-Installation gemeistert, erste API-Objekte kennengelernt und dabei eine erste kleine Pipeline erstellt. Hier eine kurze grafische Zusammenfassung als Erinnerung. Jetzt werden wir eine praktisch nutzbare Pipeline...

CI/CD
Softwareentwicklung

11.2.2022 | 5 Minuten Lesezeit

Marco Paga

Tekton Cloud-Native CI/CD: Ein pragmatisches Intro

In diesem Artikel möchte ich einen Überblick über Tekton geben mit dem Ziel, die Grundlagen zu erklären und einen schnellen Einstieg zu ermöglichen.Tekton möchte laut eigener Homepage der Standard für CI / CD werden. Zum einen bietet es ein Framework...

CI/CD
Kubernetes
Softwareentwicklung

19.1.2022 | 6 Minuten Lesezeit

Marco Paga

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Deployment konfigurierbarer Single Page Applications

In den letzten Jahren ist die Implementierung von Frontends in Form von Single Page Applications (kurz SPA) immer beliebter geworden. Bei Single Page Applications handelt es sich um Webseiten, die auf den Web-Technologien HTML, CSS und vor allem JavaScript...

DevOps
Frontend
CI/CD
Container
JavaScript

8.6.2021 | 6 Minuten Lesezeit

Philip Sanetra

Wie reif ist euer DevOps? – Einige Gedanken zur Messung des Fortschritts

Spoiler: Es ist ehrlich gesagt nicht von Bedeutung.In letzter Zeit haben wir des Öfteren von Kunden eine Frage gestellt bekommen:Wie misst man Fortschritt in Bezug auf Dev(Sec)Ops? Gibt es hierfür ein Maturity Model oder eine Menge an Skills, welche ...

Agilität
Cloud
DevOps
IT-Security

6.6.2021 | 4 Minuten Lesezeit

Nicolas Byl

Transitionen – quick and dirty

Quelle: pixabayDisclaimer: Vor oder nach der Lektüre des Artikels empfiehlt sich ein Blick auf das Veröffentlichungsdatum. 😉Wir begegnen häufig Unternehmen mit einer Aneinanderreihung schmerzhafter Probleme: Die Anforderungen sind unklar, die Qualit...

CI/CD
Agile Transformation
Agilität

31.3.2021 | 7 Minuten Lesezeit

Keycloak-Konfiguration mit Terraform

Infrastructure as Code (IaC) ist heutzutage aus der modernen IT-Landschaft nicht mehr wegzudenken. Red Hat beschreibt den Begriff wie folgt:Infrastructure as Code (IaC) is the managing and provisioning of infrastructure through code instead of through...

DevOps
Infrastructure
IT-Security
CI/CD
Keycloak
Open Source

2.3.2021 | 6 Minuten Lesezeit

Johanna Nolte

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Installing a Hadoop Cluster with three Commands

Hadoop provisioning automation following the “Infrastructure as Code” paradigm

Prerequisites

Setup Virtual Machines and Install Ambari

Configure your Hadoop Cluster

Conclusion

Update

Authors

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Wer wird Weltmeister – Logistische Regression und Random Forests

Wer wird Weltmeister – Feature Selection

Ist Deutschland im Fußball eine Turniermannschaft?

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

CI/CD-Pipelines mit AWS CDK CodePipeline

Automatische Dependency-Updates mit Renovate

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

„Platform Engineering ist eine Art von Knowledge Sharing“

Platform Engineering – Machen das nicht alle schon?

Open Policy Agent – Maschinen, die auf Regeln starren

Platform Engineering – Eine Einordnung

Passwörter sicher per GitOps deployen mit SealedSecrets

Terraform Remote State richtig nutzen

Tekton Triggers in der Praxis

Tekton Buildpack Pipeline: Alles schon da?

Tekton Cloud-Native CI/CD: Ein pragmatisches Intro

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Deployment konfigurierbarer Single Page Applications

Wie reif ist euer DevOps? – Einige Gedanken zur Messung des Fortschritts

Transitionen – quick and dirty

Keycloak-Konfiguration mit Terraform

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten