Automatic Provisioning of a Hadoop Cluster on Bare Metal with The Foreman and Puppet

29.4.2014 | 7 minutes of reading time

With standard tools, setting up a Hadoop cluster on your own machines still involves a lot of manual labor. This is annoying the first time you have to do it, but its even worse in cases where a cluster (such as a test system) needs to be set up repeatedly or machines enter and exit the cluster often.

The good news is that the tools Foreman , Puppet and Ambari allow to automate this process to a very large extend. Here we give a quick explanation of how this is done, how you can setup the infrastructure to automatically provision a Hadoop cluster on bare metal.

This is the second article in a series on the automation of Hadoop cluster provisioning and configuration management. In the first we described how to deploy your own virtual Hadoop cluster. You might want to start with the first article, in particular if you are not familiar with Ambari.

Prerequisites

A Puppet enabled Foreman server that is ready to use in your desired infrastructure. Including a running dns, tftp and dhcp server. (We are currently using Foreman Version 1.4.2.) Foreman helps managing servers through their lifecycle, from provisioning and configuration to orchestration and monitoring. With Puppet you can easily automate repetitive tasks and quickly deploy applications. You can find further information about such a Foreman server here and how to install such a server here .
A set of machines, discovered by the Foreman server. These machines need no operating system or anything. Any number of machines is possible and you can add additional machines at any time. Caution: The state of these machines will be lost, including the content of their harddrives.

Configuring Foreman

Now you can start to configure the provisioning, to tell Foreman what kind of operating system and services you want on the machines. All except the first and second step take place in the user interface of Foreman.

Needed Files: First you need to download the files that describe the installation of Ambari server and agent (the Hadoop management and monitoring tool). To do this, log into your Foreman server and do the following.

1sudo mkdir /etc/puppet/environments/ambari_dev
2cd /etc/puppet/environments/ambari_dev
3sudo curl "http://vzach.de/data/ambari-provisioning.zip" -o "ambari-provisioning.zip"
4sudo unzip ambari-provisioning.zip

Puppet Config: Add the following lines to the end of your Puppet configuration file (/etc/puppet/puppet.conf) for Foreman to find the downloaded files.

1[ambari_dev]
2    modulepath     = /etc/puppet/environments/ambari_dev/modules
3    config_version =

Operating System: Go to Hosts/Operating systems, click on “New Operating system” and choose the configuration as shown below. This enables the provisioning of CentOS.

OS – CentOS 6.5, Red Hat, x86_64

Partition Table – Kickstart default

Installation Media – CentOS mirror
Provisioning Templates: Go to Hosts/Provisioning templates and do the following for “Kickstart default” and “Kickstart default PXELinux”. These templates automate the CentOS installation, including the installation of Puppet.
1. Click on the entry, go to the tab “Association” and check the box at “CentOS 6.5”.
  
  Association – CentOS 6.5
2. Go to the configured operating system again and choose each template.
  
  Templates – Kickstart default & Kickstart default PXELinux
Puppet Environment: Go to Configure/Environments and click on “Import from …” (your Puppet master should show up there). Now check the entry “ambari_dev” and click on “Update”. This imports the Puppet files that you downloaded in step one.

Environment – ambari_dev
Host Group – Ambari Server: Go to Configure/Host groups, click on “New Host Group” and choose the configuration as shown below. (Some entries depend on your own infrastructure: Puppet CA, Puppet Master, Domain, Subnet, (new) Root Password)
This combines the settings for a special group of machines. Here we define that every machine in the Ambari server group actually runs an Ambari server and agent. Usually you will only need one server in this group.

Host Group – ambari_server, ambari_dev, …

Included Classes – interfering_services, ntp, ambari_server, ambari_agent

Network

OS – x86_64, CentOS 6.5, …
Host Group – Ambari Agent: Create the “ambari_agent” host group like in the previous step (os and network configuration stays the same). Every machine other than the one with the Ambari server will be in this group. The location of services from the Hadoop stack will be defined in Ambari itself. Therefore every machine other than the Ambari server will be provisioned in the same way.

Host Group – ambari_agent, ambari_dev, …

Included Classes – interfering_services, ntp, ambari_agent
Default Values: Go to Configure/Puppet classes, click the entry “ambari_server”, choose the tab “Smart Class Parameter” and click on the entry “ownhostname”. Now enter ambariserver. + the domain of your ambari_server host group (in our case “ambariserver.local.cloud”) and submit the update. Also do this for the class “ambari_agent” with parameter “serverhostname” and the same value (“ambariserver…”).
The only Ambari server you’ll need will be located at the given name by default. Therefore you don’t need to configure that everytime you add a new machine to the cluster. (It is possible to override this value though.)

Default Value – “ambariserver.” + …
Smart Values: For class “ambari_agent” and parameter “ownhostname” enter the value <%= @host.fqdn %>. This trick works only when you disable “safemode_render” under Administer/Settings/Provisioning/safemode_render (set to false). It allows you to automatically parametrize the Puppet files with the hostname of a new machine.

Setting up the machines

Starting the machines now is quiet easy:

Choose one of your discovered hosts and click on “Provision”. Now enter the name ambariserver and choose the host group “ambari_server”. Everything else is automatically filled in. Continue by submitting and make sure the chosen machine is now restarting.
You just started to provision a machine with an Ambari server and an additional Ambari agent on the same node. After this process is done, you could already start to configure your cluster of one machine with Ambari, however, generally you want to add more machines.

Provisioning – ambariserver, ambari_server, …
Additional machines can be provisioned with the “ambari_agent” host group and names chosen by yourself. You can also repeat this step if you want to add new machines to an existing cluster.

Provisioned Hosts

Configure your Hadoop Cluster

As described in our virtual provisioning blog post , you can now go to the Ambari server user interface (port 8080) and continue to install your Hadoop Cluster through its graphical interface. Keep in mind that the hostnames now depend on your own configuration. The “manual registration on the hosts” also shouldn’t bother you here, again we’ve already configured this for you.

Conclusion

With the tools shown here, you can automate the provisioning of machines for your Hadoop cluster – in this way enabling you to save effort in operations and to be more daring when trying out new configurations (after all, you can just setup the entire infrastructure in a few hours with very little manual effort). The configuration shown here is even portable to virtual machines and in this way can be used to create minimal clusters on developer machines or for automated testing.

The approaches shown here go a long way towards realizing the Infrastructure as Code vision for Hadoop – i.e. the description of the entire Hadoop cluster in configuration files that can be maintained and managed together with the rest of the source code. Configuration files that enable everyone – be they dev or ops – to quickly setup an identical infrastructure automatically.

The missing component is only the configuration of the actual Hadoop services using Ambari – but even this can be automated (we look at this some other time).

Authors

Valentin Zacharias and Malte Nottmeyer

Was this post helpful?

Likes

Blog author

Malte Nottmeyer

Do you still have questions? Just send me a message.

fromMalte Nottmeyer

Using Ambari Blueprints to automatically provision and install the Lambda...

In this blog post we want to give a tutorial to the brand new Ambari Blueprints. These blueprints allow to automate the configuration of Hadoop clusters – and together with Vagrant , Foreman and Puppet they are the last missing component to completely...

Big Data
CI/CD
DevOps

6.5.2014 | 10 Minuten Lesezeit

Malte Nottmeyer

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

CI/CD-Pipelines mit AWS CDK CodePipeline

Das Aufsetzen der CI/CD-Pipeline ist ein typischer Task in der Anfangszeit eines Projekts. Ist die Pipeline dann aufgesetzt, sind Änderungen nur noch selten notwendig. Dementsprechend wenig Routine entwickeln Programmierende im Umgang mit der Konfiguration...

Cloud
CI/CD
AWS

17.7.2023 | 4 Minuten Lesezeit

Dennis

Automatische Dependency-Updates mit Renovate

Bei der Softwareentwicklung ist es sinnvoll, bereits bestehende Funktionen wiederzuverwenden. Das spart Zeit und es wird unwahrscheinlicher, auf Probleme zu stoßen, die andere bereits gelöst haben. Funktionen können aus diesem Grund in Libraries gebündelt...

Softwareentwicklung
CI/CD

17.4.2023 | 6 Minuten Lesezeit

Alexander Backes

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

Platform Engineering mit BackstageIm folgenden Interview berichten Marc Schnitzius und Pascal Sochacki von ihren ersten Erfahrungen mit Backstage als Platform-Engineering-Lösung.Marco Paga: Marc, Pascal, ihr habt eine Sicht auf Platform Engineering, ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

2.3.2023 | 12 Minuten Lesezeit

Marco Paga

Maximilian Mayer

„Platform Engineering ist eine Art von Knowledge Sharing“

Warum „Platform Engineering“ eigentlich der falsche Begriff ist und wie man den Golden Path findet, erklärt Daniel Kocot, Senior Solution Architect, im folgenden Interview.Marco Paga: Warum ist Platform Engineering interessant?Daniel Kocot: Ich habe ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

20.2.2023 | 11 Minuten Lesezeit

Daniel Kocot

Marco Paga

Platform Engineering – Machen das nicht alle schon?

Plattformen sind aktuell ein sehr populäres Konzept, insbesondere in der Softwareentwicklung von Unternehmen. Viele sagen aber auch: So neu ist das doch gar nicht. Wir bieten unseren Entwicklern seit Jahren alle relevanten Tools und Werkzeuge, damit ...

DevOps
Accelerate

7.12.2022 | 2 Minuten Lesezeit

Matthias Niehoff

Open Policy Agent – Maschinen, die auf Regeln starren

Der Open Policy Agent (OPA) ist eine universell einsetzbare, quelloffene Policy Engine, also eine Sammlung von Komponenten, die eine einheitliche und effiziente Umsetzung von Regeln aller Art erlaubt. Dieser Artikel zeigt ein kleines Praxisbeispiel. ...

CI/CD
Softwarearchitektur
IT-Security

19.10.2022 | 5 Minuten Lesezeit

Marco Paga

Platform Engineering – Eine Einordnung

Aktuell kocht mit Platform Engineering gerade ein Thema hoch, das in den Weiten des World Wide Web für viele Reaktionen sorgt. Gerade auch Kunden aus dem Enterprise-Umfeld führt es zu interessanten Nebeneffekten, wenn aus DevOps-Teams plötzlich Platform...

Accelerate
CI/CD
DevOps

12.9.2022 | 4 Minuten Lesezeit

Daniel Kocot

Passwörter sicher per GitOps deployen mit SealedSecrets

In einem GitOps-Workflow beschreibt das Entwicklungsteam alle Ressourcen eines Kubernetes-Projekts in einem Git-Repository. Dadurch können sowohl das Entwicklungsteam als auch das Infrastrukturteam alle Bestandteile eines Projektes überblicken. Was jedoch...

DevOps
Kubernetes

13.6.2022 | 10 Minuten Lesezeit

Raffael Stein

Terraform Remote State richtig nutzen

Was ist Terraform und was ist State?Terraform ist ein Tool für die Verwaltung von Infrastruktur in Form von Code, gehört also in den sogenannten Infrastructure-as-Code-Bereich (IaC). Eine kurze Einführung und ein Vergleich zu anderen Tools findet sich...

Infrastructure
Softwarearchitektur
Cloud
DevOps

21.4.2022 | 7 Minuten Lesezeit

Alexander Kasper

Tekton Triggers in der Praxis

Tekton Triggers in der PraxisDieser Artikel ist Teil einer Reihe, die sich mit Tekton CI/CD und dem praktischen Einsatz beschäftigt.Im ersten Artikel haben wir die Installation vorgenommen und die erste Pipeline erstellt. Im zweiten Artikel haben wir...

CI/CD

4.3.2022 | 6 Minuten Lesezeit

Marco Paga

Tekton Buildpack Pipeline: Alles schon da?

Im ersten Artikel haben wir die Tekton-Installation gemeistert, erste API-Objekte kennengelernt und dabei eine erste kleine Pipeline erstellt. Hier eine kurze grafische Zusammenfassung als Erinnerung. Jetzt werden wir eine praktisch nutzbare Pipeline...

CI/CD
Softwareentwicklung

11.2.2022 | 5 Minuten Lesezeit

Marco Paga

Tekton Cloud-Native CI/CD: Ein pragmatisches Intro

In diesem Artikel möchte ich einen Überblick über Tekton geben mit dem Ziel, die Grundlagen zu erklären und einen schnellen Einstieg zu ermöglichen.Tekton möchte laut eigener Homepage der Standard für CI / CD werden. Zum einen bietet es ein Framework...

CI/CD
Kubernetes
Softwareentwicklung

19.1.2022 | 6 Minuten Lesezeit

Marco Paga

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Deployment konfigurierbarer Single Page Applications

In den letzten Jahren ist die Implementierung von Frontends in Form von Single Page Applications (kurz SPA) immer beliebter geworden. Bei Single Page Applications handelt es sich um Webseiten, die auf den Web-Technologien HTML, CSS und vor allem JavaScript...

DevOps
Frontend
CI/CD
Container
JavaScript

8.6.2021 | 6 Minuten Lesezeit

Philip Sanetra

Wie reif ist euer DevOps? – Einige Gedanken zur Messung des Fortschritts

Spoiler: Es ist ehrlich gesagt nicht von Bedeutung.In letzter Zeit haben wir des Öfteren von Kunden eine Frage gestellt bekommen:Wie misst man Fortschritt in Bezug auf Dev(Sec)Ops? Gibt es hierfür ein Maturity Model oder eine Menge an Skills, welche ...

Agilität
Cloud
DevOps
IT-Security

6.6.2021 | 4 Minuten Lesezeit

Nicolas Byl

Transitionen – quick and dirty

Quelle: pixabayDisclaimer: Vor oder nach der Lektüre des Artikels empfiehlt sich ein Blick auf das Veröffentlichungsdatum. 😉Wir begegnen häufig Unternehmen mit einer Aneinanderreihung schmerzhafter Probleme: Die Anforderungen sind unklar, die Qualit...

CI/CD
Agile Transformation
Agilität

31.3.2021 | 7 Minuten Lesezeit

Keycloak-Konfiguration mit Terraform

Infrastructure as Code (IaC) ist heutzutage aus der modernen IT-Landschaft nicht mehr wegzudenken. Red Hat beschreibt den Begriff wie folgt:Infrastructure as Code (IaC) is the managing and provisioning of infrastructure through code instead of through...

DevOps
Infrastructure
IT-Security
CI/CD
Keycloak
Open Source

2.3.2021 | 6 Minuten Lesezeit

Johanna Nolte

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Automatic Provisioning of a Hadoop Cluster on Bare Metal with The Foreman and Puppet

Prerequisites

Configuring Foreman

Setting up the machines

Configure your Hadoop Cluster

Conclusion

Authors

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Using Ambari Blueprints to automatically provision and install the Lambda...

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

CI/CD-Pipelines mit AWS CDK CodePipeline

Automatische Dependency-Updates mit Renovate

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

„Platform Engineering ist eine Art von Knowledge Sharing“

Platform Engineering – Machen das nicht alle schon?

Open Policy Agent – Maschinen, die auf Regeln starren

Platform Engineering – Eine Einordnung

Passwörter sicher per GitOps deployen mit SealedSecrets

Terraform Remote State richtig nutzen

Tekton Triggers in der Praxis

Tekton Buildpack Pipeline: Alles schon da?

Tekton Cloud-Native CI/CD: Ein pragmatisches Intro

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Deployment konfigurierbarer Single Page Applications

Wie reif ist euer DevOps? – Einige Gedanken zur Messung des Fortschritts

Transitionen – quick and dirty

Keycloak-Konfiguration mit Terraform

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten