Calculating Pi with Apache Spark

16.4.2016 | 9 minutes of reading time

Apache Spark is a system for cluster computing and part of the increasingly popular SMACK stack . The aim of this blog post is to provide a beginners introduction on how to set up a mini Spark cluster of virtual machines (VMs) using Vagrant and to run a small example application on it that approximates \(\pi\).

The cluster

To set up the Vagrant cluster on your local machine you need to first install Oracle VirtualBox on your system. After this it suffices to clone the Git repository from here to a working directory of your choice.

Once in the working directory, we can spin up the cluster using the console command vagrant up. The cluster is deployed in standalone mode and will consist of a designated master node named sparkmaster and a configurable number of worker nodes. The nodes are assigned consecutive static IP addresses and the workers are accessible via password-less SSH from the master node.

The following table summarizes the hostnames and IP addresses of the nodes and includes for later reference. It also includes the URLs to the web UIs provided by Spark on the nodes once the cluster is running:

Nodename	IP address	Web UI
sparkmaster	192.168.33.100	http://192.168.33.100:8080
sparkworker-01	192.168.33.101	http://192.168.33.101:8081
sparkworker-02	192.168.33.102	http://192.168.33.102:8082
etc.	etc.	etc.

After the cluster is up, we can use the command vagrant ssh to connect to the node with name nodename. For example, let us connect to the master node via vagrant ssh sparkmaster and have a look at its Spark installation directory:

1vagrant@sparkmaster:~$ ls -F $SPARK_HOME
2CHANGES.txt  NOTICE  README.md	bin/   data/  examples/  licenses/  python/
3LICENSE      R/      RELEASE	conf/  ec2/   lib/	 logs/	    sbin/

Spark comes with a couple of important directories containing executables and configuration files:

First of all, the directory SPARK_HOME/bin contains the spark-shell script for running Spark’s REPL (read-evaluate-print-loop), which allows interactive data exploration. But our main character here is the spark-submit script: it can be used to submit Spark applications as a JAR to the cluster.
Next, SPARK_HOME/conf contains the configuration files slaves and spark-env.sh. The first lists the hostnames of all VMs to be used as slaves while the second lists options used by Spark.
Finally, the directory SPARK_HOME/sbin will be important as it contains the shell scripts for starting and stopping the master as well as worker instances on the designated machines, either individually or in one go via the start-all.sh and stop-all.sh scripts.

We will start the master on the VM named sparkmaster while all the other VMs will be used as slaves. This can be achieved by running the start-all.sh script on sparkmaster:

1vagrant@sparkmaster:~$ $SPARK_HOME/sbin/start-all.sh

We might check that (hopefully) everything went smoothly by inspecting the log files in our cluster from the corresponding SPARK_HOME/logs directory on each individual machine. As said, the master and slave instances can be stopped by running the stop-all.sh script on sparkmaster.

Inspecting the web UI

More information is available from the Spark’s master Web UI:

Here we find the following information:

A list of all workers in the cluster under the section heading Workers.
Information on Running Applications and Completed Applications.

The UI is reachable as long as we do not deliberately stop the master by invoking one of the scripts for stopping it.

Submitting an application to the cluster

To actually submit an application to our cluster we make usage of the SPARK_HOME/bin/spark-submit.sh script. To test this and also that our cluster is set up properly, we will use the example applications for computing an approximation to \(\pi\) via Monte Carlo that ships with the Spark installation (Code: GitHub ).

For convenience the shared vagrant folder contains a shell script for submitting the example application to the cluster:

1spark-submit \
2--class de.codecentric.SparkPi \
3--master spark://192.168.33.100:7077  \
4--conf spark.eventLog.enabled=true \
5/vagrant/jars/spark-pi-example-1.0.jar 10

Besides a reference to the main class in the JAR and the path to the latter, we pass the IP address and port for the the Spark master instance and enable event logging. The latter will allow us to look at specific information in the web UI even after the application has finished. The argument 10 determines the size of the random sample used and also the degree of parallelism; see below.

If we invoke this script we get the result of the computation printed to the console. Also note the corresponding finished application after switching to the master Web UI in our browser:

1vagrant@sparkmaster:~$ /vagrant/scripts/submit-script-pi.sh
2Pi is roughly 3.13918

How is \(\pi\) approximated here?

This computation is based on the following heuristic: By definition \(\pi\) is the area \(A_{\mathrm{Circle}}\) of a circle with radius \(r=1\) (generally, \(\pi\cdot r^2\) is the area of a circle of radius \(r\)).

One then circumscribes this unit circle with a square whose area equals \(A_{\mathrm{Square}}=4\). The ratio of these two areas thus equals to \(\frac{A_{\mathrm{Circle}}}{A_{\mathrm{Square}}}=\frac{\pi}{4}\) and gives the geometric probability of a point inside the square to lie inside in the circle.
Now let us assume that we pick a huge number \(n\) of points randomly inside the circumscribed square, for example, by throwing darts or dropping rain drops onto it. A certain number \(n_{\mathrm{in}}\) of these points will end up inside the area described by the circle while the remaining number \(n_{\mathrm{out}}\) of these points will lie outside of it (but inside the square). Thus \(n_{\mathrm{in}}+n_{\mathrm{out}}=n\) and the probability of a point lying inside of the circle area is \(\frac{n_{\mathrm{in}}}{n}\).
Heuristically, one has \(\frac{A_{\mathrm{Circle}}}{A_{\mathrm{Square}}}\approx\frac{n_{\mathrm{in}}}{n}\) and hence \(\pi\approx\frac{4\cdot n_{\mathrm{in}}}{n}\).

It goes without saying, that this algorithm is non-deterministic and results will likely change with each run.

To wrap things up: The beauty of this is, it paves a way to approximate \(\pi\) by simply counting the fraction of points that end up inside the circle out of a total population of points randomly thrown at the circumscribed square. Something that can be distributed in a trivial fashion. And this is exactly what the mentioned Spark application does! A interactive visualization of the above may be found here: Link.

Subsequently, we drill down on some of the basic concepts of Spark by looking into the code of SparkPi . This includes speaking about the concept of a RDD, Spark’s abstract data type for handling data distributed on a cluster.

Resilient Distributed Datasets (RDDs)

Within the Spark world the core abstraction is that of a Resilient Distributed Dataset. The rationale is that we want to create, distribute and process data within a cluster that is created from various input data, e.g. text files or plain Java/Scala collections. These input data are structured by Spark into RDDs of which one can basically think of as Java/Scala collections that are distributed over the cluster into partitions. Spark provides a functional-programming style API for Java/Scala that allows to either

create new RDDs from various input sources, like files residing in HDFS, etc.
create new RDDs from already existing ones by so-called transformations, or
to create final Java/Scala values from existing RDDs by so-called actions.

To make these distributed data sets resilient or fault-tolerant, Spark keeps track of the dependencies between the input data and the intermediate RDDs created from it through an RDDs dependency graph. In case of failure this graph allows to replay the parts of the computation that were necessary to create the RDD at hand. It is important to note that RDDs are computed in a lazy fashion: only creating a final Java/Scala value via an action triggers the actual execution of a computation. Since the dependency graph in Spark is an example of a directed acyclic graph (DAG) this name is used as a reference frequently, for example in the web UI.

Writing a simple Spark application

To illustrate the ideas outlined in the previous section, let us rewrite the application SparkPi step by step. We will follow the original source but allow ourselves to divert a little from it in order to stress where and how RDDs are created and transformed. To begin with, the basic skeleton for the main application looks as follows:

1import scala.math.random
2import org.apache.spark._
3 
4object SparkApp {
5  def main(args: Array[String]): Unit = {
6    val conf = new SparkConf().setAppName("Spark Pi")
7    val sc = new SparkContext(conf)
8 
9    // Application code goes here...
10 
11    sc.stop()
12  }
13}

The main entry point to every Spark application is creating a SparkContext object. It provides a connection to the Spark cluster and context information about the cluster as well as the application itself and is used to create RDDs from input data. For example we are able to set the name of the application that will also appear in the Spark web UIs to be "Spark Pi". Further parameters might be passed to the Spark context at runtime as has already happened in the above usage of the submit script; there the IP address of the master node is passed to the Spark context.

The main step in the application code is to create a huge number n of random sample points by using the parallelize method provided by the Spark context sc . It allows to create an initial RDD from any Scala collection. In our case this collection, xs, consists of the first n consecutive numbers. The resulting RDD is divided into a number of slices partitions. Next, this RDD is transformed via map to the RDD sample that contains a number of n random points \((x,y)\) inside the square \([-1,1]\times [-1,1]\). Finally, we filter out the points from the sample that lie in the interior of the unit disc and count these in order to obtain an approximative value for \(\pi\). Here counting represents the final action that triggers the evaluation of all previous RDDs along the dependency graph.

1val slices = if (args.length < 0) args(0).toInt else 2 
2val n = math.min(100000L * slices, Int.MaxValue).toInt 
3val xs = 1 until n 
4val rdd = sc.parallelize(xs, slices)
5            .setName("'Initial rdd'") 
6val sample = rdd.map { i =>
7  val x = random * 2 - 1
8  val y = random * 2 - 1
9  (x, y)
10}.setName("'Random points sample'")
11 
12val inside = sample.filter { case (x, y) => (x * x + y * y < 1) }.setName("'Random points inside circle'")
13 
14val count = inside.count()
15 
16println("Pi is roughly " + 4.0 * count / n)

We can find a visual representation of the dependency graph of the final RDD inside after running the application by clicking either corresponding application id or name (here “SparkPi”) in the master web UI under the section “Completed Applications”. There one finds a link labeled “Application Detail UI”, which leads to more detailed information about the jobs and stages involved in the application. Our application includes only one job consisting solely of one stage, and by clicking on the corresponding link in the “Application Detail UI”, we finally find a representation of the dependency graph:

Notice that we were able to set names for debugging/monitoring purposes in the application code by using the setName method provided by the RDD class, and that these names also appear in the visual representation of the dependency graph. This is for example helpful when it comes to the identification of performance bottlenecks in larger applications that involve more intricated ways of creating and transforming RDDs.

That’s all! If you want, you can stop the cluster using vagrant halt or can completely get rid of it with vagrant destroy -f after exiting from the master’s shell.

Summary

In conclusion, we described how to set up a small Spark cluster using Vagrant, and how to write and submit a simple application to the cluster. Finally, we saw how to make basic usage of the web UI for monitoring purposes.

Was this post helpful?

Likes

Blog author

Daniel Pape

Do you still have questions? Just send me a message.

fromDaniel Pape

Matrix Factorization for Ad Recommendation

This blog post describes how matrix factorization can be applied to the problem of ad targeting. It draws from my experience of developing a machine-learning-based solution for this task for the real-time performance marketing company twiago together...

AWS
Data

14.3.2018 | 7 Minuten Lesezeit

Daniel Pape

Spark 2.0 – Datasets and case classes

The brand new major 2.0 release of Apache Spark was given out two days ago. One of its features is the unification of the DataFrame and Dataset APIs. While the DataFrame API has been part of Spark since the advent of Spark SQL (they replaced SchemaRDDs...

27.7.2016 | 7 Minuten Lesezeit

Daniel Pape

Spam classification using Spark’s DataFrames, ML and Zeppelin (Part 1)

This is the first entry in a series of blog posts about building and validating machine learning pipelines with Apache Spark . Its main concern is to show how to explore data with Spark and Apache Zeppelin notebooks in order to build machine learning...

Scala
Big Data
Data
Machine Learning

22.6.2016 | 15 Minuten Lesezeit

Daniel Pape

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Bei klassischen Machine-Learning-(ML-)Projekten beschäftigen sich Data Scientists häufig längere Zeit (mehrere Monate) mit der Entwicklung eines ML-Modells. Dabei werden hohe Kosten verursacht und die Zeit, bis ein erstes Modell zur Verfügung steht, ...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Google Cloud
Machine Learning

17.5.2021 | 5 Minuten Lesezeit

Nils Bauroth

Sven Rediske

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

Noch vor kurzer Zeit mussten für den Einsatz von künstlicher Intelligenz (KI) unter großem Aufwand eigene KI-Modelle erstellt werden. Heute ist für viele Anwendungsfälle die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Machine Learning
Python

29.7.2020 | 11 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und Konstruktion eigener neuronaler Netze möglich. Heute ist die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken. So kann man ...

Cloud
Computer Vision
Data
Python
Machine Learning
Google Cloud
Künstliche Intelligenz

8.7.2020 | 11 Minuten Lesezeit

Nico Axtmann

Marcel Mikl

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und ausreichend Spezialwissen möglich. Hauptsächlich große Internet-Konzerne wie Google, Apple und Facebook hatten das Geld, die Daten und die Expertise, um ...

Data
Machine Learning
Künstliche Intelligenz

6.7.2020 | 7 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

BIE Spotty – unsere Lösung beim BIE City Hackathon

Typischerweise sind bei Hackathons viele Soft- und Hardware-Entwickler zu finden, die innerhalb eines begrenzten Zeitraums versuchen, kreative und ungewöhnliche Lösungen in Form von Code und ersten Prototypen für vorher definierte Challenges zu erarbeiten...

IoT
Computer Vision
IT-Security
Machine Learning

2.7.2020 | 5 Minuten Lesezeit

Meike Wocken

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Machine Learning und künstliche Intelligenz sind aktuell in aller Munde und versprechen vielfältige Einsatzmöglichkeiten im Unternehmen. Trotzdem tun sich viele Unternehmen aktuell noch schwer, das Potential der Technologie zu nutzen. „Der Fokus liegt...

Künstliche Intelligenz
Data
Community
Machine Learning

27.5.2020 | 1 Minuten Lesezeit

Matthias Niehoff

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Warum gelingt es Data-Science-Initiativen häufig nicht, einen echten Mehrwert zu schaffen? Wir haben einige Ursachen dafür ausgemacht. In diesem Blogpost stellen wir vier typische Fallen für Data-Science-Projekte vor und geben Tipps, wie Du sie umschiffen...

Machine Learning
Data
Künstliche Intelligenz
Softwareentwicklung

27.3.2020 | 11 Minuten Lesezeit

Marcel Mikl

Machine-Learning-Modelle bewerten – die Crux mit den Testdaten

Machine-Learning-Technologien lassen sich erfolgreich und praxisnah im Unternehmensumfeld einsetzen. Ein konkreter, überschaubarer Anwendungsfall und somit fokussierter Einsatz von Machine-Learning-Modellen kann dabei echten Mehrwert generieren. Dieser...

Data
Machine Learning
Data Science

25.3.2020 | 5 Minuten Lesezeit

Berthold Schulte

Deployment von Machine-Learning-Modellen mit Seldon Core

In diesem Artikel sehen wir uns an, wie wir Machine-Learning- und Deep-Learning-Modelle mit Seldon Core deployen können. Seldon Core ist eine Open-Source-Plattform, um Modelle auf einem Kubernetes-Cluster in Betrieb zu nehmen. Bevor wir uns Seldon Core...

Softwarearchitektur
Data
Künstliche Intelligenz
Machine Learning

9.9.2019 | 7 Minuten Lesezeit

Nico Axtmann

Data Science in der Praxis: Häufige Fehler und Vorgehen

In diesem Artikel gehen wir auf die Besonderheiten von Data Science in der Praxis ein. Wir konzentrieren uns auf die technischen Unterschiede, häufige Fehler und Herausforderungen. Dabei lassen wird die sozialen und kommunikativen Aspekte außen vor. ...

Agilität
Machine Learning
Data

28.8.2019 | 11 Minuten Lesezeit

Nico Axtmann

Inbetriebnahme eines scikit-learn-Modells mit ONNX und FastAPI

Dieser Artikel befasst sich mit dem Deployment eines Machine-Learning-Modells, das den Wert eines Hauses in Boston anhand gewisser Merkmale wie der Kriminalitätsrate des Bezirks und der Anzahl der Räume in einer Wohnung bestimmen kann. Im ersten Schritt...

Data
Python
Künstliche Intelligenz
Machine Learning

6.8.2019 | 3 Minuten Lesezeit

Nico Axtmann

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Calculating Pi with Apache Spark

The cluster

Inspecting the web UI

Submitting an application to the cluster

How is \(\pi\) approximated here?

Resilient Distributed Datasets (RDDs)

Writing a simple Spark application

Summary

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Matrix Factorization for Ad Recommendation

Spark 2.0 – Datasets and case classes

Spam classification using Spark’s DataFrames, ML and Zeppelin (Part 1)

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Große Sprachmodelle: Was ist ein LLM?

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Schnelles Training eines Recommendation-Modells durch BigQuery ML

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

BIE Spotty – unsere Lösung beim BIE City Hackathon

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Machine-Learning-Modelle bewerten – die Crux mit den Testdaten

Deployment von Machine-Learning-Modellen mit Seldon Core

Data Science in der Praxis: Häufige Fehler und Vorgehen

Inbetriebnahme eines scikit-learn-Modells mit ONNX und FastAPI

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten