Map/Reduce with Hadoop and Pig

25.10.2012 | 7 minutes of reading time

Big data. One of the buzz words of the software industry in the last decade. We all heard about it but I am not sure if we actually can comprehend it as we should and as it deserves. It reminds me of the Universe – mankind has knowledge that it is big, huge, vast, but no one can really understand the size of it. Same can be said for the amount of data being collected and processed every day somewhere in the clouds if IT. As Google’s CEO, Eric Schmidt, once said: “There were 5 exabytes of information created by the entire world between the dawn of civilization and 2003. Now that same amount is created every two days.”

Mankind is clearly capable of storing and persisting this hardly imaginable bulk of data, that’s for sure. What impresses me more is that we are able to process it and analyze it in reasonable time.

For those that don’t know what Map/Reduce is, it is a programming model, or framework if you like it more that way, for processing large (seriously large) data sets in distributed manner, using large number of computers, i.e. nodes.
This algorithm consists of two steps – map and reduce. During the mapping phase, master node takes the input, creates smaller sub-problems out of it and distributes those to computers that are actually performing the processing – worker nodes. After the data was processed, it is being sent back to the master node. That is when reduce step begins: master node aggregates all the responses and combines them and creates the answer to the original problem.
Apache Hadoop is very popular free implementation of this framework. Very, very powerful one. Several tools are built on top of it and, thus, provide several ways to approach the problem of processing big data. One of those is Apache Pig – platform for analyzing large data sets. It consist of high level programming language (Pig Latin) for expressing data analysis programs, and its compiler which produces Map/Reduce programs ready to be executed using Apache Hadoop.

I had some experience with Apache Pig and it was good. Pig Latin is not difficult to learn and whole platform provides good tool for the job. But, I wanted to see how it would compare to “native” Map/Reduce job programs written in Java using Apache Hadoop APIs.
For that reason I imagined use-case merely familiar to any of you #sarcasm: I imagined a social network site and put myself in a role of a member. Some of my friends are members too and we are connected. Being embarrassingly popular person, I have many, many friends and connections. Naturally, I don’t want to talk to all of them nor to see what every and each of them is doing. I just want to see those that are important to me. For that reason, system will calculate the weight of my relationships and present me only my heaviest friends.

Interactions between two people can be various:
– viewing profile details – sneak a peak feature on mouse hover over friend’s name, for example
– viewing full profile
– commenting on friend’s status, comment, photo or whatever
– liking friend’s status, comment, photo or whatever
– sending a message to a friend, etc.

Each of those actions would have a certain weight expressed in a form of a number, giving us result – friendship weight, calculated as a sum of all interactions.

For my own purposes, I decided that raw data used as input would be an CSV file containing only basic information: time-stamp of the interaction between two users, username of the source user (he or she caused the interaction), username of the target user, interaction type and interaction weight. Thus, a single interaction record looks like this:


1341147920675,jason.bourne,jane.doe,VIEW_PROFILE,10

Having my input data placed in the proper location in the Hadoop file system, next step would be to run the job that will return sorted list of users (descending by friendship weight) for each user in the input file.

One illustration of simple Map/Reduce job that solves this problem is implemented in Java. Small map function could look like this:

1@Override
2protected void map(LongWritable offset, Text text, Context context) throws IOException, InterruptedException {
3   String[] tokens = text.toString().split(",");
4   String sourceUser = tokens[1];
5   String targetUser = tokens[2];
6   int points = Integer.parseInt(tokens[4]);
7   context.write(new Text(sourceUser), new InteractionWritable(targetUser, points));
8}

It tokenizes each input record and extracts from it users involved in interaction and interaction weight. Those parts of information become the output of the map function and the input for the reduce function which could be something like this:

1@Override
2protected void reduce(Text token, Iterable<InteractionWritable> counts, Context context) throws IOException, InterruptedException {
3   try {
4      Map<Text, IntWritable> interactionGroup = new HashMap<Text, IntWritable>();
5      Iterator<InteractionWritable> i = counts.iterator();
6      while (i.hasNext()) {
7         InteractionWritable interaction = i.next();
8         Text targetUser = new Text(interaction.getTargetUser().toString());
9         int weight = interaction.getPoints().get();
10 
11         IntWritable weightWritable = interactionGroup.get(targetUser);
12         if (weightWritable != null) {
13            weight += weightWritable.get();
14         }
15         interactionGroup.put(targetUser, new IntWritable(weight));
16      }
17 
18      InteractionCollector interactionCollector = new InteractionCollector();
19      Iterator<Entry<Text, IntWritable>> iEntry = interactionGroup.entrySet().iterator();
20      while (iEntry.hasNext()) {
21         Entry<Text, IntWritable> entry = iEntry.next();
22         interactionCollector.addEntry(entry);
23      }
24      List<Entry<Text, IntWritable>> orderedInteractions = interactionCollector.getInteractions();
25      for (Entry<Text, IntWritable> entry : orderedInteractions) {
26         context.write(token, new Text(entry.getKey().toString() + " " + entry.getValue().get()));
27      }
28   } catch (Exception e) {
29      // Of course, do something more sensible.
30      e.printStackTrace();
31   }
32}

What it does is summing up the interaction weight (for each source and target user pair), takes care about ordering and writes out the result. Not too complicated.
On the other hand, pig script doing the same job is even more simple:

1interactionRecords = LOAD '/blog/user_interaction_big.txt' USING PigStorage(',') AS (
2   timestamp: long,
3   sourceUser: chararray,
4   targetUser: chararray,
5   eventType: chararray,
6   eventWeight: int
7);
8 
9interactionData = FOREACH interactionRecords GENERATE
10   sourceUser,
11   targetUser,
12   eventWeight;
13 
14groupedByInteraction = GROUP interactionData BY (sourceUser, targetUser);
15summarizedInteraction = FOREACH groupedByInteraction GENERATE
16   group.sourceUser AS sourceUser,
17   group.targetUser AS targetUser,
18   SUM(interactionData.eventWeight) AS eventWeight;
19 
20result = ORDER summarizedInteraction BY sourceUser, eventWeight DESC;
21 
22DUMP result;

It does the same steps as Java implementation – loads input data, extracts only needed parts, groups it, sums the interaction weight and prints out the result.

There are some obvious pros and cons of each approach. Java implementation is more verbose and demands more coding than implementing a Pig script as it was expected. On the other hand, example given in this article is very, very simple and cannot be used as proper measurement. If use-case was much more complicated, we could easily get into situation where we would really need to think how to design and organize our code. Pig platform allows calling scripts from other scripts, passing the parameters from one script to another and has other useful stuff that could help in that endevour but I don’t think it can handle complicated use cases particularly good. After all, Pig Latin is script language and, at the moment, there is no IDE or text editor that can help in maintaining and refactoring Pig code as well as it might be needed. There are some Eclipse plugins, for instance, but they are far from refactoring feature Eclipse offers for Java code.
Another very interesting thing to point out is performance. Again, I will have to say that results I am presenting here are strictly informational and not to be taken very seriously. I was doing tests in single data node Hadoop cluster installed in virtual machine which is not really a production environment. For one thousand records, Pig script needed more than minute and a half to do the job while Java Map/Reduce class did its part for about ten seconds. When run against much bigger set of data, five millions of records, script finished in two minutes (roughly) comparing to native Map/Reduce time of around forty seconds. Difference between two runs in both approaches was almost equal – around thirty seconds. Obviously, there is a lot of overhead in loading pig platform, preparing it to preprocess and run the script.

The intention of this simple example was to make a comparison between these two solutions, mainly out of the plain curiosity of the author. Besides that, this use-case can show how much “our” data and our behavior can reveal about us. I know I wouldn’t be able to say who is my best friend or with whom I interact the most.

Was this post helpful?

Likes

Blog author

Dusan Zamurovic

Do you still have questions? Just send me a message.

fromDusan Zamurovic

MapReduce testing with MRUnit

In one of the previous posts on our blog , my colleague gave us a nice example how to test a map/reduce job. A starting point was the implementation of it which was done using Apache Pig. I would like to extend his example in this post by adding a little...

1.6.2014 | 5 Minuten Lesezeit

Dusan Zamurovic

Android persistence accelerated – revisited

Finally, after quite a while, we found some free time to work on Android persistence library I wrote about in this blog post . Knowing we have very tight schedule, as always, we wanted to make sure library is ready to be used. So, we took a good look...

Android
Java
Mobile
Database

9.5.2012 | 3 Minuten Lesezeit

Dusan Zamurovic

Developing JavaScript client using, well, JavaScript

So, we are using JavaScript to develop a JavaScript client. What do you think about that? We are not using GWT, RichFaces or any other tech that could free us from writing JavaScript. We decided to get our hands dirty and to write JavaScript ourselves...

14.11.2011 | 6 Minuten Lesezeit

Dusan Zamurovic

Android persistence accelerated – small inhouse ORM

A person easily gets used to comfort and luxury. In every segment of life. Bigger apartment, better car, new phone, bigger kitchen sink… Those are all good things. But, a person easily forgets how it was before the progress happened. Nervousness in the...

Database

4.4.2011 | 5 Minuten Lesezeit

Dusan Zamurovic

Android, Maven and Hudson. Pardon me, Jenkins.

Android platform is based on Java but is somehow different. It compiles into Dalvik rather than into Java byte code and runs in emulator which is enough to make some of your standard Java tools fail and become unusable. There was one specific problem...

16.2.2011 | 6 Minuten Lesezeit

Dusan Zamurovic

On your mark, get set, present!

In my inner dialog about GWT I mentioned that we used Model-View-Presenter approach in our project – MVP plus event bus mechanism. It is quite interesting approach, could be labeled as overhead, but it is with no doubt useful. This time, I would like...

Java
Agile methods

27.11.2010 | 4 Minuten Lesezeit

Dusan Zamurovic

Inner dialog on GWT – benefits and drawbacks

Project I’m currently working on really interested and intrigued me. Main reason is GWT, technology I had chance to meet more than once, but never to get to know it very well. When I heard that it will be used, I was very enthusiastic about it, because...

Java
Agile methods
UX/UI

1.11.2010 | 5 Minuten Lesezeit

Dusan Zamurovic

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Test Fixtures mit JUnit 5

Wir Softwareentwickler leben in einem ständigen Dilemma. Jede Funktionalität der Software sollte durch Unit-Tests und Integrationstest abgesichert werden. Es sollten dabei so viel Tests wie nötig, aber nur so wenige wie möglich geschrieben werden. Schreiben...

Java
Testing
Framework
Softwareentwicklung

25.3.2024 | 7 Minuten Lesezeit

Jens Kaiser

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Reactive Programming mit Spring Webflux

In diesem Artikel geben wir einen Überblick über Reactive Programming, erläutern, welche Prinzipien diesem zugrunde liegen und wann ein Einsatz sinnvoll sein kann. Anschließend zeigen wir, wie mithilfe des Spring-Webflux-Projekts eine reaktive Anwendung...

Spring
Java
Reactive Programming

11.12.2023 | 13 Minuten Lesezeit

Christian Franzen

Ferdinand Ade

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Test-Fixtures: Wozu denn überhaupt?

Für uns Softwareentwickler ist der ultimative Endgegner immer die Komplexität. Wir haben zahlreiche, teils ziemlich mächtige Waffen gesammelt, um in diesen Kämpfen bestehen zu können: Dinge wie Modularisierung, Abstraktion, Lean Development, iteratives...

Testing
Java
Test Driven Development

12.5.2023 | 19 Minuten Lesezeit

Rüdiger zu Dohna

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

Microstream – das Ende der O/R-Mapper?

Über eine Suche nach Alternativen zu O/R-Mappern und Persistenz-Frameworks für NoSQL-Datenbanken bin ich auf Microstream aufmerksam geworden und war ziemlich schnell interessiert. Zum einen, weil Microstream wie ich aus der Oberpfalz kommt, aber haupts...

Java
Datenbank
Softwarearchitektur

29.9.2022 | 13 Minuten Lesezeit

Felix Rieß

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Die Zukunft der IDEs – aus Sicht eines „Java-EE-Entwicklers“

Bei unseren Kunden und auch bei codecentric dreht sich alles um den besten und schnellsten Weg, die richtige Software zu entwickeln – und das natürlich in hoher Qualität. Von daher bin ich auch ein fleißiger Leser des „State of DevOps“-Report (hier zum...

Cloud
Java
Remote Work

16.5.2022 | 11 Minuten Lesezeit

Rainer Vehns

Keycloak.X, aber sicher – ohne bekannte Sicherheitslücken!

TLDR: Wie man die bekannten CVEs (Common Vulnerabilities and Exposures) mit einer eigenen Keycloak-Distribution auf null* reduziert.EinführungKeycloak (s. Website) wird durch die Umstellung auf Quarkus einfacher und robuster, so das Versprechen. Wie...

Java
IT-Security
Keycloak

9.5.2022 | 9 Minuten Lesezeit

Sebastian Rose

Thomas Darimont

Wie man Java-Klassen in Python benutzt

Generell sollte man zwar für jedes Problem das passende Werkzeug nutzen. Aber oftmals wird man gezwungen, den Hammer Java zu nutzen, weil der Rest des Hauses mit diesem Hammer gebaut wurde. Eine moderne Lösung dieses Problems ist natürlich die Microservice...

Künstliche Intelligenz
Java
Python

15.11.2021 | 8 Minuten Lesezeit

Hendrik Schawe

Effizient mit Text, Code und IDEs arbeiten

Hast du dich schon immer gefragt, warum andere Leute ihre Entwicklungsumgebung (Integrated Development Environment, IDE) anders nutzen als du? Ist dir aufgefallen, dass andere beim Programmieren deutlich langsamer oder schneller sind? Kennst du auch ...

Softwareentwicklung
Java

6.10.2021 | 12 Minuten Lesezeit

Jonas Verhoelen

Serverless Java mit AWS – Zwei Jahre Cloud-Native

Vor zwei Jahren haben wir angefangen, ein Kundenprodukt Cloud-Native auf Basis von Serverless, Java und AWS Managed Services umzusetzen. Im Folgenden möchte ich beschreiben, was wir in dieser Zeit gemeinsam gelernt haben und was wir heute besser machen...

Softwarearchitektur
Cloud
Java
Microservices
Serverless
Softwareentwicklung

2.12.2020 | 9 Minuten Lesezeit

Felix Massem

BPMN im Smart Home: Camunda und openHAB

Geschäftsprozessmodellierung und einhergehende Sprachen wie BPMN und DMN sind Begriffe, denen man normalerweise im beruflichen Umfeld begegnet und die im privaten Raum keine Rolle spielen. Natürlich kann man die Prozesse eines Haushalts (aka kleines,...

Java
BPM
Smart Home
IoT

6.4.2020 | 8 Minuten Lesezeit

Stephan Köninger

State Management in Svelte

Teil der Webentwicklung in 2020 sind nicht nur komponentenbasierte Ansätze, sondern ebenso die Nutzung von State-Management-Lösungen. Diese orientieren sich in der Regel an der Flux-Architektur und ihrem prominentesten Vertreter, Redux . Und so ist es...

JavaScript
React
Java

25.2.2020 | 3 Minuten Lesezeit

Daniel Zenzes

Gleich und doch anders: Einführung in Svelte

Verglichen mit den letzten Jahren ist es im JavaScript-Umfeld in letzter Zeit verhältnismäßig ruhig geworden. Gerade im Frontend sind React, Angular und, mit etwas Abstand, Vue etabliert und erfreuen sich einer wachsenden Nutzerbasis. Komponentenbasierte...

JavaScript
Java

18.2.2020 | 4 Minuten Lesezeit

Daniel Zenzes

Synchroner Batch mit Mule 4

Während in Mule 3 der Batch noch eine eigenständige Komponente war und Batches sich in der Konfiguration auf der gleichen Ebene wie Flows befanden, ist der Batch in Mule 4 zu einem sogenannten Scope geworden, der jetzt innerhalb eines Flows lebt. Auf...

Java
APM
JavaScript
Integration

28.1.2020 | 5 Minuten Lesezeit

Roger Butenuth

Was ist GraalVM?

Als ich anfing, mich genauer mit GraalVM zu beschäftigen, hatte ich nur eine grobe Vorstellung davon, was sich hinter der Bezeichnung eigentlich verbirgt. Beim Lesen der ersten Artikel zum Thema war ich geradezu verwirrt. Was ist GraalVM denn nun? Ein...

Java

23.1.2020 | 5 Minuten Lesezeit

Timo

Schnelles Entwickeln mit Kubernetes in Azure

Kubernetes ist die de facto Deployment-Umgebung für moderne Microservice-Architekturen. Alle großen Cloud-Anbieter haben daher Angebote für Kubernetes, die durch zahlreiche Features ergänzt werden, die Ressourcen des jeweiligen Anbieters intelligent ...

Cloud
Java
Microservices
Azure
Kubernetes

31.7.2019 | 5 Minuten Lesezeit

Christian Sauer

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Map/Reduce with Hadoop and Pig

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

MapReduce testing with MRUnit

Android persistence accelerated – revisited

Developing JavaScript client using, well, JavaScript

Android persistence accelerated – small inhouse ORM

Android, Maven and Hudson. Pardon me, Jenkins.

On your mark, get set, present!

Inner dialog on GWT – benefits and drawbacks

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Test Fixtures mit JUnit 5

Charge your APIs Volume 23: REST vs. gRPC

Reactive Programming mit Spring Webflux

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Test-Fixtures: Wozu denn überhaupt?

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Microstream – das Ende der O/R-Mapper?

Streaming Wikipedia mit Apache Kafka

Die Zukunft der IDEs – aus Sicht eines „Java-EE-Entwicklers“

Keycloak.X, aber sicher – ohne bekannte Sicherheitslücken!

Wie man Java-Klassen in Python benutzt

Effizient mit Text, Code und IDEs arbeiten

Serverless Java mit AWS – Zwei Jahre Cloud-Native

BPMN im Smart Home: Camunda und openHAB

State Management in Svelte

Gleich und doch anders: Einführung in Svelte

Synchroner Batch mit Mule 4

Was ist GraalVM?

Schnelles Entwickeln mit Kubernetes in Azure

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten