MapReduce testing with MRUnit

1.6.2014 | 5 minutes of reading time

In one of the previous posts on our blog , my colleague gave us a nice example how to test a map/reduce job. A starting point was the implementation of it which was done using Apache Pig. I would like to extend his example in this post by adding a little twist to it. Map/reduce job I am going to test will be the same he used but implemented in Java.
Multi-threaded environment can be a hostile place to dwell in and debugging and testing it is not easy. With map/reduce things get even more complex. These jobs run in distributed fashion, across many JVMs in a cluster of machines. That is why it is important to use all the power of unit testing and run them as isolated as possible.
My colleague used PigUnit for testing his pig script. I am going to use MRUnit – a Java library written to help with unit testing map/reduce jobs.

Logic of the example is the same as in the mentioned post#link. There are two input paths. One containing user information: user id, first name, last name, country, city and company. Other one holds user’s awesomeness rating in a form of a pair: user id, rating value.

1# user information
21,Ozren,Gulan,Serbia,Novi Sad,codecentric
54,Linda,Jefferson,USA,New York,ae.com
9# rating information

*Disclaimer: Test data is highly reliable and taken from real life, so if it turns out that Ozren has the highest rating, he tweaked it :).

Our MR job reads the inputs line by line and joins the information about users and their awesomeness rating. It filters out all users with rating less than 150 leaving only awesome people in the results.
I decided not to show full Java code in the post because it is not important. It is to enough know what goes in and what we expect as a result of the job. Those interested in implementation details can find it here . These are just signatures of mapper and reducer classes – they determine types of input and output data:

1public class AwesomenessRatingMapper
2    extends Mapper<LongWritable, Text, LongWritable, AwesomenessRatingWritable> {
3    // ...
6public class AwesomenessRatingReducer
7    extends Reducer<LongWritable, AwesomenessRatingWritable, LongWritable, Text> {
8    // ...

There are three main MRUnit classes that drive our tests: MapDriver, ReduceDriver and MapReduceDriver. They are generic classes whose type paremeters depend on input and output types of mapper, reducer and whole map/reduce job, respectively. This is how we instantiate them:

1AwesomenessRatingMapper mapper = new AwesomenessRatingMapper();
2MapDriver<LongWritable, Text, LongWritable, AwesomenessRatingWritable> mapDriver = MapDriver.newMapDriver(mapper);
4AwesomenessRatingReducer reducer = new AwesomenessRatingReducer();
5ReduceDriver<LongWritable, AwesomenessRatingWritable, LongWritable, Text> reduceDriver = ReduceDriver.newReduceDriver(reducer);
7MapReduceDriver<LongWritable, Text, LongWritable, AwesomenessRatingWritable, LongWritable, Text> mapReduceDriver = MapReduceDriver.newMapReduceDriver(mapper, reducer);

MRUnit provides us tools to write tests in different manners. First approach is more traditional one – we specify the input, run the job (or a part of it) and check if the output looks like we expect. In other words, we do the assertions by hand.

2public void testMapperWithManualAssertions() throws Exception {
3    mapDriver.withInput(new LongWritable(0L), TestDataProvider.USER_INFO);
4    mapDriver.withInput(new LongWritable(1L), TestDataProvider.RATING_INFO);
6    Pair<LongWritable, AwesomenessRatingWritable> userInfoTuple = new Pair<LongWritable, AwesomenessRatingWritable>(
7                    TestDataProvider.USER_ID, TestDataProvider.USER_INFO_VALUE);
8    Pair<LongWritable, AwesomenessRatingWritable> ratingInfoTuple = new Pair<LongWritable, AwesomenessRatingWritable>(
9                    TestDataProvider.USER_ID, TestDataProvider.RATING_INFO_VALUE);
11    List<Pair<LongWritable, AwesomenessRatingWritable>> result = mapDriver.run();
13    Assertions.assertThat(result).isNotNull().hasSize(2).contains(userInfoTuple, ratingInfoTuple);
16// ...
19public void testReducerWithManualAssertions() throws Exception {
20    ImmutableList<AwesomenessRatingWritable> values = ImmutableList.of(TestDataProvider.USER_INFO_VALUE,
21                    TestDataProvider.RATING_INFO_VALUE);
22    ImmutableList<AwesomenessRatingWritable> valuesFilteredOut = ImmutableList.of(
23                    TestDataProvider.USER_INFO_VALUE_FILTERED_OUT, TestDataProvider.RATING_INFO_VALUE_FILTERED_OUT);
25    reduceDriver.withInput(TestDataProvider.USER_ID, values);
26    reduceDriver.withInput(TestDataProvider.USER_ID_FILTERED_OUT, valuesFilteredOut);
28    Pair<LongWritable, Text> expectedTupple = new Pair<LongWritable, Text>(TestDataProvider.USER_ID,
29                    TestDataProvider.RESULT_TUPPLE_TEXT);
31    List<Pair<LongWritable, Text>> result = reduceDriver.run();
33    Assertions.assertThat(result).isNotNull().hasSize(1).containsExactly(expectedTupple);
36// ...
39public void testMapReduceWithManualAssertions() throws Exception {
40    mapReduceDriver.withInput(new LongWritable(0L), TestDataProvider.USER_INFO);
41    mapReduceDriver.withInput(new LongWritable(1L), TestDataProvider.RATING_INFO);
42    mapReduceDriver.withInput(new LongWritable(3L), TestDataProvider.USER_INFO_FILTERED_OUT);
43    mapReduceDriver.withInput(new LongWritable(4L), TestDataProvider.RATING_INFO_FILTERED_OUT);
45    Pair<LongWritable, Text> expectedTupple = new Pair<LongWritable, Text>(TestDataProvider.USER_ID,
46                    TestDataProvider.RESULT_TUPPLE_TEXT);
48    List<Pair<LongWritable, Text>> result = mapReduceDriver.run();
50    Assertions.assertThat(result).isNotNull().hasSize(1).containsExactly(expectedTupple);

Other approach is to specify input and specify the output, too. In this case, we do not have to do the assertions. Instead, we can let the framework do it.

2public void testMapperWithAutoAssertions() throws Exception {
3    mapDriver.withInput(new LongWritable(0L), TestDataProvider.USER_INFO);
4    mapDriver.withInput(new LongWritable(1L), TestDataProvider.RATING_INFO);
6    mapDriver.withOutput(TestDataProvider.USER_ID, TestDataProvider.USER_INFO_VALUE);
7    mapDriver.withOutput(TestDataProvider.USER_ID, TestDataProvider.RATING_INFO_VALUE);
9    mapDriver.runTest();
12// ...
15public void testReducerWithAutoAssertions() throws Exception {
16    ImmutableList<AwesomenessRatingWritable> values = ImmutableList.of(TestDataProvider.USER_INFO_VALUE,
17                    TestDataProvider.RATING_INFO_VALUE);
18    ImmutableList<AwesomenessRatingWritable> valuesFilteredOut = ImmutableList.of(
19                    TestDataProvider.USER_INFO_VALUE_FILTERED_OUT, TestDataProvider.RATING_INFO_VALUE_FILTERED_OUT);
21    reduceDriver.withInput(TestDataProvider.USER_ID, values);
22    reduceDriver.withInput(TestDataProvider.USER_ID_FILTERED_OUT, valuesFilteredOut);
24    reduceDriver.withOutput(new Pair<LongWritable, Text>(TestDataProvider.USER_ID,
25                    TestDataProvider.RESULT_TUPPLE_TEXT));
27    reduceDriver.runTest();
30// ...
33public void testMapReduceWithAutoAssertions() throws Exception {
34    mapReduceDriver.withInput(new LongWritable(0L), TestDataProvider.USER_INFO);
35    mapReduceDriver.withInput(new LongWritable(1L), TestDataProvider.RATING_INFO);
36    mapReduceDriver.withInput(new LongWritable(3L), TestDataProvider.USER_INFO_FILTERED_OUT);
37    mapReduceDriver.withInput(new LongWritable(4L), TestDataProvider.RATING_INFO_FILTERED_OUT);
39    Pair<LongWritable, Text> expectedTupple = new Pair<LongWritable, Text>(TestDataProvider.USER_ID,
40                    TestDataProvider.RESULT_TUPPLE_TEXT);
41    mapReduceDriver.withOutput(expectedTupple);
43    mapReduceDriver.runTest();

The main difference is in calling driver’s method run() or runTest(). First one just runs the test without validating the results. Second also adds validation of the results to the execution flow.

There are some nice things in MRUnit I wanted to point out (some of them are shown in this post in more detail). For example…
Method List> MapDriver#run() returns a list of pairs which is useful for testing the situations when mapper produces key/value pairs for given input. This is what we have used in the approach when we were checking the results of the mapper run.

Then, both MapDriver and ReduceDriver have method getContext(). It returns Context for further mocking – online documentation has some short but clear examples how to do it.

Why not to mention counters? Counters are the easiest way to measure and track the number of operations that happen in Map/Reduce programs. There are some built in counters like “Spilled Records”, “Map output records”, “Reduce input records” or “Reduce shuffle bytes”… MRUnit supports inspecting those by using getCounters() method of each of the drivers.

Class TestDriver provides facility for setting mock configuration – TestDriver#getConfiguration()) will allow you to change only those parts of configuration you need to change.

Finally, MapReduceDriver is useful for testing the MR job in whole, checking if map and reduce parts are working combined together.

MRUnit is still young project, just a couple of years old, but it is already interesting and helpful. And, if I compare this approach to testing M/R jobs to the one [presented by a colleague of mine#link], I prefer MRUnit to PigUnit. MRUnit is not better – it is is made for testing “native”, Java M/R jobs and I like that implementation approach more. PigScript vs Java M/R is completely other topic .

share post




Gemeinsam bessere Projekte umsetzen.

Wir helfen Deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.