MongoDB: The Fifth Element

2 Comments

What has been started more than five weeks back as my “small” personal MongoDB experiment is now entering its fifth round. Still I am following the corresponding MongoDB course at 10gen and this week it is all about aggregation functions. You know this is when in relational databases “group by”-clauses show up and things get a little bit more complicated than the usual Selects and Updates. As it is already some kind of tradition now for this series another film title has been chosen to be the godfather for this (fifth) posting. Having worked for several years in a project implementing a Network Data Warehouse on top of Oracle I had my share in (relational) aggregation functions. Thus I am really curious how this will be done in MongoDB. Let’s start.

“$sort does pretty much what you’d expect. It sorts the documents. But there are few things to keep in mind when you use $sort and the first is that $sort can be a real memory hug.” -Quotes from the course

Aggregation Functions – Warmup

In this paragraph we are taking a first look at the syntax and general concepts used for the aggregation functions in MongoDB. For this we will execute what would be a “count(*)” SQL-query in a relational database. As a sample collection we are re-using one of the already used collections from an earlier example:

> db.ships.find().pretty()
{
	"_id" : ObjectId("5092b4e9ddb9073fa519a955"),
	"name" : "USS Defiant",
	"operator" : "Starfleet",
	"type" : "escort",
	"class" : "Defiant"
}
{
	"_id" : ObjectId("5092b7e7ddb9073fa519a956"),
	"name" : "USS Enterprise-D",
	"operator" : "Starfleet",
	"type" : "Explorer",
	"class" : "Galaxy"
}
{
	"_id" : ObjectId("50be2da30238f791c861f1e2"),
	"name" : "IKS Buruk",
	"operator" : "Klingon Empire",
	"type" : "Warship"
}

Now let’s assume we would like to know how many star ships are available in that collection for each of the operators (Starfleet and Klingon Empire). In SQL this would probably look something like:

SELECT operator, count(*) FROM ships GROUP BY operator;

Ok, one could get the impression that this looks slightly more complicated in MongoDB as we are doing this – as usual – as a JSON document:

> db.ships.aggregate([{$group : {_id : "$operator", num_ships : {$sum : 1}}}])
{
	"result" : [
		{
			"_id" : "Klingon Empire",
			"num_ships" : 1
		},
		{
			"_id" : "Starfleet",
			"num_ships" : 2
		}
	],
	"ok" : 1
}

First of all we are using the “aggregate”-function on our “ships”-collection. We are doing a group-by using the “$group”-function. Then we are defining the unique id for the resulting JSON document telling which field from our collection to use for this. The result of the actual count should then be shown in a field named “num_ships” (this could have any name). The calculation is done by summing up with the value of “1” for each document found. If the key does already exist it does an “upsert” for the number and for a key found for the first time it starts summing things up by one. The nice thing about the result is that it is a JSON document and thus all the known concepts we have seen so far do apply here.

“Sony makes laptops. They make other things too, but not in my data set.” -Quotes from the course

But why is everything related to the aggregation function wrapped in “[…]” indicating an array? The answer is that MongoDB is using a …

Pipeline Concept

Ok, now it is getting interesting. Most of you probably know the concepts of pipes from the Unix command shell. It is basically the possibility to execute an operation on some input and use the output as the input for the next command and so on. Often files are processed this way under Unix. A common example looks something like this:

find . -print | xargs grep something

Now MongoDB supports the same kind of approach in its aggregation framework. There is a set of possible stages and each of those is taken a set of documents as an input and is producing a resulting set of documents (or the final resulting JSON document at the end of the pipeline). This can then in turn again be used for the next stage an so on. There are stages that always create as many “output documents” as there have been “input documents”. But there are also stages that will reduce the amount of documents while processing and one that might even increase the amount of documents. There is no given order and stages can appear more often than once or not at all. The stages are:

  • $project – Change the set of documents by modifying keys and values. This is a 1:1 mapping, thus for each document that is processed there will be a resulting document.
  • $match – This is a filtering operation and thus this can reduce the amount of documents that are given as input to the next stage. This can be used for example if aggregation should only happen on a subset of the data. In our data set we could for example filter for operator=”starfleet” and perform some grouping operation afterwards.
  • $group – This does the actual aggregation and the possible operations are listed down below. As we are grouping by one or more keys this can have a reducing effect on the amount of documents.
  • $sort – Obviously this is sorting the documents one way or the other for the next stage. It should be noted that this might use a lot of memory. Thus if possible one should always try to reduce the amount of documents first.
  • $skip – With this it is possible to skip forward in the list of documents for a given amount of documents. This allows for example starting only from the 10th document. Typically this will be used together with “$sort” and especially together with “$limit”.
  • $limit – This limits the amount of documents to look at by the given number starting from the current position. Typically this is only used together with “$sort” and especially together with “$skip”. There is an abstract example shown below.
  • $unwind – This is used to unwind document that are using arrays. when using an array the data is kind of pre-joinded and this operation will be undone with this to have individual documents again. Thus with this stage we will increase the amount of documents for the next stage. It is required as grouping on something inside an array would otherwise not be possible.

I have to say that this concept sounds pretty appealing. Of course the first time I heard the phrase “similar to Unix pipes” I was already sure I will love this :-).

“The $project stage let’s you reshape the documents as they come to the pipeline.” -Quotes from the course

Due to the pure amount of information in this topic it would be probably too much having an example for each stage, but just to give an impression how this works the following shows the “$project” stage combined with a “$group” stage.

> db.ships.aggregate([{$project : {_id : 0, operator : {$toLower : "$operator"},
crew : {"$multiply" : ["$crew",10]}}}])
{
        "result" : [
                {
                        "operator" : "starfleet",
                        "crew" : 7500
                },
                {
                        "operator" : "starfleet",
                        "crew" : 500
                },
                {
                        "operator" : " klingon empire",
                        "crew" : 400
                },
        ],
        "ok" : 1
}

The above example shows the “$project” stage as such and the resulting set of documents. Now let’s add some grouping:

}
> db.ships.aggregate([{$project : {_id : 0, operator : {$toLower : "$operator"},crew : {"$multiply" : ["$crew",10]}}}, 
                      {$group : {_id : "$operator", num_ships : {$sum : "$crew"}}}])
{
        "result" : [
                {
                        "_id" : " klingon empire",
                        "num_ships" : 400
                },
                {
                        "_id" : "starfleet",
                        "num_ships" : 8000
                }
        ],
        "ok" : 1
}

Now we have first modified the data in the “$project” stage and then used the resulting set of documents as an input for the “$group” stage. This is the blueprint of usage for all the stages and combining them (multiple times if wanted).

“$match perfoms a filter on the dcouments as they pass through the pipe and it has a reducing effect.” -Quotes from the course

And as promised a short explanation how “$skip” and “$limit” work together:

o o o o o [o o o o o o o] o o o o o o o o
$skip 5 and then $limit 7         

Assuming every “o” is a document we can see that with “$skip” and “$limit” it is possible to just grep a subset of documents. Again, for this to make sense the set of documents must be for sure ordered in some way or otherwise pre-processed.

Aggregation Expressions

Of course there is a whole bunch of operations that can be used in the “$group”-stage during aggregation. The following table gives a quick overview on those:

Expression Description Example
$sum Sums up the defined value from all documents in the collection db.ships.aggregate([{$group : {_id : “$operator”, num_ships : {$sum : “$crew”}}}])
$avg Calculates the average of all given values from all documents in the collection. db.ships.aggregate([{$group : {_id : “$operator”, num_ships : {$avg : “$crew”}}}])
$min Gets the minimum of the corresponding values from all documents in the collection. db.ships.aggregate([{$group : {_id : “$operator”, num_ships : {$min : “$crew”}}}])
$max Gets the maximum of the corresponding values from all documents in the collection. db.ships.aggregate([{$group : {_id : “$operator”, num_ships : {$max : “$crew”}}}])
$push Pushes the value to an array in the resulting document. db.ships.aggregate([{$group : {_id : “$operator”, classes : {$push: “$class”}}}])
$addToSet Pushes the value to an array in the resulting document but does not create duplicates. db.ships.aggregate([{$group : {_id : “$operator”, classes : {$addToSet : “$class”}}}])
$first Gets the first document from the source documents according to the grouping. Typically this makes only sense together with some previously applied “$sort”-stage. db.ships.aggregate([{$group : {_id : “$operator”, first_class : {$first : “$class”}}}])
$last Gets the last document from the source documents according to the grouping. Typically this makes only sense together with some previously applied “$sort”-stage. db.ships.aggregate([{$group : {_id : “$operator”, last_class : {$last : “$class”}}}])

For the examples the collection used so far has been extended by a value for the amount of crew members on the ships:

> db.ships.find().pretty()
{
	"_id" : ObjectId("5092b4e9ddb9073fa519a955"),
	"class" : "Defiant",
	"crew" : 50,
	"name" : "USS Defiant",
	"operator" : "Starfleet",
	"type" : "escort"
}
{
	"_id" : ObjectId("5092b7e7ddb9073fa519a956"),
	"class" : "Galaxy",
	"crew" : 750,
	"name" : "USS Enterprise-D",
	"operator" : "Starfleet",
	"type" : "Explorer"
}
{
	"_id" : ObjectId("50be2da30238f791c861f1e2"),
	"crew" : 40,
	"name" : "IKS Buruk",
	"operator" : "Klingon Empire",
	"type" : "Warship"
}

I think it will get a bit too lengthy to show in addition all the results here for all the queries. I can tell that those are the expected ones and the more complicated ones I will still show in detail. Anyway there are a few operations that are not so common when coming from a relational word. Those are “$push” and “$addToSet”. Let’s take a closer look at those right away.

> db.ships.aggregate([{$group : {_id : "$operator", classes : {$addToSet : "$class"}}}])
{
        "result" : [
                {
                        "_id" : " Klingon Empire",
                        "classes" : [
                                "Warship"
                        ]
                },
                {
                        "_id" : "Starfleet",
                        "classes" : [
                                "Defiant",
                                "Galaxy"
                        ]
                }
        ],
        "ok" : 1
}

In the above example we have used “$addToSet” to create a new JSON document that is using the name of the operator as a key and then lists in an array – named classes – all the available ship classes for that operator. Well, “$push” would do the same thing, but it would create duplicates. But not with the data set from the above example, as there is a 1:1 mapping from operator to class. We would need a 1:n relation from operator to class to have a different effect when using “$push”.

Now one more example on “$first” and “$last”. Those are taking the first or last 😉 element from each “grouping block”. It does not really makes sense to look at this isolated, but we will do it anyway. As we do no sorting the original order of the documents is used and thus for the Klingon Empire the result is always the same as there is only one entry, but for the Federation we will pick different elements with $first and $last:

> db.ships.aggregate([{$group : {_id : "$operator", first_class : {$first : "$class"}}}])
{
        "result" : [
                {
                        "_id" : " Klingon Empire",
                        "first_class" : "Warship"
                },
                {
                        "_id" : "Starfleet",
                        "first_class" : "Galaxy"
                }
        ],
        "ok" : 1
}
>
> db.ships.aggregate([{$group : {_id : "$operator", last_class : {$last : "$class"}}}])
{
        "result" : [
                {
                        "_id" : " Klingon Empire",
                        "last_class" : "Warship"
                },
                {
                        "_id" : "Starfleet",
                        "last_class" : "Defiant"
                }
        ],
        "ok" : 1
}

Wrapup

Puh, this was a lot of stuff this time and probably the posting could have been double the length and still not cover everything properly with examples. Nevertheless I hope I got at least the basic concepts clear to some extend. For me I see one thing as the key feature here and this is the pipeline concept used by MongoDB. I find this an approach that is very easy to understand and well known from other technologies (if you would like to count “shell” as a technology ;-)). what I really like is that I can very easily test out the results of individual stages by executing them without the other stages. To some extend this is of course also possible with SQL, but there the concept is not that clearly incorporated.

“The aggregation framework is super cool!” -Quotes from the course (there was coming a but, which I had to cut ;-))

The following table gives an overview on how operations are mapped from SQL to MongoDB.

SQL MongoDB
WHERE $match
GROUP BY $group
HAVING $match
SELECT $project
ORDER BY $sort
LIMIT $limit
SUM $sum
COUNT() $sum
JOIN This cannot be directly matched as the whole concpet of joining does not exist as such in MongoDB. The “$unwind” operator is maybe the closest match here, because in MongoDB the data is usually pre-joined and this will create individual documents again (from an array).

Finally there are a few things to keep in mind and this is first of all that the result of an aggregation function is always one MongoDB document and this is limited to 16 MB. Another thing is related to sharding. This will be the topic of the next lesson and thus the next blog entry. But the one thing to keep in mind is that the load of aggregation cannot be balanced in all cases to the individual mongod instances, but it might put (heavy) load to the mongos router-instance. I guess more on this will come soon as I can see the next lectures already in my online class :-).


The MongoDB class series

Part 1 – MongoDB: First Contact
Part 2 – MongoDB: Second Round
Part 3 – MongoDB: Close Encounters of the Third Kind
Part 4 – MongoDB: I am Number Four
Part 5 – MongoDB: The Fith Element
Part 6 – MongoDB: The Sixth Sense
Part 7 – MongoDB: Tutorial Overview and Ref-Card

Java Supplemental Series

Part 1 – MongoDB: Supplemental – GRIDFS Example in Java
Part 2 – MongoDB: Supplemental – A complete Java Project – Part 1
Part 3 – MongoDB: Supplemental – A complete Java Project – Part 2

Author

Thomas Jaspers

Share on FacebookGoogle+Share on LinkedInTweet about this on TwitterShare on RedditDigg thisShare on StumbleUpon

More content about Big Data

Kommentare

Comment

Your email address will not be published. Required fields are marked *