MongoDB: I am Number Four

No Comments

Here we go again and I have to admit that my head is already spinning a little bit from all the MongoDB input during the last couple of weeks. With the fourth post in this series we have reached halftime in the 10gen’s seven week course on MongoDB. And another week another film title in the post heading. But it must be mentioned here that is was close with a suggestion made in the last posting of this series. Spoiler warning ahead: The film giving title for the next posting is already set in stone :-)

“Indexes turn out to be the single most important factor for performance in MongoDB and any relational databases.” -Quotes from the course

But enough for the prelude, let’s jump into the middle of another week with MongoDB. And this week it is all about performance as is already indicated by the above quote.

Indexes in General

Indexes are the most important factor to speed up performance in MongoDB (and any other relational database). And the good message for everyone having a lot of experiences with relational databases: The concepts are not really that different in MongoDB. Instead of a “full table”-scan we are talking here about a “full collection”-scan, but either one means “death to your performance” when executed on a big amount of data.

MongoDB is storing indexes as a B-Tree to speed up search operations. The indexes as such are an ordered list of keys that are part of that index. Let’s look at an obvious example, which is that we would like to store information on all planets in the universe in MongoDB. First of all we would have them grouped by galaxies, then by quadrants and finally individual planets and the information on those (not shown in the example as only the indexes are of interest here). Thus we would have:

Index: Galaxy, Quadrant, Planet    

Milky Way                                       |       Andromeda         |        ...
_______________________________________________________________________________________
Alpha                 | Beta | Gamma | Delta
_______________________________________________________________________________________
Earth | Kronos ...

The above index would allow us to query very quickly on any planet where the name of the planet, the quadrant and the galaxy is known. As known from relational databases indexes are very strong if the whole index can be applied. If we are starting from top an index can at last be partially helpful and starting from the bottom it is totally useless. In our example this means that if we have only the name of the planet the index cannot be used at all. If we would have the name of the galaxy and the name of the planet at least the galaxy-part of the index could be used. Nothing really new here.

“But the largest driver for performance is gonna be your algorithmic performance. And for a database server that’s gonna be how fast can we execute queries. And that is driven by whether or not the database can use an index to look data up.” – Quotes from the course

Another aspect of indexes is the costs they involve. Using indexes will speed up read-performance, but it will slow down the write-performance as additional index-information needs to be written to disc. Of course this also increases the amount of disc space required. Thus one should not create indexes for everything, but for those queries where they are really needed. And of course it is important to create proper indexes, so that they can be utilised by the database system (this is generally true and not specific to MongoDB).

This leads to the more practical part of creating indexes in MongoDB.

Basic Topics on Indexes

Creating indexes is pretty straightforward and we have seen it already at the end of the previous posting of this series. Let’s assume we are having the following collection:

> db.universe.find().pretty()
{
	"_id" : ObjectId("50ad35ca0238f791c861f1d0"),
	"galaxy" : "Milky Way",
	"quadrant" : "Alpha",
	"planet" : "Earth",
	"moreParamsHere" : "1001"
}
{
	"_id" : ObjectId("50ad35da0238f791c861f1d1"),
	"galaxy" : "Milky Way",
	"quadrant" : "Alpha",
	"planet" : "Kronos",
	"moreParamsHere" : "6009"
}

Then we can easily create new a new index using the “ensureIndex(…)” command:

> db.universe.ensureIndex({galaxy : 1})

The ensureIndex-command gets the name of the attribute and a number that indicates the way the index is sorted. The value “1” means ascending and a value of “-1″ descending. For searching this does not make any difference in performance, but it will make a difference if the results will be ordered.

To see all the indexes created in the database the following command can be used. It can be seen that I have played around quite a bit already in my MongoDB test-database :-). In addition it can be seen that all “_id”-attributes are per default indexed.

> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.names", "name" : "_id_" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.startrek", "name" : "_id_" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.ships", "name" : "_id_" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.instructors", "name" : "_id_" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.candidates", "name" : "_id_" }
{ "v" : 1, "key" : { "candidates" : 1 }, "ns" : "test.instructors", "name" : "candidates_1" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.downloads.files", "name" : "_id_" }
{ "v" : 1, "key" : { "filename" : 1, "uploadDate" : 1 }, "ns" : "test.downloads.files", 
   "name" : "filename_1_uploadDate_1" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.downloads.chunks", "name" : "_id_" }
{ "v" : 1, "key" : { "files_id" : 1, "n" : 1 }, "unique" : true, "ns" : "test.downloads.chunks", 
   "name" : "files_id_1_n_1" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.downloads_meta", "name" : "_id_" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.gtf_projects", "name" : "_id_" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.universe", "name" : "_id_" }
{ "v" : 1, "key" : { "galaxy" : 1 }, "ns" : "test.universe", "name" : "galaxy_1" }

Of course this is not really the best way to dig into this as the amount of indexes might be a bit overwhelming in a bigger system. Therefore it is possible to check the indexes that are available directly on a specific collection:

> db.universe.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"ns" : "test.universe",
		"name" : "_id_"
	},
	{
		"v" : 1,
		"key" : {
			"galaxy" : 1
		},
		"ns" : "test.universe",
		"name" : "galaxy_1"
	}
]

Of course we also need a way to drop indexes again. his should be fairly easy to memorise as the command is just “dropIndex” instead of “ensureIndex”, but takes the same parameters. Listing the indexes again after dropping one, it can be seen that this very index is removed from the list. (And in some real system one would also recognise this from a potential drop in performance.)

> db.universe.dropIndex({galaxy : 1})
{ "nIndexesWas" : 2, "ok" : 1 }
> db.universe.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"ns" : "test.universe",
		"name" : "_id_"
	}
]

Advanced Topics on Indexes

Up to here we have been dealing only with indexes on individual attributes of a document. In the example above (galaxy, quadrant, planet) we have been talking about a compound index. Again it is very easy creating such a compound index as the syntax is again very close to what one would expect:

> db.universe.ensureIndex({galaxy : 1, quadrant : 1, planet : 0})
> db.universe.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"ns" : "test.universe",
		"name" : "_id_"
	},
	{
		"v" : 1,
		"key" : {
			"galaxy" : 1,
			"quadrant" : 1,
			"planet" : 0
		},
		"ns" : "test.universe",
		"name" : "galaxy_1_quadrant_1_planet_0"
	}
]

And by listing the indexes on the universe-collection we can see that the new index has been created.

There is one tricky thing about creating compound indexes in MongoDB. It is possible to create indexes on attributes that are storing an array of values. Then all the array-values are indexed. This is called a “Multikey Index” in MongoDB. Pretty cool, but the problem is that it is not allowed to have more than one “Multikey Index” in a compound index. In a relational world this would be quite straightforward as the table structures are known well in advance. But with MongoDB it is possible to store an array-type of value to an attribute at any time. If then by accident array values are stored to more than one attribute that is part of a compound index in that document a runtime error will occur. This is something that I do not like too much, even though it can of course be handled in the implementation of an application.

“Good developers have an understanding of performance. They write with performance in mind and they can find their own performance problems in their programs.” -Quotes from the course

One feature most developers are typically using heavily with relational database is the creation of unique indexes. Of course this is also possible with MongoDB and again JSON syntax is used for this by adding an additional “unique”-attribute to the document used to create the index.

> db.universe.dropIndex({galaxy : 1, quadrant : 1, planet : 0})
{ "nIndexesWas" : 2, "ok" : 1 }
> db.universe.ensureIndex({galaxy : 1, quadrant : 1, planet : 0},{unique : true})
> db.universe.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"ns" : "test.universe",
		"name" : "_id_"
	},
	{
		"v" : 1,
		"key" : {
			"galaxy" : 1,
			"quadrant" : 1,
			"planet" : 0
		},
		"unique" : true,
		"ns" : "test.universe",
		"name" : "galaxy_1_quadrant_1_planet_0"
	}
]

So let’s drop the existing index first and then recreate it as being unique. This can also be seen afterwards when checking the index as it has the attribute “unique” set to “true”. There is one minor flaw in this, as the “_id”-index is always unique per default, but the “unique”-attribute is not shown for it. Thus this has to be kept in mind.

Index Command
Creating an index db.universe.ensureIndex({galaxy : 1})
Dropping an index db.universe.dropIndex({galaxy : 1})
Creating a compound index db.universe.ensureIndex({galaxy : 1, quadrant : 1, planet : 0})
Dropping a compound index db.universe.dropIndex({galaxy : 1, quadrant : 1, planet : 0})
Creating a unique compound index db.universe.ensureIndex({galaxy : 1, quadrant : 1, planet : 0},{unique : true})

With unique indexes there is one well known problem from relational databases. If there is already data in the database which is not unique one has to ensure its uniqueness before one is able to create the unique-index. That is not different with MongoDB, but MongoDB does offer one very brute force way to solve the problem. This is using {unique: true, dropDups : true} to create the index. This will delete all but one document of all the duplicated documents (according to the new index) in the collection. This cannot be undone and there is no way to determine which of the duplicated documents will be deleted. So this is a possibility, but not really a solution and therefore I will not show any concrete example on this.

On using {dropDups : true} “This is a pretty big sledgehammer to solve the problem, but it will work.” -Quotes from the course

Now there is another potential problem in MongoDB that is due to its nature of allowing to store incomplete documents. Let’s assume we are having the following collection:

> db.holodeck_programs.find()
> db.holodeck_programs.insert({name : 'Captain Proton'})
> db.holodeck_programs.insert({name : 'Delta Flyer', type : 'simulation' })
> db.holodeck_programs.insert({name : 'Warp Plasma', type : 'analyse' })
> db.holodeck_programs.insert({name : 'Emergency Medical Hologram'})
> db.holodeck_programs.find()
{ "_id" : ObjectId("50b3c2820238f791c861f1dd"), "name" : "Captain Proton" }
{ "_id" : ObjectId("50b3c28d0238f791c861f1de"), "name" : "Delta Flyer", "type" : "simulation" }
{ "_id" : ObjectId("50b3c29c0238f791c861f1df"), "name" : "Warp Plasma", "type" : "analyse" }
{ "_id" : ObjectId("50b3c2a10238f791c861f1e0"), "name" : "Emergency Medical Hologram" }

Now let’s try to create a unique index on the key “type”:

> db.holodeck_programs.ensureIndex({type : 1}, {unique : true})
E11000 duplicate key error index: test.holodeck_programs.$type_1  dup key: { : null }

This does result in an error, which is not too big a surprise. Internally the value for “type” is null for every document where the “type” is not explicitly set. Therefore we have duplicates in the database and it is quite obvious that “dropDups” is not the solution here. But MongoDB would not be MongoDB if there would not be another cool feature around the next corner for this kind of problem. (Just as a side note: It took me long time to decide whether to write cool feature here or better weird feature ;-).) Anyway, the solution are “Sparse Indexes”. If an index is defined to be sparse it is only applied on those documents where the corresponding key values are explicitly set. Let’s look at an example and the further implications of this:

> db.holodeck_programs.ensureIndex({type : 1}, {unique : true, sparse : true})
> db.holodeck_programs.find({type : 'simulation'})
{ "_id" : ObjectId("50b3c28d0238f791c861f1de"), "name" : "Delta Flyer", "type" : "simulation" }
> db.holodeck_programs.find().sort({name : 1})
{ "_id" : ObjectId("50b3c2820238f791c861f1dd"), "name" : "Captain Proton" }
{ "_id" : ObjectId("50b3c28d0238f791c861f1de"), "name" : "Delta Flyer", "type" : "simulation" }
{ "_id" : ObjectId("50b3c2a10238f791c861f1e0"), "name" : "Emergency Medical Hologram" }
{ "_id" : ObjectId("50b3c29c0238f791c861f1df"), "name" : "Warp Plasma", "type" : "analyse" }
> db.holodeck_programs.find().sort({type : 1})
{ "_id" : ObjectId("50b3c29c0238f791c861f1df"), "name" : "Warp Plasma", "type" : "analyse" }
{ "_id" : ObjectId("50b3c28d0238f791c861f1de"), "name" : "Delta Flyer", "type" : "simulation" }

When searching for a document by “type” everything works as usual, but one must keep in mind that now only those documents are considered that have the key “type”. Comparing the results of the two find.sort-commands one pretty well sees the potential problem in this. As sorting is also taking advantage of existing indexes this will have a bit strange result where the documents not having the type-key are not shown at all if sorting by “type”. Another thing to keep in mind I would say.

“Indexes are not costless, they take space on disc and they also take time to keep updated.” -Quotes from the course

The last topic in this section is about the difference of creating indexes in the foreground or in the background. For me this is a topic that already tends to go more into the direction of database administration, but it is anyway good to know. Creating an index in the foreground is fast, but it blocks other writers on the same database. Creating indexes in the foreground is the default. Now even though it is fast this can still mean it takes several minutes for a lot of documents. In a productive environment this might be a problem and then it might be a good idea to use {background : true}. This will slow down the index creation by a factor of one to five. But it does not block any other writers. When using replica sets there is one additional option. I do not really know much about replica sets, but it seems to be different instances of MongoDB that all run the same database. In this case it is possible to temporarily isolate and remove one instance from the replica set, creating the index in the foreground and adding it back to the replica set. I think for me to do an elaborated decision on this I would need to know more on replica sets, but luckily there are still more lectures to come.

Indexes Explain-ed & Size of Indexes

For this part it would be good to have a really big collection, which I do not (yet) have and I am too lazy to create one now ;-). Nevertheless we can have a look at the explain-command of MongoDB.

> db.holodeck_programs.find({type : 'simulation'}).explain()
{
	"cursor" : "BtreeCursor type_1",
	"isMultiKey" : false,
	"n" : 1,
	"nscannedObjects" : 1,
	"nscanned" : 1,
	"nscannedObjectsAllPlans" : 1,
	"nscannedAllPlans" : 1,
	"scanAndOrder" : false,
	"indexOnly" : false,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"millis" : 0,
	"indexBounds" : {
		"type" : [
			[
				"simulation",
				"simulation"
			]
		]
	},
	"server" : "Thomass-MacBook-Pro.local:27017"
}
> db.holodeck_programs.find({name : 'Captain Proton'}).explain()
{
	"cursor" : "BasicCursor",
	"isMultiKey" : false,
	"n" : 1,
	"nscannedObjects" : 4,
	"nscanned" : 4,
	"nscannedObjectsAllPlans" : 4,
	"nscannedAllPlans" : 4,
	"scanAndOrder" : false,
	"indexOnly" : false,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"millis" : 0,
	"indexBounds" : {
		
	},
	"server" : "Thomass-MacBook-Pro.local:27017"
}

First and probably most important the “cursor”-value shows us if any index has been used or not. In our first query MongoDB could use the index on “type”, when querying for the “name” there is no index and thus no index is used indicated by the “BasicCursor” value. The number of scanned objects are also interesting and is correlating well with the different queries. The “millis”-value tells us how long a query took in milliseconds. As there is hardly any data in the used collection there is no difference visible here, but it is a very good indicator. Another interesting value is the “indexOnly” entry. If this one would be true MongoDB can retrieve all the needed information from the index and does not need to load the document at all. This could be achieved when having a compound index, for example:

> db.universe.find({galaxy : 'Milky Way'}, {galaxy : 1, quadrant : 1, planet : 1, _id : 0}).explain()
{
	"cursor" : "BtreeCursor galaxy_1_quadrant_1_planet_0",
	"isMultiKey" : false,
	"n" : 2,
	"nscannedObjects" : 2,
	"nscanned" : 2,
	"nscannedObjectsAllPlans" : 2,
	"nscannedAllPlans" : 2,
	"scanAndOrder" : false,
	"indexOnly" : true,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"millis" : 0,
	"indexBounds" : {
		"galaxy" : [
			[
				"Milky Way",
				"Milky Way"
			]
		],
		"quadrant" : [
			[
				{
					"$minElement" : 1
				},
				{
					"$maxElement" : 1
				}
			]
		],
		"planet" : [
			[
				{
					"$minElement" : 1
				},
				{
					"$maxElement" : 1
				}
			]
		]
	},
	"server" : "Thomass-MacBook-Pro.local:27017"
}

In the above example all values that are queried are from the same compound index and thus MongoDB does not need to look to any document at all, but all information can be retrieved from the index itself (“indexOnly” : true).

One very important aspect when creating indexes is whether or not it is possible for MongoDB to keep them in memory or not. Thus knowing how big the indexes are can help in either adjusting main memory of the used servers as on solution. Or maybe certain indexes are not really needed and should be dropped. Using the “stats()”- and the “totalIndexSize()”-commands the corresponding values can be inspected. All size values are given in bytes.

> db.holodeck_programs.stats()
{
	"ns" : "test.holodeck_programs",
	"count" : 4,
	"size" : 240,
	"avgObjSize" : 60,
	"storageSize" : 4096,
	"numExtents" : 1,
	"nindexes" : 2,
	"lastExtentSize" : 4096,
	"paddingFactor" : 1,
	"systemFlags" : 1,
	"userFlags" : 0,
	"totalIndexSize" : 16352,
	"indexSizes" : {
		"_id_" : 8176,
		"type_1" : 8176
	},
	"ok" : 1
}
> db.holodeck_programs.totalIndexSize()
16352

Puh, considering the amount of content in this blog post and the amount of time it took me to write it up to here I fear I have to speed things up a bit. So the next paragraphs will be a bit more compressed.

Hints & Efficiency

MongoDB does a pretty good job in deciding which index to use for a particular query. Nevertheless it is possible to manually override this decision making process by adding hints to the query. A hint is just added as an additional command and specifies the index that should be used. This is by the way the next time things are getting a bit complicated when considering sparse indexes. If hinting to a sparse index and thus forcing MongoDB to use it, it will show no documents where the corresponding attribute is not present. So the combination of using sparse indexes and hints must be considered carefully.

It is also possible to completely disable the use of any index with “hint({$natural:1})”. This can be for example helpful to examine how efficient an existing index is by not using it temporarily.

“And I know there are not that many students in that class, because I build that data set. So I have information it (MongoDB) does not necessarily have.”- Quotes from the course

The efficiency of using indexes depends a lot on the type of query. Using operators like $gt, $lt or $ne will slow down the efficiency of the used indexes a lot. The same is true for the use of certain regular expressions that do not start matching from the beginning (/^).

Logging & Profiling

MongoDB per default logs out slow queries where slow means a query running longer than 100ms. To be able to demonstrate the output I would need a bigger collection than what I have for testing currently. Anyway the message would show up in the shell window where MongoDB has been started.

The next step is profiling. Therefore the additional parameters

--profile <level>

and

--slowmx <value>

can be used when starting mongod. Setting the profiling level to 0 means logging is turned off. Setting this value to 1 is logging out all queries slower (in milliseconds) than the value given with the second parameter. Finally giving a value of 3 for the profile-setting will enable logging of all queries. From the MongoDB shell the settings can be checked as follows:

> db.getProfilingLevel()
0
> db.getProfilingStatus()
{ "was" : 0, "slowms" : 100 }

And I can also change the profiling level from the MongoDB shell:

> db.setProfilingLevel(1, 5)
{ "was" : 0, "slowms" : 100, "ok" : 1 }
> db.getProfilingLevel()
1
> db.getProfilingStatus()
{ "was" : 1, "slowms" : 5 }

There is much more to profiling like querying for various values using “db.system.profile.find()” using corresponding JSON documents as parameters describing what to profile. That could be probably a blog post of its own.

Top, Stats & Sharding

Finally for this blog post a very quick look at some more high-level commands and concepts. The commands “mongotop” and “mongostat” are designed to work similar to the corresponding unix-commands.

Of course the following values do not tell too much on my idle running MongoDB instance. But with mongotop it is possible to see how much time is spent with the different operations in the different collections.

thomass-macbook-pro:bin thomasjaspers$ ./mongotop
connected to: 127.0.0.1

                            ns       total        read       write		2012-11-30T22:24:27
        test.system.namespaces         0ms         0ms         0ms
          local.system.replset         0ms         0ms         0ms
                        admin.         0ms         0ms         0ms

The command mongostat gives a kind of snapshot of the overall MongoDB system. Both of this commands are probably meant more for system administrators, but it is of course good to know those.

thomass-macbook-pro:bin thomasjaspers$ ./mongostat
connected to: 127.0.0.1
insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
     0      0      0      0       0       1       0     0m  2.41g    28m      2 local:0.0%          0       0|0     0|0    62b     1k     1   23:25:04 
     0      0      0      0       0       1       0     0m  2.41g    28m      0     .:0.1%          0       0|0     0|0    62b     1k     1   23:25:05 
     0      0      0      0       0       1       0     0m  2.41g    28m      0 local:0.0%          0       0|0     0|0    62b     1k     1   23:25:06 
     0      0      0      0       0       1       0     0m  2.41g    28m      0 local:0.0%          0       0|0     0|0    62b     1k     1   23:25:07 

The mongostat command is often used to check the value for “idx miss %”. That value is telling how well MongoDB is able to execute all reads from the indexes inside memory. But be careful: The value will be very good if no index is used at all, but that does for sure not mean that the application is very fast.

“But ultimately to debug the performance of your programs you gonna need to do some profiling to figure out what is slow inside your programs.”- Quotes from the course

The very last thing one should at least know is the concept of sharding. In this MongoDB is running in a kind of cluster (I only said a kind of ;-)). What is done in this case is that there is a MongoDB instance mongos that is operating as a router. An application is only talking to this router. Behind that one there can be many mongod-instances distributed over a lot of servers. Now what is important to know is that in this scenario a so-called shard-key is needed. This must be used for all insert-operations and it is the key the router uses to determine the instance to store the document to. All other operations (update, find, remove) can omit the shard-key, but of course this will slow down performance as then some broadcasting is needed to find the proper document. This is a solution for big collections and for sure more information than this is required to implement it properly. This short paragraph was just meant to raise awareness that this feature exists in MongoDB.

Ok, that was it for this week’s lecture. There are three more lectures to come and thus three more blog postings. I am still enjoying my MongoDB experience, even though this was a tough one for me time wise. But I am already curious what will come next :-).


The MongoDB class series

Part 1 – MongoDB: First Contact
Part 2 – MongoDB: Second Round
Part 3 – MongoDB: Close Encounters of the Third Kind
Part 4 – MongoDB: I am Number Four
Part 5 – MongoDB: The Fith Element
Part 6 – MongoDB: The Sixth Sense
Part 7 – MongoDB: Tutorial Overview and Ref-Card

Java Supplemental Series

Part 1 – MongoDB: Supplemental – GRIDFS Example in Java
Part 2 – MongoDB: Supplemental – A complete Java Project – Part 1
Part 3 – MongoDB: Supplemental – A complete Java Project – Part 2

Author

Thomas Jaspers

Thomas Jaspers

Share on FacebookGoogle+Share on LinkedInTweet about this on TwitterShare on RedditDigg thisShare on StumbleUpon

Comment

Your email address will not be published. Required fields are marked *