Overview

MongoDB Text Search Explained

6 Comments

The upcoming release 2.4 of MongoDB will include a first, experimental support for full text search (FTS). This feature was requested early in the history of MongoDB as you can see from this JIRA ticket: SERVER-380. FTS is first available with the developer release 2.3.2.

Full Text Search 101

Before looking at how MongoDB implemented its initial full text search, we need to learn a little bit about the basics. There are (at least) two important concepts in order to unterstand full text search:

Stop Words

Stop words are used to filter words that are irrelevant for searching. Examples are is, at, the etc. Let’s have a look at the following sentence …

I am your father, Luke

… and these stop words: am, I, your. After applying the stop words, that’s what’s left of our sentence:

father Luke

The remains are processed in the next step. Please note that stop words are langugage dependent and may also vary from domain to domain.

Stemming

Stemming is the process of reducing words to their root, base or .. well .. stem. Remember things like declension and conjugation? These typically change the stem of a word. Example

waiting, waited, waits

have all the same stem wait. This processing is also language dependent. Implementations for stemming are called stemmers.

The following diagram sums up the whole process:

mongo_fts_2

So let’s see how we can use MongoDB for full text search.

Enable Text Search

Up to now, text search is disabled by default. You have to enable it at server start with the follwing command line option:

$ ./mongod --setParameter textSearchEnabled=true

Create a text index

First of all, you define a special kind of index on a field, similar to geospatial indexes:

db.txt.ensureIndex( {txt: "text"} )

Language settings are important with FTS. MongoDB uses the open source stemmer Snowball and a custom set of stop words for every language supported by that stemmer. The default language is English.

If you have a look at the indexes, our special text index shows up:

> db.txt.getIndices()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "txt.txt",
                "name" : "_id_"
        },
        {
                "v" : 0,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "ns" : "txt.txt",
                "name" : "txt_text",
                "weights" : {
                        "txt" : 1
                },
                "default_language" : "english",
                "language_override" : "language"
        }
]

Insert documents

If you insert a document to the above collection, MongoDB applies filtering of stop words and stemming to the content of the indexed text field. Each stem is added to the index pointing to the current document.

db.txt.insert( {txt: "I am your father, Luke"} )

You can easily see that the stop word filtering happened, because there are only 2 keys in the index txt.txt.$txt_text:

> db.txt.validate()
{
        "ns" : "txt.txt",
         ...
        "nIndexes" : 2,
        "keysPerIndex" : {
                "txt.txt.$_id_" : 1,
                "txt.txt.$txt_text" : 2
        },
        ...
}

Search

If you want to perform a full text search, you run a command on the collection holding the text index:

db.txt.runCommand( "text", { search : "father" } )

Again, the language (this time the language of the search phrase) defaults to English.

The result looks like this:

> db.txt.runCommand("text", {search: "father"} )
{
        "queryDebugString" : "father||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.75,
                        "obj" : {
                                "_id" : ObjectId("50e820689068856d0ac6a801"),
                                "txt" : "I am your father, Luke"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 1,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 114
        },
        "ok" : 1
}

We have one hit for “father” using the index. The ObjectId of the document is return alongside with the full text.

This doesn’t feel like rocket science? Ok, then try a more advanced example:

> db.txt.insert({txt: "I'm still waiting"})
> db.txt.insert({txt: "I waited for hours"})
> db.txt.insert({txt: "He waits"})
> db.txt.runCommand("text", {search: "wait"})
{
        "queryDebugString" : "wait||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 1,
                        "obj" : {
                                "_id" : ObjectId("50e82dc9c95b73b63ec5f5aa"),
                                "txt" : "He waits"
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "_id" : ObjectId("50e82db5c95b73b63ec5f5a9"),
                                "txt" : "I waited for hours"
                        }
                },
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50e82dabc95b73b63ec5f5a8"),
                                "txt" : "I'm still waiting"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 3,
                "nscannedObjects" : 0,
                "n" : 3,
                "timeMicros" : 148
        },
        "ok" : 1
}

That’s pretty cool, isn’t it? As you can see, the resulting documents are sorted in descending order according to the score. There is a metric applied that measures the distance between the search word and the indexed stems.

Examples

All examples can be found on github. Try them yourself.

Summary

Of course, this implementation of a full text search won’t enable MongoDB to compete with search engines like Apache Solr or Elastic Search, but it is a step in the right direction. I think there are many use cases where this kind of FTS is absolutely sufficient. And don’t forget: this is the first release. We probably will see other interesting features in the future.

If I had to write a wish list, I would write the following:

  • Enable users to provide their own stop word lists (w/o compiling). This could be done via a command line option pointing to a file or a new system collection like system.fts.stopwords
  • Use a stemmer implementation that supports more languages than these. What about all the Asian langugages?
  • Introduce the concept of a dictionary in order to handle
    • synonyms,
    • irregular words and
    • compound words that are common in various European languages, something like the German words Volltextsuche (full text search) or Erdbeermarmeladenglas (jar of strawberry jam).

What’s next

In my next blog article I will have a closer look at more advanced features and non-English languages.

In the meantime: try text search yourself, especially if you have huge product data sets. Report any errors or suggestions to the Mongo JIRA.

More content about Big Data

Kommentare

  • 7. January 2013 von adebarbara

    Tobias, excelent explanation! I notice this “[…] won’t enable MongoDB to compete with search engines like Lucene or Elastic Search”. I think would be better to say that “[…] won’t enable MongoDB to compete with search engines like Apache Solr or Elastic Search” since both projects are build on top of Apache Lucene?

    Thanks for the post!
    Andrés

  • We had to use Elastic Search (which based Solr in its core), we found Elastic Search easy to setup but a nightmare to develop against with the C# NEST client.

    Thankfully MongoDB 2.4 has been released which has a new Free Text Search feature.

    MongoDB 2.4 Release Notes

    I’ve also written a simple post on how to use MongoDB Free Text Search.

    It has full instructions, working code examples and links to all the documentation; it might be helpful to developers getting started.

    • 1. August 2013 von Maziyar

      Hi Robert,

      Since you’ve had experience with ES and Solr, comparing with this new feature which is a blast because it’s built-in, what do you think? Can it replace the Solr and ES?

  • 10. June 2014 von Cupidvogel

    Hi,

    So imagine I have large bodies of text, say movie reviews. I want a search based on complete or incomplete phrases in the collection of all reviews. Say, there are 200 reviews. Each review is mapped to a certain URL. Some of them contain the text “is the death of duty”. Now if I search by the text “death of duty”, I want the list of URLs whose review contains that text somewhere. Can that be implemented? Will it be fast enough, compared to what some search engines like Lucene offer?

    • Tobias Trelle

      Your scenario can be implemented w/ MongoDB’s full text search. Like w/ any other MongoDB query, you can restrict the set of field sto retrieve from the documents matched by your search criteria.

      I didn’t do any performane comparisons with other search engines. But you can’t compare Lucene w/ MongoDB. The first is a library, the second a database server. If text search is your core business you might want to take look at Solr or Elasticsearch.

      HTH, Tobias

Comment

Your email address will not be published. Required fields are marked *