MongoDB Text Search Explained

7.1.2013 | 5 minutes of reading time

The upcoming release 2.4 of MongoDB will include a first, experimental support for full text search (FTS). This feature was requested early in the history of MongoDB as you can see from this JIRA ticket: SERVER-380 . FTS is first available with the developer release 2.3.2 .

Full Text Search 101

Before looking at how MongoDB implemented its initial full text search, we need to learn a little bit about the basics. There are (at least) two important concepts in order to unterstand full text search:

Stop Words

Stop words are used to filter words that are irrelevant for searching. Examples are is, at, the etc. Let’s have a look at the following sentence …

I am your father, Luke

… and these stop words: am, I, your. After applying the stop words, that’s what’s left of our sentence:

father Luke

The remains are processed in the next step. Please note that stop words are langugage dependent and may also vary from domain to domain.

Stemming

Stemming is the process of reducing words to their root, base or .. well .. stem. Remember things like declension and conjugation ? These typically change the stem of a word. Example

waiting, waited, waits

have all the same stem wait. This processing is also language dependent. Implementations for stemming are called stemmers.

The following diagram sums up the whole process:

So let’s see how we can use MongoDB for full text search.

Enable Text Search

Up to now, text search is disabled by default. You have to enable it at server start with the follwing command line option:

1$ ./mongod --setParameter textSearchEnabled=true

Create a text index

First of all, you define a special kind of index on a field, similar to geospatial indexes:

1db.txt.ensureIndex( {txt: "text"} )

Language settings are important with FTS. MongoDB uses the open source stemmer Snowball and a custom set of stop words for every language supported by that stemmer. The default language is English.

If you have a look at the indexes, our special text index shows up:

1> db.txt.getIndices()
2[
3        {
4                "v" : 1,
5                "key" : {
6                        "_id" : 1
7                },
8                "ns" : "txt.txt",
9                "name" : "_id_"
10        },
11        {
12                "v" : 0,
13                "key" : {
14                        "_fts" : "text",
15                        "_ftsx" : 1
16                },
17                "ns" : "txt.txt",
18                "name" : "txt_text",
19                "weights" : {
20                        "txt" : 1
21                },
22                "default_language" : "english",
23                "language_override" : "language"
24        }
25]

Insert documents

If you insert a document to the above collection, MongoDB applies filtering of stop words and stemming to the content of the indexed text field. Each stem is added to the index pointing to the current document.

1db.txt.insert( {txt: "I am your father, Luke"} )

You can easily see that the stop word filtering happened, because there are only 2 keys in the index txt.txt.$txt_text:

1> db.txt.validate()
2{
3        "ns" : "txt.txt",
4         ...
5        "nIndexes" : 2,
6        "keysPerIndex" : {
7                "txt.txt.$_id_" : 1,
8                "txt.txt.$txt_text" : 2
9        },
10        ...
11}

Search

If you want to perform a full text search, you run a command on the collection holding the text index:

1db.txt.runCommand( "text", { search : "father" } )

Again, the language (this time the language of the search phrase) defaults to English.

The result looks like this:

1> db.txt.runCommand("text", {search: "father"} )
2{
3        "queryDebugString" : "father||||||",
4        "language" : "english",
5        "results" : [
6                {
7                        "score" : 0.75,
8                        "obj" : {
9                                "_id" : ObjectId("50e820689068856d0ac6a801"),
10                                "txt" : "I am your father, Luke"
11                        }
12                }
13        ],
14        "stats" : {
15                "nscanned" : 1,
16                "nscannedObjects" : 0,
17                "n" : 1,
18                "timeMicros" : 114
19        },
20        "ok" : 1
21}

We have one hit for “father” using the index. The ObjectId of the document is return alongside with the full text.

This doesn’t feel like rocket science? Ok, then try a more advanced example:

1> db.txt.insert({txt: "I'm still waiting"})
2> db.txt.insert({txt: "I waited for hours"})
3> db.txt.insert({txt: "He waits"})
4> db.txt.runCommand("text", {search: "wait"})
5{
6        "queryDebugString" : "wait||||||",
7        "language" : "english",
8        "results" : [
9                {
10                        "score" : 1,
11                        "obj" : {
12                                "_id" : ObjectId("50e82dc9c95b73b63ec5f5aa"),
13                                "txt" : "He waits"
14                        }
15                },
16                {
17                        "score" : 0.75,
18                        "obj" : {
19                                "_id" : ObjectId("50e82db5c95b73b63ec5f5a9"),
20                                "txt" : "I waited for hours"
21                        }
22                },
23                {
24                        "score" : 0.6666666666666666,
25                        "obj" : {
26                                "_id" : ObjectId("50e82dabc95b73b63ec5f5a8"),
27                                "txt" : "I'm still waiting"
28                        }
29                }
30        ],
31        "stats" : {
32                "nscanned" : 3,
33                "nscannedObjects" : 0,
34                "n" : 3,
35                "timeMicros" : 148
36        },
37        "ok" : 1
38}

That’s pretty cool, isn’t it? As you can see, the resulting documents are sorted in descending order according to the score. There is a metric applied that measures the distance between the search word and the indexed stems.

Examples

All examples can be found on github . Try them yourself.

Summary

Of course, this implementation of a full text search won’t enable MongoDB to compete with search engines like Apache Solr or Elastic Search , but it is a step in the right direction. I think there are many use cases where this kind of FTS is absolutely sufficient. And don’t forget: this is the first release. We probably will see other interesting features in the future.

If I had to write a wish list, I would write the following:

Enable users to provide their own stop word lists (w/o compiling). This could be done via a command line option pointing to a file or a new system collection like system.fts.stopwords
Use a stemmer implementation that supports more languages than these . What about all the Asian langugages?
Introduce the concept of a dictionary in order to handle
- synonyms,
- irregular words and
- compound words that are common in various European languages, something like the German words Volltextsuche (full text search) or Erdbeermarmeladenglas (jar of strawberry jam).

What’s next

In my next blog article I will have a closer look at more advanced features and non-English languages .

In the meantime: try text search yourself, especially if you have huge product data sets. Report any errors or suggestions to the Mongo JIRA .

Was this post helpful?

Likes

Blog author

Tobias Trelle

Software Architect

Do you still have questions? Just send me a message.

fromTobias Trelle

ctop – manage and monitor your Docker containers

In this post, I’d like to introduce you to a nice command line tool called ctop. I discovered it when I was looking for a tool to monitor some Docker containers for a MongoDB replica set on my local machine while running some load tests. ctop is basically...

Container

17.12.2018 | 1 Minuten Lesezeit

Tobias Trelle

Leaflet und GeoJSON-Daten

Heute zeige ich euch, wie man mittels der JavaScript-Bibliothek Leaflet GeoJSON -Daten auf einer Karte in eigenen Anwendungen darstellen kann. Wie man dies mittels des Google Maps JavaScript API macht, habe ich in diesem Beitrag erklärt . Wir werden ...

Softwareentwicklung
JavaScript

11.6.2018 | 3 Minuten Lesezeit

Tobias Trelle

Google Cloud Function for Machine Learning

In this post I’ll show you how to use a Google Cloud Function to access the machine learning API for natural language processing . Cloud functions are one of the serverless features of the GCP. Please keep in mind that serverless does not mean that your...

Cloud
Google Cloud
Machine Learning

21.5.2018 | 5 Minuten Lesezeit

Tobias Trelle

Google Cloud Natural Language API

In this article I’d like to give you a short introduction to a subset of Google’s machine learning capabilities: the natural language API. This API processes text snippets and can apply several analysis algorithms: analyze-entities: detects entities ...

Cloud
Google Cloud
Machine Learning

6.5.2018 | 4 Minuten Lesezeit

Tobias Trelle

Google Maps API und GeoJSON-Daten

Heute zeige ich euch, wie man GeoJSON-Daten in eigenen Anwendungen in Zusammenhang mit Google Maps anzeigen kann. In meinem GeoJSON-Tutorial hatte ich kurz angesprochen, wie man GeoJSON-Daten mit Drittanbieter-Diensten darstellen kann. Zur Einbettung...

Softwareentwicklung
JavaScript
Google

15.4.2018 | 3 Minuten Lesezeit

Tobias Trelle

RESTful Microservices on the Google Cloud Platform

This tutorial shows you how to develop a RESTful microservice running on the Google Cloud Platform. I already explained how to deploy Spring Boot applications to the AppEngine and how to set up a MongoDB replica set in the Compute Engine . Today you...

Cloud
Google Cloud
Microservices
API
Spring

8.4.2018 | 3 Minuten Lesezeit

Tobias Trelle

GeoJSON Tutorial

In meinem Artikel über die Identifizierung potentieller EX-Raid Arenen in Pokémon GO habe ich das Thema GeoJSON nur kurz als Exkurs erwähnt. Heute möchte ich etwas detaillierter in dieses Thema einsteigen. GeoJSON Spezifikation Was genau sind denn überhaupt...

Data
Softwareentwicklung
JavaScript

19.3.2018 | 4 Minuten Lesezeit

Tobias Trelle

Cloud Launcher for MongoDB in the Google Compute Engine

In this post you will learn how to use Google’s Cloud Launcher to set up instances for a MongoDB replica set in the Google Compute Engine. Replication in MongoDB A minimal MongoDB replica set consists of two data bearing nodes and one so-called arbiter...

Cloud
Infrastructure as Code
Google
NoSQL

5.3.2018 | 3 Minuten Lesezeit

Tobias Trelle

Deploying Spring Boot Applications in the Google AppEngine Flex Environment

In this tutorial I will show how to set up a deployment of Spring Boot applications for the AppEngine Flex environment in the Google Cloud infrastructure. Prerequisites You should be familiar with the Spring Boot ecosystem and should be able to use Maven...

Software development
Cloud
Google
Google Cloud
Spring

13.2.2018 | 2 Minuten Lesezeit

Tobias Trelle

EX-Raid-Arenen in Pokémon GO identifizieren

Heute betreiben wir ein wenig Data Mining mit Geo-Daten, um herauszufinden, wie man potentielle EX-Raid Arenen im Augmented Reality -Spiel Pokémon GO identifizieren kann. Pokémon GO Basics In Pokémon GO geht es darum, möglichst viele der kleinen Pokémon...

Data
JavaScript
AR/VR

5.2.2018 | 5 Minuten Lesezeit

Tobias Trelle

Change Streams in MongoDB 3.6

MongoDB 3.6 introduces an interesting API enhancement called change streams. With change streams you can watch for changes to certain collections by means of the driver API. This feature replaces all the custom oplog watcher implementations out there...

Change Management
NoSQL

15.1.2018 | 2 Minuten Lesezeit

Tobias Trelle

Spring Cloud Service Discovery with Dynamic Metadata

Spring Cloud Service Discovery If you are running applications consisting of a lot of microservices depending on each other, you are probably using some kind of service registry. Spring Cloud offers a set of starters for interacting with the most common...

Cloud
Software architecture
Spring

8.1.2018 | 2 Minuten Lesezeit

Tobias Trelle

Lego WeDo 2.0 Programmierung

Den Lego WeDo 2.0 Bausatz habe ich in bereits in einem ersten Post vorgestellt . Im heutigen Beitrag möchte ich genauer auf dessen Programmierung eingehen. Meet Milo Zunächst muss aber erst mal Hardware her. Der Baukasten enthält (zum Glück, wie ich ...

Softwareentwicklung
Testing

18.10.2017 | 5 Minuten Lesezeit

Tobias Trelle

JUnit 5 – Des Kaisers neue Kleider

JUnit 5 ist im September 2017 in der ersten stabilen Version erschienen. In diesem Post möchte ich Euch die wichtigsten neuen Features vorstellen. Dabei gehe ich davon aus, dass der geneigte Leser mit JUnit 4 halbwegs vertraut ist und Vergleiche dann...

Java
Testing

1.10.2017 | 7 Minuten Lesezeit

Tobias Trelle

Unboxing Lego WeDo 2.0 Roboter Bausatz

In diesem und weiteren Posts möchte ich Euch das Lego WeDo 2.0 Set (45300) vorstellen. Es gehört zur Lego Education Linie und hat Kinder im Grundschulalter als Zielgruppe (und natürlich auch die zugehörigen AFOL s). Das Set wird in einem robusten stabelbaren...

Softwareentwicklung
Testing
Künstliche Intelligenz

27.9.2017 | 2 Minuten Lesezeit

Tobias Trelle

Graphen-Visualisierung mit Neo4j

In diesem Artikel möchte ich nach einer kurzen Einführung in die Graphen-Theorie einen Überblick über die NoSQL-Datenbank Neo4j geben. Insbesondere werde ich auf die Möglichkeiten eingehen, die Neo4j bei der Visualisierung von Graphen anbietet. Was ist...

Datenbank
NoSQL

18.6.2017 | 9 Minuten Lesezeit

Tobias Trelle

In love with Ada

Anyone out there remembering the Ada programming language? In this blog post, I’m going to give you a short introduction to Ada, the history of its name and some of the current occurrences in pop culture. Hello World in Ada To compile our first Ada program...

Software development
Raspberry Pi

10.4.2016 | 3 Minuten Lesezeit

Tobias Trelle

Joins and Schema Validation in MongoDB 3.2

Version 3.2 of the NoSQL database MongoDB introduces two new interesting features (amongst others) that I’d like to explore in this blog post. Joins The logical namespaces where documents are stored are called collections in MongoDB. Up to now every...

NoSQL
Big Data
Validation

7.12.2015 | 3 Minuten Lesezeit

Tobias Trelle

MongoDB-Einführung bei der Java-Usergruppe ruhrjug

Die Java-Enthusiasten im Ruhrgebiet treffen sich regelmäßig bei der ruhrjug , um sich über aktuelle Themen rund um die Programmiersprache Java auszutauschen. Beim letzten Treffen vor der Sommerpause am 25.06.2015 war ich eingeladen, um dort einen Vortrag...

Java
NoSQL
Community
Spring

1.7.2015 | 1 Minuten Lesezeit

Tobias Trelle

MongoDB 2.8 – Neue Storage-Engine WiredTiger

Mit Version 2.8 kommen wesentliche Neuerungen auf die Benutzer der NoSQL-Datenbank MongoDB zu. Eine davon ist die Einführung einer weiteren Storage Engine. Was es damit auf sich hat, werde ich in diesem Artikel erläutern. Bis zur Version 2.6 hat MongoDB...

Big Data
NoSQL

10.12.2014 | 4 Minuten Lesezeit

Tobias Trelle

MongoDB – Riesige Datenmengen schemafrei verwalten

MongoDB ist eine dokumentenorientierte NoSQL-Datenbank, die sich steigender Beliebtheit erfreut. In meinem Artikel MongoDB – Riesige Datenmengen schemafrei verwalten aus dem Java Magazin 5.14 gebe ich eine allgemeine kurze Einführung und erläutere die...

Datenbank
NoSQL

10.7.2014 | 1 Minuten Lesezeit

Tobias Trelle

MongoDB World 2014

For the very first time, the MongoDB community from all over the world gathered in one place. The MongoDB World conference 2014 took place in New York City from June 23rd to 25th. Talks The talks were separated into three topics: dev, ops & buisness...

Big Data
NoSQL
Community

6.7.2014 | 2 Minuten Lesezeit

Tobias Trelle

Test Automation for NoSQL Databases with NoSQL Unit and Travis-CI

Today I want to give you a short summary of my NoSQL matters talk on test automation for NoSQL databases . I basically introduce two tools that may help you with writing unit and integration tests for NoSQL databases: NoSQLUNit is a JUnit extension...

NoSQL
Testing
CI/CD

7.5.2014 | 1 Minuten Lesezeit

Tobias Trelle

Near-Realtime Analytics with MongoDB, Node.js & SmoothieCharts

In this blog post we’ll have a look at how easy it is to do some (near-)realtime analytics with your (big) data. I will use some well-known technologies like MongoDB and node.js and a lesser known JavaScript library called Smoothies Charts for realtime...

Big Data
Node.js

21.1.2014 | 4 Minuten Lesezeit

Tobias Trelle

MongoDB and Ruby

#MongoDB #Ruby I gave a lightning talk on the Ruby driver for MongoDB at the Cloud Developer Camp in Düsseldorf on last Saturday. Here are the slides: Click on the button to load the content from www.slideshare.net. Load content

NoSQL
Ruby

18.7.2013 | 1 Minuten Lesezeit

Tobias Trelle

MongoDB 2.4 Introduces Geospatial Indexing and Search for GeoJSON Geometries...

In case you are unfamiliar with the geospatial stuff, have a look at this introduction to geospatial indexing and searching with MongoDB . In version 2.4 MongoDB introduces support for a subset of GeoJSON geometries. These geometries can be used both...

JavaScript
Big Data
NoSQL

6.3.2013 | 3 Minuten Lesezeit

Tobias Trelle

OOP 2013: Praktische Einführung in MongoDB

Auf der OOP 2013 gab es von mir einen Vortrag zum Thema „Praktische Einführung in MongoDB“ Klicken Sie auf den unteren Button, um den Inhalt von de.slideshare.net zu laden. Inhalt laden Wer wollte, konnte sich MongoDB herunterladen und die Beispiele...

NoSQL
Community

1.2.2013 | 1 Minuten Lesezeit

Tobias Trelle

MongoDB Text Search Tutorial

In my introduction to text search in MongoDB , we had a look at the basic features. Today we’ll have a closer look at the details. API You may have noticed that a text search is not executed with a find() command. Instead you call db.foo.runCommand( ...

NoSQL
Search

10.1.2013 | 7 Minuten Lesezeit

Tobias Trelle

Spring Batch and MongoDB

#springbatch #mongodb #nosql Spring Batch Spring Batch is a Spring-based framework for enterprise Java batch processing. An important aspect of Spring Batch is the separation between reading from and writing to resources and the processing of a single...

30.11.2012 | 5 Minuten Lesezeit

Tobias Trelle

Oliver Gierke on Spring Data and all the REST …

Today something completely different: I’ll interview Oliver Gierke from SpringSource . He we go … Tobias Trelle: Hi Oliver. Would you mind introducing yourself to listeners that might not already know you. Oliver Gierke: My name is Oliver Gierke. I ...

Data
Java
Community
Database
NoSQL
Spring

20.11.2012 | 10 Minuten Lesezeit

Tobias Trelle

Pessimistic Locking with MongoDB

In this article, I’m going to sketch a pattern for implementing pessimistic locking with MongoDB . MongoDB is a document-orientated NoSQL datastore that does not support locking itself. In some business processes it may be required that you have an ...

23.10.2012 | 3 Minuten Lesezeit

Tobias Trelle

GridFS Support in Spring Data MongoDB

MongoDB MongoDB is a highly scalable, document oriented NoSQL datastore from 10gen. For more information have a look at the MongoDB homepage: http://www.mongodb.org . A short introduction to MongoDB can be found at this blog post . GridFS In MongoDB ...

Cloud
Java
Infrastructure
NoSQL
Spring

26.7.2012 | 2 Minuten Lesezeit

Tobias Trelle

MonjaDB – A MongoDB GUI Client Tool

5.6.2012 | 1 Minuten Lesezeit

Tobias Trelle

Spring Data – Part 6: Redis

Redis Redis [1] is a NoSQL [2] key/value datastore. Think of it as a big, very fast persistent hashmap. Redis offers a master/slave data replication [3] and also a built-in publish/subscribe messaging system [4]. It is implemented in C and can be built...

Java
Cloud
NoSQL
Spring

26.4.2012 | 4 Minuten Lesezeit

Tobias Trelle

MongoDB User-Gruppe Düsseldorf

MongoDB MongoDB ist eine hochskalierbare, Dokumenten-orientierte NoSQL -Datenbank des Herstellers 10gen. Mehr Details finden Sie auf der MongoDB-Homepage: http://www.mongodb.org . Eine kurze Einleitung, die die ersten Schritte mit MongoDB erklärt, findet...

Cloud
NoSQL

22.4.2012 | 1 Minuten Lesezeit

Tobias Trelle

Spring Data – Part 4: Geospatial Queries with MongoDB

Introduction Every location-based service [1 ] has to solve the following problem: find all venues within a given distance from the current location of the user. Long before the advent of mobile devices, geographic information systems (GIS) [2 ] had ...

Cloud
NoSQL
Spring

15.3.2012 | 6 Minuten Lesezeit

Tobias Trelle

Spring Data – Part 5: Neo4j

Introduction Neo4j [1 ] is a high-performance NoSQL [2 ] datastore specialized in persisting graphs. A graph [3 ] is data structure consisting of finite sets of vertices and edges, where an edge is a connection between two vertices. Graphs are used to...

Software architecture
Java
Cloud
NoSQL
Spring

27.2.2012 | 4 Minuten Lesezeit

Tobias Trelle

Spring Data – Part 3: MongoDB

In this part of my blog series I’m going to show how easy it is to access a MongoDB datastore with Spring Data MongoDB. MongoDB MongoDB is a so called NoSQL datastore for document-oriented storage. A good place to start with MongoDB is the Developer...

Cloud
NoSQL
Spring

1.2.2012 | 5 Minuten Lesezeit

Tobias Trelle

Spring Data – Part 2: JPA

What happened before? Part 1: Spring Data Commons Part 2: Spring Data JPA After looking at the Spring Data Commons project in the first part of this blog series, today I’m going to introduce you to the sub project Spring Data JPA . JPA Being a part ...

Java
Software development
Spring

21.1.2012 | 3 Minuten Lesezeit

Tobias Trelle

Spring Data – Part 1: Commons

One goal of the Spring Data project is to provide a common API for accessing both NoSQL datastores and relational databases. Spring Data serves as an umbrella project which offers general solutions – like pagination in large result sets – and consists...

Spring

21.12.2011 | 2 Minuten Lesezeit

Tobias Trelle

Testing and Mocking of Static Methods in Java

Again and again I stumble upon the myth that static code is evil because it is hard to test and you can’t mock it. Architects and lead developers are telling that tale and the juniors are picking it up and repeating it: “Static code is evil. It is hard...

BDD
Java
Testing
Software development
Test Driven Development

10.11.2011 | 4 Minuten Lesezeit

Tobias Trelle

Cloud Computing Basics: the CAP Theorem

Almost unlimited scalability is an essential facet of cloud computing as it is offered by the Google App Engine or CloudFoundry. Insuring this feature leads to a trade-off with other nonfunctional aspects from enterprise computing like consistency. But...

Database
Cloud

28.8.2011 | 4 Minuten Lesezeit

Tobias Trelle

Documenting Custom Robot Framework Keyword Libraries

Right now, I’m introducing the robot framework for automated web tests for one of our customers. Beside the basic robot framework, we are using the SeleniumLibrary and RIDE . This tool stack is going to be rolled out to all software development teams...

Testing

14.8.2011 | 2 Minuten Lesezeit

Tobias Trelle

Quo vadis VMware? vFabric vs. Cloud Foundry

Introduction We will start with an introdcution of VMware’s cloud solutions vFabric and Cloud Foundry. After that, the further evolution of these PaaS platforms will be discussed. vFabric VMware offers his PaaS cloud solution vFabric Cloud Application...

Spring
Cloud

6.6.2011 | 3 Minuten Lesezeit

Tobias Trelle

AMQP Messaging mit RabbitMQ und Spring

RabbitMQ ist als Messaging-System Teil der vFabric Cloud Application Platform. Die Unterstützung des performanten Messaging Protokolls AMQP prädestiniert RabbitMQ für den Einsatz in Hochverfügbarkeitsszenarien. RabbitMQ ist ein Open-Source-Produkt ...

Cloud
Java
Softwareentwicklung
Messaging
Spring

20.4.2011 | 4 Minuten Lesezeit

Tobias Trelle

WebSphereMQ Integration using Mule ESB Community Edition

Mule ESB is an open source implementation of an enterprise service bus. In contrast to the free Community Edition, Mule’s commercial Enterprise Edition provides integration of WebSphereMQ servers out of the box. This article explains how to integrate...

Software architecture
Integration

11.3.2011 | 1 Minuten Lesezeit

Tobias Trelle

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

ChatGPT im Alltag eines Python-Entwicklers

Seit einigen Tagen spiele ich mit ChatGPT herum. Beruflich und privat konnte ich damit einige Fragen bearbeiten, bspw. welche Alternativen es zu bestimmten Tools gibt, was Vorteile von Teilzeit für den Arbeitgeber sind oder wer ich bin. Leider weiß ChatGPT...

NLP
Python
Künstliche Intelligenz

27.1.2023 | 7 Minuten Lesezeit

Robert Meißner

Mit wenigen Zeilen Code Titel und Vorschaubild generieren

Ich bin ein fauler Mensch. Und ich schreibe viel, u. a. beruflich und privat in Blogs, auf Twitter und auf Wissenschaftsseiten. Das Schreiben per se ist schön. Aber wenn ich mir Titel überlegen muss oder gar Schlagwörter, dann ist der Spaß vorbei. Noch...

11.10.2022 | 7 Minuten Lesezeit

Robert Meißner

Auslesen von deutschen Empfängeradressen mit Kofax Transformation Modules...

Das Auslesen von Adress-/Anschriftbereichen in Briefen war schon immer eine recht schwierige Problematik. Die Freude war umso größer, als Kofax vor einigen KTM-Versionen (Kofax Transformation Modules ) ein Werkzeug (Adress-Lokator) für das automatisierte...

NLP
Archivierung

7.3.2022 | 6 Minuten Lesezeit

Jürgen Voss

Natural Language Processing: Erweiterungen mit KTM 6.4

Im Frühjahr 2020 erhielt das Produkt Kofax Transformation Modules (KTM) mit dem Service Pack 6.3.1 ein neues Modul: Natural Language Processing (NLP). Natural Language Processing versucht, den Text des Dokuments zu analysieren, Wörter und deren Beziehungen...

Content Management
Archivierung
NLP

15.4.2021 | 2 Minuten Lesezeit

Jürgen Voss

Handschriftenerkennung bei der Dokumentenklassifikation und -extraktion

Im Rahmen eines Kundenprojektes bei einem Telekommunikationsunternehmen war die Aufgabenstellung folgende: Die Eingangsbriefpost musste digitalisiert werden. Nach dem Scannen der Dokumente galt es diese zu klassifizieren (z. B. Kündigungen, Beschwerden...

Content Management
NLP

29.3.2020 | 3 Minuten Lesezeit

Thomas Bergmann

Kofax Transformation Modules: Natural Language Processing, Stimmungen ...

Kofax Transformation Modules (KTM) bietet diverse Werkzeuge, um Dokumente zu klassifizieren und Daten zu extrahieren. Diese Werkzeuge wurden bereits in früheren Blog-Artikeln erläutert:– Dokumentenklassifizierung – Datenextraktion mit Format-Lokatoren...

Content Management
NLP
Archivierung

16.3.2020 | 7 Minuten Lesezeit

Jürgen Voss

Dokumentenklassifikation, Datenextraktion und der ganze Rest…

Im Laufe der Zeit gab es im codecentric-Blog viele Beiträge, die Dokumentenklassifikation und Datenextraktion zum Thema hatten. In diesem Beitrag möchte ich diese Artikel nochmal in einen Zusammenhang stellen und auf Neuerungen bei den älteren Beiträ...

Content Management
NLP
Archivierung

20.8.2019 | 7 Minuten Lesezeit

Jürgen Voss

Orientierungsprobleme bei der Dokumentenerkennung (Kofax Transformation...

Bei der intelligenten Dokumentenklassifizierung und Datenextraktion von Eingangspost in Unternehmen müssen die Eingangskanäle Papier, Email und Fax berücksichtigt werden. Normalerweise ist die Orientierung der digitalisierten Dokumente (0°, 90°, 180°...

Content Management
NLP
Archivierung

7.7.2019 | 3 Minuten Lesezeit

Jürgen Voss

codecentric.AI Bootcamp ist online!

Im letzten Jahr haben wir bei codecentric eine AI-Initiative gestartet. Wir haben uns zum Ziel gesetzt, einen Online-Kurs zum Thema Machine Learning und künstliche Intelligenz in deutscher Sprache zu entwickeln. Natürlich gibt es bereits mehrere sehr...

Computer Vision
Künstliche Intelligenz
NLP

26.5.2019 | 4 Minuten Lesezeit

Oliver Moser

Natural Language Processing — Einsteigen und loslegen!

1 Worum geht es?Ob Suchmaschinen, Spamfilter, Chatbots oder Sprachassistenten wie Siri und Alexa — Computer verarbeiten immer mehr Sprache mit immer besserer Genauigkeit und dringen damit immer weiter in unseren Alltag vor. Dahinter stecken anspruchsvolle...

Künstliche Intelligenz
Machine Learning
Python
NLP
Data

7.3.2019 | 11 Minuten Lesezeit

Thomas Timmermann

kibconfig – Wartungstool für Kibana Dashboards

Als wir vor 2 Jahren zu Beginn unseres Projekts damit begannen, unser ELK Logging über Kibana Dashboards zu optimieren, standen wir vor einem Problem: Wie konnten wir unsere für die PP-Umgebung vorbereiteten Dashboards, Visualisierungen und gespeicherten...

NoSQL
APM

12.10.2017 | 3 Minuten Lesezeit

Carsten Rohrbach

Graphen-Visualisierung mit Neo4j

Datenbank
NoSQL

18.6.2017 | 10 Minuten Lesezeit

Tobias Trelle

Computer-Vision-Techniken in Kofax Transformation Modules (KTM/KTD)

„Computer Vision“ ist eines der wichtigsten, aktuellen Themen in der IT. Überall in modernen Systemen kommt diese Technologie zum Einsatz – sei es in den genialen Autos von Tesla („Object Detection“ für Hindernisse, andere Verkehrsteilnehmer, Straßenschilder...

Data
NLP
Softwareentwicklung
Computer Vision
Archivierung
Künstliche Intelligenz

11.4.2017 | 3 Minuten Lesezeit

Niko Blättermann

Topic Modeling der codecentric Blog-Artikel

Der größte Teil von Big Data sind unstrukturierte Daten. Wenn eine Organisation ihre oder externe Daten von sozialen Medien mit dem Ziel besserer Geschäftsentscheidungen nutzbar machen möchte, so besteht eine Herausforderung darin aus unstrukturierten...

NLP
Python
Machine Learning

3.1.2017 | 15 Minuten Lesezeit

Matthias Radtke

Elasticsearch: _type-Mapping zur Dateninspektion

ProblemsituationEine typische Situation: Daten aus einer Domän mit verschiedenen Sub-Domänen liegen in stark unterschiedlicher und unbekannter Form, mit ebenso unterschiedlichen und unbekannten Werten, vor. Sich mit diesen Daten auseinanderzusetzen ist...

NoSQL

5.12.2016 | 3 Minuten Lesezeit

Christian Börner-Schulte

Spring Boot & Apache CXF – Logging & Monitoring mit Logback, Elasticsearch...

SOAP-Endpoints auf Basis von Microservice-Technologien mit Spring Boot? Cool! Aber wie findet man bei den ganzen „Micro-Servern“ Fehler? Wie sehen die SOAP-Nachrichten aus und wie logge ich eigentlich generell? Und: wie viele Produkte haben wir eigentlich...

Frontend
NoSQL
Java
APM
Logging
Spring

26.7.2016 | 24 Minuten Lesezeit

Jonas Hecht

IoT-Analyse-Plattform

Internet of Things (IoT) oder auch Industrie 4.0 ist heute in aller Munde. Aber welche Herausforderungen stellen sich eigentlich bei der Verarbeitung großer Datenmengen? Eine Variante kann sein, Daten zu sammeln und später im Batch-Betrieb zu verarbeiten...

Cloud
IoT
NoSQL
Scala
Big Data

13.7.2016 | 14 Minuten Lesezeit

Achim Nierbeck

Elixir, Phoenix und CouchDB – Eine Einführung

Das Elixir MVC Framework PhoenixVon Markus Krogemann und Marcel WolfWorum geht es?Zunächst wird gezeigt, wie sich eine Webanwendung mit Phoenix in einfachen Schritten erstellen lässt, ohne dass ein tieferes Verständnis für eine funktionale Programmiersprache...

Softwareentwicklung
Functional programming
Frontend
NoSQL

13.1.2016 | 4 Minuten Lesezeit

Marcel Wolf

Joins und Schema-Validierung mit MongoDB 3.2

Mit Version 3.2 der dokumentenorientierten NoSQL-Datenbank MongoDB werden u.a. zwei lange vermisste(?) Features eingeführt, auf die ich im Folgenden näher eingehen möchte.JoinsDie logischen Namensräume, in denen man seine Dokumente ablegt, werden in...

NoSQL
Big Data
Validierung

7.12.2015 | 3 Minuten Lesezeit

Tobias Trelle

Kofax Transformation Modules (KTM): ‚Freiformerkennung‘ für handschriftliche...

Freiformerkennung versucht im Gegensatz zur formularbasierten Erkennung, bestimmte Werte wie etwa eine Versicherungsnummer, irgendwo auf einem Dokument zu finden. Hilfreich dabei ist immer eine bestimmte Struktur des gesuchten Wertes, der dann meist ...

NLP
Archivierung

19.7.2015 | 3 Minuten Lesezeit

Jürgen Voss

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

MongoDB Text Search Explained

Full Text Search 101

Stop Words

Stemming

Enable Text Search

Create a text index

Insert documents

Search

Examples

Summary

What’s next

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

ctop – manage and monitor your Docker containers

Leaflet und GeoJSON-Daten

Google Cloud Function for Machine Learning

Google Cloud Natural Language API

Google Maps API und GeoJSON-Daten

RESTful Microservices on the Google Cloud Platform

GeoJSON Tutorial

Cloud Launcher for MongoDB in the Google Compute Engine

Deploying Spring Boot Applications in the Google AppEngine Flex Environment

EX-Raid-Arenen in Pokémon GO identifizieren

Change Streams in MongoDB 3.6

Spring Cloud Service Discovery with Dynamic Metadata

Lego WeDo 2.0 Programmierung

JUnit 5 – Des Kaisers neue Kleider

Unboxing Lego WeDo 2.0 Roboter Bausatz

Graphen-Visualisierung mit Neo4j

In love with Ada

Joins and Schema Validation in MongoDB 3.2

MongoDB-Einführung bei der Java-Usergruppe ruhrjug

MongoDB 2.8 – Neue Storage-Engine WiredTiger

MongoDB – Riesige Datenmengen schemafrei verwalten

MongoDB World 2014

Test Automation for NoSQL Databases with NoSQL Unit and Travis-CI

Near-Realtime Analytics with MongoDB, Node.js & SmoothieCharts

MongoDB and Ruby

MongoDB 2.4 Introduces Geospatial Indexing and Search for GeoJSON Geometries...

OOP 2013: Praktische Einführung in MongoDB

MongoDB Text Search Tutorial

Spring Batch and MongoDB

Oliver Gierke on Spring Data and all the REST …

Pessimistic Locking with MongoDB

GridFS Support in Spring Data MongoDB

MonjaDB – A MongoDB GUI Client Tool

Spring Data – Part 6: Redis

MongoDB User-Gruppe Düsseldorf

Spring Data – Part 4: Geospatial Queries with MongoDB

Spring Data – Part 5: Neo4j

Spring Data – Part 3: MongoDB

Spring Data – Part 2: JPA

Spring Data – Part 1: Commons

Testing and Mocking of Static Methods in Java

Cloud Computing Basics: the CAP Theorem

Documenting Custom Robot Framework Keyword Libraries

Quo vadis VMware? vFabric vs. Cloud Foundry

AMQP Messaging mit RabbitMQ und Spring

WebSphereMQ Integration using Mule ESB Community Edition

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

ChatGPT im Alltag eines Python-Entwicklers

Mit wenigen Zeilen Code Titel und Vorschaubild generieren

Auslesen von deutschen Empfängeradressen mit Kofax Transformation Modules...

Natural Language Processing: Erweiterungen mit KTM 6.4

Handschriftenerkennung bei der Dokumentenklassifikation und -extraktion

Kofax Transformation Modules: Natural Language Processing, Stimmungen ...

Dokumentenklassifikation, Datenextraktion und der ganze Rest…

Orientierungsprobleme bei der Dokumentenerkennung (Kofax Transformation...

codecentric.AI Bootcamp ist online!

Natural Language Processing &mdash; Einsteigen und loslegen!

kibconfig – Wartungstool für Kibana Dashboards

Natural Language Processing — Einsteigen und loslegen!