Document classification, data extraction and everything

20.8.2019 | 6 minutes of reading time

Over time, a lot of posts about document classification and data extraction, using Kofax, among other products, have been published in the codecentric blog. This blog post will put these posts into context and point out the changes with regard to older posts.

The codecentric divison ‘Digital Integration’ uses the products Kofax Capture / Kofax Transformation Modules and Kofax Total Agility, among others, in customer projects. Therefore, a large part of the posts refer to these products.

The listed posts were published independently over the last years. I have grouped the posts into different areas to get some clearity:

Best practices
Experience reports
Tips and tricks
Latest trends
The basis of everything
Best practices

Best practices in document classification and data extraction

Regardless of specific projects, these posts are about best practices in customer projects that use document classification and data extraction.

The basis of mailroom automation is the classification of incomig documents into different document types, as the extraction of data will be different within these types. The following article explains the classification tools available in Kofax Transformation Modules.

Document classification with Kofax Transformation Modules (KTM)

Artificial intelligence, neural networks and machine learning play a role in this environment. Kofax Transformation Modules offers AI mechanisms for years, providing new tools with every release:

Kofax Transformation Modules (KTM), AI and Machine Learning

Experience reports

The following blog posts are based on customer projects. Topics range from details of the processing of SEPA-mandates to document process auomation in insurance companies.

At one of our customers, incoming SEPA mandates are processed automatically or manually, depending on whether handwritten notes occur in a certain area of the SEPA form. This post explains how this can be done with KTM tools:

Kofax Transformation Modules: SEPA Mandates and handwritten additional information – or: who scribbled on my form?

Document process automation is discussed at the beginning of every project. At project start there often are different understandings about this topic and project members have to develop a common understanding about it before a project starts. These different views are presented in the following blog post:

KTM and insurance companies: Document Process Automation

The aim of a recognition process is preferably the automatic processing of incoming documents. Contract terminations offer potential for automation, as the termination date is often mentionend in the document. Practical problems may occur, but these can be solved with KTM tools:

Automatic termination of an insurance contract – how Kofax KTM may help

Tips and tricks

“Small” problems will occur in every customer project that cannot be solved with standard tools. This requires some creativity to find a solution without using external tools. Here are some tips and tricks which arose from our projects:

Scan and recognition products often try to rotate captured documents in a “correct” way, so people may read it without rotating the page manually. Sometimes this automation fails, especially with telefaxes, which may contain lines that are printed 90° or 180° rotated to the main text. The following article explains how to rotate this problem documents ‘correct’ automatically:

Orientation problems with document processing (Kofax Transformation Modules)

KTM offers so-called ‘dictionaries’. For example, you may use regular expressions for extracting a date from a document which may appear in different formats: 01.09.2019, 01. September 2019, etc. A dictionary (a plain text file) can contain the names of the months and their abbreviations. This dictionary can be referenced in the regular expression. This saves a lot of typing efforts when defining the regular expressions, and on the other hand, you may change the dictionary without modifying your KTM project. All this is KTM standard functionality. But sometimes you would like to search for stuff in the dictionary by script. This can be done this way:

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

The next piece of advice is not necessary any more by now and may only be used for KTM versions 5 or lower. Machine-written data can easily be extracted by freeform recognition. This wasn’t possible with handwritten data, as full page OCR engines were optimized to machine-written characters. The following post describes how to recognize hand-written data with freeform recognition tools. Kofax KTM 5.5 and higher offers a new full text OCR engine which extracts machine- and handwritten data on a page.

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten numbers

The all-purpose tool for data extraction with KTM is the so-called format locators. The following two blog posts are an introduction to how to use these freeform recognition tools:

Kofax Transformation Modules – format locators and dynamic regular expressions

Kofax Transformation Modules – format locators and dynamic regular expressions – Part 2

Latest trends

For years, Kofax Capture and Kofax Transformation Modules have been the basis of many capture projects and Kofax is market leader in this area. To be prepared for advanced requirements, Kofax offers a product called Kofax Total Agility (KTA). In simple terms: KTA contains the products Kofax Capture, Kofax Total Agility and Kofax Import Connector embedded in a flexible workflow engine. Daniel Brodka explains the extensive capabilities of KTA in this post:

Introduction of and first steps in Kofax Total Agility

A growing part of our business is the area of Robot Process Automation (RPA). Kofax provides the product Kapow as platform to process data from structured or unstructured databases, files, email systems, websites, portals and even legacy mainframe systems or terminal emulations. Kapow fits perfectly into the other existing Kofax products. Kofax Kapow has changed its name recently and is now named Kofax RPA. Stefan Blank has summarized the capabilities of Kofax RPA/Kapow by building an example robot:

Robotic Process Automation with Kofax Kapow™

The basis of everything

The successful capturing platform offered by Kofax is Kofax Capture. With Kofax Capture you may get nice solutions without even using KTM. How to do this and how to create your own extensions to the platform is shown by Stefan Blank in this post about extended customizing of Kofax Capture:

Kofax Capture – Customisation beyond standard features

Stefan Blank wrote another blog post about a project-specific extension to Kofax Capture. This post is about adjustments of the scanning module to meet specific project requirements:

Kofax Capture Advanced Scan Api: A first approach

Kofax Capture includes a module to validate data that has been recognized or enter further data manually. This module is called ‘Validaton’. Within ‘Validation’, there is a scripting language which can be used to customize the behavior of the module to project requirements. For years, this language was ‘SBL’-Basic which is compatible to the ‘old’ Visual Basic of the 90s. But for some years it has also been possible to use .NET (VB, C#) as development environment. The next post explains what you need to consider when switching from SBL to .NET:

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

Barcodes are a popular mechanism for document separation. The barcodes may be put as labels on the first document page or they may be inserted on a separate page before the first page of a document. The document separation works well in general. But sometimes external barcodes will generate wrong splitting of the documents, as they are recognized as separation barcodes. Now the document structure is destroyed. But there is a remedy for this:

Kofax Capture – Document Separation and Barcodes

Hopefully, this summary and sorting of the miscellaneous blog posts has made the topic of data capturing, data classification and data extraction more clear and transparent. For any questions and remarks please use the commet section below. We appreciate your feedback!

Was this post helpful?

Likes

Blog author

Jürgen Voss

Do you still have questions? Just send me a message.

fromJürgen Voss

Spaß mit Flaggen: KTM – ein lockerer Rückblick auf 16 Jahre Kofax Transformation...

Anfang 2006 war ich bei DICOM beschäftigt, die einige Jahre zuvor Kofax gekauft hatten (ja, ich bin schon etwas älter). Da ich mit dem KTM-Vorgängerprodukt Ascent Advanced Forms schon einige Projekte erfolgreich durchgeführt hatte, durfte ich mich dann...

Digitalisierung

12.12.2022 | 2 Minuten Lesezeit

Jürgen Voss

Auslesen von deutschen Empfängeradressen mit Kofax Transformation Modules...

Das Auslesen von Adress-/Anschriftbereichen in Briefen war schon immer eine recht schwierige Problematik. Die Freude war umso größer, als Kofax vor einigen KTM-Versionen (Kofax Transformation Modules ) ein Werkzeug (Adress-Lokator) für das automatisierte...

NLP
Archivierung

7.3.2022 | 6 Minuten Lesezeit

Jürgen Voss

Natural Language Processing: Erweiterungen mit KTM 6.4

Im Frühjahr 2020 erhielt das Produkt Kofax Transformation Modules (KTM) mit dem Service Pack 6.3.1 ein neues Modul: Natural Language Processing (NLP). Natural Language Processing versucht, den Text des Dokuments zu analysieren, Wörter und deren Beziehungen...

Content Management
Archivierung
NLP

15.4.2021 | 2 Minuten Lesezeit

Jürgen Voss

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Kofax Transformation Modules (KTM) offers several tools for document classification and data extraction. There are some older blog articles about these tools: – Document classification – Data extraction with format locators – Machine Learning The...

Content Management
AI
Archiving
NLP

6.4.2020 | 8 Minuten Lesezeit

Jürgen Voss

Kofax Transformation Modules: Natural Language Processing, Stimmungen ...

Kofax Transformation Modules (KTM) bietet diverse Werkzeuge, um Dokumente zu klassifizieren und Daten zu extrahieren. Diese Werkzeuge wurden bereits in früheren Blog-Artikeln erläutert: – Dokumentenklassifizierung – Datenextraktion mit Format-Lokatoren...

Content Management
NLP
Archivierung

16.3.2020 | 7 Minuten Lesezeit

Jürgen Voss

Dokumentenklassifikation, Datenextraktion und der ganze Rest…

Im Laufe der Zeit gab es im codecentric-Blog viele Beiträge, die Dokumentenklassifikation und Datenextraktion zum Thema hatten. In diesem Beitrag möchte ich diese Artikel nochmal in einen Zusammenhang stellen und auf Neuerungen bei den älteren Beiträ...

Content Management
NLP
Archivierung

20.8.2019 | 7 Minuten Lesezeit

Jürgen Voss

Orientation problems with document processing (Kofax Transformation Modules...

Document classification and data extraction in business companies have to deal with paper documents, emails and faxes. The orientation of the digitized documents (0°, 90°, 180°, 270°) usually doesn’t matter. During OCR processing the system will recognize...

Content Management
Archiving
AI

7.7.2019 | 3 Minuten Lesezeit

Jürgen Voss

Orientierungsprobleme bei der Dokumentenerkennung (Kofax Transformation...

Bei der intelligenten Dokumentenklassifizierung und Datenextraktion von Eingangspost in Unternehmen müssen die Eingangskanäle Papier, Email und Fax berücksichtigt werden. Normalerweise ist die Orientierung der digitalisierten Dokumente (0°, 90°, 180°...

Content Management
NLP
Archivierung

7.7.2019 | 3 Minuten Lesezeit

Jürgen Voss

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

In addition to fuzzy databases KTM also offers so-called dictionaries for the optimization of recognition. For example these dictionaries can be used in the regular expressions of a format locator to find dates of the form “01. December 2015”. The dictionary...

6.7.2017 | 2 Minuten Lesezeit

Jürgen Voss

Kofax Transformation Modules (KTM), AI and Machine Learning

The topics AI, machine learning and deep learning are on everyone’s lips, and the media regularly publishes articles on them. What many do not know is that Kofax Transformation Modules (KTM) also provides mechanisms of machine learning. KTM is a system...

5.6.2017 | 5 Minuten Lesezeit

Jürgen Voss

Kofax Transformation Modules (KTM), KI und maschinelles Lernen

Die Themen „KI“, maschinelles Lernen und Deep Learning sind in aller Munde, und in den Medien erscheinen regelmäßig Artikel darüber. Was viele nicht wissen ist, dass Kofax Transformation Modules (KTM) „unter der Haube“ auch Mechanismen des maschinellen...

16.5.2017 | 5 Minuten Lesezeit

Jürgen Voss

CenterDevice und CenterScan – Scannen, Erkennen und sichere Ablage

CenterDevice ist ein Cloud-basiertes, professionelles Dokumentenmanagement- und Online-Collaboration-System. Im codecentric-Blog-Artikel CenterDevice und Kofax Capture – Integration out of the box wurde die einfache Integration von CenterDevice und...

8.2.2017 | 2 Minuten Lesezeit

Jürgen Voss

CenterDevice und Kofax Capture – Integration out of the box

Eine Standardaufgabe in vielen Unternehmen ist die Digitalisierung von eingehenden Papier-, Fax- und EMail-Dokumenten, deren Klassifizierung, Datenextraktion, sowie die sichere Ablage in einem Dokumentenmanagementsystem. In diesem Artikel soll kurz skizziert...

7.12.2016 | 3 Minuten Lesezeit

Jürgen Voss

Unterstützung eines automatisierten Kündigungsprozesses mit Kofax KTM

Die Eingangsdokumente (Brief, Fax oder Email) bei einem unserer Versicherungskunden werden mit Kofax Capture erfasst und durch Kofax Transformation Modules (KTM) klassifiziert und die gewünschten Geschäftsdaten werden dann ebenfalls mit KTM extrahiert...

26.10.2016 | 4 Minuten Lesezeit

Jürgen Voss

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

With Kofax Capture you can enter document index values in a validation screen or just confirm or changes values which have been recognized automatically. The validation screen form presents all fields of a document and the user has to confirm/change ...

8.6.2016 | 4 Minuten Lesezeit

Jürgen Voss

Kofax Transformation Modules: SEPA Mandates and handwritten additional...

Within the last two years many companies had to ask their customers to sign the SEPA Direct Debit Mandates. It is an established procedure to send out forms with filled customer data (the SEPA Mandate). The customer signs the mandate and sends it back...

19.2.2016 | 5 Minuten Lesezeit

Jürgen Voss

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten...

In contrast to form based recognition, the free-form recognition tries to find certain values (like an insurance number) somewhere on a document. It is helpful if the searched value has a structure that can be found with regular expressions. Furthermore...

NLP
Archiving

19.7.2015 | 4 Minuten Lesezeit

Jürgen Voss

Kofax Capture – Document Separation and Barcodes

A well known approach to separate documents at scan time is the use of barcode labels on the first page of a document. The barcode may also be put on a single separator sheet. If a batch of documents is scanned by Kofax Capture, the barcode will be recognized...

6.1.2015 | 4 Minuten Lesezeit

Jürgen Voss

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

IBM Content Collector for SAP (ICC/SAP) is an interface for SAP ERP-Systems and IBM archiving systems: IBM Content Manager, On Demand und TSM. SAP provides the standard interface ‘ArchiveLink’ for linking external archiving systems. ICC/SAP is certified...

Content Management
NLP
Archiving

22.7.2014 | 5 Minuten Lesezeit

Jürgen Voss

KTM and insurance companies: Document Process Automation

Many of our customers are using systems for automatic document classification and data extraction. ‘Kofax Transformation Modules’ (KTM) is one of these systems. These data capturing systems extract metadata out of the electronic images (these are ...

29.11.2013 | 5 Minuten Lesezeit

Jürgen Voss

Document classification with Kofax Transformation Modules (KTM)

22.3.2013 | 6 Minuten Lesezeit

Jürgen Voss

Kofax Transformation Modules – format locators and dynamic regular expressions...

Part 2: Dynamic regular expressions in KTM In the first part of this blog article I explained the use of KTM format locators and regular epressions. Now I will try to explain how flexible KTM projects can be designed by using the KTM internal scripting...

1.2.2013 | 4 Minuten Lesezeit

Jürgen Voss

Kofax Transformation Modules – format locators and dynamic regular expressions

Part 1: An introduction to format locators and regular expressions Many of our customers are using systems for automatic document classification and data extraction. These data capturing systems extract metadata out of the electronic images (these are...

9.1.2013 | 5 Minuten Lesezeit

Jürgen Voss

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Natural Language Processing: Erweiterungen mit KTM 6.4

Content Management
Archivierung
NLP

15.4.2021 | 2 Minuten Lesezeit

Jürgen Voss

Handschriftenerkennung bei der Dokumentenklassifikation und -extraktion

Im Rahmen eines Kundenprojektes bei einem Telekommunikationsunternehmen war die Aufgabenstellung folgende: Die Eingangsbriefpost musste digitalisiert werden. Nach dem Scannen der Dokumente galt es diese zu klassifizieren (z. B. Kündigungen, Beschwerden...

Content Management
NLP

29.3.2020 | 3 Minuten Lesezeit

Thomas Bergmann

Kofax Transformation Modules: Natural Language Processing, Stimmungen ...

Kofax Transformation Modules (KTM) bietet diverse Werkzeuge, um Dokumente zu klassifizieren und Daten zu extrahieren. Diese Werkzeuge wurden bereits in früheren Blog-Artikeln erläutert:– Dokumentenklassifizierung – Datenextraktion mit Format-Lokatoren...

Content Management
NLP
Archivierung

16.3.2020 | 7 Minuten Lesezeit

Jürgen Voss

Dokumentenklassifikation, Datenextraktion und der ganze Rest…

Content Management
NLP
Archivierung

20.8.2019 | 7 Minuten Lesezeit

Jürgen Voss

Orientierungsprobleme bei der Dokumentenerkennung (Kofax Transformation...

Content Management
NLP
Archivierung

7.7.2019 | 3 Minuten Lesezeit

Jürgen Voss

Introduction of and first steps in Kofax Total Agility

Kofax Total Agility (KTA) is one and probably the leading product in the First Mile™ strategy of Kofax. This strategy implies a simplification and improvement of the first steps of a business case. You can see KTA as a versatile workflow platform which...

Content Management
Archivierung

8.7.2017 | 10 Minuten Lesezeit

Daniel Brodka

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

IBM Content Collector for SAP (kurz ICC/SAP) ist die Verbindung zwischen SAP ERP-Systemen und den von IBM angebotenen Archivierungslösungen IBM Content Manager, On Demand und TSM. SAP stellt eine Standardschnittstelle zur Anbindung von externen Archivsystemen...

Content Management
NLP
Archivierung

22.7.2014 | 5 Minuten Lesezeit

Jürgen Voss

Automatisierter Modulimport für OpenCms

In einem unserer Projekte verwenden wir seit langer Zeit OpenCms als Redaktionssystem, ergänzt um ein Backend zur Realisierung von Fachlogik. Da wir agil arbeiten, haben wir natürlich auch den Anspruch, agil zu testen. Erste Maßnahme in Sprint 1 des ...

CI/CD
Content Management

15.4.2010 | 5 Minuten Lesezeit

Robert Spielmann

codecentric @ W-Jax 2008, Tag 2, 05.11.2008

Heute ist der zweite Konferenztag auf der W-Jax 2008 , der führenden Konferenz für umfassendes Know-how im Java-Umfeld. Die Konferenz, die sich an Softwareentwickler, Projektleiter und Architekten richtet, beschäftigt sich mit den wichtigsten Aspekten...

Framework
BPM
Java
Community
Content Management
Open Source
Frontend
Softwarearchitektur
Spring
Validierung
Webdevelopment

7.11.2008 | 3 Minuten Lesezeit

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Document classification, data extraction and everything

Best practices in document classification and data extraction

Experience reports

Tips and tricks

Latest trends

The basis of everything

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Spaß mit Flaggen: KTM – ein lockerer Rückblick auf 16 Jahre Kofax Transformation...

Auslesen von deutschen Empfängeradressen mit Kofax Transformation Modules...

Natural Language Processing: Erweiterungen mit KTM 6.4

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Kofax Transformation Modules: Natural Language Processing, Stimmungen ...

Dokumentenklassifikation, Datenextraktion und der ganze Rest…

Orientation problems with document processing (Kofax Transformation Modules...

Orientierungsprobleme bei der Dokumentenerkennung (Kofax Transformation...

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

Kofax Transformation Modules (KTM), AI and Machine Learning

Kofax Transformation Modules (KTM), KI und maschinelles Lernen

CenterDevice und CenterScan – Scannen, Erkennen und sichere Ablage

CenterDevice und Kofax Capture – Integration out of the box

Unterstützung eines automatisierten Kündigungsprozesses mit Kofax KTM

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

Kofax Transformation Modules: SEPA Mandates and handwritten additional...

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten...

Kofax Capture – Document Separation and Barcodes

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

KTM and insurance companies: Document Process Automation

Document classification with Kofax Transformation Modules (KTM)

Kofax Transformation Modules – format locators and dynamic regular expressions...

Kofax Transformation Modules – format locators and dynamic regular expressions

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Natural Language Processing: Erweiterungen mit KTM 6.4

Handschriftenerkennung bei der Dokumentenklassifikation und -extraktion

Kofax Transformation Modules: Natural Language Processing, Stimmungen ...

Dokumentenklassifikation, Datenextraktion und der ganze Rest…

Orientierungsprobleme bei der Dokumentenerkennung (Kofax Transformation...

Introduction of and first steps in Kofax Total Agility

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

Automatisierter Modulimport für OpenCms

codecentric @ W-Jax 2008, Tag 2, 05.11.2008

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten