Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten numbers

19.7.2015 | 4 minutes of reading time

In contrast to form based recognition, the free-form recognition tries to find certain values (like an insurance number) somewhere on a document. It is helpful if the searched value has a structure that can be found with regular expressions. Furthermore key words are often used for the search. These key words are located ‘near’ the searched values (for example ‘insurance number’, ‘ins nbr’, …)

Most of the established classification/extraction products offer this kind of tools. With machine printed text all of them will deliver sufficient results.

At our customers we are using the Kofax product Kofax Transformation Modules (KTM) for document classification and data extraction. The KTM tools for free-form recognition with machine printed text are the so called ‘format locators’. You can read about them in former KTM blog articles (1).

This article will decribe how to find handwritten numbers that have a certain structure somewhere on a document.

In this example we are searching handwritten insurance numbers on a document. These numbers have the following structure: 1x-xxxxxx-xx. The x represents a character between 0 and 9, example: 14-386723-89.

This is the example document, which will be used in our KTM project:

Within the KTM project you first have to classify the example document to the appropriate document class (in our example to the class ‘InsuranceDocs’). This can be done with any of the available KTM classification methods (see also: Document classification with KTM ).

A field ‘InsuranceNumber’ and a locator ‘Numbers’ (Advanced Zone Locator) should be added to the document class ‘InsuranceDocs’:

This is the base idea behind ‘free-form recognition’ for handwritten numbers:

  1. The Advanced Zone Locator reads the text of the page by sizing its zone large enough to cover the entire page (or at least the region where the handwritten numbers may occur).
  2. From experience the RecoStar Engine reads numerical characters better than the FineReader Engine. Therefore RecoStar is used within the Advanced Zone Locator with a numerical recognition profile [0-9-].
  3. The result of the Advanced Zone Locator will be a string consisting of numerical characters and -.
  4. Within the script of the document class ‘InsuranceDocs’ the result string will be examined for insurance numbers using regular expressions.
  5. If possible the found insurance number should be checked against an inventory database and finally put into the extraction field ‘InsuranceNumber’.

Setup of the Advanced Zone Locator

Draw the zone on the region of the example page, where the handwritten numbers may occur:

Set the zone recognition profile to a RecoStar zone engine with these settings:

Remove the checkmark at ‘Registration failure makes zone invalid’, as registration will always fail with unstructured documents, and we want to keep the result in any case:

Testing of the Advanced Zone Locator will show this result:

At first this looks somewhat messy, but in the fourth line from bottom, the desired insurance number shows up. Now this number still has to be extracted from the result string.

Extraction of the insurance number by scripting

Exemplarily we are using the event ‘Document_AfterProcess’ in the script of document class ‘InsuranceDocs’, to extract the insurance number out of the result string of the Advanced Zone Locator by using regular expressions.

First of all the library ‘Microsoft VBScript Regular Expressions 5.5’ has to be added as reference to the script:

This Microsoft library enables your scripting to search with regular expression in string variables (Microsoft VBScript Regular Expressions 5.5 Description ).

The actual KTM scripting finally looks like:

1Option Explicit
3' Class script: InsuranceDocs
5Private Sub Document_AfterProcess(ByVal pXDoc As CASCADELib.CscXDocument)
6   Dim String_RecoStar As String
7   Dim myRegExp As RegExp
8   Dim myMatches As MatchCollection
9   Dim myMatch As Match
10   Dim InsNbr_Recostar As String
12   Set myRegExp = New RegExp
14   'get the first alternative from the advanced zone locator
15   String_RecoStar=Trim(pXDoc.Locators.ItemByName("Numbers").Alternatives(0).SubFields.ItemByName("UF_Zone0").Text)
17   myRegExp.IgnoreCase = True
18   myRegExp.Global = True
19   'define the regular expression for the insurance numbers
20   myRegExp.Pattern = "1(1|2|3|4|5|6|7|8|9)\s?\-\s?\d{6}\s?\-\s?\d{2}"
22   Set myMatches = myRegExp.Execute(String_RecoStar)
23   If myMatches.Count>0 Then 'if something was found:
24      'we just take the first result in this example...
25      InsNbr_Recostar=Replace(myMatches.Item(0)," ","") 'get rid of spaces
26      If DB_Check(InsNbr_Recostar)=True Then 'if possible validate the number against a database
27         'put the value into the InsuranceNumber field
28         pXDoc.Fields.ItemByName("InsuranceNumber").Text=InsNbr_Recostar
29         pXDoc.Fields.ItemByName("InsuranceNumber").Valid=True
30      End If
31   End If
32End Sub
34Function DB_Check(Number As String) As Boolean
35   DB_Check=True 'just return True in this example
36   'Implement the database validation of the extracted insurance number
37End Function

Processing the example document in KTM project builder will finally produce a result like this:

(1) More codecentric blog articles about KTM:

KTM and insurance companies: Document Process Automation

Document classification with Kofax Transformation Modules (KTM)

Kofax Transformation Modules – format locators and dynamic regular expressions – Part 2

Kofax Transformation Modules – format locators and dynamic regular expressions

share post




More articles in this subject area\n

Discover exciting further topics and let the codecentric world inspire you.


Gemeinsam bessere Projekte umsetzen

Wir helfen Deinem Unternehmen

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.