Document classification with Kofax Transformation Modules (KTM)


Many of our customers are using systems for automatic document classification and data extraction. ‘Kofax Transformation Modules’ (KTM) is one of these systems. These data capturing systems extract  metadata out of the electronic images (these are the scanned pages of the documents, faxes or emails) and release the data and the document to business applications.

In this article I will explain the different ways of document classification within KTM.

Up to now, two other articles about KTM were published in the codecentric blog:

Kofax Transformation Modules – format locators and dynamic regular expressions – Part 1
Kofax Transformation Modules – format locators and dynamic regular expressions – Part 2

Before data can be extracted out of a document, KTM needs to know the type of the document. Invoices have to be treated different than for example insurance contracts. You want to extract invoice number, invoice date and amounts from an invoice but the insurance number and the insurance class from the contract.

So you have to determine the document type first, before data extraction can take place. In KTM this is done by the classification mechanism which occurs before extraction. As soon as a document has  been classified, the metadata can be extracted.

Document classification in KTM can be done with various methods, which differ in the amount of complexity and the effort in document preparation:


1. Classification by layout

This classification method tries to determine the type of document by using the graphical structure of the document. This is the fastest way of classification, as no Optical Character Recognition (OCR) is needed. This classication method can only be used in document areas, where the documents can be separated cleary by optic. Examples are application types, which can be distinguished by their design (structure, company logo, …). Inappropriate would be forms in financial or insurance services, as these forms look very similar.

You have to train KTM for using the layout classification on the document types of a customer. But the manual effort for this is kept very low by KTM. You must collect some samples for the appropraite document types and show KTM which samples are representative for which document type. KTM will then learn the characteristical layout structures of each document type. This training of the document types can be done easily with the graphical user interface of KTM Project Builder.


2. Classification by content

The approach of the (automatically) classification by content is similar to layout classification. The difference is, that the content of the documents is used for classification instead of the layout. To achieve that, an OCR read of the documents must be done before.

And what’s really nice: you don’t have to care for the meaning of the content. Just as with the layout classification, you just have to prepare batches of samples for the document types. After the OCR reading has occured, you just show KTM which samples are representative for each document type. After this setup KTM will learn automatically, which words, phrases or word combinations are chraracteristic for a docuemnt type. This training of the samples takes place in the KTM Project Builder similar to the approach with layout classification.


3. Classification by instructions

When using layout or content classification you ‘only’ have to provide sufficient samples to KTM. The work of learning and evaluation wil be done by KTM itself. If classification by instruction is used, the developer has to know the content of the document and he must be able to evaluate them in the business environment. For each document type you can define words, phrases and word combinatons manually, which are characteristic for the document type. So you will need some subject-specific knowledge about the documents.

Classification by instruction is often used with general correspondence documents. For example: if  ‘dunning’ and ‘dunning charge’ are both found on a document, this document must be classified as ‘dunning procedure’.

In order to use classification by instructions an OCR read must have been done before. The instructions (words, phrases, word combinations) are entered into KTM once again via the KTM Project Builder.


4. Classification by script

4.1 Barcode

Sometimes a document barcode may be sufficient for the classification of the document. This is also possible with KTM, but you have to use the internal script language of KTM, which is comparable to Visual Basic.

You have to start with a barcode locator (BCode), which recognizes the barcode value and which has to be defined on projcet level. A short script on project level will classify the document into the desired document type:

' Class script: Project
Private Sub Document_BeforeClassifyXDoc(pXDoc As CASCADELib.CscXDocument, bSkip As Boolean)
  If pXDoc.Locators.ItemByName("BCode").Alternatives.Count>0 Then
     If pXDoc.Locators.ItemByName("BCode").Alternatives(0).Confidence > 0.95 Then
       pXDoc.Reclassify "Barcodeantrag"
       Exit Sub 'only one reclassify
     End If
  End If
End Sub

The script will be called in the event Document_BeforeClassifyXDoc, which is executed before all other classification mechanisms  of KTM.

The script checks first, if the barcode locator has found anything at all and if the confidence is greater than 95%. If this is the case, the reclassify method is used to classify the document as document type ‘Barcodeantrag’ (barcode claim). After ‘reclassify’ you have to exit the sub, so that no further ‘reclassify’ may happen in the following code. Multiple ‘reclassify’ are possible with KTM, but you should be careful doing it, as you might create inifinite loops.

4.2 Format Locators, Advanced Zone Locators and Everything…

The ‘classification by script principle’ in 4.1 with barcode locators can be used with any other locators too. The important fact is, that the locator must  identify a document type clearly. Furthermore the locator has to be defined  on project level, as otherwise the script in Document_BeforeClassifyXDoc will not be executed. The primary goal of these locators is not the data extraction. They are just resources for classification.

A format locator, which is defined at project level, may identify the type of an insurance application and classify the document. The following image snippet shows a part of an insurance application for general liability insurance (‘Haftpflichtversicherunng’)


A format loactor (‘Antrag_Haft’) may search the word ‘Haftpflichtversicherung‘ (general liability insurance) above the word ‘Antrag‘ (application) in a region in the upper left corner of the document. If the format locator scores, the document can be classified as document type ‘Antrag_Haft’.

The scripting looks like this (equivalent to the barcode example):

' Class script: Project
Private Sub Document_BeforeClassifyXDoc(pXDoc As CASCADELib.CscXDocument, bSkip As Boolean)
  If pXDoc.Locators.ItemByName("Antrag_Haft").Alternatives.Count>0 Then
     If pXDoc.Locators.ItemByName("Antrag_Haft").Alternatives(0).Confidence > 0.95 Then
       pXDoc.Reclassify "Antrag_Haftpflicht"
       Exit Sub 'omnly one reclassify
     End If
  End If
End Sub

If you want to use an ‘Advanced Zone Locator’ (Antrag_Haft_EZL) for the classification of the above insurance application, you have to adjust the script to the subfields of the zone locator:

' Class script: Project
Private Sub Document_BeforeClassifyXDoc(pXDoc As CASCADELib.CscXDocument, bSkip As Boolean)
  If pXDoc.Locators.ItemByName("Antrag_Haft_EZL").Alternatives.Count>0 Then
     If pXDoc.Locators.ItemByName("Antrag_Haft_EZL").Alternatives(0).SubFields.ItemByName("UF_Zone0").Confidence > 0.95 Then
       pXDoc.Reclassify "Antrag_Haftpflicht"
       Exit Sub 'only one reclassify
     End If
  End If
End Sub

As you may see, there are a lot of possibilites in classification with scripting. For example you could use a database locator to identify the sender of a document (if an appropriate master data file exists) and classify the document according to the sender.

Often forms have a unique form number printed in the lower left corner, but this number is 90° rotated. An ‘Advanced Zone Locator’ can read the 90° rotated number and the document can be classified using this unique number with scripting.

Maybe this article gave you some motivation to experiment with KTM’s classification methods and scripting. Have fun 🙂

One more hint for the developers among our readers: my colleague Frank Engelen (he’s working in our Agile Software Factory) has just published an interesting article about data and document classification with the tool ‘RapidMiner’. With a little Java knowledge you may develop your own classification mechanism!

You can find his article here: Taking a look at Java-based Machine Learning by Classification

New: KTM and insurance companies: Document Process Automation


  • adrasha

    21. May 2014 von adrasha

    using script in KTM, ocr need not be perform for all pages.
    can we do same thing in KC (recognition server)

  • Arsh

    7. June 2014 von Arsh

    Hi Juergen,
    I am working banking project and where most of the documents are unstructured with variation in nature. I am planning to use both layout and content classification.

    Is there any impact like performance using both classification in a project. Please suggest.


    • Jürgen Voss

      7. June 2014 von Jürgen Voss

      Yes there is an performance impact using different classification methods.
      Fastest is the layout classification and then content classification.
      For the layout classication no OCR is needed, so it’s the fastest way of classification.
      On the other side, if you use format locators for classication by scripting it will take a considerable amount of processing time, especially if you define a lot of format locators for all your different document types.

      My experience in banking projects is, that all the forms of one bank look very similiar and layout classication doesn’t work very well. Even content classification doesn’t give good results as the wording of the forms is very similar and you cannot define a region when using the instructions.

      So I often end up with using format locators on project level and doing the classification by scripting. It needs more work and the processing takes more time but the results are more accurate.

      As most of your documents are unstructured I would start by using the ‘learning’ content classification followed by classification with instructions.

      • Arsh

        Thanks Juergen.How classification using format locator different from instruction classification. Could you please share some sample of code on classification by scripting(format locator).

        • Juergen Voss

          30. June 2014 von Juergen Voss

          Hi Arsh,

          classification with instructions will look for the occurance of the given words/phrases on all pages which have been ocr’ed. You cannot define a region on a page, to restrict the search only to this region.

          If you use a format locator to look for the words/phrases you have the possibilty to restrict the search to the region(s), which can be defined within the format locator.
          I have sent an email to you with an example project, which demonstrates the use of a format locator to classify a document to a specific document class.


Your email address will not be published.