Overview

Kofax Transformation Modules – format locators and dynamic regular expressions – Part 2

5 Comments

Part 2: Dynamic regular expressions in KTM

In the first part of this blog article I explained the use of KTM format locators and regular epressions. Now I will try to explain how flexible KTM projects can be designed by using the KTM internal scripting language. But you should be familiar with KTM’s scripting language and the KTM object model.

KTM format locators (see part 1) are static expressions, when they have been defined in  the KTM Project Builder. They are used with their defined values within the Kofax Capture workflow during runtime.

But there might be the – admittedly very rare – case, that you have to change the regular expression of a format locator during runtime, because of general conditions. Unfortunately this doesn’t work ‘out of the box’. But within the rich building set of KTM there is a library which will enable this functionality.

Recently we had to setup a document classification/extraction project at a scan service provider who works for financial institutions. The challenge was to develop one project, which should work for several clients. We had to deal with document types, where the described ‘static’ format locators could not deliver sufficent results. We were in need of some type of a format locator whose regular expression could be modified during runtime (depending on client specific data). As KTM provides a VB-compatible scripting language and due to some knowledge of the KTM object model, we were able to master this challenge

The documents of this special document type had the same layout and content for all clients. The difference was just a certain part of text on the document – depending on the client.  Depending on the location of this text (upper left or upper right corner), an account number had to be read out of a field which was located at bottom left or bottom right of the document.

The mapping between the client and the specific text part was provided in an initialization file outside of KTM:

100 ; Hamburg
110 ; Berlin
120 ; Bremen
:
:

If needed, this file can be edited anytime independent of the KTM project. The client number  (100, 110, 120, …)  is read by the KTM project during the runtime of the Kofax Capture scanning system.

Within the KTM project we defined a format locator, which checks if the client specific text (Hamburg, Berlin, Bremen, …) is printed in the upper left or upper right corner. The regular expression of this format locator was dynamically fed with the client’s specific text (Hamburg, Berlin, Bremen, …) during the runtime of the KTM project.  That way we succeeded in changing the regular expression of a format locator during runtime only by editing an external initialization file.

Because of the complexity of the described project, I will explain the dynamically change of the regular expression with the simple example of the insurance number from part 1 of this blog article.

First of all you have to create a reference to a KTM library. This is done in the Project Builder scripting environment on the appropriate document class:

KTM-Referenz

In part 1 of this article we have setup a format locator FL_VSNR with the regular
expression 20\d{2}/\d{1,10}:

KTM-FL-Formats75

In order to change the regular expression during runtime, you have to insert a scripting locator (SL_ChangeRE in the screen below) ABOVE the format locator, whose regular expression has to be changed. The scripting locator must be defined above the format locator. So he will be executed before the format locator, as the scripting locator must change the regular expression of the format locator.

KTM-Scriptlocator

The scripting locator SL_ChangeRE consists of the following piece of scripting, which changes the regular expression of the format locator FL_VSNR to the new value 20\d{2}/\d{2}:

' Class script: Dokumente
Private Sub SL_ChangeRE_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, _
            ByVal pLocator As CASCADELib.CscXDocField)

Dim NewRegEx As String
Dim oLocator As CscRegExpLib.CscRegExpLocator

'get format Locator FL_VSNR
Set oLocator = Project.ClassByName("Dokumente").Locators.ItemByName("FL_VSNR").LocatorMethod
'set new regex for FL_VSNR
NewRegEx="20\d{2}/\d{2}"
oLocator.RegularExpressions(0).RegularExpression=NewRegEx
End Sub

The behaviour of this script can be tested directly within the KTM Project Builder with our test document:

*** Remark: Versicherungsnummer = insurance number ***

KTM-Documentpart

The original format locator (see part 1):

KTM-FL-Formats75

The document extraction will result in the known value 2011/47123, if the scripting locator is not used:

KTM-Result1

If the script locator is used the result will change to 2011/47:

KTM-Result2

If you take a look at the format locator in Project Builder after the extraction, you will see, that the regular expression has been changed actually:

KTM-Formatlocatorchanged

I hope this inner view of some parts of the KTM toolbox shows, that KTM is indeed a very configurable product. I am looking forward to any further hints or tricks in the usage of KTM tools and its scripting language. Within the next months I will try to publish more articles about KTM in this place.

New: article about document classification with KTM

New: KTM and insurance companies: Document Process Automation

Kommentare

  • Mark Ortiz

    4. February 2013 von Mark Ortiz

    A very interesting article. A have not played around with the extra references in scripting, so it’s good to know that you’ve done some useful things with them. I probably would have had two format locators, and then chosen between them based on the client. But there are many ways to achieve the same results, and your way looks very useful. Thanks.

    • Jürgen Voss

      4. February 2013 von Jürgen Voss

      Mark, thanks for your kind words.
      Like you said, we were also trying to go the way with two client based format locators. With this approach we were using dictionaries within the format locators, which could be edited outside of KTM too. But the dictionary system in KTM was not flexible enough for the client’s text phrases, so we used the described solution for changing the regex.

      • Mark Ortiz

        5. February 2013 von Mark Ortiz

        I see, thanks. Without wishing to jump the gun on future posts – have you used any of the other references? There’s lots of things I’d like to try, if I had time. But the documentation isn’t good, so it’s more a case of trial and error. One thing I’d like to do is a ‘matrix’ of classifications. E.g. AFC finds the correspondence types. IC finds the organisation. They are then added up to classify as the organisation and type. Obviously it can already work this way to an extent, but not quite how I want it.

  • alpesh

    29. May 2015 von alpesh

    Hi,
    I am working on KTM. I am facing below issue. It will great help if you mail me. mail: alpeshviranik@gmail.com
    While extracting a text using format locator, how to extract multiple lines from the document. For example in the following text
    Bank name: WELLS FARGO
    BANK

    Date: January 6,
    2012

    • Jürgen Voss

      30. June 2015 von Jürgen Voss

      Hi Alpesh,

      let’s suppose you have defined a format locator ‘Locator’ which will find something on a page (i.e. WELLS FARGO).
      If you want to retrieve the next line ‘south’ of ‘WELLS FARGO’ – in your case ‘Date: January 6, 2012’ – you could use the following script in Document_AfterExtract of the classified document class:

      Private Sub Document_AfterExtract(ByVal pXDoc As CASCADELib.CscXDocument)
      Dim LocatorLine As Integer
      Dim NextLine As String
      If pXDoc.Locators.ItemByName("Locator").Alternatives.Count>0 Then
      'Get the line number of the first locator alternative
      LocatorLine=pXDoc.Locators.ItemByName("Locator").Alternatives(0).Words.ItemByIndex(0).LineIndex
      'get next line 'south' of the alternative
      NextLine=pXDoc.TextLines.ItemByIndex(LocatorLine+1).Text 'You should check, if the line exists in prod
      'You could put the content of NextLine to a field for example:
      'pxdoc.fields...=nextLine
      End If
      End Sub

Comment

Your email address will not be published. Required fields are marked *