One of our insurance customers uses Kofax Capture and Kofax Transformation Modules (KTM) to capture and classify incoming documents (correspondence, fax or emails). After the classification the business data of the documents is also extracted by KTM.
By the early classification as a contract termination, the insurance company is able to start actions to convince their customers to withdraw the termination. The remaining termination processes however should be managed automatically if possible. To enable this, KTM is used to extract the termination date from the documents. This data is then used to start the internal termination process.
During the extraction of the date with KTM format locators and regular expressions we got problems with the date extraction as soon as line breaks occurred in the termination phrases of an document. Below I will explain, how these problems were solved with KTM tools.
There are two different ways how people express the termination date:
1. explicit mentioning of a date: …I terminate the contract as of 31.12.2016…
2. indirect date: …I terminate the contract at the next possible date…
Both versions can be easily extracted with KTM format locators and regular expressions.
1. Extraction of an explicit mentioned date:
Remark: the german words ‘zum’, ‘dem’ means ‘as of’.
The locator will extract the desired date:
The problem occurs if the date and the terms (‘as of’, …) are not separated by an blank space but by a line break:
In this case a combination of format locators and relation evaluator may help:
A format locator finds the terms (‘as of’, …) at the end of a line:
The $ character ensures that only terms at the end of a line are found.
Another format locator will search for dates, which are at the beginning of a line:
The ^ character ensures that only dates at the beginning of a line are found.
The relation evaluator (in german: Geometrie-Evaluator) can now be used to find all dates (at the beginning of a line) which are located below the termination terms (at the end of a line):
Now the result of the relation evaluator may be accepted by judging the confidence of the result (depending on the distance).
Instead of using the relation evaluator, a script could be used to check if the result of the ‘date format locator’ is one line below the result of the ‘terms format locator’.
2. With indirect mentioned dates a similar approach can be used:
If the document contains only an indirect date (…I terminate the contract at the next possible date…), the insurance company needs a ‘dummy date’ for their business process. In theses cases KTM provides ‘01.01.1970’.
This can simply be done by using KTM’s format locators:
Remark: the german terms above stand for ‘I terminate the contract at the next possible date’.
In the real world we have got a long list of terms indicating an indirect mentioned date.
The format locator delivers a hit, if one of the terms appears in the document:
If there is a hit, the date must be set by script to ‘01.01.1970’.
However the format locators fails again, if there are line breaks within the searched terms:
Because a lot of possible ‘word/line break’-combinations exist, there is no simple solution with a relation evaluator. A possible solution could be to search the terms within the fulltext of the document’s OCR result by scripting.
Favorably the script should use the terms that are already defined in the regular expression of the format locator ‘DatumIndirekt’. By doing so, the terms (in the real world, there are a lot of them) have to be maintained only in one place.
First of all, the script has to read the terms from the regex of the format locator. To achieve that, the KTM-DLL ‘Kofax Cascade Regular Expressions Locator’ has to be referenced in the scripting environment:
The following function will read the terms from the regex part of the format locator and search for them in the OCR fulltext of the first document page:
Function TermFound(ByVal pXDoc As CASCADELib.CscXDocument) As Boolean 'Returns True, if the terms from locator 'DatumIndirekt' are found within the OCR fulltext Dim oLocator As CscRegExpLib.CscRegExpLocator Dim i As Integer Dim Terms() As String Dim Fulltext As String 'Init Returncode with False TermFound=False 'get Format Locator 'DatumIndirekt' Set oLocator = Project.ClassByName("YourClass").Locators.ItemByName("DatumIndirekt").LocatorMethod ReDim Terms(oLocator.RegularExpressions.Count-1) 'Put the RegEx-Expressions from Format Locator 'DatumIndirekt' into the array Terms() For i=0 To oLocator.RegularExpressions.Count-1 Terms(i)= oLocator.RegularExpressions.ItemByIndex(i).RegularExpression Next 'Now get the OCR fulltext from page 1 Fulltext=pXDoc.Pages.ItemByIndex(0).Text 'Remove empty spaces, CR and LF Fulltext=Replace(Fulltext,Chr(9),"") 'Tab Fulltext=Replace(Fulltext," ","") 'blank Fulltext=Replace(Fulltext,Chr(13),"") 'CR Fulltext=Replace(Fulltext,Chr(10),"") 'LF '... 'Search for Terms() in Fulltext For i=0 To oLocator.RegularExpressions.Count-1 If InStr(Fulltext,Terms(i))>0 Then 'Bingo! Return True TermFound=True Exit For End If Next End Function
By doing so, the maintenance of the terms can be done just in one place – the format locator ‘DatumIndirekt’. All appearances of the terms will be found, with or without line breaks.
If you want to maintain the terms in an external text file instead of inside KTM, a similar method can be used to change the regular expressions of a format locator during runtime: