Kofax Transformation Modules (KTM) offers several tools for document classification and data extraction. There are some older blog articles about these tools:
– Document classification
– Data extraction with format locators
– Machine Learning
The recent version 6.3 of KTM got another interesting tool with the release of Service Pack 6.3.1: Natural Language Processing (NLP).
Natural Language Processing will analyse the text of a document and tries to understand the relation of words to gain information and knowledge.
The Kofax NLP package appears to be based on the Salience engine from Lexalytics.
Because NLP was introduces by a service pack, its documentation is kind of rudimentary. The main documentation of KTM doesn’t contain anything about NLP. It will be updated with the next main KTM release. However, parts of the NLP documentation are included within the Readme of the service pack.
This article provides an overview about the NLP installation, two new locators (extraction tools) related to NLP and individual expansions to entities and sentiments.
1. NLP package: installation
2. Entities and sentiments
3. Custom entities and sentiments
1. NLP package: installation
The NLP-package is not included in the installation sources of KTM 6.3 or ServicePack 6.3.1. However the ServicePack includes two new locators which will make use of the NLP package.
NLP is a separate installation package, wich can be downloaded at delivery.kofax.com. The download consists of three msi installation files which cover different languages:
The three packages can be installed one after the other.
2. Entities and sentiments
The known tools (format locators, database locators, etc.) enable us to extract a lot of information from documents. But most of them are one-dimensional data as a date, numbers, amounts or strings.
The new entity locator enables KTM to recognize objects like persons, products, places, URLs, email addresses, schools, organizations, cities and so on. This locator is configurable. The result can be a list of all found entities (no matter which type), or the search can be restricted to a single entity type.
The second new locator (sentiment locator) tries to determine the sentiment of a document. Is there a positive mood within the text or is the document a complaint with a negative wording? The sentiment locator cannot be configured despite the region settings known from the other locators.
2.1 Entity locator
The entity locator may be used at every document class in the KTM class tree, just like every other locator. There is a specialty: within the class properties you have to configure the language to be used by the new locator. You can select the languages which were installed by the three msi install packages. This new setting can be found at the bottom of the class properties:
The following languages are selectable:
The upcoming examples will use the following letter:
The Entity-locator offers three different modes:
Mode 1: simple field and no filter on entity type
The result of this mode is a list of all entities found in the document:
Mode 2: simple field and filter on entity type
As expected, you will only get the results which meet the filter on entity type:
The result list of both modes will only list the entity text and not the entity type. This is unpleasant, especially in mode 1 without a filter on entity type.
But there is a third mode:
Mode 3: table field with or without filter on entity type
Before using this mode, you first have to create a simple table model (the same models as when using the table locator). I named the columns just Text, Confidence, EntityType and Sentiment and the tabel model is named Entity. This mode delivers the most meaningful results:
The result is a table with values for confidence, value/text, type and sentiment for each row. But I doubt the meaning of the sentiment in this context.
What are the advantages and disadvantages of mode 1/2 and mode 3?
The result of mode 1/2 can be assigned directly to a KTM field (the best alternative). But the field will only contain the value and not the type of the entity. The other alternatives of the locator may be queried by a KTM script as usual.
The results of mode 3 can only be assigned to a KTM table field, as it is a table. You get all alternatives in this table field and you get access to all values (confidence, value, entity type, sentiment) instead of only the entity value in mode 1/2.
Overall, I prefer the table mode of the entity locator because I have access to all result information at a single point.
2.2 Sentiment locator
This locator tries to find out the basic mood of a document.
The result of this locator is a value between -1.00 and +1.00 and may be assigned to a simple KTM field. Kofax explains the range of values as follows:
Positive: 0.12 to 1.00
Neutral: -0.025 to 0.11
Negative: -0.026 to -1.00
The sentiment locator automatically searches for a basic mood and has no configurable settings. But you can restrict the search area by specifying regions as known by other KTM locators.
The properties page of the sentiment locator:
A test with the sample document above gives the following result:
The value of +0.227859 suggest a positive general mood of the document. Which is correct in my opinion.
I cannot give a final assessment of this locator up to now because I did not have enough documents with ‘moods’ available to me. But since the locator is very easy to use, you can just let it run in various projects and gain experience with it.
3. Custom entities and sentiments
3.1 Custom entities
The NLP of KTM 6.3.1 contains several predefined entities as persons, places etc. Furtheron the NLP offers the possibility to define custom, project-specific entities and use them for your document processing.
An example when to use custom entites:
A company provides products that are broken down into categories, and these categories each have multiple model numbers. For basic correspondence the model number is not needed but the category is always needed. Because of this, you can provide a custom entity file that contains a list of model numbers for each category. When extraction is performed, a model number is recognized but the category is returned in the extraction results.
The procedure for using custom entities:
The project specific entity file must be located within the path of the KTM project:
…project directory\Custom\SalienceData\en\salience\entities\<entity directory>
If the path doesn’t exist, you have to create it manually.
For other languages you have to replace en by the language short cut (de = German, es = Spanish, etc.)
The name of <entity directory> defines the entity type which will be returned by the entity locator in table mode.
The entity file with the custom content must be placed in the entity directory. The name of the entity file doesn’t matter, but the file extension must be .cdl.
The lines of the cdl-file must have the following structure:
Search Text<tab>Entity Label<tab>Entity Name
‘Entity Label’ is not supported up to now, but can be included in the file. If you don’t use entity label both <tab> must appear in the line.
‘Search Text’ will be searched within the document and ‘Entity Name’ will be returned as the result of the locator in the text column.
For example you need information about vehicle categories out of your documents. First you create an entity file MyEntity.cdl:
VW Tiguan car Volkswagen (car)
VW Golf car Volkswagen (car)
VW Bulli truck Volkswagen (truck)
Skoda Octavia car Skoda (car)
Skoda Fabia car Skoda (car)
Honda CBR 650 motorbike Honda (motorbike)
Honda CBR 123 motorbike Honda (motorbike)
The file is stored in the following path:
The directory vehicles defines the type of your custom entity.
Testing with a suitable document:
gives the following result:
The custom defined entities unfortunately do not appear as an entity type under the entity type filter.
In order to make the new custom entities known to KTM (in Project Builder), the underlying salience engine must be stopped once. This can be done under the properties of the document class where the entity locator was defined:
The engine is automatically restarted during the next extraction run.
3.2 Custom sentiments
Similar to the entities, you can also add custom definitions to the sentiment locator. The NLP provides a sentiment definition file for many languages.
The english sentiment file is located at this path:
C:\Program Files (x86)\Common Files\Kofax\Salience6.4\en\salience\sentiment
and its name is general.hsd.
This an extract from the provided general.hsd:
First is the rated phrase and seperated by a <tab> the rated value between -1 and +1.
To change the rating of certain phrases or add new phrases you have to copy the file general.hsd into the path of your KTM project:
There you can edit the copied file general.hsd and add your custom phrases and ratings.
In order to make the new custom sentiments known to KTM (in Project Builder), the underlying salience engine must be stopped once, as described under ‘custom entities’.
Natural Language Pack for Kofax Transformaton Modules provides an interesting and promising extension for document classification and data extraction. I hope this article was able to provide an initial overview of the possibilities of the NLP and to replace the currently missing Kofax documentation.