The topics AI, machine learning and deep learning are on everyone’s lips, and the media regularly publishes articles on them. What many do not know is that Kofax Transformation Modules (KTM) also provides mechanisms of machine learning. KTM is a system for automatic classification of documents and extraction of data fields (see also: Document classification with Kofax Transformation Modules).
KTM always included tools from machine learning, which can be used alone or together with the rule-based free-form recognition. This neural network-based methods will be briefly described here.
A KTM project consist of the following phases:
- Project preparation: Document types, data fields, clustering
- Project implementation: Classification and extraction design
- Production: Capturing, classification, extraction, manual validation
Prior to extraction, the classification of the document has to be done because different types of documents normally have different extraction fields. Once the classification has been successfully carried out, the document type-specific field extraction can be started.
KTM provides tools from the field of machine learning for the project preparation as well as for the implementation of the project and the production phase in order to train the system and improve the quality of the results successively.
By training, learning systems recognize the context and store it for future use. KTM does not memorize the absolute position of a field, but saves the environment in which the field is located. This can include words which are located nearby (and their distances to the field), position to other fields, but also lines or similar objects. This newly learned context is immediately available when the next document is processed, and the field value can be extracted directly for a similar document – hopefully! “Hopefully” was inserted because such systems are not deterministic and some document types must be trained several times.
The KTM toolbox for machine learning contains of the following elements:
- Clustering Tool: Get basic information about the document types, what are the main/important types?
- Administrative training with examples of the main document types: Document type classification
- Administrative training with examples of the main document types: Extraction of the field data
- Production cycle: System learns by manual assignment of the document type
- Production cycle: System learns by manual field correction/data entry
The Clustering Tool
At the beginning of a recognition project, it should first be clarified which documents promise the most “profit”. Which document types are worthwhile for training – and which should not be considered initially?
KTM includes a tool (clustering tool) that analyzes unsorted batches of documents and divides them into batches with similar characteristics. This sorting can be done according to graphical criteria as well as according to content. After using this tool you usually have a very good impression, which are the main document types of a project, which should be trained first.
This example shows, that one should first concentrate on the processing of the generated batches 1, 5 and 4. Part 4 contains 36 “CAR Parts Co-Delivery Note” documents.
Administrative training of document types
For this, you will use the main document types, that have been determined by the clustering tool. Within the KTM development environment, the document types are created manually or they can be created automatically from the batches. For each document type, an administrator assigns a number of sample documents to the system for learning. This number is project-dependent, but in real projects a value of about 20 documents has proven itself. The training of the document types can take place via the layout and/or the text content of the documents.
The success of this training can be immediately checked using the non-trained examples of the document type batches.
In real life projects, I only trust the classification result obtained by learning when a certain confidence level has been reached (e.g. 80%). At lower values, additional document-type-specific rules are used to determine the document type.
Administrative training of field extraction
After the training of the document types, the extraction of the document type-specific data fields can be done in the next step. Similar to the training of the document types, a certain number of documents is taken per document type. Training is done by just showing the system the position on the document where the data for a field should be extracted. This is simply done by using mouse clicks. KTM does not memorize the absolute positions, but stores features (graphics, words, lines etc.) near the extraction position.
Again, the success of this training can be immediately checked using the non-trained examples of the document type batches.
Online Learning during production
After a pre-trained system has been set to production, KTM offers the possibility to further improve the classification and extraction during daily processing. This includes the optimization of main document types already trained in the preparation, but also the basic training and optimization of the previously neglected other document types.
The KTM validation module offers all documents for validation where the classification was unsure or data fields were unsure or empty after extraction. A user can manually correct the classification and/or the data fields and the document may be marked for online learning if desired.
After that, the original document goes into the further processing and a copy is sent to the KTM learning mechanism. Depending on the configuration of the KTM system, the system learns the changes directly, and these are available at the next processed batch, or an administrator must first check and release the new learning document.
The following diagram shows the flow of KTM processing and the integration of online learning:
However, direct online learning – without the control of an administrator – entails the risk that the system will learn incorrectly, since the person at the validation workplace directly releases a document for learning. Neural networks cannot be debugged like programs in classical development – there must be other ways to find the error (the wrongly trained document) and make corrections.
KTM provides a view of all trained documents per document type as well as the possibility to remove or reconfigure documents from a learning set. Nevertheless, one should not underestimate the effort for such a correction. Therefore the release of new learning documents should be done by an administrator or a specialist despite the delay in getting the new training set in production.