Many of our customers are using systems for automatic document classification and data extraction. ‘Kofax Transformation Modules’ (KTM) is one of these systems. These data capturing systems extract metadata out of the electronic images (these are the scanned pages of the documents, faxes or emails) and release the data and the document to business applications.
In this article I will explain the different ways of document classification within KTM.
Up to now, two other articles about KTM were published in the codecentric blog:
Before data can be extracted out of a document, KTM needs to know the type of the document. Invoices have to be treated different than for example insurance contracts. You want to extract invoice number, invoice date and amounts from an invoice but the insurance number and the insurance class from the contract.
So you have to determine the document type first, before data extraction can take place. In KTM this is done by the classification mechanism which occurs before extraction. As soon as a document has been classified, the metadata can be extracted.
Document classification in KTM can be done with various methods, which differ in the amount of complexity and the effort in document preparation:
1. Classification by layout
This classification method tries to determine the type of document by using the graphical structure of the document. This is the fastest way of classification, as no Optical Character Recognition (OCR) is needed. This classication method can only be used in document areas, where the documents can be separated cleary by optic. Examples are application types, which can be distinguished by their design (structure, company logo, …). Inappropriate would be forms in financial or insurance services, as these forms look very similar.
You have to train KTM for using the layout classification on the document types of a customer. But the manual effort for this is kept very low by KTM. You must collect some samples for the appropraite document types and show KTM which samples are representative for which document type. KTM will then learn the characteristical layout structures of each document type. This training of the document types can be done easily with the graphical user interface of KTM Project Builder.
2. Classification by content
The approach of the (automatically) classification by content is similar to layout classification. The difference is, that the content of the documents is used for classification instead of the layout. To achieve that, an OCR read of the documents must be done before.
And what’s really nice: you don’t have to care for the meaning of the content. Just as with the layout classification, you just have to prepare batches of samples for the document types. After the OCR reading has occured, you just show KTM which samples are representative for each document type. After this setup KTM will learn automatically, which words, phrases or word combinations are chraracteristic for a docuemnt type. This training of the samples takes place in the KTM Project Builder similar to the approach with layout classification.
3. Classification by instructions
When using layout or content classification you ‘only’ have to provide sufficient samples to KTM. The work of learning and evaluation wil be done by KTM itself. If classification by instruction is used, the developer has to know the content of the document and he must be able to evaluate them in the business environment. For each document type you can define words, phrases and word combinatons manually, which are characteristic for the document type. So you will need some subject-specific knowledge about the documents.
Classification by instruction is often used with general correspondence documents. For example: if ‘dunning’ and ‘dunning charge’ are both found on a document, this document must be classified as ‘dunning procedure’.
In order to use classification by instructions an OCR read must have been done before. The instructions (words, phrases, word combinations) are entered into KTM once again via the KTM Project Builder.
4. Classification by script
Sometimes a document barcode may be sufficient for the classification of the document. This is also possible with KTM, but you have to use the internal script language of KTM, which is comparable to Visual Basic.
You have to start with a barcode locator (BCode), which recognizes the barcode value and which has to be defined on projcet level. A short script on project level will classify the document into the desired document type:
' Class script: Project Private Sub Document_BeforeClassifyXDoc(pXDoc As CASCADELib.CscXDocument, bSkip As Boolean) If pXDoc.Locators.ItemByName("BCode").Alternatives.Count>0 Then If pXDoc.Locators.ItemByName("BCode").Alternatives(0).Confidence > 0.95 Then pXDoc.Reclassify "Barcodeantrag" Exit Sub 'only one reclassify End If End If End Sub
The script will be called in the event Document_BeforeClassifyXDoc, which is executed before all other classification mechanisms of KTM.
The script checks first, if the barcode locator has found anything at all and if the confidence is greater than 95%. If this is the case, the reclassify method is used to classify the document as document type ‘Barcodeantrag’ (barcode claim). After ‘reclassify’ you have to exit the sub, so that no further ‘reclassify’ may happen in the following code. Multiple ‘reclassify’ are possible with KTM, but you should be careful doing it, as you might create inifinite loops.
4.2 Format Locators, Advanced Zone Locators and Everything…
The ‘classification by script principle’ in 4.1 with barcode locators can be used with any other locators too. The important fact is, that the locator must identify a document type clearly. Furthermore the locator has to be defined on project level, as otherwise the script in Document_BeforeClassifyXDoc will not be executed. The primary goal of these locators is not the data extraction. They are just resources for classification.
A format locator, which is defined at project level, may identify the type of an insurance application and classify the document. The following image snippet shows a part of an insurance application for general liability insurance (‘Haftpflichtversicherunng’)
A format loactor (‘Antrag_Haft’) may search the word ‘Haftpflichtversicherung‘ (general liability insurance) above the word ‘Antrag‘ (application) in a region in the upper left corner of the document. If the format locator scores, the document can be classified as document type ‘Antrag_Haft’.
The scripting looks like this (equivalent to the barcode example):
' Class script: Project Private Sub Document_BeforeClassifyXDoc(pXDoc As CASCADELib.CscXDocument, bSkip As Boolean) If pXDoc.Locators.ItemByName("Antrag_Haft").Alternatives.Count>0 Then If pXDoc.Locators.ItemByName("Antrag_Haft").Alternatives(0).Confidence > 0.95 Then pXDoc.Reclassify "Antrag_Haftpflicht" Exit Sub 'omnly one reclassify End If End If End Sub
If you want to use an ‘Advanced Zone Locator’ (Antrag_Haft_EZL) for the classification of the above insurance application, you have to adjust the script to the subfields of the zone locator:
' Class script: Project Private Sub Document_BeforeClassifyXDoc(pXDoc As CASCADELib.CscXDocument, bSkip As Boolean) If pXDoc.Locators.ItemByName("Antrag_Haft_EZL").Alternatives.Count>0 Then If pXDoc.Locators.ItemByName("Antrag_Haft_EZL").Alternatives(0).SubFields.ItemByName("UF_Zone0").Confidence > 0.95 Then pXDoc.Reclassify "Antrag_Haftpflicht" Exit Sub 'only one reclassify End If End If End Sub
As you may see, there are a lot of possibilites in classification with scripting. For example you could use a database locator to identify the sender of a document (if an appropriate master data file exists) and classify the document according to the sender.
Often forms have a unique form number printed in the lower left corner, but this number is 90° rotated. An ‘Advanced Zone Locator’ can read the 90° rotated number and the document can be classified using this unique number with scripting.
Maybe this article gave you some motivation to experiment with KTM’s classification methods and scripting. Have fun
One more hint for the developers among our readers: my colleague Frank Engelen (he’s working in our Agile Software Factory) has just published an interesting article about data and document classification with the tool ‘RapidMiner’. With a little Java knowledge you may develop your own classification mechanism!
You can find his article here: Taking a look at Java-based Machine Learning by Classification