Orientation problems with document processing (Kofax Transformation Modules)

No Comments

Document classification and data extraction in business companies have to deal with paper documents, emails and faxes. The orientation of the digitized documents (0°, 90°, 180°, 270°) usually doesn’t matter. During OCR processing the system will recognize the rotation of the documents und will align them for readability.

Sometimes this alignment mechanism fails. Especially faxes include a fax headerline, which is often rotated to the body text by 180°. This happens for example when the paper sheet is put backwards into the fax machine. But even paper documents may contain notes or numbers which are printed to the left or to the right of the text, rotated by 90°. This blog post explains how to solve this problem with a small Kofax Transformation Modules (KTM) script.

Kofax Transformation Modules

Especially with faxes, the OCR engine will read the fax headline first, and the wrong orientation of the document will not be aligned:

document orientation

Our customers often use ‘Kofax Transformation Modules” (KTM) for automated mailroom processing. The described problem of the failed rotation alignment also happens with KTM. Kofax offers a solution: Kofax Knowledgebase article 19794.

The described Kofax Transformation Modules script deletes rectangular regions at the margins (top, left, right, bottom) and the OCR engine will not find the text within these regions any more. Deletion at the document margins only happens within the computer memory and the source document will be unchanged. The width of these regions (in the example below: 100) must be adjusted to the documents of the customer.

Private Sub Document_BeforeClassifyXDoc(ByVal pXDoc As CASCADELib.CscXDocument, ByRef bSkip As Boolean)
Dim oImage As CscImage
Dim lMargin As Long

lMargin = 100

'Get current image for page 1
Set oImage = pXDoc.CDoc.Pages(0).GetImage()

'Erase a margin around the edge of the image
oImage.EraseRect 0, 0, lMargin, oImage.Height
oImage.EraseRect oImage.Width-lMargin, 0, lMargin, oImage.Height
oImage.EraseRect 0, 0, oImage.Width, lMargin
oImage.EraseRect 0, oImage.Height-lMargin, oImage.Width, lMargin

'Clean up memory
Set oImage = Nothing

End Sub

To check out the best width of the regions, I saved the modified document (which exists only in the computer memory) as a TIF file to a temporary directory on the hard disk. With a viewer I was able to examine the result of the deletions (see script below).

oImage.EraseRect deletes the regions by whiting them out. Thus it is often difficult to identify the deleted regions on a document with white background. Instead you may use oImage.Redact, which will  mark the regions with black color. This will make the checkout of the region sizes rather easy.

Private Sub Document_BeforeClassifyXDoc(ByVal pXDoc As CASCADELib.CscXDocument, ByRef bSkip As Boolean)
Dim oImage As CscImage
Dim lMargin As Long

lMargin = 100

'Get current image for page 1
Set oImage = pXDoc.CDoc.Pages(0).GetImage()

'Erase a margin around the edge of the image
oImage.Redact 0, 0, lMargin, oImage.Height
oImage.Redact oImage.Width-lMargin, 0, lMargin, oImage.Height
oImage.Redact 0, 0, oImage.Width, lMargin
oImage.Redact 0, oImage.Height-lMargin, oImage.Width, lMargin

'Save image to temp
oImage.Save("C:\temp\Redact.tif",CscImgFileFormatTIFFFaxG4)

'Clean up memory
Set oImage = Nothing
End Sub

The source document:

source document

The modified in-memory document:

modified document

The result of the Kofax Transformation Module script – the correct rotation of the document – cannot be tested with KTM Project Builder. But it works in the runtime environment. However, you may save the redacted image with the above mentioned oImage.Save to a TIF file. So you can check out the result. The scripting line with oImage.Save should only be activated within Project Builder. Please deactivate the line during runtime, as otherwise the TIF file will be saved in the runtime environment.

Summary

By using this simple redaction functionality, we were able to align all faxes well and get the correct OCR results for data extraction in our customer projects. With faxes, it is sufficent to just redact to the upper margin which contains the fax headerline.

Comment

Your email address will not be published. Required fields are marked *