//

Orientation problems with document processing (Kofax Transformation Modules)

7.7.2019 | 3 minutes of reading time

Document classification and data extraction in business companies have to deal with paper documents, emails and faxes. The orientation of the digitized documents (0°, 90°, 180°, 270°) usually doesn’t matter. During OCR processing the system will recognize the rotation of the documents und will align them for readability.

Sometimes this alignment mechanism fails. Especially faxes include a fax headerline, which is often rotated to the body text by 180°. This happens for example when the paper sheet is put backwards into the fax machine. But even paper documents may contain notes or numbers which are printed to the left or to the right of the text, rotated by 90°. This blog post explains how to solve this problem with a small Kofax Transformation Modules (KTM) script.

Kofax Transformation Modules

Especially with faxes, the OCR engine will read the fax headline first, and the wrong orientation of the document will not be aligned:

Our customers often use ‘Kofax Transformation Modules” (KTM) for automated mailroom processing. The described problem of the failed rotation alignment also happens with KTM. Kofax offers a solution: Kofax Knowledgebase article 19794.

The described Kofax Transformation Modules script deletes rectangular regions at the margins (top, left, right, bottom) and the OCR engine will not find the text within these regions any more. Deletion at the document margins only happens within the computer memory and the source document will be unchanged. The width of these regions (in the example below: 100) must be adjusted to the documents of the customer.

1Private Sub Document_BeforeClassifyXDoc(ByVal pXDoc As CASCADELib.CscXDocument, ByRef bSkip As Boolean)
2Dim oImage As CscImage
3Dim lMargin As Long
4
5lMargin = 100
6
7'Get current image for page 1
8Set oImage = pXDoc.CDoc.Pages(0).GetImage()
9
10'Erase a margin around the edge of the image
11oImage.EraseRect 0, 0, lMargin, oImage.Height
12oImage.EraseRect oImage.Width-lMargin, 0, lMargin, oImage.Height
13oImage.EraseRect 0, 0, oImage.Width, lMargin
14oImage.EraseRect 0, oImage.Height-lMargin, oImage.Width, lMargin
15
16'Clean up memory
17Set oImage = Nothing
18
19End Sub
20

To check out the best width of the regions, I saved the modified document (which exists only in the computer memory) as a TIF file to a temporary directory on the hard disk. With a viewer I was able to examine the result of the deletions (see script below).

oImage.EraseRect deletes the regions by whiting them out. Thus it is often difficult to identify the deleted regions on a document with white background. Instead you may use oImage.Redact, which will  mark the regions with black color. This will make the checkout of the region sizes rather easy.

1Private Sub Document_BeforeClassifyXDoc(ByVal pXDoc As CASCADELib.CscXDocument, ByRef bSkip As Boolean)
2Dim oImage As CscImage
3Dim lMargin As Long
4
5lMargin = 100
6
7'Get current image for page 1
8Set oImage = pXDoc.CDoc.Pages(0).GetImage()
9
10'Erase a margin around the edge of the image
11oImage.Redact 0, 0, lMargin, oImage.Height
12oImage.Redact oImage.Width-lMargin, 0, lMargin, oImage.Height
13oImage.Redact 0, 0, oImage.Width, lMargin
14oImage.Redact 0, oImage.Height-lMargin, oImage.Width, lMargin
15
16'Save image to temp
17oImage.Save("C:\temp\Redact.tif",CscImgFileFormatTIFFFaxG4)
18
19'Clean up memory
20Set oImage = Nothing
21End Sub
22

The source document:

The modified in-memory document:

The result of the Kofax Transformation Module script – the correct rotation of the document – cannot be tested with KTM Project Builder. But it works in the runtime environment. However, you may save the redacted image with the above mentioned oImage.Save to a TIF file. So you can check out the result. The scripting line with oImage.Save should only be activated within Project Builder. Please deactivate the line during runtime, as otherwise the TIF file will be saved in the runtime environment.

Summary

By using this simple redaction functionality, we were able to align all faxes well and get the correct OCR results for data extraction in our customer projects. With faxes, it is sufficent to just redact to the upper margin which contains the fax headerline.

share post

Likes

0

//

More articles in this subject area\n

Discover exciting further topics and let the codecentric world inspire you.

//

Gemeinsam bessere Projekte umsetzen

Wir helfen Deinem Unternehmen

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.