Proceedings of the Symposium on Document Image Understanding Technology, Greenbelt, MD (April 2003)
Optical Character Recognition (OCR) accuracy prediction and the associated problem of choosing an optimal OCR to process a given document page are two very important issues that need to be addressed when processing degraded documents. In this paper we tackle this problem in the context of an intelligent document enhancement framework that we are developing. By taking a look at the overall system architecture we realize that existing modules may be used more than once in the overall pipeline, performing a different function each time. Specifically, we ask whether a script identification engine, after being suitably trained, can distinguish not between different script types, but between different degradation levels of the same script. We then craft a rules-based classifier to send the document to the most suitable OCR. In our experimental setup we apply our algorithmic framework to a common set of artifacts—broken and merged characters—and employ two OCR engines with different performance characteristics. Our results demonstrate a robust ability to discriminate between clean, broken-character, and merged-character documents. Finally, we demonstrate that using our system to choose between the two OCR engines produces a better overall accuracy than either engine can produce on its own.
For More Information
(Please include your name, address, organization, and the paper reference. Requests without this information will not be honored.)