By Simone Marinai
Document image analysis and recognition (DIAR) is a research field that has its roots in the first Optical Character Recognition (OCR) systems, applied for reading numeric check codes. Nowadays, the technology related to DIAR is used in a broad range of applications, where some information has to be extracted from structured documents existing in different media. Typical applications include, among the others, handwritten character recognition, processing of textual web images, and information extraction from digital libraries. In the digital library community a lot of efforts have been devoted to the digitization of paper collections in order to archive them as document image collections. Large digital archives hare currently available, however their full fruition can be achieved only by accessing the information that is embedded in the digital image. The simple application of Optical Character Recognition (OCR) packages can only partially solve these problems, both for the difficulty of obtaining clean converted text and for the lack of structural description of the document. To tackle this problems either layout analysis methods or document image retrieval approaches can be considered.
Scanning and storage
- Raw storage
- Image compression
- Document image compression
- Noise removal
- Skew detection
- Connected components computation
- Geometric and logic layout analysis
- Region segmentation and labeling
- Page classification
- Character segmentation
- Word recognition
- Processing of converted text
- Retrieval by layout similarity