Friday 6 January 2012

Document image analysis and recognition


By Simone Marinai 
Document image analysis and recognition (DIAR) is a research field that has its roots in the first Optical Character Recognition (OCR) systems, applied for reading numeric check codes. Nowadays, the technology related to DIAR is used in a broad range of applications, where some information has to be extracted from structured documents existing in different media. Typical applications include, among the others, handwritten character recognition, processing of textual web images, and information extraction from digital libraries. In the digital library community a lot of efforts have been devoted to the digitization of paper collections in order to archive them as document image collections. Large digital archives hare currently available, however their full fruition can be achieved only by accessing the information that is embedded in the digital image. The simple application of Optical Character Recognition (OCR) packages can only partially solve these problems, both for the difficulty of obtaining clean converted text and for the lack of structural description of the document. To tackle this problems either layout analysis methods or document image retrieval approaches can be considered.

Scanning and storage
  • Raw storage
  • Image compression
  • Document image compression
Image pre-processing
  • Noise removal
  • Skew detection
  • Connected components computation
Layout analysis
  • Geometric and logic layout analysis
  • Region segmentation and labeling
  • Page classification
OCR and handwriting recognition
  • Character segmentation
  • Word recognition
Document image retrieval
  • Processing of converted text
  • Retrieval by layout similarity