Published Conference Proceedings - Paper
Word-Based Adaptive OCR for Historical Books
Kluzner, V & Tzadok, A & Shimony, Y & Walach, E & Antonacopoulos, A 2009, Word-Based Adaptive OCR for Historical Books, in: 'Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR2009) ', IEEE Computer Society, Los Alamitos, USA, pp.501-505. Conference details: ICDAR2009, Barcelona, Spain, July 2009.
The aim of this work is to propose a new approach to the recognition of historical texts by providing an adaptive mechanism that automatically tunes itself to a specific book. The system is based on clustering together all the similar words in a book/text and simultaneously handling entire class. The paper describes the architecture of such a system and new algorithms that have been developed for robust word image comparison (including registration, optical flow based distortion compensation, and adaptive binarization). Results for a large dataset are presented as well. Over 23% recognition improvement is demonstrated.
One of the major results of the IMPACT multi-million research project, actively involving industry and academia, in improving OCR performance for large-scale digitization of historical documents. In the case of books (majority of world-library holdings) the proposed architecture for OCR supports a recognition system that can train itself as it progresses through the pages of a book. This is an important requirement for large-scale digitization, where human input is impractical, very costly and material is printed using a variety archaic conventions and fonts. Experiments with material from major European libraries demonstrate a significant improvement in recognition rate using this approach.
Kluzner, & Tzadok, & Shimony, & Walach, & Antonacopoulos, A eds. 2009, Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR2009) , IEEE Computer Society, Los Alamitos, USA, pp.501-505.
ICDAR2009, Barcelona, Spain, July 2009