SEEK: Salford Environment for Expertise and Knowledge

Published Conference Proceedings - Paper
July 2009

A New Framework for Recognition of Heavily Degraded Characters in Historical Typewritten Documents Based on Semi-Supervised Clustering

Pletschacher, S & Hu, J & Antonacopoulos, A 2009, A New Framework for Recognition of Heavily Degraded Characters in Historical Typewritten Documents Based on Semi-Supervised Clustering, in: 'Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR2009) ', IEEE Computer Society, Los Alamitos, USA, pp.56-510. Conference details: ICDAR2009, Barcelona, Spain, July 2009.

Abstract

This paper presents a new semi-supervised clustering framework to the recognition of heavily degraded characters in historical typewritten documents, where off-theshelf OCR typically fails. The constraints are generated using typographical (collection-independent) domain knowledge and are used to guide both sample (glyph set) partitioning and metric learning. Experimental results using simple features provide encouraging evidence that this approach can lead to significantly improved clustering results compared to simple K-Means clustering, as well as to clustering using a state-of-the art OCR engine.

Notes

For the first time semi-supervised clustering is applied to the recognition of degraded typewritten documents. Most administrative documents of most of the 20th century were typewritten and this poses a significant challenge as they differ from modern uniformly printed documents. Commercial OCR does not deal well with this significant class of historical document (as demonstrated also in the paper) and the proposed approach provides the basis on which to build a new system (under development) that requires little training and can work with a variety of typewritten documents even if they are significantly degraded (a key requirement for large-scale digitization).

Authors

SEEK Members

External Authors

Stefan Pletschacher

Jianying Hu

Publication Details

Conference Proceedings
Pletschacher, & Hu, & Antonacopoulos, A eds. 2009, Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR2009) , IEEE Computer Society, Los Alamitos, USA, pp.56-510.

Conference Details
ICDAR2009, Barcelona, Spain, July 2009