Published Conference Proceedings - Paper
A New Framework for Recognition of Heavily Degraded Characters in Historical Typewritten Documents Based on Semi-Supervised Clustering
Pletschacher, S & Hu, J & Antonacopoulos, A 2009, A New Framework for Recognition of Heavily Degraded Characters in Historical Typewritten Documents Based on Semi-Supervised Clustering, in: 'Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR2009) ', IEEE Computer Society, Los Alamitos, USA, pp.56-510. Conference details: ICDAR2009, Barcelona, Spain, July 2009.
This paper presents a new semi-supervised clustering framework to the recognition of heavily degraded characters in historical typewritten documents, where off-theshelf OCR typically fails. The constraints are generated using typographical (collection-independent) domain knowledge and are used to guide both sample (glyph set) partitioning and metric learning. Experimental results using simple features provide encouraging evidence that this approach can lead to significantly improved clustering results compared to simple K-Means clustering, as well as to clustering using a state-of-the art OCR engine.
For the first time semi-supervised clustering is applied to the recognition of degraded typewritten documents. Most administrative documents of most of the 20th century were typewritten and this poses a significant challenge as they differ from modern uniformly printed documents. Commercial OCR does not deal well with this significant class of historical document (as demonstrated also in the paper) and the proposed approach provides the basis on which to build a new system (under development) that requires little training and can work with a variety of typewritten documents even if they are significantly degraded (a key requirement for large-scale digitization).
Pletschacher, & Hu, & Antonacopoulos, A eds. 2009, Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR2009) , IEEE Computer Society, Los Alamitos, USA, pp.56-510.
ICDAR2009, Barcelona, Spain, July 2009