Skip to main content
Log in

A categorization system for handwritten documents

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

This paper presents a complete system able to categorize handwritten documents, i.e. to classify documents according to their topic. The categorization approach is based on the detection of some discriminative keywords prior to the use of the well-known tf-idf representation for document categorization. Two keyword extraction strategies are explored. The first one proceeds to the recognition of the whole document. However, the performance of this strategy strongly decreases when the lexicon size increases. The second strategy only extracts the discriminative keywords in the handwritten documents. This information extraction strategy relies on the integration of a rejection model (or anti-lexicon model) in the recognition system. Experiments have been carried out on an unconstrained handwritten document database coming from an industrial application concerning the processing of incoming mails. Results show that the discriminative keyword extraction system leads to better recall/precision tradeoffs than the full recognition strategy. The keyword extraction strategy also outperforms the full recognition strategy for the categorization task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aas, K., Eikvil, L.: Text Categorisation: A Survey. Technical Report, Norwegian Computing Center (1999)

  2. Adamek T., O’Connor N.E., Murphy N., Smeaton A.F.: Word matching using single closed contours for indexing handwritten historical documents. IJDAR 9(2), 153–165 (2007)

    Article  Google Scholar 

  3. Baeza-Yates R., Ribeiro-Neto B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Reading, MA (1999)

    Google Scholar 

  4. Belongie S., Malik J., Puzicha J.: Shape matching and object recognition using shape contexts. IEEE Trans. on PAMI 24(4), 509–522 (2002)

    Article  Google Scholar 

  5. Bertolami R., Bunke H.: Hidden Markov Model based ensemble methods for offline handwritten text line recognition. Pattern Recognit. 41, 3452–3460 (2008)

    Article  MATH  Google Scholar 

  6. Bishop C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995)

    Google Scholar 

  7. Cao, H., Govindaraju, V.: Vector Model Based Indexing and Retrieval of Handwritten Medical Forms. ICDAR 2007, 1, 88–92 (2007)

  8. Chatelain C., Koch G., Heutte L., Paquet T.: Une méthode dirigée par la syntaxe pour l’extraction de champs numériques dans les courriers entrants. Traitement du Signal 23(2), 179–198 (2006)

    MATH  Google Scholar 

  9. Doermann D.: The indexing and retrieval of document images: a survey. Comput. Vis. Image Underst. 70(3), 287–298 (1998)

    Article  Google Scholar 

  10. El-Yacoubi A., Gilloux M., Sabourin R., Suen C.Y.: An HMM based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Trans. PAMI 21(8), 752–760 (1999)

    Article  Google Scholar 

  11. El-Yacoubi M.A., Gilloux M., Bertille J.M.: A statistical approach for phrase location and recognition within a text line: an application to street name recognition. IEEE Trans. PAMI 24(2), 172–188 (2002)

    Article  Google Scholar 

  12. Gatos, B., Konidaris, T., Ntzios, K., Pratikakis, I., Perantonis, S.J.: A Segmentation-free Approach for Keyword Search in Historical Typewritten Documents, pp. 54–58. ICDAR, Seoul (2005)

  13. Gatos, B., Stamatopoulos, N., Louloudis, G.: ICDAR 2009 Handwriting Segmentation Contest, pp. 1393–1397. ICDAR, Seoul (2009)

  14. Graves A., Liwicki M., Fernandez S., Bertolami R., Bunke H., Schmidhuber J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)

    Article  Google Scholar 

  15. Grosicki, E., El Abed, H.: ICDAR 2009 Handwriting Recognition Competition, pp. 1398–1402. ICDAR, Seoul (2009)

  16. Heutte L., Paquet T., Moreau J.V., Lecourtier Y., Olivier C.: A structural/statistical feature based vector for handwritten character recognition. Pattern Recognit. Lett. 19(7), 629–641 (1998)

    Article  Google Scholar 

  17. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Claire, N., Celine, R. (eds.) Proceedings of ECML-98, pp. 137–142 (1998)

  18. Kim G., Govindaraju V.: A lexicon driven approach to handwritten word recognition for realtime applications. IEEE Trans. PAMI 19(4), 366–378 (1997)

    Article  Google Scholar 

  19. Kimura, F., Tsuruoka, S., Miyake, Y., Shridhar, M.: A Lexicon Directed Algorithm for recognition of unconstrained handwritten words. IEICE Trans. Inf. Syst. E77-D(7), 785–793 (1994)

  20. Koch, G., Paquet, T., Heutte, L.: Combination of contextual information for handwritten word recognition. In: 9th IAPR International Workshop on Frontiers in Handwriting Recognition, pp. 468–473. IWFHR’2004 (2004)

  21. Koerich A.L., Sabourin R., Suen C.Y.: Vocabulary off-line handwriting recognition: a survey. Pattern Anal. Appl. 6, 97–121 (2003)

    Article  MathSciNet  Google Scholar 

  22. Koerich, A.L.: Rejection strategies for handwritten word recognition, pp. 479–484. IWFHR (2004)

  23. Koerich A.L., Sabourin R., Suen C.Y.: Recognition and verification of unconstrained handwritten words. IEEE PAMI 27(10), 1509–1522 (2005)

    Article  Google Scholar 

  24. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th Annual International ACM SIGIR, pp. 37–50 (1992)

  25. Lorette, G., Paquet, T.: La reconnaissance de l’Ecriture manuscrite, Traite IC2, Les Documents Ecrits, chap. 2, ISBN: 2-7462-1143-2, (2007)

  26. Manmatha, R., Croft, W.B.: Word Spotting: Indexing Handwritten Archives. Intell. Multimed. Inf. Retr., pp. 43–64 (1997)

  27. Nosary, A.: Automatique Recognition of Handwritten texts trough writer adaptation. Ph.D Thesis (in french), Universite de Rouen (2002)

  28. Plamondon R., Srihari S.N.: On-line and off-line handwriting recognition: a comprehensive suvey. IEEE-PAMI 22(1), 63–84 (2000)

    Article  Google Scholar 

  29. Porter M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  30. Rath T.M., Manmatha R.: Word spotting for historical documents. IJDAR 9, 139–152 (2007)

    Article  Google Scholar 

  31. Richard M.D., Lippmann R.P.: Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput. 3, 461–483 (1991)

    Article  Google Scholar 

  32. Rodriguez-Serrano, J.A., Perronnin, F.: Score Normalization for HMM-based Word Spotting Using a Universal Background Model. ICFHR 2008 (2008)

  33. Rodriguez-Serrano J.A., Perronnin F.: Handwritten Word-Spotting Using Hidden Markov Models and Universal Vocabularies. Pattern Recognit 42(9), 2106–2116 (2009)

    Article  MATH  Google Scholar 

  34. Salton G., Buckley C.: Term-weighting approaches, in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)

    Article  Google Scholar 

  35. Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  36. Seni G., Cohen E.: External word segmentation of off-line handwritten text lines. Pattern Recognit. 27(1), 41–52 (1994)

    Article  Google Scholar 

  37. Shi, Z., Setlur, S., Govindaraju, V.: A steerable directional local profile technique for extraction of handwritten arabic text lines. In: International Conference on Document Analysis and Recognition. (ICDAR’09), Spain, (2009)

  38. Stafylakis, T., Papavassiliou, V., Katsouros, V., Carayannis, G.: Robust text-line and word segmentation for handwritten documents images. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 3393–3396. Las Vegas (2008)

  39. Terasawa, K., Tanaka, Y.: Slit style hog feature for document image word spotting, pp. 116–120. ICDAR, Seoul (2009)

  40. van der Zant T., Schomaker L., Haak K.: Handwritten-word spotting using biologically inspired features. IEEE Trans. PAMI 30(11), 1945–1957 (2008)

    Article  Google Scholar 

  41. Vinciarelli, A., Luettin J.: Offline Cursive Script Recognition Based on Continuous Density HMM, pp. 493–498. IWFHR (2000)

  42. Vinciarelli A., Bengio S., Bunke H.: Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. PAMI 26(6), 709–720 (2004)

    Article  Google Scholar 

  43. Vinciarelli A.: Noisy text categorisation. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1295–1882 (2005)

    Article  Google Scholar 

  44. Wang, W., Brakensiek, A., Kosmala, A., Rigoll, G.: HMM Based High Accuracy Off-line Cursive Handwriting Recognition by Baseline Detection Error Tolerant Feature Extraction Approach, pp. 209–218. IWFHR VII, Amsterdam (2000)

  45. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 412–420. Nashville (1997)

  46. Yazgan, A., Saraclar, M.: Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition. IEEE ICASP Process. 1, 745 (2004)

    Google Scholar 

  47. Yin, F., Liu, C.-L.: Handwritten text line segmentation by clustering with distance metric learning. In: Proceedings of 11th International Confernence on Frontiers in Handwriting Recognition, pp. 229–234. Montreal, Canada (2008)

  48. Zimmermann, M., Bertolami, R., Bunke, H.: Rejection strategies for offline handwritten sentence recognition. Pattern Recognit. ICPR 2004, 2, pp. 550–553 (2004)

    Google Scholar 

  49. Zimmermann M., Chappelier J.-C., Bunke H.: Offline grammar-based recognition of handwritten sentences. IEEE Trans. Pattern Anal. Mach. Intell. 18(5), 818–821 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Clément Chatelain.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Paquet, T., Heutte, L., Koch, G. et al. A categorization system for handwritten documents. IJDAR 15, 315–330 (2012). https://doi.org/10.1007/s10032-011-0173-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-011-0173-5

Keywords

Navigation