Skip to main content
Erschienen in: International Journal on Document Analysis and Recognition (IJDAR) 3/2014

01.09.2014 | Original Paper

Mathematical formula identification and performance evaluation in PDF documents

verfasst von: Xiaoyan Lin, Liangcai Gao, Zhi Tang, Josef Baker, Volker Sorge

Erschienen in: International Journal on Document Analysis and Recognition (IJDAR) | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

An important initial step of mathematical formula recognition is to correctly identify the location of formulae within documents. Previous work in this area has traditionally focused on image-based documents; however, given the prevalence and popularity of the PDF format for dissemination, alternatives to image-based approaches are increasingly being explored. In this paper, we investigate the use of both machine learning techniques and heuristic rules to locate the boundaries of both isolated and embedded formulae within documents, based upon data extracted directly from PDF files. We propose four new features along with preprocessing and post-processing techniques for isolated formula identification. Furthermore, we compare, analyse and extensively tune nine state-of-the-art learning algorithms for a comprehensive evaluation of our proposed methods. The evaluation is carried out over a ground-truth dataset, which we have made publicly available, together with an application adaptable fine-grained evaluation metric. Our experimental results demonstrate that the overall accuracies of isolated and embedded formula identification are increased by 11.52 and 10.65 %, compared with our previously proposed formula identification approach.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Anderson, R.H.: Syntax-directed recognition of hand-printed two-dimensional mathematics. PhD thesis, Harvard University, Cambridge, Massachusetts (1968) Anderson, R.H.: Syntax-directed recognition of hand-printed two-dimensional mathematics. PhD thesis, Harvard University, Cambridge, Massachusetts (1968)
2.
Zurück zum Zitat Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Mathematical formula identification in PDF documents. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1419–1423. IEEE (2011) Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Mathematical formula identification in PDF documents. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1419–1423. IEEE (2011)
3.
Zurück zum Zitat Lin, X., Gao, L., Tang, Z., Hu, X., Lin, X.: Identification of embedded mathematical formulas in PDF documents using SVM. In: Document Recognition and Retrieval (DRR) XIX, pp. 8297 0D 1–8 (2012) Lin, X., Gao, L., Tang, Z., Hu, X., Lin, X.: Identification of embedded mathematical formulas in PDF documents using SVM. In: Document Recognition and Retrieval (DRR) XIX, pp. 8297 0D 1–8 (2012)
4.
Zurück zum Zitat Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Performance evaluation of mathematical formula identification. In: The 10th IAPR International Workshop on Document Analysis Systems (DAS), pp. 287–291. IEEE (2012) Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Performance evaluation of mathematical formula identification. In: The 10th IAPR International Workshop on Document Analysis Systems (DAS), pp. 287–291. IEEE (2012)
5.
Zurück zum Zitat Adobe. PDF reference, 7th edition (2008) Adobe. PDF reference, 7th edition (2008)
6.
Zurück zum Zitat Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Document Anal. Recognit. pp. 1–27 (2011) Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Document Anal. Recognit. pp. 1–27 (2011)
7.
Zurück zum Zitat Baker, J.B.: A linear grammar approach for the analysis of mathematical documents. PhD thesis, University of Birmingham (2012) Baker, J.B.: A linear grammar approach for the analysis of mathematical documents. PhD thesis, University of Birmingham (2012)
8.
Zurück zum Zitat Rahman, F., Alam, H.: Conversion of PDF documents into HTML: a case study of document image analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 87–91. IEEE (2003) Rahman, F., Alam, H.: Conversion of PDF documents into HTML: a case study of document image analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 87–91. IEEE (2003)
9.
Zurück zum Zitat Baker, J.B., Sexton, A.P., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF. In: Proceedings of the 8th International Conference on Mathematical Knowledge Management, vol. 5625 of LNAI, pp. 201–216. Springer (2009) Baker, J.B., Sexton, A.P., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF. In: Proceedings of the 8th International Conference on Mathematical Knowledge Management, vol. 5625 of LNAI, pp. 201–216. Springer (2009)
10.
Zurück zum Zitat Fateman, R.J., Tokuyasu, T., Berman, B.P., Mitchell, N.: Optical character recognition and parsing of typeset mathematics. J. Vis. Commun. Image Represent. 7(1), 2–15 (1996)CrossRef Fateman, R.J., Tokuyasu, T., Berman, B.P., Mitchell, N.: Optical character recognition and parsing of typeset mathematics. J. Vis. Commun. Image Represent. 7(1), 2–15 (1996)CrossRef
11.
Zurück zum Zitat Lee, H.J., Wang, J.S.: Design of a mathematical expression understanding system. Pattern Recognit. Lett. 18(3), 289–298 (1997)CrossRef Lee, H.J., Wang, J.S.: Design of a mathematical expression understanding system. Pattern Recognit. Lett. 18(3), 289–298 (1997)CrossRef
12.
Zurück zum Zitat Toumit, J.Y., Garcia-Salicetti, S., Emptoz, H.: A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR), pp. 119–122. IEEE (1999) Toumit, J.Y., Garcia-Salicetti, S., Emptoz, H.: A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR), pp. 119–122. IEEE (1999)
13.
Zurück zum Zitat Garain, U., Chaudhuri, B.B.: A syntactic approach for processing mathematical expressions in printed documents. In: Proceedings of the 15th International Conference on Pattern Recognition (ICPR), vol. 4, pp. 523–526. IEEE (2000) Garain, U., Chaudhuri, B.B.: A syntactic approach for processing mathematical expressions in printed documents. In: Proceedings of the 15th International Conference on Pattern Recognition (ICPR), vol. 4, pp. 523–526. IEEE (2000)
14.
Zurück zum Zitat Kacem, A., Belaïd, A., Ben Ahmed, M.: Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context. Int. J. Document Anal. Recognit. 4(2), 97–108 (2001) Kacem, A., Belaïd, A., Ben Ahmed, M.: Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context. Int. J. Document Anal. Recognit. 4(2), 97–108 (2001)
15.
Zurück zum Zitat Inoue, K., Miyazaki, R., Suzuki, M.: Optical recognition of printed mathematical documents. In: Proceedings of the Third Asian Technology Conference on Mathematics, pp. 280–289 (1998) Inoue, K., Miyazaki, R., Suzuki, M.: Optical recognition of printed mathematical documents. In: Proceedings of the Third Asian Technology Conference on Mathematics, pp. 280–289 (1998)
16.
Zurück zum Zitat Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: Proceedings of the 2003 ACM Symposium on Document Engineering, pp. 95–104. ACM (2003) Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: Proceedings of the 2003 ACM Symposium on Document Engineering, pp. 95–104. ACM (2003)
17.
Zurück zum Zitat Baker, J.B., Sexton, A.P., Sorge, V.: Towards reverse engineering of PDF documents. In: Towards a Digital Mathematics Library, pp. 65–75. Masaryk University Press (2011) Baker, J.B., Sexton, A.P., Sorge, V.: Towards reverse engineering of PDF documents. In: Towards a Digital Mathematics Library, pp. 65–75. Masaryk University Press (2011)
18.
Zurück zum Zitat Lee, H.J., Wang, J.S.: Design of a mathematical expression recognition system. In: Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1084–1087. IEEE (1995) Lee, H.J., Wang, J.S.: Design of a mathematical expression recognition system. In: Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1084–1087. IEEE (1995)
19.
Zurück zum Zitat Chowdhury, S.P., Mandal, S., Das, A.K., Chanda, B.: Automated segmentation of math-zones from document images. In: 7th International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 755–759 (2003) Chowdhury, S.P., Mandal, S., Das, A.K., Chanda, B.: Automated segmentation of math-zones from document images. In: 7th International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 755–759 (2003)
20.
Zurück zum Zitat Chang, T.Y., Takiguchi, Y., Okada, M.: Physical structure segmentation with projection profile for mathematic formulae and graphics in academic paper images. In: The Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1193–1197. IEEE (2007) Chang, T.Y., Takiguchi, Y., Okada, M.: Physical structure segmentation with projection profile for mathematic formulae and graphics in academic paper images. In: The Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1193–1197. IEEE (2007)
21.
Zurück zum Zitat Garain, U., Chaudhuri, B.B., Chaudhuri, A.R.: Identification of embedded mathematical expressions in scanned documents. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol. 1, pp. 384–387. IEEE (2004) Garain, U., Chaudhuri, B.B., Chaudhuri, A.R.: Identification of embedded mathematical expressions in scanned documents. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol. 1, pp. 384–387. IEEE (2004)
22.
Zurück zum Zitat Garain, U.: Identification of mathematical expressions in document images. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1340–1344. IEEE (2009) Garain, U.: Identification of mathematical expressions in document images. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1340–1344. IEEE (2009)
23.
Zurück zum Zitat Jin, J., Han, X., Wang, Q.: Mathematical formulas extraction. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1138–1141. IEEE (2003) Jin, J., Han, X., Wang, Q.: Mathematical formulas extraction. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1138–1141. IEEE (2003)
24.
Zurück zum Zitat Drake, D.M., Baird, H.S.: Distinguishing mathematics notation from English text using computational geometry. In: Proceedings. Eighth International Conference on Document Analysis and Recognition (ICDAR), pp. 1270–1274. IEEE (2005) Drake, D.M., Baird, H.S.: Distinguishing mathematics notation from English text using computational geometry. In: Proceedings. Eighth International Conference on Document Analysis and Recognition (ICDAR), pp. 1270–1274. IEEE (2005)
25.
Zurück zum Zitat Liu, Y., Bai, K., Gao, L.: An efficient pre-processing method to identify logical components from PDF documents. Adv. Knowl. Discov. Data Min. pp. 500–511 (2011) Liu, Y., Bai, K., Gao, L.: An efficient pre-processing method to identify logical components from PDF documents. Adv. Knowl. Discov. Data Min. pp. 500–511 (2011)
26.
Zurück zum Zitat Uchida, S., Nomura, A., Suzuki, M.: Quantitative analysis of mathematical documents. Int. J. Document Anal. Recognit. 7(4), 211–218 (2005)CrossRef Uchida, S., Nomura, A., Suzuki, M.: Quantitative analysis of mathematical documents. Int. J. Document Anal. Recognit. 7(4), 211–218 (2005)CrossRef
27.
Zurück zum Zitat Phillips, I., Chanda, B., Haralick, R: University of Washington UW-III English technical document image database (1996) Phillips, I., Chanda, B., Haralick, R: University of Washington UW-III English technical document image database (1996)
29.
Zurück zum Zitat Gao, L., Tang, Z., Lin, X., Qiu, R.: Comprehensive global typography extraction system for electronic book documents. In: The Eighth IAPR International Workshop on Document Analysis Systems (DAS), pp. 615–621. IEEE (2008) Gao, L., Tang, Z., Lin, X., Qiu, R.: Comprehensive global typography extraction system for electronic book documents. In: The Eighth IAPR International Workshop on Document Analysis Systems (DAS), pp. 615–621. IEEE (2008)
31.
Zurück zum Zitat Bishop, C.M.: Pattern Recognition and Machine Learning, vol. 4. Springer, New York (2006) Bishop, C.M.: Pattern Recognition and Machine Learning, vol. 4. Springer, New York (2006)
32.
Zurück zum Zitat Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004) Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)
33.
Zurück zum Zitat Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004) Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
34.
Zurück zum Zitat Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
35.
Zurück zum Zitat Kim, S.H., Jeong, C.B., Kwag, H.K., Suen, C.Y.: Word segmentation of printed text lines based on gap clustering and special symbol detection. In: Proceedings. 16th International Conference on Pattern Recognition (ICPR), vol. 2, pp. 320–323. IEEE (2002) Kim, S.H., Jeong, C.B., Kwag, H.K., Suen, C.Y.: Word segmentation of printed text lines based on gap clustering and special symbol detection. In: Proceedings. 16th International Conference on Pattern Recognition (ICPR), vol. 2, pp. 320–323. IEEE (2002)
36.
Zurück zum Zitat Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011) Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
37.
Zurück zum Zitat Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef
Metadaten
Titel
Mathematical formula identification and performance evaluation in PDF documents
verfasst von
Xiaoyan Lin
Liangcai Gao
Zhi Tang
Josef Baker
Volker Sorge
Publikationsdatum
01.09.2014
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal on Document Analysis and Recognition (IJDAR) / Ausgabe 3/2014
Print ISSN: 1433-2833
Elektronische ISSN: 1433-2825
DOI
https://doi.org/10.1007/s10032-013-0216-1

Weitere Artikel der Ausgabe 3/2014

International Journal on Document Analysis and Recognition (IJDAR) 3/2014 Zur Ausgabe