skip to main content
10.1145/3342558.3345399acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article
Best Student Paper

Prediction of Mathematical Expression Declarations based on Spatial, Semantic, and Syntactic Analysis

Authors Info & Claims
Published:23 September 2019Publication History

ABSTRACT

Mathematical expressions (ME) and words are carefully bonded together in most science, technology, engineering, and mathematics (STEM) documents. They respectively give quantitative and qualitative descriptions of a system model under discussion. This paper proposes a general model for finding the co-reference relations between words and MEs, based on which we developed a novel algorithm for predicting the natural language declarations of MEs--the ME-Dec. The prediction algorithm is applied in a three-level framework, where the first level is a customized tagger to identify the syntactic roles of MEs and the part-of-speech (POS) tags of words in the ME-word mixed sentences. The second level screens the ME-Dec candidates based on the hypothesis that most ME-Dec are noun phrases (NP). A shallow chunker is trained from the fuzzy process mining algorithm, which uses the labeled POS tag series in the NTCIR-10 dataset as input to mine for the frequent syntactic patterns of NP. In the third level, using distance, word stem, and POS tag respectively as the spatial, semantic, and syntactic features, the bonding model between MEs and ME-Dec candidates is trained on the NTCIR-10 training set. The final prediction results are made upon the majority votes of an ensemble of Naïve Bayesian classifiers based on the three features. Evaluation of the model on the NTCIR-10 test set, the proposed algorithm achieved 75% and 71% average F1 score in soft matching and strict matching, respectively, which outperforms the state-of-the-art solutions by a margin of 5-18%.1

References

  1. Magdalena Wolska and Mihai Grigore. 2010. Symbol Declarations in Mathematical Writing. In Towards a Digital Mathematics Library. Paris, France, July 7-8th, 2010, pages 119--127.Google ScholarGoogle Scholar
  2. Minh-Nghiem Quoc, Keisuke Yokoi, Yuichiroh Matsubayashi, and Akiko Aizawa. 2010. Mining Coreference Relations between Formulas and Text using Wikipedia. In NLPIX 2010. 69--74.Google ScholarGoogle Scholar
  3. Giovanni Yoko Kristianto, Minh-Nghiem Quoc, Yuichiroh Matsubayashi, and Akiko Aizawa. 2012. Extracting Definitions of Mathematical Expressions in Scientific Papers. In Proc. of the 26th Annual Conference of JSAI, 2012.Google ScholarGoogle Scholar
  4. Giovanni Yoko Kristianto and Akiko Aizawa. 2014. Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers. D-Lib Magazine 20(11), 9.Google ScholarGoogle Scholar
  5. Ulf Schöneberg and Wolfram Sperber. 2014. POS Tagging and Its Applications for Mathematics. In CICM 2014. 213--223.Google ScholarGoogle Scholar
  6. Robert Pagel and Moritz Schubotz. 2014. Mathematical Language Processing Project. In CEUR Workshop 2014.Google ScholarGoogle Scholar
  7. Moritz Schubotz, Alexey Grigorev, Marcus Leich, Howard S. Cohl, Norman Meuschke, Bela Gipp, Abdou S. Youssef, and Volker Markl. 2016. Semantification of Identifiers in Mathematics for Better Math Information Retrieval. In SIGIR 2016. 135--144.Google ScholarGoogle Scholar
  8. Moritz Schubotz, Leonard Krämer, Norman Meuschke, Felix Hamborg, and Bela Gipp. 2017. Evaluating and Improving the Extraction of Mathematical Identifier Definitions. In CLEF 2017. 82--94.Google ScholarGoogle Scholar
  9. Giovanni Yoko Kristianto, G. Topic and Akiko Aizawa. 2017. Utilizing Dependency Relationships between Math Expressions in Math IR. Information Retrieval Journal 20, 132--167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Adwait Ratnaparkhi. 1996. A Maximum Entropy Model for Part-Of-Speech Tagging. In Proc. EMNLP 1996.Google ScholarGoogle Scholar
  11. Thorsten Brants. 2000. TnT: A Statistical Part-of-Speech Tagger. In Proc. ANLP 2000, 224--231.Google ScholarGoogle Scholar
  12. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proc. EMNLP 2002. 1--8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich Part-of-Speech Tagging with A Cyclic Dependency Network. In Proc. NAACL 2003, 173--180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Xing Wang, Jason Lin, Ryan Vrecenar, and Jyh-Charn Liu. 2017. Syntactic Role Identification of Mathematical Expressions. In ICDIM 2017, 179--184.Google ScholarGoogle Scholar
  15. Magdalena Wolska and Ivana Kruijff-Korbayová. 2004. Analysis of Mixed Natural and Symbolic Input in Mathematical Dialogs. In Proc. of ACL'04, 25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mohan Ganesalingam. 2010. The Language of Mathematics. Ph.D. Dissertation, University of Cambridge.Google ScholarGoogle Scholar
  17. Katrin Fundel, Robert Küffner, and Ralf Zimmer. 2007. RelEx--Relation Extraction Using Dependency Parse Trees. Bioinformatics 23(3), 365--371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ivan Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword Expressions: A Pain in the Neck for NLP. In Proc. of CICLing 2002, 1--15.Google ScholarGoogle Scholar
  19. Murat Bayraktar, Bilge Say and Varol Akman. 1998. An Analysis of English Punctuation: The Special Case of Comma. International Journal of Corpus Linguistics 3(1), 33--57.Google ScholarGoogle ScholarCross RefCross Ref
  20. Preslav Nakov and Marti Hearst. 2005. Using the Web as An Implicit Training Set: Application to Structural Ambiguity Resolution. In HLT/EMNLP 2005, 835--842.Google ScholarGoogle Scholar
  21. Miriam Goldberg. 1999. An Unsupervised Model for Statistically Determining Coordinate Phrase Attachment. In Proc. of ACL, 610--614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Philip Resnik. 1999. Semantic Similarity in A Taxonomy: An Information-based Measure and Its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research 11(1), 95--130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andrew Viterbi. 1967. Error Bounds for Convolutional Codes and An Asymptotically Optimum Decoding Algorithm. IEEE Transactions on Information Theory 13(2), 260--269.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Christian W. Günther and Wil MP Van Der Aalst. 2007. Fuzzy Mining--Adaptive Process Simplification Based on Multi-Perspective Metrics. In ICBPM 2007, 328--343.Google ScholarGoogle Scholar
  25. Christian W. Günther and Anne Rozinat. 2012. Disco: Discover Your Processes. In BPM (Demos) 940, 40--44.Google ScholarGoogle Scholar
  26. Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. In TeachNLP'02, 69--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In ACL, 55--60.Google ScholarGoogle Scholar
  28. Giovanni Yoko Kristianto, Minh-Quoc Nghiem, Nobuo Inui, Goran Topić, and Akiko Aizawa. 2012. Annotating Mathematical Expression Definitions for Automatic Detection. In MIR 2012 Workshop.Google ScholarGoogle Scholar
  29. Akiko Aizawa, Michael Kohlhase, and Iadh Ounis. 2013. NTCIR-10 Math Pilot Task Overview. In NTCIR.Google ScholarGoogle Scholar
  30. Elsevier Open Access Corpus. https://github.com/elsevierlabs/OA-STM-Corpus.Google ScholarGoogle Scholar
  31. ACL Web. https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art).Google ScholarGoogle Scholar
  32. Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. 260.Google ScholarGoogle Scholar
  33. Peter Willett. 2006. The Porter Stemming Algorithm: Then and Now. Program 40(3), 219--223.Google ScholarGoogle Scholar
  34. Ann Taylor, Mitchell Marcus, and Beatrice Santorini. 2003. The Penn Treebank: An Overview. In Treebanks, Springer, Dordrecht, 5--22.Google ScholarGoogle Scholar

Index Terms

  1. Prediction of Mathematical Expression Declarations based on Spatial, Semantic, and Syntactic Analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019
        September 2019
        254 pages
        ISBN:9781450368872
        DOI:10.1145/3342558

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 September 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        DocEng '19 Paper Acceptance Rate30of77submissions,39%Overall Acceptance Rate178of537submissions,33%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader