ABSTRACT
Mathematical expressions (ME) and words are carefully bonded together in most science, technology, engineering, and mathematics (STEM) documents. They respectively give quantitative and qualitative descriptions of a system model under discussion. This paper proposes a general model for finding the co-reference relations between words and MEs, based on which we developed a novel algorithm for predicting the natural language declarations of MEs--the ME-Dec. The prediction algorithm is applied in a three-level framework, where the first level is a customized tagger to identify the syntactic roles of MEs and the part-of-speech (POS) tags of words in the ME-word mixed sentences. The second level screens the ME-Dec candidates based on the hypothesis that most ME-Dec are noun phrases (NP). A shallow chunker is trained from the fuzzy process mining algorithm, which uses the labeled POS tag series in the NTCIR-10 dataset as input to mine for the frequent syntactic patterns of NP. In the third level, using distance, word stem, and POS tag respectively as the spatial, semantic, and syntactic features, the bonding model between MEs and ME-Dec candidates is trained on the NTCIR-10 training set. The final prediction results are made upon the majority votes of an ensemble of Naïve Bayesian classifiers based on the three features. Evaluation of the model on the NTCIR-10 test set, the proposed algorithm achieved 75% and 71% average F1 score in soft matching and strict matching, respectively, which outperforms the state-of-the-art solutions by a margin of 5-18%.1
- Magdalena Wolska and Mihai Grigore. 2010. Symbol Declarations in Mathematical Writing. In Towards a Digital Mathematics Library. Paris, France, July 7-8th, 2010, pages 119--127.Google Scholar
- Minh-Nghiem Quoc, Keisuke Yokoi, Yuichiroh Matsubayashi, and Akiko Aizawa. 2010. Mining Coreference Relations between Formulas and Text using Wikipedia. In NLPIX 2010. 69--74.Google Scholar
- Giovanni Yoko Kristianto, Minh-Nghiem Quoc, Yuichiroh Matsubayashi, and Akiko Aizawa. 2012. Extracting Definitions of Mathematical Expressions in Scientific Papers. In Proc. of the 26th Annual Conference of JSAI, 2012.Google Scholar
- Giovanni Yoko Kristianto and Akiko Aizawa. 2014. Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers. D-Lib Magazine 20(11), 9.Google Scholar
- Ulf Schöneberg and Wolfram Sperber. 2014. POS Tagging and Its Applications for Mathematics. In CICM 2014. 213--223.Google Scholar
- Robert Pagel and Moritz Schubotz. 2014. Mathematical Language Processing Project. In CEUR Workshop 2014.Google Scholar
- Moritz Schubotz, Alexey Grigorev, Marcus Leich, Howard S. Cohl, Norman Meuschke, Bela Gipp, Abdou S. Youssef, and Volker Markl. 2016. Semantification of Identifiers in Mathematics for Better Math Information Retrieval. In SIGIR 2016. 135--144.Google Scholar
- Moritz Schubotz, Leonard Krämer, Norman Meuschke, Felix Hamborg, and Bela Gipp. 2017. Evaluating and Improving the Extraction of Mathematical Identifier Definitions. In CLEF 2017. 82--94.Google Scholar
- Giovanni Yoko Kristianto, G. Topic and Akiko Aizawa. 2017. Utilizing Dependency Relationships between Math Expressions in Math IR. Information Retrieval Journal 20, 132--167.Google ScholarDigital Library
- Adwait Ratnaparkhi. 1996. A Maximum Entropy Model for Part-Of-Speech Tagging. In Proc. EMNLP 1996.Google Scholar
- Thorsten Brants. 2000. TnT: A Statistical Part-of-Speech Tagger. In Proc. ANLP 2000, 224--231.Google Scholar
- Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proc. EMNLP 2002. 1--8.Google ScholarDigital Library
- Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich Part-of-Speech Tagging with A Cyclic Dependency Network. In Proc. NAACL 2003, 173--180.Google ScholarDigital Library
- Xing Wang, Jason Lin, Ryan Vrecenar, and Jyh-Charn Liu. 2017. Syntactic Role Identification of Mathematical Expressions. In ICDIM 2017, 179--184.Google Scholar
- Magdalena Wolska and Ivana Kruijff-Korbayová. 2004. Analysis of Mixed Natural and Symbolic Input in Mathematical Dialogs. In Proc. of ACL'04, 25.Google ScholarDigital Library
- Mohan Ganesalingam. 2010. The Language of Mathematics. Ph.D. Dissertation, University of Cambridge.Google Scholar
- Katrin Fundel, Robert Küffner, and Ralf Zimmer. 2007. RelEx--Relation Extraction Using Dependency Parse Trees. Bioinformatics 23(3), 365--371.Google ScholarDigital Library
- Ivan Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword Expressions: A Pain in the Neck for NLP. In Proc. of CICLing 2002, 1--15.Google Scholar
- Murat Bayraktar, Bilge Say and Varol Akman. 1998. An Analysis of English Punctuation: The Special Case of Comma. International Journal of Corpus Linguistics 3(1), 33--57.Google ScholarCross Ref
- Preslav Nakov and Marti Hearst. 2005. Using the Web as An Implicit Training Set: Application to Structural Ambiguity Resolution. In HLT/EMNLP 2005, 835--842.Google Scholar
- Miriam Goldberg. 1999. An Unsupervised Model for Statistically Determining Coordinate Phrase Attachment. In Proc. of ACL, 610--614.Google ScholarDigital Library
- Philip Resnik. 1999. Semantic Similarity in A Taxonomy: An Information-based Measure and Its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research 11(1), 95--130.Google ScholarDigital Library
- Andrew Viterbi. 1967. Error Bounds for Convolutional Codes and An Asymptotically Optimum Decoding Algorithm. IEEE Transactions on Information Theory 13(2), 260--269.Google ScholarDigital Library
- Christian W. Günther and Wil MP Van Der Aalst. 2007. Fuzzy Mining--Adaptive Process Simplification Based on Multi-Perspective Metrics. In ICBPM 2007, 328--343.Google Scholar
- Christian W. Günther and Anne Rozinat. 2012. Disco: Discover Your Processes. In BPM (Demos) 940, 40--44.Google Scholar
- Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. In TeachNLP'02, 69--72.Google ScholarDigital Library
- Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In ACL, 55--60.Google Scholar
- Giovanni Yoko Kristianto, Minh-Quoc Nghiem, Nobuo Inui, Goran Topić, and Akiko Aizawa. 2012. Annotating Mathematical Expression Definitions for Automatic Detection. In MIR 2012 Workshop.Google Scholar
- Akiko Aizawa, Michael Kohlhase, and Iadh Ounis. 2013. NTCIR-10 Math Pilot Task Overview. In NTCIR.Google Scholar
- Elsevier Open Access Corpus. https://github.com/elsevierlabs/OA-STM-Corpus.Google Scholar
- ACL Web. https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art).Google Scholar
- Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. 260.Google Scholar
- Peter Willett. 2006. The Porter Stemming Algorithm: Then and Now. Program 40(3), 219--223.Google Scholar
- Ann Taylor, Mitchell Marcus, and Beatrice Santorini. 2003. The Penn Treebank: An Overview. In Treebanks, Springer, Dordrecht, 5--22.Google Scholar
Index Terms
- Prediction of Mathematical Expression Declarations based on Spatial, Semantic, and Syntactic Analysis
Recommendations
Prediction of Mathematical Expression Constraints (ME-Con)
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018This paper presents two different prediction models of Mathematical Expression Constraints (ME-Con) in technical publications. Based on the assumption of independent probability distributions, two types of features: FS, based on the ME symbols; FW, ...
Syntactic-Semantic Classes of Context-Sensitive Synonyms Based on a Bilingual Corpus
Human Language Technology. Challenges for Computer Science and LinguisticsAbstractThis paper summarizes findings of a three-year study on verb synonymy in translation based on both syntactic and semantic criteria and reports on recent results extending this work. Primary language resources used are existing Czech and English ...
Semantic classification of automatically acquired nouns using lexico-syntactic clues
COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics: PostersIn this paper, we present a two-stage approach to acquire Japanese unknown morphemes from text with full POS tags assigned to them. We first acquire unknown morphemes only making a morphology-level distinction, and then apply semantic classification to ...
Comments