research-article

Prediction of Mathematical Expression Declarations based on Spatial, Semantic, and Syntactic Analysis

Authors:
Jason Lin

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA
View Profile

,
Xing Wang

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA
View Profile

,
Zelun Wang

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA
View Profile

,
Donald Beyette

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA
View Profile

,
Jyh-Charn Liu

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA

Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA
View Profile

DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019September 2019Article No.: 15Pages 1–10https://doi.org/10.1145/3342558.3345399

Published:23 September 2019Publication History

DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019

Pages 1–10

ABSTRACT

Mathematical expressions (ME) and words are carefully bonded together in most science, technology, engineering, and mathematics (STEM) documents. They respectively give quantitative and qualitative descriptions of a system model under discussion. This paper proposes a general model for finding the co-reference relations between words and MEs, based on which we developed a novel algorithm for predicting the natural language declarations of MEs--the ME-Dec. The prediction algorithm is applied in a three-level framework, where the first level is a customized tagger to identify the syntactic roles of MEs and the part-of-speech (POS) tags of words in the ME-word mixed sentences. The second level screens the ME-Dec candidates based on the hypothesis that most ME-Dec are noun phrases (NP). A shallow chunker is trained from the fuzzy process mining algorithm, which uses the labeled POS tag series in the NTCIR-10 dataset as input to mine for the frequent syntactic patterns of NP. In the third level, using distance, word stem, and POS tag respectively as the spatial, semantic, and syntactic features, the bonding model between MEs and ME-Dec candidates is trained on the NTCIR-10 training set. The final prediction results are made upon the majority votes of an ensemble of Naïve Bayesian classifiers based on the three features. Evaluation of the model on the NTCIR-10 test set, the proposed algorithm achieved 75% and 71% average F1 score in soft matching and strict matching, respectively, which outperforms the state-of-the-art solutions by a margin of 5-18%.1

References

Magdalena Wolska and Mihai Grigore. 2010. Symbol Declarations in Mathematical Writing. In Towards a Digital Mathematics Library. Paris, France, July 7-8th, 2010, pages 119--127.Google Scholar
Minh-Nghiem Quoc, Keisuke Yokoi, Yuichiroh Matsubayashi, and Akiko Aizawa. 2010. Mining Coreference Relations between Formulas and Text using Wikipedia. In NLPIX 2010. 69--74.Google Scholar
Giovanni Yoko Kristianto, Minh-Nghiem Quoc, Yuichiroh Matsubayashi, and Akiko Aizawa. 2012. Extracting Definitions of Mathematical Expressions in Scientific Papers. In Proc. of the 26th Annual Conference of JSAI, 2012.Google Scholar
Giovanni Yoko Kristianto and Akiko Aizawa. 2014. Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers. D-Lib Magazine 20(11), 9.Google Scholar
Ulf Schöneberg and Wolfram Sperber. 2014. POS Tagging and Its Applications for Mathematics. In CICM 2014. 213--223.Google Scholar
Robert Pagel and Moritz Schubotz. 2014. Mathematical Language Processing Project. In CEUR Workshop 2014.Google Scholar
Moritz Schubotz, Alexey Grigorev, Marcus Leich, Howard S. Cohl, Norman Meuschke, Bela Gipp, Abdou S. Youssef, and Volker Markl. 2016. Semantification of Identifiers in Mathematics for Better Math Information Retrieval. In SIGIR 2016. 135--144.Google Scholar
Moritz Schubotz, Leonard Krämer, Norman Meuschke, Felix Hamborg, and Bela Gipp. 2017. Evaluating and Improving the Extraction of Mathematical Identifier Definitions. In CLEF 2017. 82--94.Google Scholar
Giovanni Yoko Kristianto, G. Topic and Akiko Aizawa. 2017. Utilizing Dependency Relationships between Math Expressions in Math IR. Information Retrieval Journal 20, 132--167.Google ScholarDigital Library
Adwait Ratnaparkhi. 1996. A Maximum Entropy Model for Part-Of-Speech Tagging. In Proc. EMNLP 1996.Google Scholar
Thorsten Brants. 2000. TnT: A Statistical Part-of-Speech Tagger. In Proc. ANLP 2000, 224--231.Google Scholar
Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proc. EMNLP 2002. 1--8.Google ScholarDigital Library
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich Part-of-Speech Tagging with A Cyclic Dependency Network. In Proc. NAACL 2003, 173--180.Google ScholarDigital Library
Xing Wang, Jason Lin, Ryan Vrecenar, and Jyh-Charn Liu. 2017. Syntactic Role Identification of Mathematical Expressions. In ICDIM 2017, 179--184.Google Scholar
Magdalena Wolska and Ivana Kruijff-Korbayová. 2004. Analysis of Mixed Natural and Symbolic Input in Mathematical Dialogs. In Proc. of ACL'04, 25.Google ScholarDigital Library
Mohan Ganesalingam. 2010. The Language of Mathematics. Ph.D. Dissertation, University of Cambridge.Google Scholar
Katrin Fundel, Robert Küffner, and Ralf Zimmer. 2007. RelEx--Relation Extraction Using Dependency Parse Trees. Bioinformatics 23(3), 365--371.Google ScholarDigital Library
Ivan Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword Expressions: A Pain in the Neck for NLP. In Proc. of CICLing 2002, 1--15.Google Scholar
Murat Bayraktar, Bilge Say and Varol Akman. 1998. An Analysis of English Punctuation: The Special Case of Comma. International Journal of Corpus Linguistics 3(1), 33--57.Google ScholarCross Ref
Preslav Nakov and Marti Hearst. 2005. Using the Web as An Implicit Training Set: Application to Structural Ambiguity Resolution. In HLT/EMNLP 2005, 835--842.Google Scholar
Miriam Goldberg. 1999. An Unsupervised Model for Statistically Determining Coordinate Phrase Attachment. In Proc. of ACL, 610--614.Google ScholarDigital Library
Philip Resnik. 1999. Semantic Similarity in A Taxonomy: An Information-based Measure and Its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research 11(1), 95--130.Google ScholarDigital Library
Andrew Viterbi. 1967. Error Bounds for Convolutional Codes and An Asymptotically Optimum Decoding Algorithm. IEEE Transactions on Information Theory 13(2), 260--269.Google ScholarDigital Library
Christian W. Günther and Wil MP Van Der Aalst. 2007. Fuzzy Mining--Adaptive Process Simplification Based on Multi-Perspective Metrics. In ICBPM 2007, 328--343.Google Scholar
Christian W. Günther and Anne Rozinat. 2012. Disco: Discover Your Processes. In BPM (Demos) 940, 40--44.Google Scholar
Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. In TeachNLP'02, 69--72.Google ScholarDigital Library
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In ACL, 55--60.Google Scholar
Giovanni Yoko Kristianto, Minh-Quoc Nghiem, Nobuo Inui, Goran Topić, and Akiko Aizawa. 2012. Annotating Mathematical Expression Definitions for Automatic Detection. In MIR 2012 Workshop.Google Scholar
Akiko Aizawa, Michael Kohlhase, and Iadh Ounis. 2013. NTCIR-10 Math Pilot Task Overview. In NTCIR.Google Scholar
Elsevier Open Access Corpus. https://github.com/elsevierlabs/OA-STM-Corpus.Google Scholar
ACL Web. https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art).Google Scholar
Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. 260.Google Scholar
Peter Willett. 2006. The Porter Stemming Algorithm: Then and Now. Program 40(3), 219--223.Google Scholar
Ann Taylor, Mitchell Marcus, and Beatrice Santorini. 2003. The Penn Treebank: An Overview. In Treebanks, Springer, Dordrecht, 5--22.Google Scholar

Index Terms

Prediction of Mathematical Expression Declarations based on Spatial, Semantic, and Syntactic Analysis
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Recommendations

Prediction of Mathematical Expression Constraints (ME-Con)
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018

This paper presents two different prediction models of Mathematical Expression Constraints (ME-Con) in technical publications. Based on the assumption of independent probability distributions, two types of features: FS, based on the ME symbols; FW, ...
Read More
Syntactic-Semantic Classes of Context-Sensitive Synonyms Based on a Bilingual Corpus
Human Language Technology. Challenges for Computer Science and Linguistics
Abstract
This paper summarizes findings of a three-year study on verb synonymy in translation based on both syntactic and semantic criteria and reports on recent results extending this work. Primary language resources used are existing Czech and English ...
Read More
Semantic classification of automatically acquired nouns using lexico-syntactic clues
COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics: Posters

In this paper, we present a two-stage approach to acquire Japanese unknown morphemes from text with full POS tags assigned to them. We first acquire unknown morphemes only making a morphology-level distinction, and then apply semantic classification to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019
September 2019
254 pages
ISBN:9781450368872
DOI:10.1145/3342558
General Chairs:
Uwe Borghoff,
Sonja Schimmler
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 September 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Best Student Paper
Author Tags
Co-reference
Declaration extraction
Mathematical expression
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
DocEng '19 Paper Acceptance Rate30of77submissions,39%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 231
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Prediction of Mathematical Expression Declarations based on Spatial, Semantic, and Syntactic Analysis

DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019

ABSTRACT

References

Cited By

Index Terms

Recommendations

Prediction of Mathematical Expression Constraints (ME-Con)

Syntactic-Semantic Classes of Context-Sensitive Synonyms Based on a Bilingual Corpus

Semantic classification of automatically acquired nouns using lexico-syntactic clues