ABSTRACT
Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method.
- K. Ahmad, L. Gillam, and L. Tostevin. University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In The Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland.Google Scholar
- H. Ahonen. Knowledge discovery in documents by extracting frequent word sequences. Library Trends, 48(1), 1999.Google Scholar
- A. Allahverdyan and A. Galstyan. Comparative analysis of viterbi training and maximum likelihood estimation for hmms. In NIPS, pages 1674--1682, 2011.Google Scholar
- T. Baldwin and S. N. Kim. Multiword expressions. Handbook of Natural Language Processing, second edition. Morgan and Claypool, 2010.Google Scholar
- S. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting-phrase mining for ad-hoc text analytics. VLDB, 3(1--2):1348--1357, 2010. Google ScholarDigital Library
- C. M. Bishop. Pattern recognition and machine learning, volume 1. springer New York, 2006. Google ScholarDigital Library
- G. Blackwood, A. De Gispert, and W. Byrne. Phrasal segmentation models for statistical machine translation. In COLING, 2008.Google Scholar
- L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
- P.-C. Chang, M. Galley, and C. D. Manning. Optimizing chinese word segmentation for machine translation performance. In ACL Workshop on Statistical Machine Translation, 2008. Google ScholarDigital Library
- K.-h. Chen and H.-H. Chen. Extracting noun phrases from large-scale texts: A hybrid approach and its automatic evaluation. In ACL, 1994. Google ScholarDigital Library
- E. F. Codd. A Relational Model for Large Shared Data Banks. Communications of The ACM, 13:377--387, 1970. Google ScholarDigital Library
- M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, and J. Han. Automatic construction and ranking of topical keyphrases on collections of short documents. In SDM, 2014.Google ScholarCross Ref
- P. Deane. A nonparametric method for extraction of candidate phrasal terms. In ACL, 2005. Google ScholarDigital Library
- H. Echizen-ya and K. Araki. Automatic evaluation method for machine translation using noun-phrase chunking. In ACL, 2010. Google ScholarDigital Library
- A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 8(3), Aug. 2015. Google ScholarDigital Library
- K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word terms:. the c-value/nc-value method. JODL, 3(2):115--130, 2000.Google ScholarCross Ref
- C. Gao and S. Michel. Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing. In EDBT, 2012. Google ScholarDigital Library
- M. A. Halliday. Lexis as a linguistic level. In memory of JR Firth, pages 148--162, 1966.Google Scholar
- K. S. Hasan and V. Ng. Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In COLING, 2010. Google ScholarDigital Library
- T. Koo, X. Carreras, and M. Collins. Simple semi-supervised dependency parsing. ACL-HLT, 2008.Google Scholar
- J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In KDD, KDD '09, pages 497--506, 2009. Google ScholarDigital Library
- X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, volume 3, pages 587--592, 2003. Google ScholarDigital Library
- Y. Li, B.-J. P. Hsu, C. Zhai, and K. Wang. Unsupervised query segmentation using clickthrough for information retrieval. In SIGIR, 2011. Google ScholarDigital Library
- Z. Liu, X. Chen, Y. Zheng, and M. Sun. Automatic keyphrase extraction by bridging vocabulary gap. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 135--144. Association for Computational Linguistics, 2011. Google ScholarDigital Library
- R. McDonald, F. Pereira, K. Ribarov, and J. Hajič. Non-projective dependency parsing using spanning tree algorithms. In EMNLP, 2005. Google ScholarDigital Library
- R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. In ACL, 2004.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.Google ScholarDigital Library
- D. P, A. Dey, and D. Majumdar. Fast mining of interesting phrases from subsets of text corpora. In EDBT, 2014.Google Scholar
- A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the Very Large Data Bases Conference (VLDB), 3((1--2)), September 2010. Google ScholarDigital Library
- Y. Park, R. J. Byrd, and B. K. Boguraev. Automatic glossary extraction: beyond terminology identification. In COLING, 2002. Google ScholarDigital Library
- V. Punyakanok and D. Roth. The use of classifiers in sequential inference. In NIPS, 2001.Google ScholarDigital Library
- C. Ramisch, A. Villavicencio, and C. Boitet. Multiword expressions in the wild? the mwetoolkit comes in handy. In COLING, pages 57--60, 2010. Google ScholarDigital Library
- B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012.Google ScholarDigital Library
- A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. Multidimensional content exploration. Proc. VLDB Endow., 1(1):660--671, Aug. 2008. Google ScholarDigital Library
- R. Sproat, W. Gale, C. Shih, and N. Chang. A stochastic finite-state word-segmentation algorithm for chinese. Computational linguistics, 22(3):377--404, 1996. Google ScholarDigital Library
- B. Tan and F. Peng. Unsupervised query segmentation using generative language models and wikipedia. In WWW, 2008. Google ScholarDigital Library
- E. F. Tjong Kim Sang and S. Buchholz. Introduction to the conll-2000 shared task: Chunking. In CONLL, 2000. Google ScholarDigital Library
- I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254--255. ACM, 1999. Google ScholarDigital Library
- E. Xun, C. Huang, and M. Zhou. A unified statistical model for the identification of english basenp. In ACL, 2000. Google ScholarDigital Library
- D. Zhang, C. Zhai, and J. Han. Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases. In SDM, pages 1123--1134, 2009.Google ScholarCross Ref
- Z. Zhang, J. Iria, C. A. Brewster, and F. Ciravegna. A comparative evaluation of term recognition algorithms. LREC, 2008.Google Scholar
Index Terms
- Mining Quality Phrases from Massive Text Corpora
Recommendations
Mining Infrequent High-Quality Phrases from Domain-Specific Corpora
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementPhrase mining is a fundamental task for text analysis and has various downstream applications such as named entity recognition, topic modeling, and relation extraction. In this paper, we focus on mining high-quality phrases from domain-specific corpora ...
Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningIntegrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
High quality error-tolerant phrase mining on text corpus
Highlights- Mining high-quality phrases on text with errors.
- Error-tolerant phrase model to ...
AbstractPhrases are widely used in many text-based expert and intelligent systems. Phrase mining is a critical and preprocessing operation for these systems. With the increase of text data, errors in text corpus widely exist. Existing ...
Comments