research-article

Mining Quality Phrases from Massive Text Corpora

Authors:
Jialu Liu

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Jingbo Shang

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Chi Wang

Microsoft Research, Redmond, USA

Microsoft Research, Redmond, USA
View Profile

,
Xiang Ren

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Jiawei Han

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 1729–1744https://doi.org/10.1145/2723372.2751523

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1729–1744

ABSTRACT

Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method.

References

K. Ahmad, L. Gillam, and L. Tostevin. University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In The Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland.Google Scholar
H. Ahonen. Knowledge discovery in documents by extracting frequent word sequences. Library Trends, 48(1), 1999.Google Scholar
A. Allahverdyan and A. Galstyan. Comparative analysis of viterbi training and maximum likelihood estimation for hmms. In NIPS, pages 1674--1682, 2011.Google Scholar
T. Baldwin and S. N. Kim. Multiword expressions. Handbook of Natural Language Processing, second edition. Morgan and Claypool, 2010.Google Scholar
S. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting-phrase mining for ad-hoc text analytics. VLDB, 3(1--2):1348--1357, 2010. Google ScholarDigital Library
C. M. Bishop. Pattern recognition and machine learning, volume 1. springer New York, 2006. Google ScholarDigital Library
G. Blackwood, A. De Gispert, and W. Byrne. Phrasal segmentation models for statistical machine translation. In COLING, 2008.Google Scholar
L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
P.-C. Chang, M. Galley, and C. D. Manning. Optimizing chinese word segmentation for machine translation performance. In ACL Workshop on Statistical Machine Translation, 2008. Google ScholarDigital Library
K.-h. Chen and H.-H. Chen. Extracting noun phrases from large-scale texts: A hybrid approach and its automatic evaluation. In ACL, 1994. Google ScholarDigital Library
E. F. Codd. A Relational Model for Large Shared Data Banks. Communications of The ACM, 13:377--387, 1970. Google ScholarDigital Library
M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, and J. Han. Automatic construction and ranking of topical keyphrases on collections of short documents. In SDM, 2014.Google ScholarCross Ref
P. Deane. A nonparametric method for extraction of candidate phrasal terms. In ACL, 2005. Google ScholarDigital Library
H. Echizen-ya and K. Araki. Automatic evaluation method for machine translation using noun-phrase chunking. In ACL, 2010. Google ScholarDigital Library
A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 8(3), Aug. 2015. Google ScholarDigital Library
K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word terms:. the c-value/nc-value method. JODL, 3(2):115--130, 2000.Google ScholarCross Ref
C. Gao and S. Michel. Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing. In EDBT, 2012. Google ScholarDigital Library
M. A. Halliday. Lexis as a linguistic level. In memory of JR Firth, pages 148--162, 1966.Google Scholar
K. S. Hasan and V. Ng. Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In COLING, 2010. Google ScholarDigital Library
T. Koo, X. Carreras, and M. Collins. Simple semi-supervised dependency parsing. ACL-HLT, 2008.Google Scholar
J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In KDD, KDD '09, pages 497--506, 2009. Google ScholarDigital Library
X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, volume 3, pages 587--592, 2003. Google ScholarDigital Library
Y. Li, B.-J. P. Hsu, C. Zhai, and K. Wang. Unsupervised query segmentation using clickthrough for information retrieval. In SIGIR, 2011. Google ScholarDigital Library
Z. Liu, X. Chen, Y. Zheng, and M. Sun. Automatic keyphrase extraction by bridging vocabulary gap. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 135--144. Association for Computational Linguistics, 2011. Google ScholarDigital Library
R. McDonald, F. Pereira, K. Ribarov, and J. Hajič. Non-projective dependency parsing using spanning tree algorithms. In EMNLP, 2005. Google ScholarDigital Library
R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. In ACL, 2004.Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.Google ScholarDigital Library
D. P, A. Dey, and D. Majumdar. Fast mining of interesting phrases from subsets of text corpora. In EDBT, 2014.Google Scholar
A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the Very Large Data Bases Conference (VLDB), 3((1--2)), September 2010. Google ScholarDigital Library
Y. Park, R. J. Byrd, and B. K. Boguraev. Automatic glossary extraction: beyond terminology identification. In COLING, 2002. Google ScholarDigital Library
V. Punyakanok and D. Roth. The use of classifiers in sequential inference. In NIPS, 2001.Google ScholarDigital Library
C. Ramisch, A. Villavicencio, and C. Boitet. Multiword expressions in the wild? the mwetoolkit comes in handy. In COLING, pages 57--60, 2010. Google ScholarDigital Library
B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012.Google ScholarDigital Library
A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. Multidimensional content exploration. Proc. VLDB Endow., 1(1):660--671, Aug. 2008. Google ScholarDigital Library
R. Sproat, W. Gale, C. Shih, and N. Chang. A stochastic finite-state word-segmentation algorithm for chinese. Computational linguistics, 22(3):377--404, 1996. Google ScholarDigital Library
B. Tan and F. Peng. Unsupervised query segmentation using generative language models and wikipedia. In WWW, 2008. Google ScholarDigital Library
E. F. Tjong Kim Sang and S. Buchholz. Introduction to the conll-2000 shared task: Chunking. In CONLL, 2000. Google ScholarDigital Library
I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254--255. ACM, 1999. Google ScholarDigital Library
E. Xun, C. Huang, and M. Zhou. A unified statistical model for the identification of english basenp. In ACL, 2000. Google ScholarDigital Library
D. Zhang, C. Zhai, and J. Han. Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases. In SDM, pages 1123--1134, 2009.Google ScholarCross Ref
Z. Zhang, J. Iria, C. A. Brewster, and F. Ciravegna. A comparative evaluation of term recognition algorithms. LREC, 2008.Google Scholar

Index Terms

Mining Quality Phrases from Massive Text Corpora
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Mining Infrequent High-Quality Phrases from Domain-Specific Corpora
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Phrase mining is a fundamental task for text analysis and has various downstream applications such as named entity recognition, topic modeling, and relation extraction. In this paper, we focus on mining high-quality phrases from domain-specific corpora ...
Read More
Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
Read More
High quality error-tolerant phrase mining on text corpus
Highlights
- Mining high-quality phrases on text with errors.
- Error-tolerant phrase model to ...
Abstract
Phrases are widely used in many text-based expert and intelligent systems. Phrase mining is a critical and preprocessing operation for these systems. With the increase of text data, errors in text corpus widely exist. Existing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
phrasal segmentation
phrase mining
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 124
  Total Citations
  View Citations
- 1,753
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining Quality Phrases from Massive Text Corpora

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining Infrequent High-Quality Phrases from Domain-Specific Corpora

Mining comparable bilingual text corpora for cross-language information integration

High quality error-tolerant phrase mining on text corpus