skip to main content
10.1145/2723372.2751523acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Mining Quality Phrases from Massive Text Corpora

Authors Info & Claims
Published:27 May 2015Publication History

ABSTRACT

Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method.

References

  1. K. Ahmad, L. Gillam, and L. Tostevin. University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In The Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland.Google ScholarGoogle Scholar
  2. H. Ahonen. Knowledge discovery in documents by extracting frequent word sequences. Library Trends, 48(1), 1999.Google ScholarGoogle Scholar
  3. A. Allahverdyan and A. Galstyan. Comparative analysis of viterbi training and maximum likelihood estimation for hmms. In NIPS, pages 1674--1682, 2011.Google ScholarGoogle Scholar
  4. T. Baldwin and S. N. Kim. Multiword expressions. Handbook of Natural Language Processing, second edition. Morgan and Claypool, 2010.Google ScholarGoogle Scholar
  5. S. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting-phrase mining for ad-hoc text analytics. VLDB, 3(1--2):1348--1357, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. M. Bishop. Pattern recognition and machine learning, volume 1. springer New York, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Blackwood, A. De Gispert, and W. Byrne. Phrasal segmentation models for statistical machine translation. In COLING, 2008.Google ScholarGoogle Scholar
  8. L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P.-C. Chang, M. Galley, and C. D. Manning. Optimizing chinese word segmentation for machine translation performance. In ACL Workshop on Statistical Machine Translation, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K.-h. Chen and H.-H. Chen. Extracting noun phrases from large-scale texts: A hybrid approach and its automatic evaluation. In ACL, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. F. Codd. A Relational Model for Large Shared Data Banks. Communications of The ACM, 13:377--387, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, and J. Han. Automatic construction and ranking of topical keyphrases on collections of short documents. In SDM, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  13. P. Deane. A nonparametric method for extraction of candidate phrasal terms. In ACL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Echizen-ya and K. Araki. Automatic evaluation method for machine translation using noun-phrase chunking. In ACL, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 8(3), Aug. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word terms:. the c-value/nc-value method. JODL, 3(2):115--130, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  17. C. Gao and S. Michel. Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing. In EDBT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. A. Halliday. Lexis as a linguistic level. In memory of JR Firth, pages 148--162, 1966.Google ScholarGoogle Scholar
  19. K. S. Hasan and V. Ng. Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In COLING, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Koo, X. Carreras, and M. Collins. Simple semi-supervised dependency parsing. ACL-HLT, 2008.Google ScholarGoogle Scholar
  21. J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In KDD, KDD '09, pages 497--506, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, volume 3, pages 587--592, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Li, B.-J. P. Hsu, C. Zhai, and K. Wang. Unsupervised query segmentation using clickthrough for information retrieval. In SIGIR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Z. Liu, X. Chen, Y. Zheng, and M. Sun. Automatic keyphrase extraction by bridging vocabulary gap. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 135--144. Association for Computational Linguistics, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. McDonald, F. Pereira, K. Ribarov, and J. Hajič. Non-projective dependency parsing using spanning tree algorithms. In EMNLP, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. In ACL, 2004.Google ScholarGoogle Scholar
  27. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. P, A. Dey, and D. Majumdar. Fast mining of interesting phrases from subsets of text corpora. In EDBT, 2014.Google ScholarGoogle Scholar
  29. A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the Very Large Data Bases Conference (VLDB), 3((1--2)), September 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Park, R. J. Byrd, and B. K. Boguraev. Automatic glossary extraction: beyond terminology identification. In COLING, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. V. Punyakanok and D. Roth. The use of classifiers in sequential inference. In NIPS, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Ramisch, A. Villavicencio, and C. Boitet. Multiword expressions in the wild? the mwetoolkit comes in handy. In COLING, pages 57--60, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. Multidimensional content exploration. Proc. VLDB Endow., 1(1):660--671, Aug. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Sproat, W. Gale, C. Shih, and N. Chang. A stochastic finite-state word-segmentation algorithm for chinese. Computational linguistics, 22(3):377--404, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. B. Tan and F. Peng. Unsupervised query segmentation using generative language models and wikipedia. In WWW, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. F. Tjong Kim Sang and S. Buchholz. Introduction to the conll-2000 shared task: Chunking. In CONLL, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254--255. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Xun, C. Huang, and M. Zhou. A unified statistical model for the identification of english basenp. In ACL, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. Zhang, C. Zhai, and J. Han. Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases. In SDM, pages 1123--1134, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  41. Z. Zhang, J. Iria, C. A. Brewster, and F. Ciravegna. A comparative evaluation of term recognition algorithms. LREC, 2008.Google ScholarGoogle Scholar

Index Terms

  1. Mining Quality Phrases from Massive Text Corpora

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader