skip to main content
10.1145/1401890.1401952acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Mining multi-faceted overviews of arbitrary topics in a text collection

Authors Info & Claims
Published:24 August 2008Publication History

ABSTRACT

A common task in many text mining applications is to generate a multi-faceted overview of a topic in a text collection. Such an overview not only directly serves as an informative summary of the topic, but also provides a detailed view of navigation to different facets of the topic. Existing work has cast this problem as a categorization problem and requires training examples for each facet. This has three limitations: (1) All facets are predefined, which may not fit the need of a particular user. (2) Training examples for each facet are often unavailable. (3) Such an approach only works for a predefined type of topics. In this paper, we break these limitations and study a more realistic new setup of the problem, in which we would allow a user to flexibly describe each facet with keywords for an arbitrary topic and attempt to mine a multi-faceted overview in an unsupervised way. We attempt a probabilistic approach to solve this problem. Empirical experiments on different genres of text data show that our approach can effectively generate a multi-faceted overview for arbitrary topics; the generated overviews are comparable with those generated by supervised methods with training examples. They are also more informative than unstructured flat summaries. The method is quite general, thus can be applied to multiple text mining tasks in different application domains.

References

  1. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  2. H. Chen and S. Dumais. Bringing order to the web: automatically categorizing search results. In Proceedings of CHI '00, pages 145--152, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. T. Dumais, E. Cutrell, and H. Chen. Optimizing search by showing results in context. In Proceedings of CHI '01, pages 277--284, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Gruhl, R. Guha, R. Kumar, J. Novak, and A. Tomkins. The predictive power of online chatter. In Proceedings of KDD '05, pages 78--87, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. In Proceedings of WWW '04, pages 491--501, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR '96, pages 76--84, Zürich, CH, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR '99, pages 50--57, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Jiang and C. Zhai. Exploiting domain structure for named entity recognition. In Proceedings of HLT-NAACL '06, pages 74--81, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kullback, S. and Leibler, R. A. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79--86, mar 1951.Google ScholarGoogle ScholarCross RefCross Ref
  11. X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai, and B. Schatz. Automatically generating gene summaries from biomedical literature. In Proceedings of PSB '06, pages 41--50, 2006.Google ScholarGoogle Scholar
  12. X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai, and B. R. Schatz. Generating gene summaries from biomedical literature: A study of semi-structured summarization. Inf. Process. Manage., 43(6):1777--1791, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Liu, M. Hu, and J. Cheng. Opinion observer: analyzing and comparing opinions on the web. In Proceedings of WWW '05, pages 342--351, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Lu and C. Zhai. Opinion integration through semi-supervised topic modeling. In Proceedings of WWW '07, pages 121--130, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, 1997.Google ScholarGoogle Scholar
  16. Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In Proceedings of WWW '08, pages 101--110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of WWW '07, pages 171--180, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Q. Mei and C. Zhai. A mixture model for contextual text mining. In Proceedings of KDD '06, pages 649--655, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. pages 355--368, 1999.Google ScholarGoogle Scholar
  20. P. Pirolli, P. Schank, M. Hearst, and C. Diehl. Scatter/gather browsing communicates the topic structure of a very large text collection. In Proceedings of CHI '96, pages 213--220, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. A. C. R. A. Drysdale and T. F. Consortium. Flybase: genes and gene models. Nucleic Acids Res., 33:390--395, 2005.Google ScholarGoogle Scholar
  22. E. Stoica, M. Hearst, and M. Richardson. Automating creation of hierarchical faceted metadata structures. In Proceedings of NAACL/HLT '2007, pages 244--251, 2007.Google ScholarGoogle Scholar
  23. C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Wang and C. Zhai. Learn from web search logs to organize search results. In Proceedings of SIGIR '07, pages 87--94, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to Web search results. Computer Networks (Amsterdam, Netherlands: 1999), 31(11--16):1361--1374, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learning to cluster web search results. In Proceedings of SIGIR '04, pages 210--217, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD '04, pages 743--748, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, pages 912--919, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining multi-faceted overviews of arbitrary topics in a text collection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2008
      1116 pages
      ISBN:9781605581934
      DOI:10.1145/1401890
      • General Chair:
      • Ying Li,
      • Program Chairs:
      • Bing Liu,
      • Sunita Sarawagi

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 August 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader