skip to main content
research-article

Towards the web of concepts: extracting concepts from large datasets

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Concepts are sequences of words that represent real or imaginary entities or ideas that users are interested in. As a first step towards building a web of concepts that will form the backbone of the next generation of search technology, we develop a novel technique to extract concepts from large datasets. We approach the problem of concept extraction from corpora as a market-basket problem, adapting statistical measures of support and confidence. We evaluate our concept extraction algorithm on datasets containing data from a large number of users (e.g., the AOL query log data set), and we show that a high-precision concept set can be extracted.

References

  1. http://en.wikipedia.org/wiki/wikipedia:criteria_for_speedy_deletion.Google ScholarGoogle Scholar
  2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Gelfand et. al. Discovering concepts in raw texts: Building semantic relationship graphs. Technical report, 1998.Google ScholarGoogle Scholar
  4. I. Bichindaritz and S. Akkineni. Concept mining for indexing medical literature. LNCS, 3587, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Brants and A. Franz. Web 1T 5-gram V1, 2006.Google ScholarGoogle Scholar
  6. E. Brill. A simple rule-based part of speech tagger. In Ap. NLP 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Brin. Extracting patterns and relations from the world wide web. In WebDB, EDBT 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Evans and C. Zhai. Noun-phrase analysis in unrestricted text for information retrieval. In ACL 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. De Comit et. al. Positive and unlabeled examples help learning. In Conf. on Alg. Learning Theory 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Pass et. al. A picture of search. In InfoScale 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation. Data Min. Knowl. Discov., 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In WSDM 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Jacquemin and D. Bourigault. Term extraction and automatic indexing. Handbook Of Comp. Linguistics, 2003.Google ScholarGoogle Scholar
  15. K. Frantzi et. al. Automatic recognition of multi-word terms: the c-value/nc-value method. Int. Journal on Digital Libraries 2000.Google ScholarGoogle Scholar
  16. A. Kittur, E. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In CHI 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Kageura and B. Umino. Methods of automatic term recognition: a review. Terminology, 3, 1996.Google ScholarGoogle Scholar
  18. M. Looks et. al. Streaming hierarchical clustering for concept mining. Aerospace Conference, IEEE, 2007.Google ScholarGoogle Scholar
  19. M. T. Castellvi et. al. Automatic term detection: A review of current systems. In Recent Adv. in Comp. Terminology 2001.Google ScholarGoogle Scholar
  20. A. Maedche and S. Staab. Mining ontologies from text. In EKAW 2000, London, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. D. Manning and H. Schutze. Foundations of Statistical NLP. MIT Press, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Dalvi et. al. A web of concepts. In PODS 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Pantel et. al. A statistical corpus-based term extractor. AI 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Agrawal et. al. Mining sequential patterns. In ICDE 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Jones et. al. Generating query substitutions. In WWW 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Loh et. al. Concept-based knowledge discovery in texts extracted from the web. SIGKDD Explor. Newsl., 2(1), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Sanderson and B. Croft. Deriving concept hierarchies from text. In SIGIR 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Seno and G. Karypis. Finding frequent patterns using length-decreasing support constraints, 2001.Google ScholarGoogle Scholar
  29. K. Wang, Y. He, D. W. Cheung, and Y. L. Chin. Mining confident rules without support requirement. In CIKM 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Q. Yang and H. H. Zhang. Web-log mining for predictive web caching. TKDE., 15(4), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards the web of concepts: extracting concepts from large datasets
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
      September 2010
      1658 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 September 2010
      Published in pvldb Volume 3, Issue 1-2

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader