Abstract
Concepts are sequences of words that represent real or imaginary entities or ideas that users are interested in. As a first step towards building a web of concepts that will form the backbone of the next generation of search technology, we develop a novel technique to extract concepts from large datasets. We approach the problem of concept extraction from corpora as a market-basket problem, adapting statistical measures of support and confidence. We evaluate our concept extraction algorithm on datasets containing data from a large number of users (e.g., the AOL query log data set), and we show that a high-precision concept set can be extracted.
- http://en.wikipedia.org/wiki/wikipedia:criteria_for_speedy_deletion.Google Scholar
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB 1994. Google ScholarDigital Library
- B. Gelfand et. al. Discovering concepts in raw texts: Building semantic relationship graphs. Technical report, 1998.Google Scholar
- I. Bichindaritz and S. Akkineni. Concept mining for indexing medical literature. LNCS, 3587, 2005. Google ScholarDigital Library
- T. Brants and A. Franz. Web 1T 5-gram V1, 2006.Google Scholar
- E. Brill. A simple rule-based part of speech tagger. In Ap. NLP 1992. Google ScholarDigital Library
- S. Brin. Extracting patterns and relations from the world wide web. In WebDB, EDBT 1998. Google ScholarDigital Library
- D. Evans and C. Zhai. Noun-phrase analysis in unrestricted text for information retrieval. In ACL 1996. Google ScholarDigital Library
- F. De Comit et. al. Positive and unlabeled examples help learning. In Conf. on Alg. Learning Theory 1999. Google ScholarDigital Library
- G. Pass et. al. A picture of search. In InfoScale 2006. Google ScholarDigital Library
- J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation. Data Min. Knowl. Discov., 2004. Google ScholarDigital Library
- M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING 1992. Google ScholarDigital Library
- P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In WSDM 2008. Google ScholarDigital Library
- C. Jacquemin and D. Bourigault. Term extraction and automatic indexing. Handbook Of Comp. Linguistics, 2003.Google Scholar
- K. Frantzi et. al. Automatic recognition of multi-word terms: the c-value/nc-value method. Int. Journal on Digital Libraries 2000.Google Scholar
- A. Kittur, E. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In CHI 2008. Google ScholarDigital Library
- K. Kageura and B. Umino. Methods of automatic term recognition: a review. Terminology, 3, 1996.Google Scholar
- M. Looks et. al. Streaming hierarchical clustering for concept mining. Aerospace Conference, IEEE, 2007.Google Scholar
- M. T. Castellvi et. al. Automatic term detection: A review of current systems. In Recent Adv. in Comp. Terminology 2001.Google Scholar
- A. Maedche and S. Staab. Mining ontologies from text. In EKAW 2000, London, UK. Google ScholarDigital Library
- C. D. Manning and H. Schutze. Foundations of Statistical NLP. MIT Press, June 1999. Google ScholarDigital Library
- N. Dalvi et. al. A web of concepts. In PODS 2009. Google ScholarDigital Library
- P. Pantel et. al. A statistical corpus-based term extractor. AI 2001. Google ScholarDigital Library
- R. Agrawal et. al. Mining sequential patterns. In ICDE 1995. Google ScholarDigital Library
- R. Jones et. al. Generating query substitutions. In WWW 2006. Google ScholarDigital Library
- S. Loh et. al. Concept-based knowledge discovery in texts extracted from the web. SIGKDD Explor. Newsl., 2(1), 2000. Google ScholarDigital Library
- M. Sanderson and B. Croft. Deriving concept hierarchies from text. In SIGIR 1999. Google ScholarDigital Library
- M. Seno and G. Karypis. Finding frequent patterns using length-decreasing support constraints, 2001.Google Scholar
- K. Wang, Y. He, D. W. Cheung, and Y. L. Chin. Mining confident rules without support requirement. In CIKM 2001. Google ScholarDigital Library
- Q. Yang and H. H. Zhang. Web-log mining for predictive web caching. TKDE., 15(4), 2003. Google ScholarDigital Library
Index Terms
- Towards the web of concepts: extracting concepts from large datasets
Recommendations
A web of concepts
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe make the case for developing a web of concepts by starting with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extracting concept-centric metadata, and stitching it together to create a ...
Ranking web sites using domain ontology concepts
Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Crisply generated fuzzy concepts
ICFCA'05: Proceedings of the Third international conference on Formal Concept AnalysisIn formal concept analysis of data with fuzzy attributes, both the extent and the intent of a formal (fuzzy) concept may be fuzzy sets. In this paper we focus on so-called crisply generated formal concepts. A concept $\langle{A,B}\rangle \in \mathcal{B}(...
Comments