research-article

Towards the web of concepts: extracting concepts from large datasets

Authors:
Aditya Parameswaran

Stanford University

Stanford University
View Profile

,
Hector Garcia-Molina

Stanford University

Stanford University
View Profile

,
Anand Rajaraman

Kosmix Corporation

Kosmix Corporation
View Profile

Proceedings of the VLDB Endowment Volume 3 Issue 1-2pp 566–577https://doi.org/10.14778/1920841.1920914

Published:01 September 2010Publication History

Proceedings of the VLDB Endowment

Abstract

Concepts are sequences of words that represent real or imaginary entities or ideas that users are interested in. As a first step towards building a web of concepts that will form the backbone of the next generation of search technology, we develop a novel technique to extract concepts from large datasets. We approach the problem of concept extraction from corpora as a market-basket problem, adapting statistical measures of support and confidence. We evaluate our concept extraction algorithm on datasets containing data from a large number of users (e.g., the AOL query log data set), and we show that a high-precision concept set can be extracted.

References

http://en.wikipedia.org/wiki/wikipedia:criteria_for_speedy_deletion.Google Scholar
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB 1994. Google ScholarDigital Library
B. Gelfand et. al. Discovering concepts in raw texts: Building semantic relationship graphs. Technical report, 1998.Google Scholar
I. Bichindaritz and S. Akkineni. Concept mining for indexing medical literature. LNCS, 3587, 2005. Google ScholarDigital Library
T. Brants and A. Franz. Web 1T 5-gram V1, 2006.Google Scholar
E. Brill. A simple rule-based part of speech tagger. In Ap. NLP 1992. Google ScholarDigital Library
S. Brin. Extracting patterns and relations from the world wide web. In WebDB, EDBT 1998. Google ScholarDigital Library
D. Evans and C. Zhai. Noun-phrase analysis in unrestricted text for information retrieval. In ACL 1996. Google ScholarDigital Library
F. De Comit et. al. Positive and unlabeled examples help learning. In Conf. on Alg. Learning Theory 1999. Google ScholarDigital Library
G. Pass et. al. A picture of search. In InfoScale 2006. Google ScholarDigital Library
J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation. Data Min. Knowl. Discov., 2004. Google ScholarDigital Library
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING 1992. Google ScholarDigital Library
P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In WSDM 2008. Google ScholarDigital Library
C. Jacquemin and D. Bourigault. Term extraction and automatic indexing. Handbook Of Comp. Linguistics, 2003.Google Scholar
K. Frantzi et. al. Automatic recognition of multi-word terms: the c-value/nc-value method. Int. Journal on Digital Libraries 2000.Google Scholar
A. Kittur, E. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In CHI 2008. Google ScholarDigital Library
K. Kageura and B. Umino. Methods of automatic term recognition: a review. Terminology, 3, 1996.Google Scholar
M. Looks et. al. Streaming hierarchical clustering for concept mining. Aerospace Conference, IEEE, 2007.Google Scholar
M. T. Castellvi et. al. Automatic term detection: A review of current systems. In Recent Adv. in Comp. Terminology 2001.Google Scholar
A. Maedche and S. Staab. Mining ontologies from text. In EKAW 2000, London, UK. Google ScholarDigital Library
C. D. Manning and H. Schutze. Foundations of Statistical NLP. MIT Press, June 1999. Google ScholarDigital Library
N. Dalvi et. al. A web of concepts. In PODS 2009. Google ScholarDigital Library
P. Pantel et. al. A statistical corpus-based term extractor. AI 2001. Google ScholarDigital Library
R. Agrawal et. al. Mining sequential patterns. In ICDE 1995. Google ScholarDigital Library
R. Jones et. al. Generating query substitutions. In WWW 2006. Google ScholarDigital Library
S. Loh et. al. Concept-based knowledge discovery in texts extracted from the web. SIGKDD Explor. Newsl., 2(1), 2000. Google ScholarDigital Library
M. Sanderson and B. Croft. Deriving concept hierarchies from text. In SIGIR 1999. Google ScholarDigital Library
M. Seno and G. Karypis. Finding frequent patterns using length-decreasing support constraints, 2001.Google Scholar
K. Wang, Y. He, D. W. Cheung, and Y. L. Chin. Mining confident rules without support requirement. In CIKM 2001. Google ScholarDigital Library
Q. Yang and H. H. Zhang. Web-log mining for predictive web caching. TKDE., 15(4), 2003. Google ScholarDigital Library

Index Terms

Towards the web of concepts: extracting concepts from large datasets
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

A web of concepts
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

We make the case for developing a web of concepts by starting with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extracting concept-centric metadata, and stitching it together to create a ...
Read More
Ranking web sites using domain ontology concepts

Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Read More
Crisply generated fuzzy concepts
ICFCA'05: Proceedings of the Third international conference on Formal Concept Analysis

In formal concept analysis of data with fuzzy attributes, both the extent and the intent of a formal (fuzzy) concept may be fuzzy sets. In this paper we focus on so-called crisply generated formal concepts. A concept $\langle{A,B}\rangle \in \mathcal{B}(...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2010
Published in pvldb Volume 3, Issue 1-2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 438
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards the web of concepts: extracting concepts from large datasets

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A web of concepts

Ranking web sites using domain ontology concepts

Crisply generated fuzzy concepts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Towards the web of concepts: extracting concepts from large datasets

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A web of concepts

Ranking web sites using domain ontology concepts

Crisply generated fuzzy concepts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media