invited-talk

A web of concepts

Authors:
Nilesh Dalvi

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Ravi Kumar

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Bo Pang

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Raghu Ramakrishnan

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Andrew Tomkins

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Philip Bohannon

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Sathiya Keerthi

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Srujana Merugu

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsJune 2009Pages 1–12https://doi.org/10.1145/1559795.1559797

Published:29 June 2009Publication History

PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Pages 1–12

ABSTRACT

We make the case for developing a web of concepts by starting with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extracting concept-centric metadata, and stitching it together to create a semantically rich aggregate view of all the information available on the web for each concept instance. The goal of building and maintaining such a web of concepts presents many challenges, but also offers the promise of enabling many powerful applications, including novel search and information discovery paradigms. We present the goal, motivate it with example usage scenarios and some analysis of Yahoo! logs, and discuss the challenges in building and leveraging such a web of concepts. We place this ambitious research agenda in the context of the state of the art in the literature, and describe various ongoing efforts at Yahoo! Research that are related.

References

]]J. Allan. Topic Detection and Tracking. Kluwer Academic, 2002.Google ScholarDigital Library
]]R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--596, 2002. Google ScholarDigital Library
]]R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6:1817--1853, 2005. Google ScholarDigital Library
]]T. Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.Google Scholar
]]J. Atserias, H. Zaragoza, M. Ciaramita, and G. Attardi. Semantically annotated snapshot of the English Wikipedia. In LREC, 2008.Google Scholar
]]H. Bast, A. Chitea, F. Suchanek, and I. Weber. Ester: Efficient search on text, entities and relations. In SIGIR, pages 671--678, 2007. Google ScholarDigital Library
]]R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with Lixto. In VLDB, pages 119--128, 2001. Google ScholarDigital Library
]]O. Benjelloun, A.D. Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, pages 953--964, 2006. Google ScholarDigital Library
]]T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34--43, 2001.Google ScholarCross Ref
]]P.A. Bernstein and L. Haas. Information integration in the enterprise. CACM, 51(9):72--79, 2008. Google ScholarDigital Library
]]I. Bhattacharya and L. Getoor. A latent Dirichlet model for unsupervised entity resolution. In SDM, 2006.Google ScholarCross Ref
]]I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007. Google ScholarDigital Library
]]R. Brachman and H. Levesque. Knowledge Representation and Reasoning. Morgan Kaufmann, 2004. Google ScholarDigital Library
]]P. Buneman, S. Khanna, and W.C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google ScholarDigital Library
]]M.J. Cafarella, A. Halevy, D.Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008. Google ScholarDigital Library
]]C. Cardie. Empirical methods in information extraction. AI Magazine, 18(4):65--79, 1997.Google ScholarDigital Library
]]S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, pages 865--876, 2005. Google ScholarDigital Library
]]F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, pages 943--952, 2008. Google ScholarDigital Library
]]J. Cheney, P. Buneman, and B. Ludäscher. Report on the principles of provenance workshop. SIGMOD Record, 37(1):62--65, 2008. Google ScholarDigital Library
]]W.W. Cohen, P. Ravikumar, and S.E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI Workshop on Information Integration on the Web, pages 73--78, 2003.Google Scholar
]]V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
]]N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction : An approach based on a probabilistic tree-edit model. In SIGMOD, 2009. Google ScholarDigital Library
]]N. Dalvi, R. Kumar, B. Pang, and A. Tomkins. Matching reviews with objects using a language model. In Manuscript, 2008.Google Scholar
]]N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB, 16(4):523--544, 2004. Google ScholarDigital Library
]]P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarDigital Library
]]A. Doan, J. Madhavan, P. Domingos, and A.Y. Halevy. Ontology matching: A machine learning approach. In Handbook on Ontologies, pages 385--404, 2004.Google ScholarCross Ref
]]A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction: State of the art and research directions. In SIGMOD, pages 799--800, 2006. Google ScholarDigital Library
]]P. Domingos. Multi-relational record linkage. In KDD Workshop on Multi-Relational Data Mining, pages 31--48, 2004.Google Scholar
]]X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, pages 85--96, 2005. Google ScholarDigital Library
]]O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates. Web-scale information extraction in Knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarDigital Library
]]I.P. Fellegi and A.B. Sunter. A theory for record linkage. JASA, 64:1183--1210, 1969.Google ScholarCross Ref
]]A.D. Fuxman and R.J. Miller. First-order query rewriting for inconsistent databases. In ICDT, pages 337--351, 2005. Google ScholarDigital Library
]]R. Gilleron, F. Jousse, I. Tellier, and M. Tommasi. XML document transformation with conditional random fields. In INEX, 2006.Google Scholar
]]M.N. Gubanov and P.A. Bernstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.Google Scholar
]]A. Gupta and I.S. Mumick. Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999. Google ScholarCross Ref
]]R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, pages 965--976, 2006. Google ScholarDigital Library
]]A.Y. Halevy, M.J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006. Google ScholarDigital Library
]]R. Hall, C. Sutton, and A. McCallum. Unsupervised deduplication using cross-field dependencies. In KDD, pages 310--317, 2008. Google ScholarDigital Library
]]W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001. Google ScholarDigital Library
]]C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521--538, 1998. Google ScholarDigital Library
]]A. Jain, D. Kifer, A. Kirpal, S. Merugu, S. Keerthi, P. Bohannon, and R. Ramakrishnan. Concept-centric extraction: using domain knowledge and local learning. In Manuscript, 2008.Google Scholar
]]T.S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engineering Bulletin, 29(1):40--48, 2006.Google Scholar
]]D.V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, 2005.Google ScholarCross Ref
]]N. Kushmerick, D.S. Weld, and R.B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.Google Scholar
]]J. Madhavan, L. Afanasiev, L. Antova, and A.Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.Google Scholar
]]A. McCallum. Information extraction: Distilling structured data from unstructured text. Queue, 3(9):48--57, 2005. Google ScholarDigital Library
]]A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.Google Scholar
]]R. McCann, A. Kramnik, W. Shen, V. Varadarajan, O. Sobulo, and A. Doan. Integrating data from disparate sources: A mass collaboration approach. In ICDE, pages 487--488, 2005. Google ScholarDigital Library
]]I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for semistructured. In AAAI: Workshop on AI and Information Integration, 1998.Google Scholar
]]J. Myllymaki and J. Jackson. Robust web data extraction with XML path expressions. Technical Report RJ 10245, IBM, 2002.Google Scholar
]]G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001. Google ScholarDigital Library
]]H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. and James. Automatic linkage of vital records. Science, 130:954--959, 1959.Google ScholarCross Ref
]]H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.Google Scholar
]]E. Rahm and P.A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarDigital Library
]]E. Rahm, A. Thor, D. Aumueller, H.H. Do, N. Golovin, and T. Kirsten. iFuice: Information fusion utilizing instance correspondences and peer mappings. In WebDB, pages 7--12, 2005.Google Scholar
]]R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In ICML, pages 759--766, 2007. Google ScholarDigital Library
]]R. Ramakrishnan and A. Tomkins. Toward a peopleweb. IEEE Computer, 40(8):63--72, 2007. Google ScholarDigital Library
]]A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using W4F. In VLDB, pages 738--741, 1999. Google ScholarDigital Library
]]P. Singla and P. Domingos. Entity resolution with Markov logic. In ICDM, pages 572--582, 2006. Google ScholarDigital Library
]]S. Sundararajan and S. Keerthi. Graph based classification methods using inaccurate external classifier information. In Manuscript, 2008.Google Scholar
]]J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google Scholar

Index Terms

A web of concepts
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Read More
Ranking web sites using domain ontology concepts

Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Read More
Basic level of concepts in formal concept analysis
ICFCA'12: Proceedings of the 10th international conference on Formal Concept Analysis

The paper presents a preliminary study on basic level of concepts in the framework of formal concept analysis (FCA). The basic level of concepts is an important phenomenon studied in the psychology of concepts. We argue that this phenomenon may be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2009
298 pages
ISBN:9781605585536
DOI:10.1145/1559795
General Chair:
Jan Paredaens
University of Antwerp, Belgium
,
Program Chair:
Jianwen Su
University of California, Santa Barbara, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
concepts
extraction
ranking
selection
Qualifiers
- invited-talk
Conference

Acceptance Rates
PODS '09 Paper Acceptance Rate26of97submissions,27%Overall Acceptance Rate642of2,707submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 63
  Total Citations
  View Citations
- 1,255
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A web of concepts

PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Intelligent crawling of web applications for web archiving

Ranking web sites using domain ontology concepts

Basic level of concepts in formal concept analysis