ABSTRACT
We make the case for developing a web of concepts by starting with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extracting concept-centric metadata, and stitching it together to create a semantically rich aggregate view of all the information available on the web for each concept instance. The goal of building and maintaining such a web of concepts presents many challenges, but also offers the promise of enabling many powerful applications, including novel search and information discovery paradigms. We present the goal, motivate it with example usage scenarios and some analysis of Yahoo! logs, and discuss the challenges in building and leveraging such a web of concepts. We place this ambitious research agenda in the context of the state of the art in the literature, and describe various ongoing efforts at Yahoo! Research that are related.
- ]]J. Allan. Topic Detection and Tracking. Kluwer Academic, 2002.Google ScholarDigital Library
- ]]R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--596, 2002. Google ScholarDigital Library
- ]]R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6:1817--1853, 2005. Google ScholarDigital Library
- ]]T. Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.Google Scholar
- ]]J. Atserias, H. Zaragoza, M. Ciaramita, and G. Attardi. Semantically annotated snapshot of the English Wikipedia. In LREC, 2008.Google Scholar
- ]]H. Bast, A. Chitea, F. Suchanek, and I. Weber. Ester: Efficient search on text, entities and relations. In SIGIR, pages 671--678, 2007. Google ScholarDigital Library
- ]]R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with Lixto. In VLDB, pages 119--128, 2001. Google ScholarDigital Library
- ]]O. Benjelloun, A.D. Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, pages 953--964, 2006. Google ScholarDigital Library
- ]]T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34--43, 2001.Google ScholarCross Ref
- ]]P.A. Bernstein and L. Haas. Information integration in the enterprise. CACM, 51(9):72--79, 2008. Google ScholarDigital Library
- ]]I. Bhattacharya and L. Getoor. A latent Dirichlet model for unsupervised entity resolution. In SDM, 2006.Google ScholarCross Ref
- ]]I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007. Google ScholarDigital Library
- ]]R. Brachman and H. Levesque. Knowledge Representation and Reasoning. Morgan Kaufmann, 2004. Google ScholarDigital Library
- ]]P. Buneman, S. Khanna, and W.C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google ScholarDigital Library
- ]]M.J. Cafarella, A. Halevy, D.Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008. Google ScholarDigital Library
- ]]C. Cardie. Empirical methods in information extraction. AI Magazine, 18(4):65--79, 1997.Google ScholarDigital Library
- ]]S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, pages 865--876, 2005. Google ScholarDigital Library
- ]]F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, pages 943--952, 2008. Google ScholarDigital Library
- ]]J. Cheney, P. Buneman, and B. Ludäscher. Report on the principles of provenance workshop. SIGMOD Record, 37(1):62--65, 2008. Google ScholarDigital Library
- ]]W.W. Cohen, P. Ravikumar, and S.E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI Workshop on Information Integration on the Web, pages 73--78, 2003.Google Scholar
- ]]V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
- ]]N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction : An approach based on a probabilistic tree-edit model. In SIGMOD, 2009. Google ScholarDigital Library
- ]]N. Dalvi, R. Kumar, B. Pang, and A. Tomkins. Matching reviews with objects using a language model. In Manuscript, 2008.Google Scholar
- ]]N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB, 16(4):523--544, 2004. Google ScholarDigital Library
- ]]P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarDigital Library
- ]]A. Doan, J. Madhavan, P. Domingos, and A.Y. Halevy. Ontology matching: A machine learning approach. In Handbook on Ontologies, pages 385--404, 2004.Google ScholarCross Ref
- ]]A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction: State of the art and research directions. In SIGMOD, pages 799--800, 2006. Google ScholarDigital Library
- ]]P. Domingos. Multi-relational record linkage. In KDD Workshop on Multi-Relational Data Mining, pages 31--48, 2004.Google Scholar
- ]]X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, pages 85--96, 2005. Google ScholarDigital Library
- ]]O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates. Web-scale information extraction in Knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarDigital Library
- ]]I.P. Fellegi and A.B. Sunter. A theory for record linkage. JASA, 64:1183--1210, 1969.Google ScholarCross Ref
- ]]A.D. Fuxman and R.J. Miller. First-order query rewriting for inconsistent databases. In ICDT, pages 337--351, 2005. Google ScholarDigital Library
- ]]R. Gilleron, F. Jousse, I. Tellier, and M. Tommasi. XML document transformation with conditional random fields. In INEX, 2006.Google Scholar
- ]]M.N. Gubanov and P.A. Bernstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.Google Scholar
- ]]A. Gupta and I.S. Mumick. Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999. Google ScholarCross Ref
- ]]R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, pages 965--976, 2006. Google ScholarDigital Library
- ]]A.Y. Halevy, M.J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006. Google ScholarDigital Library
- ]]R. Hall, C. Sutton, and A. McCallum. Unsupervised deduplication using cross-field dependencies. In KDD, pages 310--317, 2008. Google ScholarDigital Library
- ]]W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001. Google ScholarDigital Library
- ]]C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521--538, 1998. Google ScholarDigital Library
- ]]A. Jain, D. Kifer, A. Kirpal, S. Merugu, S. Keerthi, P. Bohannon, and R. Ramakrishnan. Concept-centric extraction: using domain knowledge and local learning. In Manuscript, 2008.Google Scholar
- ]]T.S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engineering Bulletin, 29(1):40--48, 2006.Google Scholar
- ]]D.V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, 2005.Google ScholarCross Ref
- ]]N. Kushmerick, D.S. Weld, and R.B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.Google Scholar
- ]]J. Madhavan, L. Afanasiev, L. Antova, and A.Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.Google Scholar
- ]]A. McCallum. Information extraction: Distilling structured data from unstructured text. Queue, 3(9):48--57, 2005. Google ScholarDigital Library
- ]]A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.Google Scholar
- ]]R. McCann, A. Kramnik, W. Shen, V. Varadarajan, O. Sobulo, and A. Doan. Integrating data from disparate sources: A mass collaboration approach. In ICDE, pages 487--488, 2005. Google ScholarDigital Library
- ]]I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for semistructured. In AAAI: Workshop on AI and Information Integration, 1998.Google Scholar
- ]]J. Myllymaki and J. Jackson. Robust web data extraction with XML path expressions. Technical Report RJ 10245, IBM, 2002.Google Scholar
- ]]G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001. Google ScholarDigital Library
- ]]H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. and James. Automatic linkage of vital records. Science, 130:954--959, 1959.Google ScholarCross Ref
- ]]H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.Google Scholar
- ]]E. Rahm and P.A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarDigital Library
- ]]E. Rahm, A. Thor, D. Aumueller, H.H. Do, N. Golovin, and T. Kirsten. iFuice: Information fusion utilizing instance correspondences and peer mappings. In WebDB, pages 7--12, 2005.Google Scholar
- ]]R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In ICML, pages 759--766, 2007. Google ScholarDigital Library
- ]]R. Ramakrishnan and A. Tomkins. Toward a peopleweb. IEEE Computer, 40(8):63--72, 2007. Google ScholarDigital Library
- ]]A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using W4F. In VLDB, pages 738--741, 1999. Google ScholarDigital Library
- ]]P. Singla and P. Domingos. Entity resolution with Markov logic. In ICDM, pages 572--582, 2006. Google ScholarDigital Library
- ]]S. Sundararajan and S. Keerthi. Graph based classification methods using inaccurate external classifier information. In Manuscript, 2008.Google Scholar
- ]]J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google Scholar
Index Terms
- A web of concepts
Recommendations
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebThe steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Ranking web sites using domain ontology concepts
Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Basic level of concepts in formal concept analysis
ICFCA'12: Proceedings of the 10th international conference on Formal Concept AnalysisThe paper presents a preliminary study on basic level of concepts in the framework of formal concept analysis (FCA). The basic level of concepts is an important phenomenon studied in the psychology of concepts. We argue that this phenomenon may be ...
Comments