skip to main content
10.1145/1559795.1559797acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
invited-talk

A web of concepts

Published:29 June 2009Publication History

ABSTRACT

We make the case for developing a web of concepts by starting with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extracting concept-centric metadata, and stitching it together to create a semantically rich aggregate view of all the information available on the web for each concept instance. The goal of building and maintaining such a web of concepts presents many challenges, but also offers the promise of enabling many powerful applications, including novel search and information discovery paradigms. We present the goal, motivate it with example usage scenarios and some analysis of Yahoo! logs, and discuss the challenges in building and leveraging such a web of concepts. We place this ambitious research agenda in the context of the state of the art in the literature, and describe various ongoing efforts at Yahoo! Research that are related.

References

  1. ]]J. Allan. Topic Detection and Tracking. Kluwer Academic, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ]]R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--596, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ]]R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6:1817--1853, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. ]]T. Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.Google ScholarGoogle Scholar
  5. ]]J. Atserias, H. Zaragoza, M. Ciaramita, and G. Attardi. Semantically annotated snapshot of the English Wikipedia. In LREC, 2008.Google ScholarGoogle Scholar
  6. ]]H. Bast, A. Chitea, F. Suchanek, and I. Weber. Ester: Efficient search on text, entities and relations. In SIGIR, pages 671--678, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ]]R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with Lixto. In VLDB, pages 119--128, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. ]]O. Benjelloun, A.D. Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, pages 953--964, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. ]]T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34--43, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  10. ]]P.A. Bernstein and L. Haas. Information integration in the enterprise. CACM, 51(9):72--79, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. ]]I. Bhattacharya and L. Getoor. A latent Dirichlet model for unsupervised entity resolution. In SDM, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  12. ]]I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. ]]R. Brachman and H. Levesque. Knowledge Representation and Reasoning. Morgan Kaufmann, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ]]P. Buneman, S. Khanna, and W.C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ]]M.J. Cafarella, A. Halevy, D.Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. ]]C. Cardie. Empirical methods in information extraction. AI Magazine, 18(4):65--79, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ]]S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, pages 865--876, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. ]]F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, pages 943--952, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. ]]J. Cheney, P. Buneman, and B. Ludäscher. Report on the principles of provenance workshop. SIGMOD Record, 37(1):62--65, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. ]]W.W. Cohen, P. Ravikumar, and S.E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI Workshop on Information Integration on the Web, pages 73--78, 2003.Google ScholarGoogle Scholar
  21. ]]V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. ]]N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction : An approach based on a probabilistic tree-edit model. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. ]]N. Dalvi, R. Kumar, B. Pang, and A. Tomkins. Matching reviews with objects using a language model. In Manuscript, 2008.Google ScholarGoogle Scholar
  24. ]]N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB, 16(4):523--544, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ]]P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. ]]A. Doan, J. Madhavan, P. Domingos, and A.Y. Halevy. Ontology matching: A machine learning approach. In Handbook on Ontologies, pages 385--404, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  27. ]]A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction: State of the art and research directions. In SIGMOD, pages 799--800, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. ]]P. Domingos. Multi-relational record linkage. In KDD Workshop on Multi-Relational Data Mining, pages 31--48, 2004.Google ScholarGoogle Scholar
  29. ]]X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, pages 85--96, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. ]]O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates. Web-scale information extraction in Knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. ]]I.P. Fellegi and A.B. Sunter. A theory for record linkage. JASA, 64:1183--1210, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  32. ]]A.D. Fuxman and R.J. Miller. First-order query rewriting for inconsistent databases. In ICDT, pages 337--351, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. ]]R. Gilleron, F. Jousse, I. Tellier, and M. Tommasi. XML document transformation with conditional random fields. In INEX, 2006.Google ScholarGoogle Scholar
  34. ]]M.N. Gubanov and P.A. Bernstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.Google ScholarGoogle Scholar
  35. ]]A. Gupta and I.S. Mumick. Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999. Google ScholarGoogle ScholarCross RefCross Ref
  36. ]]R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, pages 965--976, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. ]]A.Y. Halevy, M.J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. ]]R. Hall, C. Sutton, and A. McCallum. Unsupervised deduplication using cross-field dependencies. In KDD, pages 310--317, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. ]]W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. ]]C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521--538, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. ]]A. Jain, D. Kifer, A. Kirpal, S. Merugu, S. Keerthi, P. Bohannon, and R. Ramakrishnan. Concept-centric extraction: using domain knowledge and local learning. In Manuscript, 2008.Google ScholarGoogle Scholar
  42. ]]T.S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engineering Bulletin, 29(1):40--48, 2006.Google ScholarGoogle Scholar
  43. ]]D.V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  44. ]]N. Kushmerick, D.S. Weld, and R.B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.Google ScholarGoogle Scholar
  45. ]]J. Madhavan, L. Afanasiev, L. Antova, and A.Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.Google ScholarGoogle Scholar
  46. ]]A. McCallum. Information extraction: Distilling structured data from unstructured text. Queue, 3(9):48--57, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. ]]A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.Google ScholarGoogle Scholar
  48. ]]R. McCann, A. Kramnik, W. Shen, V. Varadarajan, O. Sobulo, and A. Doan. Integrating data from disparate sources: A mass collaboration approach. In ICDE, pages 487--488, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. ]]I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for semistructured. In AAAI: Workshop on AI and Information Integration, 1998.Google ScholarGoogle Scholar
  50. ]]J. Myllymaki and J. Jackson. Robust web data extraction with XML path expressions. Technical Report RJ 10245, IBM, 2002.Google ScholarGoogle Scholar
  51. ]]G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. ]]H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. and James. Automatic linkage of vital records. Science, 130:954--959, 1959.Google ScholarGoogle ScholarCross RefCross Ref
  53. ]]H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.Google ScholarGoogle Scholar
  54. ]]E. Rahm and P.A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. ]]E. Rahm, A. Thor, D. Aumueller, H.H. Do, N. Golovin, and T. Kirsten. iFuice: Information fusion utilizing instance correspondences and peer mappings. In WebDB, pages 7--12, 2005.Google ScholarGoogle Scholar
  56. ]]R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In ICML, pages 759--766, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. ]]R. Ramakrishnan and A. Tomkins. Toward a peopleweb. IEEE Computer, 40(8):63--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. ]]A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using W4F. In VLDB, pages 738--741, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. ]]P. Singla and P. Domingos. Entity resolution with Markov logic. In ICDM, pages 572--582, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. ]]S. Sundararajan and S. Keerthi. Graph based classification methods using inaccurate external classifier information. In Manuscript, 2008.Google ScholarGoogle Scholar
  61. ]]J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google ScholarGoogle Scholar

Index Terms

  1. A web of concepts

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
        June 2009
        298 pages
        ISBN:9781605585536
        DOI:10.1145/1559795
        • General Chair:
        • Jan Paredaens,
        • Program Chair:
        • Jianwen Su

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 June 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • invited-talk

        Acceptance Rates

        PODS '09 Paper Acceptance Rate26of97submissions,27%Overall Acceptance Rate642of2,707submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader