skip to main content
article

Structured databases on the web: observations and implications

Published:01 September 2004Publication History
Skip Abstract Section

Abstract

The Web has been rapidly "deepened" by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this "deep Web" of searchable databses is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our "macro" study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our "micro" study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How "hidden" are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions.

References

  1. BrightPlanet.com. The deep web: Surfacing hidden value. Accessible at http://brightplanet.com, July 2000.]]Google ScholarGoogle Scholar
  2. Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400(6740):107--109, 1999.]]Google ScholarGoogle ScholarCross RefCross Ref
  3. Ed O'Neill, Brian Lavoie, and Rick Bennett. Web characterization. Accessible at "http://wcp.oclc.org".]]Google ScholarGoogle Scholar
  4. GNU. wget. Accessible at "http://www.gnu.org/software/wget/wget.html".]]Google ScholarGoogle Scholar
  5. Kevin Chen-Chuan Chang, Bin He, Chengkai Li, and Zhen Zhang. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository, 2003.]]Google ScholarGoogle Scholar
  6. G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Massachusetts, 1949.]]Google ScholarGoogle Scholar
  7. William W. Cohen. Some practical observations on integration of web information. In WebDB (Informal Proceedings), pages 55--60, 1999.]]Google ScholarGoogle Scholar
  8. Marti A. Hearst. Trends & controversies: Information integration. IEEE Intelligent System, 13(5):12--24, September 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59--74, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Panagiotis G. Ipeirotis, Luis Gravano, and Mehran Sahami. Probe, count, and classify: Categorizing hidden web databases. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, Ca., May 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings ACM SIGMOD International Conference on Management of Data, pages 479-490, Philadelphia, Pennsylvania, USA, June 1999. ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. David Hawking and Paul B. Thistlewaite. Methods for information server selection. ACM Transactions on Information Systems, 17(1):40--76, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Atsushi Sugiura and Oren Etzioni. Query routing for web search engines: architecture and experiments. In Proceedings of WWW9, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Weiyi Meng, King-Lup Liu, Clement T. Yu, Xiaodong Wang, Yuhsi Chang, and Naphtali Rishe. Determining text databases to search in the internet. In Proceedings of 24th International Conference on Very Large Data Bases, pages 14--25, New York City, New York, USA, August 1998. Morgan Kaufmann.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jeffrey D. Ullman. Information integration using logical views. In Proceedings of the 6th International Conference on Database Theory, Delphi, Greece, January 1997. Springer, Berlin.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Zhen Zhang, Bin He, and Kevin Chen-Chuan Chang. Understanding web query interfaces: Best effort parsing with hidden syntax. In SIGMOD Conference, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Bin He and Kevin Chen-Chuan Chang. Statistical schema matching across web query interfaces. In SIGMOD Conference, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Bin He, Kevin Chen-Chuan Chang, and Jiawei Han. Discovering complex matchings across web query interfaces: A correlation mining approach. In SIGKDD Conference, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd VLDB Conference, pages 251--262, Bombay, India, 1996. VLDB Endowment, Saratoga, Calif.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yannis Papakonstantinou, Héctor García-Molina, and Jeffrey Ullman. Medmaker: A mediation system based on declarative specifications. In Proceedings of the 12th International Conference on Data Engineering, New Orleans, La., 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Renée, J. Miller, Mauricio A. Hernández, Laura M. Haas, Lingling Yan, C. T. Howard Ho, Ronald Fagin, and Lucian Popa. The Clio project: managing heterogeneity. SIGMOD Rec., 30(1):78--83, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford protocol proposal for internet retrieval and search. Accessible at http://www-db.stanford.edu/~gravano/starts.html, August 1996.]]Google ScholarGoogle Scholar
  23. Luis Gravano and Héctor García-Molina. Merging ranks from heterogeneous internet sources. In Proceedings of 23rd International Conference on Very large Data Bases, pages 196--205, Athens, Greece, August 1997. VLDB Endowment, Saratoga, Calif.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Bertram Ludäscher and Amarnath Gupta. Modeling interactive web sources for information mediation. In Advances in Conceptual Modeling: ER '99 Workshops on Evolution and Change in Data Management, Reverse Engineering in Information Systems, and the World Wide Web and Conceptual Modeling, Paris, France, November 15--18, 1999, Proceedings, volume 1727 of Lecture Notes in Computer Science, pages 225--238. Springer, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden web. In Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, 2001. Morgan Kaufmann.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. James Caverlee, Ling Liu, and David Buttler. Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep web. In ICDE Conference, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In The VLDB Journal 2001, pages 109--118, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 467--478, Philadelphia, Pennsylvania, USA, June 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Structured databases on the web: observations and implications
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGMOD Record
        ACM SIGMOD Record  Volume 33, Issue 3
        September 2004
        94 pages
        ISSN:0163-5808
        DOI:10.1145/1031570
        Issue’s Table of Contents

        Copyright © 2004 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 September 2004

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader