Abstract
The Web has been rapidly "deepened" by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this "deep Web" of searchable databses is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our "macro" study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our "micro" study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How "hidden" are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions.
- BrightPlanet.com. The deep web: Surfacing hidden value. Accessible at http://brightplanet.com, July 2000.]]Google Scholar
- Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400(6740):107--109, 1999.]]Google ScholarCross Ref
- Ed O'Neill, Brian Lavoie, and Rick Bennett. Web characterization. Accessible at "http://wcp.oclc.org".]]Google Scholar
- GNU. wget. Accessible at "http://www.gnu.org/software/wget/wget.html".]]Google Scholar
- Kevin Chen-Chuan Chang, Bin He, Chengkai Li, and Zhen Zhang. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository, 2003.]]Google Scholar
- G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Massachusetts, 1949.]]Google Scholar
- William W. Cohen. Some practical observations on integration of web information. In WebDB (Informal Proceedings), pages 55--60, 1999.]]Google Scholar
- Marti A. Hearst. Trends & controversies: Information integration. IEEE Intelligent System, 13(5):12--24, September 1998.]] Google ScholarDigital Library
- Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59--74, 1998.]] Google ScholarDigital Library
- Panagiotis G. Ipeirotis, Luis Gravano, and Mehran Sahami. Probe, count, and classify: Categorizing hidden web databases. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, Ca., May 2001.]] Google ScholarDigital Library
- James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings ACM SIGMOD International Conference on Management of Data, pages 479-490, Philadelphia, Pennsylvania, USA, June 1999. ACM Press.]] Google ScholarDigital Library
- David Hawking and Paul B. Thistlewaite. Methods for information server selection. ACM Transactions on Information Systems, 17(1):40--76, 1999.]] Google ScholarDigital Library
- Atsushi Sugiura and Oren Etzioni. Query routing for web search engines: architecture and experiments. In Proceedings of WWW9, 2000.]] Google ScholarDigital Library
- Weiyi Meng, King-Lup Liu, Clement T. Yu, Xiaodong Wang, Yuhsi Chang, and Naphtali Rishe. Determining text databases to search in the internet. In Proceedings of 24th International Conference on Very Large Data Bases, pages 14--25, New York City, New York, USA, August 1998. Morgan Kaufmann.]] Google ScholarDigital Library
- Jeffrey D. Ullman. Information integration using logical views. In Proceedings of the 6th International Conference on Database Theory, Delphi, Greece, January 1997. Springer, Berlin.]] Google ScholarDigital Library
- Zhen Zhang, Bin He, and Kevin Chen-Chuan Chang. Understanding web query interfaces: Best effort parsing with hidden syntax. In SIGMOD Conference, 2004.]] Google ScholarDigital Library
- Bin He and Kevin Chen-Chuan Chang. Statistical schema matching across web query interfaces. In SIGMOD Conference, 2003.]] Google ScholarDigital Library
- Bin He, Kevin Chen-Chuan Chang, and Jiawei Han. Discovering complex matchings across web query interfaces: A correlation mining approach. In SIGKDD Conference, 2004.]] Google ScholarDigital Library
- Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd VLDB Conference, pages 251--262, Bombay, India, 1996. VLDB Endowment, Saratoga, Calif.]] Google ScholarDigital Library
- Yannis Papakonstantinou, Héctor García-Molina, and Jeffrey Ullman. Medmaker: A mediation system based on declarative specifications. In Proceedings of the 12th International Conference on Data Engineering, New Orleans, La., 1996.]] Google ScholarDigital Library
- Renée, J. Miller, Mauricio A. Hernández, Laura M. Haas, Lingling Yan, C. T. Howard Ho, Ronald Fagin, and Lucian Popa. The Clio project: managing heterogeneity. SIGMOD Rec., 30(1):78--83, 2001.]] Google ScholarDigital Library
- Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford protocol proposal for internet retrieval and search. Accessible at http://www-db.stanford.edu/~gravano/starts.html, August 1996.]]Google Scholar
- Luis Gravano and Héctor García-Molina. Merging ranks from heterogeneous internet sources. In Proceedings of 23rd International Conference on Very large Data Bases, pages 196--205, Athens, Greece, August 1997. VLDB Endowment, Saratoga, Calif.]] Google ScholarDigital Library
- Bertram Ludäscher and Amarnath Gupta. Modeling interactive web sources for information mediation. In Advances in Conceptual Modeling: ER '99 Workshops on Evolution and Change in Data Management, Reverse Engineering in Information Systems, and the World Wide Web and Conceptual Modeling, Paris, France, November 15--18, 1999, Proceedings, volume 1727 of Lecture Notes in Computer Science, pages 225--238. Springer, 1999.]] Google ScholarDigital Library
- Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden web. In Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, 2001. Morgan Kaufmann.]] Google ScholarDigital Library
- James Caverlee, Ling Liu, and David Buttler. Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep web. In ICDE Conference, 2004.]] Google ScholarDigital Library
- Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In The VLDB Journal 2001, pages 109--118, 2001.]] Google ScholarDigital Library
- D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 467--478, Philadelphia, Pennsylvania, USA, June 1999.]] Google ScholarDigital Library
Index Terms
- Structured databases on the web: observations and implications
Recommendations
Databases on the web: national web domain survey
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & ApplicationsThe deep Web, the part of the Web consisting of web pages filled with information from myriads of online databases, is to date relatively unexplored. Even its basic characteristics such as, for instance, the number of searchable databases on the Web are ...
Structured data on the web
NGITS'09: Proceedings of the 7th international conference on Next generation information technologies and systemsThough search on the World-Wide Web has focused mostly on unstructured text, there is an increasing amount of structured data on the Web and growing interest in harnessing such data. I will describe several current projects at Google whose overall goal ...
Comments