article

Structured databases on the web: observations and implications

Authors:
Kevin Chen-Chuan Chang

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Bin He

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Chengkai Li

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Mitesh Patel

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Zhen Zhang

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 33 Issue 3September 2004pp 61–70https://doi.org/10.1145/1031570.1031584

Published:01 September 2004Publication History

ACM SIGMOD Record

Abstract

The Web has been rapidly "deepened" by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this "deep Web" of searchable databses is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our "macro" study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our "micro" study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How "hidden" are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions.

References

BrightPlanet.com. The deep web: Surfacing hidden value. Accessible at http://brightplanet.com, July 2000.]]Google Scholar
Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400(6740):107--109, 1999.]]Google ScholarCross Ref
Ed O'Neill, Brian Lavoie, and Rick Bennett. Web characterization. Accessible at "http://wcp.oclc.org".]]Google Scholar
GNU. wget. Accessible at "http://www.gnu.org/software/wget/wget.html".]]Google Scholar
Kevin Chen-Chuan Chang, Bin He, Chengkai Li, and Zhen Zhang. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository, 2003.]]Google Scholar
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Massachusetts, 1949.]]Google Scholar
William W. Cohen. Some practical observations on integration of web information. In WebDB (Informal Proceedings), pages 55--60, 1999.]]Google Scholar
Marti A. Hearst. Trends & controversies: Information integration. IEEE Intelligent System, 13(5):12--24, September 1998.]] Google ScholarDigital Library
Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59--74, 1998.]] Google ScholarDigital Library
Panagiotis G. Ipeirotis, Luis Gravano, and Mehran Sahami. Probe, count, and classify: Categorizing hidden web databases. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, Ca., May 2001.]] Google ScholarDigital Library
James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings ACM SIGMOD International Conference on Management of Data, pages 479-490, Philadelphia, Pennsylvania, USA, June 1999. ACM Press.]] Google ScholarDigital Library
David Hawking and Paul B. Thistlewaite. Methods for information server selection. ACM Transactions on Information Systems, 17(1):40--76, 1999.]] Google ScholarDigital Library
Atsushi Sugiura and Oren Etzioni. Query routing for web search engines: architecture and experiments. In Proceedings of WWW9, 2000.]] Google ScholarDigital Library
Weiyi Meng, King-Lup Liu, Clement T. Yu, Xiaodong Wang, Yuhsi Chang, and Naphtali Rishe. Determining text databases to search in the internet. In Proceedings of 24th International Conference on Very Large Data Bases, pages 14--25, New York City, New York, USA, August 1998. Morgan Kaufmann.]] Google ScholarDigital Library
Jeffrey D. Ullman. Information integration using logical views. In Proceedings of the 6th International Conference on Database Theory, Delphi, Greece, January 1997. Springer, Berlin.]] Google ScholarDigital Library
Zhen Zhang, Bin He, and Kevin Chen-Chuan Chang. Understanding web query interfaces: Best effort parsing with hidden syntax. In SIGMOD Conference, 2004.]] Google ScholarDigital Library
Bin He and Kevin Chen-Chuan Chang. Statistical schema matching across web query interfaces. In SIGMOD Conference, 2003.]] Google ScholarDigital Library
Bin He, Kevin Chen-Chuan Chang, and Jiawei Han. Discovering complex matchings across web query interfaces: A correlation mining approach. In SIGKDD Conference, 2004.]] Google ScholarDigital Library
Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd VLDB Conference, pages 251--262, Bombay, India, 1996. VLDB Endowment, Saratoga, Calif.]] Google ScholarDigital Library
Yannis Papakonstantinou, Héctor García-Molina, and Jeffrey Ullman. Medmaker: A mediation system based on declarative specifications. In Proceedings of the 12th International Conference on Data Engineering, New Orleans, La., 1996.]] Google ScholarDigital Library
Renée, J. Miller, Mauricio A. Hernández, Laura M. Haas, Lingling Yan, C. T. Howard Ho, Ronald Fagin, and Lucian Popa. The Clio project: managing heterogeneity. SIGMOD Rec., 30(1):78--83, 2001.]] Google ScholarDigital Library
Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford protocol proposal for internet retrieval and search. Accessible at http://www-db.stanford.edu/~gravano/starts.html, August 1996.]]Google Scholar
Luis Gravano and Héctor García-Molina. Merging ranks from heterogeneous internet sources. In Proceedings of 23rd International Conference on Very large Data Bases, pages 196--205, Athens, Greece, August 1997. VLDB Endowment, Saratoga, Calif.]] Google ScholarDigital Library
Bertram Ludäscher and Amarnath Gupta. Modeling interactive web sources for information mediation. In Advances in Conceptual Modeling: ER '99 Workshops on Evolution and Change in Data Management, Reverse Engineering in Information Systems, and the World Wide Web and Conceptual Modeling, Paris, France, November 15--18, 1999, Proceedings, volume 1727 of Lecture Notes in Computer Science, pages 225--238. Springer, 1999.]] Google ScholarDigital Library
Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden web. In Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, 2001. Morgan Kaufmann.]] Google ScholarDigital Library
James Caverlee, Ling Liu, and David Buttler. Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep web. In ICDE Conference, 2004.]] Google ScholarDigital Library
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In The VLDB Journal 2001, pages 109--118, 2001.]] Google ScholarDigital Library
D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 467--478, Philadelphia, Pennsylvania, USA, June 1999.]] Google ScholarDigital Library

Index Terms

Structured databases on the web: observations and implications
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Intranet and Web Databases for Dummies
Read More
Databases on the web: national web domain survey
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications

The deep Web, the part of the Web consisting of web pages filled with information from myriads of online databases, is to date relatively unexplored. Even its basic characteristics such as, for instance, the number of searchable databases on the Web are ...
Read More
Structured data on the web
NGITS'09: Proceedings of the 7th international conference on Next generation information technologies and systems

Though search on the World-Wide Web has focused mostly on unstructured text, there is an increasing amount of structured data on the Web and growing interest in harnessing such data. I will describe several current projects at Google whose overall goal ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGMOD Record Volume 33, Issue 3
September 2004
94 pages
ISSN:0163-5808
DOI:10.1145/1031570
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2004
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 248
  Total Citations
  View Citations
- 1,645
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Structured databases on the web: observations and implications

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Intranet and Web Databases for Dummies

Databases on the web: national web domain survey

Structured data on the web

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Structured databases on the web: observations and implications

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Intranet and Web Databases for Dummies

Databases on the web: national web domain survey

Structured data on the web

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media