research-article

Free Access

Structured data on the web

Authors:
Michael J. Cafarella

University of Michigan, Ann Arbor, MI

University of Michigan, Ann Arbor, MI
View Profile

,
Alon Halevy

Google Research, Mountain View, CA

Google Research, Mountain View, CA
View Profile

,
Jayant Madhavan

Google Research, Mountain View, CA

Google Research, Mountain View, CA
View Profile

Authors Info & Claims

Communications of the ACM Volume 54 Issue 2February 2011pp 72–79https://doi.org/10.1145/1897816.1897839

Published:01 February 2011Publication History

Communications of the ACM

Abstract

Google's Web Tables and Deep Web Crawler identify and deliver this otherwise inaccessible resource directly to end users.

References

Barbosa, L. and Freire, J. Siphoning Hidden-Web data through keyword-based interfaces. In Proceedings of the Brazilian Symposium on Databases, 2004, 309--321.Google Scholar
Bergman. M.K. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing 7, 1 (2001).Google ScholarCross Ref
Cafarella, M.J., Halevy, A.Y., and Khoussainova, N. Data integration for the relational Web. Proceedings of the VLDB Endowment 2, 1 (2009), 1090--1101. Google ScholarDigital Library
Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., and Zhang, Y. WebTables: Exploring the power of tables on the Web. Proceedings of the VLDB Endowment 1, 1 (Aug. 2008), 538--549. Google ScholarDigital Library
Cafarella, M.J., Halevy, A.Y., Zhang, Y., Wang, D.Z., and Wu, E. Uncovering the relational Web. In Proceedings of the 11th International Workshop on the Web and Databases (Vancouver, B.C., June 13, 2008).Google Scholar
Callan, J.P. and Connell, M.E. Query-based sampling of text databases. ACM Transactions on Information Systems 19, 2 (2001), 97--130. Google ScholarDigital Library
Cars.com (faq); http://siy.cars.com/siy/qsg/faqgeneralinfo.jsp#howmanyadsGoogle Scholar
Cazoodle apartment search; http://apartments.cazoodle.com/Google Scholar
Chang, K.C.-C., He, B., and Zhang, Z. Toward large-scale integration: Building a metaquerier over databases on the Web. In Proceedings of the Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 2005).Google Scholar
Chen, H., Tsai, S., and Tsai, J. Mining tables from large-scale html texts. In Proceedings of the 18th International Conference on Computational Linguistics (Saarbrucken, Germany, July 31--Aug. 4, 2000), 166--172. Google ScholarDigital Library
Elmeleegy, H., Madhavan, J., and Halevy, A. Harvesting relational tables from lists on the Web. Proceedings of the VLDB Endowment 2, 1 (2009), 1078--1089. Google ScholarDigital Library
Gatterbauer, W., Bohunsky, P., Herzog, M., Krüupl, B., and Pollak, B. Towards domain-independent information extraction from Web tables. In Proceedings of the 16th International World Wide Web Conference (Banff, Canada, May 8--12, 2007), 71--80. Google ScholarDigital Library
Gonzalez, H., Halevy, A., Jensen, C., Langen, A., Madhavan, J., Shapley, R., Shen, W., and Goldberg-Kidon, J. Google Fusion Tables: Web-centered data management and collaboration. In Proceedings of the SIGMOD ACM Special Interest Group on Management of Data (Indianapolis, 2010). ACM Press, New York, 2010, 1061--1066. Google ScholarDigital Library
He, B., Patel, M., Zhang, Z., and Chang, K.C.-C. Accessing the Deep Web. Commun. ACM 50, 5 (May 2007), 94--101. Google ScholarDigital Library
Ipeirotis, P.G. and Gravano, L. Distributed search over the Hidden Web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (Hong Kong, Aug. 20--23, 2002), 394--405. Google ScholarDigital Library
Limaye, G., Sarawagi, S., and Chakrabarti, S. Annotating and searching Web tables using entities, types, and relationships. Proceedings of the VLDB Endowment 3, 1 (2010), 1338--1347. Google ScholarDigital Library
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., and Halevy, A.Y. Google's Deep Web Crawl. Proceedings of the VLDB Endowment 1, 1 (2008), 1241--1252. Google ScholarDigital Library
Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D., and Yu, C. Web-scale data integration: You can afford to pay as you go. In Proceedings of the Second Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 7--10, 2007). 342--350.Google Scholar
Ntoulas, A., Zerfos, P., and Cho, J. Downloading textual Hidden Web content through keyword queries. In Proceedings of the Joint Conference on Digital Libraries (Denver, June 7--11, 2005), 100--109. Google ScholarDigital Library
Raghavan, S. and Garcia-Molina, H. Crawling the Hidden Web. In Proceedings of the 27th International Conference on Very Large Databases (Rome, Italy, Sept. 11--14, 2001), 129--138. Google ScholarDigital Library
Trulia; http://www.trulia.com/Google Scholar
Wang, Y. and Hu, J. A machine-learning-based approach for table detection on the Web. In Proceedings of the 11th International World Wide Web Conference (Honolulu, 2002), 242--250. Google ScholarDigital Library
Zanibbi, R., Blostein, D., and Cordy, J. A survey of table recognition: Models, observations, transformations, and inferences. International Journal on Document Analysis and Recognition 7, 1 (2004), 1--16. Google ScholarDigital Library

Recommendations

Extracting structured data from Web pages
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...
Read More
Keyword search on structured and semi-structured data
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Empowering users to access databases using simple keywords can relieve the users from the steep learning curve of mastering a structured query language and understanding complex and possibly fast evolving data schemas. In this tutorial, we give an ...
Read More
Bigtable: A Distributed Storage System for Structured Data

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Communications of the ACM Volume 54, Issue 2
February 2011
115 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/1897816
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 71
  Total Citations
  View Citations
- 4,798
  Total Downloads
- Downloads (Last 12 months)290
- Downloads (Last 6 weeks)77
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Structured data on the web

Communications of the ACM

Abstract

References

Cited By

Recommendations

Extracting structured data from Web pages

Keyword search on structured and semi-structured data

Bigtable: A Distributed Storage System for Structured Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Structured data on the web

Communications of the ACM

Abstract

References

Cited By

Recommendations

Extracting structured data from Web pages

Keyword search on structured and semi-structured data

Bigtable: A Distributed Storage System for Structured Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media