article

A brief survey of web data extraction tools

Authors:
Alberto H. F. Laender

Federal University of Minas Gerais, Belo Horizonte MG Brazil

Federal University of Minas Gerais, Belo Horizonte MG Brazil
View Profile

,
Berthier A. Ribeiro-Neto

Federal University of Minas Gerais, Belo Horizonte MG Brazil

Federal University of Minas Gerais, Belo Horizonte MG Brazil
View Profile

,
Altigran S. da Silva

Federal University of Minas Gerais, Belo Horizonte MG Brazil

Federal University of Minas Gerais, Belo Horizonte MG Brazil
View Profile

,
Juliana S. Teixeira

Federal University of Minas Gerais, Belo Horizonte MG Brazil

Federal University of Minas Gerais, Belo Horizonte MG Brazil
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 31 Issue 2June 2002pp 84–93https://doi.org/10.1145/565117.565137

Published:01 June 2002Publication History

ACM SIGMOD Record

Abstract

In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction fools, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data.

References

ABASCAL, R., AND SÁNCHEZ, J. A. X-tract: Structure extraction from botanical textual descriptions. In Proceeding off the String Processing & Information Retrieval Symposium and International Workshop on Groupware, SPIRE/GRIWG (Cancúún, Mexico, 1999), pp. 2-7.]] Google ScholarDigital Library
ABITEBOUL, S. Querying semi-structured data. In Database Theory - ICDT'97, 6th International Conference, Delphi, Greece, January 8-10, 1997, Proceedings (1997), F. N. Afrati and P. Kolaitis, Eds., vol. 1186 of Lecture Notes in Computer Science, Springer, pp. 1-18.]] Google ScholarDigital Library
ADELBERG, B. NODOSE - A tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Seattle, WA, 1998), pp. 283-294.]] Google ScholarDigital Library
AROCENA, G. O., AND MENDELZON, A. O. WebOQL: Restructuring documents, databases, and webs. In Proceedings of the 14th International Conference on Data Engineering (Orlando, FL, 1998), pp. 24-33.]] Google ScholarDigital Library
BAUMGARTNER, R., FLESCA, S., AND GOTTLOB, G. Visual Web information extraction with Lixto. In Proceedings of the 26th International Conference on Very Large Data Bases (Rome, Italy, 2001), pp. 119-128.]] Google ScholarDigital Library
BRAY, T., PAOLI, J., AND SPERBERG-MCQUEEN, M. Extensible markup language (XML) 1.0. http://www.w3.org/TR/REC-xml.]]Google Scholar
BRIN, S., MOTWANI, R., PAGE, L., AND WINOGRAD, T. What can you do with a Web in your pocket? Data Engineering Bulletin 21, 2(1998), 37-47.]]Google Scholar
CALIFF, M. E., AND MOONEY, R. J. Relational Learning of Pattern-Match Rules for Information Extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence (Orlando, FL, 1999), pp. 328-334.]] Google ScholarDigital Library
CRESCENZI, Y., AND MECCA, G. Grammars have exceptions. Information Systems 23, 8(1998), 539-565.]] Google ScholarDigital Library
CRESCENZI, V., MECCA, G., AND MERIALDO, P. RoadRunner: Towards automatic data extraction from large Web sites. In Proceedings of the 26th International Conference on Very Large Data Bases (Rome, Italy, 2001), pp. 109-118.]] Google ScholarDigital Library
EMBLEY, D. W., CAMPBELL, D. M., JIANG, Y. S., LIDDLE, S. W., KAI NG, Y., QUASS, D., AND SMITH, R. D. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering 31, 3(1999), 227-251.]] Google ScholarDigital Library
EMBLEY, D. W., JIANG, Y. S., AND NG, Y.-K. Record-boundary discovery in Web documents. In Proceedings ACM SIGMOD International Conference of Management of Data (Philadelphia, PA, 1999), pp. 467-478]] Google ScholarDigital Library
FLORESCU, D., LEVY, A. Y., AND MENDELZON, A. O. Database techniques for the World-Wide Web: A survey. SIGMOD Record 27, 3(1998), 59-74.]] Google ScholarDigital Library
FREITAG, D. Machine Learning for Information Extraction in Informal Domains. Machine Learning 39, 2/3(2000), 169-202.]] Google ScholarDigital Library
GOLGHER, P. B., DA SILVA, A. S., LAENDER, A. H. F., AND RIBEIRO-NETO, B. A. Bootstrapping for Example-Based Data Extraction. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management (Atlanta, GA, 2001), pp. 371-378.]] Google ScholarDigital Library
HAMMER, J., GARCIA-MOLINA, H., NESTOROV, S., YERNENI, R., BREUNIG, M., AND VASSALOS, V. Template-based wrappers in the TSIMMIS system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Tucson, AZ, 1997), pp. 532-535.]] Google ScholarDigital Library
HAMMER, J., MCHUGH, J., AND GARCIA-MOLINA, H. Semistructured data: The TSIMMIS experience. In Proceedings of the First East-European Symposium on Advances in Databases and Information Systems (St. Petersburg, Russia, 1997), pp. 1-8.]]Google Scholar
HSU, C.-N., AND DUNG, M.-T. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems 23, 8 (1998), 521-538.]] Google ScholarDigital Library
HUCK, G., FANKHAUSER, P., ABERER, K., AND NEUHOLD, E. J. Jedi: Extracting and synthesizing information from the Web. In Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems (New York City, NY, 1998), pp. 32-43.]] Google ScholarDigital Library
KUSHMERICK, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence Journal 118, 1-2 (2000), 15-68.]] Google ScholarDigital Library
LAENDER, A. H. F., RIBEIRO-NETO, B., ANDDA SILVA,, A. S. DEByE - Data Extraction By Example. Data and Knowledge Engineering 40, 2 (2002), 121-154.]] Google ScholarDigital Library
LAENDER, A. H. F., RIBEIRO-NETO, B., DA SILVA, A. S., AND SILVA, E. S. Representing Web Data as Complex Objects. In Electronic Commerce and Web Technologies, K. Bauknecht, S. K. Mandria, and G. Pernul, Eds. Springer, Berlin, 2000, pp. 216-228.]] Google ScholarDigital Library
LIU, L., PU, C., AND HAN, W. XWRAP: An XML-enabled wrapper construction system for Web information sources. In Proceedings of the 16th International Conference on Data Engineering (San Diego, CA, 2000), pp. 611-621.]] Google ScholarDigital Library
LUDÄSCHER, B., HIMMERÖDER, R., LAUSEN, G., MAY, W., AND SCHLEPPHORST, C. Managing semistructured data with FLORID: A deductive object-oriented perspective. Information Systems 23, 8 (1998), 589-613.]] Google ScholarDigital Library
MECCA, G., ATZENI, P., MASCI, A., MERIALDO, P., AND SINDONI, G. The Araneus Web-Base Management System. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Seattle, WA, 1998), pp. 544-546.]] Google ScholarDigital Library
MUSLEA, I. RISE: Repository of online information sources used in information extraction tasks. http://www.isi.edu/muslea/RISE/.]]Google Scholar
MUSLEA, I. Extraction Patterns for Information Extraction Tasks: A Survey. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (Orlando, FL, 1999), pp. 1-6.]]Google Scholar
MUSLEA, I., MINTON, S., AND KNOBLOCK, C. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4, 1/2 (2001), 93-114.]] Google ScholarDigital Library
PAPAKONSTANTINOU, Y., GARCIA-MOLINA, H., AND WIDOM, J. Object Exchange Across Heterogenous Information Sources. In Proceedings of 11th International Conference on Data Engineering (Taipei, Taiwan, 1995), pp. 251-260.]] Google ScholarDigital Library
RIBEIRO-NETO, B., LAENDER, A. H. F., ANDDA SILVA, A. S. Extracting semi-structured data through examples. In Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management (Kansas City, MO, 1999), pp. 94-101.]] Google ScholarDigital Library
SAHUGUET, A., AND AZAVANT, F. Building intelligent Web applications using lightweight wrappers. Data and Knowledge Engineering 36, 3 (2001), 283-316.]] Google ScholarDigital Library
SODERLAND, S. Learning information extraction rules for semi-structured and free text. Machine Learning 34, 1-3 (1999), 233-272.]] Google ScholarDigital Library
TEIXEIRA, J. S. A Comparative Study of Approaches for Semistructured Data Extraction. Master's thesis, Department of Computer Science, Federal University of Minas Gerais, Brazil, 2001. In Portuguese.]]Google Scholar
WORLD WIDE WEB CONSORTIUM. W3C. The Document Object Model. http://www.w3.org/DOM.]]Google Scholar

Index Terms

A brief survey of web data extraction tools
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Effective Web Data Extraction with Ducky
IDEAS '15: Proceedings of the 19th International Database Engineering & Applications Symposium

The World Wide Web has become an invaluable source of data. However, extracting useful information from the vastness of the web can become a challenge as depending on the amount of data needed, manual extraction or creation of web scraping programs may ...
Read More
Towards web-scale structured web data extraction
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

In this paper we present an ongoing PhD research on unsupervised and domain-independent structured data extraction from the Web. We propose a novel method to extract structured data records from template-generated Web pages. The method is based on ...
Read More
Browser GUI for generating web data extraction rules in Ducky
iiWAS '15: Proceedings of the 17th International Conference on Information Integration and Web-based Applications & Services

To benefit from the invaluable data in the World Wide Web, manual extraction or creation of web scraping programs may be necessary. However, these processes can be tedious and complicated. To address these, we have proposed Ducky, which is a Web data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGMOD Record Volume 31, Issue 2
June 2002
112 pages
ISSN:0163-5808
DOI:10.1145/565117
Issue’s Table of Contents

Copyright © 2002 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2002
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 467
  Total Citations
  View Citations
- 6,108
  Total Downloads
- Downloads (Last 12 months)183
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A brief survey of web data extraction tools

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Effective Web Data Extraction with Ducky

Towards web-scale structured web data extraction

Browser GUI for generating web data extraction rules in Ducky

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A brief survey of web data extraction tools

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Effective Web Data Extraction with Ducky

Towards web-scale structured web data extraction

Browser GUI for generating web data extraction rules in Ducky

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media