skip to main content
article

A brief survey of web data extraction tools

Published:01 June 2002Publication History
Skip Abstract Section

Abstract

In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction fools, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data.

References

  1. ABASCAL, R., AND SÁNCHEZ, J. A. X-tract: Structure extraction from botanical textual descriptions. In Proceeding off the String Processing & Information Retrieval Symposium and International Workshop on Groupware, SPIRE/GRIWG (Cancúún, Mexico, 1999), pp. 2-7.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ABITEBOUL, S. Querying semi-structured data. In Database Theory - ICDT'97, 6th International Conference, Delphi, Greece, January 8-10, 1997, Proceedings (1997), F. N. Afrati and P. Kolaitis, Eds., vol. 1186 of Lecture Notes in Computer Science, Springer, pp. 1-18.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ADELBERG, B. NODOSE - A tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Seattle, WA, 1998), pp. 283-294.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AROCENA, G. O., AND MENDELZON, A. O. WebOQL: Restructuring documents, databases, and webs. In Proceedings of the 14th International Conference on Data Engineering (Orlando, FL, 1998), pp. 24-33.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BAUMGARTNER, R., FLESCA, S., AND GOTTLOB, G. Visual Web information extraction with Lixto. In Proceedings of the 26th International Conference on Very Large Data Bases (Rome, Italy, 2001), pp. 119-128.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. BRAY, T., PAOLI, J., AND SPERBERG-MCQUEEN, M. Extensible markup language (XML) 1.0. http://www.w3.org/TR/REC-xml.]]Google ScholarGoogle Scholar
  7. BRIN, S., MOTWANI, R., PAGE, L., AND WINOGRAD, T. What can you do with a Web in your pocket? Data Engineering Bulletin 21, 2(1998), 37-47.]]Google ScholarGoogle Scholar
  8. CALIFF, M. E., AND MOONEY, R. J. Relational Learning of Pattern-Match Rules for Information Extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence (Orlando, FL, 1999), pp. 328-334.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. CRESCENZI, Y., AND MECCA, G. Grammars have exceptions. Information Systems 23, 8(1998), 539-565.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. CRESCENZI, V., MECCA, G., AND MERIALDO, P. RoadRunner: Towards automatic data extraction from large Web sites. In Proceedings of the 26th International Conference on Very Large Data Bases (Rome, Italy, 2001), pp. 109-118.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. EMBLEY, D. W., CAMPBELL, D. M., JIANG, Y. S., LIDDLE, S. W., KAI NG, Y., QUASS, D., AND SMITH, R. D. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering 31, 3(1999), 227-251.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. EMBLEY, D. W., JIANG, Y. S., AND NG, Y.-K. Record-boundary discovery in Web documents. In Proceedings ACM SIGMOD International Conference of Management of Data (Philadelphia, PA, 1999), pp. 467-478]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. FLORESCU, D., LEVY, A. Y., AND MENDELZON, A. O. Database techniques for the World-Wide Web: A survey. SIGMOD Record 27, 3(1998), 59-74.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. FREITAG, D. Machine Learning for Information Extraction in Informal Domains. Machine Learning 39, 2/3(2000), 169-202.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. GOLGHER, P. B., DA SILVA, A. S., LAENDER, A. H. F., AND RIBEIRO-NETO, B. A. Bootstrapping for Example-Based Data Extraction. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management (Atlanta, GA, 2001), pp. 371-378.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. HAMMER, J., GARCIA-MOLINA, H., NESTOROV, S., YERNENI, R., BREUNIG, M., AND VASSALOS, V. Template-based wrappers in the TSIMMIS system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Tucson, AZ, 1997), pp. 532-535.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. HAMMER, J., MCHUGH, J., AND GARCIA-MOLINA, H. Semistructured data: The TSIMMIS experience. In Proceedings of the First East-European Symposium on Advances in Databases and Information Systems (St. Petersburg, Russia, 1997), pp. 1-8.]]Google ScholarGoogle Scholar
  18. HSU, C.-N., AND DUNG, M.-T. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems 23, 8 (1998), 521-538.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. HUCK, G., FANKHAUSER, P., ABERER, K., AND NEUHOLD, E. J. Jedi: Extracting and synthesizing information from the Web. In Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems (New York City, NY, 1998), pp. 32-43.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. KUSHMERICK, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence Journal 118, 1-2 (2000), 15-68.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. LAENDER, A. H. F., RIBEIRO-NETO, B., ANDDA SILVA,, A. S. DEByE - Data Extraction By Example. Data and Knowledge Engineering 40, 2 (2002), 121-154.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. LAENDER, A. H. F., RIBEIRO-NETO, B., DA SILVA, A. S., AND SILVA, E. S. Representing Web Data as Complex Objects. In Electronic Commerce and Web Technologies, K. Bauknecht, S. K. Mandria, and G. Pernul, Eds. Springer, Berlin, 2000, pp. 216-228.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. LIU, L., PU, C., AND HAN, W. XWRAP: An XML-enabled wrapper construction system for Web information sources. In Proceedings of the 16th International Conference on Data Engineering (San Diego, CA, 2000), pp. 611-621.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. LUDÄSCHER, B., HIMMERÖDER, R., LAUSEN, G., MAY, W., AND SCHLEPPHORST, C. Managing semistructured data with FLORID: A deductive object-oriented perspective. Information Systems 23, 8 (1998), 589-613.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. MECCA, G., ATZENI, P., MASCI, A., MERIALDO, P., AND SINDONI, G. The Araneus Web-Base Management System. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Seattle, WA, 1998), pp. 544-546.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. MUSLEA, I. RISE: Repository of online information sources used in information extraction tasks. http://www.isi.edu/muslea/RISE/.]]Google ScholarGoogle Scholar
  27. MUSLEA, I. Extraction Patterns for Information Extraction Tasks: A Survey. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (Orlando, FL, 1999), pp. 1-6.]]Google ScholarGoogle Scholar
  28. MUSLEA, I., MINTON, S., AND KNOBLOCK, C. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4, 1/2 (2001), 93-114.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. PAPAKONSTANTINOU, Y., GARCIA-MOLINA, H., AND WIDOM, J. Object Exchange Across Heterogenous Information Sources. In Proceedings of 11th International Conference on Data Engineering (Taipei, Taiwan, 1995), pp. 251-260.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. RIBEIRO-NETO, B., LAENDER, A. H. F., ANDDA SILVA, A. S. Extracting semi-structured data through examples. In Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management (Kansas City, MO, 1999), pp. 94-101.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. SAHUGUET, A., AND AZAVANT, F. Building intelligent Web applications using lightweight wrappers. Data and Knowledge Engineering 36, 3 (2001), 283-316.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. SODERLAND, S. Learning information extraction rules for semi-structured and free text. Machine Learning 34, 1-3 (1999), 233-272.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. TEIXEIRA, J. S. A Comparative Study of Approaches for Semistructured Data Extraction. Master's thesis, Department of Computer Science, Federal University of Minas Gerais, Brazil, 2001. In Portuguese.]]Google ScholarGoogle Scholar
  34. WORLD WIDE WEB CONSORTIUM. W3C. The Document Object Model. http://www.w3.org/DOM.]]Google ScholarGoogle Scholar

Index Terms

  1. A brief survey of web data extraction tools
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGMOD Record
      ACM SIGMOD Record  Volume 31, Issue 2
      June 2002
      112 pages
      ISSN:0163-5808
      DOI:10.1145/565117
      Issue’s Table of Contents

      Copyright © 2002 Authors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2002

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader