Abstract
In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction fools, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data.
- ABASCAL, R., AND SÁNCHEZ, J. A. X-tract: Structure extraction from botanical textual descriptions. In Proceeding off the String Processing & Information Retrieval Symposium and International Workshop on Groupware, SPIRE/GRIWG (Cancúún, Mexico, 1999), pp. 2-7.]] Google ScholarDigital Library
- ABITEBOUL, S. Querying semi-structured data. In Database Theory - ICDT'97, 6th International Conference, Delphi, Greece, January 8-10, 1997, Proceedings (1997), F. N. Afrati and P. Kolaitis, Eds., vol. 1186 of Lecture Notes in Computer Science, Springer, pp. 1-18.]] Google ScholarDigital Library
- ADELBERG, B. NODOSE - A tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Seattle, WA, 1998), pp. 283-294.]] Google ScholarDigital Library
- AROCENA, G. O., AND MENDELZON, A. O. WebOQL: Restructuring documents, databases, and webs. In Proceedings of the 14th International Conference on Data Engineering (Orlando, FL, 1998), pp. 24-33.]] Google ScholarDigital Library
- BAUMGARTNER, R., FLESCA, S., AND GOTTLOB, G. Visual Web information extraction with Lixto. In Proceedings of the 26th International Conference on Very Large Data Bases (Rome, Italy, 2001), pp. 119-128.]] Google ScholarDigital Library
- BRAY, T., PAOLI, J., AND SPERBERG-MCQUEEN, M. Extensible markup language (XML) 1.0. http://www.w3.org/TR/REC-xml.]]Google Scholar
- BRIN, S., MOTWANI, R., PAGE, L., AND WINOGRAD, T. What can you do with a Web in your pocket? Data Engineering Bulletin 21, 2(1998), 37-47.]]Google Scholar
- CALIFF, M. E., AND MOONEY, R. J. Relational Learning of Pattern-Match Rules for Information Extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence (Orlando, FL, 1999), pp. 328-334.]] Google ScholarDigital Library
- CRESCENZI, Y., AND MECCA, G. Grammars have exceptions. Information Systems 23, 8(1998), 539-565.]] Google ScholarDigital Library
- CRESCENZI, V., MECCA, G., AND MERIALDO, P. RoadRunner: Towards automatic data extraction from large Web sites. In Proceedings of the 26th International Conference on Very Large Data Bases (Rome, Italy, 2001), pp. 109-118.]] Google ScholarDigital Library
- EMBLEY, D. W., CAMPBELL, D. M., JIANG, Y. S., LIDDLE, S. W., KAI NG, Y., QUASS, D., AND SMITH, R. D. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering 31, 3(1999), 227-251.]] Google ScholarDigital Library
- EMBLEY, D. W., JIANG, Y. S., AND NG, Y.-K. Record-boundary discovery in Web documents. In Proceedings ACM SIGMOD International Conference of Management of Data (Philadelphia, PA, 1999), pp. 467-478]] Google ScholarDigital Library
- FLORESCU, D., LEVY, A. Y., AND MENDELZON, A. O. Database techniques for the World-Wide Web: A survey. SIGMOD Record 27, 3(1998), 59-74.]] Google ScholarDigital Library
- FREITAG, D. Machine Learning for Information Extraction in Informal Domains. Machine Learning 39, 2/3(2000), 169-202.]] Google ScholarDigital Library
- GOLGHER, P. B., DA SILVA, A. S., LAENDER, A. H. F., AND RIBEIRO-NETO, B. A. Bootstrapping for Example-Based Data Extraction. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management (Atlanta, GA, 2001), pp. 371-378.]] Google ScholarDigital Library
- HAMMER, J., GARCIA-MOLINA, H., NESTOROV, S., YERNENI, R., BREUNIG, M., AND VASSALOS, V. Template-based wrappers in the TSIMMIS system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Tucson, AZ, 1997), pp. 532-535.]] Google ScholarDigital Library
- HAMMER, J., MCHUGH, J., AND GARCIA-MOLINA, H. Semistructured data: The TSIMMIS experience. In Proceedings of the First East-European Symposium on Advances in Databases and Information Systems (St. Petersburg, Russia, 1997), pp. 1-8.]]Google Scholar
- HSU, C.-N., AND DUNG, M.-T. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems 23, 8 (1998), 521-538.]] Google ScholarDigital Library
- HUCK, G., FANKHAUSER, P., ABERER, K., AND NEUHOLD, E. J. Jedi: Extracting and synthesizing information from the Web. In Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems (New York City, NY, 1998), pp. 32-43.]] Google ScholarDigital Library
- KUSHMERICK, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence Journal 118, 1-2 (2000), 15-68.]] Google ScholarDigital Library
- LAENDER, A. H. F., RIBEIRO-NETO, B., ANDDA SILVA,, A. S. DEByE - Data Extraction By Example. Data and Knowledge Engineering 40, 2 (2002), 121-154.]] Google ScholarDigital Library
- LAENDER, A. H. F., RIBEIRO-NETO, B., DA SILVA, A. S., AND SILVA, E. S. Representing Web Data as Complex Objects. In Electronic Commerce and Web Technologies, K. Bauknecht, S. K. Mandria, and G. Pernul, Eds. Springer, Berlin, 2000, pp. 216-228.]] Google ScholarDigital Library
- LIU, L., PU, C., AND HAN, W. XWRAP: An XML-enabled wrapper construction system for Web information sources. In Proceedings of the 16th International Conference on Data Engineering (San Diego, CA, 2000), pp. 611-621.]] Google ScholarDigital Library
- LUDÄSCHER, B., HIMMERÖDER, R., LAUSEN, G., MAY, W., AND SCHLEPPHORST, C. Managing semistructured data with FLORID: A deductive object-oriented perspective. Information Systems 23, 8 (1998), 589-613.]] Google ScholarDigital Library
- MECCA, G., ATZENI, P., MASCI, A., MERIALDO, P., AND SINDONI, G. The Araneus Web-Base Management System. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Seattle, WA, 1998), pp. 544-546.]] Google ScholarDigital Library
- MUSLEA, I. RISE: Repository of online information sources used in information extraction tasks. http://www.isi.edu/muslea/RISE/.]]Google Scholar
- MUSLEA, I. Extraction Patterns for Information Extraction Tasks: A Survey. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (Orlando, FL, 1999), pp. 1-6.]]Google Scholar
- MUSLEA, I., MINTON, S., AND KNOBLOCK, C. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4, 1/2 (2001), 93-114.]] Google ScholarDigital Library
- PAPAKONSTANTINOU, Y., GARCIA-MOLINA, H., AND WIDOM, J. Object Exchange Across Heterogenous Information Sources. In Proceedings of 11th International Conference on Data Engineering (Taipei, Taiwan, 1995), pp. 251-260.]] Google ScholarDigital Library
- RIBEIRO-NETO, B., LAENDER, A. H. F., ANDDA SILVA, A. S. Extracting semi-structured data through examples. In Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management (Kansas City, MO, 1999), pp. 94-101.]] Google ScholarDigital Library
- SAHUGUET, A., AND AZAVANT, F. Building intelligent Web applications using lightweight wrappers. Data and Knowledge Engineering 36, 3 (2001), 283-316.]] Google ScholarDigital Library
- SODERLAND, S. Learning information extraction rules for semi-structured and free text. Machine Learning 34, 1-3 (1999), 233-272.]] Google ScholarDigital Library
- TEIXEIRA, J. S. A Comparative Study of Approaches for Semistructured Data Extraction. Master's thesis, Department of Computer Science, Federal University of Minas Gerais, Brazil, 2001. In Portuguese.]]Google Scholar
- WORLD WIDE WEB CONSORTIUM. W3C. The Document Object Model. http://www.w3.org/DOM.]]Google Scholar
Index Terms
- A brief survey of web data extraction tools
Recommendations
Effective Web Data Extraction with Ducky
IDEAS '15: Proceedings of the 19th International Database Engineering & Applications SymposiumThe World Wide Web has become an invaluable source of data. However, extracting useful information from the vastness of the web can become a challenge as depending on the amount of data needed, manual extraction or creation of web scraping programs may ...
Towards web-scale structured web data extraction
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data miningIn this paper we present an ongoing PhD research on unsupervised and domain-independent structured data extraction from the Web. We propose a novel method to extract structured data records from template-generated Web pages. The method is based on ...
Browser GUI for generating web data extraction rules in Ducky
iiWAS '15: Proceedings of the 17th International Conference on Information Integration and Web-based Applications & ServicesTo benefit from the invaluable data in the World Wide Web, manual extraction or creation of web scraping programs may be necessary. However, these processes can be tedious and complicated. To address these, we have proposed Ducky, which is a Web data ...
Comments