Automatic generation of agents for collecting hidden Web pages for data extraction

https://doi.org/10.1016/j.datak.2003.10.003Get rights and content

Abstract

As the Web grows, more and more data has become available under dynamic forms of publication, such as legacy databases accessed by an HTML form (the so called hidden Web). In situations such as this, integration of this data relies more and more on the fast generation of agents that can automatically fetch pages for further processing. As a result, there is an increasing need for tools that can help users generate such agents. In this paper, we describe a method for automatically generating agents to collect hidden Web pages. This method uses a pre-existing data repository for identifying the contents of these pages and takes the advantage of some patterns that can be found among Web sites to identify the navigation paths to follow. To demonstrate the accuracy of our method, we discuss the results of a number of experiments carried out with sites from different domains.

Introduction

With the popularization of the World-Wide Web, a huge amount of data from a number of different domains has become available. However, managing and querying such data is not trivial since we cannot make use of traditional database techniques. One way to deal with this data is through the so called Web wrappers [1], programs that extract unstructured data from Web pages and store it in suitable formats such as XML and relational tables.

To accomplish their task, Web wrappers take as input a set of pages from a Web source. This set of pages is generally collected by Web agents such as spiders or crawlers. Traditionally, these agents cover only the so called Publicly Indexable Web (PIW) [2], which corresponds to the set of Web pages reachable only by following hyperlinks. However, recent studies point out that the great majority of the pages containing useful data on the Web is outside the PIW [2], [3]. This large portion of the Web is generally called the hidden [1], [2] or deep [3] Web. Pages in the hidden Web are dynamically generated by programs in response to forms submitted to searchable databases. Thus, to allow wrappers to deal with the data on such pages we need a special type of agent to collect them, a type we call the hidden Web agent.

To illustrate the functionality expected from a hidden Web agent, suppose we want to fetch a set of data-rich pages1 from the Bookpool Web site.2 We start from the site’s main page, depicted in Fig. 1(a). Although we have a simple search box there (see arrow 1), an advanced search form, where users can better specify their needs, is preferable. Selecting the “Search” hyperlink (arrow 2) takes us to the advanced search page depicted in Fig. 1(b). After filling in some fields and submitting the form, an answer page is presented, as shown in Fig. 2(a). Note that not all information available is presented in the first answer page. Thus we should select the “Next” hyperlink (arrow 3) to get other pages, like the one in Fig. 2(b). An agent to fetch these pages should act as described.

In this paper, we present a method to generate hidden Web agents for sites with common navigational characteristics. The method uses a set of heuristics and a sample data repository for automatically finding relevant forms, filling them in, and collecting pages containing useful data. We also describe the results of a number of experiments carried out with sites from different domains to evaluate the accuracy of our method. These results show that our method is successful on 80% of the sites we have used in our experiments.

The rest of the paper is organized as follows. Section 2 overviews related work. In Section 3 we introduce the navigation pattern concept. Section 4 outlines our method for generating hidden Web agents. Experimental results are described in Section 5. Finally, in Section 6 we conclude the paper.

Section snippets

Related work

In the last few years, we have witnessed an increasing interest in exploring the information available in the hidden Web. The term hidden Web was used by Lawrence and Giles [2] to refer to the huge set of Web pages dynamically generated as the output of queries over databases (or other kinds of remote processing), which are usually produced as the result of filling and submitting HTML forms. Bergman [3] experimentally demonstrated that a great portion of the hidden Web consists of pages

Navigation patterns

To develop a general solution to automatically generate agents to fetch hidden Web pages is a very difficult, if not impossible task because a dynamic page can be generated in many different ways. Thus, we restrict the scope of our method to sites respecting two common navigation patterns that reflect common ways found by users to navigate on the Web. We formalize the concept of navigation pattern as follows.

Definition 1

A navigation pattern is denoted by a quintuple NP=(P,Σ,σ,p0,T), where P is a set of

Agent generation

To fulfill its task, a hidden Web agent must be able to simulate a user’s navigation through a Web site, i.e., it must follow links, fill in forms, and follow threads of answer pages.

The generation of such an agent can be accomplished in a variety of ways. The traditional approach is by writing specific code in some specific language (e.g., Java), but this is known to be very time consuming and error prone, making maintenance very difficult. A different approach is adopted by the ASByE tool [20]

Experimental results

In our experiments, we evaluated all steps of our method for automatically generating hidden Web agents, verifying the accuracy of each heuristic. We used 30 Web sites from three different domains: books, CDs, and software (see Table 1). For each domain, we selected the top ten sites listed at Google’s Web directory, a directory ranked by popularity. We used those URLs as the sites’ entry points.

Almost all sites used have a simple search box on every page, as we can see from the large number of

Conclusions

In this paper, we presented a method for automatically generating agents for collecting pages from the hidden Web. Unlike other approaches that are domain dependent [15] or assume sites with a single and very simple navigation structure [16], [17], [18], our method is able to handle sites with more complex structures, generating agents with no user intervention.

Although restricted to sites that follow two navigation patterns, our method covers a large number of sites from different domains, as

Acknowledgements

This work was partially supported by Project SIAM (MCT/CNPq/PRONEX grant 76.97.1016.00), CNPq grant 46.7775/00-1, and research funding from the Brazilian National Program in Informatics (Decree-law 3800/01). The authors would like to thank the anonymous reviewers whose comments helped to improve this paper.

Juliano Palmieri Lage received a B.Sc. degree in Computer Science from the Federal University of Minas Gerais, Brazil and currently is an M.Sc. student in the same institution. His research interests include compression algorithms and query processing on compressed data, databases, semistructured data, and information retrieval systems.

References (23)

  • V. Crescenzi, G. Mecca, P. Merialdo, RoadRunner: Towards automatic data extraction from large Web sites, in:...
  • Cited by (69)

    • iDetect: Content based monitoring of complex networks using mobile agents

      2012, Applied Soft Computing Journal
      Citation Excerpt :

      Agent technology offers the concept of mobility (i.e. An Agent which has the capability of moving from one node to another node of the network autonomously commonly is known as Mobile Agent) [10,19,31]. Agent paradigm has been used in diverse areas such as semantic web services, Bit-Torrent protocol, spam filtering, health monitoring, manufacturing, supply chain, economics/finance, and inventory control [24,8,25,9,28,23,30,20] because of its unique characteristics.

    • Exploring the Intersections of Web Science and Accessibility

      2020, Advances in Intelligent Systems and Computing
    • Deep Web crawling: a survey

      2019, World Wide Web
    • Form filling based on constraint solving

      2018, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    Juliano Palmieri Lage received a B.Sc. degree in Computer Science from the Federal University of Minas Gerais, Brazil and currently is an M.Sc. student in the same institution. His research interests include compression algorithms and query processing on compressed data, databases, semistructured data, and information retrieval systems.

    Altigran Soares da Silva received a B.Sc. degree in Data Processing in 1990 from the Federal University of Amazonas (UFAM), Brazil, where he has a position as lecturer since 1991. He received a M.Sc. (1995) and a Ph.D. (2002) in Computer Science from the Federal University of Minas Gerais (UFMG), Brazil. Currently, he is an associate professor at the Computer Science Department of UFAM and participates as an associate researcher of the UFMG Database Group. He has been working on a number of research projects found by Brazilian national agencies such as the National Research Council (CNPq) and served as external reviewer and program committee member for conferences on databases and Web technology worldwide. In 2003, he is the organizing committee chair of 18th Brazilian Symposium on Databases/17th Brazilian Symposium on Software Engineering. His main research interests include extraction and management of semi-structured data, Web information retrieval and database modeling and design. He is a member of the Brazilian Computer Society and ACM.

    Paulo B. Golgher is the Chief Technology Officer at Akwan Information Technologies, Brazil. His research interests include semistructured data, Web agents, and information retrieval. He received an M.Sc. in computer science from the Federal University of Minas Gerais.

    Alberto H.F. Laender received a B.Sc. degree in Electrical Engineering and an M.Sc. degree in Computer Science from the Federal University of Minas Gerais, Brazil, and a Ph.D. degree in Computing from the University of East Anglia, UK. He joined the Computer Science Department of the Federal University of Minas Gerais in 1975, where he is currently a Full Professor and the head of the Database Research Group. In 1997, he was a Visiting Scientist at the Hewlett–Packard Palo Alto Laboratories. He has served as a program committee member for several national and international conferences on databases and Web-related topics. He also served as a program committee co-chair for the 19th International Conference on Conceptual Modeling held in Salt Lake City, Utah, in October 2000, and as the program committee chair for the 9th International Symposium on String Processing and Information Retrieval held in Lisbon, Portugal, in September 2002. His research interests include conceptual database modeling, database design methods, database user interfaces, semistructured data, Web data management, and digital libraries.

    View full text