Skip to main content
main-content

Über dieses Buch

Mining the World Wide Web: An Information Search Approach explores the concepts and techniques of Web mining, a promising and rapidly growing field of computer science research. Web mining is a multidisciplinary field, drawing on such areas as artificial intelligence, databases, data mining, data warehousing, data visualization, information retrieval, machine learning, markup languages, pattern recognition, statistics, and Web technology. Mining the World Wide Web presents the Web mining material from an information search perspective, focusing on issues relating to the efficiency, feasibility, scalability and usability of searching techniques for Web mining.
Mining the World Wide Web is designed for researchers and developers of Web information systems and also serves as an excellent supplemental reference to advanced level courses in data mining, databases and information retrieval.

Inhaltsverzeichnis

Frontmatter

Information Retrieval on the Web

Frontmatter

Chapter 1. Keyword-Based Search Engines

Abstract
The World Wide Web (WWW), also known as the Web, was introduced in 1992 at the Center for European Nuclear Research (CERN) in Switzerland [28]. What began as a means of facilitating data sharing in different formats among physicists at CERN is today a mammoth, heterogeneous, non-administered, distributed, global information system that is revolutionizing the information age. The Web is organized as a set of hypertext documents interconnected by hyperlinks, used in the Hypertext Markup Language (HTML) to construct links between documents. The many potential benefits the Web augurs have spurred research in information search/filtering [54, 154], Web/database integration [43, 168], Web querying systems [3, 150, 155, 187], and data mining [66, 252]. The Web has also brought together researchers from areas as diverse as communications, electronic publishing, language processing, and databases, as well as from multiple scientific and business domains.
George Chang, Marcus J. Healey, James A. M. McHugh, Jason T. L. Wang

Chapter 2. Query-Based Search Systems

Abstract
A shortcoming of the keyword-based search tools discussed in Chapter 1 is the lack of high-level querying facilities available to facilitate information retrieval on the Web. Query languages have long been part of database management systems (DBMSs), the standard DBMS query language being the Structural Query Language (SQL) [104, 151, 221]. Such query languages not only provide a structural way of accessing the data stored in a database, but also hide details of the database structure from the user. Since the Web is often viewed as a gigantic database holding vast stores of information, some Web-oriented query systems have been developed. However, unlike the highly structured data found in a DBMS, information on the Web is stored mainly as files. The files can be generally categorized as:
  • Structured, such as flat databases and BibTeX files.
  • Semistructured, such as HTML, XML, and LATEX files.
  • Unstructured, such as sound, image, pure text and executable files.
George Chang, Marcus J. Healey, James A. M. McHugh, Jason T. L. Wang

Chapter 3. Mediators and Wrappers

Abstract
Search engines and directories, as described in Chapter 1, provide Internet users with rapid retrieval of information, but do not provide a database-like query language to retrieve information based, for example, on the underlying structure of the HTML documents. The lack of such query languages is due largely to the semistructured nature of Web data [1], which is unsuitable for retrieval and storage in a relational database form. Systems based on mediators and data warehouses have been introduced to overcome the inability to query Web data using a full-fledged database query language. In contrast to the approaches described in Chapter 2, which employ search engines as backends, mediators and data warehouses are based on a database management system (DBMS). Thus, the query languages in Chapter 2 would be implemented using search engines and directories as backends. Such a query system would generate, from the user’s query, a query or a set of queries that can be executed on the search engines. The responses returned from the search engines would then be compiled and supplied to the user. In the mediator or data warehouse approach, on the other hand, the user interacts with the DBMS, which in turn interacts with the Web.
George Chang, Marcus J. Healey, James A. M. McHugh, Jason T. L. Wang

Chapter 4. Multimedia Search Engines

Abstract
Multimedia is integral to both human and modern computer communications. As digital sound and imagery proliferate, the need to search for audio and visual information has increased. However, most popular search engines are still textual as described in Chapter 1, even though the diversity of Web content has transformed the Web from a merely textual to a multimedia-based repository. Web information content comes in a variety of audio, video, image, and text formats, a list of the most commonly found media formats and types being given in Table 4.1. The multimedia information is highly distributed, minimally indexed, and lacks appropriate schemas. The critical question in multimedia search is how to design a scalable, visual information retrieval system? Such audio and visual information systems require large resources for transmission, storage and processing, factors which make indexing, retrieving, and managing visual information an immense challenge.
George Chang, Marcus J. Healey, James A. M. McHugh, Jason T. L. Wang

Data Mining on the Web

Frontmatter

Chapter 5. Data Mining

Abstract
Advances in storage, networking, processing power, and software technology have enabled us to efficiently store and retrieve digital data in databases, data warehouses, or other information repositories. What can we do with this accumulated data? Ignoring this data would be wasteful because much of the needed knowledge is waiting to be discovered in the repositories. The underlying thesis of data mining is to use techniques to discover and extract valuable knowledge in this data. The vast amount of data on the Web, in particular, has made data exploration tools essential. This chapter presents the basic concepts of data mining and knowledge discovery in databases.
George Chang, Marcus J. Healey, James A. M. McHugh, Jason T. L. Wang

Chapter 6. Text Mining

Abstract
The data mining techniques used in knowledge discovery as described in Chapter 5 were originally designed to extract information from structured data. However, most data on the Web is unstructured, stored in documents or in non-alpha-numeric form such as images. Most is textual, found in memos, e-mail messages, or similar documents. Previously developed techniques for data mining are unsuitable for analyzing such unstructured textual information. In this chapter, we present the main ideas of knowledge discovery in textual information, which is called text mining.
George Chang, Marcus J. Healey, James A. M. McHugh, Jason T. L. Wang

Chapter 7. Web Mining

Abstract
Data mining and knowledge discovery in large collections of data are known to be effective and useful as discussed in Chapter 5. With the growth of online data on the Web, the opportunity has arisen to utilize data mining techniques to analyze data stored on Web servers across the Internet. The application of data mining techniques to Web data, called Web mining, is used to discover patterns in this sea of information. Web mining is an evolutionary step beyond merely resource discovery and information extraction already supported by Web information retrieval systems such as the search engines and directories described in Chapter 1. In this chapter, we explore different ways for performing Web mining.
George Chang, Marcus J. Healey, James A. M. McHugh, Jason T. L. Wang

Chapter 8. Web Crawling Agents

Abstract
An essential component of information mining and pattern discovery on the Web is the Web Crawling Agent (WCA). General-purpose Web Crawling Agents, which were briefly described in Chapter 1, are intended to be used for building generic portals. The diverse and voluminous nature of Web documents presents formidable challenges to the design of high performance WCAs. They require both powerful processors and a tremendous amount of storage, and yet even then can only cover restricted portions of the Web. Nonetheless, despite their fundamental importance in providing Web services, the design of WCAs is not well-documented in the literature. This chapter describes the conceptual design and implementation of Web crawling agents.
George Chang, Marcus J. Healey, James A. M. McHugh, Jason T. L. Wang

A Case Study in Environmental Engineering

Frontmatter

Chapter 9. Envirodaemon

Abstract
Engineers and scientists have longed for instantaneous, distributed network access to the entire and technology literature. These longings are well on their way to being realized as a result of the improvement and convergence of the coumputing and communications infrastructure and the indispensability of the Internet for scientific research. The size of the organization able to perform a search has decreased as groups of lay people and scientists can now search ‘digital libraries’ without the aid of trained reference librarians.
George Chang, Marcus J. Healey, James A. M. McHugh, Jason T. L. Wang

Backmatter

Weitere Informationen