Elsevier

Decision Support Systems

Volume 44, Issue 2, January 2008, Pages 482-494
Decision Support Systems

A machine learning approach to web page filtering using content and structure analysis

https://doi.org/10.1016/j.dss.2007.06.002Get rights and content

Abstract

As the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods — a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management.

Introduction

The most popular way to look for information on the Web is to use Web search engines such as Google (www.google.com) and AltaVista (www.altavista.com). Many users begin their Web activities by submitting a query to a search engine. However, as the size of the Web is still growing and the number of indexable pages on the Web has exceeded eight billion, it has become more difficult for search engines to keep an up-to-date and comprehensive search index. Users often find it difficult to search for useful and high-quality information on the Web using general-purpose search engines, especially when searching for specific information on a given topic.

Many vertical search engines, or topic-specific search engines, have been built to facilitate more efficient searching in various domains. These search engines alleviate the information overload problem to some extent by providing more precise results and more customized features [11]. For example, LawCrawler (www.lawcrawler.com) allows users to search for legal information and provides links to lawyers and legal information and to relevant government Web sites. BuildingOnline (www.buildingonline.com) is a specialized search engine for the building industry, where users can search by manufacturers, architects, associations, contractors, etc. BioView.com (www.bioview.com) and SciSeek (www.sciseek.com) are two other examples that focus on scientific domains.

Although they provide a promising alternative for users, these vertical search engines are not easy to build. There are two major challenges to building vertical search engines: (1) How to locate relevant documents on the Web? (2) How to filter irrelevant documents from a collection? This study tries to address the second issue and to propose new approaches. The remainder of the paper is structured as follows. Section 2 reviews existing work on vertical search engine development, text classification, and Web content and structure analysis. In Section 3 we discuss some problems with existing Web page filtering approaches and pose our research questions. Section 4 describes in detail our proposed approach. Section 5 describes an experiment designed to evaluate our approach and presents experimental results. In Section 6, we conclude our paper with some discussion and suggestions for future research directions.

Section snippets

Building vertical search engines

A good vertical search engine should contain as many relevant, high-quality pages and as few irrelevant, low-quality pages as possible. Given the Web's large size and diversity of content, it is not easy to build a comprehensive and relevant collection for a vertical search engine. There are two main problems:

  • The search engine needs to locate the URLs that point to relevant Web pages. To improve efficiency, it is necessary for the page collection system to predict which URL is the most likely

Research questions

Based on the review, we identified several problems with traditional approaches to Web page filtering. Firstly, a manual approach is very labor-intensive and time-consuming. Although such approach can achieve high quality, it is usually not feasible under limited resources. The keyword-based and the lexicon-based approaches can automate the process, but they both have shortcomings. A simple keyword-based approach cannot deal with problem of polysemy, i.e., words having more than one semantic

A Web-feature approach

To address the problems with current approaches in Web page filtering, we propose an approach that incorporates Web content and structure analysis into Web filtering. Instead of representing each document as a bag of words, each Web page is represented by a limited number of content and link features. This reduces the dimensionality (the number of attributes used) of the classifier and thus the number of training examples needed. The characteristics of Web structure also can be incorporated

Experiment testbed

In order to evaluate the proposed approach, two experiments that compared the proposed approaches with traditional approaches were conducted. The medical field was chosen as the domain for evaluation because many diverse users (including medical doctors, researchers, librarians and general public) seek important and high-quality information on health topics on the Web. It is also important for them to distinguish between Web pages of good and poor quality [14].

A Web page testbed and a medical

Conclusion and future directions

In this paper, we have described a Web-feature approach to Web page classification that combines Web content analysis and Web structure analysis. We compared our approaches with traditional text classification methods and found the experimental results to be encouraging. We believe that the proposed approaches are useful for various Web applications, especially for vertical search engine development.

While the Web-feature approaches are promising, it is interesting to examine which of the 14

Acknowledgements

This project has been supported in part by the following grants:

  • NSF Digital Library Initiative-2 (PI: H. Chen), “High-performance Digital Library Systems: From Information Retrieval to Knowledge Management,” IIS-9817473, April 1999–March 2002;

  • NIH/NLM Grant (PI: H. Chen), “UMLS Enhanced Dynamic Agents to Manage Medical Knowledge,” 1 R01 LM06919-1A1, February 2001–January 2004;

  • HKU Seed Funding for Basic Research (PI: M. Chau), “Using Content and Link Analysis in Developing Domain-specific Web

Michael Chau is an Assistant Professor and the BBA(IS)/BEng(CS) Coordinator in the School of Business at the University of Hong Kong. He received his PhD degree in management information systems from the University of Arizona and a bachelor degree in computer science and information systems from the University of Hong Kong. His current research interests include information retrieval, Web mining, data mining, knowledge management, and security informatics. He has published more than 60 research

References (53)

  • O. Baujard et al.

    Trends in medical information retrieval on the Internet

    Computers in Biology and Medicine

    (1998)
  • M. Chau et al.

    Building a scientific knowledge web portal: the nanoport experience

    Decision Support Systems

    (2006)
  • A. Arasu et al.

    Searching the web

    ACM Transactions on Internet Technology

    (2001)
  • S. Brin et al.

    The anatomy of a large-scale hypertextual web search engine

  • K.M.A. Chai et al.

    Bayesian online classifiers for text classification and filtering

  • S. Chakrabarti et al.

    Enhanced hypertext categorization using hyperlink

  • S. Chakrabarti et al.

    Mining the web's link structure

    IEEE Computer

    (1999)
  • S. Chakrabarti et al.

    Focused crawling: a new approach to topic-specific web resource discovery

  • M. Chau et al.

    Comparison of three vertical search spiders

    IEEE Computer

    (2003)
  • M. Chau et al.

    Personalized and focused Web spiders

  • M. Chau et al.

    Incorporating Web analysis into neural networks: an example in hopfield net searching

    IEEE Transactions on Systems, Man, and Cybernetics (Part C)

    (2007)
  • H. Chen

    Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms

    Journal of the American Society for Information Science

    (1995)
  • H. Chen et al.

    An intelligent personal spider (Agent) for dynamic Internet/Intranet searching

    Decision Support Systems

    (1998)
  • H. Chen et al.

    HelpfulMed: intelligent searching for medical information over the internet

    Journal of the American Society for Information Science and Technology

    (2003)
  • F.C. Cheong

    Internet Agents: Spiders, Wanderers, Brokers, and Bots

    (1996)
  • J. Cho et al.

    Efficient crawling through URL ordering

  • W.W. Cohen

    Text categorization and relational learning

  • W.W. Cohen et al.

    Context-sensitive learning methods for text categorization

    ACM Transactions on Information Systems

    (1999)
  • M. Diligenti et al.

    Focused crawling using context graphs

  • S.T. Dumais et al.

    Inductive learning algorithms and representations for text categorization

  • J. Furnkranz

    Exploiting structural information for text categorization on the WWW

  • M. Iwayama et al.

    Cluster-based text categorization: a comparison of category search strategies

  • T. Joachims

    Text categorization with support vector machines: learning with many relevant features

  • T. Joachims

    Making large-Scale SVM Learning Practical

  • T. Joachims et al.

    Composite kernels for hypertext categorization

  • J. Kleinberg

    Authoritative sources in a hyperlinked environment

  • Cited by (133)

    • URL filtering using big data analytics in 5G networks

      2021, Computers and Electrical Engineering
      Citation Excerpt :

      Our method decides URL's fate in real-time, processing it faster by enabling spark's in-memory computation by using Resilient Distributed Datasets (RDD) that enables efficient data reuse in a broad range of applications. In-memory processing is well suited for machine learning algorithms [9]. Machine learning algorithms are used in the process and system scales up the algorithms for better performance.

    • Cyber parental control: A bibliometric study

      2020, Children and Youth Services Review
    • From associations to sarcasm: Mining the shift of opinions regarding the Supreme Court on twitter

      2019, Online Social Networks and Media
      Citation Excerpt :

      A word does not need to appear frequently in general (e.g., ‘thus’, ‘then’) because the inverse document-frequency increases the weight of terms that are frequent within some documents rather than throughout a collection. In an extreme case, the tf-idf measure may be a reason to have associations that do not exist (false positive) rather than missing existing ones (false negative), as the measure has been noted for being occasionally lenient by accepting irrelevant terms if a collection was noisy [66]. In sum, the design of our study gives credit to the idea that the Supreme Court was only associated with partisanship on November 11-17th but not on October 9-15th.

    View all citing articles on Scopus

    1. Download : Download full-size image

    Michael Chau is an Assistant Professor and the BBA(IS)/BEng(CS) Coordinator in the School of Business at the University of Hong Kong. He received his PhD degree in management information systems from the University of Arizona and a bachelor degree in computer science and information systems from the University of Hong Kong. His current research interests include information retrieval, Web mining, data mining, knowledge management, and security informatics. He has published more than 60 research articles in leading journals and conferences, including IEEE Computer, Journal of the America Society for Information Science and Technology, Decision Support Systems, ACM Transactions on Information Systems, and Communications of the ACM. More information can be found at http://www.business.hku.hk/~mchau/.

    1. Download : Download full-size image

    Hsinchun Chen is a McClelland Professor of Management Information Systems at the University of Arizona and Andersen Consulting Professor of the Year (1999). He received the B.S. degree from the National Chiao-Tung University in Taiwan, the MBA degree from SUNY Buffalo, and the PhD degree in Information Systems from the New York University. Dr. Chen is a Fellow of IEEE and AAAS. He received the IEEE Computer Society 2006 Technical Achievement Award. He is author/editor of 13 books, 17 book chapters, and more than 130 SCI journal articles covering intelligence analysis, biomedical informatics, data/text/web mining, digital library, knowledge management, and Web computing. Dr. Chen was ranked #8 in publication productivity in Information Systems (CAIS 2005) and #1 in Digital Library research (IP&M 2005) in two recent bibliometric studies. He serves on ten editorial boards including: ACM Transactions on Information Systems, IEEE Transactions on Systems, Man, and Cybernetics, Journal of the American Society for Information Science and Technology, and Decision Support Systems. Dr. Chen has served as a Scientific Counselor/Advisor of the National Library of Medicine (USA), Academia Sinica (Taiwan), and National Library of China (China). He has been an advisor for major NSF, DOJ, NLM, DOD, DHS, and other international research programs in digital library, digital government, medical informatics, and national security research. Dr. Chen is founding director of Artificial Intelligence Lab and Hoffman E-Commerce Lab. He is conference co-chair of ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2004 and has served as the conference/program co-chair for the past eight International Conferences of Asian Digital Libraries (ICADL), the premiere digital library meeting in Asia that he helped develop. Dr. Chen is also (founding) conference co-chair of the IEEE International Conferences on Intelligence and Security Informatics (ISI) 2003–2007. Dr. Chen has also received numerous awards in information technology and knowledge management education and research including: AT&T Foundation Award, SAP Award, the Andersen Consulting Professor of the Year Award, the University of Arizona Technology Innovation Award, and the National Chaio-Tung University Distinguished Alumnus Award. Further information can be found at http://ai.arizona.edu/hchen/.

    View full text