Elsevier

Applied Soft Computing

Volume 4, Issue 4, September 2004, Pages 423-432
Applied Soft Computing

Web document clustering using a hybrid neural network

https://doi.org/10.1016/j.asoc.2004.02.003Get rights and content

Abstract

The list of documents returned by Internet search engines in response to a query these days can be quite overwhelming. There is an increasing need for organising this information and presenting it in a more compact and efficient manner. This paper describes a method developed for the automatic clustering of World Wide Web documents, according to their relevance to the user’s information needs, by using a hybrid neural network. The objective is to reduce the time and effort the user has to spend to find the information sought after. Clustering documents by features representative of their contents—in this case, key words and phrases—increases the effectiveness and efficiency of the search process. It is shown that a two-dimensional visual presentation of information on retrieved documents, instead of the traditional linear listing, can create a more user-friendly interface between a search engine and the user.

Introduction

The amount of textual information available through the World Wide Web has grown exponentially in recent years. Consequently, there is a growing need for more precise techniques for presenting documents relevant to a user’s information needs. We propose a hybrid artificial neural network-based technique to cluster the documents retrieved for presentation to the user in a two-dimensional manner.

There is a great deal of literature on document retrieval. The main focus is on the relevance of the documents to the user’s information needs. The problem is that a Web search engine requires a precise and comprehensive query expression in order to perform the search and ranking of documents so that only the relevant documents are presented to the user. However, a Web user tends to present only short queries. The need to accept short queries while overcoming the limitations of their imprecision and incompleteness has given rise to research into ways of improving the information retrieval (IR) results based on short queries. A number of research work on automatic keyword extraction and term expansion has been reported (e.g. [3], [19], [37], [46]).

Instead of extracting single word terms, phrases can be extracted. The usefulness of phrases in information retrieval has been well recognized [1], [7], [10], [12], [20], [30], [33], [42], [43]. However, since Croft et al.’s work reported in [7] in 1991, not much progress appears to have been made so far in using phrases to improve the performance of information retrieval. Narita and Ogawa [33] conducted experiments to extract phrases from natural language queries for query expansion. These experimental results indicated some improvement in the effectiveness of information retrieval where the query texts were long. Croft and Das [6] used user-identified phrases in Fagan’s algorithm and a probabilistic model without significant improvement in information retrieval.

Research has been conducted on the analysis and mapping of large collections of text documents using self-organising maps such as the WEBSOM [17], [18], [27], where similar documents are placed close to each other. Users are able to retrieve documents by selecting corresponding positions on the map. WEBSOM performs at least comparably with more conventional retrieval methods [26].

Research has also been conducted on using user feedback to improve relevance of information retrieval (e.g. [14], [36], [39]). However, it appears that there are upper limits to further improvements using relevance feedback [44]. Ad hoc feedback or blind feedback [3], [35] is another commonly used strategy. It is based on the assumption that search engines return the most relevant documents at the top of the list. A small set of documents is first retrieved based on the initial query submitted by the user, and then terms are extracted from these documents for query expansion. Conjunctions and disjunctions are usually used to join the terms together, to form a single long query. Thus, it is important that the initially retrieved documents are relevant to the user’s query, as otherwise documents subsequently retrieved will more likely be irrelevant to the user’s information needs. This will cause an alteration in focus of the search topic—known as query drift [31]. Khor and Khan [24] approached query expansion differently in that, instead of adding terms to the initial query, relevant key phrases were extracted to form separate queries. These were then used separately to retrieve the top-ranked documents. Basing on the assumption that search engines return documents ranked in descending order of relevance to the query terms, this method tends to retrieve more relevant documents than the conventional query expansion technique. We have adopted the key phrase extraction and query expansion method described in [17] to retrieve the relevant documents.

With the large collection of retrieved documents, it is necessary to process them into a format suitable for presentation to the user. This involves sorting and ranking the documents according to their relevance to the user’s information needs. The conventional procedure is to rank the retrieved documents in descending order of relevance according to certain predetermined criteria.

The usual outcome of the ranking process applied by a search engine is a long list of document titles. The main drawback of such an approach is that the user is still required to browse through this long list to select those that are actually considered to be of interest. Another shortcoming is that the resultant list of documents from a search engine does not make distinctions between the different concepts that may be present in the query, as the list inevitably has to be ranked sequentially. The problem lies mainly in the presentation of the list of document titles. These documents are usually listed serially irrespective of the similarity or dissimilarity in their contents—that is, it does not make distinctions between the different concepts. Thus, two documents appearing next to each other in the list may not necessarily be of a similar nature and vice versa. As the list of documents grows longer, the amount of time and effort needed to browse through the list to look for relevant documents increases. In an attempt to circumvent this problem, Lin et al. [28] used the Kohonen Feature Map to construct a self-organising semantic map such that documents of similar contents are placed close to one another. Document clustering is also commonly used to automatically organize large collection of retrieval results, by grouping together documents that belong to the same topic in order to facilitate user’s browsing of retrieval results [16]. Most clustering algorithms use the vector space model of information retrieval [40], in which text documents are represented as a set of points (or term vector) in a high dimensional vector space. Each direction of the vector space corresponds to a unique term in the document collection, and the component of a document vector along a given direction corresponds to the importance of that term to the document. Similarity between two texts is traditionally measured by the cosine of the angle between their vectors, though Cartesian distance is also used. Documents judged to be similar by this measure are grouped together by the clustering algorithm. These algorithms are notoriously slow because of the so-called curse of dimensionality [2] resulting from the high dimensionality of the vector space. Dimensionality reduction techniques are further discussed in Section 2.2.

We propose a hybrid ANN-based method for reducing dimensionality, and providing the user with documents clustered according to the similarity of their contents. The nature of a document’s contents is represented using key words and phrases found in them. Each cluster is associated with the group of key words and phrases found in documents belonging to it. This enables the user to be selective in choosing particular clusters of documents to browse—thus making significant savings in browsing time and effort.

Section 2 presents the document representation scheme used in our investigation, and discusses the resulting problem of high dimensionality. In Section 3, we describe the methodology and tools used for clustering and ranking documents for visual presentation. Section 4 contains details of experiments carried out to evaluate the accuracy of the clustering process implemented. The paper concludes with Section 5, which highlights the main contributions of the work and points to further investigations to be done in future.

Section snippets

Document representation and dimensionality

The issue of a suitable representation scheme for documents needs to be resolved before any analysis can be carried out on them for clustering purposes. The dimensionality associated with the chosen representation scheme is of concern as it impacts on the efficiency of any document analysis and clustering technique. Both these issues are discussed below.

Document clustering using PCA–ART

Fig. 1 shows the stages involved in the clustering and ranking of documents retrieved by a search engine. The process starts with the collection of documents retrieved by a search engine in response to the user query. The document retrieval process may be carried out in two phases (not shown in Fig. 1)—first using the query expression initially submitted by the user (see examples in Section 4.1), and then using an expanded query formed using key phrases extracted from documents retrieved in the

Experimental results

This section describes the experiments carried out to test the clustering performance of the PCA–ART1 neural network, including its effectiveness in extracting document features, and in filtering off less relevant documents. As our principal aim was to evaluate the performance of Web document clustering, we chose to construct our own test document collection to be subjected to the clustering process. This was done by submitting different queries to the Google search engine, and collecting the

Conclusion

There appears to be no reported study on the use of hybrid PCA–ART neural network for textual document processing. The investigation reported in this paper is an attempt to utilize a hybrid neural net consisting of a PCA and an ART1 network designed specifically for more intelligent and user-friendly information retrieval from the World Wide Web.

The problem of high dimensionality in term vectors used for representing documents has been dealt with through principal component analysis carried out

References (47)

  • P.G. Anick, S. Vaithyanathan, Exploiting clustering and phrases for context-based information retrieval, in:...
  • R. Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, New Jersey,...
  • C. Buckley, G. Salton, et al., Automatic query expansion using smart: Trec 3, in: Proceedings of the 3rd Text REtrieval...
  • G.A Carpenter et al.

    A massively parallel architecture for a self-organizing neural pattern recognition machine

    J. Comput. Vision Graphics Image Process.

    (1987)
  • G.A Carpenter et al.

    The ART of adaptive pattern recognition by a self-organizing neural network

    J. IEEE Comput.

    (1988)
  • W.B. Croft, R. Das, Experiments with query acquisition and use in document retrieval systems, in: Proceedings of the...
  • W.B. Croft, H.R. Turtle, et al., The use of phrases and structured queries in information retrieval, in: Proceedings of...
  • S Deerwester et al.

    Indexing by latent semantic analysis

    J. Am. Soc. Information Sci.

    (1990)
  • K.I. Diamantaras, S.Y. Kung, Principal Component Neural Networks: Theory and Applications, Wiley, New York,...
  • J.L. Fagan, Automatic phrase indexing for document retrieval: an examination of syntactic and non-syntactic methods,...
  • J.L Fagan

    The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval

    J. Am. Soc. Information Sci.

    (1989)
  • E. Frank, G.W. Paynter, et al., Domain-specific keyphrase extraction, in: Proceedings of the 16th International Joint...
  • D. Guo, D. Peuquet, et al., Opening the black box: interactive hierarchical clustering for multivariate spatial...
  • D. Harman, Relevance feedback revisited, in: Proceedings of the 15th Annual International SIGIR’92, Copenhagen,...
  • M.H. Hassoun, Fundamentals of Artificial Neural Networks, MIT Press, Cambridge,...
  • M.A. Hearst, J.O. Pedersen, Reexamining the cluster hypothesis: scatter/gather on retrieval results, J. ACM...
  • T. Honkela, S. Kaski, et al., Exploration of full-text databases with self-organizing maps, in: Proceedings of the...
  • T. Honkela, S. Kaski, et al., WEBSOM—self-organizing maps of document collections, in: Proceedings f the WSOM’97,...
  • Y. Jing, W.B. Croft, An association thesaurus for information retrieval, in: Proceedings of the RIAO, Rockefeller...
  • H. Joho, M. Sanderson, Retrieving descriptive phrases from large amounts of free text, in: Proceedings of the 9th...
  • I.T. Jolliffe, Principal Component Analysis, Springr-Verlag, New York,...
  • K.S. Jones, A statistical interpretation of term specificity and its application in retrieval. Document retrieval...
  • L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York,...
  • Cited by (17)

    • Meta-cognitive RBF Network and its Projection Based Learning algorithm for classification problems

      2013, Applied Soft Computing Journal
      Citation Excerpt :

      In a classification problem, the objective is to learn the decision surface that accurately maps an input feature space to an output space of class labels. Several learning algorithms for different neural network architectures have been used in various problems in science, business, industry and medicine, including the handwritten character recognition [2], speech recognition [3], biomedical medical diagnosis [4], prediction of bankruptcy [5], text categorization [6] and information retrieval [7]. Among various architectures reported in the literature, Radial Basis Function (RBF) network gaining attention due to its localization property of Gaussian function, and widely used in classification problems.

    • Clustering of document collection - A weighting approach

      2009, Expert Systems with Applications
      Citation Excerpt :

      Also, this work contains a convenient method to predict the optimal number of clusters. In addition to the algorithms mentioned above, several heuristics algorithms, such as genetic algorithm (Alguliev & Aliguliyev, 2005; Aliguliyev, 2006; Garai & Chaudhuri, 2004; Laszlo & Mukherjee, 2006, 2007), neural network (Khan & Khor, 2004), ant algorithm (Azzag, Venturini, Oliver, & Guinot, 2007; Hartmann, 2005), particle swarm intelligence (Azzag et al., 2007; Cui, Potok, & Palathingal, 2005; Das, Abraham, & Konar, 2008; Hartmann, 2005) and differential evolution algorithm (Abraham, Das, & Konar, 2006; Das, Abraham, & Konar, 2008) have been proposed and evaluated for document clustering. In paper, Azzag et al. (2007) is presented a new model for data clustering, which is inspired from the self-assembly behavior of real ants.

    View all citing articles on Scopus
    View full text