Web document clustering using a hybrid neural network

doi:10.1016/j.asoc.2004.02.003

Applied Soft Computing

Volume 4, Issue 4, September 2004, Pages 423-432

https://doi.org/10.1016/j.asoc.2004.02.003 Get rights and content

Abstract

The list of documents returned by Internet search engines in response to a query these days can be quite overwhelming. There is an increasing need for organising this information and presenting it in a more compact and efficient manner. This paper describes a method developed for the automatic clustering of World Wide Web documents, according to their relevance to the user’s information needs, by using a hybrid neural network. The objective is to reduce the time and effort the user has to spend to find the information sought after. Clustering documents by features representative of their contents—in this case, key words and phrases—increases the effectiveness and efficiency of the search process. It is shown that a two-dimensional visual presentation of information on retrieved documents, instead of the traditional linear listing, can create a more user-friendly interface between a search engine and the user.

Introduction

The amount of textual information available through the World Wide Web has grown exponentially in recent years. Consequently, there is a growing need for more precise techniques for presenting documents relevant to a user’s information needs. We propose a hybrid artificial neural network-based technique to cluster the documents retrieved for presentation to the user in a two-dimensional manner.

There is a great deal of literature on document retrieval. The main focus is on the relevance of the documents to the user’s information needs. The problem is that a Web search engine requires a precise and comprehensive query expression in order to perform the search and ranking of documents so that only the relevant documents are presented to the user. However, a Web user tends to present only short queries. The need to accept short queries while overcoming the limitations of their imprecision and incompleteness has given rise to research into ways of improving the information retrieval (IR) results based on short queries. A number of research work on automatic keyword extraction and term expansion has been reported (e.g. [3], [19], [37], [46]).

Instead of extracting single word terms, phrases can be extracted. The usefulness of phrases in information retrieval has been well recognized [1], [7], [10], [12], [20], [30], [33], [42], [43]. However, since Croft et al.’s work reported in [7] in 1991, not much progress appears to have been made so far in using phrases to improve the performance of information retrieval. Narita and Ogawa [33] conducted experiments to extract phrases from natural language queries for query expansion. These experimental results indicated some improvement in the effectiveness of information retrieval where the query texts were long. Croft and Das [6] used user-identified phrases in Fagan’s algorithm and a probabilistic model without significant improvement in information retrieval.

Research has been conducted on the analysis and mapping of large collections of text documents using self-organising maps such as the WEBSOM [17], [18], [27], where similar documents are placed close to each other. Users are able to retrieve documents by selecting corresponding positions on the map. WEBSOM performs at least comparably with more conventional retrieval methods [26].

Research has also been conducted on using user feedback to improve relevance of information retrieval (e.g. [14], [36], [39]). However, it appears that there are upper limits to further improvements using relevance feedback [44]. Ad hoc feedback or blind feedback [3], [35] is another commonly used strategy. It is based on the assumption that search engines return the most relevant documents at the top of the list. A small set of documents is first retrieved based on the initial query submitted by the user, and then terms are extracted from these documents for query expansion. Conjunctions and disjunctions are usually used to join the terms together, to form a single long query. Thus, it is important that the initially retrieved documents are relevant to the user’s query, as otherwise documents subsequently retrieved will more likely be irrelevant to the user’s information needs. This will cause an alteration in focus of the search topic—known as query drift [31]. Khor and Khan [24] approached query expansion differently in that, instead of adding terms to the initial query, relevant key phrases were extracted to form separate queries. These were then used separately to retrieve the top-ranked documents. Basing on the assumption that search engines return documents ranked in descending order of relevance to the query terms, this method tends to retrieve more relevant documents than the conventional query expansion technique. We have adopted the key phrase extraction and query expansion method described in [17] to retrieve the relevant documents.

With the large collection of retrieved documents, it is necessary to process them into a format suitable for presentation to the user. This involves sorting and ranking the documents according to their relevance to the user’s information needs. The conventional procedure is to rank the retrieved documents in descending order of relevance according to certain predetermined criteria.

The usual outcome of the ranking process applied by a search engine is a long list of document titles. The main drawback of such an approach is that the user is still required to browse through this long list to select those that are actually considered to be of interest. Another shortcoming is that the resultant list of documents from a search engine does not make distinctions between the different concepts that may be present in the query, as the list inevitably has to be ranked sequentially. The problem lies mainly in the presentation of the list of document titles. These documents are usually listed serially irrespective of the similarity or dissimilarity in their contents—that is, it does not make distinctions between the different concepts. Thus, two documents appearing next to each other in the list may not necessarily be of a similar nature and vice versa. As the list of documents grows longer, the amount of time and effort needed to browse through the list to look for relevant documents increases. In an attempt to circumvent this problem, Lin et al. [28] used the Kohonen Feature Map to construct a self-organising semantic map such that documents of similar contents are placed close to one another. Document clustering is also commonly used to automatically organize large collection of retrieval results, by grouping together documents that belong to the same topic in order to facilitate user’s browsing of retrieval results [16]. Most clustering algorithms use the vector space model of information retrieval [40], in which text documents are represented as a set of points (or term vector) in a high dimensional vector space. Each direction of the vector space corresponds to a unique term in the document collection, and the component of a document vector along a given direction corresponds to the importance of that term to the document. Similarity between two texts is traditionally measured by the cosine of the angle between their vectors, though Cartesian distance is also used. Documents judged to be similar by this measure are grouped together by the clustering algorithm. These algorithms are notoriously slow because of the so-called curse of dimensionality [2] resulting from the high dimensionality of the vector space. Dimensionality reduction techniques are further discussed in Section 2.2.

We propose a hybrid ANN-based method for reducing dimensionality, and providing the user with documents clustered according to the similarity of their contents. The nature of a document’s contents is represented using key words and phrases found in them. Each cluster is associated with the group of key words and phrases found in documents belonging to it. This enables the user to be selective in choosing particular clusters of documents to browse—thus making significant savings in browsing time and effort.

Section 2 presents the document representation scheme used in our investigation, and discusses the resulting problem of high dimensionality. In Section 3, we describe the methodology and tools used for clustering and ranking documents for visual presentation. Section 4 contains details of experiments carried out to evaluate the accuracy of the clustering process implemented. The paper concludes with Section 5, which highlights the main contributions of the work and points to further investigations to be done in future.

Section snippets

Document representation and dimensionality

The issue of a suitable representation scheme for documents needs to be resolved before any analysis can be carried out on them for clustering purposes. The dimensionality associated with the chosen representation scheme is of concern as it impacts on the efficiency of any document analysis and clustering technique. Both these issues are discussed below.

Document clustering using PCA–ART

Fig. 1 shows the stages involved in the clustering and ranking of documents retrieved by a search engine. The process starts with the collection of documents retrieved by a search engine in response to the user query. The document retrieval process may be carried out in two phases (not shown in Fig. 1)—first using the query expression initially submitted by the user (see examples in Section 4.1), and then using an expanded query formed using key phrases extracted from documents retrieved in the

Experimental results

This section describes the experiments carried out to test the clustering performance of the PCA–ART1 neural network, including its effectiveness in extracting document features, and in filtering off less relevant documents. As our principal aim was to evaluate the performance of Web document clustering, we chose to construct our own test document collection to be subjected to the clustering process. This was done by submitting different queries to the Google search engine, and collecting the

Conclusion

There appears to be no reported study on the use of hybrid PCA–ART neural network for textual document processing. The investigation reported in this paper is an attempt to utilize a hybrid neural net consisting of a PCA and an ART1 network designed specifically for more intelligent and user-friendly information retrieval from the World Wide Web.

The problem of high dimensionality in term vectors used for representing documents has been dealt with through principal component analysis carried out

References (47)

P.G. Anick, S. Vaithyanathan, Exploiting clustering and phrases for context-based information retrieval, in:...
R. Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, New Jersey,...
C. Buckley, G. Salton, et al., Automatic query expansion using smart: Trec 3, in: Proceedings of the 3rd Text REtrieval...
G.A Carpenter et al.
A massively parallel architecture for a self-organizing neural pattern recognition machine
J. Comput. Vision Graphics Image Process.
(1987)
G.A Carpenter et al.
The ART of adaptive pattern recognition by a self-organizing neural network
J. IEEE Comput.
(1988)
W.B. Croft, R. Das, Experiments with query acquisition and use in document retrieval systems, in: Proceedings of the...
W.B. Croft, H.R. Turtle, et al., The use of phrases and structured queries in information retrieval, in: Proceedings of...
S Deerwester et al.
Indexing by latent semantic analysis
J. Am. Soc. Information Sci.
(1990)
K.I. Diamantaras, S.Y. Kung, Principal Component Neural Networks: Theory and Applications, Wiley, New York,...
J.L. Fagan, Automatic phrase indexing for document retrieval: an examination of syntactic and non-syntactic methods,...

J.L Fagan

The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval

J. Am. Soc. Information Sci.

(1989)

E. Frank, G.W. Paynter, et al., Domain-specific keyphrase extraction, in: Proceedings of the 16th International Joint...

D. Guo, D. Peuquet, et al., Opening the black box: interactive hierarchical clustering for multivariate spatial...

D. Harman, Relevance feedback revisited, in: Proceedings of the 15th Annual International SIGIR’92, Copenhagen,...

M.H. Hassoun, Fundamentals of Artificial Neural Networks, MIT Press, Cambridge,...

M.A. Hearst, J.O. Pedersen, Reexamining the cluster hypothesis: scatter/gather on retrieval results, J. ACM...

T. Honkela, S. Kaski, et al., Exploration of full-text databases with self-organizing maps, in: Proceedings of the...

T. Honkela, S. Kaski, et al., WEBSOM—self-organizing maps of document collections, in: Proceedings f the WSOM’97,...

Y. Jing, W.B. Croft, An association thesaurus for information retrieval, in: Proceedings of the RIAO, Rockefeller...

H. Joho, M. Sanderson, Retrieving descriptive phrases from large amounts of free text, in: Proceedings of the 9th...

I.T. Jolliffe, Principal Component Analysis, Springr-Verlag, New York,...

K.S. Jones, A statistical interpretation of term specificity and its application in retrieval. Document retrieval...

L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York,...

Cited by (17)

Meta-cognitive RBF Network and its Projection Based Learning algorithm for classification problems
2013, Applied Soft Computing Journal
Citation Excerpt :
In a classification problem, the objective is to learn the decision surface that accurately maps an input feature space to an output space of class labels. Several learning algorithms for different neural network architectures have been used in various problems in science, business, industry and medicine, including the handwritten character recognition [2], speech recognition [3], biomedical medical diagnosis [4], prediction of bankruptcy [5], text categorization [6] and information retrieval [7]. Among various architectures reported in the literature, Radial Basis Function (RBF) network gaining attention due to its localization property of Gaussian function, and widely used in classification problems.
‘Meta-cognitive Radial Basis Function Network’ (McRBFN) and its ‘Projection Based Learning’ (PBL) algorithm for classification problems in sequential framework is proposed in this paper and is referred to as PBL-McRBFN. McRBFN is inspired by human meta-cognitive learning principles. McRBFN has two components, namely the cognitive component and the meta-cognitive component. The cognitive component is a single hidden layer radial basis function network with evolving architecture. In the cognitive component, the PBL algorithm computes the optimal output weights with least computational effort by finding analytical minima of the nonlinear energy function. The meta-cognitive component controls the learning process in the cognitive component by choosing the best learning strategy for the current sample and adapts the learning strategies by implementing self-regulation. In addition, sample overlapping conditions are considered for proper initialization of new hidden neurons, thus minimizes the misclassification. The interaction of cognitive component and meta-cognitive component address the what-to-learn, when-to-learn and how-to-learn human learning principles efficiently. The performance of the PBL-McRBFN is evaluated using a set of benchmark classification problems from UCI machine learning repository and two practical problems, viz., the acoustic emission signal classification and the mammogram for cancer classification. The statistical performance evaluation on these problems has proven the superior performance of PBL-McRBFN classifier over results reported in the literature.
SOMSE: A semantic map based meta-search engine for the purpose of web information customization
2011, Applied Soft Computing Journal
To combat information overload, systems that are often referred to as information customization systems are needed. Such systems act on the user's behalf and can rely on existing information services like search engines that do the resource-intensive part of the work. These systems will be sufficiently lightweight to run on an average PC and serve as personal assistants. Since such an assistant has relatively modest resource requirements it can reside on an individual user's machine. If the assistant resides on the user's machine, there is no need to turn down intelligence. The system can have substantial local intelligence. In this paper, we propose an information customization system that combines meta-search and unsupervised learning. A meta-search engine simultaneously searches multiple search engines and returns a single list of results. The results retrieved by this engine can be highly relevant, since it is usually grabbing the first items from the relevancy-ranked list of hits returned by the individual search engines. The Kohonen Feature Map is then used to construct a self-organizing semantic map such that documents of similar contents are placed close to one another.
Performance evaluation of density-based clustering methods
2009, Information Sciences
With the development of the World Wide Web, document clustering is receiving more and more attention as an important and fundamental technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. A good document clustering approach can assist computers in organizing the document corpus automatically into a meaningful cluster hierarchy for efficient browsing and navigation, which is very valuable for complementing the deficiencies of traditional information retrieval technologies. In this paper, we study the performance of different density-based criterion functions, which can be classified as internal, external or hybrid, in the context of partitional clustering of document datasets. In our study, a weight was assigned to each document, which defined its relative position in the entire collection. To show the efficiency of the proposed approach, the weighted methods were compared to their unweighted variants. To verify the robustness of the proposed approach, experiments were conducted on datasets with a wide variety of numbers of clusters, documents and terms. To evaluate the criterion functions, we used the WebKb, Reuters-21578, 20Newsgroups-18828, WebACE and TREC-5 datasets, as they are currently the most widely used benchmarks in document clustering research. To evaluate the quality of a clustering solution, a wide spectrum of indices, three internal validity indices and seven external validity indices, were used. The internal validity indices were used for evaluating the within-cluster scatter and between cluster separations. The external validity indices were used for comparing the clustering solutions produced by the proposed criterion functions with the “ground truth” results. Experiments showed that our approach significantly improves clustering quality. In this paper, we developed a modified differential evolution (DE) algorithm to optimize the criterion functions. This modification accelerates the convergence of DE and, unlike the basic DE algorithm, guarantees that the received solution will be feasible.
Clustering of document collection - A weighting approach
2009, Expert Systems with Applications
Citation Excerpt :
Also, this work contains a convenient method to predict the optimal number of clusters. In addition to the algorithms mentioned above, several heuristics algorithms, such as genetic algorithm (Alguliev & Aliguliyev, 2005; Aliguliyev, 2006; Garai & Chaudhuri, 2004; Laszlo & Mukherjee, 2006, 2007), neural network (Khan & Khor, 2004), ant algorithm (Azzag, Venturini, Oliver, & Guinot, 2007; Hartmann, 2005), particle swarm intelligence (Azzag et al., 2007; Cui, Potok, & Palathingal, 2005; Das, Abraham, & Konar, 2008; Hartmann, 2005) and differential evolution algorithm (Abraham, Das, & Konar, 2006; Das, Abraham, & Konar, 2008) have been proposed and evaluated for document clustering. In paper, Azzag et al. (2007) is presented a new model for data clustering, which is inspired from the self-assembly behavior of real ants.
Clustering algorithms are used to assess the interaction among documents by organizing documents into clusters such that document within a cluster are more similar to each other than are documents belonging to different clusters. Document clustering has been traditionally investigated as a means of improving the performance of search engines by pre-clustering the entire corpus, and a post-retrieval document browsing technique as well. It has long been studied as a post-retrieval document visualization technique. The purpose of present paper to show that assignment weight to documents improves clustering solution.
Designing evolving user profile in e-CRM with dynamic clustering of Web documents
2008, Data and Knowledge Engineering
Internet technology enables companies to capture new customers, track their performances and online behavior, and customize communications, products, services, and prices. Analyses of customers and customer interactions for electronic customer relationship management (e-CRM) can be performed by way of using data mining (DM), optimization methods, or combined approaches. One key issue in the analysis of access patterns on the Web is the clustering and classification of Web documents. Generally, the classification has its base on analytical models which assume a pre-fixed set of keywords (attributes) with predefined list of categories. This assumption is not realistic for large and evolving collections of documents such as World Wide Web. We propose a new approach to solve the problem of unknown number of evolving categories. The approach begins with the classification of test documents into a set of initial categories. A working prototype system which is based on Fuzzy Clustering CRM (FC-CRM) has been developed and presented to validate the proposed approach and illustrate how it handles the dynamic inflow of new documents.
MASACAD: A multi-agent approach to information customization for the purpose of academic advising of students
2007, Applied Soft Computing Journal
The growth and advancement in the Internet and the World Wide Web has led to an explosion in the amount of available information. This staggering amount of information has made it extremely difficult for users to locate and retrieve information that is actually relevant to their task at hand. Dealing with this problem of “information overload” will need tools to customize the information space. In this paper we present MASACAD, a multi-agent system that learns to advise students by mining the Web and discuss important problems in relationship to information customization systems and smooth the way for possible solutions. The main idea is to approach information customization using a multi-agent paradigm in combination with a number of aspects from the domains of machine learning, user modeling, and Web mining.

View all citing articles on Scopus

View full text

Web document clustering using a hybrid neural network

Abstract

Introduction

Section snippets

Document representation and dimensionality

Document clustering using PCA–ART

Experimental results

Conclusion

A massively parallel architecture for a self-organizing neural pattern recognition machine

J. Comput. Vision Graphics Image Process.

The ART of adaptive pattern recognition by a self-organizing neural network

J. IEEE Comput.

Indexing by latent semantic analysis

J. Am. Soc. Information Sci.

The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval

J. Am. Soc. Information Sci.