A clustering technique for news articles using WordNet

doi:10.1016/j.knosys.2012.06.015

Knowledge-Based Systems

Volume 36, December 2012, Pages 115-128

https://doi.org/10.1016/j.knosys.2012.06.015 Get rights and content

Abstract

The Web is overcrowded with news articles, an overwhelming information source both with its amount and diversity. Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. Several approaches have been proposed which, however, suffer from problems like synonymy, ambiguity and lack of a descriptive content marking of the generated clusters. In this work, we are investigating the application of a great spectrum of clustering algorithms, as well as similarity measures, to news articles that originate from the Web. Also, we are proposing the enhancement of standard k-means algorithm using the external knowledge from WordNet hypernyms in a twofold manner: enriching the “bag of words” used prior to the clustering process and assisting the label generation procedure following it. Furthermore, we are examining the effect that text preprocessing has on clustering. Operating on a corpus of news articles derived from major news portals, our comparison of the existing clustering methodologies revealed that k-means, gives better aggregate results when it comes to efficiency. This is amplified when the algorithm is accompanied with preliminary steps for data cleaning and normalizing, despite its simple nature. Moreover, the proposed WordNet-enabled W-k means clustering algorithm significantly improves standard k-means generating also useful and high quality cluster tags by using the presented cluster labeling process.

Introduction

News articles flood the Web every day from an extreme amount of major or minor news portals from around the globe. It is utterly impossible for a single individual to be able to keep track of an event, or a series of related events, from an unbiased and truly informative point of view. While the amount of online information sources is rapidly increasing, so does the available online news content. One of the most common approaches for organizing this immense amount of data is the use of clustering techniques. Object clustering refers to the process of partitioning a collection of objects into several sub-collections based on their similarity of contents. For the case of user clustering, each sub-collection is called a user cluster and includes users that have revealed similar appeals in their selections of text articles while browsing through a document collection. Clustering has been proven to be a useful technique for information retrieval by discovering interesting information kernels and distributions in the underlying data. In general, it helps constructing meaningful partitions of large sets of objects based on various methodologies and heuristics. It plays a crucial role in organizing large collections. For example (a) it can be used to structure query results, (b) form the basis for further processing of the organized topical groups using other information retrieval techniques such as summarization, or (c) within the scope of recommendation systems by affecting their performance as far as suggestions made towards the end users are concerned. Clustering has also been exploited within the scope of machine learning [2], as a time series mining task [17] which uses frequent itemsets to find association rules of items in large transactional databases.

Clustering of news articles can help by depicting the underneath content hierarchy of a huge amount of articles within the reach of a single individual. Consequently, it can provide information retrieval (IR) systems with the potential to alleviate users while browsing and detecting quickly the needed information.

However, there are several challenges that clustering techniques normally have to overcome. Among them is efficiency: generated clusters have to be well connected from a notional point of view, despite the diversity in content and size that the original documents might have. For example, it is frequent for some news articles to belong to the same notional cluster, even though they do not share common words. The vice-versa is also possible: news articles sharing common words, while being completely unrelated to each other. Ambiguity and synonymy are thus two of the major problems that document clustering techniques regularly fail to tackle with. Furthermore, having IR systems simply generate clusters of documents is not enough per se. The reason is that it is virtually impossible for humans to conceptualize information by merely browsing through hundreds of documents belonging to the same cluster. However, assigning meaningful labels to the generated clusters can help users conveniently recognize the content of each generated set and thus easily analyze the results.

In this manuscript, we are describing a variety of document clustering techniques and evaluating their application on our data set: news articles originating from the Web. Our aim is to compare the resulting clusters and determine which technique is best fitted for the extreme amount and diversity of news articles that an indexing system needs to address. Furthermore we are presenting a novel methodological approach towards document clustering, and in particular, clustering of news articles deriving from the Web, that combines regular k-means with external information extracted from the WordNet database. Our approach combines keyword extraction and several information retrieval techniques. We are also incorporating the proposed algorithm in our existing system [5], evaluating the clustering results compared to regular k-means using a large pool of Web news articles existing in the system’s database.

The rest of the manuscript is organized as follows: Section 2 gives a background of the related work regarding clustering methodologies as well as the use of the WordNet database on this field. In Section 3, we give a brief overview of our system which we are enhancing with clustering techniques. In Section 4 we describe the various clustering methodologies explored in this work, while in Section 5 we present the algorithmic approach of W-k means. In Section 6 we outline our experimental approach towards the clustering methodologies used and present our evaluation results. Section 7 concludes this manuscript with some remarks about the future work that is currently underway.

Section snippets

Related work

Clustering data in general has been heavily researched by the scientific community over the last 20 years. Especially for document clustering, a huge variety of techniques has been proposed. A major goal of document clustering is to improve the results of information retrieval systems in terms of precision/recall. This in turn leads to serving better filtered and adequate results to their users, helping in essence the decision making process.

Information flow

Our system, PeRSSonal [5], features a staged and modular approach for performing the various tasks concerning news articles that originate from the Web. The scope of the PeRSSonal system is the construction of a new generation Web service that unifies many Information Retrieval tasks under a common framework. It is delivering quality information, targeted to end users that do not want or do not have the time to engage to the tedious task of filtering information. PeRSSonal consists of several

Clustering news articles

The overall clustering process as evaluated in this paper is depicted in Fig. 2.

The generated term – frequency vectors (‘bag of words’) for each article described in the previous section, which is a weighted scheme of stemmed nouns existing in the original text, is given as input to the clustering subsystem. At this level, we used a twofold implementation/evaluation. Firstly, by applying a variety of clustering algorithms and distance metrics, we try to determine whether preprocessing has an

Algorithm approach for W-k means

In this section we are presenting our algorithm approach for exploiting the WordNet database within the scope of k-means. The WordNet lexical reference system, organizes different linguistic relations into hierarchies. Most importantly, given any noun, verb, adjective and adverb, WordNet can provide results regarding hypernyms, hyponyms, meronyms or holonyms. Using these graph-like structures, we can search the WordNet database for all the hypernyms of a given set of words, then weigh them

Experimental procedure

In the current section we are presenting our experimental procedure and its results. Our analysis consists of: (a) evaluating known clustering methodologies and distance measures when applied within the domain of news articles, (b) evaluating our WordNet enabled k-means clustering and cluster labeling algorithm, and (c) comparing the proposed W-k means clustering results to those generated by two state of the art generic clustering toolboxes: Cluto [14] and SenseClusters [16].

Conclusion

Within the scope of our indexing system, we have presented our evaluation results comparing some of the best clustering options currently available, applying them to the domain of news articles that originate from the Web. From the plethora of similarity measures that have been used, the appliance of Euclidian and cosine k-means produced the best results based not only on the internal CI function, but also on a real users’ experimentation. More specifically, we have found that hierarchical

Future work

For the future, we will be evaluating W-k means with regards to time efficiency using more clustering algorithms and larger document sets. We are also planning on determining how well our approach scales with increasing numbers of articles as is the case with online indexing services. Moreover, we will be researching towards using the clustering kernel for clustering system users based on their dynamic profiles, and we will proceed with evaluating more extensively the clustering module with

Acknowledgements

This research has been co-financed by the European Union (European Social Fund – ESF) and Greek national funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework (NSRF) – Research Funding Program: Heracleitus II. Investing in knowledge society through the European Social Fund.

References (26)

A.A. Abdelmalek et al.
Evaluation of text clustering methods using WordNet
The International Arab Journal of Information Technology
(2010)
A.Y. Al-Omary et al.
A new approach of clustering based machine-learning algorithm
Knowledge-Based Systems
(2006)
D. Arthur, S. Vassilvitskii, On the Worst Case Complexity of the k-means Method, Technical Report, Stanford,...
D. Arthur, S. Vassilvitskii, k-Means++: the advantages of careful seeding, in: Proceedings of the Eighteenth Annual...
C. Bouras et al.
PeRSSonal’s core functionality evaluation: enhancing text labeling through personalized summaries
Data and Knowledge Engineering Journal
(2008)
C. Bouras et al.
Improving text summarization using noun retrieval techniques, Lecture Notes in Computer Science
Knowledge-Based Intelligent Information and Engineering Systems
(2008)
P.S. Bradley, U. Fayyad, Refining initial points for k-means clustering, in: Proceedings of the 15th International...
D. Carmel, H. Roitman, N. Zwerdling, Enhancing cluster labeling using wikipedia, in: Proceedings of the 32nd...
C.L. Chen et al.
An integration of fuzzy association rules and WordNet for document clustering
Lecture Notes in Computer Science in Advances in Knowledge Discovery and Data Mining
(2009)
W.H.E. Day et al.
Efficient algorithms for agglomerative hierarchical clustering methods
Journal of Classification
(1984)

A. El-Hamdouchi et al.

Comparison of hierarchic agglomerative clustering methods for document retrieval

The Computer Journal

(1989)

T.F. Gharib et al.

Fuzzy document clustering approach using WordNet lexical categories

M. Hoon et al.

The C Clustering Library

(2003)

Cited by (63)

Fuzzy Local Information C -Means based clustering and Fractional Dwarf Mongoose optimization enabled deep learning for relevant document retrieval
2023, Engineering Applications of Artificial Intelligence
Document Retrieval (DR) needs an innovative model to rank and retrieve documents based on their relevancy with respect to some questions that requires strong text understanding capability. The prime motive of documents retrieval is to search the relevant documents that satisfy the user's questions. However, it is a complex process because it means the natural language textual content based on the syntax, context and semantics. Conventional techniques for listing files prefer typical word and sentence encrypting to create constant length document abiding. However, the widely used bag-of-words (BoW) method failed to integrate the signify context, which is a crucial problem to compreh
end the document-query relevancy. In order to overcome such issues, deep neural networks (DNNs) have been put forward to arrange search outcomes with respect to user's questions. Here, a unified solution is provided to perform relevant document retrieval using Dwarf Mongoose Optimization Fractional-based Deep Convolutional Neural network (DMOF-Deep CNN). Here, the textual content processing is done based on BERT tokenization and feature term extraction. Moreover, the cluster based indexing by elastic search is accomplished using Fuzzy Local Information C-Means (FLICM) clustering and dice coefficient is employed to perform the query matching. Finally, re-ranking based document retrieval is conducted in terms of deep CNN, which is trained using designed DMOF. In addition, the designed DMOF-Deep CNN has outperformed other existing models by delivering maximum precision of 0.854, recall of 0.913, F1-score of 0.882.
A personalized recommendation method under the cloud platform based on users’ long-term preferences and instant interests
2022, Advanced Engineering Informatics
Citation Excerpt :
Due to the fact that a product attribute word is often described in multiple ways, the product attribute lexicon should be constructed in advance. Domestic and foreign methods of building product domain lexicon are more mature [60]. Due to the fact that the online text of cloud platform has two attributes of high domain expertise and network word irregularity, in [61], the semi-supervised concept consists in building product domain lexicon.
Rich consumer online text data are embedded in the cloud platform. Using new technologies has become a central issue for acquiring consumer preference, analyzing consumer demand, and performing personalized recommendation services. In order to recommend the cloud platform services efficiently and accurately, this paper proposes a personalized recommendation model referred to as Residual bi-directional Recurrent Neural Network with Dual Attentive mechanism (BiRDA) for the service recommend to cloud platforms, by combining users’ long-term preferences with instant interest. The proposed recommender prototype is summarized as follows. (1) Analyzing the relationship between long-term preferences and instant interests based on co-opetition theory. (2) Extracting users’ online text data from the cloud platform. (3) Deriving the product attribute words of user preference using an analysis of online text data. (4) Product attribute words are transformed into the form of word vectors. (5) The word vector is input into the Residual bi-directional Recurrent Neural Network (Res-BiRNN) to make the prediction. On the one hand, the long-term preference is expressed by the user's field of expertise (i.e., answer content). On the other hand, the even interest is expressed by the user's changing interest (i.e., question data). (6) Assigning different weights to long-term preferences and instant interest using the dual attention mechanism to output predictions. (7) Generating recommendation lists for users based on the predicted values. Accordingly, BiRDA is compared with five state-of-the-art recommendation methods (i.e., DREAM, BINN, SHAN, Caser, and DeepMove), as well as six variants of the BiRDA model, Using users’ Q&A datasets from NiorcngeCDS cloud platform, XMAKE cloud platform, and Asksubarme cloud platform as examples. The experiments show that the proposed method is more efficient and accurate than the other models. Therefore, the study offers some important insights into allowing a large number of resources under the cloud platform to be fully utilized and provides a novel idea for the construction of the cloud platform front-end.
A hybrid approach for text document clustering using Jaya optimization algorithm
2021, Expert Systems with Applications
Citation Excerpt :
They have used Euclidean distance measure for finding distance between documents. Bouras and Tsogkas (2012) proposed Wordnet enabled W-K means clustering algorithm for the news articles clustering techniques. Their work improved the performance of standard K-means clustering algorithm.
In this digital era, millions of Internet users are contributing vast amounts of data in the form of unstructured text documents. Organizing this material is a tedious task. The clustering of text document plays a vital role for organizing these unstructured text documents. In our paper, we make use of Hybrid Jaya Optimization algorithm (HJO) for text Document Clustering (DC), referred to as HJO-DC. We have used the Silhouette index as a metric to measure the quality of a solution. The proposed work is compared with partitioning techniques such as K-Means and K-Medoids and metaheuristic techniques such as Genetic algorithm, Cuckoo Search, Particle Swarm Optimizer, Firefly and Grey Wolf Optimizer. Remarkably, the proposed algorithm achieves the highest quality clustering in all benchmark examples.
Link-based multi-verse optimizer for text documents clustering
2020, Applied Soft Computing Journal
Citation Excerpt :
These steps are discussed in the subsection below. The text preprocessing steps should be implemented with the aim of reducing the text features’ number so that the algorithm task is facilitated [3]. These steps are classified as follows: (1) tokenization, (2) removal stop word, (3) stemming, and (4) computing the terms’ weighing and document representation.
Text document clustering (TDC) represents a key task in text mining and unsupervised machine learning, which partitions a specific documents’ collection into varied K-groups according to certain similarity/dissimilarity criterion. There exists a considerable amount of knowledge in the text clustering field and many attempts were carried out to resolve the TDC problem and improve the learning performance. The multi-verse optimizer algorithm (MVO) is a stochastic population-based algorithm, which was recently introduced and successfully utilized to tackle many optimization problems that are complex. The original MVO performance is limited to the utilization of only the best solution in the exploitation phase (local search capability), which makes it suffer from entrapment in local optima and low convergence rate. This paper aims to propose a novel method of modifying the MVO algorithm called link-based Multi-verse optimizer algorithm (LBMVO) to enhance the exploitation phase in the original MVO. The enhancement involves adding a neighbor operator to the MVO algorithm to enhance the search capability via a novel probability factor, namely neighborhood selection strategy (NSS). The proposed LBMVO’s effectiveness was tested on six standard datasets, which are used in the text clustering domain in addition to five standard datasets, which are utilized in the data clustering domain. The experiments revealed that the modified MVO with NSS has boosted the results in terms of error rate, accuracy, recall, precision, F-measure, purity, entropy criteria, and high convergence rate. Generally, LBMVO has outperformed or at least showed that it is profoundly competitive compared with the original MVO algorithm and with widely known clustering techniques like Spectral, Agglomerative, Density-based spatial clustering of applications with noise (DBSCAN), K-means, K-means++ clustering techniques and the optimization algorithms like harmony search (HS), genetic algorithm (GA), particle swarm optimization (PSO), krill herd algorithm (KHA), covariance matrix adaptation evolution strategy (CMAES), coyote optimization algorithm (COA), as well as original MVO.
A User Demand Acquisition Method for Cloud Services Based on User Sentiment Analysis and Long- and Short-Term Preferences
2024, SSRN
A comprehensive and analytical review of text clustering techniques
2024, International Journal of Data Science and Analytics

View all citing articles on Scopus

^☆: This manuscript is an extended version of the KES2010 conference paper named: “W-k means: Clustering News Articles using WordNet”.

¹: Tel.: +30 2610 996954.

View full text

A clustering technique for news articles using WordNet☆

Abstract

Introduction

Section snippets

Related work

Information flow

Clustering news articles

Algorithm approach for W-k means

Experimental procedure

Conclusion

Future work

Acknowledgements

Evaluation of text clustering methods using WordNet

The International Arab Journal of Information Technology

A new approach of clustering based machine-learning algorithm

Knowledge-Based Systems

PeRSSonal’s core functionality evaluation: enhancing text labeling through personalized summaries

Data and Knowledge Engineering Journal

Improving text summarization using noun retrieval techniques, Lecture Notes in Computer Science

Knowledge-Based Intelligent Information and Engineering Systems

An integration of fuzzy association rules and WordNet for document clustering

Lecture Notes in Computer Science in Advances in Knowledge Discovery and Data Mining

Efficient algorithms for agglomerative hierarchical clustering methods

Journal of Classification

Comparison of hierarchic agglomerative clustering methods for document retrieval

The Computer Journal

Fuzzy document clustering approach using WordNet lexical categories

The C Clustering Library