Elsevier

Knowledge-Based Systems

Volume 36, December 2012, Pages 115-128
Knowledge-Based Systems

A clustering technique for news articles using WordNet

https://doi.org/10.1016/j.knosys.2012.06.015Get rights and content

Abstract

The Web is overcrowded with news articles, an overwhelming information source both with its amount and diversity. Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. Several approaches have been proposed which, however, suffer from problems like synonymy, ambiguity and lack of a descriptive content marking of the generated clusters. In this work, we are investigating the application of a great spectrum of clustering algorithms, as well as similarity measures, to news articles that originate from the Web. Also, we are proposing the enhancement of standard k-means algorithm using the external knowledge from WordNet hypernyms in a twofold manner: enriching the “bag of words” used prior to the clustering process and assisting the label generation procedure following it. Furthermore, we are examining the effect that text preprocessing has on clustering. Operating on a corpus of news articles derived from major news portals, our comparison of the existing clustering methodologies revealed that k-means, gives better aggregate results when it comes to efficiency. This is amplified when the algorithm is accompanied with preliminary steps for data cleaning and normalizing, despite its simple nature. Moreover, the proposed WordNet-enabled W-k means clustering algorithm significantly improves standard k-means generating also useful and high quality cluster tags by using the presented cluster labeling process.

Introduction

News articles flood the Web every day from an extreme amount of major or minor news portals from around the globe. It is utterly impossible for a single individual to be able to keep track of an event, or a series of related events, from an unbiased and truly informative point of view. While the amount of online information sources is rapidly increasing, so does the available online news content. One of the most common approaches for organizing this immense amount of data is the use of clustering techniques. Object clustering refers to the process of partitioning a collection of objects into several sub-collections based on their similarity of contents. For the case of user clustering, each sub-collection is called a user cluster and includes users that have revealed similar appeals in their selections of text articles while browsing through a document collection. Clustering has been proven to be a useful technique for information retrieval by discovering interesting information kernels and distributions in the underlying data. In general, it helps constructing meaningful partitions of large sets of objects based on various methodologies and heuristics. It plays a crucial role in organizing large collections. For example (a) it can be used to structure query results, (b) form the basis for further processing of the organized topical groups using other information retrieval techniques such as summarization, or (c) within the scope of recommendation systems by affecting their performance as far as suggestions made towards the end users are concerned. Clustering has also been exploited within the scope of machine learning [2], as a time series mining task [17] which uses frequent itemsets to find association rules of items in large transactional databases.

Clustering of news articles can help by depicting the underneath content hierarchy of a huge amount of articles within the reach of a single individual. Consequently, it can provide information retrieval (IR) systems with the potential to alleviate users while browsing and detecting quickly the needed information.

However, there are several challenges that clustering techniques normally have to overcome. Among them is efficiency: generated clusters have to be well connected from a notional point of view, despite the diversity in content and size that the original documents might have. For example, it is frequent for some news articles to belong to the same notional cluster, even though they do not share common words. The vice-versa is also possible: news articles sharing common words, while being completely unrelated to each other. Ambiguity and synonymy are thus two of the major problems that document clustering techniques regularly fail to tackle with. Furthermore, having IR systems simply generate clusters of documents is not enough per se. The reason is that it is virtually impossible for humans to conceptualize information by merely browsing through hundreds of documents belonging to the same cluster. However, assigning meaningful labels to the generated clusters can help users conveniently recognize the content of each generated set and thus easily analyze the results.

In this manuscript, we are describing a variety of document clustering techniques and evaluating their application on our data set: news articles originating from the Web. Our aim is to compare the resulting clusters and determine which technique is best fitted for the extreme amount and diversity of news articles that an indexing system needs to address. Furthermore we are presenting a novel methodological approach towards document clustering, and in particular, clustering of news articles deriving from the Web, that combines regular k-means with external information extracted from the WordNet database. Our approach combines keyword extraction and several information retrieval techniques. We are also incorporating the proposed algorithm in our existing system [5], evaluating the clustering results compared to regular k-means using a large pool of Web news articles existing in the system’s database.

The rest of the manuscript is organized as follows: Section 2 gives a background of the related work regarding clustering methodologies as well as the use of the WordNet database on this field. In Section 3, we give a brief overview of our system which we are enhancing with clustering techniques. In Section 4 we describe the various clustering methodologies explored in this work, while in Section 5 we present the algorithmic approach of W-k means. In Section 6 we outline our experimental approach towards the clustering methodologies used and present our evaluation results. Section 7 concludes this manuscript with some remarks about the future work that is currently underway.

Section snippets

Related work

Clustering data in general has been heavily researched by the scientific community over the last 20 years. Especially for document clustering, a huge variety of techniques has been proposed. A major goal of document clustering is to improve the results of information retrieval systems in terms of precision/recall. This in turn leads to serving better filtered and adequate results to their users, helping in essence the decision making process.

Information flow

Our system, PeRSSonal [5], features a staged and modular approach for performing the various tasks concerning news articles that originate from the Web. The scope of the PeRSSonal system is the construction of a new generation Web service that unifies many Information Retrieval tasks under a common framework. It is delivering quality information, targeted to end users that do not want or do not have the time to engage to the tedious task of filtering information. PeRSSonal consists of several

Clustering news articles

The overall clustering process as evaluated in this paper is depicted in Fig. 2.

The generated term – frequency vectors (‘bag of words’) for each article described in the previous section, which is a weighted scheme of stemmed nouns existing in the original text, is given as input to the clustering subsystem. At this level, we used a twofold implementation/evaluation. Firstly, by applying a variety of clustering algorithms and distance metrics, we try to determine whether preprocessing has an

Algorithm approach for W-k means

In this section we are presenting our algorithm approach for exploiting the WordNet database within the scope of k-means. The WordNet lexical reference system, organizes different linguistic relations into hierarchies. Most importantly, given any noun, verb, adjective and adverb, WordNet can provide results regarding hypernyms, hyponyms, meronyms or holonyms. Using these graph-like structures, we can search the WordNet database for all the hypernyms of a given set of words, then weigh them

Experimental procedure

In the current section we are presenting our experimental procedure and its results. Our analysis consists of: (a) evaluating known clustering methodologies and distance measures when applied within the domain of news articles, (b) evaluating our WordNet enabled k-means clustering and cluster labeling algorithm, and (c) comparing the proposed W-k means clustering results to those generated by two state of the art generic clustering toolboxes: Cluto [14] and SenseClusters [16].

Conclusion

Within the scope of our indexing system, we have presented our evaluation results comparing some of the best clustering options currently available, applying them to the domain of news articles that originate from the Web. From the plethora of similarity measures that have been used, the appliance of Euclidian and cosine k-means produced the best results based not only on the internal CI function, but also on a real users’ experimentation. More specifically, we have found that hierarchical

Future work

For the future, we will be evaluating W-k means with regards to time efficiency using more clustering algorithms and larger document sets. We are also planning on determining how well our approach scales with increasing numbers of articles as is the case with online indexing services. Moreover, we will be researching towards using the clustering kernel for clustering system users based on their dynamic profiles, and we will proceed with evaluating more extensively the clustering module with

Acknowledgements

This research has been co-financed by the European Union (European Social Fund – ESF) and Greek national funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework (NSRF) – Research Funding Program: Heracleitus II. Investing in knowledge society through the European Social Fund.

References (26)

  • A.A. Abdelmalek et al.

    Evaluation of text clustering methods using WordNet

    The International Arab Journal of Information Technology

    (2010)
  • A.Y. Al-Omary et al.

    A new approach of clustering based machine-learning algorithm

    Knowledge-Based Systems

    (2006)
  • D. Arthur, S. Vassilvitskii, On the Worst Case Complexity of the k-means Method, Technical Report, Stanford,...
  • D. Arthur, S. Vassilvitskii, k-Means++: the advantages of careful seeding, in: Proceedings of the Eighteenth Annual...
  • C. Bouras et al.

    PeRSSonal’s core functionality evaluation: enhancing text labeling through personalized summaries

    Data and Knowledge Engineering Journal

    (2008)
  • C. Bouras et al.

    Improving text summarization using noun retrieval techniques, Lecture Notes in Computer Science

    Knowledge-Based Intelligent Information and Engineering Systems

    (2008)
  • P.S. Bradley, U. Fayyad, Refining initial points for k-means clustering, in: Proceedings of the 15th International...
  • D. Carmel, H. Roitman, N. Zwerdling, Enhancing cluster labeling using wikipedia, in: Proceedings of the 32nd...
  • C.L. Chen et al.

    An integration of fuzzy association rules and WordNet for document clustering

    Lecture Notes in Computer Science in Advances in Knowledge Discovery and Data Mining

    (2009)
  • W.H.E. Day et al.

    Efficient algorithms for agglomerative hierarchical clustering methods

    Journal of Classification

    (1984)
  • A. El-Hamdouchi et al.

    Comparison of hierarchic agglomerative clustering methods for document retrieval

    The Computer Journal

    (1989)
  • T.F. Gharib et al.

    Fuzzy document clustering approach using WordNet lexical categories

  • M. Hoon et al.

    The C Clustering Library

    (2003)
  • Cited by (63)

    • A personalized recommendation method under the cloud platform based on users’ long-term preferences and instant interests

      2022, Advanced Engineering Informatics
      Citation Excerpt :

      Due to the fact that a product attribute word is often described in multiple ways, the product attribute lexicon should be constructed in advance. Domestic and foreign methods of building product domain lexicon are more mature [60]. Due to the fact that the online text of cloud platform has two attributes of high domain expertise and network word irregularity, in [61], the semi-supervised concept consists in building product domain lexicon.

    • A hybrid approach for text document clustering using Jaya optimization algorithm

      2021, Expert Systems with Applications
      Citation Excerpt :

      They have used Euclidean distance measure for finding distance between documents. Bouras and Tsogkas (2012) proposed Wordnet enabled W-K means clustering algorithm for the news articles clustering techniques. Their work improved the performance of standard K-means clustering algorithm.

    • Link-based multi-verse optimizer for text documents clustering

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      These steps are discussed in the subsection below. The text preprocessing steps should be implemented with the aim of reducing the text features’ number so that the algorithm task is facilitated [3]. These steps are classified as follows: (1) tokenization, (2) removal stop word, (3) stemming, and (4) computing the terms’ weighing and document representation.

    • A comprehensive and analytical review of text clustering techniques

      2024, International Journal of Data Science and Analytics
    View all citing articles on Scopus

    This manuscript is an extended version of the KES2010 conference paper named: “W-k means: Clustering News Articles using WordNet”.

    1

    Tel.: +30 2610 996954.

    View full text