Top

Journal of Intelligent Information Systems

Published in:

01-10-2012

Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Authors: Maria Soledad Pera, Yiu-Kai Dennis Ng

Published in: Journal of Intelligent Information Systems | Issue 2/2012

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

RSS news articles that are either partially or completely duplicated in content are easily found on the Internet these days, which require Web users to sort through the articles to identify non-redundant information. This manual-filtering process is time-consuming and tedious. In this paper, we present a new filtering and clustering approach, called FICUS, which starts with identifying and eliminating redundant RSS news articles using a fuzzy set information retrieval approach and then clusters the remaining non-redundant RSS news articles according to their degrees of resemblance. FICUS uses a tree hierarchy to organize clusters of RSS news articles. The contents of the respective clusters are captured by the representative keywords from RSS news articles in the clusters so that searching and retrieval of similar RSS news articles is fast and efficient. FICUS is simple, since it uses the pre-defined word-correlation factors to determine related (words in) RSS news articles and filter redundant ones, and is supported by well-known and yet simple mathematical models, such as the standard deviation, vector space model, and probability theory, to generate clusters of non-redundant RSS news articles. Experiments performed on (test sets of) RSS news articles on various topics, which were downloaded from different online sources, verify the accuracy of FICUS on eliminating redundant RSS news articles, clustering similar RSS news articles together, and segregating different RSS news articles in terms of their contents. In addition, further empirical studies show that FICUS outperforms well-known approaches adopted for clustering RSS news articles.

previous article A new method of mining data streams using harmony search

next article Detection and resolution of semantic inconsistency and redundancy in an automatic ontology merging system

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Stopwords are commonly-occurred words such as articles, prepositions, and conjunctions, which are poor discriminators in representing the content of a sentence (or RSS news article), whereas stemmed words are words reduced to their grammatical root. From now on, unless stated otherwise, whenever we refer to words, we mean non-stop, stemmed words.

From now on, whenever we use the term similar sentences (RSS news articles), we mean sentences (RSS news articles) that are semantically the same but different in terms of words used in the sentences (RSS news articles).

The title of an RSS news article is also treated as a sentence when determining the degree of resemblance among RSS news articles.

Standard deviation is a statistical measure that determines how spread out (i.e., the variability of) the values of a data set are.

If there is more than one keyword in a cluster that has the highest weight among all the keywords in the cluster, we treat them all as “representative” keywords.

To speed up the recursive partitioning process for finding the appropriate labels without sacrificing the quality, we remove stopwords, perform stemming on the words in the RSS news articles, and use the stems as labels.

The sources of the RSS news articles in RSSrds include cnn.com, hosted.ap.org, news.yahoo.com, prnewswire.com, www.abcnews.go.com, www.cbsnews.com, and www.iht.com, to name a few.

Only single-topic RSS news articles were used in this empirical study, which follows the evaluation premise as detailed in Fung et al. (2003; Hu et al. (2008; Xu and Gong (2004), in which multiple-topic RSS news articles are not considered.

A natural class is an element of a set of classes K in which the labels of the documents within each class, i.e., terms or phrases used for describing the contents of the documents within a given class, such as “sports” or “entertainment”, are known in advance.

Delicious (delicious.com) is a social bookmarking web service that was developed to aid users in storing, sharing, and discovering web bookmarks. Delicious uses a non-hierarchical classification system which allows its users to tag their bookmarks with freely chosen index terms.

Croft, B., Metzler, D., & Strohman, T. (2010). Search engines: information retrieval in practice. Addison Wesley.

Dhillon, I., & Modha, D. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.MATHCrossRef

Dhillon, I., Mallela, S., & Kumar, R. (2002). Enhanced word clustering for hierarchical text classification. In Proceedings of ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 191–200).

Fung, B., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proceedings of SIAM international conference on data mining (ICDM) (pp. 59–70).

Hammouda, K., & Kamel, M. (2002). Phrase-based document similarity based on an index graph model. In Proceedings of IEEE international conference on data mining (ICDM) (pp. 203–210).

Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of ACM Conference on Research and Development in Information Retrieval (SIGIR) (pp. 179–186).

Jain, A., Murty, M., & Flynn P. (1999) Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323.CrossRef

Koberstein, J., & Ng, Y.-K. (2006). Using word clusters to detect similar RSS news articles. In Proceedings of the international conference on knowledge science, engineering and management (KSEM) (pp. 215–228). LNAI 4092.

Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proceedings of ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 16–22).

Li, X., Zaiane, O., & Li. Z. (2006). A comparative study on text clustering methods. In Proceedings of advanced data mining and applications (pp. 644–651).

Lim, S., & Ng, Y.-K. (2005). Categorization and information extraction of multilingual HTML documents. In Proceedings of the 9th international database engineering and application symposium (IDEAS) (pp. 415–422).

Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002). Document clustering with cluster refinement and model selection capabilities. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 191–198).

Mitchell, T. (1997). Machine learning. McGraw Hill.

Ogawa, Y., Morita, T., & Kobayashi, K. (1991). A fuzzy document retrieval system using the keyword connection matrix and a learning method. Fuzzy Sets and Systems, 39, 163–179.MathSciNetCrossRef

Pera, M., & Ng, Y.-K. (2009). Synthesizing correlated RSS news articles based on a fuzzy equivalence relation. International Journal of Web Information Systems (IJWIS), 5(1), 77–109.CrossRef

Shafer, G. (1976). A mathematical theory of evidence. Princeton University Press.

Slonim, N., Friedman, N., & Tishby, N. (2002) Unsupervised document classification using sequential information maximization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 129–136).

Slonim, N., & Tishby, N. (2001). The power of word clusters for text classification. In Proceedings of the 23rd European colloquium on information retrieval research (ECIR) (pp. 191–2000.

Xu, W., & Gong, Y. (2004). Document clustering by concept factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 202–209).

Xu, W., Liu, X., & Gong, Y. (2003). News article clustering based on non-negative matrix factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 267–273).

Xu, W., & Gong, Y. (2004). News article clustering by concept factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 202–209).

Zheng, X., He, P., Tian, M., & Yuan, F. (2003). Algorithm of documents clustering based on minimum spanning tree. In Proceedings of the 2nd international conference on machine learning and cybernetics (pp. 199–203).

Zhong, S., & Ghosh, J. (2005). A comparative study of generative models for document clustering. Knowledge and Information Systems, 8(3), 374–384.CrossRef

Zwillinger, D., Krantz, S., & Rosen, K. (Eds.) (1996). Standard mathematical tables and formulae (30th edition). CRC Press.

Title: Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles
Authors: Maria Soledad Pera
Yiu-Kai Dennis Ng
Publication date: 01-10-2012
Publisher: Springer US
Published in: Journal of Intelligent Information Systems / Issue 2/2012
Print ISSN: 0925-9902
Electronic ISSN: 1573-7675
DOI: https://doi.org/10.1007/s10844-012-0201-z

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2012

Tractable reasoning with vague knowledge using fuzzy

Linear semi-supervised projection clustering by transferred centroid regularization

Adaptive two-level optimization for selection predicates of multiple continuous queries

Rotation-invariant similarity in time series using bag-of-patterns representation

Challenges and solutions in the opinion summarization of user-generated content

An adaptive learning automata-based ranking function discovery algorithm

Premium Partner