Skip to main content
Top
Published in: Journal of Intelligent Information Systems 2/2012

01-10-2012

Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Authors: Maria Soledad Pera, Yiu-Kai Dennis Ng

Published in: Journal of Intelligent Information Systems | Issue 2/2012

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

RSS news articles that are either partially or completely duplicated in content are easily found on the Internet these days, which require Web users to sort through the articles to identify non-redundant information. This manual-filtering process is time-consuming and tedious. In this paper, we present a new filtering and clustering approach, called FICUS, which starts with identifying and eliminating redundant RSS news articles using a fuzzy set information retrieval approach and then clusters the remaining non-redundant RSS news articles according to their degrees of resemblance. FICUS uses a tree hierarchy to organize clusters of RSS news articles. The contents of the respective clusters are captured by the representative keywords from RSS news articles in the clusters so that searching and retrieval of similar RSS news articles is fast and efficient. FICUS is simple, since it uses the pre-defined word-correlation factors to determine related (words in) RSS news articles and filter redundant ones, and is supported by well-known and yet simple mathematical models, such as the standard deviation, vector space model, and probability theory, to generate clusters of non-redundant RSS news articles. Experiments performed on (test sets of) RSS news articles on various topics, which were downloaded from different online sources, verify the accuracy of FICUS on eliminating redundant RSS news articles, clustering similar RSS news articles together, and segregating different RSS news articles in terms of their contents. In addition, further empirical studies show that FICUS outperforms well-known approaches adopted for clustering RSS news articles.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Stopwords are commonly-occurred words such as articles, prepositions, and conjunctions, which are poor discriminators in representing the content of a sentence (or RSS news article), whereas stemmed words are words reduced to their grammatical root. From now on, unless stated otherwise, whenever we refer to words, we mean non-stop, stemmed words.
 
2
From now on, whenever we use the term similar sentences (RSS news articles), we mean sentences (RSS news articles) that are semantically the same but different in terms of words used in the sentences (RSS news articles).
 
3
The title of an RSS news article is also treated as a sentence when determining the degree of resemblance among RSS news articles.
 
4
Standard deviation is a statistical measure that determines how spread out (i.e., the variability of) the values of a data set are.
 
5
If there is more than one keyword in a cluster that has the highest weight among all the keywords in the cluster, we treat them all as “representative” keywords.
 
6
To speed up the recursive partitioning process for finding the appropriate labels without sacrificing the quality, we remove stopwords, perform stemming on the words in the RSS news articles, and use the stems as labels.
 
7
 
8
Only single-topic RSS news articles were used in this empirical study, which follows the evaluation premise as detailed in Fung et al. (2003; Hu et al. (2008; Xu and Gong (2004), in which multiple-topic RSS news articles are not considered.
 
9
A natural class is an element of a set of classes K in which the labels of the documents within each class, i.e., terms or phrases used for describing the contents of the documents within a given class, such as “sports” or “entertainment”, are known in advance.
 
10
Delicious (delicious.​com) is a social bookmarking web service that was developed to aid users in storing, sharing, and discovering web bookmarks. Delicious uses a non-hierarchical classification system which allows its users to tag their bookmarks with freely chosen index terms.
 
Literature
go back to reference Croft, B., Metzler, D., & Strohman, T. (2010). Search engines: information retrieval in practice. Addison Wesley. Croft, B., Metzler, D., & Strohman, T. (2010). Search engines: information retrieval in practice. Addison Wesley.
go back to reference Dhillon, I., & Modha, D. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.MATHCrossRef Dhillon, I., & Modha, D. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.MATHCrossRef
go back to reference Dhillon, I., Mallela, S., & Kumar, R. (2002). Enhanced word clustering for hierarchical text classification. In Proceedings of ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 191–200). Dhillon, I., Mallela, S., & Kumar, R. (2002). Enhanced word clustering for hierarchical text classification. In Proceedings of ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 191–200).
go back to reference Fung, B., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proceedings of SIAM international conference on data mining (ICDM) (pp. 59–70). Fung, B., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proceedings of SIAM international conference on data mining (ICDM) (pp. 59–70).
go back to reference Hammouda, K., & Kamel, M. (2002). Phrase-based document similarity based on an index graph model. In Proceedings of IEEE international conference on data mining (ICDM) (pp. 203–210). Hammouda, K., & Kamel, M. (2002). Phrase-based document similarity based on an index graph model. In Proceedings of IEEE international conference on data mining (ICDM) (pp. 203–210).
go back to reference Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of ACM Conference on Research and Development in Information Retrieval (SIGIR) (pp. 179–186). Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of ACM Conference on Research and Development in Information Retrieval (SIGIR) (pp. 179–186).
go back to reference Jain, A., Murty, M., & Flynn P. (1999) Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323.CrossRef Jain, A., Murty, M., & Flynn P. (1999) Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323.CrossRef
go back to reference Koberstein, J., & Ng, Y.-K. (2006). Using word clusters to detect similar RSS news articles. In Proceedings of the international conference on knowledge science, engineering and management (KSEM) (pp. 215–228). LNAI 4092. Koberstein, J., & Ng, Y.-K. (2006). Using word clusters to detect similar RSS news articles. In Proceedings of the international conference on knowledge science, engineering and management (KSEM) (pp. 215–228). LNAI 4092.
go back to reference Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proceedings of ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 16–22). Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proceedings of ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 16–22).
go back to reference Li, X., Zaiane, O., & Li. Z. (2006). A comparative study on text clustering methods. In Proceedings of advanced data mining and applications (pp. 644–651). Li, X., Zaiane, O., & Li. Z. (2006). A comparative study on text clustering methods. In Proceedings of advanced data mining and applications (pp. 644–651).
go back to reference Lim, S., & Ng, Y.-K. (2005). Categorization and information extraction of multilingual HTML documents. In Proceedings of the 9th international database engineering and application symposium (IDEAS) (pp. 415–422). Lim, S., & Ng, Y.-K. (2005). Categorization and information extraction of multilingual HTML documents. In Proceedings of the 9th international database engineering and application symposium (IDEAS) (pp. 415–422).
go back to reference Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002). Document clustering with cluster refinement and model selection capabilities. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 191–198). Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002). Document clustering with cluster refinement and model selection capabilities. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 191–198).
go back to reference Mitchell, T. (1997). Machine learning. McGraw Hill. Mitchell, T. (1997). Machine learning. McGraw Hill.
go back to reference Ogawa, Y., Morita, T., & Kobayashi, K. (1991). A fuzzy document retrieval system using the keyword connection matrix and a learning method. Fuzzy Sets and Systems, 39, 163–179.MathSciNetCrossRef Ogawa, Y., Morita, T., & Kobayashi, K. (1991). A fuzzy document retrieval system using the keyword connection matrix and a learning method. Fuzzy Sets and Systems, 39, 163–179.MathSciNetCrossRef
go back to reference Pera, M., & Ng, Y.-K. (2009). Synthesizing correlated RSS news articles based on a fuzzy equivalence relation. International Journal of Web Information Systems (IJWIS), 5(1), 77–109.CrossRef Pera, M., & Ng, Y.-K. (2009). Synthesizing correlated RSS news articles based on a fuzzy equivalence relation. International Journal of Web Information Systems (IJWIS), 5(1), 77–109.CrossRef
go back to reference Shafer, G. (1976). A mathematical theory of evidence. Princeton University Press. Shafer, G. (1976). A mathematical theory of evidence. Princeton University Press.
go back to reference Slonim, N., Friedman, N., & Tishby, N. (2002) Unsupervised document classification using sequential information maximization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 129–136). Slonim, N., Friedman, N., & Tishby, N. (2002) Unsupervised document classification using sequential information maximization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 129–136).
go back to reference Slonim, N., & Tishby, N. (2001). The power of word clusters for text classification. In Proceedings of the 23rd European colloquium on information retrieval research (ECIR) (pp. 191–2000. Slonim, N., & Tishby, N. (2001). The power of word clusters for text classification. In Proceedings of the 23rd European colloquium on information retrieval research (ECIR) (pp. 191–2000.
go back to reference Xu, W., & Gong, Y. (2004). Document clustering by concept factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 202–209). Xu, W., & Gong, Y. (2004). Document clustering by concept factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 202–209).
go back to reference Xu, W., Liu, X., & Gong, Y. (2003). News article clustering based on non-negative matrix factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 267–273). Xu, W., Liu, X., & Gong, Y. (2003). News article clustering based on non-negative matrix factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 267–273).
go back to reference Xu, W., & Gong, Y. (2004). News article clustering by concept factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 202–209). Xu, W., & Gong, Y. (2004). News article clustering by concept factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 202–209).
go back to reference Zheng, X., He, P., Tian, M., & Yuan, F. (2003). Algorithm of documents clustering based on minimum spanning tree. In Proceedings of the 2nd international conference on machine learning and cybernetics (pp. 199–203). Zheng, X., He, P., Tian, M., & Yuan, F. (2003). Algorithm of documents clustering based on minimum spanning tree. In Proceedings of the 2nd international conference on machine learning and cybernetics (pp. 199–203).
go back to reference Zhong, S., & Ghosh, J. (2005). A comparative study of generative models for document clustering. Knowledge and Information Systems, 8(3), 374–384.CrossRef Zhong, S., & Ghosh, J. (2005). A comparative study of generative models for document clustering. Knowledge and Information Systems, 8(3), 374–384.CrossRef
go back to reference Zwillinger, D., Krantz, S., & Rosen, K. (Eds.) (1996). Standard mathematical tables and formulae (30th edition). CRC Press. Zwillinger, D., Krantz, S., & Rosen, K. (Eds.) (1996). Standard mathematical tables and formulae (30th edition). CRC Press.
Metadata
Title
Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles
Authors
Maria Soledad Pera
Yiu-Kai Dennis Ng
Publication date
01-10-2012
Publisher
Springer US
Published in
Journal of Intelligent Information Systems / Issue 2/2012
Print ISSN: 0925-9902
Electronic ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-012-0201-z

Other articles of this Issue 2/2012

Journal of Intelligent Information Systems 2/2012 Go to the issue

Premium Partner