Skip to main content
Erschienen in: Soft Computing 3/2015

01.03.2015 | Methodologies and Application

An incremental mixed data clustering method using a new distance measure

verfasst von: Fakhroddin Noorbehbahani, Sayyed Rasoul Mousavi, Abdolreza Mirzaei

Erschienen in: Soft Computing | Ausgabe 3/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering is one of the most applied unsupervised machine learning tasks. Although there exist several clustering algorithms for numeric data, more sophisticated clustering algorithms to address mixed data (numeric and categorical data) more efficiently are still required. Other important issues to be considered in clustering are incremental learning and generating a sufficient number of clusters without specifying the number of clusters a priori. In this paper, we introduce a mixed data clustering method which is incremental and generates a sufficient number of clusters automatically. The proposed method is based on the Adjusted Self-Organizing Incremental Neural Network (ASOINN) algorithm exploiting a new distance measure and new update rules for handling mixed data. The proposed clustering method is compared with the ASOINN and three other clustering algorithms comprehensively. The results of comparative experiments on various data sets using several clustering evaluation measures show the effectiveness of the proposed mixed data clustering method.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng 63(2):503–527CrossRef Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng 63(2):503–527CrossRef
Zurück zum Zitat Amigo E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(4):461–486CrossRef Amigo E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(4):461–486CrossRef
Zurück zum Zitat Bagga A, Baldwin B (1998) Entity-based cross-document co-referencing using the vector space model. In: Proceedings of the 17th international conference on Computational linguistics, vol 1, pp 79–85 Bagga A, Baldwin B (1998) Entity-based cross-document co-referencing using the vector space model. In: Proceedings of the 17th international conference on Computational linguistics, vol 1, pp 79–85
Zurück zum Zitat Bai L, Liang J, Dang C, Cao F (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit 44(12):2843–2861CrossRefMATH Bai L, Liang J, Dang C, Cao F (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit 44(12):2843–2861CrossRefMATH
Zurück zum Zitat Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data. Springer, Berlin, pp 25–71 Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data. Springer, Berlin, pp 25–71
Zurück zum Zitat Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation abstract. In: SIAM Data Mining Conference Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation abstract. In: SIAM Data Mining Conference
Zurück zum Zitat Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth Inc., Belmont Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth Inc., Belmont
Zurück zum Zitat Cao F, Liang J, Li D, Bai L, Dang C (2012) A dissimilarity measure for the k-modes clustering algorithm. Knowl Based Syst 26:120–127CrossRef Cao F, Liang J, Li D, Bai L, Dang C (2012) A dissimilarity measure for the k-modes clustering algorithm. Knowl Based Syst 26:120–127CrossRef
Zurück zum Zitat De Assis EC, de Souza RMCR (2011) A K-medoids clustering algorithm for mixed feature-type symbolic data. In: 2011 IEEE international conference on systems, man, and cybernetics (SMC), pp 527–531 De Assis EC, de Souza RMCR (2011) A K-medoids clustering algorithm for mixed feature-type symbolic data. In: 2011 IEEE international conference on systems, man, and cybernetics (SMC), pp 527–531
Zurück zum Zitat Eiben AE, Raue PE, Ruttkay Z (1994) Genetic algorithms with multi-parent recombination. PPSN III 866:78–87 Eiben AE, Raue PE, Ruttkay Z (1994) Genetic algorithms with multi-parent recombination. PPSN III 866:78–87
Zurück zum Zitat Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection: detecting intrusions n unlabeled data. In: Applications of data mining in computer security. Kluwer, Norwell Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection: detecting intrusions n unlabeled data. In: Applications of data mining in computer security. Kluwer, Norwell
Zurück zum Zitat Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553–569CrossRefMATH Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553–569CrossRefMATH
Zurück zum Zitat Furao S, Hasegawa O (2006) An incremental network for on-line unsupervised classification and topology learning. Neural Netw 19(1):90–106CrossRefMATH Furao S, Hasegawa O (2006) An incremental network for on-line unsupervised classification and topology learning. Neural Netw 19(1):90–106CrossRefMATH
Zurück zum Zitat Furao S, Ogura T, Hasegawa O (2007) An enhanced self-organizing incremental neural network for online unsupervised learning. Neural Netw 20(8):893–903CrossRefMATH Furao S, Ogura T, Hasegawa O (2007) An enhanced self-organizing incremental neural network for online unsupervised learning. Neural Netw 20(8):893–903CrossRefMATH
Zurück zum Zitat Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS-clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 73–83 Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS-clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 73–83
Zurück zum Zitat Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: An approach based on dynamical systems. In: Proceeding of the 24rd international conference on very large data bases, pp 311–322 Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: An approach based on dynamical systems. In: Proceeding of the 24rd international conference on very large data bases, pp 311–322
Zurück zum Zitat Guha S, Rastogi R, Shim K (2000) Rock : a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366 Guha S, Rastogi R, Shim K (2000) Rock : a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366
Zurück zum Zitat Hsu CC, Chen CL, Su YW (2007) Hierarchical clustering of mixed data based on distance hierarchy. Inf Sci 177(20):4474–4492CrossRef Hsu CC, Chen CL, Su YW (2007) Hierarchical clustering of mixed data based on distance hierarchy. Inf Sci 177(20):4474–4492CrossRef
Zurück zum Zitat Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Workshop on research issues on data mining and knowledge discovery Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Workshop on research issues on data mining and knowledge discovery
Zurück zum Zitat Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304CrossRef Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304CrossRef
Zurück zum Zitat Huang Z, Ng MK (1999) A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 7(4):446–452CrossRef Huang Z, Ng MK (1999) A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 7(4):446–452CrossRef
Zurück zum Zitat Kim DW, Lee KH, Lee D (2004) Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognit Lett 25(11):1263–1271CrossRef Kim DW, Lee KH, Lee D (2004) Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognit Lett 25(11):1263–1271CrossRef
Zurück zum Zitat Martinetz TM, Berkovich SG, Schulten KJ (1993) Neural-gas network for vector quantization and its application to time-series prediction. IEEE Trans Neural Netw 4(4):558–569CrossRef Martinetz TM, Berkovich SG, Schulten KJ (1993) Neural-gas network for vector quantization and its application to time-series prediction. IEEE Trans Neural Netw 4(4):558–569CrossRef
Zurück zum Zitat Martinetz T, Schulten K (1994) Topology representing networks. Neural Netw 7(3):507–522CrossRef Martinetz T, Schulten K (1994) Topology representing networks. Neural Netw 7(3):507–522CrossRef
Zurück zum Zitat Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 935–940 Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 935–940
Zurück zum Zitat Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507CrossRef Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507CrossRef
Zurück zum Zitat R Core Team (2013) R: a language and environment for statistical computing. Vienna, Austria R Core Team (2013) R: a language and environment for statistical computing. Vienna, Austria
Zurück zum Zitat Rousseeuw L, Kaufman L (1987) Clustering by means of medoids. In: Statistical data analysis based on the L1-norm and related methods, vol 405, pp 405–416 Rousseeuw L, Kaufman L (1987) Clustering by means of medoids. In: Statistical data analysis based on the L1-norm and related methods, vol 405, pp 405–416
Zurück zum Zitat Shen F, Hasegawa O (2008) A fast nearest neighbor classifier based on self-organizing incremental neural network. Neural Netw 21(10):1537–1547CrossRefMATH Shen F, Hasegawa O (2008) A fast nearest neighbor classifier based on self-organizing incremental neural network. Neural Netw 21(10):1537–1547CrossRefMATH
Zurück zum Zitat Van Rijsbergen CJ (1974) Foundation of evaluation. J Doc 30(4):365–373CrossRef Van Rijsbergen CJ (1974) Foundation of evaluation. J Doc 30(4):365–373CrossRef
Zurück zum Zitat Villmann T, Der R, Herrmann M, Martinetz TM (1997) Topology preservation in self-organizing feature maps: exact definition and measurement. IEEE Trans Neural Netw 8(2):256–266CrossRef Villmann T, Der R, Herrmann M, Martinetz TM (1997) Topology preservation in self-organizing feature maps: exact definition and measurement. IEEE Trans Neural Netw 8(2):256–266CrossRef
Zurück zum Zitat Witten IH, Frank E (1999) Data mining: practical machine learning tools and techniques with java implementations. In: The Morgan Kaufmann series in data management systems, 1st edn. Morgan Kaufmann, Burlington Witten IH, Frank E (1999) Data mining: practical machine learning tools and techniques with java implementations. In: The Morgan Kaufmann series in data management systems, 1st edn. Morgan Kaufmann, Burlington
Metadaten
Titel
An incremental mixed data clustering method using a new distance measure
verfasst von
Fakhroddin Noorbehbahani
Sayyed Rasoul Mousavi
Abdolreza Mirzaei
Publikationsdatum
01.03.2015
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing / Ausgabe 3/2015
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-014-1296-7

Weitere Artikel der Ausgabe 3/2015

Soft Computing 3/2015 Zur Ausgabe