Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 1/2015

01.01.2015

Clustering categorical data in projected spaces

verfasst von: Mohamed Bouguessa

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The problem of clustering categorical data has been widely investigated and appropriate approaches have been proposed. However, the majority of the existing methods suffer from one or more of the following limitations: (1) difficulty detecting clusters of very low dimensionality embedded in high-dimensional spaces, (2) lack of an automatic mechanism for identifying relevant dimensions for each cluster, (3) lack of an outlier detection mechanism and (4) dependence on a set of parameters that need to be properly tuned. Most of the existing approaches are inadequate for dealing with these four issues in a unified framework. This motivates our effort to propose a fully automatic projected clustering algorithm for high-dimensional categorical data which is capable of facing the four aforementioned issues in a single framework. Our algorithm comprises two phases: (1) outlier handling and (2) clustering in projected spaces. The first phase of the algorithm is based on a probabilistic approach that exploits the beta mixture model to identify and eliminate outlier objects from a data set in a systematic way. In the second phase, the clustering process is based on a novel quality function that allows the identification of projected clusters of low dimensionality embedded in a high-dimensional space without any parameter setting by the user. The suitability of our proposal is demonstrated through empirical studies using synthetic and real data sets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aggarwal CC, Yu PS (2002) Redefining clustering for high dimensional applications. IEEE Trans Knowl Data Eng 14(2):210–225CrossRef Aggarwal CC, Yu PS (2002) Redefining clustering for high dimensional applications. IEEE Trans Knowl Data Eng 14(2):210–225CrossRef
Zurück zum Zitat Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithm for Projected clustering. In: Proceedings of the ACM SIGMOD’99 conference, pp 61–72 Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithm for Projected clustering. In: Proceedings of the ACM SIGMOD’99 conference, pp 61–72
Zurück zum Zitat Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) LIMBO: scalable clustering of categorical data. In: Proceedings of the 9th international conference on extending database technology (EDBT’04), pp 123–146 Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) LIMBO: scalable clustering of categorical data. In: Proceedings of the 9th international conference on extending database technology (EDBT’04), pp 123–146
Zurück zum Zitat Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng 17(2):203–215CrossRefMathSciNet Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng 17(2):203–215CrossRefMathSciNet
Zurück zum Zitat Bai L, Liang J, Dang C, Cao F (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit 44(12):2843–2861CrossRefMATH Bai L, Liang J, Dang C, Cao F (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit 44(12):2843–2861CrossRefMATH
Zurück zum Zitat Barbara D, Li Y, Couto J (2002) COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the 11th ACM international conference on information and knowledge management (CIKM’02), pp 582–589 Barbara D, Li Y, Couto J (2002) COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the 11th ACM international conference on information and knowledge management (CIKM’02), pp 582–589
Zurück zum Zitat Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New YorkCrossRefMATH Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New YorkCrossRefMATH
Zurück zum Zitat Bouguessa M (2011) An unsupervised approach for identifying spammers in social networks. In: Proceedings of the 23rd IEEE international conference on tools with artificial intelligence (ICTAI’11), pp 832–840 Bouguessa M (2011) An unsupervised approach for identifying spammers in social networks. In: Proceedings of the 23rd IEEE international conference on tools with artificial intelligence (ICTAI’11), pp 832–840
Zurück zum Zitat Bouguessa M, Wang S (2009) Mining projected clusters in high-dimensional spaces. IEEE Trans Knowl Data Eng 21(4):507–522CrossRef Bouguessa M, Wang S (2009) Mining projected clusters in high-dimensional spaces. IEEE Trans Knowl Data Eng 21(4):507–522CrossRef
Zurück zum Zitat Bouguessa M, Wang S, Sun H (2006) An objective approach to cluster validation. Pattern Recognit Lett 27(13):1419–1430CrossRef Bouguessa M, Wang S, Sun H (2006) An objective approach to cluster validation. Pattern Recognit Lett 27(13):1419–1430CrossRef
Zurück zum Zitat Bouguila N, Ziou D, Monga E (2006) Practical Bayesian estimation of a finite beta mixture through Gibbs sampling and its applications. Stat Comput 16(2):215–225CrossRefMathSciNet Bouguila N, Ziou D, Monga E (2006) Practical Bayesian estimation of a finite beta mixture through Gibbs sampling and its applications. Stat Comput 16(2):215–225CrossRefMathSciNet
Zurück zum Zitat Cesario E, Manco G, Ortale R (2007) Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans Knowl Data Eng 19(12):1607–1624CrossRef Cesario E, Manco G, Ortale R (2007) Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans Knowl Data Eng 19(12):1607–1624CrossRef
Zurück zum Zitat Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’07), pp 220–229 Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’07), pp 220–229
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38MATHMathSciNet Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38MATHMathSciNet
Zurück zum Zitat Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97CrossRefMathSciNet Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97CrossRefMathSciNet
Zurück zum Zitat Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396CrossRef Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396CrossRef
Zurück zum Zitat Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172 Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Zurück zum Zitat Gan G, Wu J (2004) Subspace clustering for high dimensional categorical data. ACM SIGKDD Explor Newsl 6(2):87–94CrossRef Gan G, Wu J (2004) Subspace clustering for high dimensional categorical data. ACM SIGKDD Explor Newsl 6(2):87–94CrossRef
Zurück zum Zitat Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS: clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’99), pp 73–83 Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS: clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’99), pp 73–83
Zurück zum Zitat Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366CrossRef Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366CrossRef
Zurück zum Zitat He Z, Deng S, Xu X, Huang JZ (2006) A fast greedy algorithm for outlier mining. In: Proceedings of the 10th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD’06), pp 567–576 He Z, Deng S, Xu X, Huang JZ (2006) A fast greedy algorithm for outlier mining. In: Proceedings of the 10th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD’06), pp 567–576
Zurück zum Zitat Ji Y, Wu C, Liu P, Wang J, Coombes KR (2005) Applications of beta-mixture models in bioinformatics. Bioinformatics 21(9):2118–2122CrossRef Ji Y, Wu C, Liu P, Wang J, Coombes KR (2005) Applications of beta-mixture models in bioinformatics. Bioinformatics 21(9):2118–2122CrossRef
Zurück zum Zitat Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8):1026–1041CrossRef Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8):1026–1041CrossRef
Zurück zum Zitat Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 206–215 Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 206–215
Zurück zum Zitat Kim M, Ramakrishna RS (2006) Projected clustering for categorical datasets. Pattern Recognit Lett 27(12):1405–1417CrossRef Kim M, Ramakrishna RS (2006) Projected clustering for categorical datasets. Pattern Recognit Lett 27(12):1405–1417CrossRef
Zurück zum Zitat Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Discov 20(2):259–289CrossRefMathSciNet Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Discov 20(2):259–289CrossRefMathSciNet
Zurück zum Zitat Koufakou A, Ortiz EG, Georgiopoulos M, Anagnostopoulos GC, Reynolds KM (2007) A scalable and efficient outlier detection strategy for categorical data. In: Proceedings of the 19th IEEE international conference on tools with artificial intelligence (ICTAI’07), pp 210–217 Koufakou A, Ortiz EG, Georgiopoulos M, Anagnostopoulos GC, Reynolds KM (2007) A scalable and efficient outlier detection strategy for categorical data. In: Proceedings of the 19th IEEE international conference on tools with artificial intelligence (ICTAI’07), pp 210–217
Zurück zum Zitat Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1), art no 1 Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1), art no 1
Zurück zum Zitat Ma Z, Leijon A (2009) Beta mixture models and the application to image classification. In: Proceedings of the 16th IEEE international conference on image processing (ICIP’09), pp 2045–2048 Ma Z, Leijon A (2009) Beta mixture models and the application to image classification. In: Proceedings of the 16th IEEE international conference on image processing (ICIP’09), pp 2045–2048
Zurück zum Zitat Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3):273–298CrossRefMATH Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3):273–298CrossRefMATH
Zurück zum Zitat Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. Proc Very Large Databases Endow 2(1):1270–1281 Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. Proc Very Large Databases Endow 2(1):1270–1281
Zurück zum Zitat Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12(2–3):203–228CrossRefMathSciNet Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12(2–3):203–228CrossRefMathSciNet
Zurück zum Zitat Rodriguez-Baena DS, Perez-Pulido AJ, Aguilar-Ruiz JS (2011) A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics 27(19):2738–2745CrossRef Rodriguez-Baena DS, Perez-Pulido AJ, Aguilar-Ruiz JS (2011) A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics 27(19):2738–2745CrossRef
Zurück zum Zitat Smyth P (2000) Model selection for probabilistic clustering using cross-validated likelihood. Stat Comput 10(1):63–72CrossRef Smyth P (2000) Model selection for probabilistic clustering using cross-validated likelihood. Stat Comput 10(1):63–72CrossRef
Zurück zum Zitat Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the 8th ACM international conference on information and knowledge management (CIKM’99), pp 483–490 Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the 8th ACM international conference on information and knowledge management (CIKM’99), pp 483–490
Zurück zum Zitat Xiong T, Wang S, Mayers A, Monga E (2012) DHCC: divisive hierarchical clustering of categorical data. Data Min Knowl Discov 24(1):103–135CrossRefMATHMathSciNet Xiong T, Wang S, Mayers A, Monga E (2012) DHCC: divisive hierarchical clustering of categorical data. Data Min Knowl Discov 24(1):103–135CrossRefMATHMathSciNet
Zurück zum Zitat Yip KY, Cheung DW, Ng MK (2004) HARP: A practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16(11):1387–1397CrossRef Yip KY, Cheung DW, Ng MK (2004) HARP: A practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16(11):1387–1397CrossRef
Zurück zum Zitat Yip AM, Ng MK, Wu EH, Chan TF (2007) Strategies for identifying statistically significant dense regions in microarray data. IEEE/ACM Trans Comput Biol Bioinform 4(3):415–429CrossRef Yip AM, Ng MK, Wu EH, Chan TF (2007) Strategies for identifying statistically significant dense regions in microarray data. IEEE/ACM Trans Comput Biol Bioinform 4(3):415–429CrossRef
Zurück zum Zitat Zaki MJ, Peters M, Assent I, Seidl T (2007) CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. Data Knowl Eng 60(1):51–70CrossRef Zaki MJ, Peters M, Assent I, Seidl T (2007) CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. Data Knowl Eng 60(1):51–70CrossRef
Metadaten
Titel
Clustering categorical data in projected spaces
verfasst von
Mohamed Bouguessa
Publikationsdatum
01.01.2015
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 1/2015
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-013-0336-8

Weitere Artikel der Ausgabe 1/2015

Data Mining and Knowledge Discovery 1/2015 Zur Ausgabe