Skip to main content
Erschienen in: Knowledge and Information Systems 2/2015

01.05.2015 | Regular Paper

Fast and scalable support vector clustering for large-scale data analysis

verfasst von: Yuan Ping, Yun Feng Chang, Yajian Zhou, Ying Jie Tian, Yi Xian Yang, Zhili Zhang

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As an important boundary-based clustering algorithm, support vector clustering (SVC) benefits multiple applications for its capability of handling arbitrary cluster shapes. However, its popularity is degraded by both its highly intensive pricey computation and poor label performance which are due to redundant kernel function matrix required by estimating a support function and ineffectively checking segmers between all pairs of data points, respectively. To address these two problems, a fast and scalable SVC (FSSVC) method is proposed in this paper to achieve significant improvement on efficiency while guarantees a comparable accuracy with the state-of-the-art methods. The heart of our approach includes (1) constructing the hypersphere and support function by cluster boundaries which prunes unnecessary computation and storage of kernel functions and (2) presenting an adaptive labeling strategy which decomposes clusters into convex hulls and then employs a convex-decomposition-based cluster labeling algorithm or cone cluster labeling algorithm on the basis of whether the radius of the hypersphere is greater than 1. Both theoretical analysis and experimental results (e.g., the first rank of a nonparametric statistical test) show the superiority of our method over the others, especially for large-scale data analysis under limited memory requirements.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Alzate C, Suykens JAK (2010) Multiway spectral clustering with out-of-sample extensions through weighted kernel pca. IEEE Trans Patt Anal Mach Intell 32(2):335–347CrossRef Alzate C, Suykens JAK (2010) Multiway spectral clustering with out-of-sample extensions through weighted kernel pca. IEEE Trans Patt Anal Mach Intell 32(2):335–347CrossRef
2.
Zurück zum Zitat Alzate C, Suykens JAK (2011) Sparse kernel spectral clustering models for large-scale data analysis. Neurocomputing 74(9):1382–1390CrossRef Alzate C, Suykens JAK (2011) Sparse kernel spectral clustering models for large-scale data analysis. Neurocomputing 74(9):1382–1390CrossRef
3.
Zurück zum Zitat Ban T, Abe S (2004) Spatially chunking support vector clustering algorithm. In: Proceedings of international joint conference on neural networks, pp 414–418 Ban T, Abe S (2004) Spatially chunking support vector clustering algorithm. In: Proceedings of international joint conference on neural networks, pp 414–418
4.
Zurück zum Zitat Ben-Hur A, Horn D, Siegelmann HT, Vapnik VN (2000) A support vector cluster method. In: Proceedings of 15th international conference on pattern recognition, vol 2, Barcelona, Spain, pp 724–727 Ben-Hur A, Horn D, Siegelmann HT, Vapnik VN (2000) A support vector cluster method. In: Proceedings of 15th international conference on pattern recognition, vol 2, Barcelona, Spain, pp 724–727
5.
Zurück zum Zitat Ben-Hur A, Horn D, Siegelmann HT, Vapnik VN (2001) Support vector clustering. J Mach Learn Res 2:125–137 Ben-Hur A, Horn D, Siegelmann HT, Vapnik VN (2001) Support vector clustering. J Mach Learn Res 2:125–137
6.
Zurück zum Zitat Bergmann G, Hommel G (1988) Improvements of general multiple test procedures for redundant systems of hypotheses. In: Multiple Hypotheses Testing, vol 70, pp. 100–115 Bergmann G, Hommel G (1988) Improvements of general multiple test procedures for redundant systems of hypotheses. In: Multiple Hypotheses Testing, vol 70, pp. 100–115
7.
Zurück zum Zitat Boyd S, Vandenberghe L (2009) Convex Optimization. cambridge university press, Cambridge Boyd S, Vandenberghe L (2009) Convex Optimization. cambridge university press, Cambridge
8.
Zurück zum Zitat Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Know Dis 2(2):121–167CrossRef Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Know Dis 2(2):121–167CrossRef
9.
Zurück zum Zitat Camastra F, Verri A (2005) A novel kernel method for clustering. IEEE Trans Patt Anal Mach Intell 27(5):801–805CrossRef Camastra F, Verri A (2005) A novel kernel method for clustering. IEEE Trans Patt Anal Mach Intell 27(5):801–805CrossRef
10.
Zurück zum Zitat Chiang JH, Hao PY (2003) A new kernel-based fuzzy clustering approach: Support vector clustering with cell growing. IEEE Trans Fuzzy Syst 11(4):518–527CrossRef Chiang JH, Hao PY (2003) A new kernel-based fuzzy clustering approach: Support vector clustering with cell growing. IEEE Trans Fuzzy Syst 11(4):518–527CrossRef
11.
Zurück zum Zitat Estivill-Castro V, Lee I (2000) Amoeba: Hierarchical clustering based on spatial proximity using delaunay diagram. In: Proceedings of the 9th international symposium on spatial data handling, pp 7a.26-7a.4 Estivill-Castro V, Lee I (2000) Amoeba: Hierarchical clustering based on spatial proximity using delaunay diagram. In: Proceedings of the 9th international symposium on spatial data handling, pp 7a.26-7a.4
12.
Zurück zum Zitat Estivill-Castro V, Lee I, Murray AT (2001) Criteria on proximity graphs for boundary extraction and spatial clustering. In: Proceedings of the 5th Pacific-Asia conference on knowledge discovery and data mining (PAKDD’01), vol 2035, pp 348–357 Estivill-Castro V, Lee I, Murray AT (2001) Criteria on proximity graphs for boundary extraction and spatial clustering. In: Proceedings of the 5th Pacific-Asia conference on knowledge discovery and data mining (PAKDD’01), vol 2035, pp 348–357
13.
Zurück zum Zitat Forghani Y, Yazdi HS, Effati S (2012) An extension to fuzzy support vector data description (fsvdd*). Patt Anal Appl 15(3):237–247CrossRefMathSciNet Forghani Y, Yazdi HS, Effati S (2012) An extension to fuzzy support vector data description (fsvdd*). Patt Anal Appl 15(3):237–247CrossRefMathSciNet
15.
Zurück zum Zitat Garcia S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694MATH Garcia S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694MATH
16.
Zurück zum Zitat Graven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge form the world wide web. In: Proceedings of 15th Nat’l conference for artificial intelligence (AAAI’98), pp 509–516. Madison, Wisconsin Graven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge form the world wide web. In: Proceedings of 15th Nat’l conference for artificial intelligence (AAAI’98), pp 509–516. Madison, Wisconsin
17.
Zurück zum Zitat Guo CH, Li F (2011) An improved algorithm for support vector clustering based on maximum entropy principle and kernel matrix. Exp Syst Appl 38(7):8138–8143CrossRef Guo CH, Li F (2011) An improved algorithm for support vector clustering based on maximum entropy principle and kernel matrix. Exp Syst Appl 38(7):8138–8143CrossRef
18.
Zurück zum Zitat Hersh WR, Buckley C, Leone TJ, Hickam DH (1994) Ohsumed: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of 17th annual ACM SIGIR conference, pp 192–201 Hersh WR, Buckley C, Leone TJ, Hickam DH (1994) Ohsumed: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of 17th annual ACM SIGIR conference, pp 192–201
19.
20.
Zurück zum Zitat Hurley J, Garcia-Palacios E, Sezer S (2011) Classifying network protocols: a ‘two-way’ flow approach. IET Commun 5(1):79–89CrossRef Hurley J, Garcia-Palacios E, Sezer S (2011) Classifying network protocols: a ‘two-way’ flow approach. IET Commun 5(1):79–89CrossRef
21.
Zurück zum Zitat Jung KH, Kim N, Lee J (2011) Dynamic pattern denoising method using multi-basin system with kernels. Patt Recogn 44(8):1698–1707CrossRefMATH Jung KH, Kim N, Lee J (2011) Dynamic pattern denoising method using multi-basin system with kernels. Patt Recogn 44(8):1698–1707CrossRefMATH
22.
Zurück zum Zitat Jung KH, Lee D, Lee J (2010) Fast support-based clustering method for large-scale problems. Patt Recogn 43(5):1975–1983CrossRefMATH Jung KH, Lee D, Lee J (2010) Fast support-based clustering method for large-scale problems. Patt Recogn 43(5):1975–1983CrossRefMATH
24.
Zurück zum Zitat Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings 12th international conference machine learning (ICML’95), pp 331–339. Tahoe City. Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings 12th international conference machine learning (ICML’95), pp 331–339. Tahoe City.
25.
Zurück zum Zitat Lee CH, Yang HC (2009) Construction of supervised and unsupervised learning systems for multilingual text categorization. Exp Syst Appl 2–1(36):2400–2410CrossRef Lee CH, Yang HC (2009) Construction of supervised and unsupervised learning systems for multilingual text categorization. Exp Syst Appl 2–1(36):2400–2410CrossRef
26.
Zurück zum Zitat Lee D, Jung KH, Lee J (2009) Constructing sparse kernel machines using attractors. IEEE Trans Pat Anal Mach Intell 20(4):721–729 Lee D, Jung KH, Lee J (2009) Constructing sparse kernel machines using attractors. IEEE Trans Pat Anal Mach Intell 20(4):721–729
27.
Zurück zum Zitat Lee D, Lee J (2007) Equilibrium-based support vector machine for semisupervised classification. IEEE Trans Neural Netw 18(2):578–583CrossRef Lee D, Lee J (2007) Equilibrium-based support vector machine for semisupervised classification. IEEE Trans Neural Netw 18(2):578–583CrossRef
28.
Zurück zum Zitat Lee J, Lee D (2005) An improved cluster labeling method for support vector clustering. IEEE Trans Patt Anal Mach Intell 27(3):461–464CrossRef Lee J, Lee D (2005) An improved cluster labeling method for support vector clustering. IEEE Trans Patt Anal Mach Intell 27(3):461–464CrossRef
29.
Zurück zum Zitat Lee J, Lee D (2006) Dynamic characterization of cluster structures for robust and inductive support vector clustering. IEEE Trans Patt Anal Mach Intell 28(11):1869–1874CrossRef Lee J, Lee D (2006) Dynamic characterization of cluster structures for robust and inductive support vector clustering. IEEE Trans Patt Anal Mach Intell 28(11):1869–1874CrossRef
30.
Zurück zum Zitat Lee SH, Daniels KM (2005) Gaussian kernel width selection and fast cluster labeling for support vector clustering. Technical Report No. 2005–009, USA Lee SH, Daniels KM (2005) Gaussian kernel width selection and fast cluster labeling for support vector clustering. Technical Report No. 2005–009, USA
31.
Zurück zum Zitat Lee SH, Daniels KM (2005) Cone cluster labeling for support vector clustering. In: Proceedings of 6th SIAM conference on data mining, pp 484–488 Lee SH, Daniels KM (2005) Cone cluster labeling for support vector clustering. In: Proceedings of 6th SIAM conference on data mining, pp 484–488
33.
Zurück zum Zitat Li YH (2011) Selecting training points for one-class support vector machines. Patt Recogn Lett 32(11):1517–1522CrossRef Li YH (2011) Selecting training points for one-class support vector machines. Patt Recogn Lett 32(11):1517–1522CrossRef
34.
Zurück zum Zitat Li YH, Maguire L (2011) Selecting critical patterns based on local geometrical and statistical information. IEEE Trans Patt Anal Mach Intell 33(6):1189–1201CrossRef Li YH, Maguire L (2011) Selecting critical patterns based on local geometrical and statistical information. IEEE Trans Patt Anal Mach Intell 33(6):1189–1201CrossRef
35.
Zurück zum Zitat Ling P, Zhou CG, Zhou X (2010) Improved support vector clustering. Eng Appl Artif Intell 23(4):552–559CrossRef Ling P, Zhou CG, Zhou X (2010) Improved support vector clustering. Eng Appl Artif Intell 23(4):552–559CrossRef
36.
Zurück zum Zitat Mehrotra S (1992) On the implementation of a primal-dual interior point method. SIAM J Optim 2: 575–601 Mehrotra S (1992) On the implementation of a primal-dual interior point method. SIAM J Optim 2: 575–601
37.
Zurück zum Zitat Peng JF, Zhou YJ, Wang C, Yang YX, Ping Y (2011) Early TCP traffic classification. J Appl Sci Electron Inform Eng 29(1):73–77 Peng JF, Zhou YJ, Wang C, Yang YX, Ping Y (2011) Early TCP traffic classification. J Appl Sci Electron Inform Eng 29(1):73–77
38.
Zurück zum Zitat Ping Y, Tian YJ, Zhou YJ, Yang YX (2012) Convex decomposition based cluster labeling method for support vector clustering. J Comput Sci Technol 27(2):428–442CrossRefMATHMathSciNet Ping Y, Tian YJ, Zhou YJ, Yang YX (2012) Convex decomposition based cluster labeling method for support vector clustering. J Comput Sci Technol 27(2):428–442CrossRefMATHMathSciNet
39.
Zurück zum Zitat Ping Y, Zhou YJ, Xue C, Yang YX (2012) Efficient representation of text with multiple perspectives. J China Uni Posts Telecommun 19(1):101–111 Ping Y, Zhou YJ, Xue C, Yang YX (2012) Efficient representation of text with multiple perspectives. J China Uni Posts Telecommun 19(1):101–111
40.
Zurück zum Zitat Ping Y, Zhou YJ, Yang YX (2011) A novel scheme for accelerating support vector clustering (in press) Ping Y, Zhou YJ, Yang YX (2011) A novel scheme for accelerating support vector clustering (in press)
41.
Zurück zum Zitat Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods: support vector learning. MIT Press, Cambridge Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods: support vector learning. MIT Press, Cambridge
42.
Zurück zum Zitat Puma-Villanueva WJ, Bezerra GB, Lima CAM, Zuben FJV (2005) Improving support vector clustering with ensembles. In: Proceedings of international joint conference on neural networks, Montreal, pp 13–15 Puma-Villanueva WJ, Bezerra GB, Lima CAM, Zuben FJV (2005) Improving support vector clustering with ensembles. In: Proceedings of international joint conference on neural networks, Montreal, pp 13–15
43.
Zurück zum Zitat Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1472CrossRef Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1472CrossRef
44.
Zurück zum Zitat Shamir O, Tishby N (2010) Stability and model selection in k-means clustering. Mach Learn 81(1): 213–243 Shamir O, Tishby N (2010) Stability and model selection in k-means clustering. Mach Learn 81(1): 213–243
45.
Zurück zum Zitat Sheskin D (2003) Handbook of parametric and nonparametric statistical procedures. Chapman & Hall, LondonCrossRef Sheskin D (2003) Handbook of parametric and nonparametric statistical procedures. Chapman & Hall, LondonCrossRef
46.
Zurück zum Zitat Su MY (2011) Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification. J Netw Comput Appl 34(2):722–730CrossRef Su MY (2011) Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification. J Netw Comput Appl 34(2):722–730CrossRef
47.
Zurück zum Zitat Tax DMJ, Duin PRW (1999) Support vector domain description. Patt Recogn Lett 11–13(20):1191–1199CrossRef Tax DMJ, Duin PRW (1999) Support vector domain description. Patt Recogn Lett 11–13(20):1191–1199CrossRef
48.
Zurück zum Zitat UNIBS: The UNIBS anonymized 2009 internet traces (Mar 18, 2010) UNIBS: The UNIBS anonymized 2009 internet traces (Mar 18, 2010)
49.
Zurück zum Zitat Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Patt Anal Mach Intell 24(9):1273–1280CrossRef Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Patt Anal Mach Intell 24(9):1273–1280CrossRef
50.
Zurück zum Zitat Wang CD, Lai JH (2013) Position regularized support vector domain description. Patt Recogn 46(3): 875–884 Wang CD, Lai JH (2013) Position regularized support vector domain description. Patt Recogn 46(3): 875–884
51.
Zurück zum Zitat Wang CD, Lai JH, Huang D, Zheng WS (2012) Svstream: A support vector based algorithm for clustering data streams. IEEE Trans Know Data Eng (in press), 1–14 . doi: 10.1109/TKDE.2011.263 Wang CD, Lai JH, Huang D, Zheng WS (2012) Svstream: A support vector based algorithm for clustering data streams. IEEE Trans Know Data Eng (in press), 1–14 . doi: 10.​1109/​TKDE.​2011.​263
52.
Zurück zum Zitat Wang DF, Shi L, Yeung DS, Tsang ECC, Heng PA (2007) Ellipsoidal support vector clustering for functional MRI analysis. Patt Recogn 40(10):2685–2695CrossRefMATH Wang DF, Shi L, Yeung DS, Tsang ECC, Heng PA (2007) Ellipsoidal support vector clustering for functional MRI analysis. Patt Recogn 40(10):2685–2695CrossRefMATH
53.
Zurück zum Zitat Wang JS, Chiang JC (2009) An efficient data preprocessing procedure for support vector clustering. J Univ Comput Sci 15(4):705–721 Wang JS, Chiang JC (2009) An efficient data preprocessing procedure for support vector clustering. J Univ Comput Sci 15(4):705–721
54.
Zurück zum Zitat Wu M, Scholkopf B (2007) A local learning approach for clustering. In: Advances in neural information processing systems (NIPS 2007), vol 19, pp 1529–1536. Vancouver, Canada (2007) Wu M, Scholkopf B (2007) A local learning approach for clustering. In: Advances in neural information processing systems (NIPS 2007), vol 19, pp 1529–1536. Vancouver, Canada (2007)
56.
Zurück zum Zitat Yang JH, Estivill-Castro V, Chalup SK (2002) Support vector clustering through proximity graph modelling. In: Proceedings of 9th international conference on neural information processing (ICONIP’02), pp 898–903. Orchid Country Club, Singapore Yang JH, Estivill-Castro V, Chalup SK (2002) Support vector clustering through proximity graph modelling. In: Proceedings of 9th international conference on neural information processing (ICONIP’02), pp 898–903. Orchid Country Club, Singapore
Metadaten
Titel
Fast and scalable support vector clustering for large-scale data analysis
verfasst von
Yuan Ping
Yun Feng Chang
Yajian Zhou
Ying Jie Tian
Yi Xian Yang
Zhili Zhang
Publikationsdatum
01.05.2015
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2015
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-013-0724-9

Weitere Artikel der Ausgabe 2/2015

Knowledge and Information Systems 2/2015 Zur Ausgabe

Premium Partner