Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 3/2005

01.11.2005

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

verfasst von: Hwanjo Yu, Jiong Yang, Jiawei Han, Xiaolei Li

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 3/2005

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
http://www.csie.ntu.edu.tw/~cjlin/libsvm
 
2
See http://www.cs.wisc.edu/dmi/asvm/ for the ASVM implementation.
 
3
We ran the SEL with δ = 5 (starting from one positive and one negative sample and adding five samples at each round), which gave fairly good results among others. δ is commonly set below ten. If δ is too high, its performance converges slower, which ends up with a larger amount of training data to achieve the same accuracy, and if δ is too low, SEL may need to undergo too many iterations (Schohn and Cohn, 2000; Tong and Koller, 2000).
 
4
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
 
5
In machine learning theory, the model complexity or the power of a classification function is often measured by VC-dimension. SVMs with Gaussian kernels have infinite VC-dimensions (Burges, 1998), meaning that it is able to classify arbitrary partitionings of a data set.
 
6
http://ftp.ics.uci.edu/pub/machine-learning-databases/covtype/
 
Literatur
Zurück zum Zitat Agarwal, D.K. 2002. Shrinkage estimator generalizations of proximal support vector machines. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'02), pp. 173–182. Agarwal, D.K. 2002. Shrinkage estimator generalizations of proximal support vector machines. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'02), pp. 173–182.
Zurück zum Zitat Burges, C.J.C. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167.CrossRef Burges, C.J.C. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167.CrossRef
Zurück zum Zitat Cauwenberghs G. and Poggio, T. 2000. Incremental and decremental support vector machine learning. In Proc. Advances in Neural Information Processing Systems (NIPS'00), pp. 409–415. Cauwenberghs G. and Poggio, T. 2000. Incremental and decremental support vector machine learning. In Proc. Advances in Neural Information Processing Systems (NIPS'00), pp. 409–415.
Zurück zum Zitat Chang, C.-C. and Lin, C.-J. 2001. Training nu-support vector classifiers: Theory and algorithms. Neural Computation, 13:2119–2147.CrossRefPubMedMATH Chang, C.-C. and Lin, C.-J. 2001. Training nu-support vector classifiers: Theory and algorithms. Neural Computation, 13:2119–2147.CrossRefPubMedMATH
Zurück zum Zitat Collobert, R. and Bengio, S. 2001. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143–160.CrossRefMathSciNet Collobert, R. and Bengio, S. 2001. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143–160.CrossRefMathSciNet
Zurück zum Zitat Devroye, L. Gyorfi, L., and Lugosi, G. (Eds.), A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996. Devroye, L. Gyorfi, L., and Lugosi, G. (Eds.), A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.
Zurück zum Zitat Domingos P. and Hulten, G. 2000. Mining high-speed data streams. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'00). Domingos P. and Hulten, G. 2000. Mining high-speed data streams. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'00).
Zurück zum Zitat Fung, G. and Mangasarian, O.L. 2001. Proximal support vector machine classifiers. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'01), pp. 77–86. Fung, G. and Mangasarian, O.L. 2001. Proximal support vector machine classifiers. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'01), pp. 77–86.
Zurück zum Zitat Ganti, V. Ramakrishnan, R., and Gehrke, J. 1999. Clustering large datasets in arbitrary metric spaces. In Proc. Int. Conf. Data Engineering (ICDE'98). Ganti, V. Ramakrishnan, R., and Gehrke, J. 1999. Clustering large datasets in arbitrary metric spaces. In Proc. Int. Conf. Data Engineering (ICDE'98).
Zurück zum Zitat Greiner, R. Grove, A.J., and Roth, D. 1996. Learning active classifiers. In Proc. Int. Conf. Machine Learning (ICML'96), pp. 207–215. Greiner, R. Grove, A.J., and Roth, D. 1996. Learning active classifiers. In Proc. Int. Conf. Machine Learning (ICML'96), pp. 207–215.
Zurück zum Zitat Guha, S. Rastogi, R. and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data (SIGMOD'98), pp. 73–84. Guha, S. Rastogi, R. and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data (SIGMOD'98), pp. 73–84.
Zurück zum Zitat Joachims, T. 1998a. Text categorization with support vector machines. In Proc. European Conf. Machine Learning (ECML'98), pp. 137–142. Joachims, T. 1998a. Text categorization with support vector machines. In Proc. European Conf. Machine Learning (ECML'98), pp. 137–142.
Zurück zum Zitat Joachims, T. 1998b. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines, A.J. Smola B. Scholkopf, C. Burges, (Eds.) Cambridge, MA: MIT Press. Joachims, T. 1998b. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines, A.J. Smola B. Scholkopf, C. Burges, (Eds.) Cambridge, MA: MIT Press.
Zurück zum Zitat Karypis, G. Han, E.-H., and Kumar, V. 1999 Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32:(8)68–75.CrossRef Karypis, G. Han, E.-H., and Kumar, V. 1999 Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32:(8)68–75.CrossRef
Zurück zum Zitat Kivinen, J. Smola, A.J., and Williamson, R.C. 2001. Online learning with kernels. In Proc. Advances in Neural Information Processing Systems (NIPS'01), pp. 785–792. Kivinen, J. Smola, A.J., and Williamson, R.C. 2001. Online learning with kernels. In Proc. Advances in Neural Information Processing Systems (NIPS'01), pp. 785–792.
Zurück zum Zitat Lee Y.-J. and Mangasarian, O.L. 2001. RSVM: Reduced support vector machines. In SIAM Int. Conf. Data Mining. Lee Y.-J. and Mangasarian, O.L. 2001. RSVM: Reduced support vector machines. In SIAM Int. Conf. Data Mining.
Zurück zum Zitat Mangasarian, O.L. and Musicant, D.R. 2000. Active support vector machine classification. Tech. Rep., Computer Sciences Department, University of Wisconsin at Madison. Mangasarian, O.L. and Musicant, D.R. 2000. Active support vector machine classification. Tech. Rep., Computer Sciences Department, University of Wisconsin at Madison.
Zurück zum Zitat Platt, J. 1998. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Machines, A.J. Smola B. Scholkopf, and C. Burges (Eds.) Cambridge, MA: MIT Press. Platt, J. 1998. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Machines, A.J. Smola B. Scholkopf, and C. Burges (Eds.) Cambridge, MA: MIT Press.
Zurück zum Zitat Scheffer T. and Wrobel, S. 2002. Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research. Scheffer T. and Wrobel, S. 2002. Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research.
Zurück zum Zitat Schohn, G. and Cohn, D. 2000. Less is more: Active learning with support vector machines. In Proc. Int. Conf. Machine Learning (ICML'00), pp. 839–846. Schohn, G. and Cohn, D. 2000. Less is more: Active learning with support vector machines. In Proc. Int. Conf. Machine Learning (ICML'00), pp. 839–846.
Zurück zum Zitat Scholkopf, B. Williamson, R.C. Smola, A.J., and Shawe-Taylor, J. 2000. SV estimation of a distribution's support. In Proc. Advances in Neural Information Processing Systems (NIPS'00), pp. 582–588. Scholkopf, B. Williamson, R.C. Smola, A.J., and Shawe-Taylor, J. 2000. SV estimation of a distribution's support. In Proc. Advances in Neural Information Processing Systems (NIPS'00), pp. 582–588.
Zurück zum Zitat Shih, L. Chang, Y.-H. Rennie, J., and Karger, D. 2002. Not too hot, not too cold: The bundled-svm is just right!. In Proc. the Workshop on Text Learning at the Int. Conf. on Machine Learning. Shih, L. Chang, Y.-H. Rennie, J., and Karger, D. 2002. Not too hot, not too cold: The bundled-svm is just right!. In Proc. the Workshop on Text Learning at the Int. Conf. on Machine Learning.
Zurück zum Zitat Smola, A.J. and Scholkopf, B. 1998. A tutorial on support vector regression. Tech. Rep., NeuroCOLT2 Technical Report NC2-TR-1998-030. Smola, A.J. and Scholkopf, B. 1998. A tutorial on support vector regression. Tech. Rep., NeuroCOLT2 Technical Report NC2-TR-1998-030.
Zurück zum Zitat Syed, N. Liu, H., and Sung, K. 1999. Incremental learning with support vector machines. In Proc. the Workshop on Support Vector Machines at the Int. Joint Conf. on Articial Intelligence (IJCAI'99). Syed, N. Liu, H., and Sung, K. 1999. Incremental learning with support vector machines. In Proc. the Workshop on Support Vector Machines at the Int. Joint Conf. on Articial Intelligence (IJCAI'99).
Zurück zum Zitat Tong, S. and Koller, D. 2000. Support vector machine active learning with applications to text classification. In Proc. Int. Conf. Machine Learning (ICML'00), pp. 999–1006. Tong, S. and Koller, D. 2000. Support vector machine active learning with applications to text classification. In Proc. Int. Conf. Machine Learning (ICML'00), pp. 999–1006.
Zurück zum Zitat Vapnik, V.N. 1998. Statistical Learning Theory. John Wiley and Sons. Vapnik, V.N. 1998. Statistical Learning Theory. John Wiley and Sons.
Zurück zum Zitat Wang, W. Yang, J., and Muntz, R.R. 1997. STING: A statistical information grid approach to spatial data mining. In Proc. Int. Conf. Very Large Databases (VLDB'97), pp. 186–195. Wang, W. Yang, J., and Muntz, R.R. 1997. STING: A statistical information grid approach to spatial data mining. In Proc. Int. Conf. Very Large Databases (VLDB'97), pp. 186–195.
Zurück zum Zitat Watanabe, O. Balczar, J.L. Dai, Y. 2001. A random sampling technique for training support vector machines. In Int. Conf. Data Mining (ICDM'01), pp. 43–50. Watanabe, O. Balczar, J.L. Dai, Y. 2001. A random sampling technique for training support vector machines. In Int. Conf. Data Mining (ICDM'01), pp. 43–50.
Zurück zum Zitat Yu, H. Han, J., and Chang, K.C. 2002. PEBL: Positive-example based learning for Web page classification using SVM. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'02), pp. 239–248. Yu, H. Han, J., and Chang, K.C. 2002. PEBL: Positive-example based learning for Web page classification using SVM. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'02), pp. 239–248.
Zurück zum Zitat Zhang, T. Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data (SIGMOD'96), pp. 103–114. Zhang, T. Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data (SIGMOD'96), pp. 103–114.
Metadaten
Titel
Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing
verfasst von
Hwanjo Yu
Jiong Yang
Jiawei Han
Xiaolei Li
Publikationsdatum
01.11.2005
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 3/2005
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-005-0005-7

Weitere Artikel der Ausgabe 3/2005

Data Mining and Knowledge Discovery 3/2005 Zur Ausgabe

Premium Partner