nach oben

Data Mining and Knowledge Discovery

Erschienen in:

01.11.2005

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

verfasst von: Hwanjo Yu, Jiong Yang, Jiawei Han, Xiaolei Li

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 3/2005

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification.

Vorheriger Artikel Using Generalized Estimating Equation to Learn Decision Tree with Multivariate Responses

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://www.csie.ntu.edu.tw/~cjlin/libsvm

See http://www.cs.wisc.edu/dmi/asvm/ for the ASVM implementation.

We ran the SEL with δ = 5 (starting from one positive and one negative sample and adding five samples at each round), which gave fairly good results among others. δ is commonly set below ten. If δ is too high, its performance converges slower, which ends up with a larger amount of training data to achieve the same accuracy, and if δ is too low, SEL may need to undergo too many iterations (Schohn and Cohn, 2000; Tong and Koller, 2000).

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

In machine learning theory, the model complexity or the power of a classification function is often measured by VC-dimension. SVMs with Gaussian kernels have infinite VC-dimensions (Burges, 1998), meaning that it is able to classify arbitrary partitionings of a data set.

http://ftp.ics.uci.edu/pub/machine-learning-databases/covtype/

Agarwal, D.K. 2002. Shrinkage estimator generalizations of proximal support vector machines. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'02), pp. 173–182.

Burges, C.J.C. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167.CrossRef

Cauwenberghs G. and Poggio, T. 2000. Incremental and decremental support vector machine learning. In Proc. Advances in Neural Information Processing Systems (NIPS'00), pp. 409–415.

Chang, C.-C. and Lin, C.-J. 2001. Training nu-support vector classifiers: Theory and algorithms. Neural Computation, 13:2119–2147.CrossRefPubMedMATH

Collobert, R. and Bengio, S. 2001. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143–160.CrossRefMathSciNet

Devroye, L. Gyorfi, L., and Lugosi, G. (Eds.), A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.

Domingos P. and Hulten, G. 2000. Mining high-speed data streams. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'00).

Fung, G. and Mangasarian, O.L. 2001. Proximal support vector machine classifiers. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'01), pp. 77–86.

Ganti, V. Ramakrishnan, R., and Gehrke, J. 1999. Clustering large datasets in arbitrary metric spaces. In Proc. Int. Conf. Data Engineering (ICDE'98).

Greiner, R. Grove, A.J., and Roth, D. 1996. Learning active classifiers. In Proc. Int. Conf. Machine Learning (ICML'96), pp. 207–215.

Guha, S. Rastogi, R. and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data (SIGMOD'98), pp. 73–84.

Joachims, T. 1998a. Text categorization with support vector machines. In Proc. European Conf. Machine Learning (ECML'98), pp. 137–142.

Joachims, T. 1998b. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines, A.J. Smola B. Scholkopf, C. Burges, (Eds.) Cambridge, MA: MIT Press.

Karypis, G. Han, E.-H., and Kumar, V. 1999 Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32:(8)68–75.CrossRef

Kivinen, J. Smola, A.J., and Williamson, R.C. 2001. Online learning with kernels. In Proc. Advances in Neural Information Processing Systems (NIPS'01), pp. 785–792.

Lee Y.-J. and Mangasarian, O.L. 2001. RSVM: Reduced support vector machines. In SIAM Int. Conf. Data Mining.

Mangasarian, O.L. and Musicant, D.R. 2000. Active support vector machine classification. Tech. Rep., Computer Sciences Department, University of Wisconsin at Madison.

Platt, J. 1998. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Machines, A.J. Smola B. Scholkopf, and C. Burges (Eds.) Cambridge, MA: MIT Press.

Scheffer T. and Wrobel, S. 2002. Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research.

Schohn, G. and Cohn, D. 2000. Less is more: Active learning with support vector machines. In Proc. Int. Conf. Machine Learning (ICML'00), pp. 839–846.

Scholkopf, B. Williamson, R.C. Smola, A.J., and Shawe-Taylor, J. 2000. SV estimation of a distribution's support. In Proc. Advances in Neural Information Processing Systems (NIPS'00), pp. 582–588.

Shih, L. Chang, Y.-H. Rennie, J., and Karger, D. 2002. Not too hot, not too cold: The bundled-svm is just right!. In Proc. the Workshop on Text Learning at the Int. Conf. on Machine Learning.

Smola, A.J. and Scholkopf, B. 1998. A tutorial on support vector regression. Tech. Rep., NeuroCOLT2 Technical Report NC2-TR-1998-030.

Syed, N. Liu, H., and Sung, K. 1999. Incremental learning with support vector machines. In Proc. the Workshop on Support Vector Machines at the Int. Joint Conf. on Articial Intelligence (IJCAI'99).

Tong, S. and Koller, D. 2000. Support vector machine active learning with applications to text classification. In Proc. Int. Conf. Machine Learning (ICML'00), pp. 999–1006.

Vapnik, V.N. 1998. Statistical Learning Theory. John Wiley and Sons.

Wang, W. Yang, J., and Muntz, R.R. 1997. STING: A statistical information grid approach to spatial data mining. In Proc. Int. Conf. Very Large Databases (VLDB'97), pp. 186–195.

Watanabe, O. Balczar, J.L. Dai, Y. 2001. A random sampling technique for training support vector machines. In Int. Conf. Data Mining (ICDM'01), pp. 43–50.

Yu, H. Han, J., and Chang, K.C. 2002. PEBL: Positive-example based learning for Web page classification using SVM. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'02), pp. 239–248.

Zhang, T. Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data (SIGMOD'96), pp. 103–114.

Titel: Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing
verfasst von: Hwanjo Yu
Jiong Yang
Jiawei Han
Xiaolei Li
Publikationsdatum: 01.11.2005
Verlag: Springer US
Erschienen in: Data Mining and Knowledge Discovery / Ausgabe 3/2005
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-005-0005-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2005

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets

Using Generalized Estimating Equation to Learn Decision Tree with Multivariate Responses

Finding Frequent Patterns in a Large Sparse Graph*

Premium Partner