ABSTRACT
Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly provide. However, despite the prominent properties of SVMs, they are not as favored for large-scale data mining as for pattern recognition or machine learning because the training complexity of SVMs is highly dependent on the size of a data set. Many real-world data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy.
- D. K. Agarwal. Shrinkage estimator generalizations of proximal support vector machines. In Proc. 8th Int. Conf. Knowledge Discovery and Data Mining, Edmonton, Canada, 2002.]] Google ScholarDigital Library
- J. L. Balczar, Y. Dai, and O. Watanabe. A random sampling technique for training support vector machines. In Proc. 13th Int. Conf. Algorithmic Learning Theory, Washington D.C., 2001.]] Google ScholarDigital Library
- K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? Lecture Notes in Computer Science, 1540:217--235, 1999.]] Google ScholarDigital Library
- C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121--167, 1998.]] Google ScholarDigital Library
- G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, 2000.]]Google Scholar
- C.-C. Chang and C.-J. Lin. Training nu-support vector classifiers: Thoery and algorithms. Neural Computation, 13:2119--2147, 2001.]] Google ScholarDigital Library
- R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143--160, 2001.]] Google ScholarDigital Library
- G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In Proc. 7th Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, 2001.]] Google ScholarDigital Library
- R. Greiner, A. J. Grove, and D. Roth. Learning active classifiers. In Proc. 13th Int. Conf. Machine Learning, Bari, Italy, 1996.]]Google Scholar
- S. Guha, R. Rastogi, and K. Shim. CURE: an efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, Seatle, WA, 1998.]] Google ScholarDigital Library
- O. W. J. L. Balczar, Y. Dai. A random sampling technique for training support vector machines. In The 2001 IEEE Int. Conf. Data Mining, San Jose, CA, 2001.]]Google ScholarCross Ref
- W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases. In Proc. 7th Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, 2001.]] Google ScholarDigital Library
- T. Joachims. Making large-scale support vector machine learning practical. In A. S. B. Scholkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.]] Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines. In Proc. 10th European Conference on Machine Learning, Chemnitz, Germany, 1998.]] Google ScholarDigital Library
- G. Karypis, E.-H. Han, and V. Kumar. Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8):68--75, 1999.]] Google ScholarDigital Library
- J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. In Proc. Advances in Neural Information Processing Systems, Cambridge, MA, 2002.]]Google Scholar
- Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In First SIAM Int. Conf. Data Mining, Chicago, IL, 2001.]]Google ScholarCross Ref
- J. Platt. Fast training of support vector machines using sequential minimal optimization. In A. S. B. Scholkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.]] Google ScholarDigital Library
- G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th Int. Conf. Machine Learning, Stanford, CA, 2000.]] Google ScholarDigital Library
- A. Smola and B. Sch. A tutorial on support vector regression. Technical report, 1998.]]Google Scholar
- N. Syed, H. Liu, and K. Sung. Incremental learning with support vector machines. In Proc. the Workshop on Support Vector Machines at the International Joint Conference on Articial Intelligence, Stockholm, Sweden, 1999.]]Google Scholar
- S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In Proc. 17th Int. Conf. Machine Learning, Stanford, CA, 2000.]] Google ScholarDigital Library
- V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.]]Google ScholarDigital Library
- H. Yu, J. Han, and K. C. Chang. PEBL: Positive-example based learning for Web page classification using SVM. In Proc. 8th Int. Conf. Knowledge Discovery and Data Mining, Edmonton, Canada, 2002.]] Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, Montreal, Canada, 1996.]] Google ScholarDigital Library
Index Terms
- Classifying large data sets using SVMs with hierarchical clusters
Recommendations
Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing
Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. ...
Random Local SVMs for Classifying Large Datasets
FDSE 2015: Proceedings of the Second International Conference on Future Data and Security Engineering - Volume 9446We propose a new parallel ensemble learning algorithm of random local support vector machines, called krSVM for the effectively non-linear classification of large datasets. The random local SVM in the krSVM learning strategy uses kmeans algorithm to ...
Incremental training of support vector machines using hyperspheres
In the conventional incremental training of support vector machines, candidates for support vectors tend to be deleted if the separating hyperplane rotates as the training data are added. To solve this problem, in this paper, we propose an incremental ...
Comments