Article

Classifying large data sets using SVMs with hierarchical clusters

Authors:
Hwanjo Yu

University of Illinois, Urbana-Champaign, IL

University of Illinois, Urbana-Champaign, IL
View Profile

,
Jiong Yang

University of Illinois, Urbana-Champaign, IL

University of Illinois, Urbana-Champaign, IL
View Profile

,
Jiawei Han

University of Illinois, Urbana-Champaign, IL

University of Illinois, Urbana-Champaign, IL
View Profile

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2003Pages 306–315https://doi.org/10.1145/956750.956786

Published:24 August 2003Publication History

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 306–315

ABSTRACT

Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly provide. However, despite the prominent properties of SVMs, they are not as favored for large-scale data mining as for pattern recognition or machine learning because the training complexity of SVMs is highly dependent on the size of a data set. Many real-world data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy.

References

D. K. Agarwal. Shrinkage estimator generalizations of proximal support vector machines. In Proc. 8th Int. Conf. Knowledge Discovery and Data Mining, Edmonton, Canada, 2002.]] Google ScholarDigital Library
J. L. Balczar, Y. Dai, and O. Watanabe. A random sampling technique for training support vector machines. In Proc. 13th Int. Conf. Algorithmic Learning Theory, Washington D.C., 2001.]] Google ScholarDigital Library
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? Lecture Notes in Computer Science, 1540:217--235, 1999.]] Google ScholarDigital Library
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121--167, 1998.]] Google ScholarDigital Library
G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, 2000.]]Google Scholar
C.-C. Chang and C.-J. Lin. Training nu-support vector classifiers: Thoery and algorithms. Neural Computation, 13:2119--2147, 2001.]] Google ScholarDigital Library
R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143--160, 2001.]] Google ScholarDigital Library
G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In Proc. 7th Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, 2001.]] Google ScholarDigital Library
R. Greiner, A. J. Grove, and D. Roth. Learning active classifiers. In Proc. 13th Int. Conf. Machine Learning, Bari, Italy, 1996.]]Google Scholar
S. Guha, R. Rastogi, and K. Shim. CURE: an efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, Seatle, WA, 1998.]] Google ScholarDigital Library
O. W. J. L. Balczar, Y. Dai. A random sampling technique for training support vector machines. In The 2001 IEEE Int. Conf. Data Mining, San Jose, CA, 2001.]]Google ScholarCross Ref
W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases. In Proc. 7th Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, 2001.]] Google ScholarDigital Library
T. Joachims. Making large-scale support vector machine learning practical. In A. S. B. Scholkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.]] Google ScholarDigital Library
T. Joachims. Text categorization with support vector machines. In Proc. 10th European Conference on Machine Learning, Chemnitz, Germany, 1998.]] Google ScholarDigital Library
G. Karypis, E.-H. Han, and V. Kumar. Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8):68--75, 1999.]] Google ScholarDigital Library
J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. In Proc. Advances in Neural Information Processing Systems, Cambridge, MA, 2002.]]Google Scholar
Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In First SIAM Int. Conf. Data Mining, Chicago, IL, 2001.]]Google ScholarCross Ref
J. Platt. Fast training of support vector machines using sequential minimal optimization. In A. S. B. Scholkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.]] Google ScholarDigital Library
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th Int. Conf. Machine Learning, Stanford, CA, 2000.]] Google ScholarDigital Library
A. Smola and B. Sch. A tutorial on support vector regression. Technical report, 1998.]]Google Scholar
N. Syed, H. Liu, and K. Sung. Incremental learning with support vector machines. In Proc. the Workshop on Support Vector Machines at the International Joint Conference on Articial Intelligence, Stockholm, Sweden, 1999.]]Google Scholar
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In Proc. 17th Int. Conf. Machine Learning, Stanford, CA, 2000.]] Google ScholarDigital Library
V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.]]Google ScholarDigital Library
H. Yu, J. Han, and K. C. Chang. PEBL: Positive-example based learning for Web page classification using SVM. In Proc. 8th Int. Conf. Knowledge Discovery and Data Mining, Edmonton, Canada, 2002.]] Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, Montreal, Canada, 1996.]] Google ScholarDigital Library

Index Terms

Classifying large data sets using SVMs with hierarchical clusters
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Recommendations

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. ...
Read More
Random Local SVMs for Classifying Large Datasets
FDSE 2015: Proceedings of the Second International Conference on Future Data and Security Engineering - Volume 9446

We propose a new parallel ensemble learning algorithm of random local support vector machines, called krSVM for the effectively non-linear classification of large datasets. The random local SVM in the krSVM learning strategy uses kmeans algorithm to ...
Read More
Incremental training of support vector machines using hyperspheres

In the conventional incremental training of support vector machines, candidates for support vectors tend to be deleted if the separating hyperplane rotates as the training data are added. To solve this problem, in this paper, we propose an incremental ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2003
736 pages
ISBN:1581137370
DOI:10.1145/956750
Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
hierarchical cluster
support vector machines
Qualifiers
- Article
Conference

Acceptance Rates
KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 132
  Total Citations
  View Citations
- 2,448
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Classifying large data sets using SVMs with hierarchical clusters

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

Random Local SVMs for Classifying Large Datasets

Incremental training of support vector machines using hyperspheres