skip to main content
10.1145/956750.956786acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Classifying large data sets using SVMs with hierarchical clusters

Published:24 August 2003Publication History

ABSTRACT

Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly provide. However, despite the prominent properties of SVMs, they are not as favored for large-scale data mining as for pattern recognition or machine learning because the training complexity of SVMs is highly dependent on the size of a data set. Many real-world data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy.

References

  1. D. K. Agarwal. Shrinkage estimator generalizations of proximal support vector machines. In Proc. 8th Int. Conf. Knowledge Discovery and Data Mining, Edmonton, Canada, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. L. Balczar, Y. Dai, and O. Watanabe. A random sampling technique for training support vector machines. In Proc. 13th Int. Conf. Algorithmic Learning Theory, Washington D.C., 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? Lecture Notes in Computer Science, 1540:217--235, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121--167, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, 2000.]]Google ScholarGoogle Scholar
  6. C.-C. Chang and C.-J. Lin. Training nu-support vector classifiers: Thoery and algorithms. Neural Computation, 13:2119--2147, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143--160, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In Proc. 7th Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Greiner, A. J. Grove, and D. Roth. Learning active classifiers. In Proc. 13th Int. Conf. Machine Learning, Bari, Italy, 1996.]]Google ScholarGoogle Scholar
  10. S. Guha, R. Rastogi, and K. Shim. CURE: an efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, Seatle, WA, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. O. W. J. L. Balczar, Y. Dai. A random sampling technique for training support vector machines. In The 2001 IEEE Int. Conf. Data Mining, San Jose, CA, 2001.]]Google ScholarGoogle ScholarCross RefCross Ref
  12. W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases. In Proc. 7th Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Joachims. Making large-scale support vector machine learning practical. In A. S. B. Scholkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims. Text categorization with support vector machines. In Proc. 10th European Conference on Machine Learning, Chemnitz, Germany, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Karypis, E.-H. Han, and V. Kumar. Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8):68--75, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. In Proc. Advances in Neural Information Processing Systems, Cambridge, MA, 2002.]]Google ScholarGoogle Scholar
  17. Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In First SIAM Int. Conf. Data Mining, Chicago, IL, 2001.]]Google ScholarGoogle ScholarCross RefCross Ref
  18. J. Platt. Fast training of support vector machines using sequential minimal optimization. In A. S. B. Scholkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th Int. Conf. Machine Learning, Stanford, CA, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Smola and B. Sch. A tutorial on support vector regression. Technical report, 1998.]]Google ScholarGoogle Scholar
  21. N. Syed, H. Liu, and K. Sung. Incremental learning with support vector machines. In Proc. the Workshop on Support Vector Machines at the International Joint Conference on Articial Intelligence, Stockholm, Sweden, 1999.]]Google ScholarGoogle Scholar
  22. S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In Proc. 17th Int. Conf. Machine Learning, Stanford, CA, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Yu, J. Han, and K. C. Chang. PEBL: Positive-example based learning for Web page classification using SVM. In Proc. 8th Int. Conf. Knowledge Discovery and Data Mining, Edmonton, Canada, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, Montreal, Canada, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Classifying large data sets using SVMs with hierarchical clusters

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2003
        736 pages
        ISBN:1581137370
        DOI:10.1145/956750

        Copyright © 2003 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader