skip to main content
10.1145/775047.775141acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

What's the code?: automatic classification of source code archives

Published:23 July 2002Publication History

ABSTRACT

There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.

References

  1. Abramson, N. "Information Theory and Coding." McGraw- Hill, New York, 1963.Google ScholarGoogle Scholar
  2. Bennett, K. P. and Campbell, C. "Support vector machines: Hype or Hallelujah." ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Expolarations 2(2): 1--13, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chang, C. and Lin, C. "LIBSVM: A library for support vector machines." Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, A., Lee Y. K., Yao A. Y., and Michail A. "Code search based on CVS comments: A preliminary evaluation," (Technical Report 0106). School of Computer Science and Eng., University of New South Wales, Australia, 2001.Google ScholarGoogle Scholar
  5. Creps, R. G., Simos, M. A., and Prieto-Diaz R. "The STARS conceptual framework for reuse processes, software technology for adaptable, reliable systems (STARS)" (Technical Report). DARPA, 1992.Google ScholarGoogle Scholar
  6. Dumais, S. T. "Using SVMs for text categorization." IEEE Intelligent Systems Magazine, Trends and Controversies, 13(4):21--23, 1998.Google ScholarGoogle Scholar
  7. Dumais, S. T., Platt J., Heckerman D., and Sahami M. "Inductive learning algorithms and representations for text categorization." Proceedings of the ACM Conference on Information and Knowledge Management, 148--155, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Etzkorn, L. and Davis, C. G. "Automatically identifying reusable OO legacy code." IEEE Computer, 30(10): 66--71, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Glover, E. J., Flake, G. W., Lawrence, S., Birmingham, W. P., Kruger, A., Giles, L. C., and Pennock, D. M. "Improving category specific web search by learning query modification." IEEE Symposium on Applications and the Internet (SAINT 2001), 23--31. San Diego, CA, US: IEEE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Henninger, S. "Information access tools for software reuse." Systems and Software, 30(3): 231--247, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Joachims T. "Text categorization with support vector machines." Proceedings of the Tenth European Conference on Machine Learning, 137--142, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Knerr, S., Personnaz, L., and Dreyfus, G. "Single layer learning revisited: a stepwise procedure for building and training a neural network." Neurocomputing: Algorithms, Architectures and Applications. J. Fogelman (Ed.), Springer-Verlag, 1990.Google ScholarGoogle Scholar
  13. Krovetz, R. "Viewing Morphology as an Inference Process." Artificial Intelligence, 20, 277--294, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Krueger, C. W. "Software resuse." ACM Computing Surveys, 24(2):131--183, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kwok J. T. "Automated text categorization using support vector machines." Proc. of the International Conference on Neural Information Processing, 347--351, 1999.Google ScholarGoogle Scholar
  16. Merkl, D. "Content-based software classification by self-organization." Proc. of the IEEE International Conference on Neural Networks, 1086--1091, 1995.Google ScholarGoogle Scholar
  17. Platt, J. C., Cristianini, N., and Shawe-Taylor, J. "Large margin DAGs for multiclass classification." Advances in Neural Information Processing Systems 12, 547--553. MIT Press, 2000.Google ScholarGoogle Scholar
  18. Rosson, M. B. and Carroll, J. M. "The reuse of uses in Smalltalk Programming." ACM Transactions on Computer-Human Interaction, 3(3), 219--253, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yang, Y. and Pederson, J. "A comparative study on feature selection in text categorization." Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 412--420, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. What's the code?: automatic classification of source code archives

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
                July 2002
                719 pages
                ISBN:158113567X
                DOI:10.1145/775047

                Copyright © 2002 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 23 July 2002

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%

                Upcoming Conference

                KDD '24

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader