Article

What's the code?: automatic classification of source code archives

Authors:
Secil Ugurel

The Pennsylvania State University, University Park, PA

The Pennsylvania State University, University Park, PA
View Profile

,
Robert Krovetz

NEC Research Institute, Princeton, NJ

NEC Research Institute, Princeton, NJ
View Profile

,
C. Lee Giles

The Pennsylvania State University, University Park, PA

The Pennsylvania State University, University Park, PA
View Profile

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2002Pages 632–638https://doi.org/10.1145/775047.775141

Published:23 July 2002Publication History

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 632–638

ABSTRACT

There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.

References

Abramson, N. "Information Theory and Coding." McGraw- Hill, New York, 1963.Google Scholar
Bennett, K. P. and Campbell, C. "Support vector machines: Hype or Hallelujah." ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Expolarations 2(2): 1--13, 2000. Google ScholarDigital Library
Chang, C. and Lin, C. "LIBSVM: A library for support vector machines." Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarDigital Library
Chen, A., Lee Y. K., Yao A. Y., and Michail A. "Code search based on CVS comments: A preliminary evaluation," (Technical Report 0106). School of Computer Science and Eng., University of New South Wales, Australia, 2001.Google Scholar
Creps, R. G., Simos, M. A., and Prieto-Diaz R. "The STARS conceptual framework for reuse processes, software technology for adaptable, reliable systems (STARS)" (Technical Report). DARPA, 1992.Google Scholar
Dumais, S. T. "Using SVMs for text categorization." IEEE Intelligent Systems Magazine, Trends and Controversies, 13(4):21--23, 1998.Google Scholar
Dumais, S. T., Platt J., Heckerman D., and Sahami M. "Inductive learning algorithms and representations for text categorization." Proceedings of the ACM Conference on Information and Knowledge Management, 148--155, 1998. Google ScholarDigital Library
Etzkorn, L. and Davis, C. G. "Automatically identifying reusable OO legacy code." IEEE Computer, 30(10): 66--71, 1997. Google ScholarDigital Library
Glover, E. J., Flake, G. W., Lawrence, S., Birmingham, W. P., Kruger, A., Giles, L. C., and Pennock, D. M. "Improving category specific web search by learning query modification." IEEE Symposium on Applications and the Internet (SAINT 2001), 23--31. San Diego, CA, US: IEEE, 2001. Google ScholarDigital Library
Henninger, S. "Information access tools for software reuse." Systems and Software, 30(3): 231--247, 1995. Google ScholarDigital Library
Joachims T. "Text categorization with support vector machines." Proceedings of the Tenth European Conference on Machine Learning, 137--142, 1999. Google ScholarDigital Library
Knerr, S., Personnaz, L., and Dreyfus, G. "Single layer learning revisited: a stepwise procedure for building and training a neural network." Neurocomputing: Algorithms, Architectures and Applications. J. Fogelman (Ed.), Springer-Verlag, 1990.Google Scholar
Krovetz, R. "Viewing Morphology as an Inference Process." Artificial Intelligence, 20, 277--294, 2000. Google ScholarDigital Library
Krueger, C. W. "Software resuse." ACM Computing Surveys, 24(2):131--183, 1992. Google ScholarDigital Library
Kwok J. T. "Automated text categorization using support vector machines." Proc. of the International Conference on Neural Information Processing, 347--351, 1999.Google Scholar
Merkl, D. "Content-based software classification by self-organization." Proc. of the IEEE International Conference on Neural Networks, 1086--1091, 1995.Google Scholar
Platt, J. C., Cristianini, N., and Shawe-Taylor, J. "Large margin DAGs for multiclass classification." Advances in Neural Information Processing Systems 12, 547--553. MIT Press, 2000.Google Scholar
Rosson, M. B. and Carroll, J. M. "The reuse of uses in Smalltalk Programming." ACM Transactions on Computer-Human Interaction, 3(3), 219--253, 1996. Google ScholarDigital Library
Yang, Y. and Pederson, J. "A comparative study on feature selection in text categorization." Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 412--420, 1997. Google ScholarDigital Library

Index Terms

What's the code?: automatic classification of source code archives

Recommendations

Understanding and Detecting Harmful Code
SBES '20: Proceedings of the XXXIV Brazilian Symposium on Software Engineering

Code smells typically indicate poor design implementation and choices that may degrade software quality. Hence, they need to be carefully detected to avoid such poor design. In this context, some studies try to understand the impact of code smells on the ...
Read More
Understanding code snippets in code reviews: a preliminary study of the OpenStack community
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension

Code review is a mature practice for software quality assurance in software development with which reviewers check the code that has been committed by developers, and verify the quality of code. During the code review discussions, reviewers and ...
Read More
Bug localization via searching crowd-contributed code
Internetware '14: Proceedings of the 6th Asia-Pacific Symposium on Internetware

Bug localization, i.e., locating bugs in code snippets, is a frequent task in software development. Although static bug-finding tools are available to reduce manual effort in bug localization, these tools typically detect bugs with known project-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 1,053
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

What's the code?: automatic classification of source code archives

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Understanding and Detecting Harmful Code

Understanding code snippets in code reviews: a preliminary study of the OpenStack community

Bug localization via searching crowd-contributed code