ABSTRACT
There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.
- Abramson, N. "Information Theory and Coding." McGraw- Hill, New York, 1963.Google Scholar
- Bennett, K. P. and Campbell, C. "Support vector machines: Hype or Hallelujah." ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Expolarations 2(2): 1--13, 2000. Google ScholarDigital Library
- Chang, C. and Lin, C. "LIBSVM: A library for support vector machines." Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarDigital Library
- Chen, A., Lee Y. K., Yao A. Y., and Michail A. "Code search based on CVS comments: A preliminary evaluation," (Technical Report 0106). School of Computer Science and Eng., University of New South Wales, Australia, 2001.Google Scholar
- Creps, R. G., Simos, M. A., and Prieto-Diaz R. "The STARS conceptual framework for reuse processes, software technology for adaptable, reliable systems (STARS)" (Technical Report). DARPA, 1992.Google Scholar
- Dumais, S. T. "Using SVMs for text categorization." IEEE Intelligent Systems Magazine, Trends and Controversies, 13(4):21--23, 1998.Google Scholar
- Dumais, S. T., Platt J., Heckerman D., and Sahami M. "Inductive learning algorithms and representations for text categorization." Proceedings of the ACM Conference on Information and Knowledge Management, 148--155, 1998. Google ScholarDigital Library
- Etzkorn, L. and Davis, C. G. "Automatically identifying reusable OO legacy code." IEEE Computer, 30(10): 66--71, 1997. Google ScholarDigital Library
- Glover, E. J., Flake, G. W., Lawrence, S., Birmingham, W. P., Kruger, A., Giles, L. C., and Pennock, D. M. "Improving category specific web search by learning query modification." IEEE Symposium on Applications and the Internet (SAINT 2001), 23--31. San Diego, CA, US: IEEE, 2001. Google ScholarDigital Library
- Henninger, S. "Information access tools for software reuse." Systems and Software, 30(3): 231--247, 1995. Google ScholarDigital Library
- Joachims T. "Text categorization with support vector machines." Proceedings of the Tenth European Conference on Machine Learning, 137--142, 1999. Google ScholarDigital Library
- Knerr, S., Personnaz, L., and Dreyfus, G. "Single layer learning revisited: a stepwise procedure for building and training a neural network." Neurocomputing: Algorithms, Architectures and Applications. J. Fogelman (Ed.), Springer-Verlag, 1990.Google Scholar
- Krovetz, R. "Viewing Morphology as an Inference Process." Artificial Intelligence, 20, 277--294, 2000. Google ScholarDigital Library
- Krueger, C. W. "Software resuse." ACM Computing Surveys, 24(2):131--183, 1992. Google ScholarDigital Library
- Kwok J. T. "Automated text categorization using support vector machines." Proc. of the International Conference on Neural Information Processing, 347--351, 1999.Google Scholar
- Merkl, D. "Content-based software classification by self-organization." Proc. of the IEEE International Conference on Neural Networks, 1086--1091, 1995.Google Scholar
- Platt, J. C., Cristianini, N., and Shawe-Taylor, J. "Large margin DAGs for multiclass classification." Advances in Neural Information Processing Systems 12, 547--553. MIT Press, 2000.Google Scholar
- Rosson, M. B. and Carroll, J. M. "The reuse of uses in Smalltalk Programming." ACM Transactions on Computer-Human Interaction, 3(3), 219--253, 1996. Google ScholarDigital Library
- Yang, Y. and Pederson, J. "A comparative study on feature selection in text categorization." Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 412--420, 1997. Google ScholarDigital Library
Index Terms
- What's the code?: automatic classification of source code archives
Recommendations
Understanding and Detecting Harmful Code
SBES '20: Proceedings of the XXXIV Brazilian Symposium on Software EngineeringCode smells typically indicate poor design implementation and choices that may degrade software quality. Hence, they need to be carefully detected to avoid such poor design. In this context, some studies try to understand the impact of code smells on the ...
Understanding code snippets in code reviews: a preliminary study of the OpenStack community
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program ComprehensionCode review is a mature practice for software quality assurance in software development with which reviewers check the code that has been committed by developers, and verify the quality of code. During the code review discussions, reviewers and ...
Bug localization via searching crowd-contributed code
Internetware '14: Proceedings of the 6th Asia-Pacific Symposium on InternetwareBug localization, i.e., locating bugs in code snippets, is a frequent task in software development. Although static bug-finding tools are available to reduce manual effort in bug localization, these tools typically detect bugs with known project-...
Comments