Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities

https://doi.org/10.1016/j.sysarc.2010.06.003Get rights and content

Abstract

Software security failures are common and the problem is growing. A vulnerability is a weakness in the software that, when exploited, causes a security failure. It is difficult to detect vulnerabilities until they manifest themselves as security failures in the operational stage of software, because security concerns are often not addressed or known sufficiently early during the software development life cycle. Numerous studies have shown that complexity, coupling, and cohesion (CCC) related structural metrics are important indicators of the quality of software architecture, and software architecture is one of the most important and early design decisions that influences the final quality of the software system. Although these metrics have been successfully employed to indicate software faults in general, there are no systematic guidelines on how to use these metrics to predict vulnerabilities in software. If CCC metrics can be used to indicate vulnerabilities, these metrics could aid in the conception of more secured architecture, leading to more secured design and code and eventually better software. In this paper, we present a framework to automatically predict vulnerabilities based on CCC metrics. To empirically validate the framework and prediction accuracy, we conduct a large empirical study on fifty-two releases of Mozilla Firefox developed over a period of four years. To build vulnerability predictors, we consider four alternative data mining and statistical techniques – C4.5 Decision Tree, Random Forests, Logistic Regression, and Naïve-Bayes – and compare their prediction performances. We are able to correctly predict majority of the vulnerability-prone files in Mozilla Firefox, with tolerable false positive rates. Moreover, the predictors built from the past releases can reliably predict the likelihood of having vulnerabilities in the future releases. The experimental results indicate that structural information from the non-security realm such as complexity, coupling, and cohesion are useful in vulnerability prediction.

Introduction

There is an increasing number of critical processes supported by software systems in the modern world. Think of the current prevalence of air-traffic control and online banking. When combined with the growing dependence of valuable assets (including human health and wealth, or even human lives) on the security and dependability of computer support for these processes, we see that secure software is a core requirement of the modern world. Unfortunately, there is an escalating number of incidences of software security failures. A security failure is a violation or deviation from the security policy, and a security policy is “a statement of what is, and what is not, allowed as far as security is concerned” [1]. WhiteHat Security Inc. found that nine out of ten websites had at least one security failure when they conducted a security assessment of over 600 public-facing and pre-production websites between January 1, 2006 and February 22, 2008 [2]. The number of security-related software failures reported to the Computer Emergency Response Team Coordination Center (CERT/CC) has increased fivefold over the past seven years [3].

Security failures in a software system are the mishaps we wish to avoid, but they could not occur without the presence of vulnerabilities in the underlying software. “A vulnerability is an instance of a fault in the specification, development, or configuration of software such that its execution can violate an implicit or explicit security policy” [31]. A fault is an accidental condition that, when executed, may cause a functional unit to fail to perform its required or expected function [18]. We use the term ‘fault’ to denote any software fault or defect, and reserve vulnerability for those exploitable faults which might lead to a security failure.

Vulnerabilities are generally introduced during the development of software. However, it is difficult to detect vulnerabilities until they manifest themselves as security failures in the operational stage of the software, because security concerns are not always addressed or known sufficiently early during the Software Development Life Cycle (SDLC). Therefore, it would be very useful to know the characteristics of software artifacts that can indicate post-release vulnerabilities – vulnerabilities that are uncovered by at least one security failure during the operational phase of the software. Such indications can help software managers and developers take proactive action against potential vulnerabilities. For our work, we use the term ‘vulnerability’ to denote post-release vulnerabilities only.

Software metrics are often used to assess the ability of software to achieve a predefined goal [4]. A software metric is a measure of some property of a piece of software. Complexity, coupling, and cohesion (CCC) can be measured during various software development phases and are used to evaluate the quality of software [21]. The term software complexity is often applied to the interaction between a program and a programmer working on some programming task [69]. In this context, complexity measures typically depend on program size and control structure, among many other factors. High complexity hinders program comprehension [69]. Coupling refers to the level of interconnection and dependency among software entities. Entities are said to be highly coupled when they depend on each other to such an extent that a change in one necessitates changes in others dependent upon it. Moreover, highly coupled entities are difficult to understand in isolation and reuse because dependant entities must be included. Cohesion refers to the degree that a particular entity provides a single functionality to the software system as a whole [21]. Highly cohesive entities, which have only one responsibility, are more desirable than weakly cohesive entities that do many operations and therefore are likely to be less maintainable and reusable.

Complexity, coupling and cohesion-related structural measurements pertain to software architecture because “the software architecture of a system is the structure or structures of the system, which comprises software elements, the externally visible properties of those elements, and the relationships among them” [71]. These metrics provide complementary solutions that are potentially useful for architecture evaluation [73], leading to more secured software design and code, and eventually more secured and dependable software.

Numerous studies [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [21], [69] show that high complexity and coupling and low cohesion make understanding, developing, testing, and maintaining software difficult, and, as a side effect, may introduce faults in software systems. Our intuition is that these may, as well, lead to introduction of vulnerabilities – weaknesses that can be exploited by malicious users to compromise a software system. In fact, in one of our previous studies, we have shown that high coupling is likely to increase damage propagation when a system gets compromised [35].

Although CCC metrics have been successfully employed to indicate faults in general [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], the efficacy of these metrics to indicate vulnerabilities has not yet been extensively investigated. A very few works associate complexity and coupling with vulnerabilities. Shin and William [31], [32], [33] investigate how vulnerabilities can be inferred from (only) code complexity. A study by Traroe et al. [30] uses the notion of “service coupling”, a measurement specific to service-oriented architecture. The effect of cohesion on vulnerabilities has never been studied before.

In this work, we explore how the likelihood of having vulnerabilities is affected by all three aforementioned aspects – complexity, coupling, and cohesion. This study incorporates some standard and traditional CCC metrics to CCC metrics for object-oriented architecture. Our objective is to investigate whether structural information from the non-security realm such as complexity, coupling, and cohesion metrics can be helpful in automatically predicting vulnerabilities in software.

The principal contributions of this research can be summarized as follows. First, a systematic framework to automatically predict vulnerability-prone entities from CCC metrics is proposed. Second, statistical and machine learning techniques are used to build the vulnerability predictors. In doing so, we compare the prediction performances of four alternative techniques, namely C4.5 Decision Tree, Random Forests, Logistic Regression and Naïve-Bayes. Among these, C4.5 Decision Tree, Random Forests, and Naïve-Bayes have not been applied in any kind of vulnerability prediction before. Third, an extensive empirical study is conducted on fifty-two releases of Mozilla Firefox [39] to validate the usefulness of CCC metrics in vulnerability prediction. In doing so, we provide a tool to automatically map vulnerabilities to entities by extracting information from software repositories such as security advisories, bug databases, and concurrent version systems.

The major implications of this research are as follows. First, automatic predictions of vulnerabilities will assist software practitioners in taking preventive actions against potential vulnerabilities during the early stages of the software lifecycle. Therefore, there will be a shift from reactive to proactive approach to deal with vulnerabilities. Another implication of this research is that techniques to automatically predict fault-prone entities from CCC metrics can be adopted or leveraged to automatically predict vulnerable-prone entities as well, which has not been systematically done as of now. However, the results might not necessarily be the same as for software fault prediction. Although vulnerabilities can be viewed as exploitable faults in software, there is a need to specifically investigate the efficacy of predicting vulnerabilities from CCC metrics. Research has shown that vulnerable entities have distinctive characteristics from faulty-but-non-vulnerable entities in terms of code characteristics [32], [33], [34]. Moreover, it has been found that prediction of vulnerable functions from all functions provides better results than prediction of vulnerable functions from faulty functions [35]. Finally, it is implied that robust architecture, and quality design and code are important for security and dependability. Hence, a relationship with the CCC metrics to the vulnerabilities can lead to conception of more secured software architecture, design and code, and eventually more secured and dependable software.

The rest of the paper is organized as follows. In Section 2, we present the framework to predict vulnerability using CCC metrics. In Section 3, we provide background on CCC metrics and give brief overviews of the statistical and machine learning techniques used for vulnerability prediction. In Section 4, we discuss in detail how to predict vulnerability-prone entities using the framework. In Section 5, we report the vulnerability prediction results and discuss the implications of the results. Section 6 compares and contrasts the related work on fault and vulnerability prediction. Finally, we conclude the paper, discuss some limitations of our approaches, and outline avenues for future work in Section 7.

Section snippets

Overview of vulnerability-prediction framework

There are two main approaches to software vulnerability prediction. First, count-based techniques focus on predicting the number of vulnerabilities in a software system. Managers can use these predictions to determine if the software is ready for release or if it is likely to have many lurking vulnerabilities. An example of such work is [28]. Second, classification

Background

This section provides background on complexity, coupling, and cohesion (CCC) metrics that are hypothesized to affect vulnerability-proneness. It also furnishes brief overviews of the statistical and machine learning techniques used in this study to predict vulnerabilities.

Predicting vulnerabilities

This section describes how to predict vulnerability-prone entities in software as outlined by the framework initially presented in Fig. 1 of Section 2. As an empirical evaluation of the framework, we conduct case studies on Mozilla Firefox to predict its vulnerability-prone files. This section begins by providing an overview of Mozilla Firefox (the source of data for our empirical evaluation). Then, in Section 4.2, we explain the dependent and independent variables of the prediction task at

Results and discussion

This section presents the results of predicting vulnerability-prone files in Mozilla Firefox based on their complexity, coupling, and cohesion (CCC) metrics. These results will help us quantitatively evaluate the usefulness of using CCC metrics for vulnerability prediction.

Related work

The related research is presented in three parts. First, we describe the research on fault prediction using complexity, coupling, and cohesion metrics [6], [7], [8], [9], [10], [11], [12]. Second, we compare and contrast recent work that predicts vulnerabilities from complexity and coupling metrics [30], [31], [32], [33]. Finally, we describe some studies that use other phenomena (e.g., import patterns or past vulnerabilities) to identify the vulnerable components in a software system [28], [29]

Conclusions

In this work, we investigate the efficacy of applying complexity, coupling, and cohesion metrics to automatically predict vulnerability-prone entities in software systems. We use four alternative statistical and machine learning techniques to build vulnerability predictors that learn from the CCC metrics and vulnerability history. The techniques are C4.5 Decision Trees, Random Forests, Logistic Regression, and Naïve-Bayes. We conduct an extensive empirical study on Mozilla Firefox to

Acknowledgments

This research is partially funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). The authors wish to thank Stephan Neuhaus of Saarland University, Saarbrücken, Germany for sharing his dataset (as of January 4th, 2007) on vulnerabilities in Mozilla Firefox and for his suggestions on how to obtain an updated dataset. We also thank Yonghee Shin of North Carolina State University, Raleigh, NC, USA for taking the time to answer our queries about her technique of

Istehad Chowdhury is currently a research intern in Cloakware Inc., Canada. He received his M.Sc. degree from the Department of Electrical and Computer Engineering of Queen’s University, Canada in 2009, where he was a research assistant and a member of Queen’s Reliable Software Technology (QRST) research group. He received his B.Sc. degree in Computer Science from Independent University, Bangladesh in 2005. Before joining Queen’s, he was a lecturer in the Department of Computer Science of

References (71)

  • G. Koru et al.

    An empirical comparison and characterization of high defect and high complexity modules

    Journal of Systems and Software

    (2003)
  • M. Cartwright et al.

    An empirical investigation of an object-oriented software system

    IEEE Transactions on Software Engineering

    (2000)
  • V. Basili et al.

    A validation of object-oriented design metrics as quality indicators

    IEEE Transactions on Software Engineering

    (1996)
  • N. Nagappan, T. Ball, A. Zeller, Mining metrics to predict component failures, in: Proceedings of the 28th...
  • T. Menzies et al.

    Data mining static code attributes to learn defect predictors

    IEEE Transactions on Software Engineering

    (2007)
  • H. Zhang, X. Zhang, M. Gu, Predicting defective software components from code complexity measures, in: Proceedings of...
  • W.M. Evanco et al.

    A composite complexity approach for software defect modelling

    Software Quality Journal

    (1994)
  • N. Fenton et al.

    A probabilistic model for software defect prediction

    IEEE Transactions on Software Engineering

    (2001)
  • IEEE, IEEE Std. 982.1-1988 IEEE Standard Dictionary of Measures to Produce Reliable Software, The Institute of...
  • S.R. Chidamber et al.

    A metrics suite for object oriented design

    IEEE Transactions on Software Engineering

    (1994)
  • N.E. Fenton et al.

    Software Metrics: A Rigorous and Practical Approach

    (1997)
  • T.J. McCabe

    A complexity measure

    IEEE Transactions on Software Engineering

    (1976)
  • G.J. Myers, Composite/Structured Design, Van Nostrand Reinhold Company, New York,...
  • W.A. Harrison et al.

    A complexity measure based on nesting level

    ACM Sigplan Notices

    (1981)
  • S. Henry et al.

    Software structure metrics based on information flow

    IEEE Transactions on Software Engineering

    (1981)
  • K. Ayari, P. Meshkinfam, G. Antoniol, M. Di Penta, Threats on Building Models from CVS and Bugzilla Repositories: the...
  • S. Neuhaus, T. Zimmermann, A. Zeller, Predicting vulnerable software components, in: Proceedings of the 14th ACM...
  • M.Y. Liu, I. Traore, empirical relations between attackability and coupling: a case study on DoS, in: Proceedings of...
  • Y. Shin, L. Williams, Is complexity really the enemy of software security? in: Proceedings of the Fourth ACM Workshop...
  • Y. Shin, L. Williams, An empirical model to predict security vulnerabilities using code complexity metrics, in:...
  • Y. Shin, Exploring complexity metrics as indicators of software vulnerability, in: Proceedings of the Third...
  • M. Gegick, L. Williams, M. Vouk, Predictive models for identifying software components prone to failure during security...
  • I. Chowdhury, B. Chan, M. Zulkernine, Security metrics for source code structures, in: Proceedings of the Fourth...
  • H. Malik, I. Chowdhury, H.M. Tsou, Z. Ziang, A.E. Hassan, Understanding the rationale for updating a function’s...
  • I. Chowdhury, M. Zulkernine, Can complexity, coupling, and cohesion metrics be used as early indicators of...
  • Cited by (0)

    Istehad Chowdhury is currently a research intern in Cloakware Inc., Canada. He received his M.Sc. degree from the Department of Electrical and Computer Engineering of Queen’s University, Canada in 2009, where he was a research assistant and a member of Queen’s Reliable Software Technology (QRST) research group. He received his B.Sc. degree in Computer Science from Independent University, Bangladesh in 2005. Before joining Queen’s, he was a lecturer in the Department of Computer Science of Stamford University, Bangladesh. He has been a member of ACM since 2001. His main research interest lies in the area of software engineering with special interest in software reliability and security, software metrics, empirical software engineering, and mining software repositories. More information about his research and publications can be found at http://www.cs.queensu.ca/~istehad.

    Mohammad Zulkernine is a faculty member of the School of Computing of Queen’s University, Canada, where he leads the Queen’s Reliable Software Technology (QRST) research group. He received his B.Sc. in Computer Science and Engineering from Bangladesh University of Engineering and Technology in 1993. Dr. Zulkernine received an M. Eng. in Computer Science and Systems Engineering from Muroran Institute of Technology, Japan in 1998. He received his Ph.D. from the Department of Electrical and Computer Engineering of the University of Waterloo, Canada in 2003, where he belonged to the university’s Bell Canada Software Reliability Laboratory. Dr. Zulkernine’s research focuses on software engineering (software reliability and security), automatic software monitoring and intrusion detection, methods and tools for reliable and secure software. His research work are funded by a number of provincial and federal research organizations of Canada, while he is having an industry research partnership with Bell Canada. He is a senior member of the IEEE and a member of the ACM. Dr. Zulkernine is also cross-appointed in the Department of Electrical and Computer Engineering of Queen’s University, and a licensed professional engineer of the province of Ontario, Canada.

    View full text