A large-scale study on the usage of Java’s concurrent programming constructs

https://doi.org/10.1016/j.jss.2015.04.064Get rights and content

Highlights

  • An analysis of 2227 Java projects, comprising more than 650 million lines of code.

  • Seventy seven percent of the projects create threads or employ a concurrency control mechanism.

  • Concurrent programming constructs are used both frequently and intensively.

  • Adoption of java.util.concurrent is moderate (23% of the concurrent projects use it).

  • Efficient and safe data structures, e.g., ConcurrentHashMap, are not yet widely used.

Abstract

In both academia and industry, there is a strong belief that multicore technology will radically change the way software is built. However, little is known about the current state of use of concurrent programming constructs. In this work we present an empirical work aimed at studying the usage of concurrent programming constructs of 2227 real world, stable and mature Java projects from SourceForge. We have studied the usage of concurrent techniques in the most recent versions of these applications and also how usage has evolved along time. The main findings of our study are: (I) More than 75% of the latest versions of the projects either explicitly create threads or employ some concurrency control mechanism. (II) More than half of these projects exhibit at least 47 synchronized methods and 3 implementations of the Runnable interface per 100,000 LoC, which means that not only concurrent programming constructs are used often but they are also employed intensively. (III) The adoption of the java.util.concurrent library is only moderate (approximately 23% of the concurrent projects employ it). (IV) Efficient and thread-safe data structures, such as ConcurrentHashMap, are not yet widely used, despite the fact that they present numerous advantages.

Introduction

Multicore systems offer the potential for cheap, scalable, high-performance computing and also for significant reductions in power consumption. To achieve this potential, it is essential to take advantage of new heterogeneous architectures comprising collections of multiple processing elements. To leverage multicore technology, applications must be concurrent, which poses a challenge, since it is well-known that concurrent programming is hard (Sutter, 2005). A number of programming languages provide constructs for concurrent programming. These solutions vary greatly in terms of abstraction, error-proneness, and performance. The Java programming language is particularly rich when it comes to concurrent programming constructs. For example, it includes the concept of monitor, a low-level mechanism supporting both mutual exclusion and condition-based synchronization, as well as a high-level library (Lea, 2005), java.util.concurrent, also known as j.u.c., introduced in version 1.5 of the language.

In both academia and industry, there is a strong belief that multicore technology will radically change the way software is built. However, to the best of our knowledge, there is a lack of reliable information about the current state of the practice of the development of concurrent software in terms of the constructs that developers employ. In this work, we aim to partially fill this gap.

Specifically, we present an empirical study aimed at establishing the current state of the practical usage of concurrent programming constructs in Java applications. We have analyzed 2227 stable and mature Java projects comprising more than 600 million lines of code (LoC—without blank lines and comments) from SourceForge, one of the most popular open source code repositories. Our analysis encompasses several versions of these applications and is based on more than 50 source code metrics that we have automatically collected. We have also studied correlations among some of these metrics in an attempt to find trends in the use of concurrent programming constructs. We have chosen Java because it is a widely used object-oriented programming language. Moreover, as we said before, it includes support for multithreading with both low-level and high-level mechanisms. Additionally, it is the language with the highest number of projects in SourceForge.

Evidence on how concurrent programs are written can raise developer awareness about available mechanisms. It can also indicate how well-accepted some of these mechanisms are in practice. Moreover, it can inform researchers designing new mechanisms about the kinds of constructs that developers may be more willing to use. Tool vendors can also benefit by supporting developers in the use of lesser-known, more efficient mechanisms, for example, by implementing novel refactorings (Dig, Marrero, Ernst, 2009, Ishizaki, Daijavad, Nakatani, 2011, Schäfer, Sridharan, Dolby, Tip, 2011a). Furthermore, results such as those uncovered by this study can support lecturers in more convincingly arguing students into the importance of concurrent programming, not only for the future of software development, but also for the present.

Mining data from the SourceForge repository poses several challenges. Some of them are inherent to the process of obtaining reliable data. These derive mainly from two factors: scale and lack of a standard organization for source code repositories. Others pertain to transforming the data into useful information. Grechanik et al. (2010) discussed a few challenges that make it difficult to obtain evidence from source code. For example, getting the source code of all software versions is difficult because there is no naming pattern to define if a compressed file contains source code, binary code or something else. Furthermore, it is difficult to be sure that an error has not occurred during measurement, due to the number of projects and project versions. We address these challenges by creating an infrastructure for obtaining and processing large code bases, specifically targeting SourceForge. In addition, we have conducted a survey with the committers of some of these projects as an attempt to verify whether their beliefs are supported by our data.

Based on the data we have obtained, we propose to answer a number of research questions (RQ).

We found out that more than 75% of the most recent versions of the examined projects include some form of concurrent programming, e.g., at least one occurrence of the synchronized keyword. In medium projects (20,001–100,000 LoC) this percentage grows to more than 90% and reaches 100% for large projects (over 100,000 LoC). In addition, the mean numbers (per 100,000 LoC) of synchronized methods, classes extending Thread, and classes implementing Runnable are, respectively, 66.75, 13, and 13.85. These results indicate that projects often use concurrent programming constructs and a considerable number do so intensively.1 On the other hand, perhaps counterintuitively, the overall percentage of concurrent projects has not seen significant change throughout the years, despite the pervasiveness of multicore machines.

Our data shows that only 23.21% of the analyzed concurrent projects employ classes of the java.util.concurrent library. On the other hand, there has been a growth in the adoption of this library. However, this growth does not in general seem to be related to a decrease in the use of Java’s traditional concurrent programming constructs, with a few exceptions. Furthermore, projects that have been in active development more recently, i.e., had at least one version released since 2009, employ the java.util.concurrent library more intensively than the mean. Therefore, the percentage of active, mature projects that use that library is actually higher than 23.21%.

Most of the projects use synchronized blocks and methods. The volatile modifier, explicit locks (including variations such as read-write locks), and atomic variables are less common, albeit some of them seem to be growing in popularity. We also noticed a tendency of growth in the use of synchronized blocks. In particular, the growth in their use correlates positively with the growth in the use of atomic data types, explicit locks, and the volatile modifier.

We found out that implementing the Runnable interface is the most common approach to define new threads. Moreover, a considerable number of projects employ Executors to manage thread execution (11.14% of the concurrent projects). It was possible to observe that projects that employ executors exhibit a weak tendency to reduce the number of classes that explicitly extend the Thread class.

We observed that developers are still using mostly Hashtable and HashMap, even though the former is thread-safe but inefficient and the latter is not thread-safe. Notwithstanding, there is a tendency towards the use of ConcurrentHashMap as a replacement for other associative data structures in a number of projects.

A large number of concurrent projects include invocations of the notify(), notifyAll(), or wait() methods. At the same time, we noticed that a small number of projects have eliminated many uses of these methods, employing the CountDownLatch class, part of the java.util.concurrent library, instead. This number is not large enough for statistical analysis. Nevertheless, it indicates that mechanisms with simple semantics like CountDownLatch have potential to, in some contexts, replace lower-level, more traditional ones.

Our data indicates that less than 3% of the concurrent projects implement the Thread.UncaughtExceptionHandler interface, which means that, in 97% of the concurrent projects, an exception stemming from a programming error might cause threads to die silently, potentially affecting the behavior of threads that interact with them. Moreover, analyzing these implementations, we discovered that developers often do not know what to do with uncaught exceptions in threads, even when they do implement a handler. This provides some indications that new exception handling mechanisms that explicitly address the needs of concurrent applications are called for.

To provide a basic intuition as to what developers believe to be true about the usage of concurrent programming constructs, we have also conducted a survey with more than 160 software developers. These developers are all committers of projects whose source code we have analyzed. This survey presented respondents with various questions, such as “What do you believe to be the most often used concurrent/parallel programming construct of the Java language?”. Throughout the paper, we contrast the results of this survey with data obtained by analyzing the Java source code.

This work makes the following contributions:

  • It is the first large-scale study on the usage of concurrent programming constructs in the Java language, including an analysis on how the usage of these constructs has evolved along time.

  • It presents a considerable amount of data pertaining to the current state-of-the-practice of real concurrent projects and the evolution of these projects along time.

  • It presents results from a survey conducted with committers of some of the analyzed projects. This survey provides an overview of the perception of developers about the use of concurrent programming constructs.

The rest of the paper is organized as follows: Section 2 presents some background on concurrent programming in Java. Section 3 describes our survey setup and some initial results. Next, in Section 4, we describe the infrastructure we employed to download and extract the analyzed data. In Section 5 we present the results of our study organized in terms of the research questions. We then present the threats to the validity of this work in Section 6 and some implications in Section 7. Section 8 is dedicated to related work. Finally, in Section 9, we present our conclusions and discuss future directions.

Section snippets

Background

Before presenting our study, we provide a brief background on concurrent programming. A detailed presentation about concurrent programming concepts is available elsewhere (Tanenbaum, 2008).

Generally speaking, processes and threads are the main abstractions of concurrent programming. A process is a container that keeps all the information needed to run a program, for instance, the memory location where the process can read and write data. A thread, on the other hand, can be seen as a lightweight

Survey

We have conducted a survey with programmers in order to gather information about the perception of developers about the usage of concurrent programming constructs in Java. Using this information we can check whether the intuition of these developers is reflected by the source code of real systems. The questionnaire was designed to the recommendations of Kitchenham and Pfleeger (2008), following the phases prescribed by the authors: planning, creating the questionnaire, defining the target

Study setting

This section describes the configuration of our study: our basic assumptions, our mining infrastructure, and the metrics suite that we employed.

We have built a set of tools to download projects from SourceForge, analyze the source code, and collect metrics from these projects. It comprises a crawler, a metrics collection tool, and some auxiliary shell scripts. We call this infrastructure Groundhog. Fig. 1 depicts the infrastructure we employed. Initially, the crawler populates the project

Study results

This section presents the results of our study. We organized the results in terms of the research questions.

Threats to validity

In a study such as this, there are always many limitations and threats to validity. First, to download the source code of the projects, we assumed that the source files were packaged in a file with the keywords “src” or “source” in its name. This is common practice in open source repositories. Nonetheless, it is not a rule and some projects are bound to adopt different naming conventions. We have ignored such projects. Moreover, obtaining the release date of some project versions was not

Study implications

This research has implications for different kinds of stakeholders. Five of these possible groups are discussed below.

Developers: Developers are now facing the problem of developing concurrent applications with more frequency, while keeping cost as low as possible and quality as high as possible. The results of our study provide some assistance to these developers. First, by showing that concurrent programming is already in widespread use and that they cannot ignore it (RQ1). Second, by

Related work

This section discusses related research.

Conclusion

This paper presents an empirical study into a large-scale Java open source repository. We found out that developers employ mainly simple mutual exclusion constructs. These constructs are easy to understand (though difficult to reason about) and have been available in Java since its initial version, released more than 15 years ago. Almost 80% of the concurrent projects include at least one synchronized method. Still, less than 25% of the projects employ the abstractions implemented by the

Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments. Fernando is partially supported by CNPq/Brazil (304755/2014-1, 487549/2012-0 and 477139/2013-2), FACEPE/Brazil (APQ- 0839-1.03/14) and INES (CNPq 573964/2008-4, FACEPE APQ-1037-1.03/08, and FACEPE APQ-0388-1.03/14). Any opinions expressed here are from the authors and do not necessarily reflect the views of the sponsors.

Gustavo Pinto is a Postdoctoral Researcher at Federal University of Pernambuco (UFPE). He received his M.Sc. degrees in Computer Science from Federal University of Paraná (UFPR), and his Ph.D. from Federal University of Pernambuco (UFPE). His research interests include performance and energy consumption, concurrent programming, social aspects of software engineering, big data analytics, and refactoring.

References (44)

  • LeaD.

    The java.util.concurrent synchronizer framework

    Sci. Comput. Program.

    (2005)
  • BaxterG. et al.

    Understanding the shape of java software

    SIGPLAN Not.

    (2006)
  • BurckhardtS. et al.

    Concurrent programming with revisions and isolation types

    Proceedings of OOPSLA’2010, Reno, USA

    (2010)
  • ChristiansenB.O. et al.

    Javelin: Internet-based parallel computing using java

    Concurr. Pract. Exp.

    (1997)
  • CollbergC.S. et al.

    An empirical study of java bytecode programs

    Softw. Pract. Exp.

    (2007)
  • DigD. et al.

    How do programs become more concurrent? A story of program transformations

    Technical Report

    (2008)
  • DigD. et al.

    Refactoring sequential java code for concurrency via concurrent libraries

    Proceedings of the 31st International Conference on Software Engineering, Vancouver, Canada

    (2009)
  • DyerR. et al.

    Boa: a language and infrastructure for analyzing ultra-large-scale software repositories

    ICSE’13: 35th International Conference on Software Engineering

    (2013)
  • FerrariA.F.

    Jpvm: network parallel computing in java

    ACM 1998 Workshop on Java for High-Performance Network Computing

    (1997)
  • GilJ.Y. et al.

    Micro patterns in java code

    SIGPLAN Not.

    (2005)
  • Github, 2013. Top languages. https://github.com/languages (accessed...
  • GoetzB. et al.

    Java Concurrency in Practice

    (2006)
  • GrechanikM. et al.

    An empirical investigation into a large-scale java open source code repository

    Proceedings of the 4th International Symposium on Empirical Software Engineering and Measurement, Bolzano-Bozen, Italy

    (2010)
  • GrothoffC. et al.

    Encapsulating objects with confined types

    ACM Trans. Program. Lang. Syst.

    (2007)
  • HerlihyM. et al.

    The Art of Multiprocessor Programming

    (2008)
  • IhakaR. et al.

    R: A language for data analysis and graphics

    J. Comput. Graph. Stat.

    (1996)
  • IshizakiK. et al.

    Refactoring java programs using concurrent libraries

    Proceedings of the Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging

    (2011)
  • KitchenhamB.A. et al.

    Personal opinion surveys

  • LiZ. et al.

    Have things changed now? An empirical study of bug characteristics in modern open source software

    Proceedings of the 1st Workshop on Architectural and System Support for Improving Software Dependability

    (2006)
  • LinY. et al.

    Check-then-act misuse of java concurrent collections

    Proceedings of the 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation

    (2013)
  • LinY. et al.

    Retrofitting concurrency for android applications through refactoring

    Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

    (2014)
  • LuS. et al.

    Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

    SIGOPS Oper. Syst. Rev.

    (2008)
  • Cited by (42)

    • Tracking runtime concurrent dependences in java threads using thread control profiling

      2019, Journal of Systems and Software
      Citation Excerpt :

      Programmers often misuse concurrent programming constructs. Only about 3% of projects handle thread exceptions, which may result in bugs or deterioration in applications’ performance (Pinto et al., 2015). In real-world applications, it is almost impossible to ensure concurrent programs to behave as expected.

    • Do android developers neglect error handling? a maintenance-Centric study on the relationship between android abstractions and uncaught exceptions

      2018, Journal of Systems and Software
      Citation Excerpt :

      However, these studies are designed so that they can be performed in a completely automated manner. This is inherent to the data they aim to collect, e.g., syntactic information about usage of specific constructs (Pinto et al., 2015) or stack traces (Coelho et al., 2015). However, it is impractical to generalize the collection of information about exception handling change scenarios, e.g., Changing the catch block to use normal code (Table 2).

    • Dazed and Confused: Studying the Prevalence of Atoms of Confusion in Long-Lived Java Libraries

      2022, Proceedings - 2022 IEEE International Conference on Software Maintenance and Evolution, ICSME 2022
    View all citing articles on Scopus

    Gustavo Pinto is a Postdoctoral Researcher at Federal University of Pernambuco (UFPE). He received his M.Sc. degrees in Computer Science from Federal University of Paraná (UFPR), and his Ph.D. from Federal University of Pernambuco (UFPE). His research interests include performance and energy consumption, concurrent programming, social aspects of software engineering, big data analytics, and refactoring.

    Weslley Torres received his M.Sc. degrees in Computer Science from Federal University of Pernambuco (UFPE) and now he is a Ph.D. student in computer science also at UFPE. His research interests cover concurrent programming and software evolution.

    Benito Fernandes is a MSc student in Computer Science at Universidade Federal de Pernambuco (UFPE). His research interests cover concurrent programming, energy efficiency and software evolution.

    Fernando Castor is an assistant professor at the Universidade Federal de Pernambuco (UFPE), Brazil. His research aims to support developers in the construction of large-scale, dependable software systems, with a particular emphasis on error handling, concurrent programming, energy efficiency, and software evolution.

    Roberto S. M. Barros received his B.Sc. and M.Sc. degrees in Computer Science from Universidade Federal de Pernambuco (UFPE), Brazil, in 1985 and 1988, respectively, and his Ph.D. degree in Computing Science from The University of Glasgow, Scotland (UK) in 1994. From 1985 to 1995 he worked as systems analyst at UFPE and since 1995 he is a full time Professor and Researcher, also at UFPE. His main research areas are software engineering, programming languages, XML, and Machine Learning from Data streams with Concept Drift.

    View full text