ABSTRACT
Background: The NASA datasets have previously been used extensively in studies of software defects. In 2013 Shepperd et al. presented an essential set of rules for removing erroneous data from the NASA datasets making this data more reliable to use.
Objective: We have now found additional rules necessary for removing problematic data which were not identified by Shepperd et al.
Results: In this paper, we demonstrate the level of erroneous data still present even after cleaning using Shepperd et al.'s rules and apply our new rules to remove this erroneous data.
Conclusion: Even after systematic data cleaning of the NASA MDP datasets, we found new erroneous data. Data quality should always be explicitly considered by researchers before use.
- B. Ghotra, S. McIntosh, and A. E. Hassan. Revisiting the impact of classification techniques on the performance of defect prediction models. In 37th Int. Conf. on Software Engineering (ICSE), 2015. Google ScholarDigital Library
- D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. The misuse of the NASA metrics data program data sets for automated software defect prediction. In Evaluation Assessment in Software Engineering (EASE 2011), pages 96--103, 2011.Google ScholarCross Ref
- D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. Reflections on the NASA MDP data sets. Software, IET, 6(6):549--558, Dec 2012.Google ScholarCross Ref
- T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. Software Engineering, IEEE Transactions on, 38(6):1276--1304, Nov 2012. Google ScholarDigital Library
- Y. Kamei and E. Shihab. Defect prediction: Accomplishments and future challenges. In Software Analysis, Evolution and Reengineering (SANER), 2016 IEEE 23rd International Conference on, 2016.Google Scholar
- S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. Software Engineering, IEEE Transactions on, 34(4):485--496, July 2008. Google ScholarDigital Library
- R. Malhotra. A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing, 27:504--518, 2015. Google ScholarDigital Library
- M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the NASA software defect datasets. Software Engineering, IEEE Transactions on, 39(9):1208--1215, Sept 2013. Google ScholarDigital Library
- R. S. Wahono. A systematic literature review of software defect prediction: Research trends, datasets, methods and frameworks. Journal of Software Engineering, 1(1):1--16, 2015.Google Scholar
Recommendations
An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems EngineeringData quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
A Review on Data Cleansing Methods for Big Data
AbstractMassive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better ...
A Taxonomy of Dirty Data
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining,...
Comments