ABSTRACT
OBJECTIVE - to assess the extent and types of techniques used to manage quality within software engineering data sets. We consider this a particularly interesting question in the context of initiatives to promote sharing and secondary analysis of data sets. METHOD - we perform a systematic review of available empirical software engineering studies. RESULTS - only 23 out of the many hundreds of studies assessed, explicitly considered data quality. CONCLUSIONS - first, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need more research into means of identifying, and ideally repairing, noisy cases. Third, it should become routine to use sensitivity analysis to assess conclusion stability with respect to the assumptions that must be made concerning noise levels.
- S. Biffl and W. J. Gutjahr. Using a reliability growth model to control software inspection. Empirical Software Engineering, 7(3):257--284, 2002. Google ScholarDigital Library
- C. E. Brodley and M. A. Friedl. Identifying and eliminating mislabeled training instances. In AAAI/IAAI, Vol. 1, pages 799--805, 1996. Google ScholarDigital Library
- C. Cappiello. Data Quality and Multichannel Services. PhD thesis, Politecnico di Milano, 2005.Google Scholar
- S. Counsell, G. Loizou, and R. Najjar. Quality of manual data collection in java software: an empirical investigation. Empirical Software Engineering, 12(3):275--293, 2007. Google ScholarDigital Library
- P. B. Crosby. Quality without tears: The art of hassle free management. McGraw-Hill, New York, USA, 1984.Google Scholar
- R. D. De Veaux and D. J. Hand. How to lie with bad data. Statistical Science, 20(3):231--238, 2005.Google ScholarCross Ref
- A. M. Disney and P. M. Johnson. Investigating data quality problems in the psp. Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pages 143--152, 1998. Cited By (since 1996): 2. Google ScholarDigital Library
- N. E. Fenton and M. Neil. A critique of software defect prediction models. IEEE Transactions on Software Engineering, 25(5):675--689, 1999. Google ScholarDigital Library
- D. Gamberger, N. Lavrac, and C. Groselj. Experiments with noise detection algorithms in the diagnosis of coronary artery disease. In IDAMAP-98, Third Workshop on Intelligent Data Analysis in Medicine and Pharmacology, pages 29--33, Brighton, UK, 1998. University of Brighton.Google Scholar
- M. Gertz, M. T. özsu, G. Saake, and K.-U. Sattler. Report on the dagstuhl seminar: data quality on the web". SIGMOD Record, 33(1):127--132, 2004. Google ScholarDigital Library
- T. P. Group. Promise data sets. Available: http://promisedata.org/repository/, Last accessed 10 January, 2008.Google Scholar
- R. Gulezian. Software quality measurement and modeling, maturity, control and improvement. Proceedings of the IEEE International Software Engineering Standards Symposium, pages 52--59, 1995. Google ScholarDigital Library
- P. M. Johnson. Reengineering inspection. Communications of the ACM, 41(2):49--52, 1998. Google ScholarDigital Library
- P. M. Johnson and A. M. Disney. Personal software process: A cautionary case study. IEEE Software, 15(6):85--88, 1998. Cited By (since 1996): 9. Google ScholarDigital Library
- P. M. Johnson and A. M. Disney. A critical analysis of psp data quality: Results from a case study. Empirical Software Engineering, 4(4):317--349, 1999. Cited By (since 1996): 4. Google ScholarDigital Library
- T. M. Khoshgoftaar and P. Rebours. Improving software quality prediction by noise filteringtechniques. Journal of Computer Science and Technology, 22(3):387--396, 2007. Google ScholarDigital Library
- T. M. Khoshgoftaar, N. Seliya, and K. Gao. Rule-based noise detection for software measurement data. In IRI, pages 302--307, 2004.Google ScholarCross Ref
- T. M. Khoshgoftaar and J. D. Van Hulse. Identifying noise in an attribute of interest. In ICMLA '05: Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA'05), pages 55--62, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- B. Kitchenham. Procedures for performing systematic reviews (technical report tr/se-0401). Technical Report Technical Report TR/SE-0401, Keele University, Keele, UK, July 2004.Google Scholar
- J. Li, F. O. Bjornson, R. Conradi, and V. B. Kampenes. An empirical study of variations in cots-based software development processes in the norwegian it industry. Empirical Software Engineering, 11(3):433--461, 2006. Google ScholarDigital Library
- G. A. Liebchen and M. Shepperd. Software productivity analysis of a large data set and issues of confidentiality and data quality. Proceedings of the 11th IEEE International Software Metrics Symposium (METRICS'05), 00:46, 2005. Google ScholarDigital Library
- G. A. Liebchen, B. Twala, M. Shepperd, and M. Cartwright. Assessing the quality and cleaning of a software project data set: An experience report. In Proceedings of 10th International Conference on Evaluation and Assessment in Software Engineering(EASE). British Computer Society, 2006. Google ScholarDigital Library
- G. A. Liebchen, B. Twala, M. Shepperd, M. Cartwright, and M. Stephens. Filtering, robust filtering, polishing: Techniques for addressing quality in software data. First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), 0:99--106, 2007. Google ScholarDigital Library
- R. J. A. Little and D. B. Rubin. Statistical analysis with missing data. John Wiley & Sons, Inc., New York, NY, USA, 1986. Google ScholarDigital Library
- E. Mendes and C. Lokan. Replicating studies on cross vs. single-company effort models using the isbsg database. Empirical Software Engineering, 13(1), 2008. Google ScholarDigital Library
- E. Mendes, I. Watson, C. Triggs, N. Mosley, and S. Counsell. A comparative study of cost estimation models for web hypermedia applications. Empirical Software Engineering, 8(2):163--196, 2003. Google ScholarDigital Library
- P. Mohagheghi and R. Conradi. Quality, productivity and economic benefits of software reuse: a review of industrial studies. Empirical Software Engineering, 12(5):471--516, 2007. Google ScholarDigital Library
- T. C. Redman. Data Quality for the Information Age. Artech House, Inc., Norwood, MA, USA, 1996. Foreword By-A. Blanton Godfrey. Google ScholarDigital Library
- P. Sentas, L. Angelis, and I. Stamelos. A statistical framework for analyzing the duration of software. Empirical Software Engineering, 2008 (accepted), Available online:. Google ScholarDigital Library
- F. Shull, M. G. Mendoncça, V. Basili, J. Carver, J. C. Maldonado, S. Fabbri, G. H. Travassos, and M. C. Ferreira. Knowledge-sharing issues in experimental software engineering. Empirical Software Engineering, 9(1-2):111--137, 2004. Google ScholarDigital Library
- R. Sison, D. Diaz, E. Lam, D. Navarro, and J. Navarro. Personal software process (psp) assistant.In APSEC '05: Proceedings of the 12th Asia-Pacific Software Engineering Conference (APSEC'05), pages 687--696, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- E. Stensrud, T. Foss, B. Kitchenham, and I. Myrtveit. A further empirical investigation of the relationship between mre and project size. Empirical Software Engineering, 8(2):139--161, 2003. Google ScholarDigital Library
- D. M. Strong, Y. W. Lee, and R. Y. Wang. Data quality in context. Communications of the ACM, 40(5):103--110, 1997. Google ScholarDigital Library
- J. D. Van Hulse and T. M. Khoshgoftaar. A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. Journal of Systems and Software, 2007. Google ScholarDigital Library
- J. D. Van Hulse, T. M. Khoshgoftaar, and H. Huang. The pairwise attribute noise detection algorithm. Knowledge and Information Systems, 11(2):171--190, 2007. Google ScholarDigital Library
- R. Y. Wang, H. B. Kon, and S. E. Madnick. Data quality requirements analysis and modeling. In Proceedings of the Ninth International Conference on Data Engineering, pages 670--677, Washington, DC, USA, 1993. IEEE Computer Society. Google ScholarDigital Library
- A. Wesslen. A replicated empirical study of the impact of the methods in the psp on individual engineers. Empirical Software Engineering, 5(2):93--123, 2000. Google ScholarDigital Library
Index Terms
- Data sets and data quality in software engineering
Recommendations
Data Sets and Data Quality in Software Engineering: Eight Years On
PROMISE 2016: Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software EngineeringContext: We revisit our review of data quality within the context of empirical software engineering eight years on from our PROMISE 2008 article.
Objective: To assess the extent and types of techniques used to manage quality within data sets. We ...
Data quality in empirical software engineering: a targeted review
EASE '13: Proceedings of the 17th International Conference on Evaluation and Assessment in Software EngineeringContext: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data ...
Data quality: cinderella at the software metrics ball?
WETSoM '11: Proceedings of the 2nd International Workshop on Emerging Trends in Software MetricsIn this keynote I explore what exactly do we mean by data quality, techniques to assess data quality and the very significant challenges that poor data quality can pose. I believe we neglect data quality at our peril since - whether we like it or not - ...
Comments