skip to main content
10.1145/1370788.1370799acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Data sets and data quality in software engineering

Published:12 May 2008Publication History

ABSTRACT

OBJECTIVE - to assess the extent and types of techniques used to manage quality within software engineering data sets. We consider this a particularly interesting question in the context of initiatives to promote sharing and secondary analysis of data sets. METHOD - we perform a systematic review of available empirical software engineering studies. RESULTS - only 23 out of the many hundreds of studies assessed, explicitly considered data quality. CONCLUSIONS - first, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need more research into means of identifying, and ideally repairing, noisy cases. Third, it should become routine to use sensitivity analysis to assess conclusion stability with respect to the assumptions that must be made concerning noise levels.

References

  1. S. Biffl and W. J. Gutjahr. Using a reliability growth model to control software inspection. Empirical Software Engineering, 7(3):257--284, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. E. Brodley and M. A. Friedl. Identifying and eliminating mislabeled training instances. In AAAI/IAAI, Vol. 1, pages 799--805, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Cappiello. Data Quality and Multichannel Services. PhD thesis, Politecnico di Milano, 2005.Google ScholarGoogle Scholar
  4. S. Counsell, G. Loizou, and R. Najjar. Quality of manual data collection in java software: an empirical investigation. Empirical Software Engineering, 12(3):275--293, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. B. Crosby. Quality without tears: The art of hassle free management. McGraw-Hill, New York, USA, 1984.Google ScholarGoogle Scholar
  6. R. D. De Veaux and D. J. Hand. How to lie with bad data. Statistical Science, 20(3):231--238, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  7. A. M. Disney and P. M. Johnson. Investigating data quality problems in the psp. Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pages 143--152, 1998. Cited By (since 1996): 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. E. Fenton and M. Neil. A critique of software defect prediction models. IEEE Transactions on Software Engineering, 25(5):675--689, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Gamberger, N. Lavrac, and C. Groselj. Experiments with noise detection algorithms in the diagnosis of coronary artery disease. In IDAMAP-98, Third Workshop on Intelligent Data Analysis in Medicine and Pharmacology, pages 29--33, Brighton, UK, 1998. University of Brighton.Google ScholarGoogle Scholar
  10. M. Gertz, M. T. özsu, G. Saake, and K.-U. Sattler. Report on the dagstuhl seminar: data quality on the web". SIGMOD Record, 33(1):127--132, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. P. Group. Promise data sets. Available: http://promisedata.org/repository/, Last accessed 10 January, 2008.Google ScholarGoogle Scholar
  12. R. Gulezian. Software quality measurement and modeling, maturity, control and improvement. Proceedings of the IEEE International Software Engineering Standards Symposium, pages 52--59, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. M. Johnson. Reengineering inspection. Communications of the ACM, 41(2):49--52, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. M. Johnson and A. M. Disney. Personal software process: A cautionary case study. IEEE Software, 15(6):85--88, 1998. Cited By (since 1996): 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. M. Johnson and A. M. Disney. A critical analysis of psp data quality: Results from a case study. Empirical Software Engineering, 4(4):317--349, 1999. Cited By (since 1996): 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. M. Khoshgoftaar and P. Rebours. Improving software quality prediction by noise filteringtechniques. Journal of Computer Science and Technology, 22(3):387--396, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. M. Khoshgoftaar, N. Seliya, and K. Gao. Rule-based noise detection for software measurement data. In IRI, pages 302--307, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  18. T. M. Khoshgoftaar and J. D. Van Hulse. Identifying noise in an attribute of interest. In ICMLA '05: Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA'05), pages 55--62, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Kitchenham. Procedures for performing systematic reviews (technical report tr/se-0401). Technical Report Technical Report TR/SE-0401, Keele University, Keele, UK, July 2004.Google ScholarGoogle Scholar
  20. J. Li, F. O. Bjornson, R. Conradi, and V. B. Kampenes. An empirical study of variations in cots-based software development processes in the norwegian it industry. Empirical Software Engineering, 11(3):433--461, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. A. Liebchen and M. Shepperd. Software productivity analysis of a large data set and issues of confidentiality and data quality. Proceedings of the 11th IEEE International Software Metrics Symposium (METRICS'05), 00:46, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. A. Liebchen, B. Twala, M. Shepperd, and M. Cartwright. Assessing the quality and cleaning of a software project data set: An experience report. In Proceedings of 10th International Conference on Evaluation and Assessment in Software Engineering(EASE). British Computer Society, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. A. Liebchen, B. Twala, M. Shepperd, M. Cartwright, and M. Stephens. Filtering, robust filtering, polishing: Techniques for addressing quality in software data. First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), 0:99--106, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. J. A. Little and D. B. Rubin. Statistical analysis with missing data. John Wiley & Sons, Inc., New York, NY, USA, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Mendes and C. Lokan. Replicating studies on cross vs. single-company effort models using the isbsg database. Empirical Software Engineering, 13(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Mendes, I. Watson, C. Triggs, N. Mosley, and S. Counsell. A comparative study of cost estimation models for web hypermedia applications. Empirical Software Engineering, 8(2):163--196, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Mohagheghi and R. Conradi. Quality, productivity and economic benefits of software reuse: a review of industrial studies. Empirical Software Engineering, 12(5):471--516, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. C. Redman. Data Quality for the Information Age. Artech House, Inc., Norwood, MA, USA, 1996. Foreword By-A. Blanton Godfrey. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Sentas, L. Angelis, and I. Stamelos. A statistical framework for analyzing the duration of software. Empirical Software Engineering, 2008 (accepted), Available online:. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Shull, M. G. Mendoncça, V. Basili, J. Carver, J. C. Maldonado, S. Fabbri, G. H. Travassos, and M. C. Ferreira. Knowledge-sharing issues in experimental software engineering. Empirical Software Engineering, 9(1-2):111--137, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Sison, D. Diaz, E. Lam, D. Navarro, and J. Navarro. Personal software process (psp) assistant.In APSEC '05: Proceedings of the 12th Asia-Pacific Software Engineering Conference (APSEC'05), pages 687--696, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. Stensrud, T. Foss, B. Kitchenham, and I. Myrtveit. A further empirical investigation of the relationship between mre and project size. Empirical Software Engineering, 8(2):139--161, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. M. Strong, Y. W. Lee, and R. Y. Wang. Data quality in context. Communications of the ACM, 40(5):103--110, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. D. Van Hulse and T. M. Khoshgoftaar. A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. Journal of Systems and Software, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. D. Van Hulse, T. M. Khoshgoftaar, and H. Huang. The pairwise attribute noise detection algorithm. Knowledge and Information Systems, 11(2):171--190, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Y. Wang, H. B. Kon, and S. E. Madnick. Data quality requirements analysis and modeling. In Proceedings of the Ninth International Conference on Data Engineering, pages 670--677, Washington, DC, USA, 1993. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Wesslen. A replicated empirical study of the impact of the methods in the psp on individual engineers. Empirical Software Engineering, 5(2):93--123, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data sets and data quality in software engineering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PROMISE '08: Proceedings of the 4th international workshop on Predictor models in software engineering
          May 2008
          108 pages
          ISBN:9781605580364
          DOI:10.1145/1370788

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 May 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          PROMISE '08 Paper Acceptance Rate13of16submissions,81%Overall Acceptance Rate64of125submissions,51%

          Upcoming Conference

          ICSE 2025

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader