research-article

Data sets and data quality in software engineering

Authors:
Gernot A. Liebchen

Brunel University, Uxbridge, United Kingdom

Brunel University, Uxbridge, United Kingdom
View Profile

,
Martin Shepperd

Brunel University, Uxbridge, United Kingdom

Brunel University, Uxbridge, United Kingdom
View Profile

PROMISE '08: Proceedings of the 4th international workshop on Predictor models in software engineeringMay 2008Pages 39–44https://doi.org/10.1145/1370788.1370799

Published:12 May 2008Publication History

PROMISE '08: Proceedings of the 4th international workshop on Predictor models in software engineering

Pages 39–44

ABSTRACT

OBJECTIVE - to assess the extent and types of techniques used to manage quality within software engineering data sets. We consider this a particularly interesting question in the context of initiatives to promote sharing and secondary analysis of data sets. METHOD - we perform a systematic review of available empirical software engineering studies. RESULTS - only 23 out of the many hundreds of studies assessed, explicitly considered data quality. CONCLUSIONS - first, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need more research into means of identifying, and ideally repairing, noisy cases. Third, it should become routine to use sensitivity analysis to assess conclusion stability with respect to the assumptions that must be made concerning noise levels.

References

S. Biffl and W. J. Gutjahr. Using a reliability growth model to control software inspection. Empirical Software Engineering, 7(3):257--284, 2002. Google ScholarDigital Library
C. E. Brodley and M. A. Friedl. Identifying and eliminating mislabeled training instances. In AAAI/IAAI, Vol. 1, pages 799--805, 1996. Google ScholarDigital Library
C. Cappiello. Data Quality and Multichannel Services. PhD thesis, Politecnico di Milano, 2005.Google Scholar
S. Counsell, G. Loizou, and R. Najjar. Quality of manual data collection in java software: an empirical investigation. Empirical Software Engineering, 12(3):275--293, 2007. Google ScholarDigital Library
P. B. Crosby. Quality without tears: The art of hassle free management. McGraw-Hill, New York, USA, 1984.Google Scholar
R. D. De Veaux and D. J. Hand. How to lie with bad data. Statistical Science, 20(3):231--238, 2005.Google ScholarCross Ref
A. M. Disney and P. M. Johnson. Investigating data quality problems in the psp. Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pages 143--152, 1998. Cited By (since 1996): 2. Google ScholarDigital Library
N. E. Fenton and M. Neil. A critique of software defect prediction models. IEEE Transactions on Software Engineering, 25(5):675--689, 1999. Google ScholarDigital Library
D. Gamberger, N. Lavrac, and C. Groselj. Experiments with noise detection algorithms in the diagnosis of coronary artery disease. In IDAMAP-98, Third Workshop on Intelligent Data Analysis in Medicine and Pharmacology, pages 29--33, Brighton, UK, 1998. University of Brighton.Google Scholar
M. Gertz, M. T. özsu, G. Saake, and K.-U. Sattler. Report on the dagstuhl seminar: data quality on the web". SIGMOD Record, 33(1):127--132, 2004. Google ScholarDigital Library
T. P. Group. Promise data sets. Available: http://promisedata.org/repository/, Last accessed 10 January, 2008.Google Scholar
R. Gulezian. Software quality measurement and modeling, maturity, control and improvement. Proceedings of the IEEE International Software Engineering Standards Symposium, pages 52--59, 1995. Google ScholarDigital Library
P. M. Johnson. Reengineering inspection. Communications of the ACM, 41(2):49--52, 1998. Google ScholarDigital Library
P. M. Johnson and A. M. Disney. Personal software process: A cautionary case study. IEEE Software, 15(6):85--88, 1998. Cited By (since 1996): 9. Google ScholarDigital Library
P. M. Johnson and A. M. Disney. A critical analysis of psp data quality: Results from a case study. Empirical Software Engineering, 4(4):317--349, 1999. Cited By (since 1996): 4. Google ScholarDigital Library
T. M. Khoshgoftaar and P. Rebours. Improving software quality prediction by noise filteringtechniques. Journal of Computer Science and Technology, 22(3):387--396, 2007. Google ScholarDigital Library
T. M. Khoshgoftaar, N. Seliya, and K. Gao. Rule-based noise detection for software measurement data. In IRI, pages 302--307, 2004.Google ScholarCross Ref
T. M. Khoshgoftaar and J. D. Van Hulse. Identifying noise in an attribute of interest. In ICMLA '05: Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA'05), pages 55--62, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
B. Kitchenham. Procedures for performing systematic reviews (technical report tr/se-0401). Technical Report Technical Report TR/SE-0401, Keele University, Keele, UK, July 2004.Google Scholar
J. Li, F. O. Bjornson, R. Conradi, and V. B. Kampenes. An empirical study of variations in cots-based software development processes in the norwegian it industry. Empirical Software Engineering, 11(3):433--461, 2006. Google ScholarDigital Library
G. A. Liebchen and M. Shepperd. Software productivity analysis of a large data set and issues of confidentiality and data quality. Proceedings of the 11th IEEE International Software Metrics Symposium (METRICS'05), 00:46, 2005. Google ScholarDigital Library
G. A. Liebchen, B. Twala, M. Shepperd, and M. Cartwright. Assessing the quality and cleaning of a software project data set: An experience report. In Proceedings of 10th International Conference on Evaluation and Assessment in Software Engineering(EASE). British Computer Society, 2006. Google ScholarDigital Library
G. A. Liebchen, B. Twala, M. Shepperd, M. Cartwright, and M. Stephens. Filtering, robust filtering, polishing: Techniques for addressing quality in software data. First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), 0:99--106, 2007. Google ScholarDigital Library
R. J. A. Little and D. B. Rubin. Statistical analysis with missing data. John Wiley & Sons, Inc., New York, NY, USA, 1986. Google ScholarDigital Library
E. Mendes and C. Lokan. Replicating studies on cross vs. single-company effort models using the isbsg database. Empirical Software Engineering, 13(1), 2008. Google ScholarDigital Library
E. Mendes, I. Watson, C. Triggs, N. Mosley, and S. Counsell. A comparative study of cost estimation models for web hypermedia applications. Empirical Software Engineering, 8(2):163--196, 2003. Google ScholarDigital Library
P. Mohagheghi and R. Conradi. Quality, productivity and economic benefits of software reuse: a review of industrial studies. Empirical Software Engineering, 12(5):471--516, 2007. Google ScholarDigital Library
T. C. Redman. Data Quality for the Information Age. Artech House, Inc., Norwood, MA, USA, 1996. Foreword By-A. Blanton Godfrey. Google ScholarDigital Library
P. Sentas, L. Angelis, and I. Stamelos. A statistical framework for analyzing the duration of software. Empirical Software Engineering, 2008 (accepted), Available online:. Google ScholarDigital Library
F. Shull, M. G. Mendoncça, V. Basili, J. Carver, J. C. Maldonado, S. Fabbri, G. H. Travassos, and M. C. Ferreira. Knowledge-sharing issues in experimental software engineering. Empirical Software Engineering, 9(1-2):111--137, 2004. Google ScholarDigital Library
R. Sison, D. Diaz, E. Lam, D. Navarro, and J. Navarro. Personal software process (psp) assistant.In APSEC '05: Proceedings of the 12th Asia-Pacific Software Engineering Conference (APSEC'05), pages 687--696, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
E. Stensrud, T. Foss, B. Kitchenham, and I. Myrtveit. A further empirical investigation of the relationship between mre and project size. Empirical Software Engineering, 8(2):139--161, 2003. Google ScholarDigital Library
D. M. Strong, Y. W. Lee, and R. Y. Wang. Data quality in context. Communications of the ACM, 40(5):103--110, 1997. Google ScholarDigital Library
J. D. Van Hulse and T. M. Khoshgoftaar. A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. Journal of Systems and Software, 2007. Google ScholarDigital Library
J. D. Van Hulse, T. M. Khoshgoftaar, and H. Huang. The pairwise attribute noise detection algorithm. Knowledge and Information Systems, 11(2):171--190, 2007. Google ScholarDigital Library
R. Y. Wang, H. B. Kon, and S. E. Madnick. Data quality requirements analysis and modeling. In Proceedings of the Ninth International Conference on Data Engineering, pages 670--677, Washington, DC, USA, 1993. IEEE Computer Society. Google ScholarDigital Library
A. Wesslen. A replicated empirical study of the impact of the methods in the psp on individual engineers. Empirical Software Engineering, 5(2):93--123, 2000. Google ScholarDigital Library

Index Terms

Data sets and data quality in software engineering
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems

Recommendations

Data Sets and Data Quality in Software Engineering: Eight Years On
PROMISE 2016: Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering

Context: We revisit our review of data quality within the context of empirical software engineering eight years on from our PROMISE 2008 article.

Objective: To assess the extent and types of techniques used to manage quality within data sets. We ...
Read More
Data quality in empirical software engineering: a targeted review
EASE '13: Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering

Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data ...
Read More
Data quality: cinderella at the software metrics ball?
WETSoM '11: Proceedings of the 2nd International Workshop on Emerging Trends in Software Metrics

In this keynote I explore what exactly do we mean by data quality, techniques to assess data quality and the very significant challenges that poor data quality can pose. I believe we neglect data quality at our peril since - whether we like it or not - ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PROMISE '08: Proceedings of the 4th international workshop on Predictor models in software engineering
May 2008
108 pages
ISBN:9781605580364
DOI:10.1145/1370788
General Chair:
Boetticher Boetticher
University of Houston - Clear Lake, USA
,
Program Chair:
Tom Ostrand
AT&T, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 May 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data quality
data sets
empirical research
prediction
Qualifiers
- research-article
Conference

Acceptance Rates
PROMISE '08 Paper Acceptance Rate13of16submissions,81%Overall Acceptance Rate64of125submissions,51%
More

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 72
  Total Citations
  View Citations
- 1,332
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data sets and data quality in software engineering

PROMISE '08: Proceedings of the 4th international workshop on Predictor models in software engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data Sets and Data Quality in Software Engineering: Eight Years On

Data quality in empirical software engineering: a targeted review

Data quality: cinderella at the software metrics ball?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data sets and data quality in software engineering

PROMISE '08: Proceedings of the 4th international workshop on Predictor models in software engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data Sets and Data Quality in Software Engineering: Eight Years On

Data quality in empirical software engineering: a targeted review

Data quality: cinderella at the software metrics ball?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media