ABSTRACT
Background: There has been a long debate in the software engineering literature concerning how useful cross-company (CC) data are for software effort estimation (SEE) in comparison to within-company (WC) data. Studies indicate that models trained on CC data obtain either similar or worse performance than models trained solely on WC data.
Aims: We aim at investigating if CC data could help to increase performance and under what conditions.
Method: The work concentrates on the fact that SEE is a class of online learning tasks which operate in changing environments, even though most work so far has neglected that. We conduct an analysis based on the performance of different approaches considering CC and WC data. These are: (1) an approach not designed for changing environments, (2) approaches designed for changing environments and (3) a new online learning approach able to identify when CC data are helpful or detrimental.
Results: Interesting features of data sets commonly used in the SEE literature are revealed, showing that different subsets of CC data can be beneficial or detrimental depending on the moment in time. The newly proposed approach is able to benefit from that, successfully using CC data to improve performance over WC models.
Conclusions: This work not only shows that CC data can help to increase performance for SEE tasks, but also demonstrates that the online nature of software prediction tasks should be exploited, being an important issue to be considered in the future.
- M. Baena-García, J. Del Campo-Ávila, R. Fidalgo, and A. Bifet. Early drift detection method. In IWKDDS, pages 77--86, Berlin, Germany, 2006.Google Scholar
- B. Boehm. Software Engineering Economics. Prentice-Hall, Englewood Cliffs, NJ, 1981. Google ScholarDigital Library
- M. Cartwright, M. Shepperd, and Q. Song. Dealing with missing software project data. In METRICS, pages 154--165, Sydney, 2003. Google ScholarDigital Library
- S. Conte, H. Dunsmore, and V. Shen. Software Engineering Metrics and Models. Benjamin Cummings Publishing, Menlo Park, CA, 1986. Google ScholarDigital Library
- J. Demšar. Statistical comparisons of classifiers over multiple data sets. JMLR, 7: 1--30, 2006. Google ScholarDigital Library
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explorations, 11(1): 10--18, 2009. Google ScholarDigital Library
- B. Kitchenham, E. Mendes, and G. Travassos. Cross versus within-company cost estimation studies: A systematic review. IEEE TSE, 33(5): 316--329, 2007. Google ScholarDigital Library
- E. Kocaguneli, G. Gay, T. Menzies, Y. Yang, and J. W. Keung. When to use data from other projects for effort estimation. In ASE, pages 321--324, Antwerp, Belgium, 2010. Google ScholarDigital Library
- J. Z. Kolter and M. A. Maloof. Using additive expert ensembles to cope with concept drift. In ACM ICML, pages 449--456, Bonn, Germany, 2005. Google ScholarDigital Library
- J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: An ensemble method for drifting concepts. JMLR, 8: 2755--2790, 2007. Google ScholarDigital Library
- C. Lokan and E. Mendes. Investigating the use of chronological splitting to compare software cross-company and single-company effort predictions. In EASE, page 10p, Bari, Italy, 2008. Google ScholarDigital Library
- C. Lokan and E. Mendes. Applying moving windows to software effort estimation. In ESEM, pages 111--122, Lake Buena Vista, Florida, USA, 2009. Google ScholarDigital Library
- C. Lokan and E. Mendes. Investigating the use of chronological split for software effort estimation. IET-Software, 3(5): 422--434, 2009.Google ScholarCross Ref
- C. Lokan and E. Mendes. Using chronological splitting to compare cross-and single-company effort models: Further investigation. In ACSC, pages 35--42, Wellington, New Zealand, 2009. Google ScholarDigital Library
- E. Mendes and C. Lokan. Investigating the use of chronological splitting to compare software cross-company and single-company effort predictions: a replicated study. In EASE, page 10p, Durham, 2009. Google ScholarDigital Library
- L. Minku and X. Yao. Using unreliable data for creating more reliable online learners. In IJCNN, pages 2492--2499, Brisbane, Australia, 2012.Google ScholarCross Ref
- L. L. Minku, A. White, and X. Yao. The impact of diversity on on-line ensemble learning in the presence of concept drift. IEEE TKDE, 22(5): 730--742, 2010. Google ScholarDigital Library
- L. L. Minku and X. Yao. A principled evaluation of ensembles of learning machines for software effort estimation. In PROMISE, pages 10p, doi: 10.1145/2020390.2020399, Banff, Canada, 2011. Google ScholarDigital Library
- L. L. Minku and X. Yao. DDD: A new ensemble approach for dealing with concept drift. IEEE TKDE, 24(4): 619--633, 2012. Google ScholarDigital Library
- M. L. Mitchell and J. M. Jolley. Research Design Explained. Cengage Learning, USA, 7th edition, 2010.Google Scholar
- S. Muthukrishnan. Data Streams: algorithms and applications. Now Publishers Inc., Hanover, MA, 2005.Google Scholar
- M. Shepperd and S. McDonell. Evaluating prediction systems in software project estimation. IST, 54(8): 820--827, 2012. Google ScholarDigital Library
Index Terms
- Can cross-company data improve performance in software effort estimation?
Recommendations
How to make best use of cross-company data in software effort estimation?
ICSE 2014: Proceedings of the 36th International Conference on Software EngineeringPrevious works using Cross-Company (CC) data for making Within-Company (WC) Software Effort Estimation (SEE) try to use CC data or models directly to provide predictions in the WC context. So, these data or models are only helpful when they match the ...
Clustering Dycom: An Online Cross-Company Software Effort Estimation Study
PROMISE: Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software EngineeringBackground: Software Effort Estimation (SEE) can be formulated as an online learning problem, where new projects are completed over time and may become available for training. In this scenario, a Cross-Company (CC) SEE approach called Dycom can ...
On the Terms Within- and Cross-Company in Software Effort Estimation
PROMISE 2016: Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software EngineeringBackground: the terms Within-Company (WC) and Cross-Company (CC) in Software Effort Estimation (SEE) have the connotation that CC projects are considerably different from WC projects, and that WC projects are more similar to the projects being ...
Comments