Mining the past to determine the future: Problems and possibilities
Introduction
Modern data capture technologies and the capacity for data storage mean that we are experiencing a data deluge. This brings with it both opportunities and challenges. The opportunities arise from the possibility of discerning structures and patterns which would be undetectable with data sets with fewer points or which did not include such a range of variables. The challenges include those of searching through such vast data sets, as well as issues of data quality and apparent structure arising by chance. Such issues are discussed by Hand, Blunt, Kelly and Adams (2000).
Forecasting has always been an important statistical problem — indeed, it certainly predates the development of formal data analytic tools. But with the development of formal analytics, highly sophisticated forecasting methods have been developed, with particular tools created for the unique problems of different kinds of domain.
When the two areas come together — forecasting based on large masses of data and using the rapid development tools of data mining — new opportunities are created. But, as with data mining in general, such opportunities do not come without their caveats. The careless use of any sophisticated tool can lead to misleading conclusions, and data mining is no exception. It is my view that these dangers have been largely overlooked by the data mining community, and, now that the discipline is firmly established, they need to be addressed. In this paper I briefly summarise high level notions of forecasting and data mining, and then look at some of these dangers. I illustrate these points using examples from various domains, though most come from the personal financial services sector, partly because I have considerable experience in that area, and partly because many of the dangers are particularly apparent in that area.
Section snippets
Forecasting
Economists joke that steering the economy is like steering a car by looking through the rear view mirror. Of course, one would never steer a car like that. To steer a car, one looks ahead, noting that one is approaching a bend in the road, that there is another vehicle bearing down on one, and that there is a cyclist just ahead on the near side. That is, in steering a car, one sees that certain things lie ahead, which will have to be taken into account. The presumption in this joke is that in
Data mining
The preface of my book Principles of Data Mining (Hand, Mannila, & Smyth, 2001) opened by defining data mining as ‘the science of extracting useful information from large data sets or databases’. I think that this brief definition is sufficiently broad that it will be non-controversial. However, the opening chapter of the book then included the more detailed definition: ‘the analysis of (often large) observational data sets to find unsuspected relationships and to summarise the data in novel
Problems
The combination of large data sets and observational data mean that data mining exercises are often at risk of drawing misleading conclusions. In this section I describe just four of these dangers. These problems are certainly not things I alone have detected. Indeed, within the statistics community, they are problems which are well understood. However, the central philosophy of data mining — throw sufficient computer power at a large enough data set and interesting things will be revealed —
Conclusion
Forecasting is fundamentally an inferential problem. That is, it is not simply a question of summarising data, but is rather a question of generalising from the available data to new data — and in particular to new situations which are likely to arise in the future. In contrast, the early development of data mining by the computer science community put emphasis on the analysis of the data set to hand (e.g. the discovery of ‘frequent itemsets’ in large transaction databases). It is only
Acknowledgements
The author’s work on this paper was partially supported by a Royal Society Wolfson Research Merit Award.
References (17)
A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers
International Journal of Forecasting
(2000)- et al.
The experimental study of physical mechanisms
Technometrics
(1965) Role of models in statistical analysis
Statistical Science
(1990)Artificial intelligence and psychiatry
(1985)Deconstructing statistical questions (with discussion)
Journal of the Royal Statistical Society, Series A
(1994)Modelling consumer credit risk
IMA Journal of Management Mathematics
(2001)Reject inference in credit operations
- et al.
Data mining for fun and profit
Statistical Science
(2000)
Cited by (23)
Forecasting: theory and practice
2022, International Journal of ForecastingCitation Excerpt :In time series and forecasting literature, an anomaly is mostly defined with respect to a specific context or its relation to past behaviours. The idea of a context is induced by the structure of the input data and the problem formulation (Chandola, Banerjee, & Kumar, 2007, 2009; Hand, 2009). Further, anomaly detection in forecasting literature has two main focuses, which are conflicting in nature: one demands special attention be paid to anomalies as they can be the main carriers of significant and often critical information such as fraud activities, disease outbreak, natural disasters, while the other down-grades the value of anomalies as it reflects data quality issues such as missing values, corrupted data, data entry errors, extremes, duplicates and unreliable values (Talagala, Hyndman, & Smith-Miles, 2020).
Machine learning loss given default for corporate debt
2021, Journal of Empirical FinanceCitation Excerpt :Moreover, complex data-driven ML models can be especially sensitive to market disruptions or structural changes over time. Although the importance of out-of-time testing is self-evident and well recognized in the literature (e.g., Hand, 2009), most of the existing studies on ML models for corporate debt LGD are based on in-sample and out-of-sample analysis without rigorously testing model performance out-of-time.2 The limited number of studies that report out-of-time results investigate only one or two types of ML methods (e.g., regression tree in Bastos, 2010, regression tree and support vector regression in Tobback et al., 2014), use data that contain a limited number of observations (e.g., Bastos, 2010, and Tobback et al., 2014) or do not exhibit a bi-modal pattern (e.g., Tobback et al., 2014).
Predicting mortgage early delinquency with machine learning methods
2021, European Journal of Operational ResearchCitation Excerpt :Out-of-time investigation does not receive much attention in the existing literature either, as the findings in the prior studies on ML methods are typically based on in-sample and out-of-sample evidence (see for example, the discussions in Varian (2014)). However, conducting out-of-time investigation for ML methods is particularly important-(see Hand (2009a) for a comprehensive explanation). A statistical model is built on consumer behavior observed in historical data, but such behavior could change in the future and the historical pattern observed at certain periods for a sub-population may not be generalized over time.
Foresight by online communities – The case of renewable energies
2018, Technological Forecasting and Social ChangeCitation Excerpt :Especially, Big Data or Data Mining are common methods. Both concepts deal with the use of very large amounts of data that are obtained primarily via the Internet (Hand, 2009; Hassani and Silva, 2015). OCs can be a useful source of data, too.
An empirical comparison of classification algorithms for mortgage default prediction: Evidence from a distressed mortgage market
2016, European Journal of Operational ResearchCitation Excerpt :First, it specifically focuses on mortgages. Detailed accounts of the various modelling approaches to credit scoring are included in Crook et al. (2007), Crook and Bellotti (2009), Thomas (2009), Hand (2009b), and Martin (2013). However, with the exceptions of Galindo and Tamayo (2000), or Feldman and Gross (2005) and Kennedy, Namee, Delaney, O’Sullivan, and Watson (2013a), most of the literature concentrates on credit card or personal lending only.
A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition
2012, Expert Systems with Applications