Mining the past to determine the future: Problems and possibilities

doi:10.1016/j.ijforecast.2008.09.004

International Journal of Forecasting

Volume 25, Issue 3, July–September 2009, Pages 441-451

https://doi.org/10.1016/j.ijforecast.2008.09.004 Get rights and content

Abstract

Technological advances mean that vast data sets are increasingly common. Such data sets provide us with unparallelled opportunities for modelling and predicting the likely outcome of future events. However, such data sets may also bring with them new challenges and difficulties. An awareness of these, and of the weaknesses as well as the possibilities of these large data sets, is necessary if useful forecasts are to be made. This paper looks at some of these difficulties, using illustrations with applications from various areas.

Introduction

Modern data capture technologies and the capacity for data storage mean that we are experiencing a data deluge. This brings with it both opportunities and challenges. The opportunities arise from the possibility of discerning structures and patterns which would be undetectable with data sets with fewer points or which did not include such a range of variables. The challenges include those of searching through such vast data sets, as well as issues of data quality and apparent structure arising by chance. Such issues are discussed by Hand, Blunt, Kelly and Adams (2000).

Forecasting has always been an important statistical problem — indeed, it certainly predates the development of formal data analytic tools. But with the development of formal analytics, highly sophisticated forecasting methods have been developed, with particular tools created for the unique problems of different kinds of domain.

When the two areas come together — forecasting based on large masses of data and using the rapid development tools of data mining — new opportunities are created. But, as with data mining in general, such opportunities do not come without their caveats. The careless use of any sophisticated tool can lead to misleading conclusions, and data mining is no exception. It is my view that these dangers have been largely overlooked by the data mining community, and, now that the discipline is firmly established, they need to be addressed. In this paper I briefly summarise high level notions of forecasting and data mining, and then look at some of these dangers. I illustrate these points using examples from various domains, though most come from the personal financial services sector, partly because I have considerable experience in that area, and partly because many of the dangers are particularly apparent in that area.

Section snippets

Forecasting

Economists joke that steering the economy is like steering a car by looking through the rear view mirror. Of course, one would never steer a car like that. To steer a car, one looks ahead, noting that one is approaching a bend in the road, that there is another vehicle bearing down on one, and that there is a cyclist just ahead on the near side. That is, in steering a car, one sees that certain things lie ahead, which will have to be taken into account. The presumption in this joke is that in

Data mining

The preface of my book Principles of Data Mining (Hand, Mannila, & Smyth, 2001) opened by defining data mining as ‘the science of extracting useful information from large data sets or databases’. I think that this brief definition is sufficiently broad that it will be non-controversial. However, the opening chapter of the book then included the more detailed definition: ‘the analysis of (often large) observational data sets to find unsuspected relationships and to summarise the data in novel

Problems

The combination of large data sets and observational data mean that data mining exercises are often at risk of drawing misleading conclusions. In this section I describe just four of these dangers. These problems are certainly not things I alone have detected. Indeed, within the statistics community, they are problems which are well understood. However, the central philosophy of data mining — throw sufficient computer power at a large enough data set and interesting things will be revealed —

Conclusion

Forecasting is fundamentally an inferential problem. That is, it is not simply a question of summarising data, but is rather a question of generalising from the available data to new data — and in particular to new situations which are likely to arise in the future. In contrast, the early development of data mining by the computer science community put emphasis on the analysis of the data set to hand (e.g. the discovery of ‘frequent itemsets’ in large transaction databases). It is only

Acknowledgements

The author’s work on this paper was partially supported by a Royal Society Wolfson Research Merit Award.

References (17)

L.C. Thomas
A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers
International Journal of Forecasting
(2000)
G.E.P. Box et al.
The experimental study of physical mechanisms
Technometrics
(1965)
D.R. Cox
Role of models in statistical analysis
Statistical Science
(1990)
D.J. Hand
Artificial intelligence and psychiatry
(1985)
D.J. Hand
Deconstructing statistical questions (with discussion)
Journal of the Royal Statistical Society, Series A
(1994)
D.J. Hand
Modelling consumer credit risk
IMA Journal of Management Mathematics
(2001)
D.J. Hand
Reject inference in credit operations
D.J. Hand et al.
Data mining for fun and profit
Statistical Science
(2000)

There are more references available in the full text version of this article.

Cited by (23)

Forecasting: theory and practice
2022, International Journal of Forecasting
Citation Excerpt :
In time series and forecasting literature, an anomaly is mostly defined with respect to a specific context or its relation to past behaviours. The idea of a context is induced by the structure of the input data and the problem formulation (Chandola, Banerjee, & Kumar, 2007, 2009; Hand, 2009). Further, anomaly detection in forecasting literature has two main focuses, which are conflicting in nature: one demands special attention be paid to anomalies as they can be the main carriers of significant and often critical information such as fraud activities, disease outbreak, natural disasters, while the other down-grades the value of anomalies as it reflects data quality issues such as missing values, corrupted data, data entry errors, extremes, duplicates and unreliable values (Talagala, Hyndman, & Smith-Miles, 2020).
Forecasting has always been at the forefront of decision making and planning. The uncertainty that surrounds the future is both exciting and challenging, with individuals and organisations seeking to minimise risks and maximise utilities. The large number of forecasting applications calls for a diverse set of forecasting methods to tackle real-life challenges. This article provides a non-systematic review of the theory and the practice of forecasting. We provide an overview of a wide range of theoretical, state-of-the-art models, methods, principles, and approaches to prepare, produce, organise, and evaluate forecasts. We then demonstrate how such theoretical concepts are applied in a variety of real-life contexts.
We do not claim that this review is an exhaustive list of methods and applications. However, we wish that our encyclopedic presentation will offer a point of reference for the rich work that has been undertaken over the last decades, with some key insights for the future of forecasting theory and practice. Given its encyclopedic nature, the intended mode of reading is non-linear. We offer cross-references to allow the readers to navigate through the various topics. We complement the theoretical concepts and applications covered by large lists of free or open-source software implementations and publicly-available databases.
Machine learning loss given default for corporate debt
2021, Journal of Empirical Finance
Citation Excerpt :
Moreover, complex data-driven ML models can be especially sensitive to market disruptions or structural changes over time. Although the importance of out-of-time testing is self-evident and well recognized in the literature (e.g., Hand, 2009), most of the existing studies on ML models for corporate debt LGD are based on in-sample and out-of-sample analysis without rigorously testing model performance out-of-time.2 The limited number of studies that report out-of-time results investigate only one or two types of ML methods (e.g., regression tree in Bastos, 2010, regression tree and support vector regression in Tobback et al., 2014), use data that contain a limited number of observations (e.g., Bastos, 2010, and Tobback et al., 2014) or do not exhibit a bi-modal pattern (e.g., Tobback et al., 2014).
We apply multiple machine learning (ML) methods to model loss given default (LGD) for corporate debt using a common dataset that is cross-sectional but collected over different time periods and shows much variation over time. We investigate the efficacy of three cross-validation (CV) schemes for hyper-parameter tuning and bootstrap aggregation (Bagging) in preventing out-of-time model performance deterioration. The three CV methods are shuffled K-fold, unshuffled K-fold and sequential blocked, which completely destroys, keeps some and completely retains the chronological order in the data, respectively. We find that it is important to keep the chronological order in the data when creating the training and testing samples, and the more the chronological order that can be retained, the more stable the out-of-time ML LGD model performance. By contrast, although bagging improves out-of-time fit in some cases, its effectiveness is rather marginal relative to that from the unshuffled K-fold and sequential blocked CV methods. Substantial uncertainty in relative out-of-time performance remains, however, thus ongoing model performance monitoring and benchmarking are still essential for sound model risk management for corporate LGD and other ML models.
Predicting mortgage early delinquency with machine learning methods
2021, European Journal of Operational Research
Citation Excerpt :
Out-of-time investigation does not receive much attention in the existing literature either, as the findings in the prior studies on ML methods are typically based on in-sample and out-of-sample evidence (see for example, the discussions in Varian (2014)). However, conducting out-of-time investigation for ML methods is particularly important-(see Hand (2009a) for a comprehensive explanation). A statistical model is built on consumer behavior observed in historical data, but such behavior could change in the future and the historical pattern observed at certain periods for a sub-population may not be generalized over time.
This paper investigates the performance of thirteen methods for modelling and predicting mortgage early delinquency probabilities. These models include variants of logit models, some commonly used machine learning methods, and variants of ensemble models. We find that heterogenous ensemble methods lead other methods in the training, out-of-sample, and out-of-time datasets in terms of risk classification. Nonetheless, various predictive accuracy performance measures yield different rankings among the thirteen methods and no method consistently dominates in this performance dimension in the training, out-of-sample, and out-of-time data. Lastly, predictive accuracy is a major challenge facing all mortgage early delinquency models, even in the training data.
Foresight by online communities – The case of renewable energies
2018, Technological Forecasting and Social Change
Citation Excerpt :
Especially, Big Data or Data Mining are common methods. Both concepts deal with the use of very large amounts of data that are obtained primarily via the Internet (Hand, 2009; Hassani and Silva, 2015). OCs can be a useful source of data, too.
Web 2.0 offers manifold ways in order to integrate community members via online communities (OCs) for innovation processes. OCs prove to be a valuable and dynamic source of information. External information sources are also important for foresight in order to be able to identify and monitor all relevant changes. However, traditional foresight methods are rather static in comparison with dynamic OCs. Thus, this study gives first insights into the use of OCs for foresight. First, based on literature, it is conceptually shown that OCs can contribute to foresight. Second, the question of how to assess the potential of OCs for foresight is considered. Renewable energies OCs are identified using a netnographic approach. One selected OC is analyzed in-depth by applying a prior developed criteria catalog which is based on Popper's (2008) foresight diamond. Each of its four dimensions – creativity, expertise, interaction, and evidence – is operationalized with measurement items taken from literature. In particular, the evidence dimension is supported by a text mining approach. Lastly, a focus group interview proves the usefulness of OCs for foresight. The findings show that OCs can contribute to each dimension of the foresight diamond and serve as an additional source of information for foresight.
An empirical comparison of classification algorithms for mortgage default prediction: Evidence from a distressed mortgage market
2016, European Journal of Operational Research
Citation Excerpt :
First, it specifically focuses on mortgages. Detailed accounts of the various modelling approaches to credit scoring are included in Crook et al. (2007), Crook and Bellotti (2009), Thomas (2009), Hand (2009b), and Martin (2013). However, with the exceptions of Galindo and Tamayo (2000), or Feldman and Gross (2005) and Kennedy, Namee, Delaney, O’Sullivan, and Watson (2013a), most of the literature concentrates on credit card or personal lending only.
This paper evaluates the performance of a number of modelling approaches for future mortgage default status. Boosted regression trees, random forests, penalised linear and semi-parametric logistic regression models are applied to four portfolios of over 300,000 Irish owner-occupier mortgages. The main findings are that the selected approaches have varying degrees of predictive power and that boosted regression trees significantly outperform logistic regression. This suggests that boosted regression trees can be a useful addition to the current toolkit for mortgage credit risk assessment by banks and regulators.
A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition
2012, Expert Systems with Applications
Multi-step ahead forecasting is still an open challenge in time series forecasting. Several approaches that deal with this complex problem have been proposed in the literature but an extensive comparison on a large number of tasks is still missing. This paper aims to fill this gap by reviewing existing strategies for multi-step ahead forecasting and comparing them in theoretical and practical terms. To attain such an objective, we performed a large scale comparison of these different strategies using a large experimental benchmark (namely the 111 series from the NN5 forecasting competition). In addition, we considered the effects of deseasonalization, input variable selection, and forecast combination on these strategies and on multi-step ahead forecasting at large. The following three findings appear to be consistently supported by the experimental results: Multiple-Output strategies are the best performing approaches, deseasonalization leads to uniformly improved forecast accuracy, and input selection is more effective when performed in conjunction with deseasonalization.

View all citing articles on Scopus

View full text

Mining the past to determine the future: Problems and possibilities

Abstract

Introduction

Section snippets

Forecasting

Data mining

Problems

Conclusion

Acknowledgements

International Journal of Forecasting

The experimental study of physical mechanisms

Technometrics

Role of models in statistical analysis

Statistical Science

Artificial intelligence and psychiatry

Deconstructing statistical questions (with discussion)

Journal of the Royal Statistical Society, Series A

Modelling consumer credit risk

IMA Journal of Management Mathematics

Reject inference in credit operations

Data mining for fun and profit

Statistical Science