Elsevier

International Journal of Forecasting

Volume 25, Issue 3, July–September 2009, Pages 441-451
International Journal of Forecasting

Mining the past to determine the future: Problems and possibilities

https://doi.org/10.1016/j.ijforecast.2008.09.004Get rights and content

Abstract

Technological advances mean that vast data sets are increasingly common. Such data sets provide us with unparallelled opportunities for modelling and predicting the likely outcome of future events. However, such data sets may also bring with them new challenges and difficulties. An awareness of these, and of the weaknesses as well as the possibilities of these large data sets, is necessary if useful forecasts are to be made. This paper looks at some of these difficulties, using illustrations with applications from various areas.

Introduction

Modern data capture technologies and the capacity for data storage mean that we are experiencing a data deluge. This brings with it both opportunities and challenges. The opportunities arise from the possibility of discerning structures and patterns which would be undetectable with data sets with fewer points or which did not include such a range of variables. The challenges include those of searching through such vast data sets, as well as issues of data quality and apparent structure arising by chance. Such issues are discussed by Hand, Blunt, Kelly and Adams (2000).

Forecasting has always been an important statistical problem — indeed, it certainly predates the development of formal data analytic tools. But with the development of formal analytics, highly sophisticated forecasting methods have been developed, with particular tools created for the unique problems of different kinds of domain.

When the two areas come together — forecasting based on large masses of data and using the rapid development tools of data mining — new opportunities are created. But, as with data mining in general, such opportunities do not come without their caveats. The careless use of any sophisticated tool can lead to misleading conclusions, and data mining is no exception. It is my view that these dangers have been largely overlooked by the data mining community, and, now that the discipline is firmly established, they need to be addressed. In this paper I briefly summarise high level notions of forecasting and data mining, and then look at some of these dangers. I illustrate these points using examples from various domains, though most come from the personal financial services sector, partly because I have considerable experience in that area, and partly because many of the dangers are particularly apparent in that area.

Section snippets

Forecasting

Economists joke that steering the economy is like steering a car by looking through the rear view mirror. Of course, one would never steer a car like that. To steer a car, one looks ahead, noting that one is approaching a bend in the road, that there is another vehicle bearing down on one, and that there is a cyclist just ahead on the near side. That is, in steering a car, one sees that certain things lie ahead, which will have to be taken into account. The presumption in this joke is that in

Data mining

The preface of my book Principles of Data Mining (Hand, Mannila, & Smyth, 2001) opened by defining data mining as ‘the science of extracting useful information from large data sets or databases’. I think that this brief definition is sufficiently broad that it will be non-controversial. However, the opening chapter of the book then included the more detailed definition: ‘the analysis of (often large) observational data sets to find unsuspected relationships and to summarise the data in novel

Problems

The combination of large data sets and observational data mean that data mining exercises are often at risk of drawing misleading conclusions. In this section I describe just four of these dangers. These problems are certainly not things I alone have detected. Indeed, within the statistics community, they are problems which are well understood. However, the central philosophy of data mining — throw sufficient computer power at a large enough data set and interesting things will be revealed —

Conclusion

Forecasting is fundamentally an inferential problem. That is, it is not simply a question of summarising data, but is rather a question of generalising from the available data to new data — and in particular to new situations which are likely to arise in the future. In contrast, the early development of data mining by the computer science community put emphasis on the analysis of the data set to hand (e.g. the discovery of ‘frequent itemsets’ in large transaction databases). It is only

Acknowledgements

The author’s work on this paper was partially supported by a Royal Society Wolfson Research Merit Award.

References (17)

  • L.C. Thomas

    A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers

    International Journal of Forecasting

    (2000)
  • G.E.P. Box et al.

    The experimental study of physical mechanisms

    Technometrics

    (1965)
  • D.R. Cox

    Role of models in statistical analysis

    Statistical Science

    (1990)
  • D.J. Hand

    Artificial intelligence and psychiatry

    (1985)
  • D.J. Hand

    Deconstructing statistical questions (with discussion)

    Journal of the Royal Statistical Society, Series A

    (1994)
  • D.J. Hand

    Modelling consumer credit risk

    IMA Journal of Management Mathematics

    (2001)
  • D.J. Hand

    Reject inference in credit operations

  • D.J. Hand et al.

    Data mining for fun and profit

    Statistical Science

    (2000)
There are more references available in the full text version of this article.

Cited by (23)

  • Forecasting: theory and practice

    2022, International Journal of Forecasting
    Citation Excerpt :

    In time series and forecasting literature, an anomaly is mostly defined with respect to a specific context or its relation to past behaviours. The idea of a context is induced by the structure of the input data and the problem formulation (Chandola, Banerjee, & Kumar, 2007, 2009; Hand, 2009). Further, anomaly detection in forecasting literature has two main focuses, which are conflicting in nature: one demands special attention be paid to anomalies as they can be the main carriers of significant and often critical information such as fraud activities, disease outbreak, natural disasters, while the other down-grades the value of anomalies as it reflects data quality issues such as missing values, corrupted data, data entry errors, extremes, duplicates and unreliable values (Talagala, Hyndman, & Smith-Miles, 2020).

  • Machine learning loss given default for corporate debt

    2021, Journal of Empirical Finance
    Citation Excerpt :

    Moreover, complex data-driven ML models can be especially sensitive to market disruptions or structural changes over time. Although the importance of out-of-time testing is self-evident and well recognized in the literature (e.g., Hand, 2009), most of the existing studies on ML models for corporate debt LGD are based on in-sample and out-of-sample analysis without rigorously testing model performance out-of-time.2 The limited number of studies that report out-of-time results investigate only one or two types of ML methods (e.g., regression tree in Bastos, 2010, regression tree and support vector regression in Tobback et al., 2014), use data that contain a limited number of observations (e.g., Bastos, 2010, and Tobback et al., 2014) or do not exhibit a bi-modal pattern (e.g., Tobback et al., 2014).

  • Predicting mortgage early delinquency with machine learning methods

    2021, European Journal of Operational Research
    Citation Excerpt :

    Out-of-time investigation does not receive much attention in the existing literature either, as the findings in the prior studies on ML methods are typically based on in-sample and out-of-sample evidence (see for example, the discussions in Varian (2014)). However, conducting out-of-time investigation for ML methods is particularly important-(see Hand (2009a) for a comprehensive explanation). A statistical model is built on consumer behavior observed in historical data, but such behavior could change in the future and the historical pattern observed at certain periods for a sub-population may not be generalized over time.

  • Foresight by online communities – The case of renewable energies

    2018, Technological Forecasting and Social Change
    Citation Excerpt :

    Especially, Big Data or Data Mining are common methods. Both concepts deal with the use of very large amounts of data that are obtained primarily via the Internet (Hand, 2009; Hassani and Silva, 2015). OCs can be a useful source of data, too.

  • An empirical comparison of classification algorithms for mortgage default prediction: Evidence from a distressed mortgage market

    2016, European Journal of Operational Research
    Citation Excerpt :

    First, it specifically focuses on mortgages. Detailed accounts of the various modelling approaches to credit scoring are included in Crook et al. (2007), Crook and Bellotti (2009), Thomas (2009), Hand (2009b), and Martin (2013). However, with the exceptions of Galindo and Tamayo (2000), or Feldman and Gross (2005) and Kennedy, Namee, Delaney, O’Sullivan, and Watson (2013a), most of the literature concentrates on credit card or personal lending only.

View all citing articles on Scopus
View full text