Elsevier

Information & Management

Volume 37, Issue 5, August 2000, Pages 271-281
Information & Management

Briefings
Methodological and practical aspects of data mining

https://doi.org/10.1016/S0378-7206(99)00051-8Get rights and content

Abstract

We describe the different stages in the data mining process and discuss some pitfalls and guidelines to circumvent them. Despite the predominant attention on analysis, data selection and pre-processing are the most time-consuming activities, and have a substantial influence on ultimate success. Successful data mining projects require the involvement of expertise in data mining, company data, and the subject area concerned. Despite the attractive suggestion of ‘fully automatic’ data analysis, knowledge of the processes behind the data remains indispensable in avoiding the many pitfalls of data mining.

Introduction

Data mining is receiving more and more attention from the business community, as witnessed by frequent publications in the popular IT-press, and the growing number of tools appearing on the market. The commercial interest in data mining is mainly due to increasing awareness of companies that the vast amounts of data collected on customers and their behavior contain valuable information. If the hidden information can be made explicit, it can be used to improve vital business processes. Such developments are accompanied by the construction of data warehouses and data marts. These are integrated databases that are specifically created for the purpose of analysis rather than to support daily business transactions.

Many publications on data mining discuss the construction or application of algorithms to extract knowledge from data. The emphasis is generally on the analysis phase. When a data mining project is performed in an organizational setting, one discovers that there are other important activities in the process. These activities are often more time consuming and have an equally large influence on the ultimate success of the project.

Data mining is a multi-disciplinary field, that is at the intersection of statistics, machine learning, database management, and data visualization. A natural question comes to mind: to what extent does it provide a new perspective on data analysis? This question has received some attention within the community. A popular answer is that data mining is concerned with the extraction of knowledge from really large data sets. In our view, this is not the complete answer. Company databases indeed are often quite large, especially if one considers data on customer transactions. One should however take into account the fact that:

  • Once the data mining question is specified accurately, only a small part of this large and heterogeneous database is of interest.

  • Even if the remaining dataset is large, a sample often suffices to construct accurate models.

If not necessarily in the size of the dataset, where does the contribution of the data mining perspective lie? Four aspects are of particular interest:
  • 1.

    There is a growing need for valid methods that cover the whole process (also called Knowledge Discovery in Databases or KDD), from problem formulation to the implementation of actions and monitoring of models. Methods are needed to identify the important steps, and indicate the required expertise and tools. Such methods are required to improve the quality and controllability of the process.

  • 2.

    If it is going to be used on a daily basis within organizations, then a better integration with existing information systems infrastructures is required. It is, for example, important to couple analysis tools with Data Warehouses and to integrate data mining functionality with end-user software, such as marketing campaign schedulers.

  • 3.

    From a statistical viewpoint it is often of dubious value because of the absence of a study design. Since the data were not collected with any set of analysis questions in mind, they were not sampled from a pre-defined population, and data quality may be insufficient for analysis requirements. These anomalies in data sets require a study of problems related with analysis of ‘non-random’ samples, data pollution, and missing data.

  • 4.

    Ease of interpretation is often understood to be a defining characteristic of data mining techniques. The demand for explainable models leads to a preference for techniques such as rule induction, classification trees, and, more recently, bayesian networks. Furthermore, explainable models encourage the explicit involvement of domain experts in the analysis process.

Section snippets

Required expertise

Succesful data mining projects require a collaborative effort in a number of areas of expertise.

Stages of the data mining process

Data mining is an explorative and iterative process:

  • During data analysis, new knowledge is discovered and new hypotheses are formulated. These may lead to a focussing of the data mining question or to considering alternative questions.

  • During the process one may jump between the different stages; for example from the analysis to the data pre-processing stage.

There is a need for a sound method that describes the important stages and feedbacks of the process [4]. The method should ensure the

Model interpretability

Ease of model interpretation is an important requirement. The widespread use of classification trees and rule induction algorithms in data mining applications and — tools aids in interpretation of results. Often there is a trade-off between ease of model interpretation and predictive accuracy, and the goal of the modeling task determines which quality measure is considered more important. Ease of interpretation has two major advantages:

  • 1.

    The ‘end product’, that is the final model, is easy to

Missing data

Data quality is a point of major concern in any information system, and also in construction of data warehouses, and subsequent analyses ranging from simple queries to OLAP and data mining [14], [19]. Although all aspects of data quality are relevant to data mining, we confine discussion to the issue of completeness. If many data values are missing, the quality of information and models decreases proportionally. Consider the marketing department of a bank that wants to compute the average age

Legal aspects

In data mining projects where personal data are used, it is important for management to be aware of the legislation concerning privacy issues. For example, in the Netherlands the National Consumers’ Association recently stated, that personal data of Dutch citizens are stored in more than 100 different locations. Therefore, the code of law on privacy will be renewed and reinforced in the near future. The Dutch law on privacy protection (‘Wet Persoons Registratie’) dates from 1980 and it became

Tools

Many of the early data mining tools were almost exclusively concerned with the analysis stage. They are usually derived from algorithms developed in the research community, for example C4.5 [15] and CART [2], with a user-friendly GUI. An interactive GUI is not just a superficial ‘gimmick’; it encourages the involvement of the subject area expert, and improves the efficiency of analysis. For frequent use in business this functionality is however insufficient. Also, many early systems require

Conclusions

Data mining or knowledge discovery in databases (KDD) is an exploratory and iterative process that consists of a number of stages. Data selection and data pre-processing are the most time-consuming activity, especially in the absence of a data warehouse. Data mining tools should therefore provide extensive support for data manipulation and combination. They should also provide easy access to DBMSs in which the source data reside.

The commitment of a subject area expert, data mining expert as

A. Feelders is an Assistant professor at the Department of Economics and Business Administration of Tilburg University in the Netherlands. He received his Ph.D. in Artificial Intelligence from the same university, where he currently participates in the Data Mining research program. He worked as a consultant for a Dutch Data Mining company, where he was involved in many projects for banks and insurance companies. His current research interests include the application of data mining in finance

References (19)

  • A. Subramanian et al.

    Strategic planning for data warehousing

    Information and Management

    (1997)
  • I. Bratko et al.

    Applications of inductive logic programming

    Communications of the ACM

    (1995)
  • L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth,...
  • A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM algorithm, Journal of the...
  • U. Fayyad, D. Madigan, G. Piatetsky-Shapiro, P. Smyth, From data mining to knowledge discovery in databases, AI...
  • U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, AAAI...
  • J.H. Friedman et al.

    Bump hunting in high-dimensional data

    Statistics and Computing

    (1999)
  • C. Glymour et al.

    Statistical themes and lessons for data mining

    Data Mining and Knowledge Discovery

    (1997)
  • D.J. Hand

    Data mining: statistics and more?

    The American Statistician

    (1998)
There are more references available in the full text version of this article.

Cited by (112)

  • A data mining-based framework for supply chain risk management

    2020, Computers and Industrial Engineering
    Citation Excerpt :

    Composite risk indicators and formulations may also be generated to measure some risk types (Hoffman, 2002). Selecting data and pre-processing are the most time-consuming activities (Feelders et al., 2000; Han et al., 2012). The next step is the design of risk data warehouse architecture.

  • Using ontology-based clustering to understand the push and pull factors for British tourists visiting a Mediterranean coastal destination

    2018, Information and Management
    Citation Excerpt :

    This can be performed by using the semantic clustering method described in Section 2.3. The importance of including domain knowledge in the data mining procedure is crucial in many information systems [41,25]. Some works show the benefits of managing the semantics of the terms to improve data analysis [26,59,1].

  • Predicting return visits to the emergency department for pediatric patients: Applying supervised learning techniques to the Taiwan National Health Insurance Research Database

    2017, Computer Methods and Programs in Biomedicine
    Citation Excerpt :

    Although the multivariate LR analysis helps identify independent risk factors of RVs, it is too complex to use at the bedside even though a nomogram can be developed to graphically represent the prediction model. On the other hand, black-box approaches such as RF technique are not easy to understand by clinicians, not to mention the possibility that spurious correlations might go unnoticed because of a lack of insight in the model [34]. As several of the identified risk factors of RVs are related to higher disease severity or unresolved medical problems (for example, higher acuity, undergoing many types of examinations or consultation, hospitalized within one week before the index visit), and a substantial number of RVs (29.2%) were followed by admission compared with a 4.5% admission rate in Taiwan pediatric EDs [35], many RVs were indeed warranted.

View all citing articles on Scopus

A. Feelders is an Assistant professor at the Department of Economics and Business Administration of Tilburg University in the Netherlands. He received his Ph.D. in Artificial Intelligence from the same university, where he currently participates in the Data Mining research program. He worked as a consultant for a Dutch Data Mining company, where he was involved in many projects for banks and insurance companies. His current research interests include the application of data mining in finance and marketing. His articles appeared in Computer Science in Economics and Management and IEEE Transactions on Systems, Man and Cybernetics. He is a member of the editorial board of the International Journal of Intelligent Systems in Accounting, Finance, and Management.

H. Daniels is a Professor in Knowledge Management at the Erasmus University Rotterdam and an Associate Professor in Computer Science at the Department of Economics at Tilburg University. He received a M.Sc. in Mathematics at the Technical University of Eindhoven and a Ph.D. in Physics from Groningen University. He also worked as a project manager at the National Dutch Aerospace Laboratory. He published many articles in international refereed journals, among which the International Journal of Intelligent Systems in Accounting, Finance, and Management, the Journal of Economic Dynamics and Control and Computer Science in Economics and Management. His current research interest is mainly in Knowledge Management and Data mining. He is a member of the editorial board of the journal Computational Economics.

M. Holsheimer is President of Data Distilleries. Previously Holsheimer spent several years at CWI, the Dutch Research Center for Mathematics and Computer Science. In 1993 he was asked to start the data mining research at CWI, one of the first European centers to start data mining research, and now a leading institute in this area. Since the second half of the 1990s major banks and insurance companies in the Netherlands expressed their need for data mining software and consultancy. Together with Martin Kersten, and Arno Siebes, Holsheimer founded Data Distilleries in the summer of 1995.

View full text