Methodological and practical aspects of data mining

doi:10.1016/S0378-7206(99)00051-8

Information & Management

Volume 37, Issue 5, August 2000, Pages 271-281

https://doi.org/10.1016/S0378-7206(99)00051-8 Get rights and content

Abstract

We describe the different stages in the data mining process and discuss some pitfalls and guidelines to circumvent them. Despite the predominant attention on analysis, data selection and pre-processing are the most time-consuming activities, and have a substantial influence on ultimate success. Successful data mining projects require the involvement of expertise in data mining, company data, and the subject area concerned. Despite the attractive suggestion of ‘fully automatic’ data analysis, knowledge of the processes behind the data remains indispensable in avoiding the many pitfalls of data mining.

Introduction

Data mining is receiving more and more attention from the business community, as witnessed by frequent publications in the popular IT-press, and the growing number of tools appearing on the market. The commercial interest in data mining is mainly due to increasing awareness of companies that the vast amounts of data collected on customers and their behavior contain valuable information. If the hidden information can be made explicit, it can be used to improve vital business processes. Such developments are accompanied by the construction of data warehouses and data marts. These are integrated databases that are specifically created for the purpose of analysis rather than to support daily business transactions.

Many publications on data mining discuss the construction or application of algorithms to extract knowledge from data. The emphasis is generally on the analysis phase. When a data mining project is performed in an organizational setting, one discovers that there are other important activities in the process. These activities are often more time consuming and have an equally large influence on the ultimate success of the project.

Data mining is a multi-disciplinary field, that is at the intersection of statistics, machine learning, database management, and data visualization. A natural question comes to mind: to what extent does it provide a new perspective on data analysis? This question has received some attention within the community. A popular answer is that data mining is concerned with the extraction of knowledge from really large data sets. In our view, this is not the complete answer. Company databases indeed are often quite large, especially if one considers data on customer transactions. One should however take into account the fact that:

•
Once the data mining question is specified accurately, only a small part of this large and heterogeneous database is of interest.
•
Even if the remaining dataset is large, a sample often suffices to construct accurate models.

If not necessarily in the size of the dataset, where does the contribution of the data mining perspective lie? Four aspects are of particular interest:

1.
There is a growing need for valid methods that cover the whole process (also called Knowledge Discovery in Databases or KDD), from problem formulation to the implementation of actions and monitoring of models. Methods are needed to identify the important steps, and indicate the required expertise and tools. Such methods are required to improve the quality and controllability of the process.
2.
If it is going to be used on a daily basis within organizations, then a better integration with existing information systems infrastructures is required. It is, for example, important to couple analysis tools with Data Warehouses and to integrate data mining functionality with end-user software, such as marketing campaign schedulers.
3.
From a statistical viewpoint it is often of dubious value because of the absence of a study design. Since the data were not collected with any set of analysis questions in mind, they were not sampled from a pre-defined population, and data quality may be insufficient for analysis requirements. These anomalies in data sets require a study of problems related with analysis of ‘non-random’ samples, data pollution, and missing data.
4.
Ease of interpretation is often understood to be a defining characteristic of data mining techniques. The demand for explainable models leads to a preference for techniques such as rule induction, classification trees, and, more recently, bayesian networks. Furthermore, explainable models encourage the explicit involvement of domain experts in the analysis process.

Section snippets

Required expertise

Succesful data mining projects require a collaborative effort in a number of areas of expertise.

Stages of the data mining process

Data mining is an explorative and iterative process:

•
During data analysis, new knowledge is discovered and new hypotheses are formulated. These may lead to a focussing of the data mining question or to considering alternative questions.
•
During the process one may jump between the different stages; for example from the analysis to the data pre-processing stage.

There is a need for a sound method that describes the important stages and feedbacks of the process [4]. The method should ensure the

Model interpretability

Ease of model interpretation is an important requirement. The widespread use of classification trees and rule induction algorithms in data mining applications and — tools aids in interpretation of results. Often there is a trade-off between ease of model interpretation and predictive accuracy, and the goal of the modeling task determines which quality measure is considered more important. Ease of interpretation has two major advantages:

1.
The ‘end product’, that is the final model, is easy to

Missing data

Data quality is a point of major concern in any information system, and also in construction of data warehouses, and subsequent analyses ranging from simple queries to OLAP and data mining [14], [19]. Although all aspects of data quality are relevant to data mining, we confine discussion to the issue of completeness. If many data values are missing, the quality of information and models decreases proportionally. Consider the marketing department of a bank that wants to compute the average age

Legal aspects

In data mining projects where personal data are used, it is important for management to be aware of the legislation concerning privacy issues. For example, in the Netherlands the National Consumers’ Association recently stated, that personal data of Dutch citizens are stored in more than 100 different locations. Therefore, the code of law on privacy will be renewed and reinforced in the near future. The Dutch law on privacy protection (‘Wet Persoons Registratie’) dates from 1980 and it became

Tools

Many of the early data mining tools were almost exclusively concerned with the analysis stage. They are usually derived from algorithms developed in the research community, for example C4.5 [15] and CART [2], with a user-friendly GUI. An interactive GUI is not just a superficial ‘gimmick’; it encourages the involvement of the subject area expert, and improves the efficiency of analysis. For frequent use in business this functionality is however insufficient. Also, many early systems require

Conclusions

Data mining or knowledge discovery in databases (KDD) is an exploratory and iterative process that consists of a number of stages. Data selection and data pre-processing are the most time-consuming activity, especially in the absence of a data warehouse. Data mining tools should therefore provide extensive support for data manipulation and combination. They should also provide easy access to DBMSs in which the source data reside.

The commitment of a subject area expert, data mining expert as

References (19)

A. Subramanian et al.
Strategic planning for data warehousing
Information and Management
(1997)
I. Bratko et al.
Applications of inductive logic programming
Communications of the ACM
(1995)
L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth,...
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM algorithm, Journal of the...
U. Fayyad, D. Madigan, G. Piatetsky-Shapiro, P. Smyth, From data mining to knowledge discovery in databases, AI...
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, AAAI...
J.H. Friedman et al.
Bump hunting in high-dimensional data
Statistics and Computing
(1999)
C. Glymour et al.
Statistical themes and lessons for data mining
Data Mining and Knowledge Discovery
(1997)
D.J. Hand
Data mining: statistics and more?
The American Statistician
(1998)

There are more references available in the full text version of this article.

Cited by (112)

Hybrid ML for Parameter Prediction in Production
2023, Procedia CIRP
In the past, research in the production domain was driven by mathematical and physical description of production technologies. Over the last years, data-driven approaches like machine learning (ML) and artificial intelligence (AI) gave the research a new direction. Often, already existing knowledge is neglected when using data-driven approaches resulting in models that do not represent the best possible results. By combining these two approaches all available knowledge is used generating the best possible model. This combination is called hybrid modeling. In this paper, hybrid ML as part of hybrid modeling is introduced and the benefits and challenges using hybrid ML for the prediction of process parameters in the production domain are demonstrated.
Enhancing Control of Engineering Worker-hours Using Semantic Analysis Model
2022, Results in Engineering
An engineering firm requires a decent coding system to generate worker-hour accounts in which worker-hours spent on various engineering jobs are categorized. However, most engineering firms use a project number to group worker-hour data, providing little information of value to management. This work develops an enhanced work-time coding system that links worker-hours with engineering deliverables and expertise to support the estimation of costs and the allocation of staff engineers in current and future projects. An innovative semantic analysis model is proposed to evaluate the performance of a coding system, in terms of the completeness of the information it supports, the appropriateness of the information structure, the meaningfulness of coding, and the efficiency of wording. This work also provides in-depth information about coding systems that are currently used by five major engineering firms in Taiwan. A case study of the proposed coding system, including practitioners' feedback, is provided. Finally, the six-step methodology used herein can be utilized to improve other coding systems.
A data mining-based framework for supply chain risk management
2020, Computers and Industrial Engineering
Citation Excerpt :
Composite risk indicators and formulations may also be generated to measure some risk types (Hoffman, 2002). Selecting data and pre-processing are the most time-consuming activities (Feelders et al., 2000; Han et al., 2012). The next step is the design of risk data warehouse architecture.
Increased risk exposure levels, technological developments and the growing information overload in supply chain networks drive organizations to embrace data-driven approaches in Supply Chain Risk Management (SCRM). Data Mining (DM) employs multiple analytical techniques for intelligent and timely decision making; however, its potential is not entirely explored for SCRM. The paper aims to develop a DM-based framework for the identification, assessment and mitigation of different type of risks in supply chains. A holistic approach integrates DM and risk management activities in a unique framework for effective risk management. The framework is validated with a case study based on a series of semi-structured interviews, discussions and a focus group study. The study showcases how DM supports in discovering hidden and useful information from unstructured risk data for making intelligent risk management decisions.
A systematic literature review and classification of knowledge discovery in traditional medicine
2019, Computer Methods and Programs in Biomedicine
Despite the importance of machine learning methods application in traditional medicine there is a no systematic literature review and a classification for this field. This is the first comprehensive literature review of the application of data mining methods in traditional medicine.
We reviewed 5 database between 2000 to 2017 based on the Kitchenham systematic review methodology. 502 articles were identified and reviewed for their relevance to application of machine learning methods in traditional medicine, 42 selected papers were classified and categorized on four dimension; 1) application domain of data mining techniques in traditional medicine; 2) the data mining methods most frequently used in traditional medicine; 3) main strength and limitation of data mining techniques in traditional medicine; 4) the performance evaluation methods in data mining methods in traditional medicine.
The result obtained showed that main application domain of data mining techniques in traditional medicine was related to syndrome differentiation. Bayesian Networks (BNs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) were recognized as being the methods most frequently applied in traditional medicine. Furthermore, each data mining techniques has its own strength and limitations when applied in traditional medicine. Single scaler methods were frequently used for performance evaluation of data mining methods.
Machine learning methods have become an important research field in traditional medicine. Our research provides information about this methods by examining the related articles.
Using ontology-based clustering to understand the push and pull factors for British tourists visiting a Mediterranean coastal destination
2018, Information and Management
Citation Excerpt :
This can be performed by using the semantic clustering method described in Section 2.3. The importance of including domain knowledge in the data mining procedure is crucial in many information systems [41,25]. Some works show the benefits of managing the semantics of the terms to improve data analysis [26,59,1].
This paper studies why British tourists decide to travel to a particular destination in a Catalan region. The analysis is based on a survey that includes open-ended questions. First, we propose the operationalization of the concepts of motivation and meaning as push–pull factors when choosing a destination. Second, an ontology-based clustering method is presented, which makes it possible to analyse these qualitative factors from a semantic perspective to obtain tourist segments. A benchmark confirms that the segmentation obtained is better than that generated using classic clustering methods The results show that different meanings can be associated with any single place.
Predicting return visits to the emergency department for pediatric patients: Applying supervised learning techniques to the Taiwan National Health Insurance Research Database
2017, Computer Methods and Programs in Biomedicine
Citation Excerpt :
Although the multivariate LR analysis helps identify independent risk factors of RVs, it is too complex to use at the bedside even though a nomogram can be developed to graphically represent the prediction model. On the other hand, black-box approaches such as RF technique are not easy to understand by clinicians, not to mention the possibility that spurious correlations might go unnoticed because of a lack of insight in the model [34]. As several of the identified risk factors of RVs are related to higher disease severity or unresolved medical problems (for example, higher acuity, undergoing many types of examinations or consultation, hospitalized within one week before the index visit), and a substantial number of RVs (29.2%) were followed by admission compared with a 4.5% admission rate in Taiwan pediatric EDs [35], many RVs were indeed warranted.
Return visits (RVs) to the emergency department (ED) consume medical resources and may represent a patient safety issue. The occurrence of unexpected RVs is considered a performance indicator for ED care quality. Because children are susceptible to medical errors and utilize considerable ED resources, knowing the factors that affect RVs in pediatric patients helps improve the quality of pediatric emergency care.
We collected data on visits made by patients aged ≤18 years to EDs from the National Health Insurance Research Database. The outcome of interest was a RV within 3 days of the initial visit. Potential factors were categorized into demographics, medical history, features of ED visits, physician characteristics, hospital characteristics, and treatment-seeking behavior. A multivariate logistic regression was used to identify independent predictors of RVs. We compared the performance of various data mining techniques, including Naïve Bayes, classification and regression tree (CART), random forest, and logistic regression, in predicting RVs. Finally, we developed a decision tree to stratify the risk of RVs.
Of 125,940 visits, 6,282 (5.0%) were followed by a RV within 3 days. Predictors of RVs included younger age, higher acuity, intravenous fluid, more examination types, complete blood count, consultation, lower hospital level, hospitalization within one week before the initial visit, frequent ED visits in the past one year, and visits made in Spring or on Saturdays. Patients with allergic diseases and those underwent ultrasound examination were less likely to return. Decision tree models performed better in predicting RVs in terms of area under curve. The decision tree constructed using the CART technique showed that the number of ED visits in the past one year, diagnosis category, testing of complete blood count, and age were important discriminators of risk of RVs.
We identified several factors which are associated with RVs to the ED in pediatric patients. The knowledge of these factors may help assess risk of RVs in the ED and guide physicians to reevaluate and provide interventions to children belonging to the high risk groups before ED discharge.

View all citing articles on Scopus

A. Feelders is an Assistant professor at the Department of Economics and Business Administration of Tilburg University in the Netherlands. He received his Ph.D. in Artificial Intelligence from the same university, where he currently participates in the Data Mining research program. He worked as a consultant for a Dutch Data Mining company, where he was involved in many projects for banks and insurance companies. His current research interests include the application of data mining in finance and marketing. His articles appeared in Computer Science in Economics and Management and IEEE Transactions on Systems, Man and Cybernetics. He is a member of the editorial board of the International Journal of Intelligent Systems in Accounting, Finance, and Management.

H. Daniels is a Professor in Knowledge Management at the Erasmus University Rotterdam and an Associate Professor in Computer Science at the Department of Economics at Tilburg University. He received a M.Sc. in Mathematics at the Technical University of Eindhoven and a Ph.D. in Physics from Groningen University. He also worked as a project manager at the National Dutch Aerospace Laboratory. He published many articles in international refereed journals, among which the International Journal of Intelligent Systems in Accounting, Finance, and Management, the Journal of Economic Dynamics and Control and Computer Science in Economics and Management. His current research interest is mainly in Knowledge Management and Data mining. He is a member of the editorial board of the journal Computational Economics.

M. Holsheimer is President of Data Distilleries. Previously Holsheimer spent several years at CWI, the Dutch Research Center for Mathematics and Computer Science. In 1993 he was asked to start the data mining research at CWI, one of the first European centers to start data mining research, and now a leading institute in this area. Since the second half of the 1990s major banks and insurance companies in the Netherlands expressed their need for data mining software and consultancy. Together with Martin Kersten, and Arno Siebes, Holsheimer founded Data Distilleries in the summer of 1995.

View full text

BriefingsMethodological and practical aspects of data mining

Abstract

Introduction

Section snippets

Required expertise

Stages of the data mining process

Model interpretability

Missing data

Legal aspects

Tools

Conclusions

Information and Management

Applications of inductive logic programming

Communications of the ACM

Bump hunting in high-dimensional data

Statistics and Computing

Statistical themes and lessons for data mining

Data Mining and Knowledge Discovery

Data mining: statistics and more?

The American Statistician

Briefings
Methodological and practical aspects of data mining