An approach to generate rules from neural networks for regression problems

doi:10.1016/S0377-2217(02)00792-0

European Journal of Operational Research

Volume 155, Issue 1, 16 May 2004, Pages 239-250

https://doi.org/10.1016/S0377-2217(02)00792-0 Get rights and content

Abstract

Artificial neural networks have been successfully applied to a variety of business application problems involving classification and regression. They are especially useful for regression problems as they do not require prior knowledge about the data distribution. In many applications, it is desirable to extract knowledge from trained neural networks so that the users can gain a better understanding of the solution. Existing research works have focused primarily on extracting symbolic rules for classification problems with few methods devised for regression problems. In order to fill this gap, we propose an approach to extract rules from neural networks that have been trained to solve regression problems. The extracted rules divide the data samples into groups. For all samples within a group, a linear function of the relevant input attributes of the data approximates the network output. The approach is illustrated with two examples on various application problems. Experimental results show that the proposed approach generates rules that are more accurate than the existing methods based on decision trees and linear regression.

Introduction

Artificial neural networks are powerful tools for business decision making [2], [5], [6], [15], [20], [21], [25], [27]. They work particularly well for problems involving classification and data fitting/regression. Neural networks often predict with higher accuracy than other techniques because of the networks’ capability to fit any continuous functions [3], [8]. One major drawback often associated with neural networks is their lack of explanation power. It is difficult to explain how the networks arrive at their solutions due to the complex non-linear mapping of the input data by the networks. In many applications, it is desirable to extract knowledge from trained neural networks for the users to gain better understanding of the problems at hand. The extracted knowledge is usually expressed as symbolic rules of the form $if condition, then consequence.$

In order to generate comprehensible and useful rules from neural networks that have been trained to predict continuously valued variables, the rules must be sufficiently simple yet accurate. The conditions of the rules describe a subregion of the input space, while the consequences of the rules are of the form Y=f(X), where f(X) is either a constant or a linear function of X, the attributes of the data. This type of rules is easy to understand because of their similarity to the traditional statistical approach of parametric regression. Since a single rule will not normally approximate the non-linear mapping of the network well, one possible solution is to divide the input space of the data into subregions. Prediction for all samples in the same subregion will be performed by a single linear equation whose coefficients are determined by the weights of the network connections. With finer division of the input space, more rules are produced and each rule can approximate the network output more accurately. However, in general, too many rules––with each rule’s conditions satisfied by a handful of samples––do not provide meaningful or useful knowledge to the user. Hence, a balance must be achieved between rule accuracy and rule simplicity.

Most existing research works have focused on extracting symbolic rules for solving classification problems where the network outputs are discrete. A regression problem, on the other hand, has continuous output. Few methods have been devised to extract rules from trained neural networks for regression [22]. Among these methods is a recently proposed algorithm by Setiono et al. [19] where piece-wise linear regression rules are extracted from pruned neural networks. The piece-wise linear equations are obtained as the linear combinations of the linearized hidden unit activation function. The non-linear hidden unit activation function is approximated by either a three-piece or a five-piece linear function that would minimize the approximation errors as represented by the area bounded by the non-linear function and the approximating linear function. Test results on a wide range of problems show that the method could outperform another method that generates decision trees for regression. In order to reduce the number of regression rules, clustering of the hidden unit activations is proposed prior to rule generation [17]. For samples in each cluster, a regression equation is obtained by applying statistical multiple regression technique.

For applications where one is only interested in obtaining accurate predictions, the trained neural networks will suffice. In other applications, one may want to know more about the relationships between the input variables and the continuous output variable. One feasible approach is to replace the prediction of the trained neural network by a set of multiple linear regression equations without compromising the accuracy of the prediction. Our approach works on a network with a single hidden layer and one linear output unit. We impose the restriction on the number of hidden layers to reduce the errors in approximating the hidden unit activation function by a piece-wise linear function as these errors will propagate from the input layer to the output layer through the units in the hidden layers. Experimental evidence indicates that neural networks with just one hidden layer perform as well as those with more than one hidden layer and that the former are less prone to be trapped at a local minimum of the error function while being trained [26].

The hidden unit activation function for our neural networks is the hyperbolic tangent function. This function is used because it can be approximated easily by piece-wise linear functions with relatively good accuracy. We attempt to reduce the number of rules by pruning the redundant network units. The continuous activation function of each hidden unit is then approximated locally by a three-piece linear function. Unlike in our previous method where this linear function is obtained by minimizing the area bounded by the hyperbolic tangent function and the linear approximating function [19], we compute the coefficients of the piece-wise linear approximating function such that the sum of squared errors computed from the training data samples is minimized. By minimizing the sum of squared errors, a piece-wise linear approximating function that better fits the training data can be obtained. The various combinations of the approximating linear functions divide the input space into subregions such that the function values for all inputs in the same subregion can be computed by a predicting linear function of the inputs.

This paper is organized as follows. The next section describes our neural network training and pruning. In Section 3 we describe how the hidden unit activation function of the network is approximated locally by a three-piece linear function such that the sum of the squared errors of the approximation is minimized. In Section 4 we present our approach which generates a set of regression rules from a neural network. Two examples illustrate how the decision rules can be extracted from the pruned networks in Section 5. In Section 6 we present our results and compare them with those from other methods for regression. Finally, in Section 7 we conclude the paper.

Section snippets

Network training and pruning

We divide the available data samples ( $x^{i}$ , yⁱ), i=1,2,…, where $x^{i} ∈ IR^{N}$ and yⁱ∈IR, randomly into a training set, a cross-validation set, and a test set. Using the training data set, a network with a single hidden layer consisting of H units is trained so as to minimize the sum of squared errors E(w,v) augmented with a penalty term P(w,v): $E(w,v)=∑_{i=1}^{K} (y ̃^{i} −y^{i})^{2} +P(w,v),$ $P(w,v)=ε_{1} ∑_{m=1}^{H} ∑_{l=1}^{N} w_{ml}^{2} 1+w_{ml}^{2} +∑_{m=1}^{H} v_{m}^{2} 1+v_{m}^{2} +ε_{2} ∑_{m=1}^{H} ∑_{l=1}^{N} w_{ml}^{2} +∑_{m=1}^{H} v_{m}^{2},$ where K is the number of samples in the training data set, and ε

Approximating the network activation function

While a trained and pruned neural network can achieve higher accuracy in prediction compared to other regression methods, it is usually difficult to explain or understand how this prediction is reached because of the non-linearity in the activation function involved in the computation of the network output (Eq. (3)). We attempt to explain the prediction of the network in terms of rules where the consequences of the rules are linear functions. The critical step in the rule extraction process is

Generating regression rules

A set of linear regression rules can be generated from a pruned network once the network hidden unit activation function tanh(ξ) has been approximated by the three-piece linear function as described in the previous section. The steps to generate the decision rules from a pruned neural network are as follows:

1.
For each hidden unit m=1,2,…,H, generate the three-piece linear approximation L_m(ξ) (Eq. (5)).
2.
Using the pair of points −ξ_m0 and ξ_m0 from function L_m(ξ), divide the input space into 3^H

Illustrative examples

We show in this section how the decision rules can be extracted from pruned neural networks for two application problems, Machine and Auto-mpg. The networks are selected because they only have one hidden unit and few input units left after pruning. The problems are chosen because their data sets have different combination of attributes. One data set has only continuous attributes, while the second data set has both continuous and discrete attributes.

For comparison purpose, we also employ SAS

Experimental results

The proposed approach has been tested on benchmark approximation problems from five various domains. The data sets (see Table 3) are available from the UCI repository [1]. As commonly practiced in machine learning and to allow for a consistent comparison with other existing regression methods, we similarly performed a 10-fold cross-validation evaluation on each data set. The data were randomly divided into 10 subsets of equal size. Eight subsets were used for network training, one subset for

Conclusion

We have presented an approach that generates a set of linear equations from a neural network that has been trained and pruned for application problems involving regression. Linear equations that provide predictions of the continuous target values of data samples are obtained by locally approximating each hidden unit activation function by a three-piece linear function. An approximating piece-wise linear function is computed for each hidden unit such that it minimizes the sum of squared

References (27)

S. Dutta et al.
Decision support in non-conservative domains: Generalization with neural networks
Decision Support Systems
(1994)
K. Hornik
Approximation capabilities of multilayer feedforward networks
Neural Networks
(1991)
R.L. Wilson et al.
Bankruptcy prediction using neural networks
Decision Support Systems
(1994)
C. Blake, C.J. Merz, UCI repository of machine learning databases, Department of Information and Computer Science,...
J.R. Coakley et al.
Artificial neural networks applied to ratio analysis in the analytical review process
Intelligent Systems in Accounting, Finance and Management
(1993)
G. Cybenko
Approximation by superpositions of a sigmoidal function
Mathematics of Control, Signals, and Systems
(1989)
J.E. Dennis et al.
Numerical Methods for Unconstrained Optimization and Nonlinear Equations
(1983)
V.S. Desai et al.
The efficacy of neural networks in predicting returns on stock and bond indices
Decision Sciences
(1998)
P. Ein-Dor et al.
Attributes of the performance of central processing units: A relative performance prediction model
Communications of the ACM
(1987)
G. John, R. Kohavi, K. Pfleger, Irrelevant features and the subset selection problem, in: Proceedings of the 11th...

W. Khattree, D.N. Naik, Applied multivariate statistics with SAS software, SAS Institute, Carey, NC,...

D. Kilpatrick et al.

Numeric prediction using instance-based learning with encoding length selection. Progress in connectionist-based information systems

(1998)

M.-C. Ludl, G. Widmer, Relative unsupervised discretization for regression problems, in: R.A. Mantaras, E. Plaza...

Cited by (43)

Being at the cutting edge of online shopping: Role of recommendations and discounts on privacy perceptions
2021, Computers in Human Behavior
Despite the explosion of selling online, customers continue to have privacy concerns about online purchases. To alleviate such concerns, shopping sites seek to employ interventions to encourage users to buy more online. Two common interventions used to promote online sales are: (1) recommendations that help customers choose the right product either based on historic purchase correlations or recommendations suggested by the retailer; and (2) discounts that increase the value of products. Building on privacy calculus, we theorize about how and why key, representative combinations of recommendations and discounts influence the effects of inhibitors and enablers on online purchase intention. Our research design incorporated recommendations coming from different sources for the recommendation (retailer and other customers’ preferences) product relatedness (related products with historic purchases correlated to the focal product and unrelated products with no historic purchase correlation to the focal product) and two types of discounts (regular and bundle). Participants completed a browsing task in a controlled online shopping environment and completed a survey (n = 496). We found mixed results of moderating effects of recommendations and product relatedness on the effect of inhibitors and enablers on purchase intention. Although recommendations did not enhance the effects of inhibitors, they did enhance the effects of enablers on online purchase intention. We also found that product relatedness did not enhance the effect of privacy enablers on online purchase intentions. The results also showed that discounts enhance the effects of enablers, and that discounts can counteract the moderating effect of recommendations on the relationship between inhibitors and purchase intention under certain circumstances. We discuss theoretical and practical implications.
How do internal, market and institutional factors affect the development of eco-innovation in firms?
2021, Journal of Cleaner Production
This paper investigates how drivers affect the development of eco-innovation in firms. Our research classifies the eco-innovation drivers in three categories: internal factors, market factors, and institutional factors. Using a sample with 9172 firms from the Spanish Innovation Survey Panel, we study the impact of eco-innovation drivers for energy and environmental efficiency objectives. This research utilizes a combination of two methods: Ordinal Logit Regression Models and Artificial Neural Networks. The results allow us to compare the impact of each variable. From a methodological point of view, this approach allows overcoming the difficulties of performing a regression analysis, mainly due to the low levels of explained variance and the problem of comparing the regression coefficients obtained. From the Artificial Neural Networks analysis, it is observed that the factor that most affects the eco-innovation is the previous experiences in eco-innovation, compared to variables such as external financing or innovation capabilities, which have a very small impact. These results may have important repercussions from the point of view of developing environmental incentive policies.
Towards explicit representation of an artificial neural network model: Comparison of two artificial neural network rule extraction approaches
2020, Petroleum
Citation Excerpt :
Thirdly, better understanding about the problem domain can be obtained and comprehensibility of the extracted rule set can be improved by removing the rules that have extremely low coverage. The decision tree approach, which is often adopted in the rule extraction research area [10,11], was used for simplifying the rule set so as to make it more comprehensible. By applying the decision tree algorithm, the problem becomes a classification problem in which the predictor or input attributes are the same as those used to generate the enhanced-PWL-ANN model, and the predicted or output attribute is expressed as rules given by the enhanced-PWL-ANN model.
In the quest for interpretable models, two versions of a neural network rule extraction algorithm were proposed and compared. The two algorithms are called the Piece-Wise Linear Artificial Neural Network (PWL-ANN) and enhanced Piece-Wise Linear Artificial Neural Network (enhanced PWL-ANN) algorithms. The PWL-ANN algorithm is a decomposition artificial neural network (ANN) rule extraction algorithm, and the enhanced PWL-ANN algorithm improves upon the PWL-ANN algorithm and extracts multiple linear regression equations from a trained ANN model by approximating the hidden sigmoid activation functions using N-piece linear equations. In doing so, the algorithm provides interpretable models from the originally trained opaque ANN models. A detailed application case study illustrates how the generated enhanced-PWL-ANN models can provide understandable IF-THEN rules about a problem domain. Comparison of the results generated by the two versions of the PWL-ANN algorithm showed that in comparison to the PWL-ANN models, the enhanced-PWL-ANN models support improved fidelities to the originally trained ANN models. The results also showed that more concise rule sets could be generated using the enhanced-PWL-ANN algorithm. If a more simplified set of rules is desired, the enhanced-PWL-ANN algorithm can be combined with the decision tree approach. Potential application of the algorithms to domains related to petroleum engineering can help enhance understanding of the problems.
Learning from a carbon dioxide capture system dataset: Application of the piecewise neural network algorithm
2017, Petroleum
Citation Excerpt :
Some studies that aim to extract rules by approximating the hidden sigmoid activation functions into some piecewise linear functions include Setiono et al. [21], Setiono & Thong [22] and Wang et al. [24] and the approximation is illustrated in Fig. 2. The differences among the algorithms proposed in Setiono et al. [21], Setiono & Thong [22] and Wang et al. [24] lie in the method each used to solve the PWL approximation. In order to generate sets of rules with high fidelity to the underlying neural network, the accuracy of the PWL approximation is crucial.
This paper presents the application of a neural network rule extraction algorithm, called the piece-wise linear artificial neural network or PWL-ANN algorithm, on a carbon capture process system dataset. The objective of the application is to enhance understanding of the intricate relationships among the key process parameters. The algorithm extracts rules in the form of multiple linear regression equations by approximating the sigmoid activation functions of the hidden neurons in an artificial neural network (ANN). The PWL-ANN algorithm overcomes the weaknesses of the statistical regression approach, in which accuracies of the generated predictive models are often not satisfactory, and the opaqueness of the ANN models. The results show that the generated PWL-ANN models have accuracies that are as high as the originally trained ANN models of the four datasets of the carbon capture process system. An analysis of the extracted rules and the magnitude of the coefficients in the equations revealed that the three most significant parameters of the CO₂ production rate are the steam flow rate through reboiler, reboiler pressure, and the CO₂ concentration in the flue gas.
Rule extraction from an optimized neural network for traffic crash frequency modeling
2016, Accident Analysis and Prevention
This study develops a neural network (NN) model to explore the nonlinear relationship between crash frequency and risk factors. To eliminate the possibility of over-fitting and to deal with the black-box characteristic, a network structure optimization algorithm and a rule extraction method are proposed. A case study compares the performance of the trained and modified NN models with that of the traditional negative binomial (NB) model for analyzing crash frequency on road segments in Hong Kong. The results indicate that the optimized NNs have somewhat better fitting and predictive performance than the NB models. Moreover, the smaller training/testing errors in the optimized NNs with pruned input and hidden nodes demonstrate the ability of the structure optimization algorithm to identify the insignificant factors and to improve the model generalization capacity. Furthermore, the rule-set extracted from the optimized NN model can reveal the effect of each explanatory variable on the crash frequency under different conditions, and implies the existence of nonlinear relationship between factors and crash frequency. With the structure optimization algorithm and rule extraction method, the modified NN model has great potential for modeling crash frequency, and may be considered as a good alternative for road safety analysis.
Modeling nonlinear relationship between crash frequency by severity and contributing factors by neural networks
2016, Analytic Methods in Accident Research
Citation Excerpt :
Advanced methods for network training and structure optimization can establish generalized neural network models that effectively approximate the relationship between crash frequency by severity and explanatory variables (Haykin, 2009). In addition, piecewise linear rules extracted from the developed neural networks are able to clearly illustrate the effects of risk factors (Setiono and Thong, 2004). In summary, this study attempts to develop advanced neural networks for modeling the nonlinear relationship between crash frequency by severity and risk factors, and to clarify the effects of factors on the outcomes by extracting rules from the developed neural networks.
This study develops neural network models to explore the nonlinear relationship between crash frequency by severity and risk factors. To eliminate the possibility of over-fitting and to deal with black-box characteristic, a network structure optimization and a rule extraction method are proposed. A case study compares the performance of the modified neural network models with that of the traditional multivariate Poisson-lognormal model for predicting crash frequency by severity on road segments in Hong Kong. The results indicate that the trained and optimized neural networks have better fitting and predictive performance than the multivariate Poisson-lognormal model. Moreover, the smaller differences between training and testing errors in the optimized neural networks with pruned input and hidden nodes demonstrate the ability of the structure optimization algorithm to identify insignificant factors and to improve the model's generalizability. Furthermore, two rule-sets are extracted from the optimized neural networks to explicitly reveal the exact effect of each significant explanatory variable on the crash frequency by severity under different conditions. The rules imply that there is a nonlinear relationship between risk factors and crash frequencies with each injury-severity outcome. With the structure optimization algorithm and rule extraction method, the modified neural network models have great potential for modeling crash frequency by severity, and should be considered a good alternative for road safety analysis.

View all citing articles on Scopus

View full text

Computing, Artificial Intelligence and Information TechnologyAn approach to generate rules from neural networks for regression problems

Abstract

Introduction

Section snippets

Network training and pruning

Approximating the network activation function

Generating regression rules

Illustrative examples

Experimental results

Conclusion

Decision Support Systems

Neural Networks

Decision Support Systems

Artificial neural networks applied to ratio analysis in the analytical review process

Intelligent Systems in Accounting, Finance and Management

Approximation by superpositions of a sigmoidal function

Mathematics of Control, Signals, and Systems

Numerical Methods for Unconstrained Optimization and Nonlinear Equations

The efficacy of neural networks in predicting returns on stock and bond indices

Decision Sciences

Attributes of the performance of central processing units: A relative performance prediction model

Communications of the ACM

Numeric prediction using instance-based learning with encoding length selection. Progress in connectionist-based information systems

Computing, Artificial Intelligence and Information Technology
An approach to generate rules from neural networks for regression problems