Computing, Artificial Intelligence and Information Technology
An approach to generate rules from neural networks for regression problems

https://doi.org/10.1016/S0377-2217(02)00792-0Get rights and content

Abstract

Artificial neural networks have been successfully applied to a variety of business application problems involving classification and regression. They are especially useful for regression problems as they do not require prior knowledge about the data distribution. In many applications, it is desirable to extract knowledge from trained neural networks so that the users can gain a better understanding of the solution. Existing research works have focused primarily on extracting symbolic rules for classification problems with few methods devised for regression problems. In order to fill this gap, we propose an approach to extract rules from neural networks that have been trained to solve regression problems. The extracted rules divide the data samples into groups. For all samples within a group, a linear function of the relevant input attributes of the data approximates the network output. The approach is illustrated with two examples on various application problems. Experimental results show that the proposed approach generates rules that are more accurate than the existing methods based on decision trees and linear regression.

Introduction

Artificial neural networks are powerful tools for business decision making [2], [5], [6], [15], [20], [21], [25], [27]. They work particularly well for problems involving classification and data fitting/regression. Neural networks often predict with higher accuracy than other techniques because of the networks’ capability to fit any continuous functions [3], [8]. One major drawback often associated with neural networks is their lack of explanation power. It is difficult to explain how the networks arrive at their solutions due to the complex non-linear mapping of the input data by the networks. In many applications, it is desirable to extract knowledge from trained neural networks for the users to gain better understanding of the problems at hand. The extracted knowledge is usually expressed as symbolic rules of the formifcondition,thenconsequence.

In order to generate comprehensible and useful rules from neural networks that have been trained to predict continuously valued variables, the rules must be sufficiently simple yet accurate. The conditions of the rules describe a subregion of the input space, while the consequences of the rules are of the form Y=f(X), where f(X) is either a constant or a linear function of X, the attributes of the data. This type of rules is easy to understand because of their similarity to the traditional statistical approach of parametric regression. Since a single rule will not normally approximate the non-linear mapping of the network well, one possible solution is to divide the input space of the data into subregions. Prediction for all samples in the same subregion will be performed by a single linear equation whose coefficients are determined by the weights of the network connections. With finer division of the input space, more rules are produced and each rule can approximate the network output more accurately. However, in general, too many rules––with each rule’s conditions satisfied by a handful of samples––do not provide meaningful or useful knowledge to the user. Hence, a balance must be achieved between rule accuracy and rule simplicity.

Most existing research works have focused on extracting symbolic rules for solving classification problems where the network outputs are discrete. A regression problem, on the other hand, has continuous output. Few methods have been devised to extract rules from trained neural networks for regression [22]. Among these methods is a recently proposed algorithm by Setiono et al. [19] where piece-wise linear regression rules are extracted from pruned neural networks. The piece-wise linear equations are obtained as the linear combinations of the linearized hidden unit activation function. The non-linear hidden unit activation function is approximated by either a three-piece or a five-piece linear function that would minimize the approximation errors as represented by the area bounded by the non-linear function and the approximating linear function. Test results on a wide range of problems show that the method could outperform another method that generates decision trees for regression. In order to reduce the number of regression rules, clustering of the hidden unit activations is proposed prior to rule generation [17]. For samples in each cluster, a regression equation is obtained by applying statistical multiple regression technique.

For applications where one is only interested in obtaining accurate predictions, the trained neural networks will suffice. In other applications, one may want to know more about the relationships between the input variables and the continuous output variable. One feasible approach is to replace the prediction of the trained neural network by a set of multiple linear regression equations without compromising the accuracy of the prediction. Our approach works on a network with a single hidden layer and one linear output unit. We impose the restriction on the number of hidden layers to reduce the errors in approximating the hidden unit activation function by a piece-wise linear function as these errors will propagate from the input layer to the output layer through the units in the hidden layers. Experimental evidence indicates that neural networks with just one hidden layer perform as well as those with more than one hidden layer and that the former are less prone to be trapped at a local minimum of the error function while being trained [26].

The hidden unit activation function for our neural networks is the hyperbolic tangent function. This function is used because it can be approximated easily by piece-wise linear functions with relatively good accuracy. We attempt to reduce the number of rules by pruning the redundant network units. The continuous activation function of each hidden unit is then approximated locally by a three-piece linear function. Unlike in our previous method where this linear function is obtained by minimizing the area bounded by the hyperbolic tangent function and the linear approximating function [19], we compute the coefficients of the piece-wise linear approximating function such that the sum of squared errors computed from the training data samples is minimized. By minimizing the sum of squared errors, a piece-wise linear approximating function that better fits the training data can be obtained. The various combinations of the approximating linear functions divide the input space into subregions such that the function values for all inputs in the same subregion can be computed by a predicting linear function of the inputs.

This paper is organized as follows. The next section describes our neural network training and pruning. In Section 3 we describe how the hidden unit activation function of the network is approximated locally by a three-piece linear function such that the sum of the squared errors of the approximation is minimized. In Section 4 we present our approach which generates a set of regression rules from a neural network. Two examples illustrate how the decision rules can be extracted from the pruned networks in Section 5. In Section 6 we present our results and compare them with those from other methods for regression. Finally, in Section 7 we conclude the paper.

Section snippets

Network training and pruning

We divide the available data samples (xi, yi), i=1,2,…, where xiIRN and yi∈IR, randomly into a training set, a cross-validation set, and a test set. Using the training data set, a network with a single hidden layer consisting of H units is trained so as to minimize the sum of squared errors E(w,v) augmented with a penalty term P(w,v):E(w,v)=∑i=1K(ỹi−yi)2+P(w,v),P(w,v)=ε1m=1Hl=1Nwml21+wml2+∑m=1Hvm21+vm22m=1Hl=1Nwml2+∑m=1Hvm2,where K is the number of samples in the training data set, and ε

Approximating the network activation function

While a trained and pruned neural network can achieve higher accuracy in prediction compared to other regression methods, it is usually difficult to explain or understand how this prediction is reached because of the non-linearity in the activation function involved in the computation of the network output (Eq. (3)). We attempt to explain the prediction of the network in terms of rules where the consequences of the rules are linear functions. The critical step in the rule extraction process is

Generating regression rules

A set of linear regression rules can be generated from a pruned network once the network hidden unit activation function tanh(ξ) has been approximated by the three-piece linear function as described in the previous section. The steps to generate the decision rules from a pruned neural network are as follows:

  • 1.

    For each hidden unit m=1,2,…,H, generate the three-piece linear approximation Lm(ξ) (Eq. (5)).

  • 2.

    Using the pair of points −ξm0 and ξm0 from function Lm(ξ), divide the input space into 3H

Illustrative examples

We show in this section how the decision rules can be extracted from pruned neural networks for two application problems, Machine and Auto-mpg. The networks are selected because they only have one hidden unit and few input units left after pruning. The problems are chosen because their data sets have different combination of attributes. One data set has only continuous attributes, while the second data set has both continuous and discrete attributes.

For comparison purpose, we also employ SAS

Experimental results

The proposed approach has been tested on benchmark approximation problems from five various domains. The data sets (see Table 3) are available from the UCI repository [1]. As commonly practiced in machine learning and to allow for a consistent comparison with other existing regression methods, we similarly performed a 10-fold cross-validation evaluation on each data set. The data were randomly divided into 10 subsets of equal size. Eight subsets were used for network training, one subset for

Conclusion

We have presented an approach that generates a set of linear equations from a neural network that has been trained and pruned for application problems involving regression. Linear equations that provide predictions of the continuous target values of data samples are obtained by locally approximating each hidden unit activation function by a three-piece linear function. An approximating piece-wise linear function is computed for each hidden unit such that it minimizes the sum of squared

References (27)

  • S. Dutta et al.

    Decision support in non-conservative domains: Generalization with neural networks

    Decision Support Systems

    (1994)
  • K. Hornik

    Approximation capabilities of multilayer feedforward networks

    Neural Networks

    (1991)
  • R.L. Wilson et al.

    Bankruptcy prediction using neural networks

    Decision Support Systems

    (1994)
  • C. Blake, C.J. Merz, UCI repository of machine learning databases, Department of Information and Computer Science,...
  • J.R. Coakley et al.

    Artificial neural networks applied to ratio analysis in the analytical review process

    Intelligent Systems in Accounting, Finance and Management

    (1993)
  • G. Cybenko

    Approximation by superpositions of a sigmoidal function

    Mathematics of Control, Signals, and Systems

    (1989)
  • J.E. Dennis et al.

    Numerical Methods for Unconstrained Optimization and Nonlinear Equations

    (1983)
  • V.S. Desai et al.

    The efficacy of neural networks in predicting returns on stock and bond indices

    Decision Sciences

    (1998)
  • P. Ein-Dor et al.

    Attributes of the performance of central processing units: A relative performance prediction model

    Communications of the ACM

    (1987)
  • G. John, R. Kohavi, K. Pfleger, Irrelevant features and the subset selection problem, in: Proceedings of the 11th...
  • W. Khattree, D.N. Naik, Applied multivariate statistics with SAS software, SAS Institute, Carey, NC,...
  • D. Kilpatrick et al.

    Numeric prediction using instance-based learning with encoding length selection. Progress in connectionist-based information systems

    (1998)
  • M.-C. Ludl, G. Widmer, Relative unsupervised discretization for regression problems, in: R.A. Mantaras, E. Plaza...
  • Cited by (43)

    • Towards explicit representation of an artificial neural network model: Comparison of two artificial neural network rule extraction approaches

      2020, Petroleum
      Citation Excerpt :

      Thirdly, better understanding about the problem domain can be obtained and comprehensibility of the extracted rule set can be improved by removing the rules that have extremely low coverage. The decision tree approach, which is often adopted in the rule extraction research area [10,11], was used for simplifying the rule set so as to make it more comprehensible. By applying the decision tree algorithm, the problem becomes a classification problem in which the predictor or input attributes are the same as those used to generate the enhanced-PWL-ANN model, and the predicted or output attribute is expressed as rules given by the enhanced-PWL-ANN model.

    • Learning from a carbon dioxide capture system dataset: Application of the piecewise neural network algorithm

      2017, Petroleum
      Citation Excerpt :

      Some studies that aim to extract rules by approximating the hidden sigmoid activation functions into some piecewise linear functions include Setiono et al. [21], Setiono & Thong [22] and Wang et al. [24] and the approximation is illustrated in Fig. 2. The differences among the algorithms proposed in Setiono et al. [21], Setiono & Thong [22] and Wang et al. [24] lie in the method each used to solve the PWL approximation. In order to generate sets of rules with high fidelity to the underlying neural network, the accuracy of the PWL approximation is crucial.

    • Modeling nonlinear relationship between crash frequency by severity and contributing factors by neural networks

      2016, Analytic Methods in Accident Research
      Citation Excerpt :

      Advanced methods for network training and structure optimization can establish generalized neural network models that effectively approximate the relationship between crash frequency by severity and explanatory variables (Haykin, 2009). In addition, piecewise linear rules extracted from the developed neural networks are able to clearly illustrate the effects of risk factors (Setiono and Thong, 2004). In summary, this study attempts to develop advanced neural networks for modeling the nonlinear relationship between crash frequency by severity and risk factors, and to clarify the effects of factors on the outcomes by extracting rules from the developed neural networks.

    View all citing articles on Scopus
    View full text