Fuzzy clusterwise linear regression analysis with symmetrical fuzzy output variable

doi:10.1016/j.csda.2006.06.001

Computational Statistics & Data Analysis

Volume 51, Issue 1, 1 November 2006, Pages 287-313

https://doi.org/10.1016/j.csda.2006.06.001 Get rights and content

Abstract

The traditional regression analysis is usually applied to homogeneous observations. However, there are several real situations where the observations are not homogeneous. In these cases, by utilizing the traditional regression, we have a loss of performance in fitting terms. Then, for improving the goodness of fit, it is more suitable to apply the so-called clusterwise regression analysis. The aim of clusterwise linear regression analysis is to embed the techniques of clustering into regression analysis. In this way, the clustering methods are utilized for overcoming the heterogeneity problem in regression analysis. Furthermore, by integrating cluster analysis into the regression framework, the regression parameters (regression analysis) and membership degrees (cluster analysis) can be estimated simultaneously by optimizing one single objective function. In this paper the clusterwise linear regression has been analyzed in a fuzzy framework. In particular, a fuzzy clusterwise linear regression model (FCWLR model) with symmetrical fuzzy output and crisp input variables for performing fuzzy cluster analysis within a fuzzy linear regression framework is suggested. For measuring the goodness of fit of the suggested FCWLR model with fuzzy output, a fitting index is proposed. In order to illustrate the usefulness of FCWLR model in practice, several applications to artificial and real datasets are shown.

Introduction

In a statistical perspective, the regression analysis is utilized for studying the dependence relationship between a real phenomenon (dependent variable or output variable) and other (explanatory) real phenomena (explanatory variables or independent variables or input variables). The traditional regression analysis can be suitably utilized in the case of homogeneous observations. However, in many real cases, there are several situations where the observations are not homogeneous. In these cases, by utilizing the traditional regression, we have a loss of fitting performance of the regression model. In order to improve the goodness of fit, it is more suitable to utilize the so-called clusterwise regression analysis, in which we embed the techniques of clustering into regression analysis. In this way, the clustering methods are utilized for overcoming the heterogeneity problem in regression analysis. For explaining more clearly the aim and the real usefulness of the clusterwise regression analysis, we consider the following explicative example of clusterwise on a market segmentation problem in business, drawn by Lau et al. (1999): “The manager collects a sample of the sales and income data from a set of costumers. If the costumers have homogeneous income elasticity (i.e., the regression coefficient $β$ ), $β$ can simply be estimated by regression of sales on income. In real business, costumers are heterogeneous and income elasticity will vary with customers of different clusters in the sample. The major tasks for the manager are: (i) use the income elasticity as the basis to divide customers into mutually exclusive segments, (ii) estimate the average income elasticity for each segment, (iii) identify the members of each segment. If we ignore the income elasticity differences among segments, the income elasticity estimated from the regression of sales on income will certainly be biased and inaccurate. In other words, if we want to model the parameter heterogeneity in the traditional regression, the appropriate statistical analysis will involve the simultaneous applications of the cluster analysis and regression model. One straightforward approach is the two stage method. In stage 1, we apply cluster analysis to the dataset to divide customers into segments. In stage 2, we perform regression for each segment to estimate the income elasticity. The problem is that the functions optimized in stages 1 and 2 are two different objective functions which are not necessarily related. A better formulation is to integrate the cluster analysis into regression framework, so that the income elasticities and segment membership parameters can be estimated simultaneously by optimizing one single objective function”.

In the body of literature, there are many theoretical works on clusterwise regression analysis (see, for example, De Sarbo and Cron, 1988, De Sarbo et al., 1989, De Veaux, 1989, Hathaway and Bezdek, 1993, Hathaway et al., 1996, Hennig, 2000, Hennig, 2003, Hong and Chao, 2002, Lau et al., 1999, Leşki, 2004, Preda and Saporta, 2005, Quandt and Ramsey, 1978, Shao and Wu, 2005, Spath, 1979, Yang and Ko, 1997, Van Aelest et al., 2006, Wedel and De Sarbo, 1995). Furthermore, the clusterwise regression analysis finds application in several fields, such as market segmentation and business, socio-economics, biology, engineering, and so on (see, for instance, Aurifeille and Quester, 2003, De Sarbo and Cron, 1988; Hosmer, 1974; Lau et al., 1999, Wedel and Steenkamp, 1991).

In this paper the clusterwise linear regression is analyzed in a fuzzy framework. In particular, we propose a fuzzy clusterwise linear regression model (FCWLR model) with symmetrical fuzzy output and crisp input variables for performing fuzzy cluster analysis within a fuzzy linear regression framework. We build our FCWLR model by considering, simultaneously, the Bezdek's approach to fuzzy cluster analysis (Bezdek, 1981) and the linear regression model with fuzzy output variable $(\tilde{Y})$ and crisp explanatory variables $(X_{1}, \dots, X_{k})$ suggested by Coppi and D’Urso (2003): $\{\begin{matrix} m_{i} = m_{i}^{*} + e_{i}, m_{i}^{*} = x_{i}^{'} a, \\ {}_{(-)}s_{i} = {}_{(-)}s_{i}^{*} + {}_{(-)}ε_{i}, {}_{(-)}s_{i} = m_{i} - l_{i}, {}_{(-)}s_{i}^{*} = m_{i}^{*} - l_{i}^{*}, l_{i}^{*} = m_{i}^{*} b + d, \\ {}_{(+)}s_{i} = {}_{(+)}s_{i}^{*} + {}_{(+)}ε_{i}, {}_{(+)}s_{i} = m_{i} + l_{i}, {}_{(+)}s_{i}^{*} = m_{i}^{*} + l_{i}^{*}, \end{matrix}$ where $x_{i}^{'}$ is $(1 \times (k + 1))$ -vector containing the scalar 1 and the values of the k crisp input variables observed on the ith unit, $m_{i}, m_{i}^{*}$ are, respectively, the ith observed center and the ith interpolated center, $l_{i}, l_{i}^{*}$ are, respectively, the ith observed spreads and the ith interpolated spreads, $a$ is $((k + 1) \times 1)$ -vector of regression parameters for $m_{i}$ , $b$ , $d$ are the regression parameters for the other models, and $e_{i}, {}_{(-)}ε_{i}, {}_{(+)}ε_{i}$ are the residuals.

In matrix form, we can write the previous model as follows: $\{\begin{matrix} m = m^{*} + e, m^{*} = Xa, \\ {}_{(-)}s = {}_{(-)}s^{*} + {}_{(-)}ε, {}_{(-)}s = m - l, {}_{(-)}s^{*} = m^{*} - l^{*}, l^{*} = m^{*} b + 1 d, \\ {}_{(+)}s = {}_{(+)}s^{*} + {}_{(+)}ε, {}_{(+)}s = m + l, {}_{(+)}s^{*} = m^{*} + l^{*}, \end{matrix}$ where $1$ is $(n \times 1)$ -vector of all 1's, $X$ is $(n \times (k + 1))$ -matrix containing the vector $1$ concatenated to k crisp input variables, $m$ , $m^{*}$ are, respectively, $(n \times 1)$ -vectors of observed centers and interpolated centers, $l$ , $l^{*}$ are, respectively, $(n \times 1)$ -vectors of observed spreads and interpolated spreads, $a$ is $((k + 1) \times 1)$ -vector of regression parameters for $m$ , $b$ , $d$ are, respectively, the regression parameters for the other models, and $e, {}_{(-)}ε, {}_{(+)}ε$ are, respectively, $(n \times 1)$ -vectors of residuals.

Notice that, the above fuzzy regression model is based on three linear models. The first one interpolates the centers of the fuzzy observations, the second and third ones yield the lower and upper bounds $(centers \pm spreads)$ , by building other linear models over the first one. The model is hence capable to take into account possible linear relations between the size of the spreads and the magnitude of the estimated centers. This is often the case in realistic applications, where dependence among centers and spreads is likely to occur (for instance, the uncertainty or fuzziness concerning a measurement may depend on its magnitude) (Coppi and D’Urso, 2003, D’Urso, 2003).

Furthermore, in order to test the performance of the proposed FCWLR we suggest a suitable fitting measure, i.e., the $R^{2}$ coefficient.

The structure of the paper is characterized in the following way. In Section 2, we define the fuzzy data, i.e., the symmetrical fuzzy data and in Section 3 we consider a particular distance measure between symmetrical fuzzy data. Successively, in Section 4, we propose a FCWLR model with symmetrical fuzzy output variable and crisp input variables. In particular, we formalize the model and solve the connected optimization problem; furthermore, for measuring the fitting of our model, we propose the $R^{2}$ coefficient and then prove the decomposition of the total deviation. In Section 5, for showing the applicative performances, our model is applied to several datasets. Some concluding remarks are considered in Section 6.

Section snippets

Fuzzy data

Models based on fuzzy data are diffusely used in several fields. “Sometimes, such models are used as simpler alternatives to probabilistic models (Laviolette et al., 1995). Other times they are, more appropriately, used to study data which, for their intrinsic nature, cannot be known or quantified exactly, and, hence, are correctly regarded as vague or fuzzy. A typical example of fuzzy data is a human judgment or a linguistic term. The concept of fuzzy number can be effectively used to describe

Distance measures between symmetrical fuzzy data

In literature, several topological measures have been generalized to the fuzzy framework. By restricting our interest to distance measures between fuzzy data we can consider, e.g., the following references: Bertoluzza et al. (1995), Coppi and D’Urso (2003), D’Urso and Giordani (2006), Diamond and Kloeden (1994), Kim and Kim (2004), Näther (2000), Yang and Ko (1996), Yang and Liu (1999), Tran and Duckstein, 2002a, Tran and Duckstein, 2002b. In particular, a squared Euclidean distance between a

The model

In a fuzzy framework, there are several real situations in which the fuzzy observations are not homogeneous. For this reason, it is very useful, in these cases, to utilize a FCWLR analysis, in which we embed the techniques of fuzzy clustering into fuzzy regression analysis. In this way, fuzzy clustering methods are utilized for overcoming the heterogeneity problem in fuzzy regression analysis.

In this section, we propose a new FCWLR model.

In our model, for the clustering framework, we follow the

Algorithm

1.
We fix $α$ and r and put initial values of $U$ , $b$ and $d$ .
2.
We compute the parameters $A$ , $b$ , $d$ , by utilizing the formulas (4.3.2)–(4.3.4).
3.
We compute the new matrix $U$ by means of formula (4.3.1).
4.
By assuming that $U^{(τ)}$ represents the membership degrees matrix at the $τ$ th iteration we compare $U^{(τ)}$ with $U^{(τ + 1)}$ using a convenient matrix norm: if $∥U^{(τ + 1)} - U^{(τ)}∥ < υ$ (where $υ$ is a small positive number fixed by the researcher) stop; otherwise, return to step 2.

We observe that the convergence of the algorithm depends

Applicative examples

As to the practical utilization of FCWLR model for symmetrical fuzzy output in this field of data analysis, several potential examples might be mentioned, ranging from marketing segmentation to biology and engineering problems.

In the following, for showing the applicative performances of our model, we illustrate different applicative examples. In particular, in Section 5.1, we utilize our model for interpolating three modified versions of the Yang–Ko dataset (1997). In Section 5.2 we apply our

Conclusions

In this paper, we have suggested a fuzzy clusterwise linear regression model (FCWLR model) for symmetrical fuzzy output. Furthermore, in order to measure the fitting of our model we have proposed the $R^{2}$ coefficient and then proved the decomposition of the total deviation. For showing the applicative performances, our model has been applied to different datasets.

Interesting questions for future research include:

1.
Simulation study in order to analyze in depth the computational performances of our

Acknowledgment

We would like to express our gratitude to the co-editor and the referees whose comments improved the quality of the paper.

References (39)

J.-M. Aurifeille et al.
Predicting business ethical tolerance in international markets: a concomitant clusterwise regression analysis
Internat. Business Rev.
(2003)
P. D’Urso
Linear regression analysis for fuzzy/crisp input and fuzzy/crisp output data
Comput. Statist. Data Anal.
(2003)
P. D’Urso et al.
An “orderwise” polynomial regression procedure for fuzzy data
Fuzzy Sets and Systems
(2002)
P. D’Urso et al.
A weighted fuzzy c-means clustering model for fuzzy data
Comput. Statist. Data Anal.
(2006)
R.D. De Veaux
Mixtures of linear regressions
Comput. Statist. Data Anal.
(1989)
C. Hennig
Clusters, outliers, and regression: fixed point clusters
J. Multivariate Anal.
(2003)
D.S. Kim et al.
Some properties of a new metric on the space of fuzzy numbers
Fuzzy Sets and Systems
(2004)
K. Lau et al.
A mathematical programming approach to clusterwise regression model and its extensions
European J. Oper. Res.
(1999)
C. Preda et al.
Clusterwise PLS regression on a stochastic process
Comput. Statist. Data Anal.
(2005)
Q. Shao et al.
A consistent procedure for determining the number of clusters in regression clustering
J. Statist. Plann. Inference
(2005)

L.T. Tran et al.

Comparison of fuzzy numbers using a fuzzy distance measure

Fuzzy Sets and Systems

(2002)

L.T. Tran et al.

Multiobjective fuzzy regression with central tendency and possibilistic properties

Fuzzy Sets and Systems

(2002)

M.S. Yang et al.

On a class of fuzzy c-numbers clustering procedures for fuzzy data

Fuzzy Sets and Systems

(1996)

M.S. Yang et al.

Fuzzy clustering procedures for conical fuzzy vector data

Fuzzy Sets and Systems

(1999)

C. Bertoluzza et al.

On a new class of distance between fuzzy numbers

Mathware Soft Comput.

(1995)

J.C. Bezdek

Pattern Recognition with Fuzzy Objective Function Algorithms

(1981)

Cohen, E.A., 1980. Inharmonic tone perception, Unpublished Ph.D. Dissertation, Stanford...

R. Coppi et al.

Regression analysis with fuzzy informational paradigm: a least-squares approach using membership function information

Internat. J. Pure Appl. Math.

(2003)

Coppi, R., D’Urso, P., Giordani, P., Santoro, A., 2006b. Least squares estimation of a linear regression model with LR...

Cited by (0)

View full text

Fuzzy clusterwise linear regression analysis with symmetrical fuzzy output variable

Abstract

Introduction

Section snippets

Fuzzy data

Distance measures between symmetrical fuzzy data

The model

Algorithm

Applicative examples

Conclusions

Acknowledgment

Internat. Business Rev.

Comput. Statist. Data Anal.

Fuzzy Sets and Systems

Comput. Statist. Data Anal.

Comput. Statist. Data Anal.

J. Multivariate Anal.

Fuzzy Sets and Systems

European J. Oper. Res.

Comput. Statist. Data Anal.

J. Statist. Plann. Inference

Fuzzy Sets and Systems

Fuzzy Sets and Systems

Fuzzy Sets and Systems

Fuzzy Sets and Systems

On a new class of distance between fuzzy numbers

Mathware Soft Comput.

Pattern Recognition with Fuzzy Objective Function Algorithms

Regression analysis with fuzzy informational paradigm: a least-squares approach using membership function information

Internat. J. Pure Appl. Math.