Homogeneity analysis using absolute deviations

doi:10.1016/j.csda.2004.03.007

Computational Statistics & Data Analysis

Volume 48, Issue 3, 1 March 2005, Pages 587-603

https://doi.org/10.1016/j.csda.2004.03.007 Get rights and content

Abstract

Homogeneity analysis is a technique for making graphical representations of categorical multivariate data sets. Such data sets can also be represented by the adjacency matrix of a bipartite graph. Homogeneity analysis optimizes a weighted least-squares criterion and the optimal graph layout is computed by an alternating least squares algorithm. Heiser Comput. Statist. Data Anal. (1987) 337, looked at homogeneity analysis under a more robust to outliers criterion, namely a weighted least absolute deviations criterion. In this paper, we take an in-depth look at the mathematical structure of this problem and show that the graph drawings are created by reciprocal computation of multivariate medians. Several algorithms for computing the solution are investigated and applications to actual data suggest that the resulting p-dimensional drawings (p⩾2) are degenerate, in the sense that all object points are clustered in p+1 locations. We also examine some variations of the criterion used and conclude that the generate solutions observed are a consequence of the normalization constraint employed in this class of problems.

Introduction

Homogeneity Analysis (also known as Multiple Correspondence Analysis (MCA)) is a well-known technique to make graphical representations of categorical multivariate data (Gifi, 1990). It can also be presented as a technique to produce informative layouts of bipartite graphs (Michailidis and de Leeuw, 1998; De Leeuw and Michailidis 2000a, De Leeuw and Michailidis 2000b).

The setting is as follows: data have been collected for N objects on J categorical variables with k_j categories per variable. Let K=∑_j=1^Jk_j be the total number of categories in the data set. Then, a graph $G$ with nodes (vertices) corresponding to the N objects and the K categories and with edges linking the object nodes to the category nodes, and thus reflecting which objects belong to which categories, contains the same information as the original data set. The latter information is usually represented in matrix form through a binary (0-1) matrix $W={w_{ij} |, i=1,…,N, j=1,…,K}$ . It can be easily shown that the (N+k)×(N+k) matrix $A= 0 W W′ 0$ corresponds to the adjacency matrix of our graph. The above defined multivariate data graph $G$ with vertex set V and edge set E has a special structure, namely that the N nodes corresponding to the objects are not connected between themselves and similarly for the K category nodes. This can also be seen by the two zero submatrices in the adjacency matrix A of $G$ . Thus, we are dealing with a bipartite graph.

A drawing of the graph $G$ is a mapping of its vertex set V into p-dimensional space. Adjacent points in the graph are connected by lines in the drawing. This goes in the direction of making a picture of the data, and when things work out well, a picture is worth a lot of numbers, especially when these numbers are just zeros and ones as several examples in the literature have shown (Gifi, 1990; Michailidis and de Leeuw, 1998).

The quality of the drawing is measured by the loss function $pull_{2} (X,Y)= ∑ i=1 N ∑ j=K m w_{ij} d^{2} (x_{i},y_{j}),$ where the x_i's contain the coordinates of the N objects and the y_j the coordinates of the K categories of all the variables in the p-dimensional space, and d denotes the Euclidean distance. The objective is to arrange the vertices (objects and categories) of the graph in such a way, so that the loss would be small. Thus points which are connected by lines should be close, i.e. the lines in the drawing should be short.

If we design algorithms to minimize pull₂(X,Y), then we must make sure that the perfect, but trivial, solution X=Y=0 is excluded. This is done by imposing normalization constraints. For example, in MCA drawings are normalized by requiring that X′X=I. Under this normalization the solution to problem (1.1) is characterized by the centroid principle (Gifi, 1990), namely that the category points are located in the center of gravity of the objects they belong to. An additional advantage of this normalization is that the optimal solution is given by an eigenvalue problem (Gifi, 1990). The p=2-dimensional solution for the Guttman–Bell and sleeping bags data sets (for their description see Section 4) that illustrate the centroid principle are given in Fig. 1.

However, MCA has a few drawbacks; the major ones are: (i) the influence of objects with ‘rare’ profiles that tend to dominate the solution (Michailidis and de Leeuw, 1998), as can be seen on the left part of the picture for the Guttman–Bell drawing and (ii) the presence of horseshoes (De Leeuw et al., 1980).

One possible solution to these is to use a more ‘robust’ loss function, such as $pull_{1} (X,Y)= ∑ i=1 N ∑ j=1 K w_{ij} d(x_{i},y_{j}),$ i.e. it is the same loss function as (1.1), but without squaring the distance. The same normalization is used as before, requiring that X′X=I. This is a special case of a very general framework introduced in Michailidis and de Leeuw (2001), where the square of the distance in the definition of the loss function (1.1) is replaced by a general function φ(d). Robust estimation has a very long history in statistics (Huber, 1981). The case (1.2) was discussed earlier in Heiser (1987) in the context of correspondence analysis (graphical representation of a two-way table) who gave an algorithm and an example that corresponded to our framework. The example showed clustering, in the sense that many of the objects and categories in the optimal drawing on the plane were collapsed into single points, and only very few distinct points were left. Heiser (1987, p. 349) made the following comments regarding this clustering phenomenon.

How should we appreciate this result? There are perhaps two views. One is that in the process of mapping the original table into a spatial configuration too much of the fine detail is lost, and that the approach leads to a dead end. The other is that it appears to be possible to devise a class of clustering techniques that is smoothly related to a more continuous representation, and that seems to avoid the usual combinatorial complications.

In Fig. 2, the optimal graph drawings of the Guttman–Bell and sleeping bags data sets under loss function (1.2) are shown. In both cases a very strong clustering pattern emerges for the object points; i.e. all of the object nodes occupy only three locations. On the other hand, the category points still seem to obey some form of the centroid principle for the Guttman–Bell example.

Experience with many other categorical data sets with varying numbers of objects, variables and categories per variable confirm the above empirical finding; namely, that the optimal 2-dimensional layout consists of three object nodes Michailidis and de Leeuw (2000). Analogously, the three-dimensional layouts consist of four object nodes. Finally, for p=1 the result also holds, namely that the optimal solution consists of two points only, and is rigorously proven in De Leeuw and Michailidis (2003). Obviously, such solutions become totally uninteresting from a data analysis point of view, since they are unable to uncover interesting patterns in the data. Hence, it is of great interest to gain insight into the origins of this phenomenon and examine possible alternatives that overcome the problem.

The paper is organized as follows: Section 2 discusses the structure of the loss function (1.2) and presents several optimization algorithms for computing the optimal solution. In Section 3, the structure of the optimal solution is investigated and in Section 4 the performance of the various algorithms is examined. Finally, in Section 5 we look into other loss functions and present some potential solutions to the strong clustering problem observed.

Section snippets

The loss function and its optimization

Our objective is to minimize loss function (1.2) over all N×p matrices X satisfying X′X=I and over all K×p matrices Y. The N×K matrix W={w_ij} is the off-diagonal part of the adjacency matrix A of the bipartite multivariate data graph $G$ .

For purposes of regularization, to avoid problems with differentiability and division by zero, we actually define $d^{2}_{ε} (x_{i},y_{j})≜(x_{i} −y_{j})′(x_{i} −y_{j})+ε^{2}$ throughout, where ε is small, and we minimize $pull_{(1,ε)} (X,Y)= ∑ i=1 n ∑ j=1 m w_{ij} d_{ε} (x_{i},y_{j}).$ In the remainder of the paper we will

Guttman–Bell dataset

This small dataset dealing with attitudes of social groups (also analyzed in Guttman (1968) and in Gifi (1990)) consists of 7 objects and 5 variables with a total of K=17 categories. In Fig. 3, the homogeneity analysis solution under the pull₂ and the pull₁ loss functions are given. The lines indicate which objects under the pull₂ solution are mapped to three points that describe the pull₁ solution. This mapping of the object points to one of the 3 locations is completely determined by the

The structure of the optimal solution

The block relaxation algorithms has provided insight into the structure of the optimal solution with respect to the category points. Since the y_js must correspond to multivariate medians, their position in the optimal graph layout is completely determined by this requirement.

Moreover, on the basis of extensive numerical experience (see previous section and also Michailidis and de Leeuw (2000)) we make the following conjecture.

Conjecture 4.1

The p-dimensional optimal solution X that minimizes the pull(X,Y)

Discussion: other loss functions and potential solutions

The pull₁(X,Y) function used so far is a special case of the more general class of functions defined by $pull_{β} (X,Y)= ∑ i=1 N ∑ j=1 K w_{ij} d(x_{i},y_{j})^{β},β∈[1,2].$ This is a family of convex functions with growth rates slower than the quadratic. The class contains as extreme cases both the pull₂ and the pull₁ functions. An application of Young's inequality shows that we can construct a majorization algorithm to minimize members of this class under the X′X=I minimization constraint. Specifically we have that $d(x_{i},y_{j}$

Acknowledgements

The authors would like to thank the AE and an anonymous referee for useful suggestions that improved the presentation. The work of George Michailidis was supported in part by NSF under grants IIS-9988095 and DMS-0214171.

References (22)

W.J. Heiser
Correspondence analysis with least absolute residuals
Comput. Statist. Data Anal
(1987)
De Leeuw, J., Michailidis, G., 2000a. Graph layout techniques and multidimensional data analysis. In: Bruss, F.T., Le...
J. De Leeuw et al.
Block relaxation algorithms in statistics
J. Comput. Graphical Statist
(2000)
De Leeuw, J., Michailidis, G., 2003. Weber correspondence analysis: the one dimensional case. Technical Report #343,...
J. De Leeuw et al.
Nonlinear principal components analysis with B-splines
Methods Oper. Res
(1980)
Drezner, Z. (Ed.) 1995. Facility Location. Springer, New...
A. Edelman et al.
The geometry of algorithms with orthogonality constraints
SIAM J. Matrix Anal. Appl
(1999)
A. Gifi
Nonlinear Multivariate Analysis
(1990)
L. Guttman
A general non-metric technique for fitting the smallest coordinate space for a configuration of points
Psychometrika
(1968)
P. Huber
Robust Statistics
(1981)

H.A.L. Kiers

Weighting least squares fitting using ordinary least squares algorithms

Psychometrika

(1997)

Cited by (8)

L<inf>1</inf>-norm projection pursuit principal component analysis
2006, Computational Statistics and Data Analysis
We develop a new method of robust principal component analysis based on the $L_{1}$ -norm projection pursuit approach. The aim of the paper is threefold. First, we present the underlying mathematical theory and show that it is closely related to the old centroid method of calculating principal components. Second, we present three algorithms to perform the required calculations. Third, we use Benzecri's geometric relative measure of the influence of a point on a principal axis to define cutpoints for the identification of outliers, and iteratively use it to eliminate outliers and obtain robust $L_{1}$ -norm projection pursuit principal components. Two examples of well-known data sets are provided.
An improved reliability model for FMEA using probabilistic linguistic term sets and TODIM method
2022, Annals of Operations Research
The relative performance of foreign-owned subsidiaries and domestic companies
2019, Post-Communist Economies
English Language Teachers’ Burnout Within the Cultural Dimensions Framework
2016, Asia-Pacific Education Researcher
Correspondence Analysis: Theory, Practice and New Strategies
2014, Correspondence Analysis: Theory, Practice and New Strategies
HOMALS for dimension reduction in information retrieval
2012, Studies in Classification, Data Analysis, and Knowledge Organization

View all citing articles on Scopus

View full text

Homogeneity analysis using absolute deviations

Abstract

Introduction

Section snippets

The loss function and its optimization

Guttman–Bell dataset

The structure of the optimal solution

Discussion: other loss functions and potential solutions

Acknowledgements

Comput. Statist. Data Anal

Block relaxation algorithms in statistics

J. Comput. Graphical Statist

Nonlinear principal components analysis with B-splines

Methods Oper. Res

The geometry of algorithms with orthogonality constraints

SIAM J. Matrix Anal. Appl

Nonlinear Multivariate Analysis

A general non-metric technique for fitting the smallest coordinate space for a configuration of points

Psychometrika

Robust Statistics

Weighting least squares fitting using ordinary least squares algorithms

Psychometrika