Skip to main content
Log in

Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We design a probability distribution for ordinal data by modeling the process generating data, which is assumed to rely only on order comparisons between categories. Contrariwise, most competitors often either forget the order information or add a non-existent distance information. The data generating process is assumed, from optimality arguments, to be a stochastic binary search algorithm in a sorted table. The resulting distribution is natively governed by two meaningful parameters (position and precision) and has very appealing properties: decrease around the mode, shape tuning from uniformity to a Dirac, identifiability. Moreover, it is easily estimated by an EM algorithm since the path in the stochastic binary search algorithm can be considered as missing values. Using then the classical latent class assumption, the previous univariate ordinal model is straightforwardly extended to model-based clustering for multivariate ordinal data. Parameters of this mixture model are estimated by an AECM algorithm. Both simulated and real data sets illustrate the great potential of this model by its ability to parsimoniously identify particularly relevant clusters which were unsuspected by some traditional competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Ranking data occur when some judges are asking to sort m objects, quoted by \(1,\ldots ,m\), according to a preference order. Thus, a ranking data is a permutation of the whole data set \(\{1,\ldots ,m\}\). Ordinal data occur when some judges are asking to give a note among the set of ordered notes \(1<\cdots <m\). Thus, an ordinal data is a unique element of \(\{1,\ldots ,m\}\).

  2. We will do the slight abuse of notation \(\hbox {p}(x;\mu ,\pi )=\hbox {p}\Big (\{x\};\mu ,\pi \Big )\).

  3. For avoiding the label cluster dependency, we keep the best distance over all label permutations.

  4. Note that a two component multinomial model is not performed in this univariate situation since it is not identifiable.

  5. http://www.aeres-evaluation.com/

  6. All partitions are better displayed in the first and the third axes than in the first and second axes.

  7. We conjecture that all propositions hold for any m value. In particular, we have proved the simplest of them (Propositions (5.1), (5.2), (5.4)) in this general situation (these general proofs are not given in this paper).

References

  • Agresti, A.: Analysis of Ordinal Categorical Data. Wiley Series in Probability and Statistics. Wiley-Interscience, New York (2010)

    Book  MATH  Google Scholar 

  • Allman, E.S., Matias, C., Rhodes, J.A.: Identifiability of parameters in latent structure models with many observed variables. Ann. Stat. 37(6A), 3099–3132 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Bartholomew, D.J., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach. Wiley, Hoboken (2011)

    Book  MATH  Google Scholar 

  • Biernacki, C., Jacques, J.: A generative model for rank data based on sorting algorithm. Comput. Stat. Data Anal. 58, 162–176 (2013)

    Article  MathSciNet  Google Scholar 

  • Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. J. Pattern Recogn. Soc. 28, 781–793 (1995)

    Article  Google Scholar 

  • D’Elia, A., Piccolo, D.: A mixture model for preferences data analysis. Comput. Stat. Data Anal. 49(3), 917–934 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977). With discussion

    MathSciNet  MATH  Google Scholar 

  • Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Soc. 97, 611–612 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Giordan, M., Diana, G.: A clustering method for categorical ordinal data. Commun. Stat. Theory Methods 40, 1315–1334 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Goodman, L.A., Kruskal, W.H.: Measures of association for cross classifications. J. Am. Stat. Assoc. 49, 732–764 (1954)

    MATH  Google Scholar 

  • Goodman, L.A.: Explanatory latent structure models using both identifiable and unidentifiable models. Biometrika 61, 215–231 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  • Gouget, C.: Utilisation des modèles de mélange pour la classification automatique de données ordinales. Ph.D. thesis, Université de Technologie de Compiègne (2006)

  • Hartigan, J.A., Wong, M.A.: Algorithm as 1326: a k-means clustering algorithm. Appl. Stat. 28, 100–108 (1978)

    Article  MATH  Google Scholar 

  • Iannario, M., Piccolo, D.: A program in r for cub models inference. Technical report, Universita di Napoli Federico II, Version 2.0, www.dipstat.unina.it (2009)

  • Jacques, J., Biernacki, C.: Model-based clustering for multivariate partial ranking data. J. Stat. Plan. Inference 149, 201–217 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Jacques, J., Grimonprez, Q., Biernacki, C.: Rankcluster: an r package for clustering multivariate partial rankings. R J. (in press) (2014)

  • Jacques, J., Preda, C.: Functional data clustering: a survey. Adv. Data Anal. Classif. 8(3), 231–255 (2014)

    Article  MathSciNet  Google Scholar 

  • Jollois, F-X., Nadif, M.: Classification de données ordinales : modèles et algorithmes. In: Proceedings of the 41th Conference of the French Statistical Society, Bordeaux, France (2011)

  • Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, Hoboken (1990)

    Book  Google Scholar 

  • Kendall, M.G.: The treatment of ties in ranking problems. Biometrika 33, 239–251 (1945)

    Article  MathSciNet  MATH  Google Scholar 

  • Knuth, D.E.: Sorting and Searching: The Art of Computer Programming, vol. 3, 2nd edn. Addison-Wesley Professional, Boston (1998)

    MATH  Google Scholar 

  • Lewis, S.J.G., Foltynie, T., Blackwell, A.D., Robbins, T.W., Owen, A.M., Barker, R.A.: Heterogeneity of parkinson’s disease in the early clinical stages using a data driven approach. J. Neurol. Neurosurg. Psychiatry 76, 343–348 (2003)

    Article  Google Scholar 

  • Lipsitz, S.R., Kim, K., Zhao, L.: Analysis of repeated categorical data using generalized estimating equations. Stat. Med. 13, 1149–1163 (1994)

    Article  Google Scholar 

  • Manisera, M., Zuccolotto, P.: Modeling rating data with nonlinear CUB models. Comput. Stat. Data Anal. 78, 100–118 (2014)

  • Marbac, M., Biernacki, C., Vandewalle, V.: Model-based clustering of Gaussian copulas for mixed data. ArXiv e-prints, May (2014)

  • Matechou, E., Liu, I., Pledger, S., Arnold, R.: Biclustering models for ordinal data. In: NZSA 2011 Conference, Auckland, New-Zealand (2011)

  • McCullagh, P.: Regression models for ordinal data. J. R. Stat. Soc. Ser. B 42, 109–142 (1980)

    MathSciNet  MATH  Google Scholar 

  • McLachlan, G., Peel, D.: Applied Probability and Statistics. Finite Mixture Models. Wiley Series in Probability and Statistics. Wiley-Interscience, New York (2000)

    Book  MATH  Google Scholar 

  • McParland, D., Gormley, C.: Algorithms from and for Nature and Life: Studies in Classification, Data Analysis, and Knowledge Organization, chapter Clustering Ordinal Data via Latent Variable Models. Springer, New York (2013)

    Google Scholar 

  • Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surv. 4, 80–116 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Nelsen, R.B.: An Introduction to Copulas. Springer, New York (1999)

    Book  MATH  Google Scholar 

  • Podani, J.: Braun-blanquet’s legacy and data analysis in vegetation science. J. Veg. Sci. 17, 113–117 (2006)

    Article  Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  • Somers, R.H.: A new assymmetric measure of association for ordinal variables. Am. Sociol. Rev. 27, 799–811 (1962)

    Article  Google Scholar 

  • Stevens, S.S.: On the theory of scales of measurement. Science 103(2684), 677–680 (1946)

    Article  MATH  Google Scholar 

  • Vermunt, J.K.: A general class of nonparametric models for ordinal categorical data. Sociol. Methodol. 29, 187–223 (1999)

    Article  Google Scholar 

  • Vermunt, J.K., Magidson, J.: Technical Guide for Latent GOLD 4.0: Basic and Advanced. Statistical Innovations Inc., Belmont (2005)

    Google Scholar 

  • Wolfe, J.H.: Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 5, 329–350 (1970)

    Article  Google Scholar 

  • Xu, R., Wunsch, D.C.: Clustering. Wiley, Hoboken (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julien Jacques.

Appendices

Appendix 1: The BOS properties: statements and proofs

From Equations (7), (8) and (10) in Sect. 2.2, it is easily seen that the BOS model has a polynomial expression related to \(\pi \) of degree less than or equal to \(m-1\):

$$\begin{aligned} \hbox {p}(x ; \mu ,\pi ) = \sum _{j=0}^{m-1} a_j(m,\mu ,x) \pi ^{j}. \end{aligned}$$
(18)

From this statement, we have developed a formal Matlab code computing the exact expression of each coefficient \(a_j(m,\mu ,x)\) for any x, \(\mu \), and m values. For instance, for \(m=5\), \(\mu =2\), and \(x=4\), we obtain the following exact expression:

$$\begin{aligned} \hbox {p}(4 ; 2,\pi )= & {} a_0(5,2,4) + a_1(5,2,4)\pi + a_2(5,2,4)\pi ^{2}\\&+\, a_3(5,2,4)\pi ^{3} + a_4(5,2,4)\pi ^4\\= & {} \frac{1}{5} - \frac{33}{200}\pi - \frac{457}{7200}\pi ^{2} + \frac{2}{75}\pi ^{3} + \frac{13}{7200}\pi ^4. \end{aligned}$$

This formal calculus tool is used in the proofs of all the following properties. Its advantage is to provide very straightforward proofs. Its drawback is to provide proofs for only some selected values of m. However, we have chosen \(m\in \{1,\ldots ,8\}\) since it answers to most practical situationsFootnote 7. For instance in the reference book Agresti (2010) all ordinal data have less than 8 categories.

Proposition 5.1

(\(\pi =0\): Uniformity.) If \(\pi =0\), then \(\forall (\mu ,x) \in \{1,\ldots ,m\}^2\), \(\hbox {p}(x;\mu ,\pi )=m^{-1}\).

Proof

The formal calculus leads to \(a_0(m,\mu ,x)=m^{-1}\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows by setting \(\pi =0\) in (18).

Proposition 5.2

(\(\pi =1\): Dirac in \(\mu \).) If \(\pi =1\), then \(\forall (x,\mu ) \in \{1,\ldots ,m\}^2\), \(\hbox {p}(x;\mu ,\pi )={\mathbb I}(x=\mu )\).

Proof

The formal calculus leads to \(\sum _{j=0}^{m-1}a_j(m,\mu ,\mu )=1\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows by setting \(\pi =1\) in (18).

Proposition 5.3

(\(\mu \): mode if \(\pi >0\).) If \(\pi >0\), then \(\forall (\mu ,x) \in \{1,\ldots ,m\}^2\) such that \(\mu \ne x\), \(\hbox {p}(\mu ;\mu ,\pi ) > \hbox {p}(x;\mu ,\pi )\).

Proof

For \(\pi =1\), Proposition 5.2 already established that \(\mu \) is the mode. We consider now \(0<\pi <1\). From (18), \(\hbox {p}(\mu ;\mu ,\pi ) - \hbox {p}(x;\mu ,\pi )\) is a polynomial of degree less than or equal to \(m-1\). Moreover, from Proposition 5.1, \(\pi =0\) is a root of \(\hbox {p}(\mu ;\mu ,\pi ) - \hbox {p}(x;\mu ,\pi )\), thus we can factorize it by \(\pi \)

$$\begin{aligned} \hbox {p}(\mu ;\mu ,\pi ) - \hbox {p}(x;\mu ,\pi ) = \pi \left( \sum _{j=0}^{m-2} b_j(m,\mu ,x) \pi ^j \right) .\nonumber \\ \end{aligned}$$
(19)

Since \(0<\pi < 1\) implies that \(0<\pi ^j< 1\) for any \(j>1\), the following lower bound is directly obtained

$$\begin{aligned}&\hbox {p}(\mu ;\mu ,\pi ) - \hbox {p}(x;\mu ,\pi ) \nonumber \\&\quad > \pi \left( b_0(m,\mu ,x) + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x) \right) , \end{aligned}$$
(20)

where \(b_j^-(m,\mu ,x)=\min (0,b_j(m,\mu ,x))\). All \(b_j(m,\mu ,x)\) are obtained from a formal polynomial Euclidian division for all \(m\in \{1,\ldots ,8\}\), allowing to show by formal calculus that \( b_0(m,\mu ,x) + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x) > 0\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows.

Proposition 5.4

(\(\mu \): absolute growing of its probability with \(\pi \). ) \(\forall \mu \in \{1,\ldots ,m\}\), \(\hbox {p}(\mu ;\mu ,\pi )\) is an increasing function of \(\pi \).

Proof

The formal calculus leads to \(a_j(m,\mu ,\mu )\ge 0\) for all \(m\in \{1,\ldots ,8\}\) and for all \(j\in \{1,\ldots ,m-1\}\). Conclusion follows directly from (18).

Proposition 5.5

( \(\mu \): relative growing of its probability with \(\pi \).) \(\forall (\mu ,x) \in \{1,\ldots ,m\}^2\) such that \(\mu \ne x\), \(\hbox {p}(\mu ;\mu ,\pi )- \hbox {p}(x;\mu ,\pi )\) is an increasing function of \(\pi \).

Proof

The key is to prove the differential (in \(\pi \)) expression \(\hbox {p}'(\mu ;\mu ,\pi )- \hbox {p}'(x;\mu ,\pi )>0\). From (18), \(\hbox {p}'(\mu ;\mu ,\pi ) - \hbox {p}'(x;\mu ,\pi )\) is a polynomial of degree less than or equal to \(m-2\)

$$\begin{aligned} \hbox {p}'(\mu ;\mu ,\pi ) - \hbox {p}'(x;\mu ,\pi ) = \sum _{j=0}^{m-2} b_j(m,\mu ,x) \pi ^j. \end{aligned}$$
(21)

Since \(0<\pi < 1\) implies that \(0<\pi ^j< 1\) for any \(j>1\), the following lower bound is directly obtained

$$\begin{aligned} \hbox {p}'(\mu ;\mu ,\pi ) - \hbox {p}'(x;\mu ,\pi ) > b_0(m,\mu ,x) + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x),\nonumber \\ \end{aligned}$$
(22)

where \(b_j^-(m,\mu ,x)=\min (0,b_j(m,\mu ,x))\). All \(b_j(m,\mu ,x)\) are obtained from a formal derivation of a polynomial for all \(m\in \{1,\ldots ,8\}\), allowing to show by formal calculus again that \( b_0(m,\mu ,x) + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x) > 0\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows.

Proposition 5.6

(Decreasing around \(\mu \) if \(0<\pi <1\).) \(\forall (x,x') \in \{1,\ldots ,m\}^2\) such that \(x'<x<\mu \) or \(\mu >x>x'\), \(\hbox {p}(x;\mu ,\pi )> \hbox {p}(x';\mu ,\pi )\).

Proof

From (18), \(\hbox {p}(x;\mu ,\pi ) - \hbox {p}(x';\mu ,\pi )\) is a polynomial of degree less than or equal to \(m-1\). Moreover, from Propositions 5.1 and 5.2 respectively, \(\pi =0\) and \(\pi =1\) are roots of \(\hbox {p}(x;\mu ,\pi ) - \hbox {p}(x';\mu ,\pi )\), thus we can factorize it by \(\pi (1-\pi )\)

$$\begin{aligned}&\hbox {p}(x;\mu ,\pi ) - \hbox {p}(x';\mu ,\pi ) \nonumber \\&\quad = \pi (1-\pi )\left( \sum _{j=0}^{m-3} b_j(m,\mu ,x,x') \pi ^j \right) . \end{aligned}$$
(23)

Since \(0<\pi < 1\) implies that \(0<\pi ^j< 1\) for any \(j>1\), the following lower bound is directly obtained

$$\begin{aligned}&\hbox {p}(x;\mu ,\pi ) - \hbox {p}(x';\mu ,\pi ) \nonumber \\&\quad > \pi (1-\pi )\left( b_0(m,\mu ,x,x') + \sum _{j=1}^{m-3} b_j^-(m,\mu ,x,x') \right) ,\nonumber \\ \end{aligned}$$
(24)

where \(b_j^-(m,\mu ,x,x')=\min (0,b_j(m,\mu ,x,x'))\). All \(b_j\) \((m,\mu ,x,x')\) are obtained from a formal polynomial Euclidian division for all \(m\in \{1,\ldots ,8\}\), allowing to show by formal calculus again that \( b_0(m,\mu ,x,x') + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x,\) \(x') > 0\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows.

Proposition 5.7

(Identifiability) If \(0<\pi \le 1\), identifiability holds.

Proof

The identifiability problem could concern \(\mu \) and/or \(\pi \).

  • First, there exists no couple \((\mu ,\mu ')\in \{1,\ldots ,m\}^2\) with \(\mu \ne \mu '\) such that \(\hbox {p}(x;\mu ,\pi )=\hbox {p}(x;\mu ',\pi ')\) for any \(x\in \{1,\ldots ,m\}\) and any \((\pi ,\pi ')\in (0,1]^2\). Indeed, otherwise both \(\mu \) and \(\mu '\) should be modes of both distributions, which is impossible by unicity of the mode (Propositions 5.3 and 5.6).

  • Second, there exists no couple \((\pi ,\pi ')\in (0,1]^2\) with \(\pi \ne \pi '\) such that \(\hbox {p}(x;\mu ,\pi )=\hbox {p}(x;\mu ,\pi ')\) for any \(x\in \{1,\ldots ,m\}\) and any \(\mu \in \{1,\ldots ,m\}\) since \(\hbox {p}(x;\mu ,\pi )\) is a polynomial of order at least one. This last statement comes from the fact that \(\hbox {p}(x;\mu ,\pi )\) varies from 1 / m to 0 or 1 when \(\pi \) varies (respectively when \(x\ne \mu \) and \(x=\mu \)) from Propositions 5.1 and 5.2, thus depends on the value of \(\pi \).

Appendix 2: AERES data

See Table 11.

Table 11 Bachelor’s degrees evaluations (March 2011) of 23 French universities on the ordinal scale A\(+\), A, B and C. The four evaluation criteria are described in Sect. 4.2

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biernacki, C., Jacques, J. Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm. Stat Comput 26, 929–943 (2016). https://doi.org/10.1007/s11222-015-9585-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-015-9585-2

Keywords

Navigation