Skip to main content

Affine-transformation invariant clustering models


We develop a cluster process which is invariant with respect to unknown affine transformations of the feature space without knowing the number of clusters in advance. Specifically, our proposed method can identify clusters invariant under (I) orthogonal transformations, (II) scaling-coordinate orthogonal transformations, and (III) arbitrary nonsingular linear transformations corresponding to models I, II, and III, respectively and represent clusters with the proposed heatmap of the similarity matrix. The proposed Metropolis-Hasting algorithm leads to an irreducible and aperiodic Markov chain, which is also efficient at identifying clusters reasonably well for various applications. Both the synthetic and real data examples show that the proposed method could be widely applied in many fields, especially for finding the number of clusters and identifying clusters of samples of interest in aerial photography and genomic data.


Clustering of objects invariant with respect to affine transformations of feature vectors is an important research topic since objects may be recorded via different angles and positions so that their coordinates may vary and their nearest neighbors may belong to other clusters. For example, the longitude, latitude, and altitude coordinates of an object which are recorded by devices equipped in aircrafts or satellites change across different observational time. In this situation, distance-based clustering method including k-means (MacQueen 1967), hierarchical clustering (Ward 1963), clustering based on principal components, spectral clustering (Ng et al. 2001), and others (Jain and Dubes 1988; Ozawa 1985) may fail to identify the correct clusters by grouping nearest points. Another category is distribution-based clustering methods (Banfield and Raftery 1993; Fraley and Raftery 1998; Fraley and Raftery 2002; Fraley and Raftery 2007; McCullagh and Yang 2008; Vogt et al. 2010) which may specify a partition as a parameter in a likelihood function and estimate it under a Bayesian framework.

In certain areas of application, the goal is to cluster objects i=1,…,n into disjoint subsets based on their feature vectors \(Y_{i} \in \mathbb {R}^{d}\). In this paper, we propose group invariance by considering three cases of a cluster process that are invariant with respect to three groups of affine transformations \(g\colon \mathbb {R}^{d}\to \mathbb {R}^{d}\) acting on the feature space. The group invariance implies that the feature configurations Y and Y in \(\mathbb {R}^{n\times d}\) determine the same clustering, or probability distribution on clusterings, if they belong to the same group orbit that is an equivalence class. For example, if the feature space is Euclidean and is the group of Euclidean isometries or congruences, the clustering is a function only of the maximal invariant, which is the array of Euclidean distances Dij=YiYj. For example, image data such as the aerial photography and three-dimensional protein structures are two motivating examples. The shape and relative locations of data may vary due to the change of the viewer’s angle and location.

Our goal is to develop a novel clustering method which can identify clusters of Y=(Y1,…,Yn) even when all Yi’s are mapped by an unknown affine transformation \(Y^{\prime }_{i} = \boldsymbol {a} + A Y_{i}\), where \(\boldsymbol {a}=(a_{1},\ldots,a_{d})\in \mathbb {R}^{d}\) and \(A\in \mathbb {R}^{d\times d}\) is nonsingular. Affine-invariant clustering is important when the clusters are not well-separated in the observational space. Although there are previous work on affine-invariant clustering methods (Fitzgibbon and Zisserman 2002; Begelfor and Werman 2006; Shioda R. and Tunçel 2007; Brubaker. S.C. and Vempala 2008; Kumar and Orlin 2008; Garcìa-Escudero et al. 2010; Lee et al. 2014), these existing methods handle different problems from ours. These methods aim to cluster the same item observed in different angles or mapped by different unknown affine transformations. Instead, in our problem setting we consider only one unknown affine transformation that is applied to all objects.

The affine transformations consist of three types: (1) index permutations, rotation, one-scaling on all variables, and location-translation transformations that are under the first type of covariance structures and named model I whose transformation and covariance structure σ2Id were also adopted by Vogt et al. (2010); (2) each variable may have different scaling transformations that are under the second type of covariance structures and named model II; (3) the variables are transformed by a nonsingular matrix that is named model III, where the observed variables may be linear combinations of some latent variables in model I. These models cover fairly general situations of clustering in nature.

McCullagh and Yang (2008) constructed a Dirichlet cluster process together with a random partition representing the clustering. In this paper, we follow their setup and extend their framework. We assume that the random partition of objects follows Ewens distribution (Ewens 1972), and we propose a likelihood of the responses which is invariant respect to affine transformations.

Cluster process and prior distributions

In this paper, an \(\mathbb {R}^{d}\)-valued cluster process (Y,B) means a random partition B of the natural numbers, together with an infinite sequence Y1,Y2,… of random vectors in the state space \(\mathbb {R}^{d}\). The restriction of such a process to a finite sample [n]={1,…,n} of units or specimens consists of the restricted partition B[n] accompanied by the finite sequence Y[n]=(Y1,…,Yn). A partition B[n]:[n]×[n]→{0,1} is the partition of the sample units expressed as a binary cluster-factor matrix of Bi,j=1 if Yi and Yj are of the same cluster (denoted as ij), and Bi,j=0 otherwise (McCullagh and Yang 2008). For example, when n=3, the partition {{1,2},3} and the cluster labels 112 correspond to an equivalence relation

$$B= \left(\begin{array}{lll} 1 & 1 & 0 \\ 1 & 1 & 0 \\ 0 & 0 & 1 \\ \end{array} \right). $$

Notice that the elements of B are transitional. i.e., if individuals i,j,k belong to the same cluster, then Bi,j=1 and Bj,k=1 imply Bi,k=1.

The term cluster process implies infinite exchangeability, which means that the joint distribution pn of (Y[n],B[n]) is symmetric (McCullagh and Yang 2006) or invariant under permutations of indices (Pitman 2002), and pn is the marginal distribution of pn+1 under deletion of the (n+1)th unit from the sample.

Similar to (McCullagh and Yang 2008), we construct an exchangeable Gaussian mixture as a simple example of clustering processes. First, Bp is some infinitely exchangeable random partition. Secondly, the conditional distribution of the samples Y, which is regarded as a matrix (Yi,r) of order n×d given B (say the cluster label cl(Yi)=l) and θ, is Gaussian with mean and variance as follows

$$\text{\(\textrm{\textup{E}}\)}\left(Y_{i} \,|\, B, \boldsymbol{\mu}_{l}\right) = \boldsymbol{ \mu}_{l}~,\qquad {\text{Cov}}\left(Y_{i,r}, Y_{j,s} \,|\, B,\theta\right) = \left(\delta_{i,j} + \theta B_{i,j}\right) \Sigma_{r,s}, $$

where \(\boldsymbol { \mu }_{l}=\left (\mu _{l1},\ldots,\mu _{ld}\right) \in \mathbb {R}^{d}\) is the centroid of cluster k, δ is Kronecker’s delta, that is, δi,j=1 if i=j and 0 if ij,θ>0 is a ratio parameter connecting the within- and between-cluster covariance matrices, and Σ=(Σr,s) is a positive definite matrix of order d×d, known as the within-cluster covariance matrix. In our settings, the between-cluster covariance matrix is simply θΣ, the cluster centroids μ1,…,μk are iid from N(μ,θΣ), and the mean of Y given B and μ1,…,μk is

$$\text{\(\textrm{\textup{E}}\)}\left(Y \,|\, B, \boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{k}\right) = \left(\boldsymbol{ \mu}_{cl(Y_{1})}, \ldots, \boldsymbol{ \mu}_{cl(Y_{n})}\right) $$

and the covariance of Y given B can also be represented by the covariance of its vector form \(\text {Vec}(Y)=\left (Y_{11}, \hdots, Y_{1d}, \hdots, Y_{n1}, \hdots, Y_{nd} \right)^{\intercal }\) as

$${\text{Cov}}\left(\text{Vec}(Y) \,|\, B,\theta\right) = (I_{n} + \theta B) \otimes \Sigma $$

which is an nd×nd matrix with “ ” indicating the Kronecker product. Σ, the column covariance of Y, is assumed identical for all clusters, In+θB is assumed an exchangeable structure for the row covariance of Y, and θ is the product of the standard deviations of two rows. There exist competing algorithms that are affine-equivariant and do note impose this requirement (Shioda R. and Tunçel 2007; Kumar and Orlin 2008; Garcìa-Escudero et al. 2010; Lee et al. 2014). The identity matrix itself is also a partition in which each cluster consists of one element.

Given the number of clusters k, the cluster sizes (n1,…,nk) may follow a multinomial distribution with category probabilities π=(π1,…,πk), where π follows an exchangeable Dirichlet distribution Dir (λ/k,…,λ/k). After integrating out π, the partition B follows a Dirichlet-multinomial prior

$$p_{n}(B|\lambda, k) =\frac{ k!}{(k - \#B)!}\frac{\Gamma(\lambda)\prod_{b\in B}\Gamma(n_{b}+\lambda/k)}{\Gamma(n+\lambda)[\Gamma(\lambda/k)]^{\#B}} $$

where #Bk denotes the number of clusters presented in the partition B and nb is the size of cluster b (MacEachern 1994; Dahl 2005; McCullagh and Yang 2008). The limit as k is well defined and known as the Ewens’s sampling formula (ESF) with parameter λ>0

$$ p_{n}\left(n_{1},\ldots,n_{k}|\lambda\right) = \frac{\Gamma(\lambda)\lambda^{\# B}}{\Gamma(n+\lambda)}\prod_{b\in B}\Gamma(n_{b}), $$

which is also known as Chinese restaurant process (CRP) (Ewens 1972; Neal 2000; Blei and Jordan 2006; Crane 2016). McCullagh and Yang (2008) provided a framework with a finite number of clusters and general covariance structures. In this paper, we adopt the CRP prior for partition B which implies k= in the population with the proposed Gaussian likelihood to get the affine-invariant clusters. Note that #Bn for any given sample size n.

We choose a proper prior distribution for the variance ratio θ, the symmetric F-family

$$p(\theta)\propto \frac{\theta^{\alpha-1}}{(1+\theta)^{2\alpha}} $$

with α>0 allowing a range of reasonable choices (Chaloner 1987).

We propose a sampling procedure to estimate the partition B and the parameter θ from conditional probabilities. Since the conditional distribution of θ does not have a recognized form, we propose to use a discrete version \(\left \{p(\theta _{j})\right \}^{J}_{j=1},\) where J is a predetermined moderately large integer. Based on our experience, J=100 works reasonably well for the real data examples that we have examined.

Affine-transformation invariant clustering

The affine-transformation invariant clustering identified in this manuscript is invariant even when the objects are mapped by an unknown affine transformation. The conditional distribution on partitions of [n]={1,…,n} is determined by the finite sequence Y=(Y1,…,Yn) regarded as a configuration of n labeled points in \(\mathbb {R}^{d}\). The exchangeability condition implies that any permutation π of the sequence induces a corresponding permutation in B, i.e. pn(Bπ | Y=yπ)=pn(B | Y=y), where \(y^{\pi }_{i} = y_{\pi (i)}\) and \(B^{\pi }_{i,j} = B_{\pi (i), \pi (j)}\). In many cases, it is reasonable to assume additional symmetries involving transformations in \(\mathbb {R}^{d}\), for example pn(B | Y)=pn(B | −Y). We are asking, in effect, whether two labeled configurations Y and Y which are geometrically equivalent in \(\mathbb {R}^{d}\) should determine the same conditional distribution on sample partitions.

If the state space \(\mathbb {R}^{d}\) is regarded as a d-dimensional Euclidean space with the standard Euclidean inner product and Euclidean metric, the configurations Y and Y are congruent if there exists a vector \(\boldsymbol {a}=(a_{1},\ldots,a_{d})\in \mathbb {R}^{d}\) and an orthogonal matrix \(A\in \mathbb {R}^{d\times d}\) such that \(Y^{\prime }_{i} = \boldsymbol {a} + A Y_{i}\) for each i. Equivalently, the n×n arrays of squared Euclidean distances Dij=YiYj2 and \(D^{\prime }_{ij} = \left \|Y'_{i} - Y'_{j}\right \|^{2}\) are equal. The configurations are geometrically similar if \(Y^{\prime }_{i} = \boldsymbol {a} + b Y_{i}\) for \(b \in \mathbb {R}\) and b≠0, implying that the arrays of distances are proportional D=b2D.

The geometric equivalence is defined by regarding the observation Y as a group orbit rather than a point. In general, the group is the affine group \({GA}\left (\mathbb {R}^{d}\right), {\mathcal {G}}=\mathbb {R}^{d} \times L \) and L is the collection of all d×d nonsingular matrices, with the operation (a1,A1)(a2,A2)=(a1+A1a2,A1A2) for \(\boldsymbol {a}_{i}\in \mathbb {R}^{d}, A_{i}\in L\) with i=1,2, which is consistent with compositions of affine transformations. The orbit of an element \(Y = (Y_{1}, \ldots, Y_{n})^{\intercal } \in \mathbb {R}^{n\times d}\) is defined as

$$\begin{array}{*{20}l} \text{Orb}(Y) = \left\{X \in \mathbb{R}^{n\times d}\ :\ \exists g\in {\mathcal{G}}\text{ s.t.} X = g\star Y\right\}, \end{array} $$

where the group action is that \({\mathcal {G}}\) acts on \(\mathbb {R}^{n\times d}\) as

$$\begin{array}{*{20}l} (\boldsymbol{a},A)\star Y = \left(\boldsymbol{a} + AY_{1}, \ldots, \boldsymbol{a} + A Y_{n}\right)^{\intercal} = \boldsymbol{1}_{n} \boldsymbol{a}^{\intercal} + Y A^{\intercal} \end{array} $$

where 1n is a length-n vector of 1’s. It can be verified that its vector form Vec((a,A)Y)=1na+(InA)Vec(Y). If

$$\text{Vec}(Y) \sim N\left(\boldsymbol{1}_{n} \otimes \boldsymbol{\mu}, (I_{n} + \theta B)\otimes \Sigma \right), $$

then an element in the same orbit

$$\text{Vec}\left((\boldsymbol{a},A)\star Y\right) \sim N\left(\boldsymbol{1}_{n} \otimes \left(\boldsymbol{a} + A\boldsymbol{\mu}\right), \left(I_{n} + \theta B\right)\otimes \left(A \Sigma A^{\intercal}\right) \right) $$

More specifically,

$$\begin{array}{@{}rcl@{}} \text{Vec}\left((-\boldsymbol{\mu}, I_{d})\star Y\right) &\sim& N\left(\boldsymbol{0}, (I_{n} + \theta B)\otimes \Sigma) \right)\\ \text{Vec}\left((-T^{-1} \boldsymbol{\mu}, T^{-1})\star Y\right) &\sim& N\left(\boldsymbol{0}, (I_{n} + \theta B)\otimes I_{d} \right) \end{array} $$

where T is a d×d nonsingular matrix satisfying \(\Sigma = T T^{\intercal }\).

Theorem 1

If nd, then all \(Y\in \mathbb {R}^{n\times d}\) of full rank n belong to the same orbit. If n=d+1, then all \(Y\in \mathbb {R}^{n\times d}\) satisfying \(\text {rank}(Y) = \text {rank}(Y - \boldsymbol {1}_{n} \boldsymbol {1}_{n}^{\intercal } Y/n) = d\) belong to the same orbit.

The proof of Theorem 1 is relegated to the Appendix A. According to the proof, if n=d+1, then rank(Y)=d implies that \(\text {rank}(Y - \boldsymbol {1}_{n} \boldsymbol {1}_{n}^{\intercal } Y/n)\) is either d or d−1. The case of d−1 only occupies a lower-dimensional subspace.

According to Theorem 1, for nd+1, the action is essentially transitive in the sense that all configurations of n distinct points in \(\mathbb {R}^{d}\) belong to the same orbit: all other orbits are negligible in that they have Lebesgue measure zero. As a result, the observation Y regarded as a group orbit \({\mathcal {G}} Y\) is uninformative for clustering unless n>d+1. We name the orbit and group action defined above as model III.

In model I, which is the case considered in Vogt et al. (2010), the covariance between features are proportional to an identity matrix. The group is \({\mathcal {G}}=\mathbb {R}^{d} \times \mathbb {R}\setminus \{0\}\) with the operation (a1,b1)(a2,b2)=(a1+b1a2,b1b2) for \(\boldsymbol {a}_{i}\in \mathbb {R}^{d}, b_{i}\in \mathbb {R}\setminus \{0\}, i=1,2\). The orbit of an element \(Y\in \mathbb {R}^{n\times d}\) and the group action are defined similarly as in (1) and (2) with A replaced by b. Then \(\left (\boldsymbol {a}, b\right)\star Y = \boldsymbol {1}_{n} \boldsymbol {a}^{\intercal } + bY\) and Vec((a,b)Y)=1na+bVec(Y). If

$$\text{Vec}(Y) \sim N\left(\boldsymbol{1}_{n} \otimes \boldsymbol{\mu},\ (I_{n} + \theta B)\otimes \sigma^{2}I_{d}\right), $$

then Vec((−μ,1)Y)N(0,(In+θB)σ2Id) and Vec((−μ/σ,1/σ)Y)N(0,(In+θB)Id), which correspond to elements in Orb(Y).

In essence, the observation is not regarded as a point in \(\mathbb {R}^{n\times d}\) but is treated as a group orbit generated by the group of rigid transformations, or similarity transformations if scalar multiples are permitted. In statistical terms, this approach meshes with the sub-model in which the matrix Σ in model I is a scaled identity matrix Id. An equivalent way of saying the same thing for n>d is that the column-centered sample matrix \(\tilde Y = Y - \mathbf {1}_{n} \mathbf {1}_{n}^{\intercal } Y /n\) determines the sample covariance matrix \(S = \left (\tilde Y^{\intercal } \tilde Y\right)/(n-1)\) and hence the Mahalanobis metric \(\|x - x^{*}\|^{2} = (x - x^{*})^{\intercal } S^{-1} (x - x^{*})\) in the state space (Mahalanobis 1936; Gnanadesikan and Kettenring 1972). One implication is that the n×n matrix D=(Dij)=(YiYj2) of standardized inter-point Mahalanobis distances is maximal invariant, and the conditional distribution on sample partitions depends on Y only through this matrix.

In practice, the d variables are sometimes measured on scales that are not commensurate with one another, so the state space seldom has a natural metric. In this case, we assume that Y and Y as equivalent configurations for each feature Y·,j if there are \(a_{j}\in \mathbb {R}\) and \(b_{j} \in \mathbb {R}\setminus \{0\}\), such that \(Y^{\prime }_{\cdot,j} = a_{j} + b_{j} Y_{\cdot,j}\). In model II, the group is the affine group \({GA}(\mathbb {R})^{d}, {\mathcal {G}}=\mathbb {R}^{d} \times D \) and D={diag{b1,…,bd}bi≠0, i=1,…,d} with the operation (a1,A1)(a2,A2)=(a1+A1a2,A1A2) for \(\boldsymbol {a}_{i}\in \mathbb {R}^{d}, A_{i}\in D\) with i=1,2. The orbit of an element \(Y\in \mathbb {R}^{n\times d}\) and the group action are defined in (1) and (2) with AD. If

$$\text{Vec}(Y) \sim N\left(\boldsymbol{1}_{n} \otimes \boldsymbol{\mu}, (I_{n} + \theta B)\otimes \text{diag}\{\sigma_{1}^{2}, \ldots, \sigma_{d}^{2}\}\right) $$

then \( \text {Vec}\left ((-\boldsymbol {\mu }, I_{d}) \star Y\right) \sim N(\mathbf {0}, (I_{n} + \theta B)\otimes \text {diag}\left \{\sigma _{1}^{2}, \ldots, \sigma _{d}^{2}\right \}) \), and furthermore Vec((a,A)Y)N(0,(In+θB)Id) with \(\boldsymbol {a} = -\left (\mu _{1}/\sigma _{1}, \ldots, \mu _{d}/\sigma _{d}\right)^{\intercal }\) and \(A = \text {diag}\left \{\sigma _{1}^{-1}, \ldots, \sigma _{d}^{-1}\right \}\), which correspond to elements of the group orbit. No linear combinations are permitted here, so that the integrity of the variables is preserved.

Moreover, in some cases, the location information or shapes of objects from aerial photography applications may be distorted by the viewer’s angle or position so that the variables may be strongly correlated. A more extreme approach avoids the metric assumption by regarding Y and Y as equivalent configurations if there exists a vector \(\boldsymbol {a}\in \mathbb {R}^{d}\) and a non-singular matrix \(A\in \mathbb {R}^{d\times d}\) such that \(Y^{\prime }_{i} = \boldsymbol {a} + A Y_{i}\) with \(A^{\intercal } A\) is a positive definite matrix for all i. Consequently, models I, II, III specify the structures of the covariance matrix between features, and the partition B of Y is affine invariant and the same as the partition B of the group orbit \({\mathcal {G}}Y \subset \mathbb {R}^{n\times d}\), which is independent of the mean.

3.1 Gaussian marginal probabilities

The distribution of the column-centered group orbit, \({\mathcal {G}}Y\), is assumed to be a Gaussian distribution

$$N\left(\mathbf{0}, (I_{n} + \theta B)\otimes A^{\intercal} A\right) $$

which depends only on In+θB and \(A^{\intercal } A\). Actually, it can be verified that for any \((\boldsymbol {a}, A) \in {\mathcal {G}}\), the two distributions of group orbits induced by N(1nμ,(In+θB)Σ) and \(N(\boldsymbol {1}_{n} \otimes (\boldsymbol {a} + A\boldsymbol {\mu }), (I_{n} + \theta B)\otimes \left (A \Sigma A^{\intercal }\right)\) respectively are the same.

McCullagh (2008) studied the d time series with an autocorrelation Γ and n observations in time or space following three Gaussian distribution models N(0,ΓΣ) under different assumptions of Σ as follows :

$$\begin{array}{*{20}l} \text{Model I:} \Sigma &= \sigma^{2} I_{d} \end{array} $$
$$\begin{array}{*{20}l} \text{Model II:} \Sigma &= \text{diag}\left\{\sigma^{2}_{1},\cdots,\sigma^{2}_{d}\right\} \end{array} $$
$$\begin{array}{*{20}l} \text{Model III:} \Sigma &\in PD_{d} \end{array} $$

where PDd is the collection of d×d symmetric positive definite matrices. These three models correspond to our three models of affine transformed equivalence classes which we discussed in the previous section. In this paper, we set (In+θB) as Γ and \(A^{\intercal } A\) as Σ. Following (McCullagh 2008), the log-likelihood based on Y for all three models is:

$$\begin{array}{@{}rcl@{}} l\left(\Gamma,\Sigma | Y\right) &=& -\frac12 \log\det(\Gamma\otimes\Sigma)-\frac{1}{2} \text{tr}\left(Y^{\intercal}\Gamma^{-1}Y\Sigma^{-1}\right)\\ &=& -\frac{d}{2}\log \det(\Gamma)-\frac{n}{2}\log\det(\Sigma)-\frac{1}{2}\text{tr}\left(Y^{\intercal}\Gamma^{-1}Y\Sigma^{-1}\right), \end{array} $$

Lemma 1

(In+θB)−1=InθWB, where W=diag{(1+θN1)−1,…,(1+θNn)−1} and Ni is the ith diagonal element of N=B1n.

According to Lemma 1 and its proof, which is relegated to the Appendix A, Γ=In+θB is always nonsingular for θ>0 and its inverse Γ−1=(In+θB)−1=InθWB can be obtained explicitly. To ensure that \(Y^{\intercal } \Gamma ^{-1} Y\) is positive definite with probability 1 (McCullagh 2008), as well as informative group orbits (see Theorem 1 and its subsequent discussion), we assume n>d+1.

After plugging in the maximum likelihood estimator of Σ which for model III is \(\hat {\Sigma }_{\Gamma }=Y^{\intercal }\Gamma ^{-1}Y/n\), for model II is \(\text {diag}\left (\hat {\Sigma }_{\Gamma }\right)\), and for model I is \(\text {tr}\left (\hat {\Sigma }_{\Gamma }\right)I_{d}/d\) (McCullagh 2008), the profile likelihood of Γ is

$$L_{p}\left(\Gamma^{-1} | {\mathcal{G}} Y\right) = \left\{\begin{array}{ll} {\text{det}}\left(\Gamma^{-1}\right)^{d/2} / {\text{tr}}\left(Y^{\intercal} \Gamma^{-1} Y\right)^{nd/2} & \text{(I)} \\ {\text{det}}\left(\Gamma^{-1}\right)^{d/2} / \prod_{r=1}^{d}\left(Y_{(r)}^{\intercal} \Gamma^{-1} Y_{(r)}\right)^{n/2} & (II) \\ {\text{det}}\left(\Gamma^{-1}\right)^{d/2} / \det\left(Y^{\intercal} \Gamma^{-1} Y\right)^{n/2} & (III) \end{array}\right. $$

where \(Y_{(r)} \in \mathbb {R}^{n}\) is the rth column of Y, r=1,…,d.

The conditional distribution on partitions of [n] depends on the group orbit and the assumptions made regarding Σ. For group I, with ΣId in the Gaussian model, the likelihood depends only on the distance matrix D, so the likelihood is constant on the orbits associated with the larger group of Euclidean similarities. Therefore, for model I, the similarity transformation can be generalized as if \(Y^{\prime }_{i} = \boldsymbol { a }+ A Y_{i}\) for \(A^{\intercal } A = \sigma ^{2} I_{d}\) and σ≠0, implying that the arrays of distances are proportional D=σ2D. Consequently, there is a representative element of the group orbit with feature mean vector 0, so that Vec(Y)N(0,(In+θB)σ2Id).

For model II, the affine transformation can be generalized as \(Y^{\prime }_{i} = \boldsymbol {a} + A Y_{i},\) where \(\boldsymbol {a} \in \mathbb {R}^{d}\) and \(A\in \mathbb {R}^{d\times d}\) with \(A^{\intercal } A\) as a diagonal matrix with positive diagonal entries for all i. As a result, there is a representative element of the group orbit with feature mean vector 0, so that \(\text {Vec}(Y) \sim N\left (\mathbf {0}, (I_{n} + \theta B)\otimes \text {diag}\left \{\sigma ^{2}_{1},\ldots,\sigma ^{2}_{d}\right \}\right)\). This is to work with \({GA}(\mathbb {R})^{d} \) which is the general affine group acting independently on the d columns of Y. For model III, Σ is an arbitrary matrix in PDd. The group is \({GA}\left (\mathbb {R}^{d}\right)\) and n>d+1. These three models are nested by model Imodel IImodel III.

Affine invariance in \(\mathbb {R}^{d}\) is a strong requirement, which comes at a small cost for moderate d provided that d/n is small. When \(d/n \leq 1, Y^{\intercal } \Gamma ^{-1} Y\) is positive definite with probability one (McCullagh 2008), then model III works. However, while d/n<1 is not small, model III may be inefficient due to some eigenvalues of \(Y^{\intercal } \Gamma ^{-1} Y\) and \(\det \left (Y^{\intercal } \Gamma ^{-1} Y\right)\) close to zero (Dempster 1972; Stein 1975). As a result, the profile likelihood of Γ becomes unstable. In contrast, model II is less computationally expensive than model III, and model I is the most efficient one.

Markov chain Monte Carlo algorithm for sampling partitions

We use the prior and posterior distributions of θ and B discussed in Section 2 through a Markov chain Monte Carlo (MCMC) algorithm for sampling partitions. The iterative θ is obtained by Gibbs sampling (Geman and Geman 1984) according to the conditional distribution \(p_{n}\left (\theta _{j}|B,{\mathcal {G}}Y\right)\propto p(\theta _{j}) \times L_{p}\left (\Gamma ^{-1} | {\mathcal {G}}Y\right),\) where \(p(\theta _{j})\propto \theta _{j}^{\alpha -1}/\left (1+\theta _{j}\right)^{2\alpha }\) for j=1,…,J. For instance, α=1 and the discrete set {2−3,2−2,…,210} for the range of θ are used as the default setting in our experiments. For updating B, the conditional distribution on partitions is

$$p_{n}\left(B |\theta, {\mathcal{G}} Y\right) \propto p_{n}(B|\lambda) \times L_{p}\left(\Gamma^{-1} | {\mathcal{G}}Y\right), $$

where pn(B|λ) is the Ewens distribution, and a Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970) is used to choose the iterative B. λ is set as 1 in the following applications. After burning in a certain number of the resulting Markov chain, we use the average of the partition matrix as the similarity matrix to make inference on partition. The proposal distribution \(q\left (B^{(i+1)}|B^{(i)}, {\mathcal {G}}Y\right)\) is proportional to exp(−a×dxc), where dxc is the distance between each point and the corresponding centroid of the clusters and a is a scale hyperparameter which was set as 2 in our experiments. More specifically, a partition candidate B is generated by re-assigning the label of each point with the probability proportional to the reciprocal of the distance between each point and the corresponding centroid.

Since Algorithm 1 is a Metropolis-Hastings algorithm, it satisfies the detailed balance condition, and therefore the generated Markov chain has a stationary distribution (Chib and Greenberg 1995; Gamerman 1997; Robert and Casella 2010). Since we leave a small but positive probability that the partition stays the same in the Gibbs sampling and the discrete posterior of θ stays positive always, then the transition probability

$$p_{n}(\theta^{(k+1)},B^{(k+1)}|\theta^{(k)},B^{(k)})>0 $$

where θ(k+1)=θ(k) and B(k+1)=B(k), and then the (θ,B)-valued Markov chain constructed by Algorithm 1 is aperiodic.

Lemma 2

If n>d+1, the (θ,B)-valued Markov chain constructed by Algorithm 1 is aperiodic.

Since there is always a positive chance that the partition can be split further into the simplest partition in which each element is a cluster, then all possible partitions communicate with each other, so that the (θ,B)-valued Markov chain constructed by Algorithm 1 is irreducible. Given the sample size n, the size of the state space of B known as the Bell number (Bell 1934), and the size of the state space of θ are all finite, then the irreducibility also implies positive recurrence. Consequently, the (θ,B)-valued Markov chain constructed by Algorithm 1 is ergodic (Isaacson and Madsen 1976; Gilks et al. 1996). The properties are summarized as the following lemma and theorem, whose proofs are relegated to the Appendix A.

Lemma 3

If n>d+1, the (θ,B)-valued Markov chain constructed by Algorithm 1 is irreducible, and thus is positive recurrent.

Theorem 2

(Ergodic theorem) If n>d+1, the (θ,B)-valued Markov chain constructed by Algorithm 1 converges to its stationary distribution \(p_{n}\left (\theta,B|{\mathcal {G}}Y\right)\propto p(\theta)\times p_{n}(B|\lambda) \times L_{p}\left (\Gamma ^{-1}|{\mathcal {G}}Y\right)\). More specifically, for any real-valued function f satisfying \(\sum _{(\theta, B)} |f(\theta, B)| p_{n}(\theta, B|{\mathcal {G}}Y) < \infty \), we have

$$\frac{1}{n+1} \sum_{i=0}^{n} f\left(\theta^{(i)}, B^{(i)}\right) \longrightarrow \sum_{(\theta, B)} f(\theta, B) p_{n}(\theta, B|{\mathcal{G}}Y) $$

almost surely for all initial value (θ(0),B(0)).

Analysis of simulated and real data

We test the proposed Bayesian cluster process with Algorithm 1 on both synthetic and real data. Algorithm 1 with model I and point-wise updating is equivalent to the method of (Vogt et al. 2010). If there is no prior information of the number of clusters, users can set the initial partition B as In in which each observation is a block. In practice, we use a randomly sampled clusters from a discrete uniform distribution of a range chosen by users. The clustering result is represented by the average of the estimated similarity matrix

$$S=\sum_{k=n_{0}+1}^{N}\frac{ B^{(k)}}{N-n_{0}}, $$

where n0 is the number of burn-in iterations. Furthermore, we also define a dissimilarity matrix D as \(\boldsymbol { 1}_{n}\boldsymbol { 1}_{n}^{\intercal }-S.\) The dissimilarity matrix, D, can be expressed by a heatmap which represents a matrix with grayscale colors with white as 1, black as 0, and the spectrum of gray as values between 0 and 1. The heatmap of the original similarity matrix cannot be recognized with the naked eye and equivalence relation needs to be decoded from the matrix B. However, in practice, users can identify clusters through including the names of rows and columns of the similarity matrix to find which individuals are clustered together. Additionally, the heatmap function of the stats R package can permute the order of individuals to have cluster blocks with hierarchical dendrograms. It is challenging to monitor convergence of the Markov chain because the sampled clusters are random and may vary in each iteration. To determine convergence, we run Algorithm 1 ten times for each data set and stop the chain when we observe the number of clusters remain the same in the given chain length (Chang and Fisher 2013).

5.1 Illustrative simulated data

Four clusters on the vertices of a unit square data Three simulated data sets are generated for illustration. In the simulation study, 1000 initial burn-in iterations were discarded, and 2000 Markov chains of B samples based on each model were used to calculate D. We first applied the proposed cluster process with model I on the synthetic data for four clusters centered at the four vertices of a unit square. For each vertex μk, we generate 20 points from N(μk,(1/4)I2) for k=1,…,4 (see Fig. 1, the left panel). We call the data XI, and then apply model I to cluster XI with the average within- and between-cluster distances. The resulting heatmap successfully reveals the true clusters for most of the points (not shown here).

Fig. 1

The scatter plots for XI,XII, and XIII of the unit square synthetic data from the left to the right. The most left panel is the original features which have four clusters at the vertices of the unit square with equal size 20; the middle panel is the features which are transformed by scaling each dimension differently, clusters 1 and 2 are grouped as well as clusters 3 and 4 are grouped; the right panel shows the transformed features are aligned as a straight line

Then we transform the data by \(X_{II}=X_{I}\times \left (\begin {array}{ll} 3 & 0 \\ 0 & 1/3 \\ \end {array} \right).\) The transformed features seem to have two groups (see Fig. 1, the middle panel), clusters (1,2) and clusters (3,4). The cluster process with model I does not work well for this case, while the heatmap based on model II without knowing the transformation can reveal the true clusters for most of the points (not shown here).

Furthermore we transform the data by \(X_{III}=X_{I}\times \left (\begin {array}{ll} 4.1 & 2.1 \\ 1.9 & 1.1 \\ \end {array} \right).\) The transformed features are aligned in a straight line (see Fig. 1, the right panel). The transformed data XIII is more difficult to cluster than XI and XII, since the original four clusters are transformed to be not well separated.

The resulting heatmap using model III with the initial clusters assigned randomly and uniformly from {1,2,3,4} reveals the true four clusters for most of the points (see Fig. 2).

Fig. 2

The heatmap of the similarity matrix using model III reveals the true four clusters for most of the transformed data XIII

5.2 Applications to real data

Besides the synthetic data, we also evaluate the performance of the proposed approach by using real data. We run 3000 MCMC iterations and burn in the first 1000 iterations, and use the heatmap of matrix S to visualize the clusters. The accuracy rate is based on the average proportion of identical elements of matrix B of the cluster and the true matrix B, and compared the accuracy rates with k-means (MacQueen 1967) and Mclust using R package ‘mclust’ with its default setting (Fraley and Raftery 2002). The reason why we chose R package ‘mclust’ is that Mclust is a model-based clustering approach using the Gaussian mixture model, which assumes a Gaussian distribution for each component under one of the three types covariance structures (the argument of Mclust: modelNames) 1. Spherical (EII), 2. Diagonal (VVI), and 3. General (VVV) for comparing with our proposed model I, II, III, correspondingly. The main difference is that the Mclust obtains clusters with an expectation–maximization (EM) algorithm (Dempster 1972; McLachlan and Peel 2000), but our method uses a Metropolis-Hasting algorithm with the profile likelihood of Γ to sample clusters.

Model I: Gene expression data of Leukemia patients The gene expression microarray data (Dua and Graff 2019) has been used to study genetic disorder such as identifying diagnostic or prognostic biomarkers or clustering and classifying diseases (Dudoit et al. 2002). For example, (Golub et al. 1999) classified patients of acute leukemia into two sub types, Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML). For illustration purpose, we use the training set of the leukemia data which consists of 3051 genes and 38 tumor mRNA samples. Pretending we do not know the label information, we would like to cluster the 38 samples according to their 3051 features (gene expression levels). The two clusters comprise 27 ALL cases and 11 AML cases. Since the number of features is larger than the sample size, our approach is not applicable to this dataset directly. Therefore, we first reduce the dimension by projecting the data on the subspace which consists of the first twenty principal components (PC) (Jolliffe 1986). Note that these PC scores are orthonormal which satisfies the assumption of model I. The resulting heatmap based on model I (Fig. 3) reveal the cluster of the 11 AML cases. The accuracy rate using the proposed model I with the initial clusters assigned randomly and uniformly from {1,2} is 0.9164, while the accuracy rates of k-means and Mclust are 0.6994 and 0.5886, respectively. We noticed that Mclust resulted in only one cluster.

Fig. 3

The heatmap of the similarity matrix using model I identifies the ALL group in the left upper corner and the AML group in the right bottom corner

Model II: Geographic coordinate data of Denmark’s 3D Road Network

This three-dimensional road network dataset of geographic coordinates includes the altitude, latitude, and longitude degrees of each road segments in North Jutland in northern Denmark, which is publicly available at the UC Irvine Machine Learning Repository (Kaul 2013; Dua and Graff 2019). Since three spatial dimensional features are orthogonal, it satisfies the assumption of model II so that we use this dataset to demonstrate model II. Three subjects with the road maps OSM ID 144552912 (19 observations), 125829151 (13 observations), 145752974 (14 observations) are used for the clustering analysis. Note that each objects may have several observations measured from different angles, and the altitude values are extracted from NASA’s Shuttle Radar Topography Mission (SRTM) data (Jarvis et al. 2008). The accuracy rate using model II with the initial clusters assigned randomly and uniformly from {1,2,3,4,5} is 1, while the accuracy rates of k-means with k=3 and Mclust are 0.7486 and 0.9490, respectively. The resulting heatmap using model II (Fig. 4) reveals 3 clusters correctly.

Fig. 4

The heapmat of the similarity matrix using model II correctly reveals three clusters corresponding to the three buildings in the Denmark 3-D road map data

Model III: Iris data

This iris dataset (Fisher 1936) contain three species–Setosa, Versicolor, and Virginica with four features which are the measurements of the variables sepal length and width and petal length and width in centimeters, respectively. Each species consists of 50 iris flowers. The data points are clustered by their four features. Here, d=4,n=150,k=3. The heatmap of the similarity matrix using model III correctly reflects three clusters corresponding to the three species of iris for most points (Fig. 5). The accuracy rate using the proposed model III with the initial clusters assigned randomly and uniformly from {1,2,3} is 0.9087, while the accuracy rates of k-means with k=3 and Mclust are 0.7740 and 0.7763, respectively. We noticed that both the k-means and Mclust result in two clusters by grouping Versicolor and Virginica as a cluster.

Fig. 5

The heatmap of the similarity matrix using model III correctly reveals three clusters corresponding to the three species of iris for most points

Concluding discussion

The proposed clustering method is invariant under different groups of affine transformations and computationally efficient. It identifies clusters for most samples without knowing the number of clusters in advance, and it may group a big cluster as several small clusters. These problems are dealt with an exchangeable partition prior which avoids label-switching problems and the partition valued in the MCMC algorithm is invariant under linear transformations under three types of covariance structures. The advantage of replacing the Dirichlet-multinomial prior with its limiting process is that we do not need to know the number of clusters in advance. The disadvantage is that it may be less efficient computationally if the number of clusters is known. Note that the proposed approach does not target the partition maximizing the posterior distribution. Instead, it estimates the expected partition or the similarity matrix.

The three clustering models are based on the covariance matrix between variables. There are guidelines of telling which model work best in practice by the experimental design or testing its sample covarinace matrix. If the features are othornormal or orthogonal, then model I and model II are applicable, respectively. Models I and II run faster than model III due to the structure of the covariance matrix. Otherwise, model III can be used in general. It works reasonably well across various applications.

Since we use the profile likelihood of Γ in our model, we do not sample the covariance matrix directly, and Lemma 1 and Theorem 1 implies as n>d+1, the proposed Metropolis-Hasting algorithm can work. However, the maximum likelihood estimator (MLE) of the general unstructured covariance matrix will be less efficient if the diagonal covariance structure is actually correct because it will tend to have small eigenvalues and a large determinant of the inverse covariance matrix (i.e. Γ−1). Indeed, when using model III and even if n>d+1,Γ may be near singular. This may make the sampling less efficient. i.e. the acceptance rate may become small (Roberts and Rosenthal 2001). Although the stationary distribution of the sampled clusters’ Markov chain using Algorithm 1 is independent of the initial clusters according to Theorem (2), we practically suggest to set the initial clusters sampled from a discrete uniform distribution of a range given by users instead of setting each individual as a cluster in order to obtain convergent sampled clusters without using a long Markov chain. This makes the proposed Algorithm 1 sample more efficiently from a smaller collection of partition candidates.

The proposed clustering algorithm produces the desired clusters with 2000 iterations after 1000 burn-in iterations in our experiments. The main contributions of our work include: 1) The proposed three clustering models with three types of covariance structures can handle general cases of affine transformations. In contrast, (Vogt et al. 2010) only dealt with the case of model I. 2) Algorithm 1 is efficient, since it updates all individuals’ clusters instead of a single individual’s cluster per iteration. It also ensures that the resulting partition-valued Markov chain is ergodic and convergent in distribution. 3) The experiments show the advantages of our cluster process which successfully identifies the true clusters using the proposed distance matrix. In particular if the clusters are not well separated, the similarity matrix with probabilistic nature can still reveal the relationships through hierarchical approaches. The proposed method could be used to extract interesting information from aerial photography, genomic data, and data with attributes under different scales, especially when the nearest neighbors may belong to different clusters in the feature space. The proposed method can be improved in the further work by modeling the mean of each cluster with regression on covariates or non-Gaussian distributions.

Appendix A

Proof of Theorem 1: For any \(Y \in \mathbb {R}^{n\times d}\), denote \(\tilde {Y} = Y - \boldsymbol {1}_{n} \boldsymbol {1}_{n}^{\intercal } Y/n\). Let \(\tilde {Y}_{(n-1)}\) be the (n−1)×d matrix consisting of the first n−1 rows of \(\tilde {Y}\). Since \(\boldsymbol {1}_{n}^{\intercal } \tilde {Y} = 0\), then \(\text {rank}(\tilde {Y}) = \text {rank}(\tilde {Y}_{(n-1)})\).

If nd+1 and \(\text {rank}(\tilde {Y}_{(n-1)}) = n-1\), that is, \(\tilde {Y}_{(n-1)}\) is of full row rank, then there exists an orthogonal matrix \(O\in \mathbb {R}^{d\times d}\) (column permutations), such that, \(\tilde {Y}_{(n-1)} O = (U,V)\), where \(U\in \mathbb {R}^{(n-1)\times (n-1)}\) is of full rank, and \(V\in \mathbb {R}^{(n-1)\times (d+1-n)}\). We let

$$\boldsymbol{a} = \frac{1}{n} Y^{\intercal} \boldsymbol{1}_{n}, \>\> A = O\left(\begin{array}{ll} U^{\intercal} & \boldsymbol{0}\\ V^{\intercal} & I_{d+1-n} \end{array}\right), \>\> Z = \left(\begin{array}{ll} I_{n-1} & \boldsymbol{0}\\ -\boldsymbol{1}_{n-1}^{\intercal} & \boldsymbol{0} \end{array}\right). $$

It can be verified that Y=(a,A)Z. That is, YOrb(Z), where Z is a constant matrix.

If nd and rank(Y)=n, then \(\text {rank}(\tilde {Y}_{(n-1)}) = n-1\), since \(\tilde {Y}_{(n-1)} = W Y\), where \(W = (I_{n-1} - \boldsymbol {1}_{n-1} \boldsymbol {1}_{n-1}^{\intercal }/n, -\boldsymbol {1}_{n-1}/n)\) is of full row rank n−1.

Suppose n=d+1 and rank(Y)=d=n−1. Without any loss of generality, we assume rank(Y(n−1))=n−1, where Y(n−1) consists of the first n−1 rows of Y. Then Yn=c1Y1++cn−1Yn−1 for some \(c_{1}, \ldots, c_{n-1} \in \mathbb {R}\) and \(\tilde {Y}_{(n-1)} = D Y_{(n-1)}\), where \(D = I_{n-1} - \boldsymbol {1}_{n-1} \boldsymbol {1}_{n-1}^{\intercal }/n - \boldsymbol {1}_{n-1} \boldsymbol {c}^{\intercal }/n\), and \(\boldsymbol {c}^{\intercal } = (c_{1}, \ldots, c_{n-1})\). It can be verified that if c1++cn−1≠1, then rank(D)=n−1 and \(\text {rank}(\tilde {Y}_{(n-1)}) = n-1\); if c1++cn−1=1, then rank(D)=n−2 and \(\text {rank}(\tilde {Y}_{(n-1)}) = n-2\). Note that if n=d+1 but \(\text {rank}(\tilde {Y}_{(n-1)}) = n-2\), then YOrb(Z). □

Proof of Lemma 1: Suppose the partition matrix B consists of k blocks with block sizes n1,…,nk, where k≥1,ni>0 for i=1,…,k, and n1++nk=n.

We first assume that \(B = \text {diag}\left \{\boldsymbol {1}_{n_{1}} \boldsymbol {1}_{n_{1}}^{\intercal }, \ldots, \boldsymbol {1}_{n_{k}} \boldsymbol {1}_{n_{k}}^{\intercal }\right \}\), which is in its standard form. Then \(B = LL^{\intercal }\) with \(L = \text {diag}\{\boldsymbol {1}_{n_{1}}, \ldots, \boldsymbol {1}_{n_{k}}\} \in \mathbb {R}^{n\times k}\) and \(I_{n} + \theta B = I_{n} + E E^{\intercal }\) with \(E = \sqrt {\theta } L\).

According to the Sherman-Morrison-Woodbury formula (see, for example, Section 2.1.4 in Golub and Van Loan (2013)), for matrices \(A \in \mathbb {R}^{n\times n}\) and \(U, V\in \mathbb {R}^{n\times k}, (A + UV^{\intercal })^{-1} = A^{-1} - A^{-1} U(I + V^{\intercal } A^{-1} U)^{-1} V^{\intercal } A^{-1}\) if both A and \(I+V^{\intercal } A^{-1} U\) are nonsingular. In our case, A=In is nonsingular, U=V=E, and \(I + V^{\intercal } A^{-1} U = I_{k} + E^{\intercal } E = \text {diag}\{1+\theta n_{1}, \ldots, 1+\theta n_{k}\}\) is also nonsingular. Thus

$$\begin{array}{@{}rcl@{}} & & (I_{n} + E E^{\intercal})^{-1}\\ &=& I_{n} - E\left(I_{k} + E^{\intercal} E\right)^{-1} E^{\intercal}\\ &=& I_{n} - \theta L\left(I_{k} + \theta L^{\intercal} L\right)^{-1} L^{\intercal}\\ &=& I_{n} - \theta \cdot \text{diag} \left\{\frac{1}{1+\theta n_{1}} \boldsymbol{1}_{n_{1}} \boldsymbol{1}_{n_{1}}^{\intercal}, \ldots, \frac{1}{1+\theta n_{k}} \boldsymbol{1}_{n_{k}} \boldsymbol{1}_{n_{k}}^{\intercal}\right\}\\ &=& I_{n} - \theta \cdot \text{diag} \left\{\frac{1}{1+\theta n_{1}} I_{n_{1}}, \ldots, \frac{1}{1+\theta n_{k}} I_{n_{k}}\right\} \cdot \text{diag} \left\{ \boldsymbol{1}_{n_{1}} \boldsymbol{1}_{n_{1}}^{\intercal}, \ldots, \boldsymbol{1}_{n_{k}} \boldsymbol{1}_{n_{k}}^{\intercal}\right\}\\ &=& I_{n} - \theta W B. \end{array} $$

In general, by row-switching and column-switching transformations, we can always transform B into its standard form. That is, there exists an orthogonal matrix O such that \(B_{r} = OBO^{\intercal }\) is in standard form. Let \(W_{r} = OWO^{\intercal }\). Then \((I_{n} + \theta B)^{-1} = O^{\intercal } (I_{n} + \theta B_{r})^{-1} O = O^{\intercal } (I_{n} - \theta W_{r} B_{r})O = I_{n} - \theta \cdot O^{\intercal } W_{r} O \cdot O^{\intercal } B_{r} O = I_{n} - \theta WB\). □

Proof of Lemma 3: In our case, the Markov chain built by Algorithm 1 is actually a discrete chain. It is irreducible since pn(θ(k+1),B(k+1)|θ(k),B(k))>0 for each pair of states. As a direct conclusion of Theorem 4.1 in Gilks et al. (1996), our Markov chain is positive recurrent. □

Proof of Theorem 2: Algorithm 1 is a Gibbs sampler plus a Metropolis-Hastings component for sampling B(i+1). Given B(i) and θ(i+1), the Metropolis-Hastings ratio with proposal distribution \(q(B | B^{(i)}, {\mathcal {G}}Y)\) and target distribution \(p_{n}(\theta, B | {\mathcal {G}} Y)\) is

$$\begin{array}{@{}rcl@{}} R(B^{(i)}, B^{*}) &=& \frac{p_{n}\left(\theta^{(i+1)}, B^{*} | {\mathcal{G}}Y\right) \cdot q(B^{(i)} | B^{*}, {\mathcal{G}} Y)}{p_{n}\left(\theta^{(i+1)}, B^{(i)} | {\mathcal{G}}Y\right) \cdot q\left(B^{*} | B^{(i)}, {\mathcal{G}} Y\right)}\\ &=& \frac{p\left(\theta^{(i+1)}\right) \cdot p_{n}\left(B^{*} | \lambda\right) \cdot L_{p}\left(\Gamma(\theta^{(i+1)}, B^{*})^{-1} | {\mathcal{G}}Y\right) \cdot q(B^{(i)} | B^{*}, {\mathcal{G}}Y)}{p(\theta^{(i+1)}) \cdot p_{n}(B^{(i)} | \lambda) \cdot L_{p}\left(\Gamma(\theta^{(i+1)}, B^{(i)})^{-1} | {\mathcal{G}}Y\right) \cdot q\left(B^{*} | B^{(i)}, {\mathcal{G}}Y\right)} \\ &=& \frac{p_{n}\left(B^{*} | \lambda\right) \cdot L_{p}\left(\Gamma(\theta^{(i+1)}, B^{*})^{-1} | {\mathcal{G}}Y\right) \cdot q(B^{(i)} | B^{*}, {\mathcal{G}}Y)}{p_{n}(B^{(i)} | \lambda) \cdot L_{p}\left(\Gamma(\theta^{(i+1)}, B^{(i)})^{-1} | {\mathcal{G}}Y\right) \cdot q(B^{*} | B^{(i)}, {\mathcal{G}}Y)} \end{array} $$

which is exactly R in Algorithm 1. Since Metropolis-Hastings algorithms satisfy detailed balance condition, the target distribution \(p_{n}(\theta, B | {\mathcal {G}}Y)\) is a stationary distribution. By Lemmas 2 and 3, the convergence statements follow as a direct conclusion of Theorems 4.3 and 4.4 in Gilks et al. (1996). □

Appendix B journal name abbreviations for use in Boundary-Layer meteorology

█ █

Availability of data and materials

The datasets are from simulation and the UCI Machine Learning Repository and are available as per JSDA policy.



Acute lymphoblastic leukemia


Cute myeloid leukemia


Markov chain Monte Carlo


Principal components PC


NASA’s shuttle radar topography mission


  1. Banfield, J. D., Raftery, A. E.: Model-based Gaussian and non Gaussian Clustering. Biometrics. 49, 803–821 (1993).

    MathSciNet  MATH  Google Scholar 

  2. Begelfor, E., Werman, M.: Affine Invariance Revisited. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2087–2094 (2006).

  3. Bell, E. T.: Exponential polynomials. Ann. Math. 35, 258–277 (1934).

    MathSciNet  MATH  Google Scholar 

  4. Blei, D., Jordan, M.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1, 121–144 (2006).

    MathSciNet  MATH  Google Scholar 

  5. Brubaker. S.C., Vempala, S.: Isotropic PCA and affine-invariant clustering. In: Forty Ninth Annual IEEE Symposium on Foundations of Computer Science (2008).

  6. Chaloner, K.: A Bayesian approach to the estimation of variance components in the unbalanced one-way random-effects model. Technometrics. 29, 323–337 (1987).

    MathSciNet  Google Scholar 

  7. Chang, J., Fisher, J. W.: Parallel sampling of DP mixture models using sub-clusters splits. In: NIPS’13 Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 620–628 (2013).

  8. Chib, S., Greenberg, E.: Understanding the Metropolis-Hastings Algorithm. Am. Stat. 49(4), 327–335 (1995).

    Google Scholar 

  9. Crane, H.: The ubiquitous Ewens sampling formula. Stat. Sci. 31, 1–19 (2016).

    MathSciNet  MATH  Google Scholar 

  10. Dahl, D. B.: Sequentially-allocated merge-split sampler for conjugate and nonconjugate Dirichlet process mixture models (2005). Technical Report, Department of Statistics, Texas A&M University.

  11. Dempster, A. P.: Covariance selection. Biometrics. 28(1), 157–175 (1972).

    MathSciNet  Google Scholar 

  12. Dua, D., Graff, C.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2019).

    Google Scholar 

  13. Dudoit, S., Fridlyand, J., Speed, T. P.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97(457), 77–87 (2002).

    MathSciNet  MATH  Google Scholar 

  14. Ewens, W. J.: The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112 (1972).

    MathSciNet  MATH  Google Scholar 

  15. Fisher, R. A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics. 7, 179–188 (1936).

    Google Scholar 

  16. Fitzgibbon, A., Zisserman, A.: On Affine Invariant Clustering and Automatic Cast Listing in Movies. In: European Conference on Computer Vision 2002, pp. 304–320 (2002).

  17. Fraley, C., Raftery, A. E.: How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998).

    MATH  Google Scholar 

  18. Fraley, C., Raftery, A. E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002).

    MathSciNet  MATH  Google Scholar 

  19. Fraley, C., Raftery, A. E.: Bayesian regularization for normal mixture estimation and model-based clustering. J. Classif. 24, 155–181 (2007).

    MathSciNet  MATH  Google Scholar 

  20. Gamerman, D.: Efficient Sampling from the Posterior Distribution in Generalized Linear Models. Stat. Comput. 7, 57–68 (1997).

    Google Scholar 

  21. Garcìa-Escudero, L. A., Gordaliza, A., Matràn, C., Mayo-Iscar, A.: A review of robust clustering methods. ADAC. 4, 89–109 (2010).

    MathSciNet  MATH  Google Scholar 

  22. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6(6), 721–741 (1984).

    MATH  Google Scholar 

  23. Gilks, W. R., Richardson, S., Spiegelhalter, D. J.: Markov Chain Monte Carlo in Practice. Chapman & Hall, New York (1996).

    Google Scholar 

  24. Gnanadesikan, R., Kettenring, J. R.: Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. Biometrics. 28(1), 81–124 (1972).

    Google Scholar 

  25. Golub, T. R., Slonim, D. K., Tamayo, P., Huardm, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., Lander, E. S.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 286, 531–537 (1999).

    Google Scholar 

  26. Golub, H. G., Van Loan, C. F.: Matrix Computations. 4th edition. Johns Hopkins University Press, Baltimore (2013).

    Google Scholar 

  27. Hastings, W. K.: Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika. 57(1), 97–109 (1970).

    MathSciNet  MATH  Google Scholar 

  28. Isaacson, D. L., Madsen, R. W.: Markov Chains. Wiley, New York (1976).

    Google Scholar 

  29. Jain, A. K., Dubes, R. C.: Algorithms for clustering data. Prentice Hall, Upper Saddle River (1988).

    Google Scholar 

  30. Jarvis, A., Reuter, H. I., Nelson, A., Guevara, E.: JHole-filled seamless SRTM data V4, International Centre for Tropical Agriculture (CIAT) (2008).

  31. Jolliffe, I. T.: Principal Component Analysis (1986).

  32. Kaul, M.: Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems. In: Proceedings of International Conference on Mobile Data Management (IEEE MDM), Vol 1., pp. 137–146. Milan, Italy (2013).

  33. Kumar, M., Orlin, J. B.: Scale-invariant clustering with minimum volume ellipsoids. Comput. Oper. Res. 35, 1017–1029 (2008).

    MathSciNet  MATH  Google Scholar 

  34. Lee, H., Yoo, J. -H., Park, D.: Data clustering method using a modified Gaussian kernel metric and kernel PCA. ETRI J. 36(3), 333–342 (2014).

    Google Scholar 

  35. MacEachern, S. N.: Estimating normal means with a conjugatestyle Dirichlet process prior. Commun. Stat. Simul. Comput. 23, 727–741 (1994).

    MATH  Google Scholar 

  36. MacQueen, J. B.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press, Berkeley (1967).

    Google Scholar 

  37. Mahalanobis, P. C.: On the Generalized Distance in Statistics. In: Proceedings of the National Institute of Sciences of India (1936).

  38. McCullagh, P.: Marginal likelihood for parallel series. Bernoulli. 14(3), 593–603 (2008).

    MathSciNet  MATH  Google Scholar 

  39. McCullagh, P., Yang, J.: Stochastic classification models. In: Proceedings of the International Congress of Mathematicians, vol. III, pp. 669–686, Madrid (2006).

  40. McCullagh, P., Yang, J.: How many clusters?Bayesian Anal. 3, 101–120 (2008).

    MathSciNet  MATH  Google Scholar 

  41. McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000).

    Google Scholar 

  42. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953).

    MATH  Google Scholar 

  43. Neal, R. M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9, 249–265 (2000).

    MathSciNet  Google Scholar 

  44. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. Adv. Neural Inform. Process. Syst. 14, 849–856 (2001).

    Google Scholar 

  45. Ozawa, K.: A stratificational overlapping cluster scheme. Pattern Recognit. 18, 279–286 (1985).

    MATH  Google Scholar 

  46. Pitman, J.: Combinatorial Stochastic Processes. (621). In: Ecole d’Ete de Probabilites de Saint-Flour XXXII-2002. Dept. Statistics, U.C. Berkeley (2002). Lecture notes for St. Flour course.

    Google Scholar 

  47. Robert, C. P., Casella, G.: Introducing Monte Carlo Methods with R. Springer (2010).

  48. Roberts, G., Rosenthal, J.: Optimal Scaling for Various Metropolis-Hastings Algorithms. Stat. Sci. 16(4), 351–367 (2001).

    MathSciNet  MATH  Google Scholar 

  49. Shioda R., Tunçel, L.: Clustering via minimum volume ellipsoids. Comput. Optim. Appl. 37, 247–295 (2007).

    MathSciNet  MATH  Google Scholar 

  50. Stein, C.: Estimation of a covariance matrix. Reitz Lecture, IMS-ASA Annual Meeting in 1975 (1975).

  51. Vogt, J. E., Prabhakaran, S., Fuchs, T. J., Roth, V.: The translation-invariant Wishart-Dirichlet process for clustering distance data. Proceedings of the 27th International Conference on Machine Learning, 1111–1118 (2010).

  52. Ward, J. H.: Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 58, 236–244 (1963).

    MathSciNet  Google Scholar 

Download references


The authors thank Peter McCullagh for his insightful comments and suggestions on an early version of this paper. The authors are grateful to the Editor-in-Chief, the Associate Editor and anonymous reviewers for their constructive comments and suggestions which led to remarkable improvement of the paper.


National Science Foundation grants (DMS-1924792, DMS-1924859), the LAS Award for Faculty of Science at the University of Illinois at Chicago, and the In-House Award at the University of Central Florida.

Author information




Authors’ contributions

Hsin-Hsiung Huang wrote the draft of the manuscript, developed the algorithms, and conducted the experiments. Jie Yang proposed the methods and the initial algorithms. Both authors read and approved the final manuscript.

Authors’ information

Hsin-Hsiung Huang, Ph.D., is an Associate Professor in the Department of Statistics and Data Science at the University of Central Florida. Jie Yang, Ph.D., is an Associate Professor in the Department of Mathematics, Statistics, and Computer Science at the University of Illinois at Chicago.

Corresponding author

Correspondence to Hsin-Hsiung Huang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Huang, HH., Yang, J. Affine-transformation invariant clustering models. J Stat Distrib App 7, 10 (2020).

Download citation


  • Dirichlet process
  • Ewens process
  • Metropolis-Hastings algorithm
  • Markov chain Monte Carlo sampling
  • Unsupervised learning