 Research
 Open Access
 Published:
Affinetransformation invariant clustering models
Journal of Statistical Distributions and Applications volume 7, Article number: 10 (2020)
Abstract
We develop a cluster process which is invariant with respect to unknown affine transformations of the feature space without knowing the number of clusters in advance. Specifically, our proposed method can identify clusters invariant under (I) orthogonal transformations, (II) scalingcoordinate orthogonal transformations, and (III) arbitrary nonsingular linear transformations corresponding to models I, II, and III, respectively and represent clusters with the proposed heatmap of the similarity matrix. The proposed MetropolisHasting algorithm leads to an irreducible and aperiodic Markov chain, which is also efficient at identifying clusters reasonably well for various applications. Both the synthetic and real data examples show that the proposed method could be widely applied in many fields, especially for finding the number of clusters and identifying clusters of samples of interest in aerial photography and genomic data.
Introduction
Clustering of objects invariant with respect to affine transformations of feature vectors is an important research topic since objects may be recorded via different angles and positions so that their coordinates may vary and their nearest neighbors may belong to other clusters. For example, the longitude, latitude, and altitude coordinates of an object which are recorded by devices equipped in aircrafts or satellites change across different observational time. In this situation, distancebased clustering method including kmeans (MacQueen 1967), hierarchical clustering (Ward 1963), clustering based on principal components, spectral clustering (Ng et al. 2001), and others (Jain and Dubes 1988; Ozawa 1985) may fail to identify the correct clusters by grouping nearest points. Another category is distributionbased clustering methods (Banfield and Raftery 1993; Fraley and Raftery 1998; Fraley and Raftery 2002; Fraley and Raftery 2007; McCullagh and Yang 2008; Vogt et al. 2010) which may specify a partition as a parameter in a likelihood function and estimate it under a Bayesian framework.
In certain areas of application, the goal is to cluster objects i=1,…,n into disjoint subsets based on their feature vectors \(Y_{i} \in \mathbb {R}^{d}\). In this paper, we propose group invariance by considering three cases of a cluster process that are invariant with respect to three groups of affine transformations \(g\colon \mathbb {R}^{d}\to \mathbb {R}^{d}\) acting on the feature space. The group invariance implies that the feature configurations Y and Y^{′} in \(\mathbb {R}^{n\times d}\) determine the same clustering, or probability distribution on clusterings, if they belong to the same group orbit that is an equivalence class. For example, if the feature space is Euclidean and is the group of Euclidean isometries or congruences, the clustering is a function only of the maximal invariant, which is the array of Euclidean distances D_{ij}=∥Y_{i}−Y_{j}∥. For example, image data such as the aerial photography and threedimensional protein structures are two motivating examples. The shape and relative locations of data may vary due to the change of the viewer’s angle and location.
Our goal is to develop a novel clustering method which can identify clusters of Y=(Y_{1},…,Y_{n}) even when all Y_{i}’s are mapped by an unknown affine transformation \(Y^{\prime }_{i} = \boldsymbol {a} + A Y_{i}\), where \(\boldsymbol {a}=(a_{1},\ldots,a_{d})\in \mathbb {R}^{d}\) and \(A\in \mathbb {R}^{d\times d}\) is nonsingular. Affineinvariant clustering is important when the clusters are not wellseparated in the observational space. Although there are previous work on affineinvariant clustering methods (Fitzgibbon and Zisserman 2002; Begelfor and Werman 2006; Shioda R. and Tunçel 2007; Brubaker. S.C. and Vempala 2008; Kumar and Orlin 2008; GarcìaEscudero et al. 2010; Lee et al. 2014), these existing methods handle different problems from ours. These methods aim to cluster the same item observed in different angles or mapped by different unknown affine transformations. Instead, in our problem setting we consider only one unknown affine transformation that is applied to all objects.
The affine transformations consist of three types: (1) index permutations, rotation, onescaling on all variables, and locationtranslation transformations that are under the first type of covariance structures and named model I whose transformation and covariance structure σ^{2}I_{d} were also adopted by Vogt et al. (2010); (2) each variable may have different scaling transformations that are under the second type of covariance structures and named model II; (3) the variables are transformed by a nonsingular matrix that is named model III, where the observed variables may be linear combinations of some latent variables in model I. These models cover fairly general situations of clustering in nature.
McCullagh and Yang (2008) constructed a Dirichlet cluster process together with a random partition representing the clustering. In this paper, we follow their setup and extend their framework. We assume that the random partition of objects follows Ewens distribution (Ewens 1972), and we propose a likelihood of the responses which is invariant respect to affine transformations.
Cluster process and prior distributions
In this paper, an \(\mathbb {R}^{d}\)valued cluster process (Y,B) means a random partition B of the natural numbers, together with an infinite sequence Y_{1},Y_{2},… of random vectors in the state space \(\mathbb {R}^{d}\). The restriction of such a process to a finite sample [n]={1,…,n} of units or specimens consists of the restricted partition B[n] accompanied by the finite sequence Y[n]=(Y_{1},…,Y_{n}). A partition B[n]:[n]×[n]→{0,1} is the partition of the sample units expressed as a binary clusterfactor matrix of B_{i,j}=1 if Y_{i} and Y_{j} are of the same cluster (denoted as i∼j), and B_{i,j}=0 otherwise (McCullagh and Yang 2008). For example, when n=3, the partition {{1,2},3} and the cluster labels 112 correspond to an equivalence relation
Notice that the elements of B are transitional. i.e., if individuals i,j,k belong to the same cluster, then B_{i,j}=1 and B_{j,k}=1 imply B_{i,k}=1.
The term cluster process implies infinite exchangeability, which means that the joint distribution p_{n} of (Y[n],B[n]) is symmetric (McCullagh and Yang 2006) or invariant under permutations of indices (Pitman 2002), and p_{n} is the marginal distribution of p_{n+1} under deletion of the (n+1)th unit from the sample.
Similar to (McCullagh and Yang 2008), we construct an exchangeable Gaussian mixture as a simple example of clustering processes. First, B∼p is some infinitely exchangeable random partition. Secondly, the conditional distribution of the samples Y, which is regarded as a matrix (Y_{i,r}) of order n×d given B (say the cluster label cl(Y_{i})=l) and θ, is Gaussian with mean and variance as follows
where \(\boldsymbol { \mu }_{l}=\left (\mu _{l1},\ldots,\mu _{ld}\right) \in \mathbb {R}^{d}\) is the centroid of cluster k, δ is Kronecker’s delta, that is, δ_{i,j}=1 if i=j and 0 if i≠j,θ>0 is a ratio parameter connecting the within and betweencluster covariance matrices, and Σ=(Σ_{r,s}) is a positive definite matrix of order d×d, known as the withincluster covariance matrix. In our settings, the betweencluster covariance matrix is simply θΣ, the cluster centroids μ_{1},…,μ_{k} are iid from N(μ,θΣ), and the mean of Y given B and μ_{1},…,μ_{k} is
and the covariance of Y given B can also be represented by the covariance of its vector form \(\text {Vec}(Y)=\left (Y_{11}, \hdots, Y_{1d}, \hdots, Y_{n1}, \hdots, Y_{nd} \right)^{\intercal }\) as
which is an nd×nd matrix with “ ⊗” indicating the Kronecker product. Σ, the column covariance of Y, is assumed identical for all clusters, I_{n}+θB is assumed an exchangeable structure for the row covariance of Y, and θ is the product of the standard deviations of two rows. There exist competing algorithms that are affineequivariant and do note impose this requirement (Shioda R. and Tunçel 2007; Kumar and Orlin 2008; GarcìaEscudero et al. 2010; Lee et al. 2014). The identity matrix itself is also a partition in which each cluster consists of one element.
Given the number of clusters k, the cluster sizes (n_{1},…,n_{k}) may follow a multinomial distribution with category probabilities π=(π_{1},…,π_{k}), where π follows an exchangeable Dirichlet distribution Dir (λ/k,…,λ/k). After integrating out π, the partition B follows a Dirichletmultinomial prior
where #B≤k denotes the number of clusters presented in the partition B and n_{b} is the size of cluster b (MacEachern 1994; Dahl 2005; McCullagh and Yang 2008). The limit as k→∞ is well defined and known as the Ewens’s sampling formula (ESF) with parameter λ>0
which is also known as Chinese restaurant process (CRP) (Ewens 1972; Neal 2000; Blei and Jordan 2006; Crane 2016). McCullagh and Yang (2008) provided a framework with a finite number of clusters and general covariance structures. In this paper, we adopt the CRP prior for partition B which implies k=∞ in the population with the proposed Gaussian likelihood to get the affineinvariant clusters. Note that #B≤n for any given sample size n.
We choose a proper prior distribution for the variance ratio θ, the symmetric Ffamily
with α>0 allowing a range of reasonable choices (Chaloner 1987).
We propose a sampling procedure to estimate the partition B and the parameter θ from conditional probabilities. Since the conditional distribution of θ does not have a recognized form, we propose to use a discrete version \(\left \{p(\theta _{j})\right \}^{J}_{j=1},\) where J is a predetermined moderately large integer. Based on our experience, J=100 works reasonably well for the real data examples that we have examined.
Affinetransformation invariant clustering
The affinetransformation invariant clustering identified in this manuscript is invariant even when the objects are mapped by an unknown affine transformation. The conditional distribution on partitions of [n]={1,…,n} is determined by the finite sequence Y=(Y_{1},…,Y_{n}) regarded as a configuration of n labeled points in \(\mathbb {R}^{d}\). The exchangeability condition implies that any permutation π of the sequence induces a corresponding permutation in B, i.e. p_{n}(B^{π}  Y=y^{π})=p_{n}(B  Y=y), where \(y^{\pi }_{i} = y_{\pi (i)}\) and \(B^{\pi }_{i,j} = B_{\pi (i), \pi (j)}\). In many cases, it is reasonable to assume additional symmetries involving transformations in \(\mathbb {R}^{d}\), for example p_{n}(B  Y)=p_{n}(B  −Y). We are asking, in effect, whether two labeled configurations Y and Y^{′} which are geometrically equivalent in \(\mathbb {R}^{d}\) should determine the same conditional distribution on sample partitions.
If the state space \(\mathbb {R}^{d}\) is regarded as a ddimensional Euclidean space with the standard Euclidean inner product and Euclidean metric, the configurations Y and Y^{′} are congruent if there exists a vector \(\boldsymbol {a}=(a_{1},\ldots,a_{d})\in \mathbb {R}^{d}\) and an orthogonal matrix \(A\in \mathbb {R}^{d\times d}\) such that \(Y^{\prime }_{i} = \boldsymbol {a} + A Y_{i}\) for each i. Equivalently, the n×n arrays of squared Euclidean distances D_{ij}=∥Y_{i}−Y_{j}∥^{2} and \(D^{\prime }_{ij} = \left \Y'_{i}  Y'_{j}\right \^{2}\) are equal. The configurations are geometrically similar if \(Y^{\prime }_{i} = \boldsymbol {a} + b Y_{i}\) for \(b \in \mathbb {R}\) and b≠0, implying that the arrays of distances are proportional D^{′}=b^{2}D.
The geometric equivalence is defined by regarding the observation Y as a group orbit rather than a point. In general, the group is the affine group \({GA}\left (\mathbb {R}^{d}\right), {\mathcal {G}}=\mathbb {R}^{d} \times L \) and L is the collection of all d×d nonsingular matrices, with the operation (a_{1},A_{1})∘(a_{2},A_{2})=(a_{1}+A_{1}a_{2},A_{1}A_{2}) for \(\boldsymbol {a}_{i}\in \mathbb {R}^{d}, A_{i}\in L\) with i=1,2, which is consistent with compositions of affine transformations. The orbit of an element \(Y = (Y_{1}, \ldots, Y_{n})^{\intercal } \in \mathbb {R}^{n\times d}\) is defined as
where the group action is that \({\mathcal {G}}\) acts on \(\mathbb {R}^{n\times d}\) as
where 1_{n} is a lengthn vector of 1’s. It can be verified that its vector form Vec((a,A)⋆Y)=1_{n}⊗a+(I_{n}⊗A)Vec(Y). If
then an element in the same orbit
More specifically,
where T is a d×d nonsingular matrix satisfying \(\Sigma = T T^{\intercal }\).
Theorem 1
If n≤d, then all \(Y\in \mathbb {R}^{n\times d}\) of full rank n belong to the same orbit. If n=d+1, then all \(Y\in \mathbb {R}^{n\times d}\) satisfying \(\text {rank}(Y) = \text {rank}(Y  \boldsymbol {1}_{n} \boldsymbol {1}_{n}^{\intercal } Y/n) = d\) belong to the same orbit.
The proof of Theorem 1 is relegated to the Appendix A. According to the proof, if n=d+1, then rank(Y)=d implies that \(\text {rank}(Y  \boldsymbol {1}_{n} \boldsymbol {1}_{n}^{\intercal } Y/n)\) is either d or d−1. The case of d−1 only occupies a lowerdimensional subspace.
According to Theorem 1, for n≤d+1, the action is essentially transitive in the sense that all configurations of n distinct points in \(\mathbb {R}^{d}\) belong to the same orbit: all other orbits are negligible in that they have Lebesgue measure zero. As a result, the observation Y regarded as a group orbit \({\mathcal {G}} Y\) is uninformative for clustering unless n>d+1. We name the orbit and group action defined above as model III.
In model I, which is the case considered in Vogt et al. (2010), the covariance between features are proportional to an identity matrix. The group is \({\mathcal {G}}=\mathbb {R}^{d} \times \mathbb {R}\setminus \{0\}\) with the operation (a_{1},b_{1})∘(a_{2},b_{2})=(a_{1}+b_{1}a_{2},b_{1}b_{2}) for \(\boldsymbol {a}_{i}\in \mathbb {R}^{d}, b_{i}\in \mathbb {R}\setminus \{0\}, i=1,2\). The orbit of an element \(Y\in \mathbb {R}^{n\times d}\) and the group action are defined similarly as in (1) and (2) with A replaced by b. Then \(\left (\boldsymbol {a}, b\right)\star Y = \boldsymbol {1}_{n} \boldsymbol {a}^{\intercal } + bY\) and Vec((a,b)⋆Y)=1_{n}⊗a+bVec(Y). If
then Vec((−μ,1)⋆Y)∼N(0,(I_{n}+θB)⊗σ^{2}I_{d}) and Vec((−μ/σ,1/σ)⋆Y)∼N(0,(I_{n}+θB)⊗I_{d}), which correspond to elements in Orb(Y).
In essence, the observation is not regarded as a point in \(\mathbb {R}^{n\times d}\) but is treated as a group orbit generated by the group of rigid transformations, or similarity transformations if scalar multiples are permitted. In statistical terms, this approach meshes with the submodel in which the matrix Σ in model I is a scaled identity matrix I_{d}. An equivalent way of saying the same thing for n>d is that the columncentered sample matrix \(\tilde Y = Y  \mathbf {1}_{n} \mathbf {1}_{n}^{\intercal } Y /n\) determines the sample covariance matrix \(S = \left (\tilde Y^{\intercal } \tilde Y\right)/(n1)\) and hence the Mahalanobis metric \(\x  x^{*}\^{2} = (x  x^{*})^{\intercal } S^{1} (x  x^{*})\) in the state space (Mahalanobis 1936; Gnanadesikan and Kettenring 1972). One implication is that the n×n matrix D=(D_{ij})=(∥Y_{i}−Y_{j}∥^{2}) of standardized interpoint Mahalanobis distances is maximal invariant, and the conditional distribution on sample partitions depends on Y only through this matrix.
In practice, the d variables are sometimes measured on scales that are not commensurate with one another, so the state space seldom has a natural metric. In this case, we assume that Y and Y^{′} as equivalent configurations for each feature Y_{·,j} if there are \(a_{j}\in \mathbb {R}\) and \(b_{j} \in \mathbb {R}\setminus \{0\}\), such that \(Y^{\prime }_{\cdot,j} = a_{j} + b_{j} Y_{\cdot,j}\). In model II, the group is the affine group \({GA}(\mathbb {R})^{d}, {\mathcal {G}}=\mathbb {R}^{d} \times D \) and D={diag{b_{1},…,b_{d}}∣b_{i}≠0, i=1,…,d} with the operation (a_{1},A_{1})∘(a_{2},A_{2})=(a_{1}+A_{1}a_{2},A_{1}A_{2}) for \(\boldsymbol {a}_{i}\in \mathbb {R}^{d}, A_{i}\in D\) with i=1,2. The orbit of an element \(Y\in \mathbb {R}^{n\times d}\) and the group action are defined in (1) and (2) with A∈D. If
then \( \text {Vec}\left ((\boldsymbol {\mu }, I_{d}) \star Y\right) \sim N(\mathbf {0}, (I_{n} + \theta B)\otimes \text {diag}\left \{\sigma _{1}^{2}, \ldots, \sigma _{d}^{2}\right \}) \), and furthermore Vec((a,A)⋆Y)∼N(0,(I_{n}+θB)⊗I_{d}) with \(\boldsymbol {a} = \left (\mu _{1}/\sigma _{1}, \ldots, \mu _{d}/\sigma _{d}\right)^{\intercal }\) and \(A = \text {diag}\left \{\sigma _{1}^{1}, \ldots, \sigma _{d}^{1}\right \}\), which correspond to elements of the group orbit. No linear combinations are permitted here, so that the integrity of the variables is preserved.
Moreover, in some cases, the location information or shapes of objects from aerial photography applications may be distorted by the viewer’s angle or position so that the variables may be strongly correlated. A more extreme approach avoids the metric assumption by regarding Y and Y^{′} as equivalent configurations if there exists a vector \(\boldsymbol {a}\in \mathbb {R}^{d}\) and a nonsingular matrix \(A\in \mathbb {R}^{d\times d}\) such that \(Y^{\prime }_{i} = \boldsymbol {a} + A Y_{i}\) with \(A^{\intercal } A\) is a positive definite matrix for all i. Consequently, models I, II, III specify the structures of the covariance matrix between features, and the partition B of Y is affine invariant and the same as the partition B of the group orbit \({\mathcal {G}}Y \subset \mathbb {R}^{n\times d}\), which is independent of the mean.
3.1 Gaussian marginal probabilities
The distribution of the columncentered group orbit, \({\mathcal {G}}Y\), is assumed to be a Gaussian distribution
which depends only on I_{n}+θB and \(A^{\intercal } A\). Actually, it can be verified that for any \((\boldsymbol {a}, A) \in {\mathcal {G}}\), the two distributions of group orbits induced by N(1_{n}⊗μ,(I_{n}+θB)⊗Σ) and \(N(\boldsymbol {1}_{n} \otimes (\boldsymbol {a} + A\boldsymbol {\mu }), (I_{n} + \theta B)\otimes \left (A \Sigma A^{\intercal }\right)\) respectively are the same.
McCullagh (2008) studied the d time series with an autocorrelation Γ and n observations in time or space following three Gaussian distribution models N(0,Γ⊗Σ) under different assumptions of Σ as follows :
where PD_{d} is the collection of d×d symmetric positive definite matrices. These three models correspond to our three models of affine transformed equivalence classes which we discussed in the previous section. In this paper, we set (I_{n}+θB) as Γ and \(A^{\intercal } A\) as Σ. Following (McCullagh 2008), the loglikelihood based on Y for all three models is:
Lemma 1
(I_{n}+θB)^{−1}=I_{n}−θWB, where W=diag{(1+θN_{1})^{−1},…,(1+θN_{n})^{−1}} and N_{i} is the ith diagonal element of N=B1_{n}.
According to Lemma 1 and its proof, which is relegated to the Appendix A, Γ=I_{n}+θB is always nonsingular for θ>0 and its inverse Γ^{−1}=(I_{n}+θB)^{−1}=I_{n}−θWB can be obtained explicitly. To ensure that \(Y^{\intercal } \Gamma ^{1} Y\) is positive definite with probability 1 (McCullagh 2008), as well as informative group orbits (see Theorem 1 and its subsequent discussion), we assume n>d+1.
After plugging in the maximum likelihood estimator of Σ which for model III is \(\hat {\Sigma }_{\Gamma }=Y^{\intercal }\Gamma ^{1}Y/n\), for model II is \(\text {diag}\left (\hat {\Sigma }_{\Gamma }\right)\), and for model I is \(\text {tr}\left (\hat {\Sigma }_{\Gamma }\right)I_{d}/d\) (McCullagh 2008), the profile likelihood of Γ is
where \(Y_{(r)} \in \mathbb {R}^{n}\) is the rth column of Y, r=1,…,d.
The conditional distribution on partitions of [n] depends on the group orbit and the assumptions made regarding Σ. For group I, with Σ∝I_{d} in the Gaussian model, the likelihood depends only on the distance matrix D, so the likelihood is constant on the orbits associated with the larger group of Euclidean similarities. Therefore, for model I, the similarity transformation can be generalized as if \(Y^{\prime }_{i} = \boldsymbol { a }+ A Y_{i}\) for \(A^{\intercal } A = \sigma ^{2} I_{d}\) and σ≠0, implying that the arrays of distances are proportional D^{′}=σ^{2}D. Consequently, there is a representative element of the group orbit with feature mean vector 0, so that Vec(Y)∼N(0,(I_{n}+θB)⊗σ^{2}I_{d}).
For model II, the affine transformation can be generalized as \(Y^{\prime }_{i} = \boldsymbol {a} + A Y_{i},\) where \(\boldsymbol {a} \in \mathbb {R}^{d}\) and \(A\in \mathbb {R}^{d\times d}\) with \(A^{\intercal } A\) as a diagonal matrix with positive diagonal entries for all i. As a result, there is a representative element of the group orbit with feature mean vector 0, so that \(\text {Vec}(Y) \sim N\left (\mathbf {0}, (I_{n} + \theta B)\otimes \text {diag}\left \{\sigma ^{2}_{1},\ldots,\sigma ^{2}_{d}\right \}\right)\). This is to work with \({GA}(\mathbb {R})^{d} \) which is the general affine group acting independently on the d columns of Y. For model III, Σ is an arbitrary matrix in PD_{d}. The group is \({GA}\left (\mathbb {R}^{d}\right)\) and n>d+1. These three models are nested by model I⊂model II⊂model III.
Affine invariance in \(\mathbb {R}^{d}\) is a strong requirement, which comes at a small cost for moderate d provided that d/n is small. When \(d/n \leq 1, Y^{\intercal } \Gamma ^{1} Y\) is positive definite with probability one (McCullagh 2008), then model III works. However, while d/n<1 is not small, model III may be inefficient due to some eigenvalues of \(Y^{\intercal } \Gamma ^{1} Y\) and \(\det \left (Y^{\intercal } \Gamma ^{1} Y\right)\) close to zero (Dempster 1972; Stein 1975). As a result, the profile likelihood of Γ becomes unstable. In contrast, model II is less computationally expensive than model III, and model I is the most efficient one.
Markov chain Monte Carlo algorithm for sampling partitions
We use the prior and posterior distributions of θ and B discussed in Section 2 through a Markov chain Monte Carlo (MCMC) algorithm for sampling partitions. The iterative θ is obtained by Gibbs sampling (Geman and Geman 1984) according to the conditional distribution \(p_{n}\left (\theta _{j}B,{\mathcal {G}}Y\right)\propto p(\theta _{j}) \times L_{p}\left (\Gamma ^{1}  {\mathcal {G}}Y\right),\) where \(p(\theta _{j})\propto \theta _{j}^{\alpha 1}/\left (1+\theta _{j}\right)^{2\alpha }\) for j=1,…,J. For instance, α=1 and the discrete set {2^{−3},2^{−2},…,2^{10}} for the range of θ are used as the default setting in our experiments. For updating B, the conditional distribution on partitions is
where p_{n}(Bλ) is the Ewens distribution, and a MetropolisHastings algorithm (Metropolis et al. 1953; Hastings 1970) is used to choose the iterative B. λ is set as 1 in the following applications. After burning in a certain number of the resulting Markov chain, we use the average of the partition matrix as the similarity matrix to make inference on partition. The proposal distribution \(q\left (B^{(i+1)}B^{(i)}, {\mathcal {G}}Y\right)\) is proportional to exp(−a×d_{xc}), where d_{xc} is the distance between each point and the corresponding centroid of the clusters and a is a scale hyperparameter which was set as 2 in our experiments. More specifically, a partition candidate B^{∗} is generated by reassigning the label of each point with the probability proportional to the reciprocal of the distance between each point and the corresponding centroid.
Since Algorithm 1 is a MetropolisHastings algorithm, it satisfies the detailed balance condition, and therefore the generated Markov chain has a stationary distribution (Chib and Greenberg 1995; Gamerman 1997; Robert and Casella 2010). Since we leave a small but positive probability that the partition stays the same in the Gibbs sampling and the discrete posterior of θ stays positive always, then the transition probability
where θ^{(k+1)}=θ^{(k)} and B^{(k+1)}=B^{(k)}, and then the (θ,B)valued Markov chain constructed by Algorithm 1 is aperiodic.
Lemma 2
If n>d+1, the (θ,B)valued Markov chain constructed by Algorithm 1 is aperiodic.
Since there is always a positive chance that the partition can be split further into the simplest partition in which each element is a cluster, then all possible partitions communicate with each other, so that the (θ,B)valued Markov chain constructed by Algorithm 1 is irreducible. Given the sample size n, the size of the state space of B known as the Bell number (Bell 1934), and the size of the state space of θ are all finite, then the irreducibility also implies positive recurrence. Consequently, the (θ,B)valued Markov chain constructed by Algorithm 1 is ergodic (Isaacson and Madsen 1976; Gilks et al. 1996). The properties are summarized as the following lemma and theorem, whose proofs are relegated to the Appendix A.
Lemma 3
If n>d+1, the (θ,B)valued Markov chain constructed by Algorithm 1 is irreducible, and thus is positive recurrent.
Theorem 2
(Ergodic theorem) If n>d+1, the (θ,B)valued Markov chain constructed by Algorithm 1 converges to its stationary distribution \(p_{n}\left (\theta,B{\mathcal {G}}Y\right)\propto p(\theta)\times p_{n}(B\lambda) \times L_{p}\left (\Gamma ^{1}{\mathcal {G}}Y\right)\). More specifically, for any realvalued function f satisfying \(\sum _{(\theta, B)} f(\theta, B) p_{n}(\theta, B{\mathcal {G}}Y) < \infty \), we have
almost surely for all initial value (θ^{(0)},B^{(0)}).
Analysis of simulated and real data
We test the proposed Bayesian cluster process with Algorithm 1 on both synthetic and real data. Algorithm 1 with model I and pointwise updating is equivalent to the method of (Vogt et al. 2010). If there is no prior information of the number of clusters, users can set the initial partition B as I_{n} in which each observation is a block. In practice, we use a randomly sampled clusters from a discrete uniform distribution of a range chosen by users. The clustering result is represented by the average of the estimated similarity matrix
where n_{0} is the number of burnin iterations. Furthermore, we also define a dissimilarity matrix D as \(\boldsymbol { 1}_{n}\boldsymbol { 1}_{n}^{\intercal }S.\) The dissimilarity matrix, D, can be expressed by a heatmap which represents a matrix with grayscale colors with white as 1, black as 0, and the spectrum of gray as values between 0 and 1. The heatmap of the original similarity matrix cannot be recognized with the naked eye and equivalence relation needs to be decoded from the matrix B. However, in practice, users can identify clusters through including the names of rows and columns of the similarity matrix to find which individuals are clustered together. Additionally, the heatmap function of the stats R package can permute the order of individuals to have cluster blocks with hierarchical dendrograms. It is challenging to monitor convergence of the Markov chain because the sampled clusters are random and may vary in each iteration. To determine convergence, we run Algorithm 1 ten times for each data set and stop the chain when we observe the number of clusters remain the same in the given chain length (Chang and Fisher 2013).
5.1 Illustrative simulated data
Four clusters on the vertices of a unit square data Three simulated data sets are generated for illustration. In the simulation study, 1000 initial burnin iterations were discarded, and 2000 Markov chains of B samples based on each model were used to calculate D. We first applied the proposed cluster process with model I on the synthetic data for four clusters centered at the four vertices of a unit square. For each vertex μ_{k}, we generate 20 points from N(μ_{k},(1/4)I_{2}) for k=1,…,4 (see Fig. 1, the left panel). We call the data X_{I}, and then apply model I to cluster X_{I} with the average within and betweencluster distances. The resulting heatmap successfully reveals the true clusters for most of the points (not shown here).
Then we transform the data by \(X_{II}=X_{I}\times \left (\begin {array}{ll} 3 & 0 \\ 0 & 1/3 \\ \end {array} \right).\) The transformed features seem to have two groups (see Fig. 1, the middle panel), clusters (1,2) and clusters (3,4). The cluster process with model I does not work well for this case, while the heatmap based on model II without knowing the transformation can reveal the true clusters for most of the points (not shown here).
Furthermore we transform the data by \(X_{III}=X_{I}\times \left (\begin {array}{ll} 4.1 & 2.1 \\ 1.9 & 1.1 \\ \end {array} \right).\) The transformed features are aligned in a straight line (see Fig. 1, the right panel). The transformed data X_{III} is more difficult to cluster than X_{I} and X_{II}, since the original four clusters are transformed to be not well separated.
The resulting heatmap using model III with the initial clusters assigned randomly and uniformly from {1,2,3,4} reveals the true four clusters for most of the points (see Fig. 2).
5.2 Applications to real data
Besides the synthetic data, we also evaluate the performance of the proposed approach by using real data. We run 3000 MCMC iterations and burn in the first 1000 iterations, and use the heatmap of matrix S to visualize the clusters. The accuracy rate is based on the average proportion of identical elements of matrix B of the cluster and the true matrix B, and compared the accuracy rates with kmeans (MacQueen 1967) and Mclust using R package ‘mclust’ with its default setting (Fraley and Raftery 2002). The reason why we chose R package ‘mclust’ is that Mclust is a modelbased clustering approach using the Gaussian mixture model, which assumes a Gaussian distribution for each component under one of the three types covariance structures (the argument of Mclust: modelNames) 1. Spherical (EII), 2. Diagonal (VVI), and 3. General (VVV) for comparing with our proposed model I, II, III, correspondingly. The main difference is that the Mclust obtains clusters with an expectation–maximization (EM) algorithm (Dempster 1972; McLachlan and Peel 2000), but our method uses a MetropolisHasting algorithm with the profile likelihood of Γ to sample clusters.
Model I: Gene expression data of Leukemia patients The gene expression microarray data (Dua and Graff 2019) has been used to study genetic disorder such as identifying diagnostic or prognostic biomarkers or clustering and classifying diseases (Dudoit et al. 2002). For example, (Golub et al. 1999) classified patients of acute leukemia into two sub types, Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML). For illustration purpose, we use the training set of the leukemia data which consists of 3051 genes and 38 tumor mRNA samples. Pretending we do not know the label information, we would like to cluster the 38 samples according to their 3051 features (gene expression levels). The two clusters comprise 27 ALL cases and 11 AML cases. Since the number of features is larger than the sample size, our approach is not applicable to this dataset directly. Therefore, we first reduce the dimension by projecting the data on the subspace which consists of the first twenty principal components (PC) (Jolliffe 1986). Note that these PC scores are orthonormal which satisfies the assumption of model I. The resulting heatmap based on model I (Fig. 3) reveal the cluster of the 11 AML cases. The accuracy rate using the proposed model I with the initial clusters assigned randomly and uniformly from {1,2} is 0.9164, while the accuracy rates of kmeans and Mclust are 0.6994 and 0.5886, respectively. We noticed that Mclust resulted in only one cluster.
Model II: Geographic coordinate data of Denmark’s 3D Road Network
This threedimensional road network dataset of geographic coordinates includes the altitude, latitude, and longitude degrees of each road segments in North Jutland in northern Denmark, which is publicly available at the UC Irvine Machine Learning Repository (Kaul 2013; Dua and Graff 2019). Since three spatial dimensional features are orthogonal, it satisfies the assumption of model II so that we use this dataset to demonstrate model II. Three subjects with the road maps OSM ID 144552912 (19 observations), 125829151 (13 observations), 145752974 (14 observations) are used for the clustering analysis. Note that each objects may have several observations measured from different angles, and the altitude values are extracted from NASA’s Shuttle Radar Topography Mission (SRTM) data (Jarvis et al. 2008). The accuracy rate using model II with the initial clusters assigned randomly and uniformly from {1,2,3,4,5} is 1, while the accuracy rates of kmeans with k=3 and Mclust are 0.7486 and 0.9490, respectively. The resulting heatmap using model II (Fig. 4) reveals 3 clusters correctly.
Model III: Iris data
This iris dataset (Fisher 1936) contain three species–Setosa, Versicolor, and Virginica with four features which are the measurements of the variables sepal length and width and petal length and width in centimeters, respectively. Each species consists of 50 iris flowers. The data points are clustered by their four features. Here, d=4,n=150,k=3. The heatmap of the similarity matrix using model III correctly reflects three clusters corresponding to the three species of iris for most points (Fig. 5). The accuracy rate using the proposed model III with the initial clusters assigned randomly and uniformly from {1,2,3} is 0.9087, while the accuracy rates of kmeans with k=3 and Mclust are 0.7740 and 0.7763, respectively. We noticed that both the kmeans and Mclust result in two clusters by grouping Versicolor and Virginica as a cluster.
Concluding discussion
The proposed clustering method is invariant under different groups of affine transformations and computationally efficient. It identifies clusters for most samples without knowing the number of clusters in advance, and it may group a big cluster as several small clusters. These problems are dealt with an exchangeable partition prior which avoids labelswitching problems and the partition valued in the MCMC algorithm is invariant under linear transformations under three types of covariance structures. The advantage of replacing the Dirichletmultinomial prior with its limiting process is that we do not need to know the number of clusters in advance. The disadvantage is that it may be less efficient computationally if the number of clusters is known. Note that the proposed approach does not target the partition maximizing the posterior distribution. Instead, it estimates the expected partition or the similarity matrix.
The three clustering models are based on the covariance matrix between variables. There are guidelines of telling which model work best in practice by the experimental design or testing its sample covarinace matrix. If the features are othornormal or orthogonal, then model I and model II are applicable, respectively. Models I and II run faster than model III due to the structure of the covariance matrix. Otherwise, model III can be used in general. It works reasonably well across various applications.
Since we use the profile likelihood of Γ in our model, we do not sample the covariance matrix directly, and Lemma 1 and Theorem 1 implies as n>d+1, the proposed MetropolisHasting algorithm can work. However, the maximum likelihood estimator (MLE) of the general unstructured covariance matrix will be less efficient if the diagonal covariance structure is actually correct because it will tend to have small eigenvalues and a large determinant of the inverse covariance matrix (i.e. Γ^{−1}). Indeed, when using model III and even if n>d+1,Γ may be near singular. This may make the sampling less efficient. i.e. the acceptance rate may become small (Roberts and Rosenthal 2001). Although the stationary distribution of the sampled clusters’ Markov chain using Algorithm 1 is independent of the initial clusters according to Theorem (2), we practically suggest to set the initial clusters sampled from a discrete uniform distribution of a range given by users instead of setting each individual as a cluster in order to obtain convergent sampled clusters without using a long Markov chain. This makes the proposed Algorithm 1 sample more efficiently from a smaller collection of partition candidates.
The proposed clustering algorithm produces the desired clusters with 2000 iterations after 1000 burnin iterations in our experiments. The main contributions of our work include: 1) The proposed three clustering models with three types of covariance structures can handle general cases of affine transformations. In contrast, (Vogt et al. 2010) only dealt with the case of model I. 2) Algorithm 1 is efficient, since it updates all individuals’ clusters instead of a single individual’s cluster per iteration. It also ensures that the resulting partitionvalued Markov chain is ergodic and convergent in distribution. 3) The experiments show the advantages of our cluster process which successfully identifies the true clusters using the proposed distance matrix. In particular if the clusters are not well separated, the similarity matrix with probabilistic nature can still reveal the relationships through hierarchical approaches. The proposed method could be used to extract interesting information from aerial photography, genomic data, and data with attributes under different scales, especially when the nearest neighbors may belong to different clusters in the feature space. The proposed method can be improved in the further work by modeling the mean of each cluster with regression on covariates or nonGaussian distributions.
Appendix A
Proof of Theorem 1: For any \(Y \in \mathbb {R}^{n\times d}\), denote \(\tilde {Y} = Y  \boldsymbol {1}_{n} \boldsymbol {1}_{n}^{\intercal } Y/n\). Let \(\tilde {Y}_{(n1)}\) be the (n−1)×d matrix consisting of the first n−1 rows of \(\tilde {Y}\). Since \(\boldsymbol {1}_{n}^{\intercal } \tilde {Y} = 0\), then \(\text {rank}(\tilde {Y}) = \text {rank}(\tilde {Y}_{(n1)})\).
If n≤d+1 and \(\text {rank}(\tilde {Y}_{(n1)}) = n1\), that is, \(\tilde {Y}_{(n1)}\) is of full row rank, then there exists an orthogonal matrix \(O\in \mathbb {R}^{d\times d}\) (column permutations), such that, \(\tilde {Y}_{(n1)} O = (U,V)\), where \(U\in \mathbb {R}^{(n1)\times (n1)}\) is of full rank, and \(V\in \mathbb {R}^{(n1)\times (d+1n)}\). We let
It can be verified that Y=(a,A)⋆Z. That is, Y∈Orb(Z), where Z is a constant matrix.
If n≤d and rank(Y)=n, then \(\text {rank}(\tilde {Y}_{(n1)}) = n1\), since \(\tilde {Y}_{(n1)} = W Y\), where \(W = (I_{n1}  \boldsymbol {1}_{n1} \boldsymbol {1}_{n1}^{\intercal }/n, \boldsymbol {1}_{n1}/n)\) is of full row rank n−1.
Suppose n=d+1 and rank(Y)=d=n−1. Without any loss of generality, we assume rank(Y_{(n−1)})=n−1, where Y_{(n−1)} consists of the first n−1 rows of Y. Then Y_{n}=c_{1}Y_{1}+⋯+c_{n−1}Y_{n−1} for some \(c_{1}, \ldots, c_{n1} \in \mathbb {R}\) and \(\tilde {Y}_{(n1)} = D Y_{(n1)}\), where \(D = I_{n1}  \boldsymbol {1}_{n1} \boldsymbol {1}_{n1}^{\intercal }/n  \boldsymbol {1}_{n1} \boldsymbol {c}^{\intercal }/n\), and \(\boldsymbol {c}^{\intercal } = (c_{1}, \ldots, c_{n1})\). It can be verified that if c_{1}+⋯+c_{n−1}≠1, then rank(D)=n−1 and \(\text {rank}(\tilde {Y}_{(n1)}) = n1\); if c_{1}+⋯+c_{n−1}=1, then rank(D)=n−2 and \(\text {rank}(\tilde {Y}_{(n1)}) = n2\). Note that if n=d+1 but \(\text {rank}(\tilde {Y}_{(n1)}) = n2\), then Y∉Orb(Z). □
Proof of Lemma 1: Suppose the partition matrix B consists of k blocks with block sizes n_{1},…,n_{k}, where k≥1,n_{i}>0 for i=1,…,k, and n_{1}+⋯+n_{k}=n.
We first assume that \(B = \text {diag}\left \{\boldsymbol {1}_{n_{1}} \boldsymbol {1}_{n_{1}}^{\intercal }, \ldots, \boldsymbol {1}_{n_{k}} \boldsymbol {1}_{n_{k}}^{\intercal }\right \}\), which is in its standard form. Then \(B = LL^{\intercal }\) with \(L = \text {diag}\{\boldsymbol {1}_{n_{1}}, \ldots, \boldsymbol {1}_{n_{k}}\} \in \mathbb {R}^{n\times k}\) and \(I_{n} + \theta B = I_{n} + E E^{\intercal }\) with \(E = \sqrt {\theta } L\).
According to the ShermanMorrisonWoodbury formula (see, for example, Section 2.1.4 in Golub and Van Loan (2013)), for matrices \(A \in \mathbb {R}^{n\times n}\) and \(U, V\in \mathbb {R}^{n\times k}, (A + UV^{\intercal })^{1} = A^{1}  A^{1} U(I + V^{\intercal } A^{1} U)^{1} V^{\intercal } A^{1}\) if both A and \(I+V^{\intercal } A^{1} U\) are nonsingular. In our case, A=I_{n} is nonsingular, U=V=E, and \(I + V^{\intercal } A^{1} U = I_{k} + E^{\intercal } E = \text {diag}\{1+\theta n_{1}, \ldots, 1+\theta n_{k}\}\) is also nonsingular. Thus
In general, by rowswitching and columnswitching transformations, we can always transform B into its standard form. That is, there exists an orthogonal matrix O such that \(B_{r} = OBO^{\intercal }\) is in standard form. Let \(W_{r} = OWO^{\intercal }\). Then \((I_{n} + \theta B)^{1} = O^{\intercal } (I_{n} + \theta B_{r})^{1} O = O^{\intercal } (I_{n}  \theta W_{r} B_{r})O = I_{n}  \theta \cdot O^{\intercal } W_{r} O \cdot O^{\intercal } B_{r} O = I_{n}  \theta WB\). □
Proof of Lemma 3: In our case, the Markov chain built by Algorithm 1 is actually a discrete chain. It is irreducible since p_{n}(θ^{(k+1)},B^{(k+1)}θ^{(k)},B^{(k)})>0 for each pair of states. As a direct conclusion of Theorem 4.1 in Gilks et al. (1996), our Markov chain is positive recurrent. □
Proof of Theorem 2: Algorithm 1 is a Gibbs sampler plus a MetropolisHastings component for sampling B^{(i+1)}. Given B^{(i)} and θ^{(i+1)}, the MetropolisHastings ratio with proposal distribution \(q(B  B^{(i)}, {\mathcal {G}}Y)\) and target distribution \(p_{n}(\theta, B  {\mathcal {G}} Y)\) is
which is exactly R in Algorithm 1. Since MetropolisHastings algorithms satisfy detailed balance condition, the target distribution \(p_{n}(\theta, B  {\mathcal {G}}Y)\) is a stationary distribution. By Lemmas 2 and 3, the convergence statements follow as a direct conclusion of Theorems 4.3 and 4.4 in Gilks et al. (1996). □
Appendix B journal name abbreviations for use in BoundaryLayer meteorology
Availability of data and materials
The datasets are from simulation and the UCI Machine Learning Repository and are available as per JSDA policy.
Abbreviations
 ALL:

Acute lymphoblastic leukemia
 AML:

Cute myeloid leukemia
 MCMC:

Markov chain Monte Carlo
 PC:

Principal components PC
 SRTM:

NASA’s shuttle radar topography mission
References
Banfield, J. D., Raftery, A. E.: Modelbased Gaussian and non Gaussian Clustering. Biometrics. 49, 803–821 (1993).
Begelfor, E., Werman, M.: Affine Invariance Revisited. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2087–2094 (2006).
Bell, E. T.: Exponential polynomials. Ann. Math. 35, 258–277 (1934).
Blei, D., Jordan, M.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1, 121–144 (2006).
Brubaker. S.C., Vempala, S.: Isotropic PCA and affineinvariant clustering. In: Forty Ninth Annual IEEE Symposium on Foundations of Computer Science (2008).
Chaloner, K.: A Bayesian approach to the estimation of variance components in the unbalanced oneway randomeffects model. Technometrics. 29, 323–337 (1987).
Chang, J., Fisher, J. W.: Parallel sampling of DP mixture models using subclusters splits. In: NIPS’13 Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 620–628 (2013).
Chib, S., Greenberg, E.: Understanding the MetropolisHastings Algorithm. Am. Stat. 49(4), 327–335 (1995).
Crane, H.: The ubiquitous Ewens sampling formula. Stat. Sci. 31, 1–19 (2016).
Dahl, D. B.: Sequentiallyallocated mergesplit sampler for conjugate and nonconjugate Dirichlet process mixture models (2005). Technical Report, Department of Statistics, Texas A&M University.
Dempster, A. P.: Covariance selection. Biometrics. 28(1), 157–175 (1972).
Dua, D., Graff, C.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2019). http://archive.ics.uci.edu/ml.
Dudoit, S., Fridlyand, J., Speed, T. P.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97(457), 77–87 (2002).
Ewens, W. J.: The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112 (1972).
Fisher, R. A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics. 7, 179–188 (1936).
Fitzgibbon, A., Zisserman, A.: On Affine Invariant Clustering and Automatic Cast Listing in Movies. In: European Conference on Computer Vision 2002, pp. 304–320 (2002).
Fraley, C., Raftery, A. E.: How many clusters? Which clustering methods? Answers via modelbased cluster analysis. Comput. J. 41, 578–588 (1998).
Fraley, C., Raftery, A. E.: Modelbased clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002).
Fraley, C., Raftery, A. E.: Bayesian regularization for normal mixture estimation and modelbased clustering. J. Classif. 24, 155–181 (2007).
Gamerman, D.: Efficient Sampling from the Posterior Distribution in Generalized Linear Models. Stat. Comput. 7, 57–68 (1997).
GarcìaEscudero, L. A., Gordaliza, A., Matràn, C., MayoIscar, A.: A review of robust clustering methods. ADAC. 4, 89–109 (2010).
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6(6), 721–741 (1984).
Gilks, W. R., Richardson, S., Spiegelhalter, D. J.: Markov Chain Monte Carlo in Practice. Chapman & Hall, New York (1996).
Gnanadesikan, R., Kettenring, J. R.: Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. Biometrics. 28(1), 81–124 (1972).
Golub, T. R., Slonim, D. K., Tamayo, P., Huardm, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., Lander, E. S.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 286, 531–537 (1999).
Golub, H. G., Van Loan, C. F.: Matrix Computations. 4th edition. Johns Hopkins University Press, Baltimore (2013).
Hastings, W. K.: Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika. 57(1), 97–109 (1970).
Isaacson, D. L., Madsen, R. W.: Markov Chains. Wiley, New York (1976).
Jain, A. K., Dubes, R. C.: Algorithms for clustering data. Prentice Hall, Upper Saddle River (1988).
Jarvis, A., Reuter, H. I., Nelson, A., Guevara, E.: JHolefilled seamless SRTM data V4, International Centre for Tropical Agriculture (CIAT) (2008). http://srtm.csi.cgiar.org.
Jolliffe, I. T.: Principal Component Analysis (1986).
Kaul, M.: Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems. In: Proceedings of International Conference on Mobile Data Management (IEEE MDM), Vol 1., pp. 137–146. Milan, Italy (2013).
Kumar, M., Orlin, J. B.: Scaleinvariant clustering with minimum volume ellipsoids. Comput. Oper. Res. 35, 1017–1029 (2008).
Lee, H., Yoo, J. H., Park, D.: Data clustering method using a modified Gaussian kernel metric and kernel PCA. ETRI J. 36(3), 333–342 (2014).
MacEachern, S. N.: Estimating normal means with a conjugatestyle Dirichlet process prior. Commun. Stat. Simul. Comput. 23, 727–741 (1994).
MacQueen, J. B.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press, Berkeley (1967).
Mahalanobis, P. C.: On the Generalized Distance in Statistics. In: Proceedings of the National Institute of Sciences of India (1936).
McCullagh, P.: Marginal likelihood for parallel series. Bernoulli. 14(3), 593–603 (2008).
McCullagh, P., Yang, J.: Stochastic classification models. In: Proceedings of the International Congress of Mathematicians, vol. III, pp. 669–686, Madrid (2006).
McCullagh, P., Yang, J.: How many clusters?Bayesian Anal. 3, 101–120 (2008).
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000).
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953).
Neal, R. M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9, 249–265 (2000).
Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. Adv. Neural Inform. Process. Syst. 14, 849–856 (2001).
Ozawa, K.: A stratificational overlapping cluster scheme. Pattern Recognit. 18, 279–286 (1985).
Pitman, J.: Combinatorial Stochastic Processes. (621). In: Ecole d’Ete de Probabilites de SaintFlour XXXII2002. Dept. Statistics, U.C. Berkeley (2002). Lecture notes for St. Flour course.
Robert, C. P., Casella, G.: Introducing Monte Carlo Methods with R. Springer (2010).
Roberts, G., Rosenthal, J.: Optimal Scaling for Various MetropolisHastings Algorithms. Stat. Sci. 16(4), 351–367 (2001).
Shioda R., Tunçel, L.: Clustering via minimum volume ellipsoids. Comput. Optim. Appl. 37, 247–295 (2007).
Stein, C.: Estimation of a covariance matrix. Reitz Lecture, IMSASA Annual Meeting in 1975 (1975).
Vogt, J. E., Prabhakaran, S., Fuchs, T. J., Roth, V.: The translationinvariant WishartDirichlet process for clustering distance data. Proceedings of the 27th International Conference on Machine Learning, 1111–1118 (2010).
Ward, J. H.: Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 58, 236–244 (1963).
Acknowledgements
The authors thank Peter McCullagh for his insightful comments and suggestions on an early version of this paper. The authors are grateful to the EditorinChief, the Associate Editor and anonymous reviewers for their constructive comments and suggestions which led to remarkable improvement of the paper.
Funding
National Science Foundation grants (DMS1924792, DMS1924859), the LAS Award for Faculty of Science at the University of Illinois at Chicago, and the InHouse Award at the University of Central Florida.
Author information
Authors and Affiliations
Contributions
Authors’ contributions
HsinHsiung Huang wrote the draft of the manuscript, developed the algorithms, and conducted the experiments. Jie Yang proposed the methods and the initial algorithms. Both authors read and approved the final manuscript.
Authors’ information
HsinHsiung Huang, Ph.D., is an Associate Professor in the Department of Statistics and Data Science at the University of Central Florida. Jie Yang, Ph.D., is an Associate Professor in the Department of Mathematics, Statistics, and Computer Science at the University of Illinois at Chicago.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Huang, HH., Yang, J. Affinetransformation invariant clustering models. J Stat Distrib App 7, 10 (2020). https://doi.org/10.1186/s4048802000111y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4048802000111y
Keywords
 Dirichlet process
 Ewens process
 MetropolisHastings algorithm
 Markov chain Monte Carlo sampling
 Unsupervised learning