Affine-transformation invariant clustering models

We develop a cluster process which is invariant with respect to unknown affine transformations of the feature space without knowing the number of clusters in advance. Specifically, our proposed method can identify clusters invariant under (I) orthogonal transformations, (II) scaling-coordinate orthogonal transformations, and (III) arbitrary nonsingular linear transformations corresponding to models I, II, and III, respectively and represent clusters with the proposed heatmap of the similarity matrix. The proposed Metropolis-Hasting algorithm leads to an irreducible and aperiodic Markov chain, which is also efficient at identifying clusters reasonably well for various applications. Both the synthetic and real data examples show that the proposed method could be widely applied in many fields, especially for finding the number of clusters and identifying clusters of samples of interest in aerial photography and genomic data.

(2020) 7:10 Page 2 of 24 invariance by considering three cases of a cluster process that are invariant with respect to three groups of affine transformations g : R d → R d acting on the feature space. The group invariance implies that the feature configurations Y and Y in R n×d determine the same clustering, or probability distribution on clusterings, if they belong to the same group orbit that is an equivalence class. For example, if the feature space is Euclidean and G is the group of Euclidean isometries or congruences, the clustering is a function only of the maximal invariant, which is the array of Euclidean distances D ij = Y i − Y j . For example, image data such as the aerial photography and three-dimensional protein structures are two motivating examples. The shape and relative locations of data may vary due to the change of the viewer's angle and location.
Our goal is to develop a novel clustering method which can identify clusters of Y = (Y 1 , . . . , Y n ) even when all Y i 's are mapped by an unknown affine transformation Y i = a + AY i , where a = (a 1 , . . . , a d ) ∈ R d and A ∈ R d×d is nonsingular. Affine-invariant clustering is important when the clusters are not well-separated in the observational space. Although there are previous work on affine-invariant clustering methods (Fitzgibbon and Zisserman 2002;Begelfor and Werman 2006;Shioda R. and Tunçel 2007;Brubaker. S.C. and Vempala 2008;Kumar and Orlin 2008;Garcìa-Escudero et al. 2010;Lee et al. 2014), these existing methods handle different problems from ours. These methods aim to cluster the same item observed in different angles or mapped by different unknown affine transformations. Instead, in our problem setting we consider only one unknown affine transformation that is applied to all objects.
The affine transformations consist of three types: (1) index permutations, rotation, onescaling on all variables, and location-translation transformations that are under the first type of covariance structures and named model I whose transformation and covariance structure σ 2 I d were also adopted by Vogt et al. (2010); (2) each variable may have different scaling transformations that are under the second type of covariance structures and named model II; (3) the variables are transformed by a nonsingular matrix that is named model III, where the observed variables may be linear combinations of some latent variables in model I. These models cover fairly general situations of clustering in nature.
McCullagh and Yang (2008) constructed a Dirichlet cluster process together with a random partition representing the clustering. In this paper, we follow their setup and extend their framework. We assume that the random partition of objects follows Ewens distribution (Ewens 1972), and we propose a likelihood of the responses which is invariant respect to affine transformations.

Cluster process and prior distributions
In this paper, an R d -valued cluster process (Y , B) means a random partition B of the natural numbers, together with an infinite sequence Y 1 , Y 2 , . . . of random vectors in the state space R d . The restriction of such a process to a finite sample Notice that the elements of B are transitional. i.e., if individuals i, j, k belong to the same cluster, then B i,j = 1 and B j,k = 1 imply B i,k = 1. The term cluster process implies infinite exchangeability, which means that the joint distribution p n of (Y [ n] , B[ n] ) is symmetric (McCullagh and Yang 2006) or invariant under permutations of indices (Pitman 2002), and p n is the marginal distribution of p n+1 under deletion of the (n + 1)th unit from the sample.
Similar to (McCullagh and Yang 2008), we construct an exchangeable Gaussian mixture as a simple example of clustering processes. First, B ∼ p is some infinitely exchangeable random partition. Secondly, the conditional distribution of the samples Y, which is regarded as a matrix (Y i,r ) of order n × d given B (say the cluster label cl(Y i ) = l) and θ, is Gaussian with mean and variance as follows is a ratio parameter connecting the within-and between-cluster covariance matrices, and = r,s is a positive definite matrix of order d × d, known as the within-cluster covariance matrix. In our settings, the between-cluster covariance matrix is simply θ , the cluster centroids μ 1 , . . . , μ k are iid from N (μ, θ ), and the mean of Y given B and μ 1 , and the covariance of Y given B can also be represented by the covariance of its vector form Vec which is an nd × nd matrix with "⊗" indicating the Kronecker product. , the column covariance of Y, is assumed identical for all clusters, I n + θB is assumed an exchangeable structure for the row covariance of Y, and θ is the product of the standard deviations of two rows. There exist competing algorithms that are affine-equivariant and do note impose this requirement (Shioda R. and Tunçel 2007;Kumar and Orlin 2008;Garcìa-Escudero et al. 2010;Lee et al. 2014). The identity matrix itself is also a partition in which each cluster consists of one element. Given the number of clusters k, the cluster sizes (n 1 , . . . , n k ) may follow a multinomial distribution with category probabilities π = (π 1 , . . . , π k ), where π follows an exchangeable Dirichlet distribution Dir(λ/k, . . . , λ/k). After integrating out π, the partition B follows a Dirichlet-multinomial prior where #B ≤ k denotes the number of clusters presented in the partition B and n b is the size of cluster b (MacEachern 1994;Dahl 2005;McCullagh and Yang 2008). The limit as k → ∞ is well defined and known as the Ewens's sampling formula (ESF) with parameter λ > 0 which is also known as Chinese restaurant process (CRP) (Ewens 1972;Neal 2000;Blei and Jordan 2006;Crane 2016). McCullagh and Yang (2008) provided a framework with a finite number of clusters and general covariance structures. In this paper, we adopt the CRP prior for partition B which implies k = ∞ in the population with the proposed Gaussian likelihood to get the affine-invariant clusters. Note that #B ≤ n for any given sample size n.
We choose a proper prior distribution for the variance ratio θ, the symmetric F-family with α > 0 allowing a range of reasonable choices (Chaloner 1987).
We propose a sampling procedure to estimate the partition B and the parameter θ from conditional probabilities. Since the conditional distribution of θ does not have a recognized form, we propose to use a discrete version p(θ j ) J j=1 , where J is a predetermined moderately large integer. Based on our experience, J = 100 works reasonably well for the real data examples that we have examined.

Affine-transformation invariant clustering
The affine-transformation invariant clustering identified in this manuscript is invariant even when the objects are mapped by an unknown affine transformation. The conditional distribution on partitions of [ n] = {1, . . . , n} is determined by the finite sequence Y = (Y 1 , . . . , Y n ) regarded as a configuration of n labeled points in R d . The exchangeability condition implies that any permutation π of the sequence induces a corresponding permutation in B, i.e. p n (B π | Y = y π ) = p n (B | Y = y) , where y π i = y π(i) and B π i,j = B π(i),π(j) . In many cases, it is reasonable to assume additional symmetries involving transformations in R d , for example p n (B | Y ) = p n (B | − Y ). We are asking, in effect, whether two labeled configurations Y and Y which are geometrically equivalent in R d should determine the same conditional distribution on sample partitions.
If the state space R d is regarded as a d-dimensional Euclidean space with the standard Euclidean inner product and Euclidean metric, the configurations Y and Y are congruent if there exists a vector a = (a 1 , . . . , a d ) ∈ R d and an orthogonal matrix A ∈ R d×d such that Y i = a + AY i for each i. Equivalently, the n × n arrays of squared Euclidean distances The geometric equivalence is defined by regarding the observation Y as a group orbit rather than a point. In general, the group is the affine group GA R d , G = R d × L and L is the collection of all d × d nonsingular matrices, with the operation (a 1 , A 1 ) • (a 2 , A 2 ) = (a 1 + A 1 a 2 , A 1 A 2 ) for a i ∈ R d , A i ∈ L with i = 1, 2, which is consistent with compositions of affine transformations. The orbit of an element Y = (Y 1 , . . . , Y n ) ∈ R n×d is defined as where the group action is that G acts on R n×d as (a, A) Y = (a + AY 1 , . . . , a + AY n ) = 1 n a + YA where 1 n is a length-n vector of 1's. It can be verified that its vector form then an element in the same orbit where T is a d × d nonsingular matrix satisfying = TT .
Theorem 1 If n ≤ d, then all Y ∈ R n×d of full rank n belong to the same orbit. If The proof of Theorem 1 is relegated to the Appendix A. According to the proof, if n = d + 1, then rank(Y ) = d implies that rank(Y − 1 n 1 n Y /n) is either d or d − 1. The case of d − 1 only occupies a lower-dimensional subspace.
According to Theorem 1, for n ≤ d + 1, the action is essentially transitive in the sense that all configurations of n distinct points in R d belong to the same orbit: all other orbits are negligible in that they have Lebesgue measure zero. As a result, the observation Y regarded as a group orbit GY is uninformative for clustering unless n > d + 1. We name the orbit and group action defined above as model III.
In model I, which is the case considered in Vogt et al. (2010), the covariance between features are proportional to an identity matrix. The group is The orbit of an element Y ∈ R n×d and the group action are defined similarly as in (1) and (2) In essence, the observation is not regarded as a point in R n×d but is treated as a group orbit generated by the group of rigid transformations, or similarity transformations if scalar multiples are permitted. In statistical terms, this approach meshes with the submodel in which the matrix in model I is a scaled identity matrix I d . An equivalent way of saying the same thing for n > d is that the column-centered sample matrix Y = Y − 1 n 1 n Y /n determines the sample covariance matrix S = Ỹ Ỹ /(n − 1) and hence the Mahalanobis metric lanobis 1936;Gnanadesikan and Kettenring 1972). One implication is that the n×n matrix (2020) 7:10 Page 6 of 24 mal invariant, and the conditional distribution on sample partitions depends on Y only through this matrix. In practice, the d variables are sometimes measured on scales that are not commensurate with one another, so the state space seldom has a natural metric. In this case, we assume that Y and Y as equivalent configurations for each feature Y ·,j if there are a j ∈ R and b j ∈ R \ {0}, such that Y ·,j = a j + b j Y ·,j . In model II, the group is the affine group The orbit of an element Y ∈ R n×d and the group action are defined in (1) and (2) , which correspond to elements of the group orbit. No linear combinations are permitted here, so that the integrity of the variables is preserved.
Moreover, in some cases, the location information or shapes of objects from aerial photography applications may be distorted by the viewer's angle or position so that the variables may be strongly correlated. A more extreme approach avoids the metric assumption by regarding Y and Y as equivalent configurations if there exists a vector a ∈ R d and a non-singular matrix A ∈ R d×d such that Y i = a + AY i with A A is a positive definite matrix for all i. Consequently, models I, II, III specify the structures of the covariance matrix between features, and the partition B of Y is affine invariant and the same as the partition B of the group orbit GY ⊂ R n×d , which is independent of the mean.

Gaussian marginal probabilities
The distribution of the column-centered group orbit, GY , is assumed to be a Gaussian distribution N 0, (I n + θB) ⊗ A A which depends only on I n +θB and A A. Actually, it can be verified that for any (a, A) ∈ G, the two distributions of group orbits induced by N (1 n ⊗ μ, (I n + θB) ⊗ ) and N(1 n ⊗ (a + Aμ), (I n + θB) ⊗ (A A ) respectively are the same.
McCullagh (2008) studied the d time series with an autocorrelation and n observations in time or space following three Gaussian distribution models N(0, ⊗ ) under different assumptions of as follows : Model II: Model III: ∈ PD d where PD d is the collection of d × d symmetric positive definite matrices. These three models correspond to our three models of affine transformed equivalence classes which (2020) 7:10 Page 7 of 24 we discussed in the previous section. In this paper, we set (I n + θB) as and A A as .
Following (McCullagh 2008), the log-likelihood based on Y for all three models is: According to Lemma 1 and its proof, which is relegated to the Appendix A, = I n + θB is always nonsingular for θ > 0 and its inverse −1 = (I n + θB) −1 = I n − θWB can be obtained explicitly. To ensure that Y −1 Y is positive definite with probability 1 (McCullagh 2008), as well as informative group orbits (see Theorem 1 and its subsequent discussion), we assume n > d + 1.
After plugging in the maximum likelihood estimator of which for model III isˆ The conditional distribution on partitions of [ n] depends on the group orbit and the assumptions made regarding . For group I, with ∝ I d in the Gaussian model, the likelihood depends only on the distance matrix D, so the likelihood is constant on the orbits associated with the larger group of Euclidean similarities. Therefore, for model I, the similarity transformation can be generalized as if Y i = a + AY i for A A = σ 2 I d and σ = 0, implying that the arrays of distances are proportional D = σ 2 D. Consequently, there is a representative element of the group orbit with feature mean vector 0, so that Vec(Y ) ∼ N 0, (I n + θB) ⊗ σ 2 I d .
For model II, the affine transformation can be generalized as Y i = a+AY i , where a ∈ R d and A ∈ R d×d with A A as a diagonal matrix with positive diagonal entries for all i. As a result, there is a representative element of the group orbit with feature mean vector 0, so that Vec(Y ) ∼ N 0, (I n + θB) ⊗ diag σ 2 1 , . . . , σ 2 d . This is to work with GA(R) d which is the general affine group acting independently on the d columns of Y. For model III, is an arbitrary matrix in PD d . The group is GA R d and n > d + 1. These three models are nested by model I ⊂ model II ⊂ model III.
Affine invariance in R d is a strong requirement, which comes at a small cost for moderate d provided that d/n is small. When d/n ≤ 1, Y −1 Y is positive definite with probability one (McCullagh 2008), then model III works. However, while d/n < 1 is not small, model III may be inefficient due to some eigenvalues of Y −1 Y and det Y −1 Y close to zero (Dempster 1972;Stein 1975). As a result, the profile likelihood of becomes unstable. In contrast, model II is less computationally expensive than model III, and model I is the most efficient one.

Markov chain Monte Carlo algorithm for sampling partitions
We use the prior and posterior distributions of θ and B discussed in Section 2 through a Markov chain Monte Carlo (MCMC) algorithm for sampling partitions. The iterative θ is obtained by Gibbs sampling (Geman and Geman 1984) according to the conditional For instance, α = 1 and the discrete set 2 −3 , 2 −2 , . . . , 2 10 for the range of θ are used as the default setting in our experiments. For updating B, the conditional distribution on partitions is where p n (B|λ) is the Ewens distribution, and a Metropolis-Hastings algorithm (Metropolis et al. 1953;Hastings 1970) is used to choose the iterative B. λ is set as 1 in the following applications. After burning in a certain number of the resulting Markov chain, we use the average of the partition matrix as the similarity matrix to make inference on partition.
where d xc is the distance between each point and the corresponding centroid of the clusters and a is a scale hyperparameter which was set as 2 in our experiments. More specifically, a partition candidate B * is generated by re-assigning the label of each point with the probability proportional to the reciprocal of the distance between each point and the corresponding centroid.
Since Algorithm 1 is a Metropolis-Hastings algorithm, it satisfies the detailed balance condition, and therefore the generated Markov chain has a stationary distribution (Chib and Greenberg 1995;Gamerman 1997; Robert and Casella 2010). Since we leave a small but positive probability that the partition stays the same in the Gibbs sampling and the discrete posterior of θ stays positive always, then the transition probability

Lemma 2 If n > d + 1, the (θ, B)-valued Markov chain constructed by Algorithm 1 is aperiodic.
Since there is always a positive chance that the partition can be split further into the simplest partition in which each element is a cluster, then all possible partitions communicate with each other, so that the (θ, B)-valued Markov chain constructed by Algorithm 1 is irreducible. Given the sample size n, the size of the state space of B known as the Bell number (Bell 1934), and the size of the state space of θ are all finite, then the irreducibility also implies positive recurrence. Consequently, the (θ, B)-valued Markov chain constructed by Algorithm 1 is ergodic (Isaacson and Madsen 1976;Gilks et al. 1996). The properties are summarized as the following lemma and theorem, whose proofs are relegated to the Appendix A.

Analysis of simulated and real data
We test the proposed Bayesian cluster process with Algorithm 1 on both synthetic and real data. Algorithm 1 with model I and point-wise updating is equivalent to the method of (Vogt et al. 2010). If there is no prior information of the number of clusters, users can set the initial partition B as I n in which each observation is a block. In practice, we use a randomly sampled clusters from a discrete uniform distribution of a range chosen by users. The clustering result is represented by the average of the estimated similarity matrix where n 0 is the number of burn-in iterations. Furthermore, we also define a dissimilarity matrix D as 1 n 1 n − S. The dissimilarity matrix, D, can be expressed by a heatmap which represents a matrix with grayscale colors with white as 1, black as 0, and the spectrum of gray as values between 0 and 1. The heatmap of the original similarity matrix cannot be recognized with the naked eye and equivalence relation needs to be decoded from the matrix B. However, in practice, users can identify clusters through including the names of rows and columns of the similarity matrix to find which individuals are clustered together. Additionally, the heatmap function of the stats R package can permute the order of individuals to have cluster blocks with hierarchical dendrograms. It is challenging to monitor convergence of the Markov chain because the sampled clusters are random and may vary

Huang and Yang Journal of Statistical Distributions and Applications
(2020) 7:10 Page 10 of 24 in each iteration. To determine convergence, we run Algorithm 1 ten times for each data set and stop the chain when we observe the number of clusters remain the same in the given chain length (Chang and Fisher 2013).

Illustrative simulated data
Four clusters on the vertices of a unit square data Three simulated data sets are generated for illustration. In the simulation study, 1000 initial burn-in iterations were discarded, and 2000 Markov chains of B samples based on each model were used to calculate D.
We first applied the proposed cluster process with model I on the synthetic data for four clusters centered at the four vertices of a unit square. For each vertex μ k , we generate 20 points from N (μ k , (1/4)I 2 ) for k = 1, . . . , 4 (see Fig. 1, the left panel). We call the data X I , and then apply model I to cluster X I with the average within-and between-cluster distances. The resulting heatmap successfully reveals the true clusters for most of the points (not shown here).
Then we transform the data by X II = X I × 3 0 0 1/3 . The transformed features seem to have two groups (see Fig. 1, the middle panel), clusters (1, 2) and clusters (3, 4). The cluster process with model I does not work well for this case, while the heatmap based on model II without knowing the transformation can reveal the true clusters for most of the points (not shown here).
Furthermore we transform the data by X III = X I × 4.1 2.1 1.9 1.1 . The transformed features are aligned in a straight line (see Fig. 1, the right panel). The transformed data X III is more difficult to cluster than X I and X II , since the original four clusters are transformed to be not well separated.
The resulting heatmap using model III with the initial clusters assigned randomly and uniformly from {1, 2, 3, 4} reveals the true four clusters for most of the points (see Fig. 2).

Applications to real data
Besides the synthetic data, we also evaluate the performance of the proposed approach by using real data. We run 3000 MCMC iterations and burn in the first 1000 iterations, Model I: Gene expression data of Leukemia patients The gene expression microarray data (Dua and Graff 2019) has been used to study genetic disorder such as identifying diagnostic or prognostic biomarkers or clustering and classifying diseases (Dudoit et al. 2002). For example, (Golub et al. 1999) classified patients of acute leukemia into two sub types, Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML). For illustration purpose, we use the training set of the leukemia data which consists of 3051 genes and 38 tumor mRNA samples. Pretending we do not know the label information, we would like to cluster the 38 samples according to their 3051 features (gene expression levels). The two clusters comprise 27 ALL cases and 11 AML cases. Since the number of features is larger than the sample size, our approach is not applicable to this dataset directly. Therefore, we first reduce the dimension by projecting the data on the subspace which consists of the first twenty principal components (PC) (Jolliffe 1986). Note that these PC scores are orthonormal which satisfies the assumption of model I. The resulting heatmap based on model I (Fig. 3) reveal the cluster of the 11 AML cases. The accuracy rate using the proposed model I with the initial clusters assigned randomly and uniformly from {1, 2} is 0.9164, while the accuracy rates of k-means and Mclust are 0.6994 and 0.5886, respectively. We noticed that Mclust resulted in only one cluster.

Model II: Geographic coordinate data of Denmark's 3D Road Network
This three-dimensional road network dataset of geographic coordinates includes the altitude, latitude, and longitude degrees of each road segments in North Jutland in northern Denmark, which is publicly available at the UC Irvine Machine Learning Repository (Kaul 2013;Dua and Graff 2019). Since three spatial dimensional features are orthogonal, it satisfies the assumption of model II so that we use this dataset to demonstrate model II. Three subjects with the road maps OSM ID 144552912 (19 observations), 125829151 (13 observations), 145752974 (14 observations) are used for the clustering analysis. Note that each objects may have several observations measured from different angles, and the altitude values are extracted from NASA's Shuttle Radar Topography Mission (SRTM) data (Jarvis et al. 2008). The accuracy rate using model II with the initial clusters assigned randomly and uniformly from {1, 2, 3, 4, 5} is 1, while the accuracy rates of k-means with k = 3 and Mclust are 0.7486 and 0.9490, respectively. The resulting heatmap using model II (Fig. 4) reveals 3 clusters correctly. Model III: Iris data This iris dataset (Fisher 1936) contain three species-Setosa, Versicolor, and Virginica with four features which are the measurements of the variables sepal length and width and petal length and width in centimeters, respectively. Each species consists of 50 iris flowers. The data points are clustered by their four features. Here, d = 4, n = 150, k = 3. The heatmap of the similarity matrix using model III correctly reflects three clusters corresponding to the three species of iris for most points (Fig. 5). The accuracy rate using the proposed model III with the initial clusters assigned randomly and uniformly from {1, 2, 3} is 0.9087, while the accuracy rates of k-means with k = 3 and Mclust are 0.7740 and 0.7763, respectively. We noticed that both the k-means and Mclust result in two clusters by grouping Versicolor and Virginica as a cluster.

Concluding discussion
The proposed clustering method is invariant under different groups of affine transformations and computationally efficient. It identifies clusters for most samples without knowing the number of clusters in advance, and it may group a big cluster as several small clusters. These problems are dealt with an exchangeable partition prior which avoids label-switching problems and the partition valued in the MCMC algorithm is invariant under linear transformations under three types of covariance structures. The advantage of replacing the Dirichlet-multinomial prior with its limiting process is that we do not need to know the number of clusters in advance. The disadvantage is that it may be less efficient computationally if the number of clusters is known. Note that the proposed approach does not target the partition maximizing the posterior distribution. Instead, it estimates the expected partition or the similarity matrix.
The three clustering models are based on the covariance matrix between variables. There are guidelines of telling which model work best in practice by the experimental design or testing its sample covarinace matrix. If the features are othornormal or orthogonal, then model I and model II are applicable, respectively. Models I and II run faster than model III due to the structure of the covariance matrix. Otherwise, model III can be used in general. It works reasonably well across various applications. Since we use the profile likelihood of in our model, we do not sample the covariance matrix directly, and Lemma 1 and Theorem 1 implies as n > d + 1, the proposed Metropolis-Hasting algorithm can work. However, the maximum likelihood estimator (MLE) of the general unstructured covariance matrix will be less efficient if the diagonal covariance structure is actually correct because it will tend to have small eigenvalues and a large determinant of the inverse covariance matrix (i.e. −1 ). Indeed, when using model III and even if n > d + 1, may be near singular. This may make the sampling less efficient. i.e. the acceptance rate may become small (Roberts and Rosenthal 2001). Although the stationary distribution of the sampled clusters' Markov chain using Algorithm 1 is independent of the initial clusters according to Theorem (2), we practically suggest to set the initial clusters sampled from a discrete uniform distribution of a range given by users instead of setting each individual as a cluster in order to obtain convergent sampled clusters without using a long Markov chain. This makes the proposed Algorithm 1 sample more efficiently from a smaller collection of partition candidates.
The proposed clustering algorithm produces the desired clusters with 2000 iterations after 1000 burn-in iterations in our experiments. The main contributions of our work include: 1) The proposed three clustering models with three types of covariance structures can handle general cases of affine transformations. In contrast, (Vogt et al. 2010) only dealt with the case of model I. 2) Algorithm 1 is efficient, since it updates all individuals' clusters instead of a single individual's cluster per iteration. It also ensures that the resulting partition-valued Markov chain is ergodic and convergent in distribution. 3) The experiments show the advantages of our cluster process which successfully identifies the true clusters using the proposed distance matrix. In particular if the clusters are not well separated, the similarity matrix with probabilistic nature can still reveal the relationships through hierarchical approaches. The proposed method could be used to extract interesting information from aerial photography, genomic data, and data with attributes under different scales, especially when the nearest neighbors may belong to different clusters in the feature space. The proposed method can be improved in the further work by modeling the mean of each cluster with regression on covariates or non-Gaussian distributions. D = I n−1 − 1 n−1 1 n−1 /n − 1 n−1 c /n, and c = (c 1 , . . . , c n−1 ). It can be verified that if c 1 + · · · + c n−1 = 1, then rank(D) = n − 1 and rank(Ỹ (n−1) ) = n − 1; if c 1 + · · · + c n−1 = 1, then rank(D) = n − 2 and rank(Ỹ (n−1) ) = n − 2. Note that if n = d + 1 but rank(Ỹ (n−1) ) = n − 2, then Y / ∈ Orb(Z).