 Research
 Open access
 Published:
Combining assumptions and graphical network into gene expression data analysis
Journal of Statistical Distributions and Applications volume 8, Article number: 9 (2021)
Abstract
Background
Analyzing gene expression data rigorously requires taking assumptions into consideration but also relies on using information about network relations that exist among genes. Combining these different elements cannot only improve statistical power, but also provide a better framework through which gene expression can be properly analyzed.
Material and methods
We propose a novel statistical model that combines assumptions and gene network information into the analysis. Assumptions are important since every test statistic is valid only when required assumptions hold. So, we propose hybrid pvalues and show that, under the null hypothesis of primary interest, these pvalues are uniformly distributed. These proposed hybrid pvalues take assumptions into consideration. We incorporate gene network information into the analysis because neighboring genes share biological functions. This correlation factor is taken into account via similar prior probabilities for neighboring genes.
Results
With a series of simulations our approach is compared with other approaches. Area Under the ROC Curves (AUCs) are constructed to compare the different methodologies; the AUC based on our methodology is larger than others. For regression analysis, AUC from our proposed method contains AUCs of Spearman test and of Pearson test. In addition, true negative rates (TNRs) also known as specificities are higher with our approach than with the other approaches. For two group comparison analysis, for instance, with a sample size of n=10, specificity corresponding to our proposed methodology is 0.716146 and specificities for ttest and rank sum are 0.689223 and 0.69797, respectively. Our method that combines assumptions and network information into the analysis is shown to be more powerful.
Conclusions
These proposed procedures are introduced as a general class of methods that can incorporate procedureselection, account for multipletesting, and incorporate graphical network information into the analysis. We obtain very good performance in simulations, and in real data analysis.
Introduction
x Gene expression data can be analyzed in a multiple testing setting as well as many other statistical methods. The validity of each test depends on the underlying distributional assumptions of the test. A proper analysis of gene expression data requires taking assumptions, usually normality, into consideration (Pounds and Fofana 2012; Pounds and Rai 2009). In addition to incorporating distributional assumptions into the overall testing, it may also be informative to incorporate any prior knowledge of association between entities (Bowman and George 1995). Such associations are often recorded by graphical networks (Wei and Pan 2008). Combining these different elements, besides gaining statistical power, provides a framework through which analysis of gene expression data can be improved. We propose a novel statistical approach that incorporates testing for distributional assumption validity with prior information provided by gene graphical network. In particular, we use graphical networks to incorporate spatial dependence into the analysis of gene expression data. The spatial correlation is taken into account by assuming similar prior probabilities for neighboring genes.
We compare our approach with other methods through a series of simulations, and demonstrate that hybridnetwork leads to an improvement on power over other approaches in most of the settings. The comparison of the different methodologies is based on specificities and/or Area under the ROC Curve (AUC). The specificity of a test is called the true negative rate; it is the proportion of samples that test negative using the test in question that are genuinely negative. An ROC curve or a receiver operating characteristic curve shows the performance of a classification model at all classification thresholds. An ROC curve is constructed by reporting sensitivities, true positive rates, on the yaxis and false positive rates on the xaxis.
The network analysis we use is the conditional autoregressive (CAR) model. CAR models are commonly used to represent spatial autocorrelation in data relating to a set of nonoverlapping areal units. Those models are typically specified in a hierarchical Bayesian framework, with inference based on Markov Chain Monte Carlo (MCMC) simulation. The most widely used software to fit CAR models is WinBUGS or OpenBUGS. In our work, we use an R function BUGS(·) that helps run OpenBUGS inside R software. Another R function, CARBayes(·), is described in Lee (2013) that can be used for Bayesian spatial modeling with conditional autoregressive priors. Using CARBayes the spatial adjacency information can be specified as a neigbourhood matrix, whereas, with BUGS(·), the user has to specify an adjacency matrix.
Material and methods
Network information can be represented by directed or undirected graphs. Graphs are structures of discrete mathematics and have found applications in scientific disciplines that consider networks of interacting elements, such as genes that interact by sharing some biological resemblances. A graph consists of a set of nodes and a set of edges that connect the nodes. Usually the nodes are the entities of interest. For instance, each gene can be considered a node and the edges the relationships among the genes. A graph can be used in a practical way by developing software to translate between representations, a process sometimes referred to as “coercion”.
In data analysis, graphs provide a data structure for knowledge representation, for example in the Gene Ontology (GO). Many studies incorporate gene network information in data analysis through the GO project. Graphs provide a computational object that can easily and naturally be used to reflect physical objects and relationships of interest. Graphs are important to statistical methodology for exploratory data analysis. A knowledgerepresentation graph can be juxtaposed with observed data to guide the discovery of important phenomena in the data. In statistical inference, inferential statements about relations between genes due to significantly frequent cocitation, or relation between gene expression and protein complex can be made, (Wei and Pan 2008).
A graph may be directed or undirected. A directed edge is an ordered pair of endvertices that can be represented graphically as an arrow drawn between the endvertices. In such an ordered pair the first vertex is called initial vertex or tail and the second the terminal vertex or head. An undirected graph disregards any sense of direction and treats both head and tail identically, see Fig. 1.
Theory/Calculation
Statistical models for hybrid testing
Consider the following multiple hypothesis testings
with θ_{g}, a parameter for gene g, and G is the total number of genes, H_{og}, the null hypothesis, and H_{1g} is the alternative hypothesis. Suppose two mutually exclusive test procedures, M_{1} and M_{2}, can be used to perform these statistical tests. When M_{1} is used, suppose T_{1}={T_{11},⋯,T_{1G}} represents the test statistics and P_{1}={P_{11},⋯,P_{1G}} the corresponding set of pvalues, and suppose T_{2}={T_{21},⋯,T_{2G}} and P_{2}={P_{21},⋯,P_{2G}} the corresponding quantities for procedure M_{2}.
Suppose A_{g}=i is an indication that the assumption for gene g holds for procedure M_{i} for testing H_{og} vs H_{1g},i=1,2. For testing
suppose T_{a}={T_{a1},⋯,T_{aG}} are the test statistics obtained from A_{g} with the corresponding set of pvalues P_{a}={P_{a1},⋯,P_{aG}}.
And then, from this method, we define an appropriate summary statistic and denote it by P={P_{1},⋯,P_{G}} with
The following theorem states the distribution of P_{g} under the null hypothesis H_{og} of Eq. (1).
Theorem
(Hybrid pvalues). Suppose there are only two mutually exclusive procedures M_{1} and M_{2} that can be used to test the null hypothesis
Let P_{1} be the pvalue obtained if method M_{1} is used for testing the null hypothesis H_{0}, and P_{2} be the pvalue if method M_{2} is used instead. Let P be defined by
Then P is uniformly distributed under the null hypothesis H_{0}. □
Proof
First, we recall some probability theory basics. Let M_{1},M_{2},⋯,M_{n} be a partition of a sample space Ω, that is \(M_{i}\cap M_j=\varnothing \forall i\neq j\) and \( \bigcup _{i}^{n} M_i=\Omega.\) Then, for any event E⊂Ω,
\( \begin {array}{lll} E&=& E\cap (\bigcup _{i}^{n} M_i) \\ &=& \bigcup _{i}^{n}(E\cap M_i)\\ \end {array} \)
and then for any probability \(\mathbb {P}\),
\( \begin {array}{lll} \mathbb {P}(E)&=&\mathbb {P}\left (E\cap \left (\bigcup _{i}^{n} M_{i}\right)\right)\\ \mathbb {P}(E)&=&\mathbb {P}\left (\bigcup _{i}^{n}\left (E\cap M_{i}\right)\right)\\ &=&\sum _{i}^{n}\mathbb {P} (E\cap M_i) \text { (Law of total probability)}\\ &=&\sum _{i}^{n}\mathbb {P}(E\mid M_i)\times \mathbb {P}(M_i) \text { (Bayes' rule).} \\ \end {array} \)
Also, the law of total probability holds for conditional probability, that is
\( \begin {array}{lcl} \mathbb {P}(E\mid B)&=&\sum _{i}^{n}\mathbb {P} (E\cap M_{i}\mid B), \text {} \forall \text { event } B \text { (Law of total probability)}\\ &=&\sum _{i}^{n}\mathbb {P}(E\mid M_{i}, B)\times \mathbb {P}(M_{i}\mid B) \text { (Bayes' rule).} \\ \end {array} \)
We recall that \(\mathbb {P}(\Omega)=\mathbb {P}\left (\left (\bigcup _{i}^{n} M_{i}\right)\right)=\sum _{i}^{n}\mathbb {P}(M_i)=1.\)
The question is to show that the hybrid pvalue, P, follows a uniform distribution under the null hypothesis (H_{0}); that is \(F_{P}(p)=\mathbb {P}(P< p\mid H_0)=p, \forall p \in (0,1),\) with F_{P} the cumulative distribution function of P.
Under the null hypothesis (H_{0}) of primary interest (gene is expressed, say) and under M_{1} and M_{2}, respectively, both P_{1} and P_{2} are uniformly distributed, that is \(\mathbb {P}(P_1< p\mid M_{1}, H_0)=p\) and \(\mathbb {P}(P_2< p\mid M_{2}, H_0)=p\), see (Pounds and Rai 2009) for instance. Recall that P is a random variable, since P_{1} and P_{2} are random variables. For the proof, we consider M_{1} and M_{2} as two events. The notation ∣H_{0} means under the null hypothesis (H_{0}).
\( \begin {array}{lcl} \mathbb {P}(P< p\mid H_0)&=&\mathbb {P}\left \{(P< p)\cap [M_{1}\cup M_{2}]\mid H_{0}\right \} \text {(since }M_{1}\text { and }M_{2}\text { form a partition)}\\ &=&\mathbb {P}\left \{(P< p)\cap M_{1}\mid H_{0}\right \}+\mathbb {P}\left \{(P< p)\cap M_{2}\mid H_{0}\right \} \\ & & \text {(since }M_{1}\text { and }M_{2}\text { are mutually exclusive)} \text {(Law of total probability)}\\ &=&\mathbb {P}(P< p\mid M_{1}, H_0)\mathbb {P}(M_{1}\mid H_0)+\\ & & \mathbb {P}(P< p\mid M_{2}, H_0)\mathbb {P}(M_{2}\mid H_0)\\ & & \text { (Bayes' rule).} \\ &=&\mathbb {P}(P_1< p\mid M_{1}, H_0)\mathbb {P}(M_{1}\mid H_0)+\mathbb {P}\left (\left (P_2< p\right)\mid M_{2}, H_{0}\right)\mathbb {P}(M_{2}\mid H_0)\\ &=&p\mathbb {P}(M_{1}\mid H_0)+p\mathbb {P}(M_{2}\mid H_0)\\ &=&p\mathbb {P}(M_{1}\mid H_0)+p(1\mathbb {P}(M_{1}\mid H_0))\\ &=&p. \end {array} \) □
Thus P is uniformly distributed under H_{0}.
Now, transform the pvalues by
where Φ is the cumulative distribution function of the standard normal distribution N(0,1), and P_{g} is the pvalue corresponding to test g. The null distribution of Z_{g} is the standard normal under H_{og} of Eq. (1). Assume that under the alternative \( Z_{g}\sim N\left (\mu _{1},\sigma _{1}^{2}\right),\) then
where \(\phi (\cdot ;\mu _{1},\sigma _{1}^2)\) is the probability density function of \(N(\mu _{1},\sigma _{1}^2),\)f is a density function.
Bayesian hierarchical models for spatial data
Conditional autoregressive (CAR) models are commonly used to represent spatial autocorrelation in data relating to a set of nonoverlapping areal units. Those data are prevalent in many fields like agriculture (Besag and Higdon 1999), and epidemiology (Lee 2011). There are three different CAR priors commonly used to model spatial autoregression. Each model is a special case of a Gaussian Markov random field (GMRF) that can be written in a general form as
where Q is a precision matrix that controls for the spatial autocorrelation structure of the random effects, and is based on a nonnegative symmetric G×G neighborhood or weight matrix W,W=(w_{kj}) where w_{kj}=1 if genes k and j are neighboring genes and w_{kj}=0 otherwise, and ϕ=(ϕ_{1},⋯,ϕ_{G}), is a set of random effects. CAR priors are commonly specified as a set of G univariate fully conditional distributions ξ(ϕ_{k}∣ϕ_{−k}) for k=1,⋯,G where ϕ_{−k}=(ϕ_{1},⋯,ϕ_{k−1},ϕ_{k+1},⋯,ϕ_{G}), and G is the total number of genes (Lee 2013; Lee 2011). The first CAR prior proposed by Besag et al. (1991) is as
The conditional expectation is the average of the random effects in neighboring genes, while the conditional variance is inversely proportional to the number of neighbors. The inverse proportionality of conditional variance is due to the fact that if random effects are spatially correlated then the more neighbors a node has the more information there is from its neighbors about the value of its random effect (subjectspecific effect). This first CAR prior is used to implement the hybridnetwork methodology as in Wei and Pan (2008). The second CAR prior proposed by Leroux et al. (1999) is given by
while the third CAR prior proposed by Stern and Cressie (1999) is defined by
where ρ is a spatial autocorrelation parameter, with ρ=0 corresponding to independence and with ρ=1 corresponding to a strong spatial autocorrelation. A uniform prior on the unit interval is specified for ρ, that is ρ∼∪(0,1), while the usual uniform prior on (0,M_{τ}) is assigned to τ^{2}, with the default value being M_{τ}=1000. The intrinsic CAR prior by Besag et al. (1991) is obtained from the second and third CAR priors when ρ=1, while when ρ=0 the difference is on the denominator in the conditional variances.
Standard and spatial normal mixture model
Multiple testing is often an essential step in the analysis of highdimensional data, such as genomic or proteomic data. The data analysis can be based on pvalues, zscores, tscores, etc. These test statistics are obtained from data reduction techniques. The hybrid pvalues discussed in “Statistical models for hybrid testing” section is an example. Consider for example a test statistic Z. We can assume that across hypotheses g=1,⋯,G the test statistic Z_{g} follows a twocomponent mixture with density f as in (5). From this twocomponent mixture two different types of mixture models, the standard and spatial normal mixture models are considered. While spatial normal mixture models consider network information in the analysis, the standard normal mixture models do not.
Standard normal mixture model
In a standard twocomponent mixture model, Z_{g} has a density function f of the form
where π_{0} is the proportion of genes that are not expressed (null hypothesis), f_{o} is the distribution of Z_{g} under the null hypothesis, and f_{1} is the distribution of Z_{g} under the alternative hypothesis.
Spatial normal mixture model
In a spatial normal mixture model, one defines genespecific prior probabilities
where T_{g} is defined by
therefore, the marginal distribution of Z_{g} is
where z_{g} is the expression value of gene g for g=1,⋯,G, and π_{g1}=1−π_{g0}. It is believed that genes on the same network, that is a group of genes with the same function, share the same prior probability of expression while different networks have possibly varying prior probabilities. The prior probabilities π_{gs}, based on a gene network, are related to two latent Markov random fields x_{s}={x_{gs};g=1,⋯,G},s=0,1 by a logistic transformation:
Each of the Gdimensional latent vectors x_{s} is distributed according to an intrinsic Gaussian conditional autoregression model (ICAR) (Besag and Kooperberg 1999). The distribution of each spatial latent variable x_{gs} conditional on x_{−gs}={x_{ks};k≠g} depends only on its direct neighbors. To be more specific,
where δ_{g} is the set of indices for the neighbors of gene g, and m_{g} is the corresponding number of neighbors. The other model specifications are articulated in this way
g=1,⋯,G and s=0,1. Network structure is summarized in a matrix format called an adjacent matrix: Adj=(a_{ij}), i=1,⋯,G; j=1,⋯,G, where
Prior distributions
In a standard normal mixture model, a beta distribution is often assumed as the prior distribution for π_{0}. In a spatial normal mixture model, genespecific prior probabilities are introduced. For the spatial normal mixture model, the prior probabilities for π_{gs}, based on a gene network, are related to two latent Markov random fields (MRFs), as mentioned previously. From Eq. (14), we assume priors on the variance components \(\sigma _{{cs}}^{2}\sim \text {inversegamma}(0.01, 0.01),\) the corresponding precision \(\frac {1}{\sigma _{{cs}}^{2}}\) has gamma(0.01,0.01) with mean 1 and variance 100. \(\sigma _{{cs}}^{2}\) acts as a smoothing parameter for the spatial field and consequently controls the degree of dependency among the prior probabilities of the genes. The size of \(\sigma _{{cs}}^{2}\) determines how similar the π_{gs} are. The smaller the \(\sigma _{{cs}}^{2}\) are, the more similar the π_{gs}.
Maximum a posterior estimation
A frequentist estimation of a standard mixture model via maximum a posterior estimation (MAPE) is used to show the effectiveness of Bayesian estimation for mixture models. Consider a standard mixture model, Eq. (10), with
with \(\boldsymbol \theta _s=\left (\mu _{s},\sigma _{s}^{2} \right), s=0,1\) and Z is a gene expression test statistic. A direct approach to estimate π_{0},π_{1},θ_{0}, and θ_{1} is to compute the likelihood function
and the log likelihood as
Obtaining MAPE’s of the parameters directly is not possible. To estimate the parameters the expectationmaximization (EM) algorithm may be used. In order to use the EM algorithm, define latent variables v={(v_{gk},z_{gk})∣k=1,⋯,n and g=1,⋯,G} where
with G_{0} (genes not expressed) and G_{1} (expressed genes) are null hypothesis and alternative groups respectively, n is the sample common to all genes. If we include latent variables we get complete data, the observed z^{′}s and the unobserved v^{′}s. The maximum a posterior function for the complete data is
Taking the log on Eq. (19) we get the log maximum a posterior function as
The EM algorithm can be used to obtain MAPE’s of π_{0},π_{1},θ_{0} and θ_{1}, if (z_{1k},z_{2k},⋯,z_{Gk}) are assumed to be independent.
However, since there is a graphical network among genes, (z_{1k},z_{2k},⋯,z_{Gk}) are not independent. In order to take into account gene graphical network a Bayesian methodology is used. Network analysis is brought into the analysis by generating latent variables according to GMRFs as in Eq. (14). After assigning prior distributions to the parameters, posterior distributions can be found using a partial Gibbs sampler and some Metropolis Hasting algorithms. We use OpenBugs software to get the MAPE’s of π_{0},π_{1},θ_{0}, and θ_{1}.
Statistical inference
The decision rule and acceptance of null hypotheses is based on probabilities from posterior distributions. For each gene g, the point estimate of p(H_{0g}∣Data) is computed and compared to a threshold τ, for g=1,⋯G.H_{0g} is rejected when \(\hat {p}(H_{0g}\mid Data),\) the point estimate, of p(H_{0g}∣Data) is less than a threshold τ.
The pvalues p_{g} obtained from the hybrid method are transformed, and the transformed statistics z_{g}=Φ^{−1}(1−p_{g}) are used, with Φ^{−1} the standard normal quantile function. Through Bayesian modeling, network information is added to the analysis. With the Bayesian inference these posterior estimates are \(\hat {\pi }_{g0}=\hat {p}(H_{0g}\mid Data)\). Inferences for the Bayesian hierarchical models are obtained using MCMC simulations, with a combination of Gibbs sampling and Metropolis steps. Gibbs sampling is used to do MCMC simulation for fully conditional posteriors with closed forms. For those that are not in closed forms the MetropolisHasting algorithm is used.
Results
Simulations
To compare the hybridnetwork method with other methods, we conducted simulation studies designed to mimic real data analysis. We conducted standard twogroup comparison studies (treatment vs control), kgroup (k>2) comparison (ANOVA), and regression analysis. The kgroup comparison is directly applicable to a genomic study comparing human ependymoma, a brain tumor that occurs in three distinct anatomic regions: Posterior Fossa (PF), Spine (SP), and Supratentorial (ST). Regression analysis is often useful to determine whether, for example, gene expression levels are related to a particular covariate such as DNA synthesis rate (INHIBO).
For each of the three types of analyses conducted in the simulation studies, two different tests can be used. The first one requires the normality assumption while the second may be appropriate when the normality assumption does not hold. For the twogroup comparison the hybridnetwork method chooses between the standard ttest for normally distributed data and the Wilcoxon test when the normality assumption fails. For kgroup (k>2) comparison, the hybridnetwork method chooses between the standard ANOVA test and the KruskalWallis test. For the regression analysis, the Pearson test for linear dependency is chosen when the normal assumption holds and the Spearman test if the normality assumption does not hold.
In Eq. (12), we use \(\hat {\pi }_{g0}\), the estimate of π_{g0}. And, the decision rule consists of rejecting the null hypothesis, H_{g0}, for gene g, if \(\hat {\pi }_{g0}\) is less than a threshold, τ. The conclusion is that the corresponding gene g is expressed. For cancer data analysis, for instance, if a gene is expressed, health researchers will target that gene in finding cure.
The comparison of the different methodologies is mainly based on specificities (not to reject the null hypotheses when they are true, we call them sometimes true negatives). We could provide both specificities and sensitivities (reject the null hypotheses when they are not true, we call them sometimes true positives); but we have decided to compute only specificities because the simulations are computationally intensive.
Kgroup comparison study
In a group comparison study, gene expression data can be modeled as:
where Y_{gij} is the expression level for gene g of the j^{th} individual in the i^{th} group,
k is the number of groups, n_{i} is the sample size of group i, and
A 2group comparison (k=2), interest is in statistical tests of the form
g=1,⋯,G. Some gene expression levels may be normally distributed while others are not normally distributed. In the twogroup comparison study, two tests are often used. The ttest is used when the normality assumption holds and the Wilcoxon test, a non parametric test, is often used when the normality assumption does not hold. For each gene g, a ttest, a WilcoxonMannWhitney rank sum test, and a ShapiroWilk test statistics are computed. Diagnoses for adequacy of the ttest statistics are made through residuals. We compute the residuals from the ttest statistic. We define the residuals on observation, j, in treatment, i, for gene, g, as
where \(\hat {Y}_{{gij}}\) is an estimate of the corresponding observation Y_{gij} obtained as follows:
If the model is adequate, residuals should be structureless; that is, they should contain no obvious patterns. Through an analysis of residuals, many types of model inadequacies and violations of the underlying assumptions can be discovered. We use the residuals to check for normality. A probit plot of residuals is an extremely useful procedure to test for normality. If the underlying error distribution is normal, this plot will resemble a straight line. Also outliers can be detected through residuals. Outliers show up on probability plots as being very different from the main body of the data. Plotting the residuals in time order of data collection is helpful in detecting correlation between the residuals. This is useful for checking independence assumptions on the errors.
To compare the hybridnetwork method with other methods, we perform a simulation study. In this setup, there are two groups of sample sizes varying from 5, 10, 25, and 50. The number of gene expressions having a normal distribution, N(μ,1), is 30. For these gene expressions, μ=0 for the null hypothesis and μ=1 for the alternative. The remaining gene expressions have lognormal distribution, lognormal (μ,1), with μ=0 in some cases and μ=1 in other cases. And a graphical network, Fig. 2, is built among genes with 212 number of edges. We translate this graphical network into an adjacent matrix.
The results are presented in Table 1, they show that hybridnetwork procedure dominates the other methodologies in most of the settings, since the hybridnetwork test specificities are higher than the specificities of the other methods. When the sample size is equal to 5, for instance, the specificity corresponding to the ttest is 0.571726, the specificity corresponding to the Wilcoxon test is 0.557244, and the specificity for the hybridnetwork test is 0.575314.
Hybrid ANOVAKruskal Wallis study
In a kgroup comparison study, a statistical model can be written as Eq. (21). For the model (21), μ_{g} is a parameter common to all treatments for gene g called the overall mean, and τ_{gi} is a parameter unique to the ith treatment for gene g called the ith treatment effect. Consider the following multiple hypothesis tests
or equivalently, by using the effects models
The hypotheses may be tested using an ANOVA test or the KruskalWallis depending on the normality assumption. If the normality assumption is valid, the ANOVA test is more powerful than the KruskalWallis; and the latter may be more powerful when the normality assumption does not hold. The proposed methodology, hybridnetwork, combines a test of assumptions and graphical network information into the analysis. For each gene g, an ANOVA pvalue, \(p_{g}^{a},\) a KruskalWallis pvalue, \(P_{g}^{w},\) and a ShapiroWilk pvalue, \(P_{g}^{s}\) are computed. We define a hybrid pvalue, \(P_{g}^{h},\) as
for g=1,⋯,G where α is a given threshold. The hybrid pvalue \(P_{g}^{h}\) is transformed into a hybrid zstatistic, \(z_{g}^{h},\) as follows:
We use \(z_{g}^{h}\) to build a CAR model from the given network with the marginal distribution of \(z_{g}^{h}\) given by
where \(z_{g}^{h}\) is the expression value for gene g,g=1,⋯,G.
The prior probabilities π_{gs}, based on a gene network, are related to two latent Markov random fields x_{s}={x_{gs};g=1,⋯,G},s=0,1 by a logistic transformation:
The distribution of each spatial latent variable x_{gs} conditional on x_{−gs}={x_{ks};k≠g} depends only on its direct neighbors. The proposed CAR prior distribution from (Besag and Kooperberg 1999) is used as
where δ_{g} is the set of indices for the neighbors of gene g, and m_{g} is the corresponding number of neighbors.
The hybridnetwork methodology, through a series of simulations, is compared to other methods. The setup of these simulations consists of three groups of sample size varying from 5, 10, 25, and 50. The number of genes with the normal distribution N(μ,1),μ=0 for the null hypothesis and μ=1 for the alternative, is 30. The number of genes with the lognormal distribution, lognormal(μ,1), with μ=0 in some cases and μ=1 in other cases, is 7 and the number of genes with the Cauchy distribution, Cauchy(θ,1), with θ=0 in some cases and θ=1 in other cases, is 7. A graphical network is built among genes with 212 edges. We present the simulations results in Table 2. They show that hybridnetwork procedure dominates other procedures in most of the cases. When the sample size is 25, for instance, the specificities from the ANOVA test, the Kruskal Wallis and the hybridnetwork test are 0.89141, 0.918197, and 0.929054, respectively.
Regression analysis
In microarray regression analysis, a statistical model can be written as
where Y_{gj} is the gene expression level for the g^{th} gene in the j^{th} individual with
and some
The question is whether a response variable and a covariate are correlated. To test for correlation between gene expression with a covariate such as a phenotype, the analysis can be based on Pearson test pvalues (P^{p}), and on Spearman test pvalues (P^{sp}). We can use ShapiroWilk pvalues (P^{s}) to test for the normality assumptions. Consider, the regression analysis in matrix format
where
We denote the least squares estimators of β_{g} as b_{g}
Let the vector of the fitted values \(\hat {Y}_{{gi}}\) be denoted as \(\hat {\mathbf {Y}}_{g},\) and the vector of the residual terms \(e_{{gi}}=Y_{{gi}}\hat {Y}_{{gi}}\) be as e_{g}. The fitted values are represented by
and the residuals by
For each gene g, compute its Pearson pvalue, \(P_{g}^{p},\) compute its Spearman pvalue, \(P_{g}^{sp},\) and from the residuals from Pearson test, a ShapiroWilk test of normality is performed, and for each gene g a pvalue, \(P_{g}^{s},\) is calculated. Finally, a hybrid pvalue, \(P_{g}^{h}\) is computed as
where α is a given threshold.
Each hybrid pvalue, \(P_{g}^{h},\) is transformed into a hybrid zstatistic, \(z_{g}^{h},\) as follows:
Using \(z_{g}^{h},\) the marginal distribution of \(z_{g}^{h}\) is given as
where \(z_{g}^{h}\) is the expression value of gene, g,g=1,⋯,G.
We compare the hybridnetwork with the other procedures through a simulation setup. The setup consists of a sample size of 25. The number of genes with the normal distribution, N(μ,1), is 30, μ=0 for the null hypothesis and μ=1 for the alternative, and the number of genes with the lognormal distribution, lognormal(μ,1), with μ=0 in some cases and μ=1 in other cases, is 14. We vary the cutoff point, τ, as in Wei and Pan (2008). And a graphical network is built among genes with 212 number of neighbors. The results of the analysis are presented in Fig. 3. In order to compare the hybridtesting with other methods, we use AUCs to judge the performance of the proposed method. A greater AUC corresponds to a better methodology. They show that the hybridnetwork performs better than the other competing procedures.
Application to human ependymoma microarray
We compare the hybridnetwork procedure with the ttest and the Wilcoxon test using human ependymoma data. The data consists of gene expression levels, gene annotation, sample annotation, and a gene graphical network. Figure 4 illustrates a graphical network of the genes under consideration, and Table 3 is a subset of the human ependymoma expression data. In this analysis, there are two groups, the sample sizes are n_{1}=37 for group1, n_{2}=42 for group2, with the total number of genes of 102, and the number of edges is 196. The data and the R codes can be requested from the corresponding author.
Using ShapiroWilk pvalues, it appears that some of the expression data are normally distributed and the others are not, with ShapiroWilk test pvalues less than α=5% for some genes. Figure 5 shows histograms of pvalues from the ttest, pvalues from the rank sum test, and pvalues based on the ShapiroWilk test of normality, respectively. The last graph of Fig. 5 presents the plot of the pvalues from the ttest with respect to the corresponding pvalues from the rank sum test. Using the ttest when the normality assumption is assumed, and the Wilcoxon test otherwise. We apply the hybridtesting procedure to analyze the data. We incorporate a graphical network to accommodate interactions between genes, as these have been noted to play a crucial role in cell functions (Shojaie and Michailidis 2009).
In order to compare the hybridnetwork procedure with the other procedures, we report results for the first six genes. We use box plots as visual methods of comparing groups. Under each Box plot, we report the results, \(\hat \pi _{\cdot 0},\) with t representing the ttest statistic, rs for Wilcoxon test statistic, and hybN for hybridnetwork statistic. We also present the pvalues from Shapiro Wilk test (Shp) under each box plot. The results are reported on Fig. 6.
With a cutoff point of τ=0.1, (τ is is a classification threshold, it is like, say α, the level of significance, see Wei and Pan (2008)), all the three methods find that genes AKT1,ATF2, and CDC25B are not expressed. Only the hybridnetwork test finds that the other three genes, ARHGEF2,BDNF and BRAF are expressed. This finding is in accordance with the box plot results. The gene selection is based on R head(·) function, that selects the 6first results. By doing so, we have tried to avoid criticism of biasness in selecting genes to analyze. First, we sort the genes and then pick the 6first genes for comparing the different methodologies.
Discussion and conclusion
To the best of our knowledge the HybridNetwork procedure is the very first one that considers assumptions and a graphical network into the analysis of gene expression data. It has a broad variety of applications and entails layers of complexities.
In simulations and in real data analysis, we show that the hybridNetwork procedures perform well. Hybridnetwork procedures can be applied to group comparison analysis and to regression analysis. In the near future we are implementing a HybridNetwork routine that will help researchers analyze gene expressions data in a better and proper manner. In our future research, we plan to apply this method to next generation sequencing data.
Availability of data and materials
Data and coding are available upon request from the corresponding author.
Abbreviations
 AUC:

Area under the ROC curve
 ROC:

Receiver operating characteristic
 GMRF:

Gaussian Markov random field
References
Besag, J., Higdon, D.: Bayesian Analysis of Agricultural Field Experients. J. R. Stat. Soc. Ser. B. 61, 691–746 (1999).
Besag, J., Kooperberg, C.: On conditional and intrinsic autoregressions. Biometrika. 82, 733–746 (1999).
Besag, J., York, J., Mollié, A.: Bayesian Image Restoration with Two Applications in Spatial Statistics. Ann. Inst. Stat. Math.43, 1–59 (1991).
Bowman, D., George, E. O.: Saturated Model for Analyzing Exchangeable Binary Data: Applications to Clinical and Developmental Toxicity Studies. J. Am. Stat. Assoc.90, 431 (1995).
Lee, D.: A Comparison of Conditional Autoregressive Models Used in Bayesian Disease Mapping. Spat. Spatiotemporal Epidemiol.2, 79–89 (2011).
Lee, D.: CARBayes: An R package for Bayesian Spatial Model with Conditional Autoregressive Priors. J. Stat. Softw.55, 13 (2013).
Leroux, B., Lei, X, Breslow, N.: Estimation of Disease Rates in Small Areas: A New Mixed Model for Spatial Dependence. In: Halloran, M. E., Berry, D. (eds.)Statistical Models in Epidemiology, the Environment, and Clinical Trials, pp. 135–178. Springer, New York (1999).
Pounds, S., Fofana, D.: Hybrid Multiple Testing (2012). http://www.bioconductor.org/packages/2.12/bioc/html/HybridMTest.html. [accessed 12.17.12].
Pounds, S., Rai, S. N.: Assumption adequacy averaging as a concept to develop more robust methods for differential gene expression analysis. Comput. Stat. Data Anal.53, 1604–1612 (2009).
Shojaie, A., Michailidis, G.: Analysis of Gene Sets Based on the Underlying Regulatory Network. J. Comput. Biol.16, 407–426 (2009).
Stern, H., Cressie, N.: Inference for extremes in disease mapping. In: Lawson, A., Biggeri, A., Boehning, D., Lesaffre, E., Viel, J. E., Bertollini, R. (eds.)Disease Mapping and Risk Assessment for Public Health, pp. 63–84. Wiley, Chichester (1999).
Wei, P., Pan, W.: Incorporating Gene Networks into Statistical Tests for Genomic Data via a Spatially Correlated Mixture Model. Bioinformatics. 24, 404–411 (2008).
Acknowledgments
This version is presented at the Joint of Statistical Meeting, JSM (The American Statistical Association, Chicago, IL 2016) and is accepted on JSM Proceedings. The authors are highly indebted to participants at this conference for their valuable comments.
Funding
The International Conference on Statistical Distributions and Applications (ICOSDA).
Author information
Authors and Affiliations
Contributions
DF: literature search, model design, simulation, coding, data analysis, data interpretation, writing, critical revision. EOG: model design, writing, critical revision. DB: writing, critical revision, data acquisition. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fofana, D., George, E.O. & Bowman, D. Combining assumptions and graphical network into gene expression data analysis. J Stat Distrib App 8, 9 (2021). https://doi.org/10.1186/s4048802100126z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4048802100126z