 Methodology
 Open Access
 Published:
Multiclass analysis and prediction with network structured covariates
Journal of Statistical Distributions and Applications volume 6, Article number: 6 (2019)
Abstract
Technological advances associated with data acquisition are leading to the production of complex structured data sets. The recent development on classification with multiclass responses makes it possible to incorporate the dependence structure of predictors. The available methods, however, are hindered by the restrictive requirements. Those methods basically assume a common network structure for predictors of all subjects without taking into account the heterogeneity existing in different classes. Furthermore, those methods mainly focus on the case where the distribution of predictors is normal. In this paper, we propose classification methods which address these limitations. Our methods are flexible in handling possibly classdependent network structures of variables and allow the predictors to follow a distribution in the exponential family which includes normal distributions as a special case. Our methods are computationally easy to implement. Numerical studies are conducted to demonstrate the satisfactory performance of the proposed methods.
Introduction
In contemporary statistical inference and machine learning theory, classification and prediction are of great importance and many approaches have been proposed. Those methods typically include the support vector machine (SVM), linear discriminant analysis (LDA), and Knearest neighbors (KNN) (Hastie et al. 2008; James et al. 2017). These methods have widespread applications and their extensions to accommodating complex settings have been proposed. For example, Lee and Lee (2003) studied multicategory support vector machines for classification of multiple types of cancer. Cristianini and ShaweTaylor (2000) presented comprehensive discussions of SVM methods. Guo et al. (2007) discussed the LDA method and its application in microarray data analysis. Safo and Ahn (2016) considered the multiclass analysis by performing the generalized sparse linear discriminant analysis. Regarding analysis of multiclass classification problems, Bagirov et al. (2003) proposed a new algorithm for multiclass cancer data. Bicciato et al. (2003) presented disjoint models for multiclass cancer analysis using the principal component technique. Liu et al. (2005) proposed the genetic algorithm (GA)based algorithm to carry out multiclass cancer classification.
Recent development on classification further incorporates the dependence structure of predictors. For example, Cetiner and Akgul (2014) developed a graphicalmodelbased method for the multilabel classification. Zhu and Pan (2009) proposed the networkbased support vector machine for classification of microarray samples for binary classification. Zi et al. (2016) discussed identification of rheumatoid arthritisrelated genes by using networkbased support vector machine. Cai et al. (2018) considered the network linear discriminant analysis. Huttenhower et al. (2007) proposed the nearest neighbor network approach. In the Bayesian paradigm, various classification approaches with networkstructures accommodated have been explored, such as Bielza et al. (2011), Miguel HernándezLobato et al. (2011), Baladanddayuthapani et al. (2014), and Peterson et al. (2015).
Although there have been methods handling network structures in classification, those methods basically assume a common network structure for predictors of all subjects without taking into account of possible heterogeneity for different classes. To overcome those shortcomings, in this paper we propose classification methods with possibly classdependent network structures of predictors taken into account. Our methods utilize the graphical model theory and allow the predictors to follow an exponential family distribution, instead of a restrictive normal distribution. Furthermore, we develop a prediction criterion for multiclass classification which accommodates pairwise dependence structures among the predictors. Our methods facilitate informative predictors with pairwise dependence structures into classification procedures, and they are computationally easy to implement.
The remainder of the paper is organized as follows. In “Data structure and framework” section, we introduce the data structure and review a convenient multiclass classification method for simple settings. In “nSMClassification with predictor graphical structures accommodated” section, we describe the basics of graphical model theory and propose two methods for multiclass classification to accommodate network structures of predictors. In “Evaluation of the performance” section, we describe the criteria for evaluating the performance of the proposed methods, and briefly review several competing classification methods for comparisons. In “Numerical studies” section, we conduct simulation studies to assess the performance of the proposed methods, and apply the proposed methods to analyze a real dataset for illustration. A general discussion is presented in the last section.
Data structure and framework
In this section, we present the data structure with multiclass responses and introduce the basic notation.
Notation
Suppose the data of n subjects come from I classes, where I is an integer no smaller than 2 and the classes are free of order, i.e., they are nominal. Let n_{i} be the class size in class i with i=1,⋯,I, and hence \(n = \sum \limits _{i=1}^{I} n_{i}\). Define Y_{ik}=i for class i=1,⋯,I and subject k=1,⋯,n_{i}, and let \(Y = \left (Y_{11}, Y_{12}, \cdots, Y_{1n_{1}}, Y_{21}, \cdots, Y_{2n_{2}}, \cdots, Y_{I1}, \cdots, Y_{In_{I}} \right)^{\top }\) denote the ndimensional random vector of response. Let Y_{·j} denote the jth component of Y. In other words, if we ignore the class information, then Y_{·j} represents the response (or the class membership) for the jth subject in the sample, where j=1,⋯,n.
For i=1,⋯,I, let \(X_{li} = \left (X_{li1},\cdots,X_{li{n_{i}}} \right)^{\top }\) denote the lth predictor (or covariate) vector associated with class i, where l=1,⋯,p for a positive integer p. We write \(X_{l} = \left (X_{l1}^{\top },\cdots, X_{lI}^{\top } \right)^{\top }\) for l=1,⋯,p, and let X=(X_{1},⋯,X_{p}) denote the n×p matrix of predictors. Let X_{·j}=(X_{·j1},⋯,X_{·jp})^{⊤} denote the jth row of X, which represents the pdimensional predictor vector for the jth subject. Without loss of generality, the {X_{·j},Y_{·j}} are treated as independent and identically distributed (i.i.d.) for j=1,⋯,n. We let lower case letters represent realized values for the corresponding random variables. For example, x_{·j} stands for a realized value of X_{·j}. The data structure is shown in Table 1.
The objective here is to use the observed data to build models in order to predict the class label for a new subject using his/her observed predictor measurement.
Logistic regression model for multiclass response
With the multiclass response, we may consider the use of the logistic regression model by adapting the discussion of Agresti (2012, Section 7.1). For i=1,⋯,I and j=1,⋯,n, let π_{ij}(x_{·j})=P(Y_{·j}=iX_{·j}=x_{·j}) denote the conditional probability that subject j is selected from class i, given the predictor information X_{·j}=x_{·j}.
Noting the constraint \(\sum \limits _{i=1}^{I} \pi _{ij}(x_{\cdot j}) = 1\) for every j=1,⋯,n, to describe the π_{ij}(x_{·j}), we can only model (I−1) of the π_{ij}(x_{·j}) rather than all of the π_{ij}(x_{·j}). Without loss of generality, we take the Ith conditional probability π_{Ij}(x_{·j}) as the reference and then consider the logistic model
for i=1,⋯,I−1 and j=1,⋯,n, where \(\gamma = \left (\gamma _{01}, \gamma _{1}^{\top }, \gamma _{02}, \gamma _{2}^{\top },\cdots,\gamma _{0,I1}, \gamma _{I1}^{\top } \right)^{\top }\) is the vector of parameters with the intercepts γ_{0i} and a pdimensional vector γ_{i} of parameters.
Equivalently, (1) shows that for i=1,⋯,I−1 and j=1,⋯,n,
and
Since the distribution of the Y_{ij} can be delineated by a multinominal distribution, the likelihood function for the observed data is given by
where π_{ij}(x_{·j}) is determined by (2) or (3). Estimation of γ can proceed with maximizing (4). Let \(\widehat {\gamma } = \left (\widehat {\gamma }_{01}, \widehat {\gamma }_{1}^{\top }, \widehat {\gamma }_{02}, \widehat {\gamma }_{2}^{\top },\cdots, \widehat {\gamma }_{0,I1}, \widehat {\gamma }_{I1}^{\top } \right)^{\top } \) denote the resulting maximum likelihood estimate of γ.
To predict the class label for a new subject with a pdimensional predictor vector \(\widetilde {x}\), we first calculate the righthand side of (2) and (3) with the \(\left (\gamma _{0i}, \gamma _{i}^{\top } \right)^{\top }\) replaced by the corresponding estimate obtained for the training data and let \(\widehat {\pi }_{1},\cdots,\widehat {\pi }_{I}\) denote the corresponding values. Let i^{∗} denote the index which corresponds to the largest value of \(\left \{ \widehat {\pi }_{1},\cdots,\widehat {\pi }_{I} \right \}\). Then the class label for this new subject is predicted as i^{∗}.
Classification with predictor graphical structures accommodated
In this section, we propose two classification methods for prediction which incorporate the network structure of the predictors. We first describe the use of graphical models to facilitate the association structure of the predictors, and then explore two methods of building prediction models using the identified association structures.
Predictor network structure
Graphical models are useful to facilitate the network structures of the predictors. Here we describe the way of using graphical models to delineate possible association structures of the predictors. For j=1,⋯,n, we use an undirected graph, denoted as G_{j}=(V_{j},E_{j}), to describe the relationship among the components of X_{·j}=(X_{·j1},⋯,X_{·jp})^{⊤}, where V_{j}={1,⋯,p} includes all the indices of predictors and V_{j}×V_{j} contains all pairs with unequal coordinates. A covariate X_{·jr} is called a vertex of the graph G_{j} if r∈V_{j}; a pair of predictors {X_{·jr},X_{·js}} is called an edge of the graph G_{j} if (r,s)∈E_{j}⊂V_{j}×V_{j}. In the setting we consider, the sets V_{j} and E_{j} are common for j=1,⋯,n, so we let V and E denote the vertex and edge of the graph, respectively.
To characterize the distribution of the predictor X_{·j}, we consider the graphical model with the exponential family distribution,
where β=(β_{1},⋯β_{p})^{⊤} is a pdimensional vector of parameters, Θ=[θ_{st}] is a p×p symmetric matrix with zero diagonal elements, and B(·) and C(·) are given functions. The function A(β,Θ) is the normalizing constant which makes (5) integrated as 1; this function is also called the logpartition function, given by
Formulation (5) gives a broad class of models which essentially covers most commonly used distributions. For example, if \(B(x) = \frac {x}{\sigma }\) and \(C(x) = \frac {x^{2}}{2 \sigma ^{2}}\) where σ is a positive constant, then (5) yields the wellknown Gaussian graphical model (Friedman et al. 2008; Hastie et al. 2015; Lee and Hastie 2015). If B(x)=x and C(x)=0 with x∈{0,1}, then with the β_{r} set to be zero, (5) reduces to
which is the Ising model without the singletons (Ravikumar et al. 2010).
To focus on featuring the pairwise association among the components of X_{·j}, similar to the structure of (6), we consider the following graphical model
where the function A(Θ) is the normalizing constant, and the θ_{st} and C(·) are defined as for (5). Model (7) is a special case of (5) which constraints the main effects parameters β_{r} in (5) to be zero; nonzero parameter θ_{st} implies that X_{·js} and X_{·jt} are conditionally dependent given other predictors.
To estimate Θ, one may apply the likelihood method using the distribution (7) directly. Alternatively, a simpler estimation method can be carried out based on a conditional distribution derived from (7) (Meinshausen and Bühlmann2006; Hastie et al. 2015, p.254). For every s∈V, let X_{·j,V∖{s}} denote the (p−1)dimensional subvector of X_{·j} with its sth component deleted, i.e., X_{·j,V∖{s}}=(X_{·j1},⋯,X_{·j,s−1},X_{·j,s+1},⋯,X_{·jp})^{⊤}. By some algebra, we have
where D(·) is the normalizing constant ensuring the integration of (8) equal one, and θ_{s}=(θ_{s1},⋯,θ_{s,s−1},θ_{s,s+1},⋯,θ_{sp})^{⊤} is a (p−1)dimensional vector of parameters indicating the relationship of X_{·js} with all other predictors X_{·jr} for r∈{1,⋯,p}∖{s} associated with (8).
Let ℓ(θ_{s}) be the loglikelihood for θ_{s} multiplied with \( \frac {1}{n}\) with the constand omitted, i.e.,
Then an estimator of θ_{s} can be obtained as
where λ is a tuning parameter and ∥·∥_{1} is the L_{1}norm. In principle, the L_{1}norm in (9) may be replaced by other penalty functions such as the weighted L_{1}norm (Zou 2006) and the nonconcave function (Fan and Li 2001). Here we focus on using the L_{1}norm, the wellknown LASSO penalty (Tibshirani 1996), to determine informative pairwise dependent predictors. The LASSO penalty is frequently considered when dealing with graphical models; it has been implemented in R. For instance, R packages huge and XMRF use the LASSO penalty to determine the network structure.
We comment that the estimator obtained from (9) depends on the choice of the tuning parameter λ. There is no unique way of selecting a suitable tuning parameter, and methods such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the Cross Validation (CV), and the Generalized Cross Validation (GCV) may be considered in the selection of the tuning parameter. Suggested by Wang et al. (2007), BIC tends to outperform others in many situations, especially in the setting with a penalized likelihood function. Consequently, here we employ the BIC approach to select the tuning parameter λ.
Define
where \(\text {df} \left \{ \widehat {\theta }_{s}(\lambda) \right \}\) represents the number of nonzero elements in \(\widehat {\theta }_{s}(\lambda)\) for a given λ. The optimal tuning parameter λ, denoted by \(\widehat {\lambda }\), is determined by minimizing (10) within a suitable range of λ. As a result, the estimator of θ_{s} is determined by \(\widehat {\theta }_{s} = \widehat {\theta }_{s}\left (\widehat {\lambda }\right)\).
The preceding procedure is repeated for all s∈V and yields the estimator \(\widehat {\theta }_{s}\) for all s∈V. There is an important point we need to pay attention. For (s,t)∈E, the estimates \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) are not necessarily identical although θ_{st} and θ_{ts} are constrained to be equal. To overcome this problem, we apply the AND rule (Meinshausen and Bühlmann 2006; Hastie et al. 2015, p.255) to determine the final estimates of \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) as their maximum if both \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) are not zero; and set \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) to be zero if one of them is zero.
To determine an estimated set of edges, we define
for s∈V. Then
is taken as the set of the edges that are estimated to exist. The R package ‘huge’ can be implemented to show the graphic results.
Under mild regularity conditions, the estimated set of edges \(\widehat {E}\) approximate the true network structure E accurately, as shown below which was available in Ravikumar et al. (2010, Section 2.2) and Theorem 5 (b) of Yang et al. (2015).
Proposition 1
(Network Recovery) Suppose E is the set of edges, and let \(\widehat {E}\) be the estimated set of edges. Under regular conditions in Meinshausen and Bühlmann (2006), we have that as n→∞,
Logistic regression with homogeneous graphically structured predictors
To incorporate the network structures of the predictors into building a prediction model, in the next two subsections, we present two methods which can be readily implemented using the R package huge and the R function glm for fitting a logistic regression model.
In the first method, called the logistic regression with homogeneous graphically structured predictors (LRHomoGraph) method, we consider the case where the subjects in different classes share a common network structure in the predictors. To build a prediction model, we make use of the development of the logistic model with multiclass responses, discussed by Agresti (2007, Section 6.1) and Agresti (2012, Section 7.1).
We first identify the pairwise dependence of the predictors using the measurements of all the subjects without distinguishing their class labels. Let \(\widehat {\theta }_{st}\) be the estimate for θ_{st} obtained for (9) by using all the predictor measurements of {X_{·j}:j=1,⋯,n}, and let \(\widehat {E} = \left \{ (s,t) : \widehat {\theta }_{st} \neq 0 \right \}\) denote the resulting estimated set of edges.
Next, for i=1,⋯,I and j=1,⋯,n, we let
be the conditional probability of Y_{·j}=i given X_{·j}=x_{·j}. Consider the logistic regression model
for i=1,2,⋯,I−1, where (α_{i0},α_{i,st})^{⊤} is the vector of parameters associated with class i and the constraint \(\sum \limits _{i=1}^{I} p_{ij}(x) = 1\) is imposed for every j=1,⋯,n.
For subject j=1,⋯,n, we let \(Y_{ij}^{\ast } = 1\) if subject j is in class i and \(Y_{ij}^{\ast } = 0\) otherwise, and hence, \(\sum \limits _{i=1}^{I} Y_{ij}^{\ast } = 1\) for every j. Let \(y_{ij}^{\ast }\) denote a realized value of \(y_{ij}^{\ast }\). For i=1,⋯,I and j=1,⋯,n, the likelihood function is given by (Agresti 2012, p.273)
where \(\alpha = \left (\alpha _{10}, \alpha _{1\cdot }^{\top },\cdots, \alpha _{(I1)0}, \alpha _{(I1)\cdot }^{\top } \right)^{\top }\) is the vector of parameters with vector \(\alpha _{i\cdot } = \left (\alpha _{i,st} : (s,t) \in \widehat {E} \right)^{\top }\) for i=1,⋯,I−1.
The estimator \(\widehat {\alpha }\) can be derived by maximizing (13) with respect to α. Therefore, for the realization x_{·j} of the pdimensional vector X_{·j},p_{ij}(x_{·j}) is estimated as
and p_{Ij}(x_{·j}) is estimated as
Finally, to predict the class label for a new subject with a pdimensional predictor \(\widetilde {x}\), we first calculate the righthand side of (14) and (15), and let \(\widetilde {\widehat {p}}_{1},\cdots,\widetilde {\widehat {p}}_{I}\) denote the corresponding values. Let i^{∗} denote the index which corresponds to the largest value of \(\left \{ \widetilde {\widehat {p}}_{1},\cdots,\widetilde {\widehat {p}}_{I} \right \}\), i.e., \(i^{\ast } = \underset {1 \leq i \leq I}{\text {argmax}} \widetilde {\widehat {p}}_{i}\). Then the class label for this new subject is predicted as i^{∗}.
Logistic regression with classdependent graphically structured predictors
We now present an alternative to the method described in “Logistic regression with homogeneous graphically structured predictors” section. Instead of pooling all the covariates to feature the covariate network structure, this method, called the logistic regression with classdependent graphically structured covariates (LRClassGraph) method, stratifies the covariate information by class when characterizing the covariate network structures.
We first introduce a binary, surrogate response variable \(Y_{ij}^{i}\) for every i and j, where i=1,⋯,I and j=1,⋯,n. Let
and define \(Y^{i} = \left (0,\cdots,0,Y_{i1}^{i},\cdots,Y_{in_{i}}^{i},0,\cdots,0 \right)^{\top }\) to be an ndimensional vector whose elements corresponding to class i are respectively \(Y_{i1}^{i},\cdots,Y_{in_{i}}^{i}\), and other elements are zero. That is, \(Y^{i} = (\underbrace {0,\cdots,0}_{n_{1} + \cdots + n_{i1}}, \underbrace {1,\cdots,1}_{n_{i}}, \underbrace {0,\cdots,0}_{n_{i+1}+ \cdots + n_{I}})^{\top }\) with i=1,⋯,I. Now we implement the following steps. Step 1: (ClassDependent Predictor Network) For each class i=1,⋯,I, we apply the procedure described in “Predictor network structure” section to determine the network structure of predictors in class i. Let \(\widehat {E}^{i} = \left \{ (s,t) : \widehat {\theta }_{st}^{i} \neq 0 \right \}\) denote an estimated set of edges for class i, where \(\widehat {\theta }_{st}^{i}\) is the estimate of θ_{st} derived from (9) based on using the predictor measurements in class i. Step 2: (ClassDependent Model Building) For each class i=1,⋯,I, fit a logistic regression model using the surrogate response vector Y^{i} with the estimated covariates network structure \(\widehat {E}^{i}\) incorporated. Specifically, for the jth component of \(Y^{i}, Y^{i}_{j}\), define \({\pi }_{j}^{i}(x_{\cdot j}) = P\left (Y_{j}^{i} = 1  X_{\cdot j} = x_{\cdot j} \right)\) and consider the logistic regression model
where \(j=1,\cdots,n, \left (\gamma _{0}^{i}, \gamma _{st}^{i}\right)^{\top }\) is the vector of parameters associated with class i. By the theory of maximum likelihood (e.g., Agresti 2012), we obtain the estimate \(\left (\widehat {\gamma }_{0}^{i}, \widehat {\gamma }_{st}^{i} \right)^{\top }\) of \(\left (\gamma _{0}^{i}, \gamma _{st}^{i} \right)^{\top }\). Step 3: (Prediction) For a realization x_{·j} of the pdimensional vector X_{·j}, based on (16), \({\pi }_{j}^{i}(x_{\cdot j})\) can be estimated by
To predict the class label for a new subject with a pdimensional covariate vector \(\widetilde {x}\), we first calculate (17) with x_{·j} replaced by \(\widetilde {x}\) for i=1,⋯,I, and let \(\widetilde {\widehat {\pi }^{1}},\cdots,\widetilde {\widehat {\pi }^{I}}\) denote the corresponding values. Let i^{∗} denote the index which corresponds to the largest value of \(\left \{ \widetilde {\widehat {\pi }^{1}},\cdots,\widetilde {\widehat {\pi }^{I}} \right \}\), i.e.,
Then the class label for this new subject is predicted as i^{∗}.
Comparison of decision boundaries
As noted in “Logistic regression with homogeneous graphically structured predictors” and “Logistic regression with classdependent graphically structured predictors” sections, while both the LRHomoGraph and LRClassGraph methods employ logistic regression to classify classes, they are different in the way of featuring predictor structures. Furthermore, we may compare their differences in terms of decision boundaries.
First, we examine the decision boundaries for the LRHomoGraph method. For i≠k, the boundary between the ith and kth classes is determined by
for a new instance with the predictor value x_{·j}, where \(\widehat {p}_{ij}(x_{\cdot j})\) and \(\widehat {p}_{ik}(x_{\cdot j})\) are given by (14) or (15). To be more specific, for any i=1,...,I−1, if k=1,...,I−1 and k≠i, then by (14), the boundary between the ith and kth classes is
and the boundary between the ith and Ith classes is, by (15),
Similarly, the decision boundaries for the LRClassGraph method can be determined based on (17). For i≠k, equating \(\widehat {\pi }_{j}^{i}(x_{\cdot j})\) and \( \widehat {\pi }_{j}^{k}(x_{\cdot j})\) for a covariate value x_{·j} gives the boundary between the ith and kth classes
Comparing (21) to (19) or (20) shows that decision boundaries for both the LRHomoGraph and LRClassGraph methods are all quadratic surfaces determined by the features selected from the graphical models. However, the way of incorporating the features is different for the two methods. The boundaries (21) are determined by the quadratic terms identified using instances from classes i and k separately, but the quadratic terms in the boundary (19) or (20) are not distinguished by the class labels. In addition, the coefficients \(\widehat {\gamma }_{st}^{i}\) and \(\widehat {\alpha }_{i,st}\) associated with the decision boundaries are generally different.
Evaluation of the performance
In this section we discuss the evaluation of the procedures proposed in “Logistic regression with homogeneous graphically structured predictors” and “Logistic regression with classdependent graphically structured predictors” sections. For comparisons, we also examine some conventional classification methods in machine learning, including support vector machine (SVM), linear discriminant analysis (LDA), Knearest neighbor (KNN), and extreme gradient boosting (XGBOOST). We first describe the measures of assessing the prediction error that are commonly used, and then we briefly review the four classification methods.
Criteria for performances
In this subsection, we describe several criteria of evaluating the performance for prediction. To show the overall performance of prediction, we consider either micro averaged metrics or macro averaged metrics (Parambath et al. 2018). For subject j=1,⋯,n, let \(\widehat {y}_{\cdot j}\) denote the predicted class label. For class i=1,⋯,I, we calculate the number of the true positives, the number of the false positives, and the number of the false negatives, respectively, given by
and
where \(\mathbb {I}(\cdot)\) is the indicator function. For micro averaged metrics, we define precision and recall, respectively, given by
Then MicroFscore is defined as
On the other hand, for macro averaged metrics, for i=1,⋯,I, let \(PRE_{i} = \frac {\text {TP}_{i}}{\text {TP}_{i} + \text {FP}_{i}}\) denote precision for class i, and let \(REC_{i} = \frac {\text {TP}_{i}}{\text {TP}_{i} + \text {FN}_{i}}\) denote recall for class i. Then the overall precision and recall are, respectively, defined as
and MacroFscore is defined as
In principle, higher values of PRE, REC and F based on both micro and macro reflect better performance of methods (Parambath et al. 2018; Sokolova et al. 2006).
Support vector machine for multiclass responses
Support vector machine (SVM) was originally designed for twoclass classification (Hastie et al. 2008, Sec. 12.2), and its extensions to the multiclass responses have been discussed by many authors. An early extension of the SVM to accommodating multiclass classification is the oneagainstall method (Hsu and Lin 2002). The main idea is that the ith SVM is trained from all subjects with positive labels in the ith class and all other subjects with negative labels. This type of SVM for multiclass classification, however, ignores the heterogeneity among the subjects in each class.
A useful multiclass SVM is the oneagainstone method (Knerr et al. 1990), which is implemented in the R package e1071. Different from the oneagainstall method, the oneagainstone method first produces I(I−1)/2 pairwise classifiers and trains data from any two selected classes, and then it applies SVM with binary classification to each pairwise classifiers. To see this, for i_{1},i_{2}∈{1,⋯,I} with i_{1}<i_{2}, we consider the following optimization
where ϕ(·) is a nonlinear mapping from a pdimensional vector to a qdimensional vector with q>p (Hsu and Lin 2002), \(\phantom {\dot {i}\!}w^{i_{1}i_{2}}\) is a qdimensional vector of parameters associated with the comparison between classes i_{1} and \(\phantom {\dot {i}\!}i_{2}, b^{i_{1}i_{2}}\) is a scalar, \(\xi _{j}^{i_{1}i_{2}}\) is the slack variable for the soft margin solution, and C is a cost parameter controlling balance of maximizing the margin and minimizing the training error.
Solving (24) for arbitrary i_{1},i_{2}∈{1,⋯,I} with i_{1}<i_{2} yields I(I−1)/2 classifiers and those classifiers can then be used for classification of a new instance, say \(\widetilde {X} = \widetilde {x}\). This can be done through a voting process (Hsu and Lin 2002). Specifically, let \(\mathcal {L} = \left \{(1,2), (1,3), \cdots, (1,I), (2,3), \cdots, (2,I),\cdots, (I1,I)\right \}\) be the collection of all pairwise class labels which includes I(I−1)/2 elements. For each class i with i=1,⋯,I, we let vote(i) denote the “number of vote” related to class i. Then we carry out the following three steps.

For class i=1,⋯,I, the initial value of vote(i) is set as 0.

For any given class i, we consider a subcollection of \(\mathcal {L}, \left \{ (i,i'): i'=i+1,\cdots,I \right \}\), which is associated with class i. Calculate \(\text {sign}\left \{ (w^{ii'})^{\top } \phi (\widetilde {x}) + b^{ii'} \right \}\)repeatedly for i^{′}=i+1,⋯,I and then determine the values of vote(i) and vote(i^{′})iteratively by the rule:
$$\begin{array}{@{}rcl@{}} &&\text{If}\ \text{sign}\left\{ (w^{ii'})^{\top} \phi(\widetilde{x}) + b^{ii'} \right\} > 0,\ \text{then we let} \\ & \ \ & \ \ \ \ \ \ \ \ \ \ vote(i) = vote(i) + 1; \\ &&\text{otherwise}, \\ &\ \ & \ \ \ \ \ \ \ \ \ \ vote(i') = vote(i') + 1; \end{array} $$where vote(i^{′}) on the righthandside of the equation is a value determined by the previous step, vote(i^{′}) on the lefthandside of the equation represents a newly determined value, and i^{′}=i+1,⋯,I.

Repeat Step 2 for i=1,⋯,I. In this way, we determine all the final values of vote(1),⋯,vote(I). Let i^{∗} denote the class index corresponding to the largest value of {vote(1),⋯,vote(I)}, i.e., \(i^{\ast } = \underset {1 \leq i \leq I}{\text {argmax}} \left \{vote(i)\right \}\). Then we let i^{∗} be the predicted class for the new instance.
Linear discriminant analysis
The idea of LDA is to model the distribution of the predictors X_{·j} separately for each of the classes Y_{·j}, and then use the Bayes theorem to obtain the conditional probabilities P(Y_{·j}=iX_{·j}=x_{·j}) (e.g., James et al. 2017). For i=1,⋯,I and j=1,⋯,n, let f_{ji}(x_{·j}) denote the conditional probability density function of the predictor X_{·j} taking value x_{·j} given that subject j comes from the ith class. Let π_{i,j}=P(Y_{·j}=i) denote the probability that the jth subject is randomly selected from class i. It is immediate that \(\sum \limits _{i=1}^{I} \pi _{i,j} = 1\) for j=1,⋯,n. By some algebra (Hastie et al. 2008, p.108) and the Bayes theorem, we obtain the posterior probability
for i=1,⋯,I and j=1,⋯,n.
To compare two classes i and l with i≠l, we calculate the logratio of (25) for classes i and l, given by
To elaborate on the idea, we particularly consider the case where the conditional distribution f_{ji}(x_{·j}) of X_{·j} given Y_{·j}=i is assumed to be the normal distribution N(μ_{i},Σ_{i}) with the probability density function
If the covariance matrices Σ_{i} in (27) are assumed to be common, i.e., Σ_{i}=Σ for every i where Σ is a positive definite matrix, (26) becomes
If (28) >0, then
showing that subject j with predictors X_{·j}=x_{·j} is more likely to be selected from class i than from class l. Consequently, (28) defines a boundary between classes i and l which is a linear function of x_{·j}.
Motivated by the form of (28), we consider a linear function in x
where μ_{i},π_{i}, and Σ are estimated by \(\widehat {\mu }_{i} = \frac {1}{n_{i}} \sum \limits _{y_{\cdot j} = i} x_{\cdot j}, \widehat {\pi }_{i} = \frac {n_{i}}{n}\), and \(\widehat {\Sigma } = \frac {1}{nI} \sum \limits _{i=1}^{I} \sum \limits _{y_{\cdot j} = i} \left (x_{\cdot j}  \widehat {\mu }_{i} \right)\left (x_{\cdot j}  \widehat {\mu }_{i} \right)^{\top }\), respectively. That is, (29) can be estimated by
Function (30) is called the linear discriminant function and is used to determine the class label for a new instance (James et al. 2017, p.143; Hastie et al. 2008, p. 109). For the prediction of a new subject with covariate \(\widetilde {x}\), we first calculate \(\widehat {\delta }_{i}(\widetilde {x})\) using (30) for i=1,⋯,I. Next, we find i^{∗} which is defined as
and the class label for this subject is then predicted as i^{∗}.
Knearest neighbor
The third classification method we compare with is the Knearest neighbor (KNN) method which is a nonparametric approach. The key idea of KNN is to use the available instances to estimate the conditional probability of Y_{·j} given X_{·j}, and then classify a new instance to a certain class based on the highest estimated conditional probability.
For a positive integer K and a new instance \(\widetilde {x}\) of predictors \(\widetilde {x}\), the first step of KNN is to identify K points which are closest to \(\widetilde {x}\); let \(\mathcal {N}_{0}\left (\widetilde {x} \right)\) denote the set containing such Knearest points of \(\widetilde {x}\). Next, for i=1,⋯,I, we calculate
Finally, let i^{∗} denote the class label which corresponds to the largest value of \(\left \{ \widehat {\pi }_{1},\cdots, \widehat {\pi }_{I} \right \}\). Then the class label for this new subject is predicted as i^{∗}.
For the KNN method, a crucial issue is the selection of K. A small value of K usually yields an overflexible decision boundary, which makes the classifier have a small bias but a large variance. On the contrary, with a large K, the boundary becomes less flexible and is close to linear, and classifier would have a small variance but a large bias. To determine an optimal K from the theoretical perspective, James et al. (2017, p. 184 and p. 186) suggested to use the crossvalidation method to select K; but from the computational viewpoint, sometimes, a choice of K may be based on a random guess, as commented by James et al. (2017, p. 167).
Extreme gradient boosting
The extreme gradient boosting (XGBOOST) is a tree based ensemble method created under the gradient boosting framework (e.g., Chen and Guestrin 2016) and can be implemented by the R package xgboost.
Let \(\mathcal {F}\) denote the space of functions representing regression trees f, where for \(f \in {\mathcal {F}}\) with \(f(x) = w_{q(x)}, q: \mathbb {R}^{p} \rightarrow {\mathcal {L}}\) reflects the structure of the tree f that maps an example to the corresponding leaf index, \({\mathcal {L}}\) is the set of the leaf indices, \(w \in \mathbb {R}^{T}\) is leaf weight, and T is the number of leaves in the tree. Suppose that K regression trees in \({\mathcal {F}}, f_{k}(\cdot) \in {\mathcal {F}}\) with k=1,⋯,K, are used to predict the output:
for an example with the input x_{·j}.
To learn the set of functions used for classification, we minimize the regularized objective function
where Ω is the regularization used to measure the model complexity, given by
with tuning parameters γ and λ. Here L(·) is the loss function which measures how well the model fits the training data. With the multiclass classification problem discussed in “Classification with predictor graphical structures accommodated” section, we specify L(·) as
with \(p_{{ij}} = \frac {\exp \left (\widehat {y}_{{ij}}\right)}{1 + \sum \limits _{l=1}^{I1} \exp \left (\widehat {y}_{{lj}}\right)}\) for i=1,…,I−1 and \(p_{{Ij}} = 1  \sum \limits _{i=1}^{I1} p_{{ij}}\).
While the formulation of the objective function in (31) is conceptually easy to balance the tradeoff between predictive accuracy and model complexity, minimizing the objective function (31) cannot be directly carried out using traditional optimization procedures. One approach is to invoke the gradient boosting tree algorithm iteratively to call for a second order approximation to the objective function. Specifically, at iteration t, we define
with \( \widehat {y}_{\cdot j}^{(0)} = 0\), and hence the objective function
Applying the secondorder approximation to (33) gives
where g_{j} and h_{j} are the first and second order gradients of the loss function \(L(y_{\cdot j},\widehat {y}^{(t1)})\) with respect to \(\widehat {y}^{(t1)}\), respectively.
Let I_{m}={j:q(x_{·j})=m} denote the instance set of leaf m. Then by (32), (34) becomes
For a given tree structure q(·), minimizing (35) gives the optimal weight \(w_{m}^{\ast }\) of leaf m and the optimal value of (35), respectively, given by
Numerical studies
In this section, we first conduct simulation studies to evaluate the performance of the proposed procedures in “Classification with predictor graphical structures accommodated” section, and then we apply the procedures to analyze a real dataset to illustrate their usage. The discussion is carried out in contrast to the classification methods reviewed in “Evaluation of the performance” section as well as the usual multiclass logistic regression model in “Logistic regression model for multiclass response” section. The R packages, svm(e1071), lda(MASS), knn.cv(class), and xgboost are used to implement the SVM, LDA, KNN, and XGBOOST methods, respectively.
Simulation study
For class i=1,⋯,I, the predictors are generated from the multivariate normal distribution with mean zero and covariance matrix \(\Sigma _{i} = \Omega _{i}^{1}\), where Ω_{i} is a matrix associated with the network structure in class i with all diagonal elements 1 and offdiagonal elements 0 or 1; for s≠t, entry (s,t) is 1 if the edge exists between X_{s} and X_{t} and 0 otherwise. The relationship between a multivariate normal distribution N(0,Σ_{i}) and the Gaussian graphical model with edges determined by \(\Omega _{i} = \Sigma _{i}^{1}\) is discussed by Hastie et al. (2015, p.246 and p.263).
We specifically consider two scenarios of network structures where the dimension of predictors is p=12. In the first scenario we specify Ω_{i} to reflect the network structures displayed in Fig. 1. For example, element (1,5) for Ω_{1} is 1, but element (1,5) for Ω_{i} is 0 if i=2,3,4. For a given class i and a subject j in this class, we calculate \(\pi _{j}^{i}(x_{\cdot j})\) by (16) where we set \(\gamma _{0}^{i} = \gamma _{{st}}^{i} = 1\). The outcome measurements are set to be \(Y_{j}^{i} = 1\) if \(\pi _{j}^{i}(x_{\cdot j}) > c\), and \(Y_{j}^{i} = 0\) otherwise, where the threshold c is chosen such that the size in class i equals n_{i}.
In the second scenario, Ω_{i} is taken as the identity matrix for i=1,⋯,I, showing that the predictors have no network structures. For subject j, the predictor X_{·j} is generated from the multivariate normal distribution with mean zero and identity matrix. To generate Y_{·j} for subject j, we first calculate π_{ij}(x_{·j}) for every i=1,⋯,I by (2) and (3) where γ_{0i} and γ_{i} are both set as log(i)+1 for class i. Then we set Y_{·j}=i^{∗} if \(i^{\ast } = \underset {i}{\text {argmax}} \pi _{{ij}}(x_{\cdot j})\). Continue this process until the desired size n_{i} is achieved for i=1,⋯,I. We consider the case with I=4 and n_{i}=50 for i=1,⋯,I and run 500 simulations. We use criteria (22) and (23) to report the performance of each method. The results are summarized in Table 2. It is seen that the proposed LRClassGraph method outperforms all the classification methods with larger values of PRE, REC and F from both micro and macro view points. The SVM performs the second best, and the performance of the LRHomoGraph method is ranked the third, followed by that of the XGBOOST method.
To understand how the proposed methods perform with the binary classification, we repeat the preceding simulations by setting I to be 2 and taking the network structures of classes 1 and 2 when considering scenario 1. The results are in Table 3. When covariates are associated with a network structure, the proposed LRClassGraph method still performs the best, and the improvement of the LRClassGraph method over existing classifiers is a lot more noticeable for I=2 than for I=4. Interestingly, when covariates are uncorrelated, unlike the multiclass case with I=4, the LRHomoGraph method outperforms the LRClassGraph method; and in this case, the SVM is the best classifier.
Glass identification dataset
We analyze a dataset concerning glass identification. The study of classification of glass types was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified. It is of interest to predict the glass type based on the information of the predictors.
The dataset contains 7 types of glass, including

building\(\underline {\ }{windows}\underline {\ }{float}\underline {\ }\)processed (Glass1),

building\(\underline {\ }{windows}\underline {\ }{non}\underline {\ }{float}\underline {\ }\)processed (Glass2),

vehicle\(\underline {\ }{windows}\underline {\ }{float}\underline {\ }\)processed (Glass3),

vehicle\(\underline {\ }{windows}\underline {\ }{non}\underline {\ }{float}\underline {\ }\)processed (Glass4),

containers (Glass5),

tableware (Glass6), and

headlamps (Glass7),
and the predictors include 9 different chemical materials, refractive index (RI), Sodium (NA), Magnesium (MG), Aluminum (AL), Silicon (SI), Potassium (K), Calcium (CA), Barium (BA), and Iron (FE). The complete dataset is available at https://archive.ics.uci.edu/ml/datasets/glass+identification. The sample size in each class is, respectively, n_{1}=70,n_{2}=76,n_{3}=17,n_{4}=0,n_{5}=13,n_{6}=9, and n_{7}=29, yielding the total sample size \(n = \sum \limits _{i=1}^{7} n_{i} = 214\). To see the correlation among the predictors, we draw a scatter plot of those 9 predictors, displayed in Fig. 2. It is seen that some predictors, such as RI and CA, are highly correlated, and that many pairwise predictors are generally correlated.
We first present the network structures for different chemical materials in each class. The network structure for each class is determined by (9) and (11). The graphical results are reported in Fig. 4. It is seen that the network structure of the predictors is different from class to class. We notice that RI has no connection with other variables in every class and the predictor FE also has no connection with others except in class 6.
We next evaluate the performance of our proposed methods as opposed to the conventional approaches, SVM, LDA, KNN, and XGBOOST, which are respectively implemented by the R packages svm(e1071), lda(MASS), knn.cv(class), and xgboost. To examine the performance of LRHomoGraph proposed in “Logistic regression with homogeneous graphically structured predictors” section, we first construct the network structures, displayed in Fig. 3, of the predictors with the class information ignored, and we then apply the procedure described in “Logistic regression with homogeneous graphically structured predictors” section. To implement the LRClassGraph method in “Logistic regression with classdependent graphically structured predictors” section, we apply model (16) with respect to six different network structures in Fig. 4, and then determine the predictive class using (18).
To measure the classification results in each class, we define the misclassification rate in class i to be
The results obtained from SVM, LDA, KNN, XGBOOST, and the proposed methods are reported in Table 4. The misclassification rate of our proposed methods in each class are smaller than other methods, and the LRClassGraph yields the smallest misclassification rate for each class. Among the four compared methods, the SVM outperforms the other three methods.
Finally, we use criteria (22) and (23) to compare the overall performance of all the methods and summarize the results in Table 5. It is clear that both LRHomoGraph and LRClassGraph produce higher values of the F, PRE and REC measures, regardless of micro and macro, implying that our proposed methods perform better than other multiclassification methods considered here. In addition, we further implement the two methods in “Classification with predictor graphical structures accommodated” section by respectively extending models (12) and (17) with the linear terms in each predictor included, and we denote those methods as LRHomoGraph+main and LRClassGraph+main, respectively, and report the results in the last two columns of Table 5. Such an extension of the models, however, does not help increase the values of these measures.
Discussion
In this paper, we propose to use logistic regression methods to make a prediction for data with network structures in predictors. In our methods, we first identify the network structures of the predictors for every class using graphical models, and then we capitalize on the identified network structures for the predictors to fit a logistic regression model to do classification and prediction. Simulation studies demonstrate that in the presence of network structures for covariates, our proposed methods produce more precise classification results than conventional methods, such as SVM, LDA, KNN, and XGBOOST. To allow interested readers to use the algorithms developed in “Classification with predictor graphical structures accommodated” section, the implementation procedures will be posted at CRAN.
Our development here focuses on examining pairwise dependence structures among predictors using the formulation (7). This is primarily driven by the consideration that such a dependence structure is intuitively interpretable and commonly exists in many problems. Extensions to facilitating triplewise or higher order dependence structures or even with the main effects (i.e., single variable effects), among predictors can be carried out by extending (7) to the form (9.5) of Hastie et al. (2015). Such extensions are, in principle, straightforward to implement technically, but the issue of overfitting may arise. In addition, underlying constraints on the model parameters may become a complex concern in numerical implementation. Discussions on this aspect were given by many authors, including Yang et al. (2015), Yi (2017), and Yi et al. (2017). Our discussion in this paper is directed to using the exponential family distribution to facilitate continuous predictor. It is easy to extend our methods to accommodate mixture graphical models which feature both continuous and discrete predictors.
In obtaining the estimator (9), we use the L_{1}norm or the LASSO penalty, which is driven by its popularity as well as the availability of the implementation software packages (e.g., R packages huge and XMRF). However, the methods described in “Classification with predictor graphical structures accommodated” section are not just confined to the LASSO penalty. Our methods apply as well when other penalty functions are used. For instance, penalty functions, such as the elasticnet, SCAD, adaptive LASSO, L_{2}norm penalties can be used to replace the LASSO penalty in deriving the estimator (9); the remaining procedures developed in “Classification with predictor graphical structures accommodated” section still carry through. It will be interesting to conduct numerical studies for the use of different penalty functions to compare how results may differ with and without incorporating the network structure in the analysis, as noted by a referee. Though in this paper we are not able to exhaust numerical explorations for all possible penalty functions, the implementation framework presented in “Classification with predictor graphical structures accommodated” section allows the users to take any penalty functions that suit their own problems.
Finally, we comment that several aspects of the methods described in “Classification with predictor graphical structures accommodated” section warrants further research. As pointed out by a referee, our methods are developed for the problems with low dimensional data (i.e., p<n) and they are not applicable to sizable data with p≥n. In the current digital world, it is not uncommon that we often have to handle data with thousands of predictor variables but the sample size is a lot smaller. In such circumstances, dimension reduction or feature screening techniques would be employed before proceeding with formal data analysis. It is interesting to generalize our methods to handle highdimensional data with p being of a polynomial order of n or even ultra highdimensional data with p being of an exponential order of n.
Our methods basically involve two steps in using measurements for the covariates and class labels. In the first step, we utilize undirected graphs to examine the covariate measurements alone, and the class information only comes into play in the second step when using logistic regression for classification. Alternatively, one may consider using directed acyclic graphs to feature conditional independencies among variables and develop probabilistic graphical models for classification. To evaluate the performance of the proposed methods, we focus on the comparisons with the competing classifiers reviewed in “Evaluation of the performance” section. While those algorithms cover a good range of available classifiers, they are not exhaustive, or even far from being comprehensive, in comparisons. Despite the frequentist nature of our methods, it is interesting to compare the proposed methods to the Bayesian network classifiers which have proven useful in applications (e.g., Geiger and Heckerman 1996; Pérez et al. 2006; Bielza and Larrañaga 2014). Furthermore, it is worthwhile to employ rigorous hypothesis testing procedures to evaluate whether the differences in the results obtained from different classifiers are statistically significant.
References
Agresti, A.: An Introduction to Categorical Data Analysis. Wiley, New York (2007).
Agresti, A.: Categorical Data Analysis. Wiley, New York (2012).
Bagirov, A. M., Ferguson, B., Ivkovic, S., Saunders, G., Yearwood, J.: New algorithms for multiclass cancer diagnosis using tumor gene expression signatures. Bioinformatics. 19, 1800–1807 (2003).
Baladanddayuthapani, V., Talluri, R., Ji, Y., Coombes, K. R., Lu, Y., Hennessy, B. T., Davies, M. A., Mallick, B. K.: Bayesian sparse graphical models for classification with application to protein expression data. Ann. Appl. Stat. 8, 1443–1468 (2014).
Bicciato, S., Luchini, A., Bello, C. D.: Pca disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics. 19, 571–578 (2003).
Bielza, C., Li, G., Larrañaga, P.: Multidimensional classification with bayesian networks. Int. J. Approx. Reason. 52, 705–727 (2011).
Bielza, C., Larrañaga, P.: Discrete bayesian network classifiers: A survey. ACM Comput. Surv. 47, 1–43 (2014).
Cai, W., Guan, G., Pan, R., Zhu, X., Wang, H.: Network linear discriminant analysis. Comput. Stat. Data Anal. 117, 32–44 (2018).
Cetiner, M., Akgul, Y. S.: Information Sciences and Systems 2014. In: In: T., C., E., G., R., L. (eds.) 2nd, pp. 53–76. Springer, New York (2014).
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of KDD ’16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, San Francisco (2016). http://doi.org/10.1145/2939672.2939785.
Cristianini, N., ShaweTaylor, J.: An Introduction to Support Vector Machines and Other Kernelbased Learning Methods. Cambridge University Press, Cambridge (2000).
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001).
Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 9, 432–441 (2008).
Geiger, D., Heckerman, D.: Knowledge representation and inference in similarity networks and bayesian multinets. Artif. Intell. 82, 45–74 (1996).
Guo, Y., Hastie, T., Tibshirani, R.: Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 8, 86–100 (2007).
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2008).
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC press, New York (2015).
Hsu, C. W., Lin, C. J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13, 415–425 (2002).
Huttenhower, C., Flamholz, A. I., Landis, J. N., Sahi, S., Myers, C. L., Olszewski, K. L., Hibbs, M. A., Siemers, N. O., Troyanskaya, O. G., Coller, H. A.: Nearest neighbor networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics. 8, 1–13 (2007).
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning: with Applications in R. Springer, New York (2017).
Knerr, S., Personnaz, L., Dreyfus, G.: Singlelayer learning revisited: A stepwise procedure for building and training neural network. In: In: F.F., S., J., H. (eds.)Neurocomputing: Algorithms, Architectures and Applications. 1st, pp. 41–50. Springer, Berlin (1990).
Lee, Y., Lee, C. K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 19, 1132–1139 (2003).
Lee, J., Hastie, T. J.: earning the structure of mixed graphical models. J. Comput. Graph. Stat. 24, 230–253 (2015).
Liu, J. J., Cutler, G., Li, W., Pan, Z., Peng, S., Hoey, T., Chen, L., Ling, X. B.: Multiclass cancer classification and biomarker discovery using gabased algorithms. Bioinformatics. 21, 2691–2697 (2005).
Meinshausen, N., Bühlmann, P.: Highdimensional graphs and variable selection with the lasso. Ann. Stat. 34, 1436–1462 (2006).
Miguel HernándezLobato, J., HernándezLobato, D., Suárez, A.: Networkbased sparse bayesian classification. Pattern Recognit. 44, 886–900 (2011).
Parambath, S. A. P., Usunier, N., Grandvalet, Y.: Optimizing pseudolinear performance measures: Application to fmeasure (2018). arXiv:1505.00199v4. Accessed 1 Jan 2018.
Pérez, A., Larrañaga, P., Inza, I.: Supervised classification with conditional gaussian networks: Increasing the structure complexity from naive bayes. Int. J. Approx. Reason. 43, 1–25 (2006).
Peterson, C. B., Stingo, F. C., Vannucci, M.: Joint bayesian variable and graph selection for regression models with networkstructured predictors. Stat. Med. 35, 1017–1031 (2015).
Ravikumar, P., Wainwright, M. J., Lafferty, J.: Highdimensional ising model selection using ℓ _{1}regularized logistic regression. Ann. Stat. 38, 1287–1319 (2010).
Safo, S. E., Ahn, J.: General sparse multiclass linear discriminant analysis. Comput. Stat. Data Anal. 99, 81–90 (2016).
Sokolova, M., Japkowicz, N., Szpakowicz, S.: AI 2006: Advances in Artificial Intelligence. In: In: A., S., B., K. (eds.) 1st, pp. 53–76. Springer, Berlin (2006).
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B. 58, 267–288 (1996).
Wang, H., Li, R., Tsai, C.: uning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 94, 553–568 (2007).
Yang, E., Ravikumar, P., Allen, G. I., Liu, Z.: Graphical models via univariate exponential family distribution. J. Mach. Learn. Res. 16, 3813–3847 (2015).
Yi, G. Y.: Composite likelihood/pseudolikelihood. Wiley StatsRef: Stat. Ref. Online (2017). https://doi.org/10.1002/9781118445112.stat07855.
Yi, G. Y., He, W., Li, H.: A class of flexible models for analysis of complex structured correlated data with application to clustered longitudinal data. Stat. 6, 448–461 (2017).
Zhu, S. X. Y., Pan, W.: Networkbased support vector machine for classification of microarray samples. BMC Bioinformatics. 10, 1–11 (2009).
Zi, X., Liu, Y., Gao, P.: Mutual information networkbased support vector machine for identification of rheumatoid arthritisrelated genes. Int. J. Clin. Experiment. Med. 9, 11764–11771 (2016).
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006).
Acknowledgements
This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and partially supported by a Collaborative Research Team Project of the Canadian Statistical Sciences Institute (CANSSI).
Author information
Author notes
Affiliations
Contributions
The first two authors lead the project with equal contributions including writing the paper; the last two authors participate in the project with equal contributions.
Corresponding author
Correspondence to Grace Y. Yi.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Qihuang Zhang and Wenqing He participate in the project with equal contributions.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Fscore
 Logistic regression model
 Multiclassification
 Network structure