 Methodology
 Open Access
Multiclass analysis and prediction with network structured covariates
 LiPang Chen†^{1},
 Grace Y. Yi†^{1}Email author,
 Qihuang Zhang†^{1} and
 Wenqing He†^{2}
https://doi.org/10.1186/s4048801900942
© The Author(s) 2019
 Received: 25 October 2018
 Accepted: 6 May 2019
 Published: 6 June 2019
Abstract
Technological advances associated with data acquisition are leading to the production of complex structured data sets. The recent development on classification with multiclass responses makes it possible to incorporate the dependence structure of predictors. The available methods, however, are hindered by the restrictive requirements. Those methods basically assume a common network structure for predictors of all subjects without taking into account the heterogeneity existing in different classes. Furthermore, those methods mainly focus on the case where the distribution of predictors is normal. In this paper, we propose classification methods which address these limitations. Our methods are flexible in handling possibly classdependent network structures of variables and allow the predictors to follow a distribution in the exponential family which includes normal distributions as a special case. Our methods are computationally easy to implement. Numerical studies are conducted to demonstrate the satisfactory performance of the proposed methods.
Keywords
 Fscore
 Logistic regression model
 Multiclassification
 Network structure
Introduction
In contemporary statistical inference and machine learning theory, classification and prediction are of great importance and many approaches have been proposed. Those methods typically include the support vector machine (SVM), linear discriminant analysis (LDA), and Knearest neighbors (KNN) (Hastie et al. 2008; James et al. 2017). These methods have widespread applications and their extensions to accommodating complex settings have been proposed. For example, Lee and Lee (2003) studied multicategory support vector machines for classification of multiple types of cancer. Cristianini and ShaweTaylor (2000) presented comprehensive discussions of SVM methods. Guo et al. (2007) discussed the LDA method and its application in microarray data analysis. Safo and Ahn (2016) considered the multiclass analysis by performing the generalized sparse linear discriminant analysis. Regarding analysis of multiclass classification problems, Bagirov et al. (2003) proposed a new algorithm for multiclass cancer data. Bicciato et al. (2003) presented disjoint models for multiclass cancer analysis using the principal component technique. Liu et al. (2005) proposed the genetic algorithm (GA)based algorithm to carry out multiclass cancer classification.
Recent development on classification further incorporates the dependence structure of predictors. For example, Cetiner and Akgul (2014) developed a graphicalmodelbased method for the multilabel classification. Zhu and Pan (2009) proposed the networkbased support vector machine for classification of microarray samples for binary classification. Zi et al. (2016) discussed identification of rheumatoid arthritisrelated genes by using networkbased support vector machine. Cai et al. (2018) considered the network linear discriminant analysis. Huttenhower et al. (2007) proposed the nearest neighbor network approach. In the Bayesian paradigm, various classification approaches with networkstructures accommodated have been explored, such as Bielza et al. (2011), Miguel HernándezLobato et al. (2011), Baladanddayuthapani et al. (2014), and Peterson et al. (2015).
Although there have been methods handling network structures in classification, those methods basically assume a common network structure for predictors of all subjects without taking into account of possible heterogeneity for different classes. To overcome those shortcomings, in this paper we propose classification methods with possibly classdependent network structures of predictors taken into account. Our methods utilize the graphical model theory and allow the predictors to follow an exponential family distribution, instead of a restrictive normal distribution. Furthermore, we develop a prediction criterion for multiclass classification which accommodates pairwise dependence structures among the predictors. Our methods facilitate informative predictors with pairwise dependence structures into classification procedures, and they are computationally easy to implement.
The remainder of the paper is organized as follows. In “Data structure and framework” section, we introduce the data structure and review a convenient multiclass classification method for simple settings. In “nSMClassification with predictor graphical structures accommodated” section, we describe the basics of graphical model theory and propose two methods for multiclass classification to accommodate network structures of predictors. In “Evaluation of the performance” section, we describe the criteria for evaluating the performance of the proposed methods, and briefly review several competing classification methods for comparisons. In “Numerical studies” section, we conduct simulation studies to assess the performance of the proposed methods, and apply the proposed methods to analyze a real dataset for illustration. A general discussion is presented in the last section.
Data structure and framework
In this section, we present the data structure with multiclass responses and introduce the basic notation.
Notation
Suppose the data of n subjects come from I classes, where I is an integer no smaller than 2 and the classes are free of order, i.e., they are nominal. Let n_{i} be the class size in class i with i=1,⋯,I, and hence \(n = \sum \limits _{i=1}^{I} n_{i}\). Define Y_{ik}=i for class i=1,⋯,I and subject k=1,⋯,n_{i}, and let \(Y = \left (Y_{11}, Y_{12}, \cdots, Y_{1n_{1}}, Y_{21}, \cdots, Y_{2n_{2}}, \cdots, Y_{I1}, \cdots, Y_{In_{I}} \right)^{\top }\) denote the ndimensional random vector of response. Let Y_{·j} denote the jth component of Y. In other words, if we ignore the class information, then Y_{·j} represents the response (or the class membership) for the jth subject in the sample, where j=1,⋯,n.
Two ways to display data with a multiclass response and predictors
Data Displayed With Class Label Used  Without Distinguishing Class Label  

Class  Subject  Predictor  Response  Subject  Predictor  Response  
1  1  X _{111}  X _{211}  ⋯  X _{ p11}  Y_{11}=1  1  X _{·11}  ⋯  X _{·1 p}  Y _{·1} 
2  X _{112}  X _{212}  ⋯  X _{ p12}  Y_{12}=1  2  X _{·21}  ⋯  X _{·2 p}  Y _{·1}  
3  X _{113}  X _{213}  ⋯  X _{ p13}  Y_{13}=1  3  X _{·31}  ⋯  X _{·3 p}  Y _{·3}  
⋮  ⋮  ⋮  ⋯  ⋮  ⋮  ⋮  ⋮  ⋯  ⋮  ⋮  
n _{1}  \(X_{11n_{1}}\)  \(X_{21n_{1}}\)  ⋯  \(X_{p1n_{1}}\)  \(Y_{1n_{1}} = 1\)  n _{1}  \(X_{\cdot n_{1}1}\)  ⋯  \(X_{\cdot n_{1} p}\)  \(Y_{\cdot n_{1}}\)  
2  1  X _{121}  X _{221}  ⋯  X _{ p21}  Y_{21}=2  n_{1}+1  \(X_{\cdot, n_{1}+1,1}\)  ⋯  \(X_{\cdot, n_{1}+1, p}\)  \(Y_{\cdot, n_{1}+1}\) 
2  X _{122}  X _{222}  ⋯  X _{ p22}  Y_{22}=2  n_{1}+2  \(X_{\cdot, n_{1}+2,1}\)  ⋯  \(X_{\cdot, n_{1}+2, p}\)  \(Y_{\cdot, n_{1}+2}\)  
3  X _{123}  X _{223}  ⋯  X _{ p23}  Y_{23}=2  n_{1}+3  \(X_{\cdot, n_{1}+3, 1}\)  ⋯  \(X_{\cdot, n_{1}+3, p}\)  \(Y_{\cdot, n_{1}+3}\)  
⋮  ⋮  ⋮  ⋯  ⋮  ⋮  ⋮  ⋮  ⋯  ⋮  ⋮  
n _{2}  \(X_{12n_{2}}\)  \(X_{22n_{2}}\)  ⋯  \(X_{p2n_{2}}\)  \(Y_{2n_{2}} = 2\)  n_{1}+n_{2}  \(X_{\cdot, n_{1}+n_{2},1}\)  ⋯  \(X_{\cdot, n_{1}+n_{2}, p}\)  \(Y_{\cdot, n_{1}+n_{2}}\)  
⋮  ⋮  ⋮  ⋮  ⋯  ⋮  ⋮  ⋮  ⋮  ⋯  ⋮  ⋮ 
I  1  X _{1 I1}  X _{2 I1}  ⋯  X _{ pI1}  Y_{I1}=I  n−n_{I}+1  \(X_{\cdot, nn_{I}+1,1}\)  ⋯  \(X_{\cdot, nn_{I}+1, p}\)  \(Y_{\cdot, nn_{I}+1}\) 
2  X _{1 I2}  X _{2 I2}  ⋯  X _{ pI2}  Y_{I2}=I  n−n_{I}+2  \(X_{\cdot, nn_{I}+2,1}\)  ⋯  \(X_{\cdot, nn_{I}+2, p}\)  \(Y_{\cdot, nn_{I}+2}\)  
3  X _{1 I3}  X _{2 I3}  ⋯  X _{ pI3}  Y_{I3}=I  n−n_{I}+3  \(X_{\cdot, nn_{I}+3,1}\)  ⋯  \(X_{\cdot, nn_{I}+3, p}\)  \(Y_{\cdot, nn_{I}+3}\)  
⋮  ⋮  ⋮  ⋯  ⋮  ⋮  ⋮  ⋮  ⋯  ⋮  ⋮  
n _{ I}  \(X_{1In_{I}}\)  \(X_{2In_{I}}\)  ⋯  \(X_{pIn_{I}}\)  \(Y_{In_{I}} = I\)  n  X _{·, n,1}  ⋯  X _{·, n,p}  Y _{·, n} 
The objective here is to use the observed data to build models in order to predict the class label for a new subject using his/her observed predictor measurement.
Logistic regression model for multiclass response
With the multiclass response, we may consider the use of the logistic regression model by adapting the discussion of Agresti (2012, Section 7.1). For i=1,⋯,I and j=1,⋯,n, let π_{ij}(x_{·j})=P(Y_{·j}=iX_{·j}=x_{·j}) denote the conditional probability that subject j is selected from class i, given the predictor information X_{·j}=x_{·j}.
for i=1,⋯,I−1 and j=1,⋯,n, where \(\gamma = \left (\gamma _{01}, \gamma _{1}^{\top }, \gamma _{02}, \gamma _{2}^{\top },\cdots,\gamma _{0,I1}, \gamma _{I1}^{\top } \right)^{\top }\) is the vector of parameters with the intercepts γ_{0i} and a pdimensional vector γ_{i} of parameters.
where π_{ij}(x_{·j}) is determined by (2) or (3). Estimation of γ can proceed with maximizing (4). Let \(\widehat {\gamma } = \left (\widehat {\gamma }_{01}, \widehat {\gamma }_{1}^{\top }, \widehat {\gamma }_{02}, \widehat {\gamma }_{2}^{\top },\cdots, \widehat {\gamma }_{0,I1}, \widehat {\gamma }_{I1}^{\top } \right)^{\top } \) denote the resulting maximum likelihood estimate of γ.
To predict the class label for a new subject with a pdimensional predictor vector \(\widetilde {x}\), we first calculate the righthand side of (2) and (3) with the \(\left (\gamma _{0i}, \gamma _{i}^{\top } \right)^{\top }\) replaced by the corresponding estimate obtained for the training data and let \(\widehat {\pi }_{1},\cdots,\widehat {\pi }_{I}\) denote the corresponding values. Let i^{∗} denote the index which corresponds to the largest value of \(\left \{ \widehat {\pi }_{1},\cdots,\widehat {\pi }_{I} \right \}\). Then the class label for this new subject is predicted as i^{∗}.
Classification with predictor graphical structures accommodated
In this section, we propose two classification methods for prediction which incorporate the network structure of the predictors. We first describe the use of graphical models to facilitate the association structure of the predictors, and then explore two methods of building prediction models using the identified association structures.
Predictor network structure
Graphical models are useful to facilitate the network structures of the predictors. Here we describe the way of using graphical models to delineate possible association structures of the predictors. For j=1,⋯,n, we use an undirected graph, denoted as G_{j}=(V_{j},E_{j}), to describe the relationship among the components of X_{·j}=(X_{·j1},⋯,X_{·jp})^{⊤}, where V_{j}={1,⋯,p} includes all the indices of predictors and V_{j}×V_{j} contains all pairs with unequal coordinates. A covariate X_{·jr} is called a vertex of the graph G_{j} if r∈V_{j}; a pair of predictors {X_{·jr},X_{·js}} is called an edge of the graph G_{j} if (r,s)∈E_{j}⊂V_{j}×V_{j}. In the setting we consider, the sets V_{j} and E_{j} are common for j=1,⋯,n, so we let V and E denote the vertex and edge of the graph, respectively.
which is the Ising model without the singletons (Ravikumar et al. 2010).
where the function A(Θ) is the normalizing constant, and the θ_{st} and C(·) are defined as for (5). Model (7) is a special case of (5) which constraints the main effects parameters β_{r} in (5) to be zero; nonzero parameter θ_{st} implies that X_{·js} and X_{·jt} are conditionally dependent given other predictors.
where D(·) is the normalizing constant ensuring the integration of (8) equal one, and θ_{s}=(θ_{s1},⋯,θ_{s,s−1},θ_{s,s+1},⋯,θ_{sp})^{⊤} is a (p−1)dimensional vector of parameters indicating the relationship of X_{·js} with all other predictors X_{·jr} for r∈{1,⋯,p}∖{s} associated with (8).
where λ is a tuning parameter and ∥·∥_{1} is the L_{1}norm. In principle, the L_{1}norm in (9) may be replaced by other penalty functions such as the weighted L_{1}norm (Zou 2006) and the nonconcave function (Fan and Li 2001). Here we focus on using the L_{1}norm, the wellknown LASSO penalty (Tibshirani 1996), to determine informative pairwise dependent predictors. The LASSO penalty is frequently considered when dealing with graphical models; it has been implemented in R. For instance, R packages huge and XMRF use the LASSO penalty to determine the network structure.
We comment that the estimator obtained from (9) depends on the choice of the tuning parameter λ. There is no unique way of selecting a suitable tuning parameter, and methods such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the Cross Validation (CV), and the Generalized Cross Validation (GCV) may be considered in the selection of the tuning parameter. Suggested by Wang et al. (2007), BIC tends to outperform others in many situations, especially in the setting with a penalized likelihood function. Consequently, here we employ the BIC approach to select the tuning parameter λ.
where \(\text {df} \left \{ \widehat {\theta }_{s}(\lambda) \right \}\) represents the number of nonzero elements in \(\widehat {\theta }_{s}(\lambda)\) for a given λ. The optimal tuning parameter λ, denoted by \(\widehat {\lambda }\), is determined by minimizing (10) within a suitable range of λ. As a result, the estimator of θ_{s} is determined by \(\widehat {\theta }_{s} = \widehat {\theta }_{s}\left (\widehat {\lambda }\right)\).
The preceding procedure is repeated for all s∈V and yields the estimator \(\widehat {\theta }_{s}\) for all s∈V. There is an important point we need to pay attention. For (s,t)∈E, the estimates \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) are not necessarily identical although θ_{st} and θ_{ts} are constrained to be equal. To overcome this problem, we apply the AND rule (Meinshausen and Bühlmann 2006; Hastie et al. 2015, p.255) to determine the final estimates of \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) as their maximum if both \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) are not zero; and set \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) to be zero if one of them is zero.
is taken as the set of the edges that are estimated to exist. The R package ‘huge’ can be implemented to show the graphic results.
Under mild regularity conditions, the estimated set of edges \(\widehat {E}\) approximate the true network structure E accurately, as shown below which was available in Ravikumar et al. (2010, Section 2.2) and Theorem 5 (b) of Yang et al. (2015).
Proposition 1
Logistic regression with homogeneous graphically structured predictors
To incorporate the network structures of the predictors into building a prediction model, in the next two subsections, we present two methods which can be readily implemented using the R package huge and the R function glm for fitting a logistic regression model.
In the first method, called the logistic regression with homogeneous graphically structured predictors (LRHomoGraph) method, we consider the case where the subjects in different classes share a common network structure in the predictors. To build a prediction model, we make use of the development of the logistic model with multiclass responses, discussed by Agresti (2007, Section 6.1) and Agresti (2012, Section 7.1).
We first identify the pairwise dependence of the predictors using the measurements of all the subjects without distinguishing their class labels. Let \(\widehat {\theta }_{st}\) be the estimate for θ_{st} obtained for (9) by using all the predictor measurements of {X_{·j}:j=1,⋯,n}, and let \(\widehat {E} = \left \{ (s,t) : \widehat {\theta }_{st} \neq 0 \right \}\) denote the resulting estimated set of edges.
for i=1,2,⋯,I−1, where (α_{i0},α_{i,st})^{⊤} is the vector of parameters associated with class i and the constraint \(\sum \limits _{i=1}^{I} p_{ij}(x) = 1\) is imposed for every j=1,⋯,n.
where \(\alpha = \left (\alpha _{10}, \alpha _{1\cdot }^{\top },\cdots, \alpha _{(I1)0}, \alpha _{(I1)\cdot }^{\top } \right)^{\top }\) is the vector of parameters with vector \(\alpha _{i\cdot } = \left (\alpha _{i,st} : (s,t) \in \widehat {E} \right)^{\top }\) for i=1,⋯,I−1.
Finally, to predict the class label for a new subject with a pdimensional predictor \(\widetilde {x}\), we first calculate the righthand side of (14) and (15), and let \(\widetilde {\widehat {p}}_{1},\cdots,\widetilde {\widehat {p}}_{I}\) denote the corresponding values. Let i^{∗} denote the index which corresponds to the largest value of \(\left \{ \widetilde {\widehat {p}}_{1},\cdots,\widetilde {\widehat {p}}_{I} \right \}\), i.e., \(i^{\ast } = \underset {1 \leq i \leq I}{\text {argmax}} \widetilde {\widehat {p}}_{i}\). Then the class label for this new subject is predicted as i^{∗}.
Logistic regression with classdependent graphically structured predictors
We now present an alternative to the method described in “Logistic regression with homogeneous graphically structured predictors” section. Instead of pooling all the covariates to feature the covariate network structure, this method, called the logistic regression with classdependent graphically structured covariates (LRClassGraph) method, stratifies the covariate information by class when characterizing the covariate network structures.
Then the class label for this new subject is predicted as i^{∗}.
Comparison of decision boundaries
As noted in “Logistic regression with homogeneous graphically structured predictors” and “Logistic regression with classdependent graphically structured predictors” sections, while both the LRHomoGraph and LRClassGraph methods employ logistic regression to classify classes, they are different in the way of featuring predictor structures. Furthermore, we may compare their differences in terms of decision boundaries.
Comparing (21) to (19) or (20) shows that decision boundaries for both the LRHomoGraph and LRClassGraph methods are all quadratic surfaces determined by the features selected from the graphical models. However, the way of incorporating the features is different for the two methods. The boundaries (21) are determined by the quadratic terms identified using instances from classes i and k separately, but the quadratic terms in the boundary (19) or (20) are not distinguished by the class labels. In addition, the coefficients \(\widehat {\gamma }_{st}^{i}\) and \(\widehat {\alpha }_{i,st}\) associated with the decision boundaries are generally different.
Evaluation of the performance
In this section we discuss the evaluation of the procedures proposed in “Logistic regression with homogeneous graphically structured predictors” and “Logistic regression with classdependent graphically structured predictors” sections. For comparisons, we also examine some conventional classification methods in machine learning, including support vector machine (SVM), linear discriminant analysis (LDA), Knearest neighbor (KNN), and extreme gradient boosting (XGBOOST). We first describe the measures of assessing the prediction error that are commonly used, and then we briefly review the four classification methods.
Criteria for performances
In principle, higher values of PRE, REC and F based on both micro and macro reflect better performance of methods (Parambath et al. 2018; Sokolova et al. 2006).
Support vector machine for multiclass responses
Support vector machine (SVM) was originally designed for twoclass classification (Hastie et al. 2008, Sec. 12.2), and its extensions to the multiclass responses have been discussed by many authors. An early extension of the SVM to accommodating multiclass classification is the oneagainstall method (Hsu and Lin 2002). The main idea is that the ith SVM is trained from all subjects with positive labels in the ith class and all other subjects with negative labels. This type of SVM for multiclass classification, however, ignores the heterogeneity among the subjects in each class.
where ϕ(·) is a nonlinear mapping from a pdimensional vector to a qdimensional vector with q>p (Hsu and Lin 2002), \(\phantom {\dot {i}\!}w^{i_{1}i_{2}}\) is a qdimensional vector of parameters associated with the comparison between classes i_{1} and \(\phantom {\dot {i}\!}i_{2}, b^{i_{1}i_{2}}\) is a scalar, \(\xi _{j}^{i_{1}i_{2}}\) is the slack variable for the soft margin solution, and C is a cost parameter controlling balance of maximizing the margin and minimizing the training error.

For class i=1,⋯,I, the initial value of vote(i) is set as 0.

For any given class i, we consider a subcollection of \(\mathcal {L}, \left \{ (i,i'): i'=i+1,\cdots,I \right \}\), which is associated with class i. Calculate \(\text {sign}\left \{ (w^{ii'})^{\top } \phi (\widetilde {x}) + b^{ii'} \right \}\)repeatedly for i^{′}=i+1,⋯,I and then determine the values of vote(i) and vote(i^{′})iteratively by the rule:$$\begin{array}{@{}rcl@{}} &&\text{If}\ \text{sign}\left\{ (w^{ii'})^{\top} \phi(\widetilde{x}) + b^{ii'} \right\} > 0,\ \text{then we let} \\ & \ \ & \ \ \ \ \ \ \ \ \ \ vote(i) = vote(i) + 1; \\ &&\text{otherwise}, \\ &\ \ & \ \ \ \ \ \ \ \ \ \ vote(i') = vote(i') + 1; \end{array} $$
where vote(i^{′}) on the righthandside of the equation is a value determined by the previous step, vote(i^{′}) on the lefthandside of the equation represents a newly determined value, and i^{′}=i+1,⋯,I.

Repeat Step 2 for i=1,⋯,I. In this way, we determine all the final values of vote(1),⋯,vote(I). Let i^{∗} denote the class index corresponding to the largest value of {vote(1),⋯,vote(I)}, i.e., \(i^{\ast } = \underset {1 \leq i \leq I}{\text {argmax}} \left \{vote(i)\right \}\). Then we let i^{∗} be the predicted class for the new instance.
Linear discriminant analysis
for i=1,⋯,I and j=1,⋯,n.
and the class label for this subject is then predicted as i^{∗}.
Knearest neighbor
The third classification method we compare with is the Knearest neighbor (KNN) method which is a nonparametric approach. The key idea of KNN is to use the available instances to estimate the conditional probability of Y_{·j} given X_{·j}, and then classify a new instance to a certain class based on the highest estimated conditional probability.
Finally, let i^{∗} denote the class label which corresponds to the largest value of \(\left \{ \widehat {\pi }_{1},\cdots, \widehat {\pi }_{I} \right \}\). Then the class label for this new subject is predicted as i^{∗}.
For the KNN method, a crucial issue is the selection of K. A small value of K usually yields an overflexible decision boundary, which makes the classifier have a small bias but a large variance. On the contrary, with a large K, the boundary becomes less flexible and is close to linear, and classifier would have a small variance but a large bias. To determine an optimal K from the theoretical perspective, James et al. (2017, p. 184 and p. 186) suggested to use the crossvalidation method to select K; but from the computational viewpoint, sometimes, a choice of K may be based on a random guess, as commented by James et al. (2017, p. 167).
Extreme gradient boosting
The extreme gradient boosting (XGBOOST) is a tree based ensemble method created under the gradient boosting framework (e.g., Chen and Guestrin 2016) and can be implemented by the R package xgboost.
for an example with the input x_{·j}.
with \(p_{{ij}} = \frac {\exp \left (\widehat {y}_{{ij}}\right)}{1 + \sum \limits _{l=1}^{I1} \exp \left (\widehat {y}_{{lj}}\right)}\) for i=1,…,I−1 and \(p_{{Ij}} = 1  \sum \limits _{i=1}^{I1} p_{{ij}}\).
where g_{j} and h_{j} are the first and second order gradients of the loss function \(L(y_{\cdot j},\widehat {y}^{(t1)})\) with respect to \(\widehat {y}^{(t1)}\), respectively.
Numerical studies
In this section, we first conduct simulation studies to evaluate the performance of the proposed procedures in “Classification with predictor graphical structures accommodated” section, and then we apply the procedures to analyze a real dataset to illustrate their usage. The discussion is carried out in contrast to the classification methods reviewed in “Evaluation of the performance” section as well as the usual multiclass logistic regression model in “Logistic regression model for multiclass response” section. The R packages, svm(e1071), lda(MASS), knn.cv(class), and xgboost are used to implement the SVM, LDA, KNN, and XGBOOST methods, respectively.
Simulation study
For class i=1,⋯,I, the predictors are generated from the multivariate normal distribution with mean zero and covariance matrix \(\Sigma _{i} = \Omega _{i}^{1}\), where Ω_{i} is a matrix associated with the network structure in class i with all diagonal elements 1 and offdiagonal elements 0 or 1; for s≠t, entry (s,t) is 1 if the edge exists between X_{s} and X_{t} and 0 otherwise. The relationship between a multivariate normal distribution N(0,Σ_{i}) and the Gaussian graphical model with edges determined by \(\Omega _{i} = \Sigma _{i}^{1}\) is discussed by Hastie et al. (2015, p.246 and p.263).
Simulation study with and without network structures for covariates, respectively, indicated by Scenarios 1 and 2: I=4
Scenario  Criteria  Agresti  SVM  LDA  KNN  XGBOOST  LRHomoGraph  LRClassGraph 

1  PRE _{ micro}  0.635  0.830  0.640  0.678  0.690  0.841  0.890 
REC _{ micro}  0.635  0.830  0.640  0.700  0.690  0.841  0.890  
F _{ micro}  0.635  0.830  0.640  0.689  0.690  0.841  0.890  
PRE _{ macro}  0.637  0.843  0.643  0.686  0.688  0.847  0.898  
REC _{ macro}  0.635  0.830  0.640  0.704  0.690  0.842  0.891  
F _{ macro}  0.636  0.836  0.641  0.695  0.689  0.844  0.894  
2  PRE _{ micro}  0.703  0.855  0.739  0.672  0.790  0.851  0.861 
REC _{ micro}  0.717  0.855  0.734  0.672  0.790  0.851  0.866  
F _{ micro}  0.710  0.855  0.736  0.672  0.790  0.851  0.863  
PRE _{ macro}  0.706  0.805  0.740  0.704  0.792  0.859  0.860  
REC _{ macro}  0.717  0.855  0.733  0.672  0.790  0.862  0.866  
F _{ macro}  0.711  0.830  0.736  0.687  0.791  0.860  0.863 
Simulation study with and without network structures for covariates, respectively, indicated by Scenarios 1 and 2: I=2
Scenario  Criteria  Agresti  SVM  LDA  KNN  XGBOOST  LRHomoGraph  LRClassGraph 

1  PRE _{ micro}  0.625  0.835  0.625  0.685  0.615  0.825  0.965 
REC _{ micro}  0.625  0.835  0.625  0.685  0.615  0.825  0.965  
F _{ micro}  0.625  0.835  0.625  0.685  0.615  0.825  0.965  
PRE _{ macro}  0.625  0.866  0.626  0.688  0.615  0.828  0.965  
REC _{ macro}  0.626  0.835  0.625  0.685  0.615  0.825  0.965  
F _{ macro}  0.625  0.850  0.626  0.686  0.615  0.825  0.965  
2  PRE _{ micro}  0.860  0.985  0.850  0.565  0.775  0.825  0.795 
REC _{ micro}  0.860  0.985  0.850  0.565  0.775  0.825  0.795  
F _{ micro}  0.860  0.985  0.850  0.565  0.775  0.825  0.795  
PRE _{ macro}  0.861  0.981  0.605  0.850  0.775  0.820  0.770  
REC _{ macro}  0.860  0.984  0.605  0.850  0.775  0.820  0.772  
F _{ macro}  0.861  0.982  0.605  0.850  0.775  0.820  0.771 
Glass identification dataset
We analyze a dataset concerning glass identification. The study of classification of glass types was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified. It is of interest to predict the glass type based on the information of the predictors.

building\(\underline {\ }{windows}\underline {\ }{float}\underline {\ }\)processed (Glass1),

building\(\underline {\ }{windows}\underline {\ }{non}\underline {\ }{float}\underline {\ }\)processed (Glass2),

vehicle\(\underline {\ }{windows}\underline {\ }{float}\underline {\ }\)processed (Glass3),

vehicle\(\underline {\ }{windows}\underline {\ }{non}\underline {\ }{float}\underline {\ }\)processed (Glass4),

containers (Glass5),

tableware (Glass6), and

headlamps (Glass7),
We first present the network structures for different chemical materials in each class. The network structure for each class is determined by (9) and (11). The graphical results are reported in Fig. 4. It is seen that the network structure of the predictors is different from class to class. We notice that RI has no connection with other variables in every class and the predictor FE also has no connection with others except in class 6.
Classification results for glass data
Glass1  Glass2  Glass3  Glass5  Glass6  Glass7  

Agresti  Glass1  52  19  10  0  0  0 
Glass2  18  54  7  3  0  2  
Glass3  0  0  0  0  0  0  
Glass5  0  1  0  9  0  0  
Glass6  0  0  0  0  9  0  
Glass7  0  2  0  1  0  27  
MIS_{i}  0.257  0.289  1.000  0.307  0.000  0.069  
SVM  Glass1  59  15  10  0  0  1 
Glass2  11  61  7  0  1  0  
Glass3  0  0  0  0  0  0  
Glass5  0  0  0  13  0  0  
Glass6  0  0  0  0  8  0  
Glass7  0  0  0  0  0  28  
MIS_{i}  0.157  0.197  1.000  0.000  0.111  0.034  
LDA  Glass1  46  16  3  0  1  0 
Glass2  14  41  3  2  1  1  
Glass3  10  12  11  0  0  1  
Glass5  0  4  0  10  0  2  
Glass6  0  3  0  0  7  1  
Glass7  0  0  0  1  0  24  
MIS_{i}  0.343  0.461  0.353  0.231  0.222  0.172  
KNN  Glass1  51  17  14  0  0  1 
Glass2  12  52  1  3  2  2  
Glass3  7  2  2  0  0  0  
Glass5  0  3  0  8  0  1  
Glass6  0  2  0  0  4  2  
Glass7  0  0  0  2  3  22  
MIS_{i}  0.271  0.316  0.882  0.385  0.556  0.241  
XGBOOST  Glass1  56  7  7  0  0  1 
Glass2  8  61  5  6  2  0  
Glass3  5  2  4  0  0  0  
Glass5  0  4  0  6  1  2  
Glass6  1  1  1  0  6  0  
Glass7  0  1  0  1  0  26  
MIS_{i}  0.200  0.197  0.765  0.538  0.333  0.103  
LRHomoGraph  Glass1  60  15  8  0  0  1 
Glass2  10  60  6  0  0  0  
Glass3  0  1  3  0  0  0  
Glass5  0  0  0  13  0  0  
Glass6  0  0  0  0  9  0  
Glass7  0  0  0  0  0  28  
MIS_{i}  0.143  0.211  0.824  0.000  0.000  0.034  
LRClassGraph  Glass1  61  10  8  0  0  1 
Glass2  9  66  6  0  0  0  
Glass3  0  0  3  0  0  0  
Glass5  0  0  0  13  0  0  
Glass6  0  0  0  0  9  0  
Glass7  0  0  0  0  0  28  
MIS_{i}  0.129  0.132  0.824  0.000  0.000  0.034 
Overall performance of classification methods applied to glass data
Agresti  SVM  LDA  KNN  XGBOOST  LRHomoGraph  LRClassGraph  LRHomoGraph  LRClassGraph  

+main  +main  
PRE _{ micro}  0.706  0.790  0.649  0.653  0.677  0.808  0.841  0.776  0.783 
REC _{ micro}  0.706  0.790  0.650  0.653  0.730  0.808  0.841  0.776  0.794 
F _{ micro}  0.706  0.790  0.649  0.653  0.703  0.808  0.841  0.776  0.788 
PRE _{ macro}  0.681  0.743  0.651  0.583  0.615  0.876  0.929  0.817  0.816 
REC _{ macro}  0.680  0.755  0.703  0.563  0.644  0.800  0.814  0.774  0.816 
F _{ macro}  0.680  0.749  0.676  0.573  0.629  0.836  0.868  0.795  0.816 
Discussion
In this paper, we propose to use logistic regression methods to make a prediction for data with network structures in predictors. In our methods, we first identify the network structures of the predictors for every class using graphical models, and then we capitalize on the identified network structures for the predictors to fit a logistic regression model to do classification and prediction. Simulation studies demonstrate that in the presence of network structures for covariates, our proposed methods produce more precise classification results than conventional methods, such as SVM, LDA, KNN, and XGBOOST. To allow interested readers to use the algorithms developed in “Classification with predictor graphical structures accommodated” section, the implementation procedures will be posted at CRAN.
Our development here focuses on examining pairwise dependence structures among predictors using the formulation (7). This is primarily driven by the consideration that such a dependence structure is intuitively interpretable and commonly exists in many problems. Extensions to facilitating triplewise or higher order dependence structures or even with the main effects (i.e., single variable effects), among predictors can be carried out by extending (7) to the form (9.5) of Hastie et al. (2015). Such extensions are, in principle, straightforward to implement technically, but the issue of overfitting may arise. In addition, underlying constraints on the model parameters may become a complex concern in numerical implementation. Discussions on this aspect were given by many authors, including Yang et al. (2015), Yi (2017), and Yi et al. (2017). Our discussion in this paper is directed to using the exponential family distribution to facilitate continuous predictor. It is easy to extend our methods to accommodate mixture graphical models which feature both continuous and discrete predictors.
In obtaining the estimator (9), we use the L_{1}norm or the LASSO penalty, which is driven by its popularity as well as the availability of the implementation software packages (e.g., R packages huge and XMRF). However, the methods described in “Classification with predictor graphical structures accommodated” section are not just confined to the LASSO penalty. Our methods apply as well when other penalty functions are used. For instance, penalty functions, such as the elasticnet, SCAD, adaptive LASSO, L_{2}norm penalties can be used to replace the LASSO penalty in deriving the estimator (9); the remaining procedures developed in “Classification with predictor graphical structures accommodated” section still carry through. It will be interesting to conduct numerical studies for the use of different penalty functions to compare how results may differ with and without incorporating the network structure in the analysis, as noted by a referee. Though in this paper we are not able to exhaust numerical explorations for all possible penalty functions, the implementation framework presented in “Classification with predictor graphical structures accommodated” section allows the users to take any penalty functions that suit their own problems.
Finally, we comment that several aspects of the methods described in “Classification with predictor graphical structures accommodated” section warrants further research. As pointed out by a referee, our methods are developed for the problems with low dimensional data (i.e., p<n) and they are not applicable to sizable data with p≥n. In the current digital world, it is not uncommon that we often have to handle data with thousands of predictor variables but the sample size is a lot smaller. In such circumstances, dimension reduction or feature screening techniques would be employed before proceeding with formal data analysis. It is interesting to generalize our methods to handle highdimensional data with p being of a polynomial order of n or even ultra highdimensional data with p being of an exponential order of n.
Our methods basically involve two steps in using measurements for the covariates and class labels. In the first step, we utilize undirected graphs to examine the covariate measurements alone, and the class information only comes into play in the second step when using logistic regression for classification. Alternatively, one may consider using directed acyclic graphs to feature conditional independencies among variables and develop probabilistic graphical models for classification. To evaluate the performance of the proposed methods, we focus on the comparisons with the competing classifiers reviewed in “Evaluation of the performance” section. While those algorithms cover a good range of available classifiers, they are not exhaustive, or even far from being comprehensive, in comparisons. Despite the frequentist nature of our methods, it is interesting to compare the proposed methods to the Bayesian network classifiers which have proven useful in applications (e.g., Geiger and Heckerman 1996; Pérez et al. 2006; Bielza and Larrañaga 2014). Furthermore, it is worthwhile to employ rigorous hypothesis testing procedures to evaluate whether the differences in the results obtained from different classifiers are statistically significant.
Notes
Declarations
Acknowledgements
This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and partially supported by a Collaborative Research Team Project of the Canadian Statistical Sciences Institute (CANSSI).
Authors’ contributions
The first two authors lead the project with equal contributions including writing the paper; the last two authors participate in the project with equal contributions.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Agresti, A.: An Introduction to Categorical Data Analysis. Wiley, New York (2007).View ArticleGoogle Scholar
 Agresti, A.: Categorical Data Analysis. Wiley, New York (2012).MATHGoogle Scholar
 Bagirov, A. M., Ferguson, B., Ivkovic, S., Saunders, G., Yearwood, J.: New algorithms for multiclass cancer diagnosis using tumor gene expression signatures. Bioinformatics. 19, 1800–1807 (2003).View ArticleGoogle Scholar
 Baladanddayuthapani, V., Talluri, R., Ji, Y., Coombes, K. R., Lu, Y., Hennessy, B. T., Davies, M. A., Mallick, B. K.: Bayesian sparse graphical models for classification with application to protein expression data. Ann. Appl. Stat. 8, 1443–1468 (2014).MathSciNetView ArticleGoogle Scholar
 Bicciato, S., Luchini, A., Bello, C. D.: Pca disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics. 19, 571–578 (2003).View ArticleGoogle Scholar
 Bielza, C., Li, G., Larrañaga, P.: Multidimensional classification with bayesian networks. Int. J. Approx. Reason. 52, 705–727 (2011).MathSciNetView ArticleGoogle Scholar
 Bielza, C., Larrañaga, P.: Discrete bayesian network classifiers: A survey. ACM Comput. Surv. 47, 1–43 (2014).View ArticleGoogle Scholar
 Cai, W., Guan, G., Pan, R., Zhu, X., Wang, H.: Network linear discriminant analysis. Comput. Stat. Data Anal. 117, 32–44 (2018).MathSciNetView ArticleGoogle Scholar
 Cetiner, M., Akgul, Y. S.: Information Sciences and Systems 2014. In: In: T., C., E., G., R., L. (eds.) 2nd, pp. 53–76. Springer, New York (2014).Google Scholar
 Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of KDD ’16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, San Francisco (2016). http://doi.org/10.1145/2939672.2939785.Google Scholar
 Cristianini, N., ShaweTaylor, J.: An Introduction to Support Vector Machines and Other Kernelbased Learning Methods. Cambridge University Press, Cambridge (2000).View ArticleGoogle Scholar
 Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001).MathSciNetView ArticleGoogle Scholar
 Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 9, 432–441 (2008).View ArticleGoogle Scholar
 Geiger, D., Heckerman, D.: Knowledge representation and inference in similarity networks and bayesian multinets. Artif. Intell. 82, 45–74 (1996).MathSciNetView ArticleGoogle Scholar
 Guo, Y., Hastie, T., Tibshirani, R.: Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 8, 86–100 (2007).View ArticleGoogle Scholar
 Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2008).MATHGoogle Scholar
 Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC press, New York (2015).View ArticleGoogle Scholar
 Hsu, C. W., Lin, C. J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13, 415–425 (2002).View ArticleGoogle Scholar
 Huttenhower, C., Flamholz, A. I., Landis, J. N., Sahi, S., Myers, C. L., Olszewski, K. L., Hibbs, M. A., Siemers, N. O., Troyanskaya, O. G., Coller, H. A.: Nearest neighbor networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics. 8, 1–13 (2007).View ArticleGoogle Scholar
 James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning: with Applications in R. Springer, New York (2017).MATHGoogle Scholar
 Knerr, S., Personnaz, L., Dreyfus, G.: Singlelayer learning revisited: A stepwise procedure for building and training neural network. In: In: F.F., S., J., H. (eds.)Neurocomputing: Algorithms, Architectures and Applications. 1st, pp. 41–50. Springer, Berlin (1990).Google Scholar
 Lee, Y., Lee, C. K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 19, 1132–1139 (2003).View ArticleGoogle Scholar
 Lee, J., Hastie, T. J.: earning the structure of mixed graphical models. J. Comput. Graph. Stat. 24, 230–253 (2015).View ArticleGoogle Scholar
 Liu, J. J., Cutler, G., Li, W., Pan, Z., Peng, S., Hoey, T., Chen, L., Ling, X. B.: Multiclass cancer classification and biomarker discovery using gabased algorithms. Bioinformatics. 21, 2691–2697 (2005).View ArticleGoogle Scholar
 Meinshausen, N., Bühlmann, P.: Highdimensional graphs and variable selection with the lasso. Ann. Stat. 34, 1436–1462 (2006).MathSciNetView ArticleGoogle Scholar
 Miguel HernándezLobato, J., HernándezLobato, D., Suárez, A.: Networkbased sparse bayesian classification. Pattern Recognit. 44, 886–900 (2011).View ArticleGoogle Scholar
 Parambath, S. A. P., Usunier, N., Grandvalet, Y.: Optimizing pseudolinear performance measures: Application to fmeasure (2018). arXiv:1505.00199v4. Accessed 1 Jan 2018.Google Scholar
 Pérez, A., Larrañaga, P., Inza, I.: Supervised classification with conditional gaussian networks: Increasing the structure complexity from naive bayes. Int. J. Approx. Reason. 43, 1–25 (2006).MathSciNetView ArticleGoogle Scholar
 Peterson, C. B., Stingo, F. C., Vannucci, M.: Joint bayesian variable and graph selection for regression models with networkstructured predictors. Stat. Med. 35, 1017–1031 (2015).MathSciNetView ArticleGoogle Scholar
 Ravikumar, P., Wainwright, M. J., Lafferty, J.: Highdimensional ising model selection using ℓ _{1}regularized logistic regression. Ann. Stat. 38, 1287–1319 (2010).MathSciNetView ArticleGoogle Scholar
 Safo, S. E., Ahn, J.: General sparse multiclass linear discriminant analysis. Comput. Stat. Data Anal. 99, 81–90 (2016).MathSciNetView ArticleGoogle Scholar
 Sokolova, M., Japkowicz, N., Szpakowicz, S.: AI 2006: Advances in Artificial Intelligence. In: In: A., S., B., K. (eds.) 1st, pp. 53–76. Springer, Berlin (2006).Google Scholar
 Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B. 58, 267–288 (1996).MathSciNetMATHGoogle Scholar
 Wang, H., Li, R., Tsai, C.: uning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 94, 553–568 (2007).MathSciNetView ArticleGoogle Scholar
 Yang, E., Ravikumar, P., Allen, G. I., Liu, Z.: Graphical models via univariate exponential family distribution. J. Mach. Learn. Res. 16, 3813–3847 (2015).MathSciNetMATHGoogle Scholar
 Yi, G. Y.: Composite likelihood/pseudolikelihood. Wiley StatsRef: Stat. Ref. Online (2017). https://doi.org/10.1002/9781118445112.stat07855.
 Yi, G. Y., He, W., Li, H.: A class of flexible models for analysis of complex structured correlated data with application to clustered longitudinal data. Stat. 6, 448–461 (2017).MathSciNetView ArticleGoogle Scholar
 Zhu, S. X. Y., Pan, W.: Networkbased support vector machine for classification of microarray samples. BMC Bioinformatics. 10, 1–11 (2009).Google Scholar
 Zi, X., Liu, Y., Gao, P.: Mutual information networkbased support vector machine for identification of rheumatoid arthritisrelated genes. Int. J. Clin. Experiment. Med. 9, 11764–11771 (2016).Google Scholar
 Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006).MathSciNetView ArticleGoogle Scholar