Multiclass analysis and prediction with network structured covariates

Chen, Li-Pang; Yi, Grace Y.; Zhang, Qihuang; He, Wenqing

doi:10.1186/s40488-019-0094-2

Methodology
Open access
Published: 06 June 2019

Multiclass analysis and prediction with network structured covariates

Li-Pang Chen¹^na1,
Grace Y. Yi¹,
Qihuang Zhang¹ &
…
Wenqing He²

Journal of Statistical Distributions and Applications volume 6, Article number: 6 (2019) Cite this article

3257 Accesses
4 Citations
Metrics details

Abstract

Technological advances associated with data acquisition are leading to the production of complex structured data sets. The recent development on classification with multiclass responses makes it possible to incorporate the dependence structure of predictors. The available methods, however, are hindered by the restrictive requirements. Those methods basically assume a common network structure for predictors of all subjects without taking into account the heterogeneity existing in different classes. Furthermore, those methods mainly focus on the case where the distribution of predictors is normal. In this paper, we propose classification methods which address these limitations. Our methods are flexible in handling possibly class-dependent network structures of variables and allow the predictors to follow a distribution in the exponential family which includes normal distributions as a special case. Our methods are computationally easy to implement. Numerical studies are conducted to demonstrate the satisfactory performance of the proposed methods.

Introduction

In contemporary statistical inference and machine learning theory, classification and prediction are of great importance and many approaches have been proposed. Those methods typically include the support vector machine (SVM), linear discriminant analysis (LDA), and K-nearest neighbors (KNN) (Hastie et al. 2008; James et al. 2017). These methods have widespread applications and their extensions to accommodating complex settings have been proposed. For example, Lee and Lee (2003) studied multicategory support vector machines for classification of multiple types of cancer. Cristianini and Shawe-Taylor (2000) presented comprehensive discussions of SVM methods. Guo et al. (2007) discussed the LDA method and its application in microarray data analysis. Safo and Ahn (2016) considered the multiclass analysis by performing the generalized sparse linear discriminant analysis. Regarding analysis of multiclass classification problems, Bagirov et al. (2003) proposed a new algorithm for multiclass cancer data. Bicciato et al. (2003) presented disjoint models for multiclass cancer analysis using the principal component technique. Liu et al. (2005) proposed the genetic algorithm (GA)-based algorithm to carry out multiclass cancer classification.

Recent development on classification further incorporates the dependence structure of predictors. For example, Cetiner and Akgul (2014) developed a graphical-model-based method for the multi-label classification. Zhu and Pan (2009) proposed the network-based support vector machine for classification of microarray samples for binary classification. Zi et al. (2016) discussed identification of rheumatoid arthritis-related genes by using network-based support vector machine. Cai et al. (2018) considered the network linear discriminant analysis. Huttenhower et al. (2007) proposed the nearest neighbor network approach. In the Bayesian paradigm, various classification approaches with network-structures accommodated have been explored, such as Bielza et al. (2011), Miguel Hernández-Lobato et al. (2011), Baladanddayuthapani et al. (2014), and Peterson et al. (2015).

Although there have been methods handling network structures in classification, those methods basically assume a common network structure for predictors of all subjects without taking into account of possible heterogeneity for different classes. To overcome those shortcomings, in this paper we propose classification methods with possibly class-dependent network structures of predictors taken into account. Our methods utilize the graphical model theory and allow the predictors to follow an exponential family distribution, instead of a restrictive normal distribution. Furthermore, we develop a prediction criterion for multiclass classification which accommodates pairwise dependence structures among the predictors. Our methods facilitate informative predictors with pairwise dependence structures into classification procedures, and they are computationally easy to implement.

The remainder of the paper is organized as follows. In “Data structure and framework” section, we introduce the data structure and review a convenient multiclass classification method for simple settings. In “nSMClassification with predictor graphical structures accom-modated” section, we describe the basics of graphical model theory and propose two methods for multiclass classification to accommodate network structures of predictors. In “Evaluation of the performance” section, we describe the criteria for evaluating the performance of the proposed methods, and briefly review several competing classification methods for comparisons. In “Numerical studies” section, we conduct simulation studies to assess the performance of the proposed methods, and apply the proposed methods to analyze a real dataset for illustration. A general discussion is presented in the last section.

Data structure and framework

In this section, we present the data structure with multiclass responses and introduce the basic notation.

Notation

Suppose the data of n subjects come from I classes, where I is an integer no smaller than 2 and the classes are free of order, i.e., they are nominal. Let n_i be the class size in class i with i=1,⋯,I, and hence $n = \sum \limits _{i=1}^{I} n_{i}$. Define Y_ik=i for class i=1,⋯,I and subject k=1,⋯,n_i, and let $Y = \left (Y_{11}, Y_{12}, \cdots, Y_{1n_{1}}, Y_{21}, \cdots, Y_{2n_{2}}, \cdots, Y_{I1}, \cdots, Y_{In_{I}} \right)^{\top }$ denote the n-dimensional random vector of response. Let Y_·j denote the jth component of Y. In other words, if we ignore the class information, then Y_·j represents the response (or the class membership) for the jth subject in the sample, where j=1,⋯,n.

For i=1,⋯,I, let $X_{li} = \left (X_{li1},\cdots,X_{li{n_{i}}} \right)^{\top }$ denote the lth predictor (or covariate) vector associated with class i, where l=1,⋯,p for a positive integer p. We write $X_{l} = \left (X_{l1}^{\top },\cdots, X_{lI}^{\top } \right)^{\top }$ for l=1,⋯,p, and let X=(X₁,⋯,X_p) denote the n×p matrix of predictors. Let X_·j=(X_·j1,⋯,X_·jp)^⊤ denote the jth row of X, which represents the p-dimensional predictor vector for the jth subject. Without loss of generality, the {X_·j,Y_·j} are treated as independent and identically distributed (i.i.d.) for j=1,⋯,n. We let lower case letters represent realized values for the corresponding random variables. For example, x_·j stands for a realized value of X_·j. The data structure is shown in Table 1.

Table 1 Two ways to display data with a multiclass response and predictors

Full size table

The objective here is to use the observed data to build models in order to predict the class label for a new subject using his/her observed predictor measurement.

Logistic regression model for multiclass response

With the multiclass response, we may consider the use of the logistic regression model by adapting the discussion of Agresti (2012, Section 7.1). For i=1,⋯,I and j=1,⋯,n, let π_ij(x_·j)=P(Y_·j=i|X_·j=x_·j) denote the conditional probability that subject j is selected from class i, given the predictor information X_·j=x_·j.

Noting the constraint $\sum \limits _{i=1}^{I} \pi _{ij}(x_{\cdot j}) = 1$ for every j=1,⋯,n, to describe the π_ij(x_·j), we can only model (I−1) of the π_ij(x_·j) rather than all of the π_ij(x_·j). Without loss of generality, we take the Ith conditional probability π_Ij(x_·j) as the reference and then consider the logistic model

$$\begin{array}{@{}rcl@{}} \log \left\{ \frac{\pi_{ij}(x_{\cdot j})}{\pi_{Ij}(x_{\cdot j})} \right\} = \gamma_{0i} + \gamma_{i}^{\top} x_{\cdot j} \end{array} $$

(1)

for i=1,⋯,I−1 and j=1,⋯,n, where $\gamma = \left (\gamma _{01}, \gamma _{1}^{\top }, \gamma _{02}, \gamma _{2}^{\top },\cdots,\gamma _{0,I-1}, \gamma _{I-1}^{\top } \right)^{\top }$ is the vector of parameters with the intercepts γ_0i and a p-dimensional vector γ_i of parameters.

Equivalently, (1) shows that for i=1,⋯,I−1 and j=1,⋯,n,

$$\begin{array}{@{}rcl@{}} \pi_{ij}(x_{\cdot j}) = \frac{\exp\left(\gamma_{0i} + \gamma_{i}^{\top} x_{\cdot j} \right)}{1 + \sum \limits_{l=1}^{I-1} \exp\left(\gamma_{0l} + \gamma_{l}^{\top} x_{\cdot j} \right)} \end{array} $$

(2)

and

$$\begin{array}{@{}rcl@{}} \pi_{Ij}(x_{\cdot j}) = 1 - \sum \limits_{i=1}^{I-1} \pi_{ij}(x_{\cdot j}). \end{array} $$

(3)

Since the distribution of the Y_ij can be delineated by a multinominal distribution, the likelihood function for the observed data is given by

$$\begin{array}{@{}rcl@{}} L\left(\gamma \right) = \prod \limits_{i=1}^{I} \left\{ \prod \limits_{j=1}^{n} \pi_{ij}(x_{\cdot j})^{y_{ij}} \right\}, \end{array} $$

(4)

where π_ij(x_·j) is determined by (2) or (3). Estimation of γ can proceed with maximizing (4). Let $\widehat {\gamma } = \left (\widehat {\gamma }_{01}, \widehat {\gamma }_{1}^{\top }, \widehat {\gamma }_{02}, \widehat {\gamma }_{2}^{\top },\cdots, \widehat {\gamma }_{0,I-1}, \widehat {\gamma }_{I-1}^{\top } \right)^{\top } $ denote the resulting maximum likelihood estimate of γ.

To predict the class label for a new subject with a p-dimensional predictor vector $\widetilde {x}$, we first calculate the right-hand side of (2) and (3) with the $\left (\gamma _{0i}, \gamma _{i}^{\top } \right)^{\top }$ replaced by the corresponding estimate obtained for the training data and let $\widehat {\pi }_{1},\cdots,\widehat {\pi }_{I}$ denote the corresponding values. Let i^∗ denote the index which corresponds to the largest value of $\left \{ \widehat {\pi }_{1},\cdots,\widehat {\pi }_{I} \right \}$. Then the class label for this new subject is predicted as i^∗.

Classification with predictor graphical structures accommodated

In this section, we propose two classification methods for prediction which incorporate the network structure of the predictors. We first describe the use of graphical models to facilitate the association structure of the predictors, and then explore two methods of building prediction models using the identified association structures.

Predictor network structure

Graphical models are useful to facilitate the network structures of the predictors. Here we describe the way of using graphical models to delineate possible association structures of the predictors. For j=1,⋯,n, we use an undirected graph, denoted as G_j=(V_j,E_j), to describe the relationship among the components of X_·j=(X_·j1,⋯,X_·jp)^⊤, where V_j={1,⋯,p} includes all the indices of predictors and V_j×V_j contains all pairs with unequal coordinates. A covariate X_·jr is called a vertex of the graph G_j if r∈V_j; a pair of predictors {X_·jr,X_·js} is called an edge of the graph G_j if (r,s)∈E_j⊂V_j×V_j. In the setting we consider, the sets V_j and E_j are common for j=1,⋯,n, so we let V and E denote the vertex and edge of the graph, respectively.

To characterize the distribution of the predictor X_·j, we consider the graphical model with the exponential family distribution,

$$ {}f(x_{\cdot j};\beta,\Theta) = \exp \left\{ \sum \limits_{r \in V} \beta_{r} B(x_{\cdot jr}) + \sum \limits_{(s,t) \in E} \theta_{st} B(x_{\cdot js})B(x_{\cdot jt}) + \sum \limits_{r \in V} C(x_{\cdot jr}) - A(\beta,\Theta) \right\}, $$

(5)

where β=(β₁,⋯β_p)^⊤ is a p-dimensional vector of parameters, Θ=[θ_st] is a p×p symmetric matrix with zero diagonal elements, and B(·) and C(·) are given functions. The function A(β,Θ) is the normalizing constant which makes (5) integrated as 1; this function is also called the log-partition function, given by

$$\begin{array}{@{}rcl@{}} A(\beta,\Theta) = \log \int \exp \left\{ \sum \limits_{r \in V} \beta_{r} B(x_{\cdot jr}) + \sum \limits_{(s,t) \in E} \theta_{st} B(x_{\cdot js})B(x_{\cdot jt}) + \sum \limits_{r \in V} C(x_{\cdot jr}) \right\} dx_{\cdot j}. \end{array} $$

Formulation (5) gives a broad class of models which essentially covers most commonly used distributions. For example, if $B(x) = \frac {x}{\sigma }$ and $C(x) = -\frac {x^{2}}{2 \sigma ^{2}}$ where σ is a positive constant, then (5) yields the well-known Gaussian graphical model (Friedman et al. 2008; Hastie et al. 2015; Lee and Hastie 2015). If B(x)=x and C(x)=0 with x∈{0,1}, then with the β_r set to be zero, (5) reduces to

$$\begin{array}{@{}rcl@{}} \exp \left\{ \sum \limits_{(s,t) \in E} \theta_{st} x_{\cdot js} x_{\cdot jt} - A(\Theta) \right\}, \end{array} $$

(6)

which is the Ising model without the singletons (Ravikumar et al. 2010).

To focus on featuring the pairwise association among the components of X_·j, similar to the structure of (6), we consider the following graphical model

$$\begin{array}{@{}rcl@{}} f(x_{\cdot j};\Theta) = \exp \left\{ \sum \limits_{(s,t) \in E} \theta_{st} x_{\cdot js} x_{\cdot jt} + \sum \limits_{r \in V} C(x_{\cdot jr}) - A(\Theta) \right\}, \end{array} $$

(7)

where the function A(Θ) is the normalizing constant, and the θ_st and C(·) are defined as for (5). Model (7) is a special case of (5) which constraints the main effects parameters β_r in (5) to be zero; nonzero parameter θ_st implies that X_·js and X_·jt are conditionally dependent given other predictors.

To estimate Θ, one may apply the likelihood method using the distribution (7) directly. Alternatively, a simpler estimation method can be carried out based on a conditional distribution derived from (7) (Meinshausen and Bühlmann2006; Hastie et al. 2015, p.254). For every s∈V, let X_·j,V∖{s} denote the (p−1)-dimensional subvector of X_·j with its sth component deleted, i.e., X_·j,V∖{s}=(X_·j1,⋯,X_·j,s−1,X_·j,s+1,⋯,X_·jp)^⊤. By some algebra, we have

$$ f\left(x_{\cdot js} | x_{\cdot j,V \setminus \{s\}}; \theta_{s}\right) = \exp \left\{ x_{\cdot js} \left(\sum \limits_{t \in V \setminus \{s\}} \theta_{st} x_{\cdot jt} \right) + C\left(x_{\cdot js} \right) - D \left(\sum \limits_{t \in V \setminus \{s\}} \theta_{st} x_{\cdot jt} \right) \right\}, $$

(8)

where D(·) is the normalizing constant ensuring the integration of (8) equal one, and θ_s=(θ_s1,⋯,θ_s,s−1,θ_s,s+1,⋯,θ_sp)^⊤ is a (p−1)-dimensional vector of parameters indicating the relationship of X_·js with all other predictors X_·jr for r∈{1,⋯,p}∖{s} associated with (8).

Let ℓ(θ_s) be the log-likelihood for θ_s multiplied with $- \frac {1}{n}$ with the constand omitted, i.e.,

$$\begin{array}{@{}rcl@{}} \ell \left(\theta_{s} \right) &=& - \frac{1}{n} \log \left\{ \prod \limits_{j=1}^{n} f\left(x_{\cdot js} | x_{\cdot j, V \setminus \{s\}} ; \theta_{s}\right) \right\} \\ &=& \frac{1}{n} \sum \limits_{j=1}^{n} \left\{ -x_{\cdot js} \left(\sum \limits_{t \in V \setminus \{s\}} \theta_{st} x_{\cdot jt} \right) + D \left(\sum \limits_{t \in V \setminus \{s\}} \theta_{st} x_{\cdot jt} \right) \right\}. \end{array} $$

Then an estimator of θ_s can be obtained as

$$\begin{array}{@{}rcl@{}} \widehat{\theta}_{s}(\lambda) = \underset{\theta_{s}}{\text{argmin}} \left\{ \ell \left(\theta_{s} \right) + \lambda \left\| \theta_{s} \right\|_{1} \right\}, \end{array} $$

(9)

where λ is a tuning parameter and ∥·∥₁ is the L₁-norm. In principle, the L₁-norm in (9) may be replaced by other penalty functions such as the weighted L₁-norm (Zou 2006) and the nonconcave function (Fan and Li 2001). Here we focus on using the L₁-norm, the well-known LASSO penalty (Tibshirani 1996), to determine informative pairwise dependent predictors. The LASSO penalty is frequently considered when dealing with graphical models; it has been implemented in R. For instance, R packages huge and XMRF use the LASSO penalty to determine the network structure.

We comment that the estimator obtained from (9) depends on the choice of the tuning parameter λ. There is no unique way of selecting a suitable tuning parameter, and methods such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the Cross Validation (CV), and the Generalized Cross Validation (GCV) may be considered in the selection of the tuning parameter. Suggested by Wang et al. (2007), BIC tends to outperform others in many situations, especially in the setting with a penalized likelihood function. Consequently, here we employ the BIC approach to select the tuning parameter λ.

Define

$$\begin{array}{@{}rcl@{}} BIC(\lambda) = 2n \ell \left(\widehat{\theta}_{s}(\lambda) \right) + \log(n)\times \text{df} \left\{ \widehat{\theta}_{s}(\lambda) \right\}, \end{array} $$

(10)

where $\text {df} \left \{ \widehat {\theta }_{s}(\lambda) \right \}$ represents the number of non-zero elements in $\widehat {\theta }_{s}(\lambda)$ for a given λ. The optimal tuning parameter λ, denoted by $\widehat {\lambda }$, is determined by minimizing (10) within a suitable range of λ. As a result, the estimator of θ_s is determined by $\widehat {\theta }_{s} = \widehat {\theta }_{s}\left (\widehat {\lambda }\right)$.

The preceding procedure is repeated for all s∈V and yields the estimator $\widehat {\theta }_{s}$ for all s∈V. There is an important point we need to pay attention. For (s,t)∈E, the estimates $\widehat {\theta }_{st}$ and $\widehat {\theta }_{ts}$ are not necessarily identical although θ_st and θ_ts are constrained to be equal. To overcome this problem, we apply the AND rule (Meinshausen and Bühlmann 2006; Hastie et al. 2015, p.255) to determine the final estimates of $\widehat {\theta }_{st}$ and $\widehat {\theta }_{ts}$ as their maximum if both $\widehat {\theta }_{st}$ and $\widehat {\theta }_{ts}$ are not zero; and set $\widehat {\theta }_{st}$ and $\widehat {\theta }_{ts}$ to be zero if one of them is zero.

To determine an estimated set of edges, we define

$$\begin{array}{@{}rcl@{}} \widehat{\mathcal{N}}(s) = \left\{ t \in V : \widehat{\theta}_{st} \neq 0 \right\} \end{array} $$

for s∈V. Then

$$\begin{array}{@{}rcl@{}} \widehat{E} = \left\{ (s,t) : s \in \widehat{\mathcal{N}}(t) \ \text{and} \ t \in \widehat{\mathcal{N}}(s) \right\} \end{array} $$

(11)

is taken as the set of the edges that are estimated to exist. The R package ‘huge’ can be implemented to show the graphic results.

Under mild regularity conditions, the estimated set of edges $\widehat {E}$ approximate the true network structure E accurately, as shown below which was available in Ravikumar et al. (2010, Section 2.2) and Theorem 5 (b) of Yang et al. (2015).

Proposition 1

(Network Recovery) Suppose E is the set of edges, and let $\widehat {E}$ be the estimated set of edges. Under regular conditions in Meinshausen and Bühlmann (2006), we have that as n→∞,

$$\begin{array}{@{}rcl@{}} P\left(\widehat{E} = E\right) \rightarrow 1. \end{array} $$

Logistic regression with homogeneous graphically structured predictors

To incorporate the network structures of the predictors into building a prediction model, in the next two subsections, we present two methods which can be readily implemented using the R package huge and the R function glm for fitting a logistic regression model.

In the first method, called the logistic regression with homogeneous graphically structured predictors (LR-HomoGraph) method, we consider the case where the subjects in different classes share a common network structure in the predictors. To build a prediction model, we make use of the development of the logistic model with multiclass responses, discussed by Agresti (2007, Section 6.1) and Agresti (2012, Section 7.1).

We first identify the pairwise dependence of the predictors using the measurements of all the subjects without distinguishing their class labels. Let $\widehat {\theta }_{st}$ be the estimate for θ_st obtained for (9) by using all the predictor measurements of {X_·j:j=1,⋯,n}, and let $\widehat {E} = \left \{ (s,t) : \widehat {\theta }_{st} \neq 0 \right \}$ denote the resulting estimated set of edges.

Next, for i=1,⋯,I and j=1,⋯,n, we let

$$\begin{array}{@{}rcl@{}} p_{ij}(x_{\cdot j}) = P\left(\left. Y_{\cdot j} = i \right| X_{\cdot j} = x_{\cdot j} \right) \end{array} $$

be the conditional probability of Y_·j=i given X_·j=x_·j. Consider the logistic regression model

$$\begin{array}{@{}rcl@{}} p_{ij}(x_{\cdot j}) = \frac{\exp\left(\alpha_{i0} + \sum \limits_{(s,t) \in \widehat{E}} \alpha_{i,st} x_{\cdot js} x_{\cdot jt} \right)}{1 + \sum \limits_{l=1}^{I-1} \exp\left(\alpha_{l0} + \sum \limits_{(s,t) \in \widehat{E}} \alpha_{l,st} x_{\cdot js} x_{\cdot jt} \right)} \end{array} $$

(12)

for i=1,2,⋯,I−1, where (α_i0,α_i,st)^⊤ is the vector of parameters associated with class i and the constraint $\sum \limits _{i=1}^{I} p_{ij}(x) = 1$ is imposed for every j=1,⋯,n.

For subject j=1,⋯,n, we let $Y_{ij}^{\ast } = 1$ if subject j is in class i and $Y_{ij}^{\ast } = 0$ otherwise, and hence, $\sum \limits _{i=1}^{I} Y_{ij}^{\ast } = 1$ for every j. Let $y_{ij}^{\ast }$ denote a realized value of $y_{ij}^{\ast }$. For i=1,⋯,I and j=1,⋯,n, the likelihood function is given by (Agresti 2012, p.273)

$$\begin{array}{@{}rcl@{}} L(\alpha) = \prod \limits_{i=1}^{I} \left\{ \prod \limits_{j=1}^{n} p_{ij}(x_{\cdot j})^{y_{ij}^{\ast}} \right\}, \end{array} $$

(13)

where $\alpha = \left (\alpha _{10}, \alpha _{1\cdot }^{\top },\cdots, \alpha _{(I-1)0}, \alpha _{(I-1)\cdot }^{\top } \right)^{\top }$ is the vector of parameters with vector $\alpha _{i\cdot } = \left (\alpha _{i,st} : (s,t) \in \widehat {E} \right)^{\top }$ for i=1,⋯,I−1.

The estimator $\widehat {\alpha }$ can be derived by maximizing (13) with respect to α. Therefore, for the realization x_·j of the p-dimensional vector X_·j,p_ij(x_·j) is estimated as

$$\begin{array}{@{}rcl@{}} \widehat{p}_{ij}(x_{\cdot j}) = \frac{\exp\left(\widehat{\alpha}_{i0} + \sum \limits_{(s,t) \in \widehat{E}} \widehat{\alpha}_{i,st} x_{\cdot js} x_{\cdot jt} \right)}{1 + \sum \limits_{l=1}^{I-1} \exp\left(\widehat{\alpha}_{l0} + \sum \limits_{(s,t) \in \widehat{E}} \widehat{\alpha}_{l,st} x_{\cdot js} x_{\cdot jt} \right)}\ \ \text{for} \ \ i = 1,\cdots,I-1, \end{array} $$

(14)

and p_Ij(x_·j) is estimated as

$$\begin{array}{@{}rcl@{}} \widehat{p}_{Ij}(x_{\cdot j}) = 1 - \sum\limits_{i=1}^{I-1} \widehat{p}_{ij}(x_{\cdot j}). \end{array} $$

(15)

Finally, to predict the class label for a new subject with a p-dimensional predictor $\widetilde {x}$, we first calculate the right-hand side of (14) and (15), and let $\widetilde {\widehat {p}}_{1},\cdots,\widetilde {\widehat {p}}_{I}$ denote the corresponding values. Let i^∗ denote the index which corresponds to the largest value of $\left \{ \widetilde {\widehat {p}}_{1},\cdots,\widetilde {\widehat {p}}_{I} \right \}$, i.e., $i^{\ast } = \underset {1 \leq i \leq I}{\text {argmax}} \widetilde {\widehat {p}}_{i}$. Then the class label for this new subject is predicted as i^∗.

Logistic regression with class-dependent graphically structured predictors

We now present an alternative to the method described in “Logistic regression with homogeneous graphically structured predictors” section. Instead of pooling all the covariates to feature the covariate network structure, this method, called the logistic regression with class-dependent graphically structured covariates (LR-ClassGraph) method, stratifies the covariate information by class when characterizing the covariate network structures.

We first introduce a binary, surrogate response variable $Y_{ij}^{i}$ for every i and j, where i=1,⋯,I and j=1,⋯,n. Let

$$\begin{array}{@{}rcl@{}} Y_{ij}^{i} = \left\{ \begin{array}{c c} 1, & Y_{ij} = i,\\ 0, & \text{otherwise}, \end{array} \right. \end{array} $$

and define $Y^{i} = \left (0,\cdots,0,Y_{i1}^{i},\cdots,Y_{in_{i}}^{i},0,\cdots,0 \right)^{\top }$ to be an n-dimensional vector whose elements corresponding to class i are respectively $Y_{i1}^{i},\cdots,Y_{in_{i}}^{i}$, and other elements are zero. That is, $Y^{i} = (\underbrace {0,\cdots,0}_{n_{1} + \cdots + n_{i-1}}, \underbrace {1,\cdots,1}_{n_{i}}, \underbrace {0,\cdots,0}_{n_{i+1}+ \cdots + n_{I}})^{\top }$ with i=1,⋯,I. Now we implement the following steps. Step 1: (Class-Dependent Predictor Network) For each class i=1,⋯,I, we apply the procedure described in “Predictor network structure” section to determine the network structure of predictors in class i. Let $\widehat {E}^{i} = \left \{ (s,t) : \widehat {\theta }_{st}^{i} \neq 0 \right \}$ denote an estimated set of edges for class i, where $\widehat {\theta }_{st}^{i}$ is the estimate of θ_st derived from (9) based on using the predictor measurements in class i. Step 2: (Class-Dependent Model Building) For each class i=1,⋯,I, fit a logistic regression model using the surrogate response vector Yⁱ with the estimated covariates network structure $\widehat {E}^{i}$ incorporated. Specifically, for the jth component of $Y^{i}, Y^{i}_{j}$, define ${\pi }_{j}^{i}(x_{\cdot j}) = P\left (Y_{j}^{i} = 1 | X_{\cdot j} = x_{\cdot j} \right)$ and consider the logistic regression model

$$\begin{array}{@{}rcl@{}} \text{logit} \left\{ {\pi}_{j}^{i}(x_{\cdot j})\right\} = \gamma_{0}^{i} + \sum \limits_{(s,t) \in \widehat{E}^{i}} \gamma_{st}^{i} x_{\cdot js} x_{\cdot jt}, \end{array} $$

(16)

where $j=1,\cdots,n, \left (\gamma _{0}^{i}, \gamma _{st}^{i}\right)^{\top }$ is the vector of parameters associated with class i. By the theory of maximum likelihood (e.g., Agresti 2012), we obtain the estimate $\left (\widehat {\gamma }_{0}^{i}, \widehat {\gamma }_{st}^{i} \right)^{\top }$ of $\left (\gamma _{0}^{i}, \gamma _{st}^{i} \right)^{\top }$. Step 3: (Prediction) For a realization x_·j of the p-dimensional vector X_·j, based on (16), ${\pi }_{j}^{i}(x_{\cdot j})$ can be estimated by

$$\begin{array}{@{}rcl@{}} \widehat{\pi}_{j}^{i}(x_{\cdot j}) = \frac{\exp\left(\widehat{\gamma}_{0}^{i} + \sum \limits_{(s,t) \in \widehat{E}^{i}} \widehat{\gamma}_{st}^{i} x_{\cdot js} x_{\cdot jt}\right)}{1+\exp\left(\widehat{\gamma}_{0}^{i} + \sum \limits_{(s,t) \in \widehat{E}^{i}} \widehat{\gamma}_{st}^{i} x_{\cdot js} x_{\cdot jt}\right)}\ \ \ \text{for} \ \ i = 1,\cdots, I. \\ \end{array} $$

(17)

To predict the class label for a new subject with a p-dimensional covariate vector $\widetilde {x}$, we first calculate (17) with x_·j replaced by $\widetilde {x}$ for i=1,⋯,I, and let $\widetilde {\widehat {\pi }^{1}},\cdots,\widetilde {\widehat {\pi }^{I}}$ denote the corresponding values. Let i^∗ denote the index which corresponds to the largest value of $\left \{ \widetilde {\widehat {\pi }^{1}},\cdots,\widetilde {\widehat {\pi }^{I}} \right \}$, i.e.,

$$\begin{array}{@{}rcl@{}} \widetilde{\widehat{\pi}^{i^{\ast}}} = \max \limits_{1 \leq i \leq I} \widetilde{\widehat{\pi}^{i}}. \end{array} $$

(18)

Then the class label for this new subject is predicted as i^∗.

Comparison of decision boundaries

As noted in “Logistic regression with homogeneous graphically structured predictors” and “Logistic regression with class-dependent graphically structured predictors” sections, while both the LR-HomoGraph and LR-ClassGraph methods employ logistic regression to classify classes, they are different in the way of featuring predictor structures. Furthermore, we may compare their differences in terms of decision boundaries.

First, we examine the decision boundaries for the LR-HomoGraph method. For i≠k, the boundary between the ith and kth classes is determined by

$$\widehat{p}_{ij}(x_{\cdot j}) =\widehat{p}_{ik}(x_{\cdot j}) $$

for a new instance with the predictor value x_·j, where $\widehat {p}_{ij}(x_{\cdot j})$ and $\widehat {p}_{ik}(x_{\cdot j})$ are given by (14) or (15). To be more specific, for any i=1,...,I−1, if k=1,...,I−1 and k≠i, then by (14), the boundary between the ith and kth classes is

$$ \sum \limits_{(s,t) \in \widehat{E}} (\widehat{\alpha}_{i,st}-\widehat{\alpha}_{k,st}) x_{\cdot js} x_{\cdot jt} +(\widehat{\alpha}_{i0} - \widehat{\alpha}_{k0})=0; $$

(19)

and the boundary between the ith and Ith classes is, by (15),

$$ \sum \limits_{(s,t) \in \widehat{E}} \widehat{\alpha}_{i,st} x_{\cdot js} x_{\cdot jt} +\widehat{\alpha}_{i0}=0. $$

(20)

Similarly, the decision boundaries for the LR-ClassGraph method can be determined based on (17). For i≠k, equating $\widehat {\pi }_{j}^{i}(x_{\cdot j})$ and $ \widehat {\pi }_{j}^{k}(x_{\cdot j})$ for a covariate value x_·j gives the boundary between the ith and kth classes

$$ \sum \limits_{(s,t) \in \widehat{E}^{i}} \widehat{\gamma}_{st}^{i} x_{\cdot js} x_{\cdot jt}-\sum \limits_{(s,t) \in \widehat{E}^{k}} \widehat{\gamma}_{st}^{k} x_{\cdot js} x_{\cdot jt} +\left(\widehat{\gamma}_{0}^{i} -\widehat{\gamma}_{0}^{k}\right)=0. $$

(21)

Comparing (21) to (19) or (20) shows that decision boundaries for both the LR-HomoGraph and LR-ClassGraph methods are all quadratic surfaces determined by the features selected from the graphical models. However, the way of incorporating the features is different for the two methods. The boundaries (21) are determined by the quadratic terms identified using instances from classes i and k separately, but the quadratic terms in the boundary (19) or (20) are not distinguished by the class labels. In addition, the coefficients $\widehat {\gamma }_{st}^{i}$ and $\widehat {\alpha }_{i,st}$ associated with the decision boundaries are generally different.

Evaluation of the performance

In this section we discuss the evaluation of the procedures proposed in “Logistic regression with homogeneous graphically structured predictors” and “Logistic regression with class-dependent graphically structured predictors” sections. For comparisons, we also examine some conventional classification methods in machine learning, including support vector machine (SVM), linear discriminant analysis (LDA), K-nearest neighbor (KNN), and extreme gradient boosting (XGBOOST). We first describe the measures of assessing the prediction error that are commonly used, and then we briefly review the four classification methods.

Criteria for performances

In this subsection, we describe several criteria of evaluating the performance for prediction. To show the overall performance of prediction, we consider either micro averaged metrics or macro averaged metrics (Parambath et al. 2018). For subject j=1,⋯,n, let $\widehat {y}_{\cdot j}$ denote the predicted class label. For class i=1,⋯,I, we calculate the number of the true positives, the number of the false positives, and the number of the false negatives, respectively, given by

$$\text{TP}_{i} = \sum \limits_{j=1}^{n} \mathbb{I} \left(y_{\cdot j} = i, \widehat{y}_{\cdot j} = i\right), \text{FP}_{i} = \sum \limits_{j=1}^{n} \mathbb{I} \left(y_{\cdot j} \neq i, \widehat{y}_{\cdot j} = i\right), $$

and

$$\begin{array}{@{}rcl@{}} \text{FN}_{i} = \sum \limits_{j=1}^{n} \mathbb{I} \left(y_{\cdot j} = i, \widehat{y}_{\cdot j} \neq i\right), \end{array} $$

where $\mathbb {I}(\cdot)$ is the indicator function. For micro averaged metrics, we define precision and recall, respectively, given by

$$\begin{array}{@{}rcl@{}} PRE_{micro} = \frac{\sum \limits_{i=1}^{I} \text{TP}_{i}}{\sum \limits_{i=1}^{I} \text{TP}_{i} + \sum \limits_{i=1}^{I} \text{FP}_{i}} \ \ \text{and} \ \ REC_{micro} = \frac{\sum \limits_{i=1}^{I} \text{TP}_{i}}{\sum \limits_{i=1}^{I} \text{TP}_{i} + \sum \limits_{i=1}^{I} \text{FN}_{i}}. \end{array} $$

Then Micro-F-score is defined as

$$\begin{array}{@{}rcl@{}} F_{micro} = 2 \times \frac{PRE_{micro} \times REC_{micro}}{PRE_{micro} + REC_{micro}}. \end{array} $$

(22)

On the other hand, for macro averaged metrics, for i=1,⋯,I, let $PRE_{i} = \frac {\text {TP}_{i}}{\text {TP}_{i} + \text {FP}_{i}}$ denote precision for class i, and let $REC_{i} = \frac {\text {TP}_{i}}{\text {TP}_{i} + \text {FN}_{i}}$ denote recall for class i. Then the overall precision and recall are, respectively, defined as

$$\begin{array}{@{}rcl@{}} PRE_{macro} = \frac{1}{I} \sum \limits_{i=1}^{I} PRE_{i} \ \ \text{and} \ \ REC_{macro} = \frac{1}{I} \sum \limits_{i=1}^{I} REC_{i}; \end{array} $$

and Macro-F-score is defined as

$$\begin{array}{@{}rcl@{}} F_{macro} = 2 \times \frac{PRE_{macro} \times REC_{macro}}{PRE_{macro} + REC_{macro}}. \end{array} $$

(23)

In principle, higher values of PRE, REC and F based on both micro and macro reflect better performance of methods (Parambath et al. 2018; Sokolova et al. 2006).

Support vector machine for multiclass responses

Support vector machine (SVM) was originally designed for two-class classification (Hastie et al. 2008, Sec. 12.2), and its extensions to the multiclass responses have been discussed by many authors. An early extension of the SVM to accommodating multiclass classification is the one-against-all method (Hsu and Lin 2002). The main idea is that the ith SVM is trained from all subjects with positive labels in the ith class and all other subjects with negative labels. This type of SVM for multiclass classification, however, ignores the heterogeneity among the subjects in each class.

A useful multiclass SVM is the one-against-one method (Knerr et al. 1990), which is implemented in the R package e1071. Different from the one-against-all method, the one-against-one method first produces I(I−1)/2 pairwise classifiers and trains data from any two selected classes, and then it applies SVM with binary classification to each pairwise classifiers. To see this, for i₁,i₂∈{1,⋯,I} with i₁<i₂, we consider the following optimization

$$\begin{array}{*{20}l} &\mathop{\text{min}}_{w^{i_{1}i_{2}},b^{i_{1}i_{2}},\xi_{j}^{i_{1}i_{2}}} \quad \left\{ \frac{1}{2}\left(w^{i_{1}i_{2}}\right)^{\top} w^{i_{1}i_{2}} + C \sum_{j=1}^{n}\xi_{j}^{i_{1}i_{2}} \right\} \\ \text{subject to}&\\ &\text{for} \ \ j=1,\ldots,n,\ \ \\ &\qquad\ \xi_{j}^{i_{1}i_{2}} \ge 0\\ &\qquad\left\{\left(w^{i_{1}i_{2}}\right)^{\top} \phi(X_{\cdot j}) + b^{i_{1}i_{2}} \right\} \ge 1-\xi_{j}^{i_{1}i_{2}}, \text{if} Y_{\cdot j} = i_{1}, \\ &\qquad \left\{ \left(w^{i_{1}i_{2}}\right)^{\top} \phi(X_{\cdot j}) + b^{i_{1}i_{2}}\right\} \le -1+\xi_{j}^{i_{1}i_{2}}, \text{if} Y_{\cdot j} = i_{2}, \end{array} $$

(24)

where ϕ(·) is a non-linear mapping from a p-dimensional vector to a q-dimensional vector with q>p (Hsu and Lin 2002), $\phantom {\dot {i}\!}w^{i_{1}i_{2}}$ is a q-dimensional vector of parameters associated with the comparison between classes i₁ and $\phantom {\dot {i}\!}i_{2}, b^{i_{1}i_{2}}$ is a scalar, $\xi _{j}^{i_{1}i_{2}}$ is the slack variable for the soft margin solution, and C is a cost parameter controlling balance of maximizing the margin and minimizing the training error.

Solving (24) for arbitrary i₁,i₂∈{1,⋯,I} with i₁<i₂ yields I(I−1)/2 classifiers and those classifiers can then be used for classification of a new instance, say $\widetilde {X} = \widetilde {x}$. This can be done through a voting process (Hsu and Lin 2002). Specifically, let $\mathcal {L} = \left \{(1,2), (1,3), \cdots, (1,I), (2,3), \cdots, (2,I),\cdots, (I-1,I)\right \}$ be the collection of all pairwise class labels which includes I(I−1)/2 elements. For each class i with i=1,⋯,I, we let vote(i) denote the “number of vote” related to class i. Then we carry out the following three steps.

For class i=1,⋯,I, the initial value of vote(i) is set as 0.
For any given class i, we consider a subcollection of $\mathcal {L}, \left \{ (i,i'): i'=i+1,\cdots,I \right \}$, which is associated with class i. Calculate $\text {sign}\left \{ (w^{ii'})^{\top } \phi (\widetilde {x}) + b^{ii'} \right \}$repeatedly for i^′=i+1,⋯,I and then determine the values of vote(i) and vote(i^′)iteratively by the rule:
$$\begin{array}{@{}rcl@{}} &&\text{If}\ \text{sign}\left\{ (w^{ii'})^{\top} \phi(\widetilde{x}) + b^{ii'} \right\} > 0,\ \text{then we let} \\ & \ \ & \ \ \ \ \ \ \ \ \ \ vote(i) = vote(i) + 1; \\ &&\text{otherwise}, \\ &\ \ & \ \ \ \ \ \ \ \ \ \ vote(i') = vote(i') + 1; \end{array} $$

where vote(i^′) on the right-hand-side of the equation is a value determined by the previous step, vote(i^′) on the left-hand-side of the equation represents a newly determined value, and i^′=i+1,⋯,I.
Repeat Step 2 for i=1,⋯,I. In this way, we determine all the final values of vote(1),⋯,vote(I). Let i^∗ denote the class index corresponding to the largest value of {vote(1),⋯,vote(I)}, i.e., $i^{\ast } = \underset {1 \leq i \leq I}{\text {argmax}} \left \{vote(i)\right \}$. Then we let i^∗ be the predicted class for the new instance.

Linear discriminant analysis

The idea of LDA is to model the distribution of the predictors X_·j separately for each of the classes Y_·j, and then use the Bayes theorem to obtain the conditional probabilities P(Y_·j=i|X_·j=x_·j) (e.g., James et al. 2017). For i=1,⋯,I and j=1,⋯,n, let f_j|i(x_·j) denote the conditional probability density function of the predictor X_·j taking value x_·j given that subject j comes from the ith class. Let π_i,j=P(Y_·j=i) denote the probability that the jth subject is randomly selected from class i. It is immediate that $\sum \limits _{i=1}^{I} \pi _{i,j} = 1$ for j=1,⋯,n. By some algebra (Hastie et al. 2008, p.108) and the Bayes theorem, we obtain the posterior probability

$$\begin{array}{@{}rcl@{}} P\left(Y_{\cdot j} = i | X_{\cdot j} = x_{\cdot j} \right) = \frac{f_{j|i}(x_{\cdot j})\pi_{i,j}}{\sum \limits_{l = 1}^{I} f_{j|l}(x_{\cdot j})\pi_{l,j}} \end{array} $$

(25)

for i=1,⋯,I and j=1,⋯,n.

To compare two classes i and l with i≠l, we calculate the log-ratio of (25) for classes i and l, given by

$$\begin{array}{@{}rcl@{}} \log \left\{ \frac{P\left(Y_{\cdot j} = i | X_{\cdot j} = x_{\cdot j} \right)}{P\left(Y_{\cdot j} = l | X_{\cdot j} = x_{\cdot j} \right)} \right\} &=& \log \left(\frac{f_{j|i}(x_{\cdot j})}{f_{j|l}(x_{\cdot j})} \right) + \log \left(\frac{\pi_{i,j}}{\pi_{l,j}} \right). \end{array} $$

(26)

To elaborate on the idea, we particularly consider the case where the conditional distribution f_j|i(x_·j) of X_·j given Y_·j=i is assumed to be the normal distribution N(μ_i,Σ_i) with the probability density function

$$\begin{array}{@{}rcl@{}} f_{j|i}(x_{\cdot j}) = \frac{1}{\left(2 \pi \right)^{p/2} \left| \Sigma_{i} \right|^{1/2}} \exp \left\{- \frac{1}{2} \left(x_{\cdot j} - \mu_{i} \right)^{\top} \Sigma_{i}^{-1} \left(x_{\cdot j} - \mu_{i} \right) \right\}. \end{array} $$

(27)

If the covariance matrices Σ_i in (27) are assumed to be common, i.e., Σ_i=Σ for every i where Σ is a positive definite matrix, (26) becomes

$$\begin{array}{@{}rcl@{}} \log \left(\frac{\pi_{i,j}}{\pi_{l,j}} \right) - \frac{1}{2} \left(\mu_{i} + \mu_{l} \right)^{\top} \Sigma^{-1} \left(\mu_{i} + \mu_{l} \right) + x_{\cdot j}^{\top} \Sigma^{-1} \left(\mu_{i} + \mu_{l} \right). \end{array} $$

(28)

If (28) >0, then

$$P\left(Y_{\cdot j} = i | X_{\cdot j} = x_{\cdot j} \right) > P\left(Y_{\cdot j} = l | X_{\cdot j} = x_{\cdot j} \right), $$

showing that subject j with predictors X_·j=x_·j is more likely to be selected from class i than from class l. Consequently, (28) defines a boundary between classes i and l which is a linear function of x_·j.

Motivated by the form of (28), we consider a linear function in x

$$\begin{array}{@{}rcl@{}} \delta_{i}(x) = \log \left(\pi_{i} \right) - \frac{1}{2} \mu_{i}^{\top} \Sigma^{-1} \mu_{i} + x^{\top} \Sigma^{-1} \mu_{i}, \end{array} $$

(29)

where μ_i,π_i, and Σ are estimated by $\widehat {\mu }_{i} = \frac {1}{n_{i}} \sum \limits _{y_{\cdot j} = i} x_{\cdot j}, \widehat {\pi }_{i} = \frac {n_{i}}{n}$, and $\widehat {\Sigma } = \frac {1}{n-I} \sum \limits _{i=1}^{I} \sum \limits _{y_{\cdot j} = i} \left (x_{\cdot j} - \widehat {\mu }_{i} \right)\left (x_{\cdot j} - \widehat {\mu }_{i} \right)^{\top }$, respectively. That is, (29) can be estimated by

$$\begin{array}{@{}rcl@{}} \widehat{\delta}_{i}(x) = \log \left(\widehat{\pi}_{i} \right) - \frac{1}{2} \widehat{\mu}_{i}^{\top} \widehat{\Sigma}^{-1} \widehat{\mu}_{i} + x^{\top} \widehat{\Sigma}^{-1} \widehat{\mu}_{i}. \end{array} $$

(30)

Function (30) is called the linear discriminant function and is used to determine the class label for a new instance (James et al. 2017, p.143; Hastie et al. 2008, p. 109). For the prediction of a new subject with covariate $\widetilde {x}$, we first calculate $\widehat {\delta }_{i}(\widetilde {x})$ using (30) for i=1,⋯,I. Next, we find i^∗ which is defined as

$$\begin{array}{@{}rcl@{}} i^{\ast} = \underset{i=1,\cdots,I}{\text{argmax}}\ \widehat{\delta}_{i}(\widetilde{x}); \end{array} $$

and the class label for this subject is then predicted as i^∗.

K-nearest neighbor

The third classification method we compare with is the K-nearest neighbor (KNN) method which is a non-parametric approach. The key idea of KNN is to use the available instances to estimate the conditional probability of Y_·j given X_·j, and then classify a new instance to a certain class based on the highest estimated conditional probability.

For a positive integer K and a new instance $\widetilde {x}$ of predictors $\widetilde {x}$, the first step of KNN is to identify K points which are closest to $\widetilde {x}$; let $\mathcal {N}_{0}\left (\widetilde {x} \right)$ denote the set containing such K-nearest points of $\widetilde {x}$. Next, for i=1,⋯,I, we calculate

$$\begin{array}{@{}rcl@{}} \widehat{\pi}_{i} = \frac{1}{K} \sum \limits_{j' \in \mathcal{N}_{0}(\widetilde{x})} \mathbb{I}(y_{\cdot j'} = i). \end{array} $$

Finally, let i^∗ denote the class label which corresponds to the largest value of $\left \{ \widehat {\pi }_{1},\cdots, \widehat {\pi }_{I} \right \}$. Then the class label for this new subject is predicted as i^∗.

For the KNN method, a crucial issue is the selection of K. A small value of K usually yields an over-flexible decision boundary, which makes the classifier have a small bias but a large variance. On the contrary, with a large K, the boundary becomes less flexible and is close to linear, and classifier would have a small variance but a large bias. To determine an optimal K from the theoretical perspective, James et al. (2017, p. 184 and p. 186) suggested to use the cross-validation method to select K; but from the computational viewpoint, sometimes, a choice of K may be based on a random guess, as commented by James et al. (2017, p. 167).

Extreme gradient boosting

The extreme gradient boosting (XGBOOST) is a tree based ensemble method created under the gradient boosting framework (e.g., Chen and Guestrin 2016) and can be implemented by the R package xgboost.

Let $\mathcal {F}$ denote the space of functions representing regression trees f, where for $f \in {\mathcal {F}}$ with $f(x) = w_{q(x)}, q: \mathbb {R}^{p} \rightarrow {\mathcal {L}}$ reflects the structure of the tree f that maps an example to the corresponding leaf index, ${\mathcal {L}}$ is the set of the leaf indices, $w \in \mathbb {R}^{T}$ is leaf weight, and T is the number of leaves in the tree. Suppose that K regression trees in ${\mathcal {F}}, f_{k}(\cdot) \in {\mathcal {F}}$ with k=1,⋯,K, are used to predict the output:

$$\begin{array}{@{}rcl@{}} \widehat{y}_{\cdot j} = \sum \limits_{k=1}^{K} f_{k}\left(x_{\cdot j} \right) \end{array} $$

for an example with the input x_·j.

To learn the set of functions used for classification, we minimize the regularized objective function

$$\begin{array}{@{}rcl@{}} \mathcal{L}(y,\widehat{y}) = \sum \limits_{j=1}^{n} L(y_{\cdot j},\widehat{y}_{\cdot j}) + \sum \limits_{k=1}^{K} \Omega(f_{k}), \end{array} $$

(31)

where Ω is the regularization used to measure the model complexity, given by

$$ {\kern110pt}\Omega(f) = \gamma T+ \frac{1}{2}\lambda \left\|w\right\|^{2} $$

(32)

with tuning parameters γ and λ. Here L(·) is the loss function which measures how well the model fits the training data. With the multiclass classification problem discussed in “Classification with predictor graphical structures accommodated” section, we specify L(·) as

$$\begin{array}{@{}rcl@{}} \sum \limits_{j=1}^{n} L(y_{\cdot j},\widehat{y}_{\cdot j}) = - \sum \limits_{i=1}^{I} \sum \limits_{j=1}^{n} y_{{ij}} \log \left(p_{{ij}} \right) \end{array} $$

with $p_{{ij}} = \frac {\exp \left (\widehat {y}_{{ij}}\right)}{1 + \sum \limits _{l=1}^{I-1} \exp \left (\widehat {y}_{{lj}}\right)}$ for i=1,…,I−1 and $p_{{Ij}} = 1 - \sum \limits _{i=1}^{I-1} p_{{ij}}$.

While the formulation of the objective function in (31) is conceptually easy to balance the tradeoff between predictive accuracy and model complexity, minimizing the objective function (31) cannot be directly carried out using traditional optimization procedures. One approach is to invoke the gradient boosting tree algorithm iteratively to call for a second order approximation to the objective function. Specifically, at iteration t, we define

$$\begin{array}{@{}rcl@{}} \widehat{y}_{\cdot j}^{(t)} = \sum \limits_{k=1}^{t} f_{k}\left(x_{\cdot j} \right) = \widehat{y}_{\cdot j}^{(t-1)} + f_{t}\left(x_{\cdot j} \right) \end{array} $$

with $ \widehat {y}_{\cdot j}^{(0)} = 0$, and hence the objective function

$$\begin{array}{@{}rcl@{}} \mathcal{L}^{(t)}(y,\widehat{y}) = \sum \limits_{j=1}^{n} L\left(y_{\cdot j},\widehat{y}_{\cdot j}^{(t)}\right) + \Omega(f_{t}). \end{array} $$

(33)

Applying the second-order approximation to (33) gives

$$ \mathcal{L}^{(t)}(y,\widehat{y}) \approx \sum \limits_{j=1}^{n} \left\{ L\left(y_{\cdot j},\widehat{y}_{\cdot j}^{(t-1)}\right) + g_{j} f_{t}\left(x_{\cdot j} \right) + \frac{1}{2} h_{j} f_{t}^{2}\left(x_{\cdot j} \right) \right\} + \Omega(f_{t}), $$

(34)

where g_j and h_j are the first and second order gradients of the loss function $L(y_{\cdot j},\widehat {y}^{(t-1)})$ with respect to $\widehat {y}^{(t-1)}$, respectively.

Let I_m={j:q(x_·j)=m} denote the instance set of leaf m. Then by (32), (34) becomes

$$\begin{array}{@{}rcl@{}} \mathcal{L}^{(t)}(y,\widehat{y}) &\approx & \sum \limits_{m=1}^{T} \left\{ \left(\sum \limits_{j \in I_{m}} g_{j} \right) w_{m} + \frac{1}{2} \left(\sum \limits_{j \in I_{m}} h_{j} + \lambda \right) w_{m}^{2} \right\} + \gamma T. \end{array} $$

(35)

For a given tree structure q(·), minimizing (35) gives the optimal weight $w_{m}^{\ast }$ of leaf m and the optimal value of (35), respectively, given by

$$\widehat w_{m} = - \frac{\sum \limits_{j \in I_{m}} g_{j}}{\sum \limits_{j \in I_{m}} h_{j} + \lambda} \ \ \ \text{and} \ \ \ \widehat {\mathcal{L}}^{(t)} = - \frac{1}{2} \sum \limits_{m=1}^{T} \frac{\left(\sum \limits_{j \in I_{m}} g_{j}\right)^{2}}{\sum \limits_{j \in I_{m}} h_{j} + \lambda} + \gamma T. $$

Numerical studies

In this section, we first conduct simulation studies to evaluate the performance of the proposed procedures in “Classification with predictor graphical structures accommodated” section, and then we apply the procedures to analyze a real dataset to illustrate their usage. The discussion is carried out in contrast to the classification methods reviewed in “Evaluation of the performance” section as well as the usual multiclass logistic regression model in “Logistic regression model for multiclass response” section. The R packages, svm(e1071), lda(MASS), knn.cv(class), and xgboost are used to implement the SVM, LDA, KNN, and XGBOOST methods, respectively.

Simulation study

For class i=1,⋯,I, the predictors are generated from the multivariate normal distribution with mean zero and covariance matrix $\Sigma _{i} = \Omega _{i}^{-1}$, where Ω_i is a matrix associated with the network structure in class i with all diagonal elements 1 and off-diagonal elements 0 or 1; for s≠t, entry (s,t) is 1 if the edge exists between X_s and X_t and 0 otherwise. The relationship between a multivariate normal distribution N(0,Σ_i) and the Gaussian graphical model with edges determined by $\Omega _{i} = \Sigma _{i}^{-1}$ is discussed by Hastie et al. (2015, p.246 and p.263).

We specifically consider two scenarios of network structures where the dimension of predictors is p=12. In the first scenario we specify Ω_i to reflect the network structures displayed in Fig. 1. For example, element (1,5) for Ω₁ is 1, but element (1,5) for Ω_i is 0 if i=2,3,4. For a given class i and a subject j in this class, we calculate $\pi _{j}^{i}(x_{\cdot j})$ by (16) where we set $\gamma _{0}^{i} = \gamma _{{st}}^{i} = 1$. The outcome measurements are set to be $Y_{j}^{i} = 1$ if $\pi _{j}^{i}(x_{\cdot j}) > c$, and $Y_{j}^{i} = 0$ otherwise, where the threshold c is chosen such that the size in class i equals n_i.

In the second scenario, Ω_i is taken as the identity matrix for i=1,⋯,I, showing that the predictors have no network structures. For subject j, the predictor X_·j is generated from the multivariate normal distribution with mean zero and identity matrix. To generate Y_·j for subject j, we first calculate π_ij(x_·j) for every i=1,⋯,I by (2) and (3) where γ_0i and γ_i are both set as log(i)+1 for class i. Then we set Y_·j=i^∗ if $i^{\ast } = \underset {i}{\text {argmax}} \pi _{{ij}}(x_{\cdot j})$. Continue this process until the desired size n_i is achieved for i=1,⋯,I. We consider the case with I=4 and n_i=50 for i=1,⋯,I and run 500 simulations. We use criteria (22) and (23) to report the performance of each method. The results are summarized in Table 2. It is seen that the proposed LR-ClassGraph method outperforms all the classification methods with larger values of PRE, REC and F from both micro and macro view points. The SVM performs the second best, and the performance of the LR-HomoGraph method is ranked the third, followed by that of the XGBOOST method.

Table 2 Simulation study with and without network structures for covariates, respectively, indicated by Scenarios 1 and 2: I=4

Full size table

To understand how the proposed methods perform with the binary classification, we repeat the preceding simulations by setting I to be 2 and taking the network structures of classes 1 and 2 when considering scenario 1. The results are in Table 3. When covariates are associated with a network structure, the proposed LR-ClassGraph method still performs the best, and the improvement of the LR-ClassGraph method over existing classifiers is a lot more noticeable for I=2 than for I=4. Interestingly, when covariates are uncorrelated, unlike the multiclass case with I=4, the LR-HomoGraph method outperforms the LR-ClassGraph method; and in this case, the SVM is the best classifier.

Table 3 Simulation study with and without network structures for covariates, respectively, indicated by Scenarios 1 and 2: I=2

Full size table

Glass identification dataset

We analyze a dataset concerning glass identification. The study of classification of glass types was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified. It is of interest to predict the glass type based on the information of the predictors.

The dataset contains 7 types of glass, including

building$\underline {\ }{windows}\underline {\ }{float}\underline {\ }$processed (Glass-1),
building$\underline {\ }{windows}\underline {\ }{non}\underline {\ }{float}\underline {\ }$processed (Glass-2),
vehicle$\underline {\ }{windows}\underline {\ }{float}\underline {\ }$processed (Glass-3),
vehicle$\underline {\ }{windows}\underline {\ }{non}\underline {\ }{float}\underline {\ }$processed (Glass-4),
containers (Glass-5),
tableware (Glass-6), and
headlamps (Glass-7),

and the predictors include 9 different chemical materials, refractive index (RI), Sodium (NA), Magnesium (MG), Aluminum (AL), Silicon (SI), Potassium (K), Calcium (CA), Barium (BA), and Iron (FE). The complete dataset is available at https://archive.ics.uci.edu/ml/datasets/glass+identification. The sample size in each class is, respectively, n₁=70,n₂=76,n₃=17,n₄=0,n₅=13,n₆=9, and n₇=29, yielding the total sample size $n = \sum \limits _{i=1}^{7} n_{i} = 214$. To see the correlation among the predictors, we draw a scatter plot of those 9 predictors, displayed in Fig. 2. It is seen that some predictors, such as RI and CA, are highly correlated, and that many pairwise predictors are generally correlated.

We first present the network structures for different chemical materials in each class. The network structure for each class is determined by (9) and (11). The graphical results are reported in Fig. 4. It is seen that the network structure of the predictors is different from class to class. We notice that RI has no connection with other variables in every class and the predictor FE also has no connection with others except in class 6.

We next evaluate the performance of our proposed methods as opposed to the conventional approaches, SVM, LDA, KNN, and XGBOOST, which are respectively implemented by the R packages svm(e1071), lda(MASS), knn.cv(class), and xgboost. To examine the performance of LR-HomoGraph proposed in “Logistic regression with homogeneous graphically structured predictors” section, we first construct the network structures, displayed in Fig. 3, of the predictors with the class information ignored, and we then apply the procedure described in “Logistic regression with homogeneous graphically structured predictors” section. To implement the LR-ClassGraph method in “Logistic regression with class-dependent graphically structured predictors” section, we apply model (16) with respect to six different network structures in Fig. 4, and then determine the predictive class using (18).

To measure the classification results in each class, we define the misclassification rate in class i to be

$$\begin{array}{@{}rcl@{}} \text{MIS}_{i} = \frac{1}{n_{i}} \sum \limits_{j=1}^{n} \mathbb{I} \left(y_{\cdot j} = i, \widehat{y}_{\cdot j} \neq i\right) \ \ \text{for} \ \ i=1,\cdots,I. \end{array} $$

The results obtained from SVM, LDA, KNN, XGBOOST, and the proposed methods are reported in Table 4. The misclassification rate of our proposed methods in each class are smaller than other methods, and the LR-ClassGraph yields the smallest misclassification rate for each class. Among the four compared methods, the SVM outperforms the other three methods.

Table 4 Classification results for glass data

Full size table

Finally, we use criteria (22) and (23) to compare the overall performance of all the methods and summarize the results in Table 5. It is clear that both LR-HomoGraph and LR-ClassGraph produce higher values of the F, PRE and REC measures, regardless of micro and macro, implying that our proposed methods perform better than other multiclassification methods considered here. In addition, we further implement the two methods in “Classification with predictor graphical structures accommodated” section by respectively extending models (12) and (17) with the linear terms in each predictor included, and we denote those methods as LR-HomoGraph+main and LR-ClassGraph+main, respectively, and report the results in the last two columns of Table 5. Such an extension of the models, however, does not help increase the values of these measures.

Table 5 Overall performance of classification methods applied to glass data

Full size table

Discussion

In this paper, we propose to use logistic regression methods to make a prediction for data with network structures in predictors. In our methods, we first identify the network structures of the predictors for every class using graphical models, and then we capitalize on the identified network structures for the predictors to fit a logistic regression model to do classification and prediction. Simulation studies demonstrate that in the presence of network structures for covariates, our proposed methods produce more precise classification results than conventional methods, such as SVM, LDA, KNN, and XGBOOST. To allow interested readers to use the algorithms developed in “Classification with predictor graphical structures accommodated” section, the implementation procedures will be posted at CRAN.

Our development here focuses on examining pairwise dependence structures among predictors using the formulation (7). This is primarily driven by the consideration that such a dependence structure is intuitively interpretable and commonly exists in many problems. Extensions to facilitating triplewise or higher order dependence structures or even with the main effects (i.e., single variable effects), among predictors can be carried out by extending (7) to the form (9.5) of Hastie et al. (2015). Such extensions are, in principle, straightforward to implement technically, but the issue of overfitting may arise. In addition, underlying constraints on the model parameters may become a complex concern in numerical implementation. Discussions on this aspect were given by many authors, including Yang et al. (2015), Yi (2017), and Yi et al. (2017). Our discussion in this paper is directed to using the exponential family distribution to facilitate continuous predictor. It is easy to extend our methods to accommodate mixture graphical models which feature both continuous and discrete predictors.

In obtaining the estimator (9), we use the L₁-norm or the LASSO penalty, which is driven by its popularity as well as the availability of the implementation software packages (e.g., R packages huge and XMRF). However, the methods described in “Classification with predictor graphical structures accommodated” section are not just confined to the LASSO penalty. Our methods apply as well when other penalty functions are used. For instance, penalty functions, such as the elastic-net, SCAD, adaptive LASSO, L₂-norm penalties can be used to replace the LASSO penalty in deriving the estimator (9); the remaining procedures developed in “Classification with predictor graphical structures accommodated” section still carry through. It will be interesting to conduct numerical studies for the use of different penalty functions to compare how results may differ with and without incorporating the network structure in the analysis, as noted by a referee. Though in this paper we are not able to exhaust numerical explorations for all possible penalty functions, the implementation framework presented in “Classification with predictor graphical structures accommodated” section allows the users to take any penalty functions that suit their own problems.

Finally, we comment that several aspects of the methods described in “Classification with predictor graphical structures accommodated” section warrants further research. As pointed out by a referee, our methods are developed for the problems with low dimensional data (i.e., p<n) and they are not applicable to sizable data with p≥n. In the current digital world, it is not uncommon that we often have to handle data with thousands of predictor variables but the sample size is a lot smaller. In such circumstances, dimension reduction or feature screening techniques would be employed before proceeding with formal data analysis. It is interesting to generalize our methods to handle high-dimensional data with p being of a polynomial order of n or even ultra high-dimensional data with p being of an exponential order of n.

Our methods basically involve two steps in using measurements for the covariates and class labels. In the first step, we utilize undirected graphs to examine the covariate measurements alone, and the class information only comes into play in the second step when using logistic regression for classification. Alternatively, one may consider using directed acyclic graphs to feature conditional independencies among variables and develop probabilistic graphical models for classification. To evaluate the performance of the proposed methods, we focus on the comparisons with the competing classifiers reviewed in “Evaluation of the performance” section. While those algorithms cover a good range of available classifiers, they are not exhaustive, or even far from being comprehensive, in comparisons. Despite the frequentist nature of our methods, it is interesting to compare the proposed methods to the Bayesian network classifiers which have proven useful in applications (e.g., Geiger and Heckerman 1996; Pérez et al. 2006; Bielza and Larrañaga 2014). Furthermore, it is worthwhile to employ rigorous hypothesis testing procedures to evaluate whether the differences in the results obtained from different classifiers are statistically significant.

References

Agresti, A.: An Introduction to Categorical Data Analysis. Wiley, New York (2007).
Book Google Scholar
Agresti, A.: Categorical Data Analysis. Wiley, New York (2012).
MATH Google Scholar
Bagirov, A. M., Ferguson, B., Ivkovic, S., Saunders, G., Yearwood, J.: New algorithms for multi-class cancer diagnosis using tumor gene expression signatures. Bioinformatics. 19, 1800–1807 (2003).
Article Google Scholar
Baladanddayuthapani, V., Talluri, R., Ji, Y., Coombes, K. R., Lu, Y., Hennessy, B. T., Davies, M. A., Mallick, B. K.: Bayesian sparse graphical models for classification with application to protein expression data. Ann. Appl. Stat. 8, 1443–1468 (2014).
Article MathSciNet Google Scholar
Bicciato, S., Luchini, A., Bello, C. D.: Pca disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics. 19, 571–578 (2003).
Article Google Scholar
Bielza, C., Li, G., Larrañaga, P.: Multi-dimensional classification with bayesian networks. Int. J. Approx. Reason. 52, 705–727 (2011).
Article MathSciNet Google Scholar
Bielza, C., Larrañaga, P.: Discrete bayesian network classifiers: A survey. ACM Comput. Surv. 47, 1–43 (2014).
Article Google Scholar
Cai, W., Guan, G., Pan, R., Zhu, X., Wang, H.: Network linear discriminant analysis. Comput. Stat. Data Anal. 117, 32–44 (2018).
Article MathSciNet Google Scholar
Cetiner, M., Akgul, Y. S.: Information Sciences and Systems 2014. In: In: T., C., E., G., R., L. (eds.) 2nd, pp. 53–76. Springer, New York (2014).
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of KDD ’16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, San Francisco (2016). http://doi.org/10.1145/2939672.2939785.
Google Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000).
Book Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001).
Article MathSciNet Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 9, 432–441 (2008).
Article Google Scholar
Geiger, D., Heckerman, D.: Knowledge representation and inference in similarity networks and bayesian multinets. Artif. Intell. 82, 45–74 (1996).
Article MathSciNet Google Scholar
Guo, Y., Hastie, T., Tibshirani, R.: Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 8, 86–100 (2007).
Article Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2008).
MATH Google Scholar
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC press, New York (2015).
Book Google Scholar
Hsu, C. -W., Lin, C. -J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13, 415–425 (2002).
Article Google Scholar
Huttenhower, C., Flamholz, A. I., Landis, J. N., Sahi, S., Myers, C. L., Olszewski, K. L., Hibbs, M. A., Siemers, N. O., Troyanskaya, O. G., Coller, H. A.: Nearest neighbor networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics. 8, 1–13 (2007).
Article Google Scholar
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning: with Applications in R. Springer, New York (2017).
MATH Google Scholar
Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: A stepwise procedure for building and training neural network. In: In: F.F., S., J., H. (eds.)Neurocomputing: Algorithms, Architectures and Applications. 1st, pp. 41–50. Springer, Berlin (1990).
Google Scholar
Lee, Y., Lee, C. -K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 19, 1132–1139 (2003).
Article Google Scholar
Lee, J., Hastie, T. J.: earning the structure of mixed graphical models. J. Comput. Graph. Stat. 24, 230–253 (2015).
Article Google Scholar
Liu, J. J., Cutler, G., Li, W., Pan, Z., Peng, S., Hoey, T., Chen, L., Ling, X. B.: Multiclass cancer classification and biomarker discovery using ga-based algorithms. Bioinformatics. 21, 2691–2697 (2005).
Article Google Scholar
Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34, 1436–1462 (2006).
Article MathSciNet Google Scholar
Miguel Hernández-Lobato, J., Hernández-Lobato, D., Suárez, A.: Network-based sparse bayesian classification. Pattern Recognit. 44, 886–900 (2011).
Article Google Scholar
Parambath, S. A. P., Usunier, N., Grandvalet, Y.: Optimizing pseudo-linear performance measures: Application to f-measure (2018). arXiv:1505.00199v4. Accessed 1 Jan 2018.
Pérez, A., Larrañaga, P., Inza, I.: Supervised classification with conditional gaussian networks: Increasing the structure complexity from naive bayes. Int. J. Approx. Reason. 43, 1–25 (2006).
Article MathSciNet Google Scholar
Peterson, C. B., Stingo, F. C., Vannucci, M.: Joint bayesian variable and graph selection for regression models with network-structured predictors. Stat. Med. 35, 1017–1031 (2015).
Article MathSciNet Google Scholar
Ravikumar, P., Wainwright, M. J., Lafferty, J.: High-dimensional ising model selection using ℓ ₁-regularized logistic regression. Ann. Stat. 38, 1287–1319 (2010).
Article MathSciNet Google Scholar
Safo, S. E., Ahn, J.: General sparse multi-class linear discriminant analysis. Comput. Stat. Data Anal. 99, 81–90 (2016).
Article MathSciNet Google Scholar
Sokolova, M., Japkowicz, N., Szpakowicz, S.: AI 2006: Advances in Artificial Intelligence. In: In: A., S., B., K. (eds.) 1st, pp. 53–76. Springer, Berlin (2006).
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B. 58, 267–288 (1996).
MathSciNet MATH Google Scholar
Wang, H., Li, R., Tsai, C.: uning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 94, 553–568 (2007).
Article MathSciNet Google Scholar
Yang, E., Ravikumar, P., Allen, G. I., Liu, Z.: Graphical models via univariate exponential family distribution. J. Mach. Learn. Res. 16, 3813–3847 (2015).
MathSciNet MATH Google Scholar
Yi, G. Y.: Composite likelihood/pseudolikelihood. Wiley StatsRef: Stat. Ref. Online (2017). https://doi.org/10.1002/9781118445112.stat07855.
Yi, G. Y., He, W., Li, H.: A class of flexible models for analysis of complex structured correlated data with application to clustered longitudinal data. Stat. 6, 448–461 (2017).
Article MathSciNet Google Scholar
Zhu, S. X. Y., Pan, W.: Network-based support vector machine for classification of microarray samples. BMC Bioinformatics. 10, 1–11 (2009).
Google Scholar
Zi, X., Liu, Y., Gao, P.: Mutual information network-based support vector machine for identification of rheumatoid arthritis-related genes. Int. J. Clin. Experiment. Med. 9, 11764–11771 (2016).
Google Scholar
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006).
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and partially supported by a Collaborative Research Team Project of the Canadian Statistical Sciences Institute (CANSSI).

Author information

Li-Pang Chen and Grace Y. Yi lead the project with equal contributions including writing the paper.

Authors and Affiliations

Department of Statistics and Actuarial Science, University of Waterloo, 200 University Ave W, Waterloo, N2L 3G1, Canada
Li-Pang Chen, Grace Y. Yi & Qihuang Zhang
Department of Statistical and Actuarial Sciences, University of Western Ontario, 1151 Richmond St North, London, N6A 5B7, Canada
Wenqing He

Authors

Li-Pang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Grace Y. Yi
View author publications
You can also search for this author in PubMed Google Scholar
Qihuang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenqing He
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The first two authors lead the project with equal contributions including writing the paper; the last two authors participate in the project with equal contributions.

Corresponding author

Correspondence to Grace Y. Yi.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional information

Qihuang Zhang and Wenqing He participate in the project with equal contributions.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Chen, LP., Yi, G.Y., Zhang, Q. et al. Multiclass analysis and prediction with network structured covariates. J Stat Distrib App 6, 6 (2019). https://doi.org/10.1186/s40488-019-0094-2

Download citation

Received: 25 October 2018
Accepted: 06 May 2019
Published: 06 June 2019
DOI: https://doi.org/10.1186/s40488-019-0094-2

Multiclass analysis and prediction with network structured covariates

Abstract

Introduction

Data structure and framework

Notation

Logistic regression model for multiclass response

Classification with predictor graphical structures accommodated

Predictor network structure

Proposition 1

Logistic regression with homogeneous graphically structured predictors

Logistic regression with class-dependent graphically structured predictors

Comparison of decision boundaries

Evaluation of the performance

Criteria for performances

Support vector machine for multiclass responses

Linear discriminant analysis

K-nearest neighbor

Extreme gradient boosting

Numerical studies

Simulation study

Glass identification dataset

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords