- Research
- Open access
- Published:
Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models
Journal of Statistical Distributions and Applications volume 8, Article number: 13 (2021)
Abstract
Mixture of experts (MoE) models are widely applied for conditional probability density estimation problems. We demonstrate the richness of the class of MoE models by proving denseness results in Lebesgue spaces, when inputs and outputs variables are both compactly supported. We further prove an almost uniform convergence result when the input is univariate. Auxiliary lemmas are proved regarding the richness of the soft-max gating function class, and their relationships to the class of Gaussian gating functions.
Introduction
Mixture of experts (MoE) models are a widely applicable class of conditional probability density approximations that have been considered as solution methods across the spectrum of statistical and machine learning Yuksel et al. (2012); Masoudnia and Ebrahimpour (2014); Nguyen and Chamroukhi (2018).
Let \(\mathbb {Z}=\mathbb {X}\times \mathbb {Y}\), where \(\mathbb {X}\subseteq \mathbb {R}^{d}\) and \(\mathbb {Y}\subseteq \mathbb {R}^{q}\), for \(d,q\in \mathbb {N}\). Suppose that the input and output random variables, \(\boldsymbol {X}\in \mathbb {X}\) and \(\boldsymbol {Y}\in \mathbb {Y}\), are related via the conditional probability density function (PDF) f(y|x) in the functional class:
where λ denotes the Lebesgue measure. The MoE approach seeks to approximate the unknown target conditional PDF f by a function of the MoE form:
where \(\mathbf {Gate}=\left (\text {Gate}_{k}\right)_{k\in [K]}\in \mathcal {G}^{K}\) (\([K]=\left \{ 1,\dots,K\right \} \)), \(\text {Expert}_{1},\dots,\text {Expert}_{K}\in \mathcal {E}\), and \(K\in \mathbb {N}\). Here, we say that m is a K-component MoE model with gates arising from the class \(\mathcal {G}^{K}\) and experts arising from \(\mathcal {E}\), where \(\mathcal {E}\) is a class of PDFs with support \(\mathbb {Y}\).
The most popular choices for \(\mathcal {G}^{K}\) are the parametric soft-max and Gaussian gating classes:
and
respectively, where
and
Here,
is the multivariate normal density function with mean vector ν and covariance matrix \(\boldsymbol {\Sigma }, \boldsymbol {\pi }^{\top }=\left (\pi _{1},\dots,\pi _{K}\right)\) is a vector of weights in the simplex:
and \(\mathbb {S}_{d}\) is the class of d×d symmetric positive definite matrices. The soft-max and Gaussian gating classes were first introduced by Jacobs et al. (1991); Jordan and Xu (1995), respectively. Typically, one chooses experts that arise from some location-scale class:
where ψ is a PDF, with respect to \(\mathbb {R}^{q}\) in the sense that \(\psi :\mathbb {R}^{q}\rightarrow \left [0,\infty \right)\) and \(\int _{\mathbb {R}^{q}}\psi \left (\boldsymbol {y}\right)\mathrm {d}\lambda \left (\boldsymbol {y}\right)=1\).
We shall say that \(f\in \mathcal {L}_{p}\left (\mathbb {Z}\right)\) for any p∈[1,∞) if
where \(\mathbf {1}_{\mathbb {Z}}\) is the indicator function that takes value 1 when \(\boldsymbol {z}\in \mathbb {Z}\), and 0 otherwise. Further, we say that \(f\in \mathcal {L}_{\infty }\left (\mathbb {Z}\right)\) if
We shall refer to \(\left \Vert \cdot \right \Vert _{p,\mathbb {Z}}\) as the \(\mathcal {L}_{p}\) norm on \(\mathbb {Z}\), for p∈[0,∞], and where the context is obvious, we shall drop the reference to \(\mathbb {Z}\).
Suppose that the target conditional PDF f is in the class \(\mathcal {F}_{p}=\mathcal {F}\cap \mathcal {L}_{p}\). We address the problem of approximating f, with respect to the \(\mathcal {L}_{p}\) norm, using MoE models in the soft-max and Gaussian gated classes,
and
by showing that both \(\mathcal {M}_{S}^{\psi }\) and \(\mathcal {M}_{G}^{\psi }\) are dense in the class \(\mathcal {F}_{p}\), when \(\mathbb {X}=\left [0,1\right ]^{d}\) and \(\mathbb {Y}\) is a compact subset of \(\mathbb {R}^{q}\). Our denseness results are enabled by the indicator function approximation result of Jiang and Tanner (1999a), and the finite mixture model denseness theorems of Nguyen et al. (2020); Nguyen et al. (2020a).
Our theorems contribute to an enduring continuity of sustained interest in the approximation capabilities of MoE models. Related to our results are contributions regarding the approximation capabilities of the conditional expectation function of the classes \(\mathcal {M}_{S}^{\psi }\) and \(\mathcal {M}_{G}^{\psi }\) (Wang and Mendel 1992; Zeevi et al. 1998; Jiang and Tanner 1999a; Krzyzak and Schafer 2005; Mendes and Jiang 2012; Nguyen et al. 2016; Nguyen et al. 2019) and the approximation capabilities of subclasses of \(\mathcal {M}_{S}^{\psi }\) and \(\mathcal {M}_{G}^{\psi }\), with respect to the Kullback–Leibler divergence (Jiang and Tanner 1999b; Norets 2010; Norets and Pelenis 2014). Our results can be seen as complements to the Kullback–Leibler approximation theorems of Norets (2010); Norets and Pelenis (2014), by the relationship between the Kullback–Leibler divergence and the \(\mathcal {L}_{2}\) norm (Zeevi and Meir 1997). That is, when f>1/κ, for all \(\left (\boldsymbol {x},\boldsymbol {y}\right)\in \mathbb {Z}\) and some constant κ>0, we have that the integrated conditional Kullback–Leibler divergence considered by Norets and Pelenis (2014):
satisfies
and thus a good approximation in the integrated Kullback–Leibler divergence is guaranteed if one can find a good approximation in the \(\mathcal {L}_{2}\) norm, which is guaranteed by our main result.
The remainder of the manuscript proceeds as follows. The main result is presented in Section 2. Technical lemmas are provided in Section 3. The proofs of our results are then presented in Section 4. Proofs of required lemmas that do not appear elsewhere are provided in Section 5. A summary of our work and some conclusions are drawn in Section 6.
Main results
Denote the class of bounded functions on \(\mathbb {Z}\) by
and write its norm as \(\left \Vert f\right \Vert _{\mathcal {B}\left (\mathbb {Z}\right)}=\sup _{\boldsymbol {z}\in \mathbb {Z}}\left |f\left (\boldsymbol {z}\right)\right |\). Further, let \(\mathcal {C}\) denote the class of continuous functions. Note that if \(\mathbb {Z}\) is compact and \(f\in \mathcal {C}\), then \(f\in \mathcal {B}\).
Theorem 1
Assume that \(\mathbb {X}=\left [0,1\right ]^{d}\) for \(d\in \mathbb {N}\). There exists a sequence \(\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}\subset \mathcal {M}_{S}^{\psi }\), such that if \(\mathbb {Y}\subset \mathbb {R}^{q}\) is compact, \(f\in \mathcal {F}\cap \mathcal {C}\), and \(\psi \in \mathcal {C}\left (\mathbb {R}^{q}\right)\) is a PDF on support \(\mathbb {R}^{q}\), then \({\lim }_{K\rightarrow \infty }\left \Vert f-m_{K}^{\psi }\right \Vert _{p}=0\), for p∈[1,∞).
Since convergence in Lebesgue spaces does not imply point-wise modes of convergence, the following result is also useful and interesting in some restricted scenarios. Here, we note that the mode of convergence is almost uniform, which implies almost everywhere convergence and convergence in measure (cf. Bartle 1995, Lem 7.10 and Thm. 7.11). The almost uniform convergence of \(\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}\) to f in the following result is to be understood in the sense of Bartle (1995), Def. 7.9. That is, for every δ>0, there exists a set \(\mathbb {E}_{\delta }\subset \mathbb {Z}\) with \(\lambda \left (\mathbb {Z}\right)<\delta \), such that \(\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}\) converges to f, uniformly on \(\mathbb {Z}\backslash \mathbb {E}_{\delta }\).
Theorem 2
Assume that \(\mathbb {X}=\left [0,1\right ]\). There exists a sequence \(\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}\subset \mathcal {M}_{S}^{\psi }\), such that if \(\mathbb {Y}\subset \mathbb {R}^{q}\) is compact, \(f\in \mathcal {F}\cap \mathcal {C}\), and \(\psi \in \mathcal {C}\left (\mathbb {R}^{q}\right)\) is a PDF on support \(\mathbb {R}^{q}\), then \({\lim }_{K\rightarrow \infty }m_{K}^{\psi }=f\), almost uniformly.
The following result establishes the connection between the gating classes \(\mathcal {G}_{S}^{K}\) and \(\mathcal {G}_{G}^{K}\).
Lemma 1
For each \(K\in \mathbb {N}, \mathcal {G}_{S}^{K}\subset \mathcal {G}_{G}^{K}\). Further, if we define the class of Gaussian gating vectors with equal covariance matrices:
where
then \(\mathcal {G}_{E}^{K}\subset \mathcal {G}_{S}^{K}\).
We can directly apply Lemma 1 to establish the following corollary to Theorems 1 and 2, regarding the approximation capability of the class \(\mathcal {M}_{G}^{\psi }\).
Corollary 1
Theorems 1 and 2 hold when \(\mathcal {M}_{S}^{\psi }\) is replaced by \(\mathcal {M}_{G}^{\psi }\) in their statements.
Technical lemmas
Let \(\mathbb {K}^{n}=\left \{ \left (k_{1},\dots,k_{d}\right)\in [n]^{d}\right \} \) and \(\kappa :\mathbb {K}^{n}\rightarrow \left [n^{d}\right ]\) be a bijection for each \(n\in \mathbb {N}\). For each \(\left (k_{1},\dots,k_{d}\right)\in \mathbb {K}^{n}\) and k∈[nd], we define \(\mathbb {X}_{k}^{n}=\mathbb {X}_{\kappa \left (k_{1},\dots,k_{d}\right)}^{n}=\prod _{i=1}^{d}\mathbb {I}_{k_{i}}^{n}\), where \(\mathbb {I}_{k_{i}}^{n}=\left [\left (k_{i}-1\right)/n,k_{i}/n\right)\) for ki∈[n−1], and \(\mathbb {I}_{n}^{n}=\left [\left (n-1\right)/n,1\right ]\).
We call \(\left \{ \mathbb {X}_{k}^{n}\right \}_{k\in \left [n^{d}\right ]}\) a fine partition of \(\mathbb {X}\), in the sense that \(\mathbb {X}=\left [0,1\right ]^{d}=\bigcup _{k=1}^{n^{d}}\mathbb {X}_{k}^{n}\), for each n, and that \(\lambda \left (\mathbb {X}_{k}^{n}\right)=n^{-d}\) gets smaller, as n increases. The following result from Jiang and Tanner (1999a) establishes the approximation capability of soft-max gates.
Lemma 2
(Jiang and Tanner, 1999, p. 1189) For each \(n\in \mathbb {N}, p\in \left [1,\infty \right)\) and ε>0, there exists a gating functions
for some \(\boldsymbol {\gamma }\in \mathbb {G}_{S}^{n^{d}}\), such that
When, d=1, we have also the following almost uniform convergence alternative to Lemma 2.
Lemma 3
Let \(\mathbb {X}=\left [0,1\right ]\). Then, for each \(n\in \mathbb {N}\), there exists a sequence of gating functions:
defined by \(\left \{ \boldsymbol {\gamma }_{l}\right \}_{l\in \mathbb {N}}\subset \mathbb {G}_{S}^{n}\), such that
almost uniformly, simultaneously for all k∈[nd].
For PDF ψ on support \(\mathbb {R}^{q}\), define the class of finite mixture models by
We require the following result, from Nguyen et al. (2020), regarding the approximation capabilities of \(\mathcal {H}^{\psi }\).
Lemma 4
(Nguyen et al., 2020a, Thm. 2(b)) If \(f\in \mathcal {C}\left (\mathbb {Y}\right)\) is a PDF on \(\mathbb {Y}, \psi \in \mathcal {C}\left (\mathbb {R}^{q}\right)\) is a PDF on \(\mathbb {R}^{q}\), and \(\mathbb {Y}\subset \mathbb {R}^{q}\) is compact, then there exists a sequence \(\left \{ h_{K}^{\psi }\right \}_{K\in \mathbb {N}}\subset \mathcal {H}^{\psi }\), such that \({\lim }_{K\rightarrow \infty }\left \Vert f-h_{K}^{\psi }\right \Vert _{\mathcal {B}\left (\mathbb {Y}\right)}=0\).
Proofs of main results
4.1 Proof of Theorem 1
To prove the result, it suffices to show that for each ε>0, there exists a \(m_{K}^{\psi }\in \mathcal {M}_{S}^{\psi }\), such that
The main steps of the proof are as follows. We firstly approximate f(y|x) by
where \(\boldsymbol {x}_{k}^{n}\in \mathbb {X}_{k}^{n}\), for each k∈[nd], such that
for all n≥N1(ε), for some sufficiently large \(N_{1}(\epsilon)\in \mathbb {N}\). Then we approximate υn(y|x) by
where \(\boldsymbol {\gamma }_{n}\in \mathbb {G}_{S}^{n^{d}}\) and \(\mathbf {Gate}=\left (\text {Gate}_{k}\left (\cdot ;\boldsymbol {\gamma }_{n}\right)\right)_{k\in \left [n^{d}\right ]}\in \mathcal {G}_{S}^{n^{d}}\), so that
using Lemma 2.
Finally, we approximate ηn(y|x) by \(m_{K_{n}}^{\psi }\left (\boldsymbol {y}|\boldsymbol {x}\right)\), where
and
for \(n_{k}\in \mathbb {N}\) (k∈[nd]), such that \(K_{n}=\sum _{k=1}^{n^{d}}n_{k}\). Here, we establish that there exists \(N_{2}\left (\epsilon,n,\boldsymbol {\gamma }_{n}\right)\in \mathbb {N}\), so that when nk≥N2(ε,n,γn),
Results (2)–(7) then imply that for each ε>0, there exists N1(ε),γn, and \(N_{2}\left (\mathbb {\epsilon },n,\boldsymbol {\gamma }_{n}\right)\), such that for all \(K_{n}=\sum _{k=1}^{n^{d}}n_{k}\), where nk≥N2(ε,n,γn) (for each k∈[nd]) and n≥N1(ε). The following inequality results from an application of the triangle inequality:
We now focus our attention to proving each of the results: (2)–(7). To prove (2), we note that since f is uniformly continuous (because \(\mathbb {Z}=\mathbb {X}\times \mathbb {Y}\) is compact, and \(f\in \mathcal {C}\)), there exists a function (1) such that for all ε>0,
We can construct such an approximation by considering the fact that as n increases, the diameter \(\delta _{n}=\sup _{k\in n^{d}}\text {diam}\left (\mathbb {X}_{k}^{n}\right)\) of the fine partition goes to zero. By the uniform continuity of f, for every ε>0, there exists a δ(ε)>0, such that if ∥(x1,y1)−(x2,y2)∥<δ(ε), then |f(y1|x1)−f(y2|x2)|<ε, for all pairs \(\left (\boldsymbol {x}_{1},\boldsymbol {y}_{1}\right),\left (\boldsymbol {x}_{2},\boldsymbol {y}_{2}\right)\in \mathbb {Z}\). Here, ∥·∥ denotes the Euclidean norm. Furthermore, for any \(\left (\boldsymbol {x},\boldsymbol {y}\right)\in \mathbb {Z}\), we have
by the triangle inequality.
Since \(\boldsymbol {x}_{k}^{n}\in \mathbb {X}_{k}^{n}\), for each k and n, we have the fact that \(\left \Vert \left (\boldsymbol {x},\boldsymbol {y}\right)-\left (\boldsymbol {x}_{k}^{n},\boldsymbol {y}\right)\right \Vert <\delta _{n}\) for \(\left (\boldsymbol {x},\boldsymbol {y}\right)\in \mathbb {X}_{k}^{n}\times \mathbb {Y}\). By uniform continuity, for each ε, we can find a sufficiently small δ(ε), such that \(\left |f\left (\boldsymbol {y}|\boldsymbol {x}\right)-f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)\right |<\varepsilon \), if \(\left \Vert \left (\boldsymbol {x},\boldsymbol {y}\right)-\left (\boldsymbol {x}_{k}^{n},\boldsymbol {y}\right)\right \Vert <\delta (\epsilon)\), for all k. The desired result (8) can be obtained by noting that the right hand side of (9) consists of only one non-zero summand for any \(\left (\boldsymbol {x},\boldsymbol {y}\right)\in \mathbb {Z}\), and by choosing \(n\in \mathbb {N}\) sufficiently large, so that δn<δ(ε).
By (8), we have the fact that υn→f, point-wise. We can bound υn as follows:
where the right-hand side is a constant and is therefore in \(\mathcal {L}_{p}\), since \(\mathbb {Z}\) is compact. An application of the Lebesgue dominated convergence theorem in \(\mathcal {L}_{p}\) then yields (2).
Next we write
Since the norm arguments are separable in x and y, we apply Fubini’s theorem to get
Because \(f\in \mathcal {B}\) and nd is finite, for any fixed \(n\in \mathbb {N}\), we have \(C_{1}(n)=\sum _{k=1}^{n^{d}}\left \Vert f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)\right \Vert _{p,\mathbb {Y}}<\infty \). For each ε>0, we need to choose a \(\boldsymbol {\gamma }_{n}\in \mathbb {G}_{S}^{n^{d}}\), such that
which can be achieved via a direct application of Lemma 2. We have thus shown (4).
Lastly, we are required to approximate \(f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)\) for each k∈[nd], by a function of form (6). Since \(\mathbb {Y}\) is compact and f and ψ are continuous, we can apply of Lemma 4, directly. Note that over a set of finite measure, convergence in \(\left \Vert \cdot \right \Vert _{\mathcal {B}}\) implies convergence in \(\mathcal {L}_{p}\) norm, for all p∈[1,∞] (cf. Oden and Demkowicz 2010, Prop. 3.9.3).
We can then write (5) as
where \(\boldsymbol {\gamma }_{n}=\left (a_{n,1},\dots,a_{n,n^{d}},\boldsymbol {b}_{n,1},\dots,\boldsymbol {b}_{n,n^{d}}\right)\). From (11), we observe that \(m_{K_{n}}^{\psi }\in \mathcal {M}_{S}^{\psi }\), with \(K_{n}=\sum _{k=1}^{n^{d}}n_{k}\).
To obtain (7), we write
By separability and Fubini’s theorem, we then have
Let \(C_{2}\left (n,\boldsymbol {\gamma }_{n}\right)=\sup _{k\in \left [n^{d}\right ]}\left \Vert \text {Gate}_{k}\left (\boldsymbol {x};\boldsymbol {\gamma }_{n}\right)\right \Vert _{p,\mathbb {X}}\). Then, we apply Lemma 4 nd times to establish the existence of a constant \(N_{2}\left (\epsilon,n,\boldsymbol {\gamma }_{n}\right)\in \mathbb {N}\), such that for all k∈[nd] and nk≥N2(ε,n,γn),
Thus, we have
which completes our proof.
4.2 Proof of Theorem 2
The proof is procedurally similar to that of Theorem 1 and thus we only seek to highlight the important differences. Firstly, for any ε>0, we approximate f(y|x) by υn(x|y) of form (1), with d=1. Result (2) implies uniform convergence, in the sense that there exists an \(N_{1}(\epsilon)\in \mathbb {N}\), such that for all n≥N1(ε),
We now seek to approximate υn by ηn of form (3), with γn=γl for some \(l\in \mathbb {N}\). Upon application of Lemma 3, it follows that for each k∈[nd] and ε>0, there exists a measurable set \(\mathbb {B}_{k}(\varepsilon)\subseteq \mathbb {X}\), such that
and
for all l≥Mk(ε,n), for some \(M_{k}\left (\epsilon,n\right)\in \mathbb {N}\). Here, \((\cdot)^{\complement }\) is the set complement operator.
Since \(f\in \mathcal {B}\), we have the bound \(C(n)=\sum _{k=1}^{n^{d}}\left \Vert f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)\right \Vert _{\mathcal {B}\left (\mathbb {Y}\right)}<\infty \). Write \(\mathbb {B}(\varepsilon)=\bigcup _{k=1}^{n^{d}}\mathbb {B}_{k}(\varepsilon)\). Then, \(\mathbb {B}^{\complement }(\varepsilon)=\bigcap _{k=1}^{n^{d}}\mathbb {B}_{k}^{\complement }(\varepsilon)\),
and
for all \(l\ge M\left (\epsilon,n\right)=\max _{k\in \left [n^{d}\right ]}M_{k}\left (\epsilon,n\right)\). Here we use the fact that the supremum over some intersect of sets is less than or equal to the minimum of the supremum over each individual set.
Upon defining \(\mathbb {C}(\varepsilon)=\mathbb {B}(\varepsilon)\times \mathbb {Y}\subset \mathbb {Z}\), we observe that
and \(\mathbb {C}(\varepsilon)\subset \mathbb {B}(\varepsilon)\times \mathbb {Y}\). Note also that
and
It follows that
Since \(\mathbb {B}(\varepsilon)\times \mathbb {Y}^{\complement }\) and \(\mathbb {B}^{\complement }(\varepsilon)\times \mathbb {Y}^{\complement }\) are empty, via separability, we have
Recall that the \(\sum _{k=1}^{n^{d}}\left \Vert f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)\right \Vert _{\mathcal {B\left (\mathbb {Y}\right)}}=C(n)<\infty \) and that we can choose l≥M(ε,n) so that
and thus
as required.
Finally, by noting that for each k∈[nd], both (6) and \(f\left (\cdot |\boldsymbol {x}_{k}^{n}\right)\) are continuous over \(\mathbb {Y}\), we apply Lemma 4 to obtain an \(N_{2}\left (\epsilon,n,l\right)\in \mathbb {N}\), such that for any ε>0 and nk≥N2(ε,n,l), we have
Here \(M_{1}=\sup _{k\in \left [n^{d}\right ]}\left \Vert \text {Gate}_{k}\left (\cdot ;\boldsymbol {\gamma }_{l}\right)\right \Vert _{\mathcal {B\left (\mathbb {X}\right)}}<\infty \), since Gatek(x;γl) is continuous in x, and \(\mathbb {X}\) is compact. Therefore, for all \(K_{n}=\sum _{k=1}^{n^{d}}n_{k}, n_{k}\ge N_{2}\left (\epsilon,n,l\right)\),
In summary, via (12), (13), and (14), for each ε>0, for any ε>0, there exists a \(\mathbb {C}(\varepsilon)\subset \mathbb {Z}\) and constants \(N_{1}(\epsilon),M\left (\epsilon,n\right),N_{2}\left (\epsilon,n,l\right)\in \mathbb {N}\), such that for all \(K_{n}=\sum _{k=1}^{n^{d}}n_{k}\), with nk≥N2(ε,n,l),l≥M(ε,n), and n≥N1(ε), it follows that \(\lambda \left (\mathbb {C}(\varepsilon)\right)<\varepsilon \), and
This completes the proof.
Proofs of lemmas
5.1 Proof of Lemma 1
We firstly prove that any gating vector from \(\mathcal {G}_{S}^{K}\) can be equivalently represented as an element of \(\mathcal {G}_{G}^{K}\). For any \(\boldsymbol {x}\in \mathbb {R}^{d}, d\in \mathbb {N}, k\in [K], a_{k}\in \mathbb {R}, \boldsymbol {b}_{k}\in \mathbb {R}^{d}\), and \(K\in \mathbb {N}\), choose \(\boldsymbol {\nu }_{k}=\boldsymbol {b}_{k}, \tau _{k}=a_{k}+\boldsymbol {b}_{k}^{\top }\boldsymbol {b}_{k}/2\) and
This implies that \(\sum _{l=1}^{K}\pi _{l}=1, \pi _{l}>0\), for all l∈[K], and
where I is the identity matrix of appropriate size. This proves that \(\mathcal {G}_{S}^{K}\subset \mathcal {G}_{G}^{K}\).
Next, to show that \(\mathcal {G}_{E}^{K}\subset \mathcal {G}_{S}^{K}\), we write
and note that
Thus, we have
Next, notice that we can write
where αl=al−ak and βl=βl−βk. We now choose ak and bk, such that for every l∈[K],
and
To complete the proof, we choose
and \(b_{k}=\boldsymbol {\nu }_{k}^{\top }\boldsymbol {\Sigma }^{-1}\), for each k∈[K].
5.2 Proof of Lemma 3
For l∈[0,∞), write
where \(x\in \mathbb {X}=\left [0,1\right ]\), and ck=(k−1)/(2k). We identify that Gate=(Gatek(x,l))k∈[n] belongs to the class \(\mathcal {G}_{S}^{n}\). The proof of the Section 4 Proposition from Jiang and Tanner (1999a) reveals that for all k∈[n],
almost everywhere in λ, as l→∞. The result then follows via an application of Egorov’s theorem (cf. Folland 1999, Thm. 2.33).
Summary and conclusions
Using recent results mixture model approximation results Nguyen et al. (2020) and Nguyen et al. (2020a), and the indicator approximation theorem of Jiang and Tanner (1999a) (cf. Section 3), we have proved two approximation theorems (Theorems 1 and 2) regarding the class of soft-max gated MoE models with experts arising from arbitrary location-scale families of conditional density functions. Via an equivalence result (Lemma 1), the results of Theorems 1 and 2 also extend to the setting of Gaussian gated MoE models (Corollary 1), which can be seen as a generalization of the soft-max gated MoE models.
Although we explicitly make the assumption that \(\mathbb {X}=\left [0,1\right ]^{d}\), for the sake of mathematical argument (so that we can make direct use of Lemma 2), a simple shift-and-scale argument can be used to generalize our result to cases where \(\mathbb {X}\) is any generic compact domain. The compactness assumption regarding the input domain is common in the MoE and mixture of regression models literature, as per the works of Jiang and Tanner (1999b); Jiang and Tanner (1999a); Norets (2010); Montuelle and Le Pennec (2014); Pelenis (2014); Devijver (2015a); Devijver (2015b).
The assumption permits the application of the result to the settings where the inputs X is assumed to be non-random design vectors that take value on some compact set \(\mathbb {X}\). This is often the case when there is only a finite number of possible design vector elements for which X can take. Otherwise, the assumption also permits the scenario where X is some random element with compactly supported distribution, such as uniformly distributed, or beta distributed inputs. Unfortunately, the case of random X over an unbounded domain (e.g., if X has multivariate Gaussian distribution) is not covered under our framework. An extension to such cases would require a more general version of Lemma 2, which we believe is a nontrivial direction for future work.
Like the input, we also assume that the output domain is restricted to a compact set \(\mathbb {Y}\). However, the output domain of the approximating class of MoE models is unrestricted to \(\mathbb {Y}\) and thus the functions (i.e., we allow ψ to be a PDF over \(\mathbb {R}^{q}\)). The restrictions placed on \(\mathbb {Y}\) is also common in the mixture approximation literature, as per the works of Zeevi and Meir (1997); Li and Barron (1999); Rakhlin et al. (2005), and is also often made in the context of nonparametric regression (see, e.g., Gyorfi et al. 2002; Cucker and Zhou 2007). Here, our use of the compactness of \(\mathbb {Y}\) is to bound the integral of vn, in (10). A more nuanced approach, such as via the use of a generalized Lebesgue spaces (see e.g., Castillo and Rafeiro 2010; Cruze-Uribe and Fiorenza 2013), may lead to result for unbounded \(\mathbb {Y}\). This is another exciting future direction of our research program.
A trivial modification to the proof of Lemma 4 allows us to replace the assumption that f is a PDF with a sub-PDF assumption (i.e., \(\int _{\mathbb {Y}}f\mathrm {d}\lambda \le 1\)), instead. This in turn permits us to replace the assumption that f(·|x) is a conditional PDF in Theorems 1 and 2 with sub-PDF assumptions as well (i.e., for each \(\boldsymbol {x}\in \mathbb {X}, \int _{\mathbb {Y}}f\left (\boldsymbol {y}|\boldsymbol {x}\right)\mathrm {d}\lambda \left (\boldsymbol {y}\right)\le 1\)). Thus, in this modified form, we have a useful interpretation for situations when the input Y is unbounded. That is, when Y is unbounded, we can say that the conditional PDF f can be arbitrarily well approximated in \(\mathcal {L}_{p}\) norm by a sequence \(\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}\) of either soft-max or Gaussian gated MoEs over any compact subdomain \(\mathbb {Y}\) of the unbounded domain of Y. Thus, although we cannot provide guarantees of the entire domain of Y, we are able to guarantee arbitrary approximate fidelity over any arbitrarily large compact subdomain. This is a useful result in practice since one is often not interested in the entire domain of Y, but only on some subdomain where the probability of Y is concentrated. This version of the result resembles traditional denseness results in approximation theory, such as those of Cheney and Light (2000), Ch. 20.
Finally, our results can be directly applied to provide approximation guarantees for a large number of currently used models in applied statistics and machine learning research. Particularly, our approximation guarantees are applicable to the recent MoE models of Ingrassia et al. (2012); Chamroukhi et al. (2013); Ingrassia et al. (2014); Chamroukhi (2016); Nguyen and McLachlan (2016); Deleforge et al. (2015a); Deleforge et al. (2015b); Kalliovirta et al. (2016); Perthame et al. (2018), among many others. Here, we may guarantee that the underlying data generating processes, if satisfying our assumptions, can be adequately well approximated by sufficiently complex forms of the models considered in each of the aforementioned work.
The rate and manner of which good approximation can be achieved as a function of the number of experts K and the sample size is a currently active research area, with pioneering work conducted in Cohen and Le Pennec (2012); Montuelle and Le Pennec (2014). More recent results in this direction appear in Nguyen et al. (2020b); Nguyen et al. (2021); Nguyen et al. (2021).
Availability of data and materials
Not applicable.
Code availability
Not applicable.
Abbreviations
- MoE:
-
Mixture of experts
- PDF:
-
Probability density function
References
Bartle, R.: The Elements of Integration and Lebesgue Measure. Wiley, New York (1995).
Castillo, R. E., Rafeiro, H.: An Introductory Course in Lebesgue Spaces. Springer, Switzerland (2010).
Chamroukhi, F.: Robust mixture of experts modeling using the t distribution. Neural Netw. 79, 20–36 (2016).
Chamroukhi, F., Mohammed, S., Trabelsi, D., Oukhellou, L., Amirat, Y.: Joint segmentation of multivariate time series with hidden process regression for human activity recognition. Neurocomputing. 120, 633–644 (2013).
Cheney, W., Light, W.: A Course in Approximation Theory. Brooks/Cole, Pacific Grove (2000).
Cohen, S., Le Pennec, E.: Conditional density estimation by penalized likelihood model selection and application. ArXiv (arXiv:1103.2021) (2012).
Cruze-Uribe, D. V., Fiorenza, A.: Variable Lebesgue Spaces: Foundations and Harmonic Analysis. Birkhauser, Basel (2013).
Cucker, F., Zhou, D. -X.: Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, Cambridge (2007).
Deleforge, A., Forbes, F., Horaud, R.: High-dimensional regression with Gaussian mixtures and partially-latent response variables. Stat. Comput. 25, 893–911 (2015).
Deleforge, A., Forbes, F., Horaud, R.: Acoustic space learning for sound-source separation and localization on binaural manifolds. Int. J. Neural Syst. 25, 1440003 (2015).
Devijver, E.: An ℓ1-oracle inequality for the Lasso in multivariate finite mixture of multivariate Gaussian regression models. ESAIM: Probab. Stat. 19, 649–670 (2015).
Devijver, E.: Finite mixture regression: a sparse variable selection by model selection for clustering. Electron. J. Stat. 9, 2642–2674 (2015).
Folland, G. B.: Real Analysis: Modern Techniques and Their Applications. Wiley, New York (1999).
Gyorfi, L., Kohler, M., Krzyzak, A., Walk, H.: A Distribution-free Theory Of Nonparametric Regression. Springer, New York (2002).
Ingrassia, S., Minotti, S. C., Punzo, A.: Model-based clustering via linear cluster-weighted models. Comput. Stat. Data Anal. 71, 159–182 (2014).
Ingrassia, S., Minotti, S. C., Vittadini, G.: Local statistical modeling via a cluster-weighted approach with elliptical distributions. J. Classif. 29, 363–401 (2012).
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., Hinton, G. E.: Adaptive mixtures of local experts. Neural Comput. 3, 79–87 (1991).
Jiang, W., Tanner, M. A.: On the approximation rate of hierachical mixtures-of-experts for generalized linear models. Neural Comput. 11, 1183–1198 (1999).
Jiang, W., Tanner, M. A.: Hierachical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation. Ann. Stat. 27, 987–1011 (1999).
Jordan, M. I., Xu, L.: Convergence results for the EM approach to mixtures of experts architectures. Neural Netw. 8, 1409–1431 (1995).
Kalliovirta, L., Meitz, M., Saikkonen, P.: Gaussian mixture vector autoregression. J. Econ. 192, 485–498 (2016).
Krzyzak, A., Schafer, D.: Nonparametric regression estimation by normalized radial basis function networks. IEEE Trans. Inf. Theory. 51, 1003–1010 (2005).
Li, J. Q., Barron, A. R.: Mixture density estimation. In: Solla, S. A., Leen, T. K., Mueller, K. R. (eds.)Advances in Neural Information Processing Systems. MIT Press, Cambridge (1999).
Masoudnia, S., Ebrahimpour, R.: Mixture of experts: a literature survey. Artif. Intell. Rev. 42, 275–293 (2014).
Mendes, E. F., Jiang, W.: On convergence rates of mixture of polynomial experts. Neural Comput. 24, 3025–3051 (2012).
Montuelle, L., Le Pennec, E.: Mixture of Gaussian regressions model with logistic weights, a penalized maximum likelihood approach. Electron. J. Stat. 8, 1661–1695 (2014).
Nguyen, H. D., Chamroukhi, F.: Practical and theoretical aspects of mixture-of-experts modeling: an overview. WIREs Data Min. Knowl. Disc. 8(4), 1246 (2018).
Nguyen, H. D., Chamroukhi, F., Forbes, F.: Approximation results regarding the multiple-output Gaussian gated mixture of linear experts model. Neurocomputing. 366, 208–214 (2019).
Nguyen, T. T., Chamroukhi, F., Nguyen, H. D., Forbes, F.: Non-asymptotic model selection in block-diagonal mixture of polynomial experts models. arXiv preprint arXiv:2104.08959 (2021). http://arxiv.org/abs/2104.08959.
Nguyen, T. T., Chamroukhi, F., Nguyen, H. D., McLachlan, G. J.: Approximation of probability density functions via location-scale finite mixtures in Lebesgue spaces. arXiv:2008.09787 (2020).
Nguyen, H. D., Lloyd-Jones, L. R., McLachlan, G. J.: A universal approximation theorem for mixture-of-experts models. Neural Comput. 28, 2585–2593 (2016).
Nguyen, H. D., McLachlan, G. J.: Laplace mixture of linear experts. Comput. Stat. Data Anal. 93, 177–191 (2016).
Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., Forbes, F.: A non-asymptotic penalization criterion for model selection in mixture of experts models. ArXiv (arXiv:2104.02640) (2021).
Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., McLachlan, G. J.: Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Math. Stat. 7, 1750861 (2020).
Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., McLachlan, G. J.: An l1-oracle inequality for the Lasso in mixture-of-experts regression models. ArXiv (arXiv:2009.10622) (2020).
Norets, A.: Approximation of conditional densities by smooth mixtures of regressions. Ann. Stat. 38, 1733–1766 (2010).
Norets, A., Pelenis, J.: Posterior consistency in conditional density estimation by covariate dependent mixtures. Econ. Theory. 30, 606–646 (2014).
Oden, J. T., Demkowicz, L. F.: Applied Functional Analysis. CRC Press, Boca Raton (2010).
Pelenis, J.: Bayesian regression with heteroscedastic error density and parametric mean function. J. Econ. 178, 624–638 (2014).
Perthame, E., Forbes, F., Deleforge, A.: Inverse regression approach to robust nonlinear high-to-low dimensional mapping. J. Multivar. Anal. 163, 1–14 (2018).
Rakhlin, A., Panchenko, D., Mukherjee, S.: Risk bounds for mixture density estimation. ESAIM: Probab. Stat. 9, 220–229 (2005).
Wang, L. -X., Mendel, J. M.: Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Trans. Neural Netw. 3, 807–814 (1992).
Yuksel, S. E., Wilson, J. N., Gader, P. D.: Twenty years of mixture of experts. IEEE Trans. Neural Netw. Learn. Syst. 23, 1177–1193 (2012).
Zeevi, A. J., Meir, R.: Density estimation through convex combinations of densities: approximation and estimation bounds. Neural Comput. 10, 99–109 (1997).
Zeevi, A. J., Meir, R., Maiorov, V.: Error bounds for functional approximation and estimation using mixtures of experts. IEEE Trans. Inf. Theory. 44, 1010–1025 (1998).
Acknowledgements
Hien Duy Nguyen and Geoffrey John McLachlan are funded by Australian Research Council grants: DP180101192 and IC170100035. TrungTin Nguyen is supported by a “Contrat doctoral” from the French Ministry of Higher Education and Research. Faicel Chamroukhi is granted by the French National Research Agency (ANR) grant SMILES ANR-18-CE40-0014. The authors also thank the Editor and Reviewer, whose careful and considerate comments lead to improvements in of text.
Funding
HDN and GJM are funded by Australian Research Council grants: DP180101192 and IC170100035. FC is funded by ANR grant: SMILES ANR-18-CE40-0014.
Author information
Authors and Affiliations
Contributions
All authors contributed equally to the exposition and to the mathematical derivations. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
None.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nguyen, H.D., Nguyen, T., Chamroukhi, F. et al. Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models. J Stat Distrib App 8, 13 (2021). https://doi.org/10.1186/s40488-021-00125-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40488-021-00125-0