Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models

Nguyen, Hien Duy; Nguyen, TrungTin; Chamroukhi, Faicel; McLachlan, Geoffrey John

doi:10.1186/s40488-021-00125-0

Research
Open access
Published: 06 August 2021

Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models

Hien Duy Nguyen ORCID: orcid.org/0000-0002-9958-432X¹,
TrungTin Nguyen²,
Faicel Chamroukhi² &
…
Geoffrey John McLachlan³

Journal of Statistical Distributions and Applications volume 8, Article number: 13 (2021) Cite this article

2599 Accesses
6 Citations
5 Altmetric
Metrics details

Abstract

Mixture of experts (MoE) models are widely applied for conditional probability density estimation problems. We demonstrate the richness of the class of MoE models by proving denseness results in Lebesgue spaces, when inputs and outputs variables are both compactly supported. We further prove an almost uniform convergence result when the input is univariate. Auxiliary lemmas are proved regarding the richness of the soft-max gating function class, and their relationships to the class of Gaussian gating functions.

Introduction

Mixture of experts (MoE) models are a widely applicable class of conditional probability density approximations that have been considered as solution methods across the spectrum of statistical and machine learning Yuksel et al. (2012); Masoudnia and Ebrahimpour (2014); Nguyen and Chamroukhi (2018).

Let $\mathbb {Z}=\mathbb {X}\times \mathbb {Y}$, where $\mathbb {X}\subseteq \mathbb {R}^{d}$ and $\mathbb {Y}\subseteq \mathbb {R}^{q}$, for $d,q\in \mathbb {N}$. Suppose that the input and output random variables, $\boldsymbol {X}\in \mathbb {X}$ and $\boldsymbol {Y}\in \mathbb {Y}$, are related via the conditional probability density function (PDF) f(y|x) in the functional class:

$$\mathcal{F}=\left\{ f:\mathbb{Z}\rightarrow\left[0,\infty\right)|\int_{\mathbb{Y}}f\left(\boldsymbol{y}|\boldsymbol{x}\right)\mathrm{d}\lambda\left(\boldsymbol{y}\right)=1,\forall\boldsymbol{x}\in\mathbb{X}\right\} \text{,} $$

where λ denotes the Lebesgue measure. The MoE approach seeks to approximate the unknown target conditional PDF f by a function of the MoE form:

$$m\left(\boldsymbol{y}|\boldsymbol{x}\right)=\sum_{k=1}^{K}\text{Gate}_{k}\left(\boldsymbol{x}\right)\text{Expert}_{k}\left(\boldsymbol{y}\right)\text{,} $$

where $\mathbf {Gate}=\left (\text {Gate}_{k}\right)_{k\in [K]}\in \mathcal {G}^{K}$ ($[K]=\left \{ 1,\dots,K\right \} $), $\text {Expert}_{1},\dots,\text {Expert}_{K}\in \mathcal {E}$, and $K\in \mathbb {N}$. Here, we say that m is a K-component MoE model with gates arising from the class $\mathcal {G}^{K}$ and experts arising from $\mathcal {E}$, where $\mathcal {E}$ is a class of PDFs with support $\mathbb {Y}$.

The most popular choices for $\mathcal {G}^{K}$ are the parametric soft-max and Gaussian gating classes:

$${}\mathcal{G}_{S}^{K}\,=\,\left\{ \!\mathbf{Gate}\,=\,\left(\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)\right)_{k\in[K]}|\forall k\in[K],\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)\,=\,\frac{\exp\left(a_{k}+\boldsymbol{b}_{k}^{\top}\cdot\right)}{\sum_{l=1}^{K}\exp\left(a_{l}+\boldsymbol{b}_{l}^{\top}\cdot\right)},\boldsymbol{\gamma}\in\mathbb{G}_{S}^{K}\right\} $$

and

$${}\mathcal{G}_{G}^{K}\,=\,\left\{ \mathbf{Gate}=\left(\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)\right)_{k\in[K]}|\forall k\in[K],\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)\,=\,\frac{\pi_{k}\phi\left(\cdot;\boldsymbol{\nu}_{k},\boldsymbol{\Sigma}_{k}\right)}{\sum_{l=1}^{K}\pi_{l}\phi\left(\cdot;\boldsymbol{\nu}_{l},\boldsymbol{\Sigma}_{l}\right)},\boldsymbol{\gamma}\in\mathbb{G}_{G}^{K}\right\} \text{,} $$

respectively, where

$$\mathbb{G}_{S}^{K}=\left\{ \boldsymbol{\gamma}=\left(a_{1},\dots,a_{K},\boldsymbol{b}_{1},\dots,\boldsymbol{b}_{K}\right)\in\mathbb{R}^{K}\times\left(\mathbb{R}^{d}\right)^{K}\right\} $$

and

$$\mathbb{G}_{G}^{K}=\left\{ \boldsymbol{\gamma}=\left(\boldsymbol{\pi},\boldsymbol{\nu}_{1},\dots,\boldsymbol{\nu}_{K},\boldsymbol{\Sigma}_{1},\dots,\boldsymbol{\Sigma}_{K}\right)\in\Pi_{K-1}\times\left(\mathbb{R}^{d}\right)^{K}\times\mathbb{S}_{d}^{K}\right\} \text{.} $$

Here,

$$\phi\left(\cdot;\boldsymbol{\nu},\boldsymbol{\Sigma}\right)=\left|2\pi\boldsymbol{\Sigma}\right|^{-1/2}\exp\left[-\frac{1}{2}\left(\cdot-\boldsymbol{\nu}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\cdot-\boldsymbol{\nu}\right)\right] $$

is the multivariate normal density function with mean vector ν and covariance matrix $\boldsymbol {\Sigma }, \boldsymbol {\pi }^{\top }=\left (\pi _{1},\dots,\pi _{K}\right)$ is a vector of weights in the simplex:

$$\Pi_{K-1}=\left\{ \boldsymbol{\pi}=\left(\pi_{k}\right)_{k\in[K]}|\forall k\in[K],{{\pi_{k}>0}},\sum_{k=1}^{K}\pi_{k}=1\right\} \text{,} $$

and $\mathbb {S}_{d}$ is the class of d×d symmetric positive definite matrices. The soft-max and Gaussian gating classes were first introduced by Jacobs et al. (1991); Jordan and Xu (1995), respectively. Typically, one chooses experts that arise from some location-scale class:

$$\mathcal{E}_{\psi}\,=\,\left\{ g_{\psi}\left(\cdot;\boldsymbol{\mu},\sigma\right):\mathbb{Y}\rightarrow\left[0,\infty\right)|g_{\psi}\left(\cdot;\boldsymbol{\mu},\sigma\right)=\frac{1}{\sigma^{q}}\psi\left(\frac{\cdot-\boldsymbol{\mu}}{\sigma}\right),\boldsymbol{\mu}\in\mathbb{R}^{q},\sigma\in\left(0,\infty\right)\right\} \text{,} $$

where ψ is a PDF, with respect to $\mathbb {R}^{q}$ in the sense that $\psi :\mathbb {R}^{q}\rightarrow \left [0,\infty \right)$ and $\int _{\mathbb {R}^{q}}\psi \left (\boldsymbol {y}\right)\mathrm {d}\lambda \left (\boldsymbol {y}\right)=1$.

We shall say that $f\in \mathcal {L}_{p}\left (\mathbb {Z}\right)$ for any p∈[1,∞) if

$$\left\Vert f\right\Vert_{p,\mathbb{Z}}=\left(\int_{\mathbb{Z}}\left|\mathbf{1}_{\mathbb{Z}}f\right|^{p}\mathrm{d}\lambda\left(\boldsymbol{z}\right)\right)^{1/p}<\infty\text{,} $$

where $\mathbf {1}_{\mathbb {Z}}$ is the indicator function that takes value 1 when $\boldsymbol {z}\in \mathbb {Z}$, and 0 otherwise. Further, we say that $f\in \mathcal {L}_{\infty }\left (\mathbb {Z}\right)$ if

$$\left\Vert f\right\Vert_{\infty,\mathbb{Z}}=\inf\left\{ a\ge0|\lambda\left(\left\{ \boldsymbol{z}\in\mathbb{Z}|\left|f\left(\boldsymbol{z}\right)\right|>a\right\} \right)=0\right\} <\infty\text{.} $$

We shall refer to $\left \Vert \cdot \right \Vert _{p,\mathbb {Z}}$ as the $\mathcal {L}_{p}$ norm on $\mathbb {Z}$, for p∈[0,∞], and where the context is obvious, we shall drop the reference to $\mathbb {Z}$.

Suppose that the target conditional PDF f is in the class $\mathcal {F}_{p}=\mathcal {F}\cap \mathcal {L}_{p}$. We address the problem of approximating f, with respect to the $\mathcal {L}_{p}$ norm, using MoE models in the soft-max and Gaussian gated classes,

$$\begin{array}{@{}rcl@{}} \mathcal{M}_{S}^{\psi} & = & \left\{ m_{K}^{\psi}:\mathbb{Z}\rightarrow\left[0,\infty\right)|m_{K}^{\psi}\left(\boldsymbol{y}|\boldsymbol{x}\right)=\sum_{k=1}^{K}\text{Gate}_{k}\left(\boldsymbol{x}\right)g_{\psi}\left(\boldsymbol{y};\boldsymbol{\mu}_{k},\sigma_{k}\right)\text{,} \right.\\ & & \quad \left. g_{\psi}\in\mathcal{E}_{\psi}\cap\mathcal{L}_{\infty},\mathbf{Gate}\in\mathcal{G}_{S}^{K},\boldsymbol{\mu}_{k}\in\mathbb{Y},\sigma_{k}\in\left(0,\infty\right),k\in[K],K\in\mathbb{N}\right\} \end{array} $$

and

$$\begin{array}{@{}rcl@{}} \mathcal{M}_{G}^{\psi} & = & \left\{ m_{K}^{\psi}:\mathbb{Z}\rightarrow\left[0,\infty\right)|m_{K}^{\psi}\left(\boldsymbol{y}|\boldsymbol{x}\right)=\sum_{k=1}^{K}\text{Gate}_{k}\left(\boldsymbol{x}\right)g_{\psi}\left(\boldsymbol{y};\boldsymbol{\mu}_{k},\sigma_{k}\right)\text{,} \right.\\ & & \quad \left. g_{\psi}\in\mathcal{E}_{\psi}\cap\mathcal{L}_{\infty},\mathbf{Gate}\in\mathcal{G}_{G}^{K},\boldsymbol{\mu}_{k}\in\mathbb{Y},\sigma_{k}\in\left(0,\infty\right),k\in[K],K\in\mathbb{N}\right\}\text{,} \end{array} $$

by showing that both $\mathcal {M}_{S}^{\psi }$ and $\mathcal {M}_{G}^{\psi }$ are dense in the class $\mathcal {F}_{p}$, when $\mathbb {X}=\left [0,1\right ]^{d}$ and $\mathbb {Y}$ is a compact subset of $\mathbb {R}^{q}$. Our denseness results are enabled by the indicator function approximation result of Jiang and Tanner (1999a), and the finite mixture model denseness theorems of Nguyen et al. (2020); Nguyen et al. (2020a).

Our theorems contribute to an enduring continuity of sustained interest in the approximation capabilities of MoE models. Related to our results are contributions regarding the approximation capabilities of the conditional expectation function of the classes $\mathcal {M}_{S}^{\psi }$ and $\mathcal {M}_{G}^{\psi }$ (Wang and Mendel 1992; Zeevi et al. 1998; Jiang and Tanner 1999a; Krzyzak and Schafer 2005; Mendes and Jiang 2012; Nguyen et al. 2016; Nguyen et al. 2019) and the approximation capabilities of subclasses of $\mathcal {M}_{S}^{\psi }$ and $\mathcal {M}_{G}^{\psi }$, with respect to the Kullback–Leibler divergence (Jiang and Tanner 1999b; Norets 2010; Norets and Pelenis 2014). Our results can be seen as complements to the Kullback–Leibler approximation theorems of Norets (2010); Norets and Pelenis (2014), by the relationship between the Kullback–Leibler divergence and the $\mathcal {L}_{2}$ norm (Zeevi and Meir 1997). That is, when f>1/κ, for all $\left (\boldsymbol {x},\boldsymbol {y}\right)\in \mathbb {Z}$ and some constant κ>0, we have that the integrated conditional Kullback–Leibler divergence considered by Norets and Pelenis (2014):

$$\int_{\mathbb{X}}D\left(f\left(\cdot|\boldsymbol{x}\right)\Vert m_{K}^{\psi}\left(\cdot|\boldsymbol{x}\right)\right)\mathrm{d}\lambda\left(\boldsymbol{x}\right)=\int_{\mathbb{X}}\int_{\mathbb{Y}}f\left(\boldsymbol{y}|\boldsymbol{x}\right)\log\frac{f\left(\boldsymbol{y}|\boldsymbol{x}\right)}{m_{K}^{\psi}\left(\boldsymbol{y}|\boldsymbol{x}\right)}\mathrm{d}\lambda\left(\boldsymbol{y}\right)\mathrm{d}\lambda\left(\boldsymbol{x}\right) $$

satisfies

$$\int_{\mathbb{X}}D\left(f\left(\cdot|\boldsymbol{x}\right)\Vert m_{K}^{\psi}\left(\cdot|\boldsymbol{x}\right)\right)\mathrm{d}\lambda\left(\boldsymbol{x}\right)\le\kappa^{2}\left\Vert f-m_{K}^{\psi}\right\Vert_{2,\mathbb{Z}}^{2}\text{,} $$

and thus a good approximation in the integrated Kullback–Leibler divergence is guaranteed if one can find a good approximation in the $\mathcal {L}_{2}$ norm, which is guaranteed by our main result.

The remainder of the manuscript proceeds as follows. The main result is presented in Section 2. Technical lemmas are provided in Section 3. The proofs of our results are then presented in Section 4. Proofs of required lemmas that do not appear elsewhere are provided in Section 5. A summary of our work and some conclusions are drawn in Section 6.

Main results

Denote the class of bounded functions on $\mathbb {Z}$ by

$$\mathcal{B}\left(\mathbb{Z}\right)=\left\{ f\in\mathcal{L}_{\infty}\left(\mathbb{Z}\right)|\exists a\in\left[0,\infty\right)\text{, such that}\text{ }\left|f\left(\boldsymbol{z}\right)\right|\le a\text{,} \forall\boldsymbol{z}\in\mathbb{Z}\right\} \text{,} $$

and write its norm as $\left \Vert f\right \Vert _{\mathcal {B}\left (\mathbb {Z}\right)}=\sup _{\boldsymbol {z}\in \mathbb {Z}}\left |f\left (\boldsymbol {z}\right)\right |$. Further, let $\mathcal {C}$ denote the class of continuous functions. Note that if $\mathbb {Z}$ is compact and $f\in \mathcal {C}$, then $f\in \mathcal {B}$.

Theorem 1

Assume that $\mathbb {X}=\left [0,1\right ]^{d}$ for $d\in \mathbb {N}$. There exists a sequence $\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}\subset \mathcal {M}_{S}^{\psi }$, such that if $\mathbb {Y}\subset \mathbb {R}^{q}$ is compact, $f\in \mathcal {F}\cap \mathcal {C}$, and $\psi \in \mathcal {C}\left (\mathbb {R}^{q}\right)$ is a PDF on support $\mathbb {R}^{q}$, then ${\lim }_{K\rightarrow \infty }\left \Vert f-m_{K}^{\psi }\right \Vert _{p}=0$, for p∈[1,∞).

Since convergence in Lebesgue spaces does not imply point-wise modes of convergence, the following result is also useful and interesting in some restricted scenarios. Here, we note that the mode of convergence is almost uniform, which implies almost everywhere convergence and convergence in measure (cf. Bartle 1995, Lem 7.10 and Thm. 7.11). The almost uniform convergence of $\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}$ to f in the following result is to be understood in the sense of Bartle (1995), Def. 7.9. That is, for every δ>0, there exists a set $\mathbb {E}_{\delta }\subset \mathbb {Z}$ with $\lambda \left (\mathbb {Z}\right)<\delta $, such that $\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}$ converges to f, uniformly on $\mathbb {Z}\backslash \mathbb {E}_{\delta }$.

Theorem 2

Assume that $\mathbb {X}=\left [0,1\right ]$. There exists a sequence $\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}\subset \mathcal {M}_{S}^{\psi }$, such that if $\mathbb {Y}\subset \mathbb {R}^{q}$ is compact, $f\in \mathcal {F}\cap \mathcal {C}$, and $\psi \in \mathcal {C}\left (\mathbb {R}^{q}\right)$ is a PDF on support $\mathbb {R}^{q}$, then ${\lim }_{K\rightarrow \infty }m_{K}^{\psi }=f$, almost uniformly.

The following result establishes the connection between the gating classes $\mathcal {G}_{S}^{K}$ and $\mathcal {G}_{G}^{K}$.

Lemma 1

For each $K\in \mathbb {N}, \mathcal {G}_{S}^{K}\subset \mathcal {G}_{G}^{K}$. Further, if we define the class of Gaussian gating vectors with equal covariance matrices:

$${}\mathcal{G}_{E}^{K}\,=\,\left\{ \mathbf{Gate}=\left(\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)\right)_{k\in[K]}|\forall k\in[K],\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)=\frac{\pi_{k}\phi\left(\cdot;\boldsymbol{\nu}_{k},\boldsymbol{\Sigma}\right)}{\sum_{l=1}^{K}\pi_{l}\phi\left(\cdot;\boldsymbol{\nu}_{l},\boldsymbol{\Sigma}\right)},\boldsymbol{\gamma}\in\mathbb{G}_{E}^{K}\right\} \text{,} $$

where

$$\mathbb{G}_{E}^{K}=\left\{ \boldsymbol{\gamma}=\left(\boldsymbol{\pi},\boldsymbol{\nu}_{1},\dots,\boldsymbol{\nu}_{K},\boldsymbol{\Sigma}\right)\in\Pi_{K-1}\times\left(\mathbb{R}^{d}\right)^{K}\times\mathbb{S}_{d}\right\} \text{,} $$

then $\mathcal {G}_{E}^{K}\subset \mathcal {G}_{S}^{K}$.

We can directly apply Lemma 1 to establish the following corollary to Theorems 1 and 2, regarding the approximation capability of the class $\mathcal {M}_{G}^{\psi }$.

Corollary 1

Theorems 1 and 2 hold when $\mathcal {M}_{S}^{\psi }$ is replaced by $\mathcal {M}_{G}^{\psi }$ in their statements.

Technical lemmas

Let $\mathbb {K}^{n}=\left \{ \left (k_{1},\dots,k_{d}\right)\in [n]^{d}\right \} $ and $\kappa :\mathbb {K}^{n}\rightarrow \left [n^{d}\right ]$ be a bijection for each $n\in \mathbb {N}$. For each $\left (k_{1},\dots,k_{d}\right)\in \mathbb {K}^{n}$ and k∈[n^d], we define $\mathbb {X}_{k}^{n}=\mathbb {X}_{\kappa \left (k_{1},\dots,k_{d}\right)}^{n}=\prod _{i=1}^{d}\mathbb {I}_{k_{i}}^{n}$, where $\mathbb {I}_{k_{i}}^{n}=\left [\left (k_{i}-1\right)/n,k_{i}/n\right)$ for k_i∈[n−1], and $\mathbb {I}_{n}^{n}=\left [\left (n-1\right)/n,1\right ]$.

We call $\left \{ \mathbb {X}_{k}^{n}\right \}_{k\in \left [n^{d}\right ]}$ a fine partition of $\mathbb {X}$, in the sense that $\mathbb {X}=\left [0,1\right ]^{d}=\bigcup _{k=1}^{n^{d}}\mathbb {X}_{k}^{n}$, for each n, and that $\lambda \left (\mathbb {X}_{k}^{n}\right)=n^{-d}$ gets smaller, as n increases. The following result from Jiang and Tanner (1999a) establishes the approximation capability of soft-max gates.

Lemma 2

(Jiang and Tanner, 1999, p. 1189) For each $n\in \mathbb {N}, p\in \left [1,\infty \right)$ and ε>0, there exists a gating functions

$$\mathbf{Gate}=\left(\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)\right)_{k\in\left[n^{d}\right]}\in\mathcal{G}_{S}^{n^{d}} $$

for some $\boldsymbol {\gamma }\in \mathbb {G}_{S}^{n^{d}}$, such that

$$\sup_{k\in\left[n^{d}\right]}\left\Vert \mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)\right\Vert_{p,\mathbb{X}}\le\epsilon\text{.} $$

When, d=1, we have also the following almost uniform convergence alternative to Lemma 2.

Lemma 3

Let $\mathbb {X}=\left [0,1\right ]$. Then, for each $n\in \mathbb {N}$, there exists a sequence of gating functions:

$$\left\{ \mathbf{Gate}_{l}=\left(\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}_{l}\right)\right)_{k\in\left[n^{d}\right]}\right\}_{l\in\mathbb{N}}\subset\mathcal{G}_{S}^{n}\text{,} $$

defined by $\left \{ \boldsymbol {\gamma }_{l}\right \}_{l\in \mathbb {N}}\subset \mathbb {G}_{S}^{n}$, such that

$$\text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}_{l}\right)\rightarrow\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }\text{,} $$

almost uniformly, simultaneously for all k∈[n^d].

For PDF ψ on support $\mathbb {R}^{q}$, define the class of finite mixture models by

$$\begin{array}{@{}rcl@{}} \mathcal{H}^{\psi} & = & \left\{ h_{K}^{\psi}:\mathbb{R}^{q}\rightarrow\left[0,\infty\right)|h_{K}^{\psi}\left(\boldsymbol{y}\right)=\sum_{k=1}^{K}c_{k}g_{\psi}\left(\boldsymbol{y};\boldsymbol{\mu}_{k},\sigma_{k}\right)\text{,} \right.\\ & & \quad \left. g_{\psi}\in\mathcal{E}_{\psi}\cap\mathcal{L}_{\infty},\left(c_{k}\right)_{k\in[K]}\in\Pi_{K-1},\boldsymbol{\mu}_{k}\in\mathbb{Y},\sigma_{k}\in\left(0,\infty\right),k\in[K],K\in\mathbb{N}\right\}\text{.} \end{array} $$

We require the following result, from Nguyen et al. (2020), regarding the approximation capabilities of $\mathcal {H}^{\psi }$.

Lemma 4

(Nguyen et al., 2020a, Thm. 2(b)) If $f\in \mathcal {C}\left (\mathbb {Y}\right)$ is a PDF on $\mathbb {Y}, \psi \in \mathcal {C}\left (\mathbb {R}^{q}\right)$ is a PDF on $\mathbb {R}^{q}$, and $\mathbb {Y}\subset \mathbb {R}^{q}$ is compact, then there exists a sequence $\left \{ h_{K}^{\psi }\right \}_{K\in \mathbb {N}}\subset \mathcal {H}^{\psi }$, such that ${\lim }_{K\rightarrow \infty }\left \Vert f-h_{K}^{\psi }\right \Vert _{\mathcal {B}\left (\mathbb {Y}\right)}=0$.

Proofs of main results

4.1 Proof of Theorem 1

To prove the result, it suffices to show that for each ε>0, there exists a $m_{K}^{\psi }\in \mathcal {M}_{S}^{\psi }$, such that

$$\left\Vert f-m_{K}^{\psi}\right\Vert_{p}<\epsilon\text{.} $$

The main steps of the proof are as follows. We firstly approximate f(y|x) by

$$ \upsilon_{n}\left(\boldsymbol{y}|\boldsymbol{x}\right)=\sum_{k=1}^{n^{d}}\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\text{,} $$

(1)

where $\boldsymbol {x}_{k}^{n}\in \mathbb {X}_{k}^{n}$, for each k∈[n^d], such that

$$ \left\Vert f-\upsilon_{n}\right\Vert_{p}<\frac{\epsilon}{3}\text{,} $$

(2)

for all n≥N₁(ε), for some sufficiently large $N_{1}(\epsilon)\in \mathbb {N}$. Then we approximate υ_n(y|x) by

$$ \eta_{n}\left(\boldsymbol{y}|\boldsymbol{x}\right)=\sum_{k=1}^{n^{d}}\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\text{,} $$

(3)

where $\boldsymbol {\gamma }_{n}\in \mathbb {G}_{S}^{n^{d}}$ and $\mathbf {Gate}=\left (\text {Gate}_{k}\left (\cdot ;\boldsymbol {\gamma }_{n}\right)\right)_{k\in \left [n^{d}\right ]}\in \mathcal {G}_{S}^{n^{d}}$, so that

$$\begin{array}{*{20}l} \left\Vert \upsilon_{n}-\eta_{n}\right\Vert_{p} & \le\sup_{k\in\left[n^{d}\right]}\left\Vert \text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)-\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }\right\Vert_{p,\mathbb{X}}\sum_{k=1}^{n^{d}}\left\Vert f\left(\cdot|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p,\mathbb{Y}}<\frac{\epsilon}{3}\text{,} \end{array} $$

(4)

using Lemma 2.

Finally, we approximate η_n(y|x) by $m_{K_{n}}^{\psi }\left (\boldsymbol {y}|\boldsymbol {x}\right)$, where

$$ m_{K_{n}}^{\psi}\left(\boldsymbol{y}|\boldsymbol{x}\right)=\sum_{k=1}^{n^{d}}\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}\right)h_{n_{k}}^{k}\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right) $$

(5)

and

$$ h_{n_{k}}^{k}\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)=\sum_{i=1}^{n_{k}}c_{i}^{k}g_{\psi}\left(\boldsymbol{y};\boldsymbol{\mu}_{i}^{k},\sigma_{i}^{k}\right)\in\mathcal{H}^{\psi} $$

(6)

for $n_{k}\in \mathbb {N}$ (k∈[n^d]), such that $K_{n}=\sum _{k=1}^{n^{d}}n_{k}$. Here, we establish that there exists $N_{2}\left (\epsilon,n,\boldsymbol {\gamma }_{n}\right)\in \mathbb {N}$, so that when n_k≥N₂(ε,n,γ_n),

$$ \left\Vert \eta_{n}-m_{K_{n}}^{\psi}\right\Vert_{p}\le\sup_{k\in\left[n^{d}\right]}\left\Vert \text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}\right)\right\Vert_{p,\mathbb{X}}\sum_{k=1}^{n^{d}}\left\Vert f\left(\cdot|\boldsymbol{x}_{k}^{n}\right)-h_{n_{k}}^{k}\left(\cdot|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p,\mathbb{Y}}<\frac{\epsilon}{3}\text{.} $$

(7)

Results (2)–(7) then imply that for each ε>0, there exists N₁(ε),γ_n, and $N_{2}\left (\mathbb {\epsilon },n,\boldsymbol {\gamma }_{n}\right)$, such that for all $K_{n}=\sum _{k=1}^{n^{d}}n_{k}$, where n_k≥N₂(ε,n,γ_n) (for each k∈[n^d]) and n≥N₁(ε). The following inequality results from an application of the triangle inequality:

$$\begin{array}{*{20}l} \left\Vert f-m_{K_{n}}^{\psi}\right\Vert_{p} & \le\left\Vert f-\upsilon_{n}\right\Vert_{p}+\left\Vert \upsilon_{n}-\eta_{n}\right\Vert_{p}+\left\Vert \eta_{n}-m_{K_{n}}^{\psi}\right\Vert_{p}<3\times\frac{\epsilon}{3}=\epsilon\text{.} \end{array} $$

We now focus our attention to proving each of the results: (2)–(7). To prove (2), we note that since f is uniformly continuous (because $\mathbb {Z}=\mathbb {X}\times \mathbb {Y}$ is compact, and $f\in \mathcal {C}$), there exists a function (1) such that for all ε>0,

$$ \sup_{\left(\boldsymbol{x},\boldsymbol{y}\right)\in\mathbb{Z}}\left|f\left(\boldsymbol{y}|\boldsymbol{x}\right)-\upsilon\left(\boldsymbol{y}|\boldsymbol{x}\right)\right|<\varepsilon\text{.} $$

(8)

We can construct such an approximation by considering the fact that as n increases, the diameter $\delta _{n}=\sup _{k\in n^{d}}\text {diam}\left (\mathbb {X}_{k}^{n}\right)$ of the fine partition goes to zero. By the uniform continuity of f, for every ε>0, there exists a δ(ε)>0, such that if ∥(x₁,y₁)−(x₂,y₂)∥<δ(ε), then |f(y₁|x₁)−f(y₂|x₂)|<ε, for all pairs $\left (\boldsymbol {x}_{1},\boldsymbol {y}_{1}\right),\left (\boldsymbol {x}_{2},\boldsymbol {y}_{2}\right)\in \mathbb {Z}$. Here, ∥·∥ denotes the Euclidean norm. Furthermore, for any $\left (\boldsymbol {x},\boldsymbol {y}\right)\in \mathbb {Z}$, we have

$$\begin{array}{*{20}l} \left|f\left(\boldsymbol{y}|\boldsymbol{x}\right)-\upsilon_{n}\left(\boldsymbol{y}|\boldsymbol{x}\right)\right| & =\left|\sum_{k=1}^{n^{d}}\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }\left[f\left(\boldsymbol{y}|\boldsymbol{x}\right)-f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right]\right| \\ & \le\sum_{k=1}^{n^{d}}\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }\left|f\left(\boldsymbol{y}|\boldsymbol{x}\right)-f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right|\text{,} \end{array} $$

(9)

by the triangle inequality.

Since $\boldsymbol {x}_{k}^{n}\in \mathbb {X}_{k}^{n}$, for each k and n, we have the fact that $\left \Vert \left (\boldsymbol {x},\boldsymbol {y}\right)-\left (\boldsymbol {x}_{k}^{n},\boldsymbol {y}\right)\right \Vert <\delta _{n}$ for $\left (\boldsymbol {x},\boldsymbol {y}\right)\in \mathbb {X}_{k}^{n}\times \mathbb {Y}$. By uniform continuity, for each ε, we can find a sufficiently small δ(ε), such that $\left |f\left (\boldsymbol {y}|\boldsymbol {x}\right)-f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)\right |<\varepsilon $, if $\left \Vert \left (\boldsymbol {x},\boldsymbol {y}\right)-\left (\boldsymbol {x}_{k}^{n},\boldsymbol {y}\right)\right \Vert <\delta (\epsilon)$, for all k. The desired result (8) can be obtained by noting that the right hand side of (9) consists of only one non-zero summand for any $\left (\boldsymbol {x},\boldsymbol {y}\right)\in \mathbb {Z}$, and by choosing $n\in \mathbb {N}$ sufficiently large, so that δ_n<δ(ε).

By (8), we have the fact that υ_n→f, point-wise. We can bound υ_n as follows:

$$ \upsilon_{n}\left(\boldsymbol{y}|\boldsymbol{x}\right)\le\sum_{i=1}^{n^{p}}\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }\sup_{\boldsymbol{\zeta}\in\mathbb{Y},\boldsymbol{\xi}\in\mathbb{X}}f\left(\boldsymbol{\zeta}|\boldsymbol{\xi}\right)=\sup_{\boldsymbol{\zeta}\in\mathbb{Y},\boldsymbol{\xi}\in\mathbb{X}}f\left(\boldsymbol{\zeta}|\boldsymbol{\xi}\right)\text{,} $$

(10)

where the right-hand side is a constant and is therefore in $\mathcal {L}_{p}$, since $\mathbb {Z}$ is compact. An application of the Lebesgue dominated convergence theorem in $\mathcal {L}_{p}$ then yields (2).

Next we write

$$\begin{array}{*{20}l} \left\Vert \upsilon_{n}-\eta_{n}\right\Vert_{p} & =\left\Vert \sum_{k=1}^{n^{d}}\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)-\sum_{k=1}^{n^{d}}\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p}\\ & \le\sum_{k=1}^{n^{d}}\left\Vert \left[\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)\right]f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p}\text{.} \end{array} $$

Since the norm arguments are separable in x and y, we apply Fubini’s theorem to get

$$\begin{array}{*{20}l} \left\Vert \upsilon_{n}-\eta_{n}\right\Vert_{p} & =\sum_{k=1}^{n^{d}}\left\Vert \left[\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)\right]\right\Vert_{p,\mathbb{X}}\left\Vert f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p,\mathbb{Y}}\\ & \le\sup_{k\in\left[n^{d}\right]}\left\Vert \left[\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)\right]\right\Vert_{p,\mathbb{X}}\sum_{k=1}^{n^{d}}\left\Vert f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p,\mathbb{Y}} \end{array} $$

Because $f\in \mathcal {B}$ and n^d is finite, for any fixed $n\in \mathbb {N}$, we have $C_{1}(n)=\sum _{k=1}^{n^{d}}\left \Vert f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)\right \Vert _{p,\mathbb {Y}}<\infty $. For each ε>0, we need to choose a $\boldsymbol {\gamma }_{n}\in \mathbb {G}_{S}^{n^{d}}$, such that

$$\sup_{k\in\left[n^{d}\right]}\left\Vert \left[\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)\right]\right\Vert_{p,\mathbb{X}}<\frac{\epsilon}{3C_{1}(n)}\text{,} $$

which can be achieved via a direct application of Lemma 2. We have thus shown (4).

Lastly, we are required to approximate $f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)$ for each k∈[n^d], by a function of form (6). Since $\mathbb {Y}$ is compact and f and ψ are continuous, we can apply of Lemma 4, directly. Note that over a set of finite measure, convergence in $\left \Vert \cdot \right \Vert _{\mathcal {B}}$ implies convergence in $\mathcal {L}_{p}$ norm, for all p∈[1,∞] (cf. Oden and Demkowicz 2010, Prop. 3.9.3).

We can then write (5) as

$$\begin{array}{*{20}l} m_{K_{n}}^{\psi}\left(\boldsymbol{y}|\boldsymbol{x}\right) & =\sum_{k=1}^{n^{d}}\frac{\exp\left(a_{n,k}+\boldsymbol{b}_{n,k}^{\top}\boldsymbol{x}\right)}{\sum_{l=1}^{n^{d}}\exp\left(a_{n,l}+\boldsymbol{b}_{n,l}^{\top}\boldsymbol{x}\right)}h_{n_{k}}^{k}\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right) \\ & =\sum_{k=1}^{n^{d}}\sum_{i=1}^{n_{k}}\frac{\exp\left(a_{n,k}+\boldsymbol{b}_{n,k}^{\top}\boldsymbol{x}\right)}{\sum_{l=1}^{n^{d}}\exp\left(a_{n,l}+\boldsymbol{b}_{n,l}^{\top}\boldsymbol{x}\right)}\frac{c_{i}^{k}}{\sum_{l=1}^{n_{k}}c_{l}^{k}}g_{\psi}\left(\boldsymbol{y};\boldsymbol{\mu}_{i}^{k},\sigma_{i}^{k}\right) \\ & =\sum_{k=1}^{n^{d}}\sum_{i=1}^{n_{k}}\frac{\exp\left(\log c_{i}^{k}+a_{n,k}+\boldsymbol{b}_{n,k}^{\top}\boldsymbol{x}\right)}{\sum_{l=1}^{n^{d}}\sum_{j=1}^{n_{k}}\exp\left(\log c_{j}^{k}+a_{n,l}+\boldsymbol{b}_{n,l}^{\top}\boldsymbol{x}\right)}g_{\psi}\left(\boldsymbol{y};\boldsymbol{\mu}_{i}^{k},\sigma_{i}^{k}\right)\text{,} \end{array} $$

(11)

where $\boldsymbol {\gamma }_{n}=\left (a_{n,1},\dots,a_{n,n^{d}},\boldsymbol {b}_{n,1},\dots,\boldsymbol {b}_{n,n^{d}}\right)$. From (11), we observe that $m_{K_{n}}^{\psi }\in \mathcal {M}_{S}^{\psi }$, with $K_{n}=\sum _{k=1}^{n^{d}}n_{k}$.

To obtain (7), we write

$$\begin{array}{*{20}l} \left\Vert \eta_{n}-m_{K_{n}}^{\psi}\right\Vert_{p} & =\left\Vert \sum_{k=1}^{n^{d}}\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)-\sum_{k=1}^{n^{d}}\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}\right)h_{n_{k}}^{k}\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p}\\ & \le\sum_{k=1}^{n^{d}}\left\Vert \text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)\left[f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)-h_{n_{k}}^{k}\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right]\right\Vert_{p}\text{.} \end{array} $$

By separability and Fubini’s theorem, we then have

$$\begin{array}{*{20}l} \left\Vert \eta_{n}-m_{K_{n}}^{\psi}\right\Vert & \le\sum_{k=1}^{n^{d}}\left\Vert \text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)\right\Vert_{p,\mathbb{X}}\left\Vert f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)-h_{n_{k}}^{k}\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p,\mathbb{Y}}\\ & \le\sup_{k\in\left[n^{d}\right]}\left\Vert \text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{n}\right)\right\Vert_{p,\mathbb{X}}\sum_{k=1}^{n^{d}}\left\Vert f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)-h_{n_{k}}^{k}\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p,\mathbb{Y}}\text{.} \end{array} $$

Let $C_{2}\left (n,\boldsymbol {\gamma }_{n}\right)=\sup _{k\in \left [n^{d}\right ]}\left \Vert \text {Gate}_{k}\left (\boldsymbol {x};\boldsymbol {\gamma }_{n}\right)\right \Vert _{p,\mathbb {X}}$. Then, we apply Lemma 4 n^d times to establish the existence of a constant $N_{2}\left (\epsilon,n,\boldsymbol {\gamma }_{n}\right)\in \mathbb {N}$, such that for all k∈[n^d] and n_k≥N₂(ε,n,γ_n),

$$\left\Vert f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)-h_{n_{k}}^{k}\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{p,\mathbb{Y}}\le\frac{\epsilon}{3C_{2}\left(n,\boldsymbol{\gamma}_{n}\right)n^{d}}\text{.} $$

Thus, we have

$$\left\Vert \eta_{n}-m_{K_{n}}^{\psi}\right\Vert \le C_{2}\left(n,\boldsymbol{\gamma}_{n}\right)\times n^{d}\times\frac{\epsilon}{3C_{2}\left(n,\boldsymbol{\gamma}_{n}\right)n^{d}}=\frac{\epsilon}{3}\text{,} $$

which completes our proof.

4.2 Proof of Theorem 2

The proof is procedurally similar to that of Theorem 1 and thus we only seek to highlight the important differences. Firstly, for any ε>0, we approximate f(y|x) by υ_n(x|y) of form (1), with d=1. Result (2) implies uniform convergence, in the sense that there exists an $N_{1}(\epsilon)\in \mathbb {N}$, such that for all n≥N₁(ε),

$$ \left\Vert f-\upsilon_{n}\right\Vert_{\mathcal{B}}<\frac{\epsilon}{3}\text{.} $$

(12)

We now seek to approximate υ_n by η_n of form (3), with γ_n=γ_l for some $l\in \mathbb {N}$. Upon application of Lemma 3, it follows that for each k∈[n^d] and ε>0, there exists a measurable set $\mathbb {B}_{k}(\varepsilon)\subseteq \mathbb {X}$, such that

$$\lambda\left(\mathbb{B}_{k}(\varepsilon)\right)<\frac{\varepsilon}{n^{d}\lambda\left(\mathbb{Y}\right)} $$

and

$$\left\Vert \text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}_{l}\right)-\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }\right\Vert_{\mathcal{B}\left(\mathbb{B}_{k}^{\complement}(\varepsilon)\right)}<\frac{\epsilon}{3} $$

for all l≥M_k(ε,n), for some $M_{k}\left (\epsilon,n\right)\in \mathbb {N}$. Here, $(\cdot)^{\complement }$ is the set complement operator.

Since $f\in \mathcal {B}$, we have the bound $C(n)=\sum _{k=1}^{n^{d}}\left \Vert f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)\right \Vert _{\mathcal {B}\left (\mathbb {Y}\right)}<\infty $. Write $\mathbb {B}(\varepsilon)=\bigcup _{k=1}^{n^{d}}\mathbb {B}_{k}(\varepsilon)$. Then, $\mathbb {B}^{\complement }(\varepsilon)=\bigcap _{k=1}^{n^{d}}\mathbb {B}_{k}^{\complement }(\varepsilon)$,

$$\lambda\left(\mathbb{B}(\varepsilon)\right)\le\sum_{k=1}^{n^{d}}\lambda\left(\mathbb{B}_{k}^{\complement}(\varepsilon)\right)<\frac{\varepsilon}{\lambda\left(\mathbb{Y}\right)}\text{,} $$

and

$$\begin{array}{*{20}l} {}\left\Vert \text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}_{l}\right)-\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }\right\Vert_{\mathcal{B}\left(\mathbb{B}^{\complement}(\varepsilon)\right)} & \le\min_{k\in\left[n^{d}\right]}\left\Vert \text{Gate}_{k}\left(\cdot;\boldsymbol{\gamma}_{l}\right)-\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }\right\Vert_{\mathcal{B}\left(\mathbb{B}_{k}^{\complement}(\varepsilon)\right)}<\frac{\epsilon}{3C(n)}\text{,} \end{array} $$

for all $l\ge M\left (\epsilon,n\right)=\max _{k\in \left [n^{d}\right ]}M_{k}\left (\epsilon,n\right)$. Here we use the fact that the supremum over some intersect of sets is less than or equal to the minimum of the supremum over each individual set.

Upon defining $\mathbb {C}(\varepsilon)=\mathbb {B}(\varepsilon)\times \mathbb {Y}\subset \mathbb {Z}$, we observe that

$$\lambda\left(\mathbb{C}(\varepsilon)\right)=\lambda\left(\mathbb{B}(\varepsilon)\right)\lambda\left(\mathbb{Y}\right)\le\frac{\varepsilon}{\lambda\left(\mathbb{Y}\right)}\times\lambda\left(\mathbb{Y}\right)=\varepsilon\text{,} $$

and $\mathbb {C}(\varepsilon)\subset \mathbb {B}(\varepsilon)\times \mathbb {Y}$. Note also that

$$\left(\mathbb{B}(\varepsilon)\times\mathbb{Y}\right)^{\complement}=\mathbb{Z}\backslash\left(\mathbb{B}(\varepsilon)\times\mathbb{Y}\right)=\mathbb{B}^{\complement}(\varepsilon)\times\mathbb{Y} $$

and

$$\mathbb{C}^{\complement}(\varepsilon)=\left(\mathbb{B}^{\complement}(\varepsilon)\times\mathbb{Y}\right)\cup\left(\mathbb{B}(\varepsilon)\times\mathbb{Y}^{\complement}\right)\cup\left(\mathbb{B}^{\complement}(\varepsilon)\times\mathbb{Y}^{\complement}\right)\text{.} $$

It follows that

$${}\left\Vert \upsilon_{n}\,-\,\eta_{n}\right\Vert_{\mathcal{B\left(\mathbb{C}^{\complement}(\varepsilon)\right)}}\!\le\!\max\!\left\{ \!\left\Vert\! \upsilon_{n}\,-\,\eta_{n}\!\right\Vert_{\mathcal{B\left(\mathbb{B}^{\complement}(\varepsilon)\times\mathbb{Y}\right)}}\!,\left\Vert\! \upsilon_{n}\,-\,\eta_{n}\!\right\Vert_{\mathcal{B\left(\mathbb{B}(\varepsilon)\times\mathbb{Y}^{\complement}\right)}}\!,\left\Vert \upsilon_{n}\,-\,\eta_{n}\right\Vert_{\mathcal{B\left(\mathbb{B}^{\complement}(\varepsilon)\times\mathbb{Y}^{\complement}\right)}}\!\right\} \!\text{.} $$

Since $\mathbb {B}(\varepsilon)\times \mathbb {Y}^{\complement }$ and $\mathbb {B}^{\complement }(\varepsilon)\times \mathbb {Y}^{\complement }$ are empty, via separability, we have

$$\begin{array}{*{20}l} \left\Vert \upsilon_{n}-\eta_{n}\right\Vert_{\mathcal{B\left(\mathbb{C}^{\complement}(\varepsilon)\right)}} & =\left\Vert \upsilon_{n}-\eta_{n}\right\Vert_{\mathcal{B\left(\mathbb{B}^{\complement}(\varepsilon)\times\mathbb{Y}\right)}}\\ & =\sup_{\boldsymbol{z}\in\mathbb{B}^{\complement}(\varepsilon)\times\mathbb{Y}}\left|\sum_{k=1}^{n^{d}}\left[\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{l}\right)\right]f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right|\\ & \le\sup_{\boldsymbol{z}\in\mathbb{B}^{\complement}(\varepsilon)\times\mathbb{Y}}\sum_{k=1}^{n^{d}}\left|\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{l}\right)\right|f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\\ & \le\sum_{k=1}^{n^{d}}\sup_{\boldsymbol{z}\in\mathbb{B}^{\complement}(\varepsilon)\times\mathbb{Y}}\left|\mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{l}\right)\right|f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\\ & =\sum_{k=1}^{n^{d}}\left\Vert \mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{l}\right)\right\Vert_{\mathcal{B\left(\mathbb{B}^{\complement}(\varepsilon)\right)}}\left\Vert f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{\mathcal{B\left(\mathbb{Y}\right)}}\\ & \le\sup_{k\in[n]}\left\Vert \mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{l}\right)\right\Vert_{\mathcal{B\left(\mathbb{B}^{\complement}(\varepsilon)\right)}}\sum_{k=1}^{n^{d}}\left\Vert f\left(\boldsymbol{y}|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{\mathcal{B\left(\mathbb{Y}\right)}}\text{.} \end{array} $$

Recall that the $\sum _{k=1}^{n^{d}}\left \Vert f\left (\boldsymbol {y}|\boldsymbol {x}_{k}^{n}\right)\right \Vert _{\mathcal {B\left (\mathbb {Y}\right)}}=C(n)<\infty $ and that we can choose l≥M(ε,n) so that

$$\sup_{k\in[n]}\left\Vert \mathbf{1}_{\left\{ \boldsymbol{x}\in\mathbb{X}_{k}^{n}\right\} }-\text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{l}\right)\right\Vert_{\mathcal{B\left(\mathbb{B}^{\complement}(\varepsilon)\right)}}<\frac{\epsilon}{3C(n)}\text{,} $$

and thus

$$ \left\Vert \upsilon_{n}-\eta_{n}\right\Vert_{\mathcal{B\left(\mathbb{C}^{\complement}(\varepsilon)\right)}}<\frac{\epsilon}{3C(n)}\times C(n)=\frac{\epsilon}{3}\text{,} $$

(13)

as required.

Finally, by noting that for each k∈[n^d], both (6) and $f\left (\cdot |\boldsymbol {x}_{k}^{n}\right)$ are continuous over $\mathbb {Y}$, we apply Lemma 4 to obtain an $N_{2}\left (\epsilon,n,l\right)\in \mathbb {N}$, such that for any ε>0 and n_k≥N₂(ε,n,l), we have

$$\left\Vert f\left(\cdot|\boldsymbol{x}_{k}^{n}\right)-h_{n_{k}}^{k}\left(\cdot|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{\mathcal{B}\left(\mathbb{Y}\right)}<\frac{\epsilon}{3M_{1}n}\text{.} $$

Here $M_{1}=\sup _{k\in \left [n^{d}\right ]}\left \Vert \text {Gate}_{k}\left (\cdot ;\boldsymbol {\gamma }_{l}\right)\right \Vert _{\mathcal {B\left (\mathbb {X}\right)}}<\infty $, since Gate_k(x;γ_l) is continuous in x, and $\mathbb {X}$ is compact. Therefore, for all $K_{n}=\sum _{k=1}^{n^{d}}n_{k}, n_{k}\ge N_{2}\left (\epsilon,n,l\right)$,

$$\begin{array}{*{20}l} \left\Vert \eta_{n}-m_{K_{n}}^{\psi}\right\Vert_{\mathcal{B}} & \le\sup_{k\in\left[n^{d}\right]}\left\Vert \text{Gate}_{k}\left(\boldsymbol{x};\boldsymbol{\gamma}_{l}\right)\right\Vert_{\mathcal{B\left(\mathbb{X}\right)}}\sum_{k=1}^{n^{d}}\left\Vert f\left(\cdot|\boldsymbol{x}_{k}^{n}\right)-h_{n_{k}}^{k}\left(\cdot|\boldsymbol{x}_{k}^{n}\right)\right\Vert_{\mathcal{B}\left(\mathbb{Y}\right)} \\ & =M_{1}\times n^{d}\times\frac{\epsilon}{3M_{1}n^{d}}=\frac{\epsilon}{3}. \end{array} $$

(14)

In summary, via (12), (13), and (14), for each ε>0, for any ε>0, there exists a $\mathbb {C}(\varepsilon)\subset \mathbb {Z}$ and constants $N_{1}(\epsilon),M\left (\epsilon,n\right),N_{2}\left (\epsilon,n,l\right)\in \mathbb {N}$, such that for all $K_{n}=\sum _{k=1}^{n^{d}}n_{k}$, with n_k≥N₂(ε,n,l),l≥M(ε,n), and n≥N₁(ε), it follows that $\lambda \left (\mathbb {C}(\varepsilon)\right)<\varepsilon $, and

$$\begin{array}{*{20}l} \left\Vert f-m_{K_{n}}^{\psi}\right\Vert_{\mathcal{B\left(\mathbb{C}^{\complement}(\varepsilon)\right)}} & \le\left\Vert f-\upsilon_{n}\right\Vert_{\mathcal{B\left(\mathbb{C}^{\complement}(\varepsilon)\right)}}+\left\Vert \upsilon_{n}-\eta_{n}\right\Vert_{\mathcal{B\left(\mathbb{C}^{\complement}(\varepsilon)\right)}}+\left\Vert \eta_{n}-m_{K_{n}}^{\psi}\right\Vert_{\mathcal{B\left(\mathbb{C}^{\complement}(\varepsilon)\right)}}\\ & \le\left\Vert f-\upsilon_{n}\right\Vert_{\mathcal{B}}+\left\Vert \upsilon_{n}-\eta_{n}\right\Vert_{\mathcal{B\left(\mathbb{C}^{\complement}(\varepsilon)\right)}}+\left\Vert \eta_{n}-m_{K_{n}}^{\psi}\right\Vert_{\mathcal{B}}\\ & <3\times\frac{\epsilon}{3}=\epsilon\text{.} \end{array} $$

This completes the proof.

Proofs of lemmas

5.1 Proof of Lemma 1

We firstly prove that any gating vector from $\mathcal {G}_{S}^{K}$ can be equivalently represented as an element of $\mathcal {G}_{G}^{K}$. For any $\boldsymbol {x}\in \mathbb {R}^{d}, d\in \mathbb {N}, k\in [K], a_{k}\in \mathbb {R}, \boldsymbol {b}_{k}\in \mathbb {R}^{d}$, and $K\in \mathbb {N}$, choose $\boldsymbol {\nu }_{k}=\boldsymbol {b}_{k}, \tau _{k}=a_{k}+\boldsymbol {b}_{k}^{\top }\boldsymbol {b}_{k}/2$ and

$$\pi_{k}=\exp\left(\tau_{k}\right)/\sum_{l=1}^{K}\exp\left(\tau_{l}\right)\text{.} $$

This implies that $\sum _{l=1}^{K}\pi _{l}=1, \pi _{l}>0$, for all l∈[K], and

$$\begin{array}{*{20}l} \frac{\exp\left(a_{k}+\boldsymbol{b}_{k}^{\top}\boldsymbol{x}\right)}{\sum_{l=1}^{K}\exp\left(a_{k}+\boldsymbol{b}_{k}^{\top}\boldsymbol{x}\right)} & =\frac{\exp\left(\tau_{k}-\boldsymbol{v}_{k}^{\top}\boldsymbol{v}_{k}/2+\boldsymbol{v}_{k}^{\top}\boldsymbol{x}\right)}{\sum_{l=1}^{K}\exp\left(\tau_{l}-\boldsymbol{v}_{l}^{\top}\boldsymbol{v}_{l}/2+\boldsymbol{v}_{l}^{\top}\boldsymbol{x}\right)}\\ & =\frac{\exp\left(\tau_{k}\right)\exp\left(-\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)^{\top}\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)/2\right)}{\sum_{l=1}^{K}\exp\left(\tau_{l}\right)\exp\left(-\left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)^{\top}\left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)/2\right)}\\ & =\frac{\pi_{k}\left(2\pi\right)^{-d/2}\exp\left(-\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)^{\top}\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)/2\right)}{\sum_{l=1}^{K}\pi_{l}\left(2\pi\right)^{-d/2}\exp\left(-\left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)^{\top}\left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)/2\right)}\\ & =\frac{\pi_{k}\phi\left(\boldsymbol{x};\boldsymbol{\nu}_{k},\mathbf{I}\right)}{\sum_{l=1}^{K}\pi_{l}\phi\left(\boldsymbol{x};\boldsymbol{\nu}_{l},\mathbf{I}\right)}\text{,} \end{array} $$

where I is the identity matrix of appropriate size. This proves that $\mathcal {G}_{S}^{K}\subset \mathcal {G}_{G}^{K}$.

Next, to show that $\mathcal {G}_{E}^{K}\subset \mathcal {G}_{S}^{K}$, we write

$$\begin{aligned} &\frac{\pi_{k}\phi\left(\boldsymbol{x};\boldsymbol{\nu}_{k},\boldsymbol{\Sigma}\right)}{\sum_{l=1}^{K}\pi_{l}\phi\left(\boldsymbol{x};\boldsymbol{\nu}_{l},\boldsymbol{\Sigma}\right)}\\ &=\frac{\pi_{k}\left|2\pi\boldsymbol{\Sigma}\right|^{-1/2}\exp\left(-\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)/2\right)}{\sum_{l=1}^{K}\pi_{l}\left|2\pi\boldsymbol{\Sigma}\right|^{-1/2}\exp\left(-\left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)/2\right)}\\ & = \frac{1}{\sum_{l=1}^{K}\exp\left(-\log\left(\pi_{l}^{-2}/\pi_{k}^{-2}\right)/2-\left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)/2-\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)/2\right)}\text{,} \end{aligned} $$

and note that

$$\begin{array}{*{20}l} & \left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{x}-\boldsymbol{\nu}_{l}\right)-\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{x}-\boldsymbol{\nu}_{k}\right)\\ = & -2\left(\boldsymbol{\nu}_{l}-\boldsymbol{\nu}_{k}\right)^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{x}+\left(\boldsymbol{\nu}_{l}+\boldsymbol{\nu}_{k}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{\nu}_{l}-\boldsymbol{\nu}_{k}\right)\text{.} \end{array} $$

Thus, we have

$$\begin{array}{*{20}l} &{} \frac{\pi_{k}\phi\left(\boldsymbol{x};\boldsymbol{\nu}_{k},\boldsymbol{\Sigma}\right)}{\sum_{l=1}^{K}\pi_{l}\phi\left(\boldsymbol{x};\boldsymbol{\nu}_{l},\boldsymbol{\Sigma}\right)}\\ &{}= \frac{1}{\sum_{l=1}^{K}\exp\left(-\log\left(\pi_{l}^{-2}/\pi_{k}^{-2}\right)/2-\left(\boldsymbol{\nu}_{l}+\boldsymbol{\nu}_{k}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{\nu}_{l}-\boldsymbol{\nu}_{k}\right)/2-\left(\boldsymbol{\nu}_{l}-\boldsymbol{\nu}_{k}\right)^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{x}\right)}\text{.} \end{array} $$

Next, notice that we can write

$$\frac{\exp\left(a_{k}+\boldsymbol{b}_{k}^{\top}\boldsymbol{x}\right)}{\sum_{l=1}^{K}\exp\left(a_{l}+\boldsymbol{b}_{l}^{\top}\boldsymbol{x}\right)}=\frac{1}{\sum_{l=1}^{K}\exp\left(\alpha_{l}+\boldsymbol{\beta}_{l}^{\top}\boldsymbol{x}\right)}\text{,} $$

where α_l=a_l−a_k and β_l=β_l−β_k. We now choose a_k and b_k, such that for every l∈[K],

$$\begin{array}{*{20}l} \alpha_{l} & =a_{l}-a_{k}=-\frac{1}{2}\log\left(\frac{\pi_{l}^{-2}}{\pi_{k}^{-2}}\right)-\frac{1}{2}\left(\boldsymbol{\nu}_{l}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\nu}_{l}-\boldsymbol{\nu}_{k}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\nu}_{k}\right)\text{,} \end{array} $$

and

$$\boldsymbol{\beta}_{l}=\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{k}=\boldsymbol{\nu}_{l}^{\top}\boldsymbol{\Sigma}^{-1}-\boldsymbol{\nu}_{k}^{\top}\boldsymbol{\Sigma}^{-1}\text{.} $$

To complete the proof, we choose

$$a_{k}=\log\left(\pi_{k}\right)-\frac{1}{2}\boldsymbol{\nu}_{k}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\nu}_{k} $$

and $b_{k}=\boldsymbol {\nu }_{k}^{\top }\boldsymbol {\Sigma }^{-1}$, for each k∈[K].

5.2 Proof of Lemma 3

For l∈[0,∞), write

$$\text{Gate}_{k}\left(x,l\right)=\frac{\exp\left(\left[x-c_{k}\right]lk\right)}{\sum_{i=1}^{n}\exp\left(\left[x-c_{i}\right]li\right)}\text{,} $$

where $x\in \mathbb {X}=\left [0,1\right ]$, and c_k=(k−1)/(2k). We identify that Gate=(Gate_k(x,l))_k∈[n] belongs to the class $\mathcal {G}_{S}^{n}$. The proof of the Section 4 Proposition from Jiang and Tanner (1999a) reveals that for all k∈[n],

$$\text{Gate}_{k}\left(x,l\right)\rightarrow\mathbf{1}_{\left\{ x\in\mathbb{I}_{k}^{n}\right\} } $$

almost everywhere in λ, as l→∞. The result then follows via an application of Egorov’s theorem (cf. Folland 1999, Thm. 2.33).

Summary and conclusions

Using recent results mixture model approximation results Nguyen et al. (2020) and Nguyen et al. (2020a), and the indicator approximation theorem of Jiang and Tanner (1999a) (cf. Section 3), we have proved two approximation theorems (Theorems 1 and 2) regarding the class of soft-max gated MoE models with experts arising from arbitrary location-scale families of conditional density functions. Via an equivalence result (Lemma 1), the results of Theorems 1 and 2 also extend to the setting of Gaussian gated MoE models (Corollary 1), which can be seen as a generalization of the soft-max gated MoE models.

Although we explicitly make the assumption that $\mathbb {X}=\left [0,1\right ]^{d}$, for the sake of mathematical argument (so that we can make direct use of Lemma 2), a simple shift-and-scale argument can be used to generalize our result to cases where $\mathbb {X}$ is any generic compact domain. The compactness assumption regarding the input domain is common in the MoE and mixture of regression models literature, as per the works of Jiang and Tanner (1999b); Jiang and Tanner (1999a); Norets (2010); Montuelle and Le Pennec (2014); Pelenis (2014); Devijver (2015a); Devijver (2015b).

The assumption permits the application of the result to the settings where the inputs X is assumed to be non-random design vectors that take value on some compact set $\mathbb {X}$. This is often the case when there is only a finite number of possible design vector elements for which X can take. Otherwise, the assumption also permits the scenario where X is some random element with compactly supported distribution, such as uniformly distributed, or beta distributed inputs. Unfortunately, the case of random X over an unbounded domain (e.g., if X has multivariate Gaussian distribution) is not covered under our framework. An extension to such cases would require a more general version of Lemma 2, which we believe is a nontrivial direction for future work.

Like the input, we also assume that the output domain is restricted to a compact set $\mathbb {Y}$. However, the output domain of the approximating class of MoE models is unrestricted to $\mathbb {Y}$ and thus the functions (i.e., we allow ψ to be a PDF over $\mathbb {R}^{q}$). The restrictions placed on $\mathbb {Y}$ is also common in the mixture approximation literature, as per the works of Zeevi and Meir (1997); Li and Barron (1999); Rakhlin et al. (2005), and is also often made in the context of nonparametric regression (see, e.g., Gyorfi et al. 2002; Cucker and Zhou 2007). Here, our use of the compactness of $\mathbb {Y}$ is to bound the integral of v_n, in (10). A more nuanced approach, such as via the use of a generalized Lebesgue spaces (see e.g., Castillo and Rafeiro 2010; Cruze-Uribe and Fiorenza 2013), may lead to result for unbounded $\mathbb {Y}$. This is another exciting future direction of our research program.

A trivial modification to the proof of Lemma 4 allows us to replace the assumption that f is a PDF with a sub-PDF assumption (i.e., $\int _{\mathbb {Y}}f\mathrm {d}\lambda \le 1$), instead. This in turn permits us to replace the assumption that f(·|x) is a conditional PDF in Theorems 1 and 2 with sub-PDF assumptions as well (i.e., for each $\boldsymbol {x}\in \mathbb {X}, \int _{\mathbb {Y}}f\left (\boldsymbol {y}|\boldsymbol {x}\right)\mathrm {d}\lambda \left (\boldsymbol {y}\right)\le 1$). Thus, in this modified form, we have a useful interpretation for situations when the input Y is unbounded. That is, when Y is unbounded, we can say that the conditional PDF f can be arbitrarily well approximated in $\mathcal {L}_{p}$ norm by a sequence $\left \{ m_{K}^{\psi }\right \}_{K\in \mathbb {N}}$ of either soft-max or Gaussian gated MoEs over any compact subdomain $\mathbb {Y}$ of the unbounded domain of Y. Thus, although we cannot provide guarantees of the entire domain of Y, we are able to guarantee arbitrary approximate fidelity over any arbitrarily large compact subdomain. This is a useful result in practice since one is often not interested in the entire domain of Y, but only on some subdomain where the probability of Y is concentrated. This version of the result resembles traditional denseness results in approximation theory, such as those of Cheney and Light (2000), Ch. 20.

Finally, our results can be directly applied to provide approximation guarantees for a large number of currently used models in applied statistics and machine learning research. Particularly, our approximation guarantees are applicable to the recent MoE models of Ingrassia et al. (2012); Chamroukhi et al. (2013); Ingrassia et al. (2014); Chamroukhi (2016); Nguyen and McLachlan (2016); Deleforge et al. (2015a); Deleforge et al. (2015b); Kalliovirta et al. (2016); Perthame et al. (2018), among many others. Here, we may guarantee that the underlying data generating processes, if satisfying our assumptions, can be adequately well approximated by sufficiently complex forms of the models considered in each of the aforementioned work.

The rate and manner of which good approximation can be achieved as a function of the number of experts K and the sample size is a currently active research area, with pioneering work conducted in Cohen and Le Pennec (2012); Montuelle and Le Pennec (2014). More recent results in this direction appear in Nguyen et al. (2020b); Nguyen et al. (2021); Nguyen et al. (2021).

Availability of data and materials

Not applicable.

Code availability

Not applicable.

Abbreviations

MoE:: Mixture of experts
PDF:: Probability density function

References

Bartle, R.: The Elements of Integration and Lebesgue Measure. Wiley, New York (1995).
Book Google Scholar
Castillo, R. E., Rafeiro, H.: An Introductory Course in Lebesgue Spaces. Springer, Switzerland (2010).
MATH Google Scholar
Chamroukhi, F.: Robust mixture of experts modeling using the t distribution. Neural Netw. 79, 20–36 (2016).
Article Google Scholar
Chamroukhi, F., Mohammed, S., Trabelsi, D., Oukhellou, L., Amirat, Y.: Joint segmentation of multivariate time series with hidden process regression for human activity recognition. Neurocomputing. 120, 633–644 (2013).
Article Google Scholar
Cheney, W., Light, W.: A Course in Approximation Theory. Brooks/Cole, Pacific Grove (2000).
MATH Google Scholar
Cohen, S., Le Pennec, E.: Conditional density estimation by penalized likelihood model selection and application. ArXiv (arXiv:1103.2021) (2012).
Cruze-Uribe, D. V., Fiorenza, A.: Variable Lebesgue Spaces: Foundations and Harmonic Analysis. Birkhauser, Basel (2013).
Book Google Scholar
Cucker, F., Zhou, D. -X.: Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, Cambridge (2007).
Book Google Scholar
Deleforge, A., Forbes, F., Horaud, R.: High-dimensional regression with Gaussian mixtures and partially-latent response variables. Stat. Comput. 25, 893–911 (2015).
Article MathSciNet Google Scholar
Deleforge, A., Forbes, F., Horaud, R.: Acoustic space learning for sound-source separation and localization on binaural manifolds. Int. J. Neural Syst. 25, 1440003 (2015).
Article Google Scholar
Devijver, E.: An ℓ₁-oracle inequality for the Lasso in multivariate finite mixture of multivariate Gaussian regression models. ESAIM: Probab. Stat. 19, 649–670 (2015).
Article MathSciNet Google Scholar
Devijver, E.: Finite mixture regression: a sparse variable selection by model selection for clustering. Electron. J. Stat. 9, 2642–2674 (2015).
Article MathSciNet Google Scholar
Folland, G. B.: Real Analysis: Modern Techniques and Their Applications. Wiley, New York (1999).
MATH Google Scholar
Gyorfi, L., Kohler, M., Krzyzak, A., Walk, H.: A Distribution-free Theory Of Nonparametric Regression. Springer, New York (2002).
Book Google Scholar
Ingrassia, S., Minotti, S. C., Punzo, A.: Model-based clustering via linear cluster-weighted models. Comput. Stat. Data Anal. 71, 159–182 (2014).
Article MathSciNet Google Scholar
Ingrassia, S., Minotti, S. C., Vittadini, G.: Local statistical modeling via a cluster-weighted approach with elliptical distributions. J. Classif. 29, 363–401 (2012).
Article MathSciNet Google Scholar
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., Hinton, G. E.: Adaptive mixtures of local experts. Neural Comput. 3, 79–87 (1991).
Article Google Scholar
Jiang, W., Tanner, M. A.: On the approximation rate of hierachical mixtures-of-experts for generalized linear models. Neural Comput. 11, 1183–1198 (1999).
Article Google Scholar
Jiang, W., Tanner, M. A.: Hierachical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation. Ann. Stat. 27, 987–1011 (1999).
MATH Google Scholar
Jordan, M. I., Xu, L.: Convergence results for the EM approach to mixtures of experts architectures. Neural Netw. 8, 1409–1431 (1995).
Article Google Scholar
Kalliovirta, L., Meitz, M., Saikkonen, P.: Gaussian mixture vector autoregression. J. Econ. 192, 485–498 (2016).
Article MathSciNet Google Scholar
Krzyzak, A., Schafer, D.: Nonparametric regression estimation by normalized radial basis function networks. IEEE Trans. Inf. Theory. 51, 1003–1010 (2005).
Article MathSciNet Google Scholar
Li, J. Q., Barron, A. R.: Mixture density estimation. In: Solla, S. A., Leen, T. K., Mueller, K. R. (eds.)Advances in Neural Information Processing Systems. MIT Press, Cambridge (1999).
Google Scholar
Masoudnia, S., Ebrahimpour, R.: Mixture of experts: a literature survey. Artif. Intell. Rev. 42, 275–293 (2014).
Article Google Scholar
Mendes, E. F., Jiang, W.: On convergence rates of mixture of polynomial experts. Neural Comput. 24, 3025–3051 (2012).
Article MathSciNet Google Scholar
Montuelle, L., Le Pennec, E.: Mixture of Gaussian regressions model with logistic weights, a penalized maximum likelihood approach. Electron. J. Stat. 8, 1661–1695 (2014).
Article MathSciNet Google Scholar
Nguyen, H. D., Chamroukhi, F.: Practical and theoretical aspects of mixture-of-experts modeling: an overview. WIREs Data Min. Knowl. Disc. 8(4), 1246 (2018).
Google Scholar
Nguyen, H. D., Chamroukhi, F., Forbes, F.: Approximation results regarding the multiple-output Gaussian gated mixture of linear experts model. Neurocomputing. 366, 208–214 (2019).
Article Google Scholar
Nguyen, T. T., Chamroukhi, F., Nguyen, H. D., Forbes, F.: Non-asymptotic model selection in block-diagonal mixture of polynomial experts models. arXiv preprint arXiv:2104.08959 (2021). http://arxiv.org/abs/2104.08959.
Nguyen, T. T., Chamroukhi, F., Nguyen, H. D., McLachlan, G. J.: Approximation of probability density functions via location-scale finite mixtures in Lebesgue spaces. arXiv:2008.09787 (2020).
Nguyen, H. D., Lloyd-Jones, L. R., McLachlan, G. J.: A universal approximation theorem for mixture-of-experts models. Neural Comput. 28, 2585–2593 (2016).
Article MathSciNet Google Scholar
Nguyen, H. D., McLachlan, G. J.: Laplace mixture of linear experts. Comput. Stat. Data Anal. 93, 177–191 (2016).
Article MathSciNet Google Scholar
Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., Forbes, F.: A non-asymptotic penalization criterion for model selection in mixture of experts models. ArXiv (arXiv:2104.02640) (2021).
Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., McLachlan, G. J.: Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Math. Stat. 7, 1750861 (2020).
Article MathSciNet Google Scholar
Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., McLachlan, G. J.: An l₁-oracle inequality for the Lasso in mixture-of-experts regression models. ArXiv (arXiv:2009.10622) (2020).
Norets, A.: Approximation of conditional densities by smooth mixtures of regressions. Ann. Stat. 38, 1733–1766 (2010).
Article MathSciNet Google Scholar
Norets, A., Pelenis, J.: Posterior consistency in conditional density estimation by covariate dependent mixtures. Econ. Theory. 30, 606–646 (2014).
Article MathSciNet Google Scholar
Oden, J. T., Demkowicz, L. F.: Applied Functional Analysis. CRC Press, Boca Raton (2010).
Book Google Scholar
Pelenis, J.: Bayesian regression with heteroscedastic error density and parametric mean function. J. Econ. 178, 624–638 (2014).
Article MathSciNet Google Scholar
Perthame, E., Forbes, F., Deleforge, A.: Inverse regression approach to robust nonlinear high-to-low dimensional mapping. J. Multivar. Anal. 163, 1–14 (2018).
Article MathSciNet Google Scholar
Rakhlin, A., Panchenko, D., Mukherjee, S.: Risk bounds for mixture density estimation. ESAIM: Probab. Stat. 9, 220–229 (2005).
Article MathSciNet Google Scholar
Wang, L. -X., Mendel, J. M.: Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Trans. Neural Netw. 3, 807–814 (1992).
Article Google Scholar
Yuksel, S. E., Wilson, J. N., Gader, P. D.: Twenty years of mixture of experts. IEEE Trans. Neural Netw. Learn. Syst. 23, 1177–1193 (2012).
Article Google Scholar
Zeevi, A. J., Meir, R.: Density estimation through convex combinations of densities: approximation and estimation bounds. Neural Comput. 10, 99–109 (1997).
MATH Google Scholar
Zeevi, A. J., Meir, R., Maiorov, V.: Error bounds for functional approximation and estimation using mixtures of experts. IEEE Trans. Inf. Theory. 44, 1010–1025 (1998).
Article MathSciNet Google Scholar

Download references

Acknowledgements

Hien Duy Nguyen and Geoffrey John McLachlan are funded by Australian Research Council grants: DP180101192 and IC170100035. TrungTin Nguyen is supported by a “Contrat doctoral” from the French Ministry of Higher Education and Research. Faicel Chamroukhi is granted by the French National Research Agency (ANR) grant SMILES ANR-18-CE40-0014. The authors also thank the Editor and Reviewer, whose careful and considerate comments lead to improvements in of text.

Funding

HDN and GJM are funded by Australian Research Council grants: DP180101192 and IC170100035. FC is funded by ANR grant: SMILES ANR-18-CE40-0014.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, La Trobe University, Bundoora Victoria, Australia
Hien Duy Nguyen
Normandie Univ, UNICAEN, CNRS, LMNO, Caen, 14000, France
TrungTin Nguyen & Faicel Chamroukhi
School of Mathematics and Physics, University of Queensland, St. Lucia Brisbane, Australia
Geoffrey John McLachlan

Authors

Hien Duy Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
TrungTin Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Faicel Chamroukhi
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey John McLachlan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to the exposition and to the mathematical derivations. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hien Duy Nguyen.

Ethics declarations

Competing interests

None.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nguyen, H.D., Nguyen, T., Chamroukhi, F. et al. Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models. J Stat Distrib App 8, 13 (2021). https://doi.org/10.1186/s40488-021-00125-0

Download citation

Received: 20 March 2021
Accepted: 13 June 2021
Published: 06 August 2021
DOI: https://doi.org/10.1186/s40488-021-00125-0

Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models

Abstract

Introduction

Main results

Theorem 1

Theorem 2

Lemma 1

Corollary 1

Technical lemmas

Lemma 2

Lemma 3

Lemma 4

Proofs of main results

4.1 Proof of Theorem 1

4.2 Proof of Theorem 2

Proofs of lemmas

5.1 Proof of Lemma 1

5.2 Proof of Lemma 3

Summary and conclusions

Availability of data and materials

Code availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords