Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models

Mixture of experts (MoE) models are widely applied for conditional probability density estimation problems. We demonstrate the richness of the class of MoE models by proving denseness results in Lebesgue spaces, when inputs and outputs variables are both compactly supported. We further prove an almost uniform convergence result when the input is univariate. Auxiliary lemmas are proved regarding the richness of the soft-max gating function class, and their relationships to the class of Gaussian gating functions.


Introduction
Mixture of experts (MoE) models are a widely applicable class of conditional probability density approximations that have been considered as solution methods across the spectrum of statistical and machine learning problems; see, for example, the reviews of Yuksel et al. (2012), Masoudnia & Ebrahimpour (2014), and Nguyen & Chamroukhi (2018).
Let Z = X × Y, where X ⊆ R d and Y ⊆ R q , for d, q ∈ N. Suppose that the input and output random variables, X ∈ X and Y ∈ Y, are related via the conditional probability density function (PDF) f (y|x) in the functional class: where λ denotes the Lebesgue measure.The MoE approach seeks to approximate the unknown target conditional PDF f by a function of the MoE form: where Gate = (Gate k ) k∈[K] ∈ G K ([K] = {1, . . ., K}), Expert 1 , . . ., Expert K ∈ E, and K ∈ N. Here, we say that m is a K-component MoE model with gates arising from the class G K and experts arising from E, where E is a class of PDFs with support Y.
The most popular choices for G K are the parametric soft-max and Gaussian gating classes: respectively, where Here, is the multivariate normal density function with mean vector ν and covariance matrix Σ, π ⊤ = (π 1 , . . ., π K ) is a vector of weights in the simplex: and S d is the class of d × d symmetric positive definite matrices.The soft-max and Gaussian gating classes were first introduced by Jacobs et al. (1991) and Jordan & Xu (1995), respectively.Typically, one chooses experts that arise from some location-scale class: where ψ is a PDF, with respect to R q in the sense that ψ : R q → [0, ∞) and R q ψ (y) dλ (y) = 1.
We shall say that where 1 Z is the indicator function that takes value 1 when z ∈ Z, and 0 otherwise.Further, we say that We shall refer to • p,Z as the L p norm on Z, for p ∈ [0, ∞], and where the context is obvious, we shall drop the reference to Z. Suppose that the target conditional PDF f is in the class F p = F ∩L p .We address the problem of approximating f , with respect to the L p norm, using MoE models in the soft-max and Gaussian gated classes, by showing that both M ψ S and M ψ G are dense in the class F p , when X = [0, 1] d and Y is a compact subset of R q .Our denseness results are enabled by the indicator function approximation result of Jiang & Tanner (1999b), and the finite mixture model denseness theorems of Nguyen et al. (2020a) and Nguyen et al. (2020c).
Our theorems contribute to an enduring continuity of sustained interest in the approximation capabilities of MoE models.Related to our results are contributions regarding the approximation capabilities of the conditional expectation function of the classes M ψ S and M ψ G (Jiang & Tanner, 1999b, Krzyzak & Schafer, 2005, Mendes & Jiang, 2012, Nguyen et al., 2019, 2016, Wang & Mendel, 1992, Zeevi et al., 1998) and the approximation capabilities of subclasses of M ψ S and M ψ G , with respect to the Kullback-Leibler divergence (Jiang & Tanner, 1999a, Norets, 2010, Norets & Pelenis, 2014).Our results can be seen as complements to the Kullback-Leibler approximation theorems of Norets (2010) and Norets & Pelenis (2014), by the relationship between the Kullback-Leibler divergence and the L 2 norm (Zeevi & Meir, 1997).That is, when f > 1/κ, for all (x, y) ∈ Z and some constant κ > 0, we have that the integrated conditional Kullback-Leibler divergence considered by Norets & Pelenis (2014): , and thus a good approximation in the integrated Kullback-Leibler divergence is guaranteed if one can find a good approximation in the L 2 norm, which is guaranteed by our main result.
The remainder of the manuscript proceeds as follows.The main result is presented in Section 2. Technical lemmas are provided in Section 3. The proofs of our results are then presented in Section 4. Proofs of required lemmas that do not appear elsewhere are provided in Section 5. A summary of our work and some conclusions are drawn in Section 6.

Main results
Denote the class of bounded functions on Z by Since convergence in Lebesgue spaces does not imply point-wise modes of convergence, the following result is also useful and interesting in some restricted scenarios.Here, we note that the mode of convergence is almost uniform, which implies almost everywhere convergence and convergence in measure (cf.Bartle 1995, Lem 7.10 and Thm. 7.11).The almost uniform convergence of m ψ K K∈N to f in the following result is to be understood in the sense of Bartle (1995, Def. 7.9).That is, for every δ > 0, there exists a set E δ ⊂ Z with λ (Z) < δ, such that

The following result establishes the connection between the gating classes G
Further, if we define the class of Gaussian gating vectors with equal covariance matrices: We can directly apply Lemma 1 to establish the following corollary to Theorems 1 and 2, regarding the approximation capability of the class M ψ G .
Corollary 1. Theorems 1 and 2 hold when M ψ S is replaced by M ψ G in their statements.
3 Technical lemmas for each n, and that λ (X n k ) = n −d gets smaller, as n increases.The following result from Jiang & Tanner (1999b) establishes the approximation capability of soft-max gates.
Then, for each n ∈ N, there exists a sequence of gating functions: For PDF ψ on support R q , define the class of finite mixture models by We require the following result, from Nguyen et al. (2020a), regarding the approximation capabilities of H ψ .
4 Proofs of main results

Proof of Theorem 1
To prove the result, it suffices to show that for each ǫ > 0, there exists a The main steps of the proof are as follows.We firstly approximate f (y|x) by where for all n ≥ N 1 (ǫ), for some sufficiently large N 1 (ǫ) ∈ N. Then we approximate υ n (y|x) by where using Lemma 2. Finally, we approximate η n (y|x) by m ψ Kn (y|x), where and Here, we establish that there exists Results ( 2)-( 7) then imply that for each ǫ > 0, there exists N 1 (ǫ), γ n , and N 2 (ǫ, n, γ n ), such that for all The following inequality results from an application of the triangle inequality: We now focus our attention to proving each of the results: (2)-(7).To prove (2), we note that since f is uniformly continuous (because Z = X × Y is compact, and f ∈ C), there exists a function (1) such that for all ε > 0, sup We can construct such an approximation by considering the fact that as n increases, the diameter δ n = sup k∈n d diam (X n k ) of the fine partition goes to zero.By the uniform continuity of f , for every ε > 0, there exists a Here, • denotes the Euclidean norm.Furthermore, for any (x, y) ∈ Z, we have by the triangle inequality.Since x n k ∈ X n k , for each k and n, we have the fact that (x, y) − (x n k , y) < δ n for (x, y) ∈ X n k × Y.By uniform continuity, for each ε, we can find a sufficiently small δ (ǫ), such that |f (y|x , for all k.The desired result (8) can be obtained by noting that the right hand side of ( 9) consists of only one non-zero summand for any (x, y) ∈ Z, and by choosing n ∈ N sufficiently large, so that δ n < δ (ǫ).By (8), we have the fact that υ n → f , point-wise.We can bound υ n as follows: where the right-hand side is a constant and is therefore in L p , since Z is compact.An application of the Lebesgue dominated convergence theorem in L p then yields (2).
Next we write Since the norm arguments are separable in x and y, we apply Fubini's theorem to get which can be achieved via a direct application of Lemma 2. We have thus shown (4).Lastly, we are required to approximate f (y|x n k ) for each k ∈ n d , by a function of form (6). Since Y is compact and f and ψ are continuous, we can apply of Lemma 4, directly.Note that over a set of finite measure, convergence in • B implies convergence in L p norm, for all p ∈ [1, ∞] (cf.Oden & Demkowicz 2010, Prop. 3.9.3).
We can then write (5) as where k=1 n k .To obtain (7), we write By separability and Fubini's theorem, we then have . Then, we apply Lemma 4 n d times to establish the existence of a constant N 2 (ǫ, n, γ n ) ∈ N, such that for all k ∈ n d and n k ≥ N 2 (ǫ, n, γ n ), Thus, we have , which completes our proof.

Proof of Theorem 2
The proof is procedurally similar to that of Theorem 1 and thus we only seek to highlight the important differences.Firstly, for any ǫ > 0, we approximate f (y|x) by υ n (x|y) of form (1), with d = 1.Result (2) implies uniform convergence, in the sense that there exists an N 1 (ǫ) ∈ N, such that for all n ≥ N 1 (ǫ), We now seek to approximate υ n by η n of form (3), with γ n = γ l for some l ∈ N. Upon application of Lemma 3, it follows that for each k ∈ n d and ε > 0, there exists a measurable set B k (ε) ⊆ X, such that Here, (•) ∁ is the set complement operator.
Since f ∈ B, we have the bound and Here we use the fact that the supremum over some intersect of sets is less than or equal to the minimum of the supremum over each individual set.Upon defining and It follows that Since B (ε) × Y ∁ and B ∁ (ε) × Y ∁ are empty, via separability, we have Recall that the and thus as required.
Finally, by noting that for each k ∈ n d , both (6) and f (•|x n k ) are continuous over Y, we apply Lemma 4 to obtain an N 2 (ǫ, n, l) ∈ N, such that for any ǫ > 0 and n k ≥ N 2 (ǫ, n, l), we have Here In summary, via (12), (13), and ( 14), for each ǫ > 0, for any ε > 0, there exists a C (ε) ⊂ Z and constants This completes the proof.
5 Proofs of lemmas

Proof of Lemma 1
We firstly prove that any gating vector from G K S can be equivalently represented as an element of G K G .For any This implies that and , where I is the identity matrix of appropriate size.This proves that and note that Thus, we have Next, notice that we can write , where α l = a l − a k and β l = β l − β k .We now choose a k and b k , such that for every l ∈ [K], and To complete the proof, we choose

Summary and conclusions
Using recent results mixture model approximation results Nguyen et al. (2020a) and Nguyen et al. (2020c), and the indicator approximation theorem of Jiang & Tanner (1999b) (cf.Section 3), we have proved two approximation theorems (Theorems 1 and 2) regarding the class of soft-max gated MoE models with experts arising from arbitrary location-scale families of conditional density functions.Via an equivalence result (Lemma 1), the results of Theorems 1 and 2 also extend to the setting of Gaussian gated MoE models (Corollary 1), which can be seen as a generalization of the soft-max gated MoE models.
Although we explicitly make the assumption that X = [0, 1] d , for the sake of mathematical argument (so that we can make direct use of Lemma 2), a simple shift-and-scale argument can be used to generalize our result to cases where X is any generic compact domain.The compactness assumption regarding the input domain is common in the MoE and mixture of regression models literature, as per the works of Jiang & Tanner (1999a), Jiang & Tanner (1999b), Norets (2010), Montuelle & Le Pennec (2014), Pelenis (2014), Devijver (2015a), andDevijver (2015b).
The assumption permits the application of the result to the settings where the inputs X is assumed to be non-random design vectors that take value on some compact set X.This is often the case when there is only a finite number of possible design vector elements for which X can take.Otherwise, the assumption also permits the scenario where X is some random element with compactly supported distribution, such as uniformly distributed, or beta distributed inputs.Unfortunately, the case of random X over an unbounded domain (e.g., if X has multivariate Gaussian distribution) is not covered under our framework.An extension to such cases would require a more general version of Lemma 2, which we believe is a nontrivial direction for future work.
Like the input, we also assume that the output domain is restricted to a compact set Y.However, the output domain of the approximating class of MoE models is unrestricted to Y and thus the functions (i.e., we allow ψ to be a PDF over R q ).The restrictions placed on Y is also common in the mixture approximation literature, as per the works of Zeevi & Meir (1997), Li &Barron (1999), andRakhlin et al. (2005), and is also often made in the context of nonparametric regression (see, e.g., Gyorfi et al., 2002 andCucker &Zhou, 2007).Here, our use of the compactness of Y is to bound the integral of v n , in (10).A more nuanced approach, such as via the use of a generalized Lebesgue spaces (see e.g., Castillo &Rafeiro, 2010 andCruze-Uribe &Fiorenza, 2013), may lead to result for unbounded Y.This is another exciting future direction of our research program.
A trivial modification to the proof of Lemma 4 allows us to replace the assumption that f is a PDF with a sub-PDF assumption (i.e., Y f dλ ≤ 1), instead.This in turn permits us to replace the assumption that f (•|x) is a conditional PDF in Theorems 1 and 2 with sub-PDF assumptions as well (i.e., for each x ∈ X, Y f (y|x) dλ (y) ≤ 1).Thus, in this modified form, we have a useful interpretation for situations when the input Y is unbounded.That is, when Y is unbounded, we can say that the conditional PDF f can be arbitrarily well approximated in L p norm by a sequence m ψ K K∈N of either soft-max or Gaussian gated MoEs over any compact subdomain Y of the unbounded domain of Y .Thus, although we cannot provide guarantees of the entire domain of Y , we are able to guarantee arbitrary approximate fidelity over any arbitrarily large compact subdomain.This is a useful result in practice since one is often not interested in the entire domain of Y , but only on some subdomain where the probability of Y is concentrated.This version of the result resembles traditional denseness results in approximation theory, such as those of Cheney & Light (2000, Ch. 20).
Finally, our results can be directly applied to provide approximation guarantees for a large number of currently used models in applied statistics and machine learning research.Particularly, our approximation guarantees are applicable to the recent MoE models of Ingrassia et al. (2012), Chamroukhi et al. (2013), Ingrassia et al. (2014), Chamroukhi (2016), Nguyen & McLachlan (2016), Deleforge et al. (2015b), Deleforge et al. (2015a), Kalliovirta et al. (2016), andPerthame et al. (2018), among many others.Here, we may guarantee that the underlying data generating processes, if satisfying our assumptions, can be adequately well approximated by sufficiently complex forms of the models considered in each of the aforementioned work.
The rate and manner of which good approximation can be achieved as a function of the number of experts K and the sample size is a currently active research area, with pioneering work conducted in Cohen & Le Pennec (2012) and Montuelle & Le Pennec (2014).More recent results in this direction appear in Nguyen et al. (2020b), Nguyen et al. (2021b), andNguyen et al. (2021a).

List of abbreviations
MoE Mixture of experts PDF Probability density function and write its norm as f B(Z) = sup z∈Z |f (z)|.Further, let C denote the class of continuous functions.Note that if Z is compact and f ∈ C, then f ∈ B. Theorem 1. Assume that X = [0, 1] d for d ∈ N.There exists a sequence m ψ K K∈N