 Research
 Open Access
 Published:
Alternative approaches for econometric modeling of panel data using mixture distributions
Journal of Statistical Distributions and Applications volume 4, Article number: 9 (2017)
Abstract
The economic researcher is sometimes confronted with panel datasets that come from a population made of a finite number of subpopulations. Within each subpopulation the individuals may also be heterogenous according to some unobserved characteristics. A good understanding of the behavior of the observed individuals may then require the ability to identify the groups to which they belong and to study their behavior across groups and within groups. This may not be a complicated exercise when a group indicator variable is available in the dataset. However, such a variable may not be included in the dataset; and as a result, the econometrician is forced to work with the marginal distribution of the observed response variable, which takes the form of a mixture distribution.
One can model a given response variable with a variety of mixture distributions. In this paper, I present several related mixture models. The most flexible one is an extension of the model by Kim et al. (2008) to the panel data setting.
I have reviewed the estimation of some of these models by the ExpectationMaximization (EM) algorithm. The intent is to exploit the nice convergence properties of this algorithm when it is difficult to find good starting values for a Newtontype algorithm. I have also discussed how to compare these models and ultimately identify the one that provides the best fit to the data set under investigation. As an application I examine the investment behavior of U.S. manufacturing firms.
Introduction
To model the heterogeneity of economic agents I present a series of panel data mixture models of increasing degree of flexibility and complexity and show how they can be used to handle at least two types of heterogeneity: heterogeneity with respect to group membership, and heterogeneity with respect to within group differences in individual characteristics. I have also reviewed the methods of estimation of some of these models via the ExpectationMaximization algorithm. The objective is to take advantage of the nice convergence properties of this algorithm when it is difficult to find good starting values for a newtontype algorithm. I have also reviewed some statistical tests that can be used to choose the best models among those discussed in this paper.
Heterogeneity is an important problem faced by the statistician or the econometrician trying to infer the behavior of economic agents from available data sets. Economic decison makers are heterogeneous in their characteristics and they usually operate in heterogeneous (different) environments. As a result, their behavior generate data whose distributions are sometimes difficult to approximate with the traditional single component econometric models. To deal with this problem, often economists divide their sample into groups using observed variables such as time (in time series) or other individual characteristics (in time series and longitudinal data). The groups obtained this way are usually static and may differ from alternative groups obtained using different observed variables.
While this strategy may allow the researchers to draw some useful conclusions, it is less attractive than the approach that uses multiple characteristics for determining group membership. It is also less flexible than the approach that allows for the possibility that an individual changes group membership depending on the evolution of his characteristics and of the conditions that he is facing. Lastly, it is much less flexible than the approach that offers a unified way (one step method) to make inference about both group membership and behavior. Mixture of distribution models offer such flexibility. These models are justified not only in theory, because they offer a nice way to model heterogeneity, but also in practice since they can be used to provide a semiparametric approximation to the nonstandard distributions of some economic variables at a reasonable cost (McLachlan and Peel 2000). Mixture of distributions are in fact at the crossroad between parametric and non parametric families of distributions. They are parametric because each component distribution usually belongs to a parametric family of distributions, and they are nonparametric because it is possible to provide a very good approximation to the distribution of some variables by increasing the number of components of the mixture (Fink 2007).
Among economic variables whose study can benefit from the applications of mixture distributions one can cite firms’ investment, households consumption, money demand, household use of healthcare, etc. Finite mixture distributions are commonly used in Econometrics, mainly in crosssectional and time series analyses. Following Hamilton (1988), some versions of the hidden Markov models have been extensively used in macroeconometrics to model business cycle fluctuations under the name of Markov Switching regression models. Nevertheless, applications of mixture of distributions in the panel data setting appear to be limited. In many cases the panel data set is treated almost the same way as a cross section. In some rare cases, as in Deb and Trivedi (2013) the dependence of the observations within each unit is modeled using individual specific effects. However, if the panel data set is viewed as a collection of time series it is not difficult to extend the hidden Markov models used in time series analysis to the panel data setting. This is the point of view adopted in this paper and also by Asea and Blomberg (1998) as well as Atman (2007) and Maruotti (2007). The most flexible models presented in this paper extends the times series model by Kim et al. (2008). I allow the Markov chains to be timeinhomogeneous and nonstationary and I introduce within group heterogeneity in the component distributions using the specification by Mundlak (1978). The models are closer to the models by Atman (2007) and Maruotti (2007). A related set of models applied to Panel Count data can also be found in Trivedi and Hyppolite (2012).
The models
Several alternative mixture distributions can be used to model the bivariate process constituted by an economic agent’s decision and its group membership. In the following sections, nine such models are described going from the simplest to the most complicated. All of the models are assumed to be made of two components, but extension to more than two components is not difficult.
The models can be used to study several different economic phenomena such as households consumption under financial constraints, firms investment under financial constraints, households demand for money, household use of healthcare, etc. In what follows I will use the example of investment choices under financial constraints to motivate the specifications.
A finite mixture model with constant mixing proportions (\(\mathcal {M}_{1}\))
Consider the vector of random variables (Y _{ it },W _{ it })^{′} where Y _{ it } represents agent i’s decision at time t (;t=1,…,T _{ i };i=1,…n) while W _{ it } is a discrete random variable
In a model about firms’ investment decisions under financing constraints, Y _{ it } would represent firm i’s investment rate at time t, while W _{ it } would be the firm’s financial status at that time. Y _{ it } and W _{ it } are assumed to be dependent in the sense that the agent’s decision depends on the group he belongs to; more precisely I assume that
where ϕ(.) is the density function of a univariate normal distribution and x _{ it } is a row vector of covariates including individual characteristics that influence the agent’s decisions, and β _{1} and β _{2} are column vectors of parameters. The joint density of (y _{ it },w _{ it })^{′} is given by
and the marginal density of y _{ it } is
Let
When W _{ it } follows a Bernoulli distribution with parameter, π, the marginal density becomes
This is a classical finite mixture of distributions with constant weights π and 1−π.
Parameters Estimation
The parameters of the preceding model can be estimated using maximum likelihood.
The completedata likelihood is
while the marginal likelihood is
Since W _{ it } is missing, maximizing the marginal likelihood appears to be the most natural estimation approach. However, the ExpectationMaximization (EM) algorithm (Dempster et al. 1977) offers a much simpler alternative. This algorithm maximizes the completedata likelihood after augmenting the data for the missing variable W _{ it } during the expectation step.
The two main steps of the algorithm are the following:

EStep (Expectation Step)
During this step, an intermediate quantity
$$Q\left(\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\right)=\mathbb{E}_{w_{it}}\left(\log{\left(L^{c}(\boldsymbol{\theta})\boldsymbol{\theta}^{\prime}\right)}\right) $$ 
Msetp (Maximization step)
during which the following maximization problem is solved
$$\begin{aligned} &\hat{\boldsymbol{\theta}}=\underset{\boldsymbol{\theta}}{\text{argmax}}\ {Q}\left(\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\right)\\ &\qquad \text{Subject to:}\\ &\qquad \text{Various appropriate constraints.} \end{aligned} $$
For the model considered here
Defining
and,
the expected completedata loglikelihood can be written as
After solving the system of equations derived from the first order conditions we get
Once we get an estimate for \(\mathbb {E}_{w_{it}}(\mathbb {I}(w_{it}=1)y_{it};\boldsymbol {\theta }^{\prime })\), computing the preceding estimators is simple. In fact,
So, if we know π, (β _{2},σ _{2}) and (β _{1},σ _{1}) we can find an estimate for \(\mathbb {E}\left (w_{it}y_{it};\boldsymbol {\theta }^{\prime }\right)\). The EM algorithm for this model can be summarized as follows:

1.
Choose initial values \(\boldsymbol {\theta }^{0}=\left (\pi ^{0},\boldsymbol {\beta }_{1}^{0},\sigma _{1}^{0},\boldsymbol {\beta }_{2}^{0},\sigma _{2}^{0}\right)\)

2.
Compute E(w _{ it }y _{ it };θ ^{0}) for each observation

3.
Substitute E(w _{ it }y _{ it };θ ^{0}) in the completedata loglikelihood

4.
Find new values for the parameters \(\boldsymbol {\theta }^{1}=\left (\pi ^{1},\boldsymbol {\beta }_{1}^{1},\sigma _{1}^{1},\boldsymbol {\beta }_{2}^{1},\sigma _{2}^{1}\right)\) by maximizing the completedata likelihood

5.
Compute \(\text {error}=\frac {L\left (\boldsymbol {\theta }^{1}\right)L\left (\boldsymbol {\theta }^{0}\right)}{L\left (\boldsymbol {\theta }^{0}\right)}\)

6.
If error is higher than a chosen tolerance level, repeat step 2 with the last estimates for the parameters

7.
Otherwise, stop; the last estimates are the maximum likelihood estimates.
This algorithm is attractive not only because it provides an intuitive interpretation of the estimation, but also because of its monotone and global convergence properties. It has been proved (McLachlan and Krishman 1997) that the loglikelihood is nondecreasing at each consecutive iteration. This property is very useful for detecting programming errors. Moreover, the global convergence property allows for more flexibility in the choice of starting values than is possible with a Newtontype algorithm.
However, the EM Algorithm is criticized not only because it converges at a low rate, but also because it does not supply automatically an estimate of the covariance matrix of the parameters (McLachlan and Krishman 1997). The Hessian necessary to obtain an estimate of the information matrix in the maximum likelihood setting is not used in the computations. There have been several solutions proposed in the literature to solve this problem. The most notable one is provided by Louis (1982).
Note that according to this model the probability that an economic agent belongs to a certain group remains the same every period. In a dynamic economic environment this assumption is too restrictive. For example, the financial status of a firm cannot be determined by flipping a coin; it is more likely to be dependent on the firm’s performance, its characteristics and on the economic conditions it is facing. Thus, several observed variables should help in determining group membership. So, a more realistic model should allow for covariates dependent mixing proportions.
A finite mixture model with smoothly varying mixing proportions (\(\mathcal {M}_{2}\))
Suppose
where
z _{ it } is a row vector of covariates that impact the probability for an agent i to belong to a certain group and γ is a column vector of parameters.
The group membership equation (Eq. 11) could be modeled with the logistic distribution. Since I want to compare all the models, I would also need to model endogeneity in the same setting and this is not straightforward. It is then better to use the normal distribution and take advantage of the nice properties of the conditional distributions of a partitioned normal random vector. Let
The joint density of (Y _{ it },W _{ it }) is
and the marginal density is
where Φ(.) is the univariate cumulative distribution function of a standard normal random variable.
Parameters Estimation
This model can also be estimated using the EM algorithm. The completedata likelihood is
and the marginal likelihood
The intermediate EM quantity is
Using Eqs. (1)  (10), the intermediate EM quantity can be rewritten as
The first order conditions for the maximization of Q(θ,θ ^{′}) will not produce a closed form solution for γ, but the estimators for (β _{1},σ _{1},β _{2},σ _{2}) are the same as before. The intermediate EM quantity being separable in the different group of parameters, \(\hat {\boldsymbol {\gamma }}\) can be obtained separately using the Newtontype method:
Also
The EM algorithm can be implemented exactly as before.
An Endogenous Switching Regression Model (\(\mathcal {M}_{3}\))
The preceding models are not as common in economics as they are in statistics. The more general model known as switching regression model seems to be preferred. The latter model has been used in several papers in the literature about firms investment. One reason why this model is more popular in econometrics may be, as signaled by Kim et al. (2008), is that the authors were mainly interested in modeling limited dependent variables. The main difference between the preceding models and the switching regression model is that in the case of the switching regression model the distributions of the components error terms are defined on the whole population while in the case of the mixture models they are defined only on the corresponding subpopulations (Maddala 1999). The model can be presented as follows:
Let
σ _{12} will not enter the density function and is then not estimable.
The joint density for (Y _{ it },W _{ it }) is given by
Similarly
The joint density becomes
The completedata likelihood is
and the marginal likelihood is
which takes the form of a mixture of two distributions. However, since ε _{ it } is dependent on u _{2i t } and u _{1i t } and since u _{2i t }≠u _{1i t }, the weights do not necessarily add up to one which is another difference between the latter model and the regular finite mixture of two normal distributions. Note that
Thus
or
Thus,
Similarly,
By plugging Eqs. (14) and (15) in Eq. (12), the marginal likelihood becomes
When σ _{ ε1}=σ _{ ε2}=0, the preceding likelihood is the same as in the previous model and the weights would add up to one. Xiaoqiang and Schiantarelli (1998), Hovakimian and Titman (2006) and Almeida and Campello (2007) use classical econometric methods to estimate the preceding endogeneous switching regression model with fixed effects.
This model can also be estimated with the EM algorithm, but the intermediate EM quantity is no longer separable in the parameters which makes this less appealing than the direct maximization of the log of the marginal likelihood. Maximizing Q(θ,θ ^{′}) at each iteration is potentially as computationally involved as the onestep maximization of the marginal likelihood. However, if one has difficulty finding good starting values for a Newtontype algorithm, one can still benefit from the nice convergence properties of the EM algorithm via the simpler model \(\mathcal {M}_{2}\). As indicated before, if the correlations between the components and the group membership equation are zero \(\mathcal {M}_{3}\) is identical to \(\mathcal {M}_{2}\) and as a result the latter will provide very good starting values for the former. One just has to apply the EM algorithm to \(\mathcal {M}_{2}\) and use the solution as starting value for \(\mathcal {M}_{3}\).
An Endogenous Switching Regression Model with Random Effect (\(\mathcal {M}_{4}\))
The endogenous switching regression can be extended by adding random effects in the components to capture within group heterogeneity, which is a very important issue in the panel data setting considered in this paper. Let
Following Mundlak (1978) I assume
α _{ i1} and α _{ i2} capture within group heterogeneity which is decomposed into two parts: a fixedeffect part (x _{.i } ζ _{0}andx _{.i } ζ _{1}) and a random effect part (ξ _{ i0} and ξ _{ i1}) uncorrelated with the exogenous variables, where
This specification of the firmspecific effect is interesting because in practice one expects that some of the exogenous variables will be correlated with the agent’s unobserved characteristics, which may also contain a random component. Moreover, the use of two different random effects for each component distribution allows the data to dictate whether or not those agents who fall more often in a given group have the same unobserved specific characteristics as those who fall most of the time in the other group. When ζ _{1} and ζ _{2} equal zero one obtains the usual random effect specification.
Let
Then
Assuming that the response variable, y _{ it }, is independent, conditional on the random effects, the unconditional likelihood is
Since the random effects are assumed to follow a normal distribution, the double integral is computed using GaussHermite Quadrature. To put the integral in the convenient form I first need to write the vector of correlated normally distributed random effects as a linear function of a vector of standard normal random variables. This is done using the spectral decomposition of Σ
where Λ is a diagonal matrix whose diagonal elements are the eigenvalues of Σ while S is the corresponding matrix of eigenvectors. Let
If I write
I then have:
where z _{1} and z _{2} are independent univariate standard normal random variables. The integral can then be approximated as
using an Rpoint onedimensional GaussHermite weight w _{ r } and nodes z _{ r }, r=1,…,R.
One can alternatively use a Cholesky decomposition, but as noted by Jäckel (2005), the spectral decomposition provides a better rotation of the sampling points, which makes the evaluation of the integral potentially more robust. Another issue is the waste of computation time. The twodimensional standard normal density for example has circular level curves centered at the origin. Its mass is concentrated within circles of rays less than or equal to 3. However, the set of sampling points obtained by taking the cartesian product of onedimensional sets of sampling points is a square in two dimensions. The mass at the points located at the extremities of the axes of the square is almost zero and does not contribute to the integral, which wastes computation time. One way to deal with this issue is to use what is called “pruning” (Jäckel 2005) which is a way of eliminating these nonimportant points. This can be done by rewriting the integral approximation as:
where
Using the MATLAB function mherzo.m written by Zhang and Jin (1996) I have generated 9point onedimensional GaussHermite weights, w _{ r } and nodes z _{ r } (r=1,..,9).
The cartesian product of the nodes with and without pruning are shown in Fig. 1.
One should note that Gaussian quadrature, or numerical integration methods in general, suffer from the curse of dimensionality. The number of function evaluations required to approximate the integral to a certain degree of accuracy increases exponentially with the dimension of the integral. Monte Carlo Integration or a monomial rule may be less costly. González et al. (2006) show that Monte Carlo and QuasiMonte Carlo methods can not only reduce computation time but also provide better accuracy in the case of logistic regressions.
Alternatively, one can use the hlikelihood method by Lee and Nelder (1996) and bypass the computation of the integral. In this case the random effects are treated as additional parameters that are estimated with the other parameters. For panel data with a large number of units this method increases significantly the number of parameters to be estimated.
A Hidden Markov Model (\(\mathcal {M}_{5}\))
One problem with the models already presented is that they do not allow the group membership at time t to be dependent on the group membership at time t1. When the groups are made of firms having the same financial status, one should note that several of the variables used in the literature to determine the presence or the absence of financial constraints such as the firm’s size, the fraction of its assets that can be used as collateral, are likely to be timedependent and as a result, the firm’s financial status at time t is potentially dependent on its status at time t1. One way to capture this time dependence is to make the following assumption
Let
I assume that w _{ it } is an unobserved variable following a first order Markov chain on a discrete statespace. The bivariate discretetime process (Y _{ it },W _{ it }) where Y _{ it }w _{ it } is independent, is a hidden Markov model (Cappé et al. 2005). Thus, the joint density for (Y _{ it },W _{ it }) is given by
where I _{ i(t−1)} means information about firm i available up to time t1.
If, for a given firm i, the path of the chain is:\(\{w_{i1}=j_{1},w_{i2}=j_{2},\ldots,w_{iT_{i}}=j_{T_{i}}\}\phantom {\dot {i}\!}\), the joint density for this firm would be
Note that the total number of possible paths is \(2^{T_{i}}\phantom {\dot {i}\!}\) for firm i. Suppose that the initial probability vector for firm i is
The joint density can be rewritten as
The preceding is true if we know a priori the full path of the state variable, w _{ it }. If we don’t, the joint density can be written as
The marginal density for firm i for the observed data is then
If
then the marginal density can be rewritten using vectormatrix operations (MacDonald and Zucchini 1997)
where 1 ^{′} is a column vector of ones.
Parameters Estimation (EM Algorithm)
Let θ=(_{ i } π _{1}, _{ i } P _{11}, _{ i } P _{22},β _{1},β _{2}) be the vector of parameters of the model. With N firms, the dimension of the vector θ is
which is a large number of parameters. To reduce the number of parameters to be estimated, I assume that
then,
The completedata likelihood is given by
and the completedata loglikelihood is
or
The intermediate quantity of EM is
To get the preceding expectation, only \(E_{\boldsymbol {\theta }^{\prime }}(\mathbb {I}(w_{it}=j\text {\boldmath \(\Im \)}_{iT_{i}}))\) and \(E_{\boldsymbol {\theta }^{\prime }}(\mathbb {I}(w_{i(t1)}=k,w_{it}=l\text {\boldmath \(\Im \)}_{iT_{i}}))\) need to be evaluated. Note that
where
The EM algorithm for this model proceeds as follows:

Choose initial values θ _{0} and,

Compute \(\phantom {\dot {i}\!}p(w_{it}=j\text {\boldmath \(\Im \)}_{iT_{i}};\boldsymbol {\theta }_{0})\) and \(\phantom {\dot {i}\!}p(w_{i(t1)}=k,w_{it}=l\text {\boldmath \(\Im \)}_{iT_{i}};\boldsymbol {\theta }_{0})\) for each observation,

Substitute the computed probability in the intermediate EM quantity \(\phantom {\dot {i}\!}\mathcal (Q)(\boldsymbol {\theta }_{1},\boldsymbol {\theta }_{0})\),

Solve
$$\begin{aligned} {~}&\boldsymbol{\theta}_{1}=\underset{\boldsymbol{\theta}}{\text{argmax}}\ \mathcal{Q}(\boldsymbol{\theta},\boldsymbol{\theta}_{0})\\ &\text{subject to:}\\ &\sum_{j=1}^{2}\pi_{j}=1\\ &\sum_{l=1}^{2}P_{kl}=1;k=1,2\\ &0\leq\pi_{j}\leq 1, j=1,2. \end{aligned} $$ 
Repeat step 2 after replacing θ _{0} by θ _{1},

Keep going until convergence.
It should be noted that the forwardbackward algorithm used to obtain α _{ it }(k) and \(\check {\beta }_{it}(k)\) is subject to numerical underflow. To avoid this problem the FORTRAN codes used for this algorithm apply the scaling method proposed by Rabiner (1989). The version of the EM algorithm just presented is also known as the BaumWelch algorithm. Step 4 is called the Mstep or maximization step. The Lagrangian for the problem is
Assume, as before, that f(y _{ it }w _{ it }=l) is the density function of the normal distribution. Then,
Let
then
The first order conditions for the maximization problem are the following:
Combining Eqs. (18) and (22) one gets
Thus,
which implies
From Eq. (20) one gets
From Eq. (21) one obtains
One main drawback with the HMM model with constant transition matrix is that the probability for a firm to move from one state to another does not depend on any observable, which is unrealistic for reasons considered in the case of the first model.
HMM Model with Time dependent Transition Matrix (\(\mathcal {M}_{6}\))
To relax the constraint imposed on the preceding model by the constant transition probabilities, a transition matrix whose components are functions of some observables can be used. Suppose
where
The preceding equation means that it is possible to predict the financial situation of firm i at time t using its situation at time t1 and some exogenous variables z _{ it }. Thus,
the transition matrix is then
This is a time heterogeneous transition matrix. This matrix is different from the specifications in Asea and Blomberg (1998), Atman (2007) and Maruotti (2007). It is also possible to use a probit or logit model for each row of the transition matrix. In fact, when the Markov chain has more than two states a multinomial probit or logit model would be the most convenient choice. However, for a chain with two states, the current specification appears to be better since it involves a smaller number of parameters and offers a nice way to test for time dependence by testing the hypothesis λ=0.
Parameters Estimation
The completedata loglikelihood function looks the same as in the previous section. The only difference is that the transition probabilities depend now on the parameters γ and λ. As a result, instead of estimating the transition matrix, I will have to estimate γ and λ. Note that there are no closed form solutions for the first order conditions with respect to γ and λ. So, the Mstep of the EM algorithm will include a NewtonRapthon maximization step.
where _{ i } P _{ kl },k=1,2;l=1,2;i=1,..,n are given in the preceding transition matrix. The HMM model presented in this section does not account for within group heterogeneity which opens the door for a possible extension.
Hidden Markov Model with Time Varying Transition Matrix and Random Effects (\(\mathcal {M}_{7}\))
Even though the groups are homogeneous with respect to the financial characteristics used to form them, there are still some unobserved characteristics with respect to which the firms within a given group can be considered to be heterogeneous. One such characteristic is the difference in management. To take account of this additional source of heterogeneity, I introduce an unobserved firm specific variable in each of the two components. Let
ξ _{ ji } (j=1,2) are random effects that are uncorrelated with x _{ it } and \(\bar {\boldsymbol {x}_{.i}}\). The conditional expectations will be modeled such that
I also assume that the random effect is independent of the firm’s financial situation captured with the variable w _{ it } and that conditional on the random effects and \(\{w_{it}\}_{1}^{T_{i}}\), investment is independent. The completedata likelihood can be written as
The completedata loglikelihood is then
The intermediate EM quantity is given by
Closedform solution for the maximization of the intermediate EM quantity \(\mathcal {Q}(\boldsymbol {\theta },\boldsymbol {\theta }^{\prime })\) exists only for the first component. The other three components have to be maximized using a Newtontype method. Let the Lagrangian for the first component be
The first order conditions are
Thus,
Using Eq. (34) in Eq. (35), I get
since
Thus
where
Let
then
The integrals are computed using GaussHermite quadrature.
Hidden Markov Model with Time Varying Transition Matrix and endogeneity (\(\mathcal {M}_{8}\))
An alternative way of extending model \(\mathcal {M}_{6}\) is to assume that the states of the Markov chain and the response variable are dependent. More precisely, we can assume
together with Eqs. (28), (29) and
The last distributional assumptions make the states of the Markov chains and the response variable y _{ it } interdependent. The resulting model is an extension to the panel data setting of a modified version of the model by Kim et al. (2008). The transition matrix of the current model uses less parameters and the correlations between the stateindicator variable and the component distributions are allowed to be different.
Parameters estimation
Because of the interdependence between the states of the Markov chains and the response variable, during the maximization step of the EM the parameters of the transition matrix and the component distributions have to be estimated together. As a result, the EM algorithm does not have any computational advantage over a Newtontype algorithm applied to the marginal likelihood. The latter can be written as in Eq. (16) after some suitable transformation. Note that
Thus, to evaluate this conditional density the current state and the previous state are both needed. The computation of the likelihood will require conditional densities that depend on the current state and the previous state. Since the transition matrix has two states, four conditional densities will result. To write the likelihood as in Eq. (16), the Markov chain has to be written as a fourstate chain. Let W W _{ it } be the new Markov chain with state space
w w _{ it } equals kl is equivalent to w _{ i(t−1)} equals k and w _{ it } equals l. The transition matrix associated to the new chain can be written as
Since the component densities now depend on the current state and the previous state, if the initial distribution of the old stateindicator variable (w _{ it }) is still the distribution at time 1 the first observation of each firm will not enter the computation of the likelihood. One alternative is to assume that the initial distribution is the distribution at time 0. In this case, even when the correlation between the stateindicator variable and the component distributions are zero, the likelihood of the current model will not be equal to the likelihood of model \(\mathcal {M}_{7}\). As a result when testing for endogeneity, direct tests on the correlation coefficients may be preferred to the likelihood ratio test comparing model \(\mathcal {M}_{7}\) and \(\mathcal {M}_{8}\). Given the preceding assumption the initial distribution of the new stateindicator variable is
Let
With these transformations, the marginal likelihood is given by Eq. (16). The component densities are
The conditional distribution of ε _{ it } given u _{1i t } is given in Eq. (13). Using this conditional distribution the previous expression becomes
Similarly,
Hidden Markov Model with Time Varying Transition Matrix, endogeneity and random effects (\(\mathcal {M}_{9}\))
For a panel data set, a natural extension of the previous model is obtained by adding random effects in the components using the specifications in Eqs. (30)  (32). The main difference between the likelihood of the current model and that of model \(\mathcal {M}_{8}\) is the introduction of a double integral in the former. More formally, if for each unit the response variables are assumed to be independent conditional on the random effects and if one maintains the assumption that the units are independent, the marginal likelihood is
Model Identification
The parameters of all the models previously presented are not automatically identified. In theory the loglikelihoods are all unbounded and a maximum likelihood estimator may not exist. Also, they all suffer from nonidentification due to label switching. The loglikelihood is invariant under the permutation of the components which will make it difficult to dissociate the unconstrained component from the constrained component.
As suggested in the literature (FruhwirthSchnatter 2006), this identification problem can be solved by the use of a set of constraints. These constraints may come from economic theory. In the case of firms’ physical investment one may be tempted to argue that a firm that has no trouble financing its investment activities should have a higher investment to capital ratio than when it has trouble obtaining funds, ceteris paribus. However, economic theory can only support the idea that a constrained firm is likely to choose a rate of investment below its optimal rate. Given the heterogeneity of the firms, it is possible that the majority of the constrained firms has a higher optimal rate of investment than the unconstrained ones. As a result, the previous constraint would be misleading. Thus, identification constraints should be chosen with care.
Another identification problem is associated with the use of a mixture model of too many components (overfitting). If the data set is generated by a single component, attempting to fit a mixture of two components may produce a component with a very small number of observations. In the case of a mixture with constant mixing proportions, the weight of each component will be very close to zero. As a consequence the loglikelihood will be approximately the same for any choice of parameters associated to that component.
Another issue that makes the identification of the parameters of these models difficult is the fact that the loglikelihoods are generally multimodal. Since the optimizers that will be used to maximize the loglikelihood can only find local maxima, the parameters estimates will be highly dependent on the starting values. To deal with this problem the loglikelihood maximization will be repeated several times with different starting values and the parameters estimates will be chosen to be the vector of estimates that corresponds to the highest loglikelihood assuming that it does not have the characteristics of a spurious maximizer. Each time the starting values are generated using either the Kmeans clustering algorithm (MacQueen 1967; Fink 2007) or a random classification scheme where each observation is randomly assigned to one group by flipping a fair coin. Note that the Kmeans algorithm does not produce the same classification at each run since the initial assignments are random. With these procedures, I try to increase the probability of finding a vector of starting values that falls in the bassin of attraction of the highest loglikelihood.
Inferences
Inferences will be based on the asymptotic properties of the maximum likelihood estimator. As discussed in the previous section, the likelihood of the models presented in this paper do not have an absolute maximum. However, for model \(\mathcal {M}_{1}\), Kiefer (1978) has showed that it is possible to find a closed set that contains the true value of the vector of parameters in which there exists a unique consistent estimator. One requirement for this set is that it does not contain π=0, π=1, σ _{2}=0, and σ _{1}=0. That estimator is asymptotically normal with a covariance matrix equal to the inverse of the information matrix. Choi and Zhou (2002) proved similar results for a class of models with covariatedependent mixing proportions.
Douc and Mathias (2001) prove the consistency and the asymptotic normality of the maximum likelihood estimator of a general hidden Markov model for both stationary and nonstationary Markov chains. The asymptotic covariance is, as usual, the inverse of the information matrix.
Robust Standard errors
According to the results stated above the standard errors of the estimated parameters can be obtained by taking the square root of the diagonal of the negative inverse of Hessian of the loglikelihood. However, the target applications are panel data. Since the likelihoods of the mixture models (\(\mathcal {M}_{1}\mathcal {M}_{4}\)) ignore the time series properties of the data, dynamic misspecification is likely to be an issue. As a result, robust standard errors should be provided. These standard errors can be estimated using the following sandwich form
where \(\nabla ^{2}{L_{it}(\hat {\boldsymbol {\theta }})}\) is the Hessian of the loglikelihood for the observation associated to firm i at time t evaluated at the maximum likelihood estimator. \(\hat {\mathbf {B}}\) can be computed as in Wooldridge (2002)
where \(\nabla {L_{it}(\hat {\boldsymbol {\theta }})}\) is a row vector containing the gradient of the loglikelihood for firm i at time t. In the preceding case the firm identification variable is used as a cluster variable.
If the sample is relatively small one can alternatively use parametric or nonparametric bootstrap. In the nonparametric case an appropriate resampling method is Moving Blocks Bootstrap as described in Cameron and Trivedi (2005). Nevertheless, for the hidden Markov models where the time series properties of the data are very important resampling among the units as proposed by Kapetanios (2008) may even be more appropriate.
I should note that for the models considered in this paper bootstrapping requires some care. The likelihoods being potentially multimodal the highest local maximum may not be reached at each repetition.
Statistical tests
The statistical tests that will be considered have four objectives: 1) to determine the number of components of the mixtures, 2) to choose the best mixture among the models with a given number of components, 3) to test for endogeneity and 4) to test for random effects.
As stated in McLachlan and Peel (2000) choosing the number of components for a mixture is difficult. The preceding authors provide a long discussion about this issue in their book. One important problem is that in some cases one may not be able to find evidence that favors a model of a given number of components over another model that contains more or fewer components. In such situations they advocate choosing the model with the smaller number of components.
For the applications targeted in this paper the possible number of components will be inferred from economic theory. The main issue will then be how to find the distribution of the chosen test statistic under the null hypothesis.
Let k _{ x } and k _{ z } be respectively the dimension of the row vector x _{ it } and the row vector z _{ it }. Let Ω _{ m } be the parameter space of model \(\mathcal {M}_{m}\), m=1,…,9.
P(2) is the space of positive definite matrices of dimension 2.
I want to test the hypothesis of a onecomponent model versus a twocomponent model represented by any of \(\mathcal {M}_{1}\mathcal {M}_{9}\). The null hypothesis can be stated as
which means
In all cases the null hypothesis falls on the boundary of the parameter space as can be seen from Eq. (38) to Eq. (47). As a consequence the regularity conditions used to derive the asymptotic distribution of the likelihood ratio test break down. Note also that under H _{0} the parameters of the component distribution with zero mixing proportion are not identifiable. The asymptotic distribution of the likelihood ratio test is not the expected χ ^{2} distribution. For example, in the case of a onecomponent binomial distribution versus a twocomponent distribution Chernoff and Lander (1995) show that the distribution of twice the logarithm of the likelihood ratio is a mixture of three distributions, two of them are χ ^{2}. Goffinet and Loisel (1992) found similar non standard results. A review of these issues can be found in McLachlan and Peel (2000).
Since the asymptotic distribution of the likelihood ratio is not standard, an interesting alternative approach is to empirically approximate the distribution of this statistic. This can be done using parametric bootstrap (McLachlan and Krishman 1997; Davidson and Hinkley 1997). This can be done as follows:

1.
Compute the maximum likelihood estimator (β,σ) for the onecomponent model.

2.
Generate a sample \(y_{it}^{*},t=1,..T_{i},i=1,\ldots,N\) from ϕ(x _{ it } β,σ).

3.
Use \(y_{it}^{*}\) and the other covariates to obtain (β _{ m },σ _{ m }) for the onecomponent model and θ _{ m } for the alternative twocomponent model.

4.
Use these parameters to compute the likelihood ratios t _{ m }.

5.
Repeat this process 999 times to obtain a sequence \(\{t_{m}\}_{m=1}^{999}\).
The pvalue for the test is then computed as
where t is the observed likelihood ratio. Note that for the random effect models \(\mathcal {M}_{4}\) and \(\mathcal {M}_{7}\) this procedure is likely to be time consuming because of the computation of the double integral. An alternative is to choose information criteria such as the Akaike Information criterion (AIC) and the Bayesian Information criterion (BIC).
The test for endogeneity is essentially the test of model \(\mathcal {M}_{3}\) versus model \(\mathcal {M}_{2}\). The null hypothesis can be stated as
The alternative hypothesis is that at least one of the coefficients of correlation is different from zero. As can be seen from Eq. (40) the boundary problem no longer exists and twice the likelihood ratio statistic has a chisquare distribution with two degrees of freedom. Alternatively, the test can also be conducted using a tstatistic.
The test for the presence or absence of random effects is also problematic. The same boundary problem discussed above is encountered. The null hypothesis of no random effect can be stated as follows:
The preceding matrix is positive semidefinite and H _{0} falls on the boundary of the parameter spaces Ω _{4} and Ω _{7}. As before the distribution of the likelihood ratio statistics is not the expected χ ^{2} distribution. Stram and Lee (1994) have studied this problem for onecomponent linear models and showed that the asymptotic distribution of the likelihood ratio statistic is a mixture of chisquare distributions.
The next important test to consider is the test of a independent mixture versus a dependent mixture (HMM). This corresponds to the test of model \(\mathcal {M}_{1}\) versus \(\mathcal {M}_{2}\), and \(\mathcal {M}_{2}\) versus \(\mathcal {M}_{6}\). In the first case the null hypothesis is
and in the second case
or
In both cases the rows of the transition matrices are the same under the null hypothesis. The asymptotic null distribution of the likelihood ratio is valid in theses cases. In the case where λ=0 under the null hypothesis a ttest is also appropriate.
Application: Firms’ investment and financing constraints
The basic intertemporal investment model by Hayashi (1982) assumes that a firm chooses the level of its next period capital stock by maximizing the expected discounted value of dividends. In reality, it is not always possible for certain firms to finance the level of investment that maximizes profit. This situation may arise because of the existence of information asymmetry between the firm’s managers and the potential suppliers of funds. Without the ability to evaluate accurately the profitability of the firm’s projects, the suppliers of funds may be unwilling to finance the firm’s investment or they may be willing to supply only a fraction of the funds needed by the firm. As a result, investment may not be financed to the level that is optimal in the absence of constraints. One way of accounting for this issue is by adding a borrowing constraint to the Hayashi (1982) model (Adda and Cooper 2003). The Euler equation from the resulting model would imply two different relationships between investment and its determinants depending on whether the constraint is binding or not. If this model is a good approximation for a firm’s investment behavior, at each point in time the firm will fall in one of two groups: the group of firms that are financially constrained (borrowing constraint is binding) and the group of firms that are not financially constrained. Since the observed data do not generally include any variable that indicates group membership, this setting is well suited for the use of finite mixture models of the kinds presented in this paper.
In this application two variables are modeled: the change in firm i’s investment to capital ratio at time t (Δ I _{ it }), and the financial status of the firm at time t (W _{ it }). Given the potential interdependence of the variables, they will be modeled as a bivariate processs (Δ I _{ it },w _{ it })^{′}, t=1,…, T _{ i },i=1,…n. Under the assumption that at any point in time, a firm can be either financially constrained or not, W _{ it } is an unobserved dichotomous random variable. Assuming that the models for Δ I _{ it } are obtained by taking the first difference of the models in level for I _{ it }, individualspecific effects or random effects will not appear in the models for Δ I _{ it }. As a result, the most appropriate models to estimate are \(\mathcal {M}_{1}\), \(\mathcal {M}_{2}\), \(\mathcal {M}_{3}\), \(\mathcal {M}_{5}\), \(\mathcal {M}_{6}\), and \(\mathcal {M}_{8}\). I estimated those models with data on 2263 US manufacturing firms obtained from the COMPUSTAT dataset for the period between 1974 and 2005. To compare the results with those obtained from some previous studies, I have generated the data almost exactly as stated in Hovakimian and Titman (2006). The definitions of the main variables are presented in Table 1.
Bootstrap likelihood ratio tests of one component versus two show a twocomponent distribution is favored over a onecomponent one in all considered cases.
Table 2 presents regular likelihood ratio tests comparing the different twocomponent models.
It is clear that the more flexible models are favored in all the cases. The evidence appears to be overwhelming (observed likelihood ratio are higher than 710) in all the cases except in the case of model \(\mathcal {M}_{2}\) versus \(\mathcal {M}_{3}\) (observed likelihood ratio is 19.5); also, Table 3 shows that only the coefficient of correlation between the second component and the choice equation is statistically significant. As a result, the addition of endogeneity may not have caused a big improvement in the fit of the data to this model. However, the correlation coefficients are both significant for model \(\mathcal {M}_{8}\) (Table 3).
The most important result here is that the HMM model \(\mathcal {M}_{5}\) strongly outperforms the mixture model \(\mathcal {M}_{1}\) and the same is true for the HMM model \(\mathcal {M}_{6}\) versus the mixture model \(\mathcal {M}_{2}\). Moreover, the test statistic on lambda (See transition matrix of model \(\mathcal {M}_{6}\)) is quite large (31.25) which implies that this parameter is significantly different from 0 and reinforces the idea that the firms financial states are timedependent. Of the two hidden Markov models, the likelihood ratio test reveals that the best one is the one that allows for a covariatedependent transition matrix.
The results of the likelihood ratio tests are also confirmed by the information criteria AIC and BIC since the most flexible Hidden Markov Model shows the lowest values. Moreover, these criteria make possible the comparison between the nonnested models \(\mathcal {M}_{3}\) and \(\mathcal {M}_{6}\) and models \(\mathcal {M}_{3}\) and \(\mathcal {M}_{5}\). Even though the hidden Markov models do not account for endogeneity they fit the data much better than the endogeneneous mixture generally used in the literature. Neglecting timedependence is then more problematic than neglecting endogeneity.
Nevertheless, even though it is clear that a twocomponent distribution fits the data better, it is not obvious which component should be labeled as financially constrained. So, to interpret the results in Table 3 I first need to find some criteria to label the components. For this reason I choose the identification criteria from the literature. Financially unconstrained firms are expected to be big, old and not highly leveraged; they are expected to pay dividends regularly and to have a bond rating; and they may face lower growth opportunities and may be less interested in carrying large cash balances. The justification of these criteria is reviewed in Hovakimian and Titman (2006). Applying these criteria to models \(\mathcal {M}_{2}\) and \(\mathcal {M}_{3}\) one can identify the financially constrained component as the one that has the largest standard deviation. The same is true for model \(\mathcal {M}_{6}\).
The results then suggest that investment is more responsive to cash flow and asset sales in the financially constrained state as was signaled in Hovakimian and Titman (2006). Since the standard deviation of investment is much higher for the financially constrained group (0.28 versus 0.08), one can conclude that the change in fixed capital investment is much more volatile for firms that spend a long time in the financially constrained state.
The most important results come from the HMM models. The estimated prior transition matrix and the average of the posterior transition matrices for the time homogeneous model are
The financially unconstrained state appears to be quite persistent. While a firm that is currently constrained has a higher probability to stay in that state next period, it also has a significant probability (43%) to become unconstrained.
Even though the Markov chain was not assumed to be stationary, the estimated transition matrices clearly admit a stationary distribution. The stationary probability vector associated with the second transition matrix is (p _{1},p _{2})=(0.29,0.71), which means that in the long run a higher proportion of the observations (71%) is expected to be classified as unconstrained. This is consistent with the estimated mixing proportions for model \(\mathcal {M}_{1}\). Similar results are obtained for the exogenous HMM model with covariatedependent transition matrix.
Conclusion
I have presented nine alternative mixture models that may be of interest for making inference from available economic panel data sets. I have also reviewed the maximum likelihood estimation of six of them via the well known ExpectationMaximization algorithm. A series of possible tests are also discussed. These tests can be use to identify among the proposed models the one that fits the data better.
Estimation of the hidden Markov models with random effects may be time consuming because, for each unit, the loglikelihood at each point in time depends on all the previous observations of that unit; moreover, this likelihood has to be computed repeatedly for each vector of abscissae or each vector of draws of the random effects. If, however, the loglikelihood is programmed in FORTRAN or C as opposed to MATLAB or R, the computation time may be reduced significantly, but performing bootstrap tests may still require a long time. Nevertheless, the models considered in this paper are very flexible and can be used to account for several potential sources of heterogeneity in panel data.
Finally, as an application I used the models without random effects to study the differences in the investment behavior of firms when they are financially constrained and when they are not, and also to learn about the process that governs the evolution of a firm’s financial status over time.
References
Adda, J, Cooper, R: Dynamic Economics: quantitative methods and applications. The MIT Press, Cambridge (2003).
Almeida, H, Campello, M: Financial constraints, asset tangibility and corporate investment. Rev. Financ. Stud. 20(5), 1429–1460 (2007).
Asea, PK, Blomberg, B: Lending cycles. J. Econ. 83, 89–128 (1998).
Atman, RM: Mixed hidden markov models: An extension of the hidden markov model to the longitudinal data setting. J. Am. Stat. Assoc. 102(477), 201–210 (2007).
Cameron, AC, Trivedi, P: Microeconometrics: Methods and Applications. Cambridge University Press, Cambridge (2005).
Cappé, O, Moulines, E, Rydén, T: Inference in Hidden Markov Models. Springer, New York (2005).
Chernoff, H, Lander, E: Asymptotic distribution of the likelihood ratio test that a mixture of two binomials is a single binomial. J. Stat. Plan. Infer. 43, 19–40 (1995).
Choi, KC, Zhou, X: Large sample properties of mixture models with covariates for competing risks. J. Multivar. Anal. 82, 331–366 (2002).
Davidson, AC, Hinkley, DV: Bootstrap methods and their applications. Cambridge University Press, Cambridge (1997).
Deb, P, Trivedi, P: Finite mixture for panels with fixed effects. J. Econ. Methods. 2, 31–35 (2013).
Dempster, AP, Laird, NM, Rubin, DB: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977).
Douc, R, Mathias, C: Asymptotics of the maximum likelihood estimator for general hidden markov models. Bernouilli. 3, 381–420 (2001).
Fink, GA: Markov Models for Pattern Recognition: From Theory to Applications. 1st ed. SpringerVerlag, New York (2007).
FruhwirthSchnatter, S: Finite Mixture and Markov Switching Models. SpringerVerlag, New York (2006).
Goffinet, B, Loisel, P: Testing in normal mixture models when the proportions are known. Biometrika. 79, 842–846 (1992).
González, J, Tuerlinckx, F, Boeck, PD, Cools, R: Numerical integration in logisticnormal models. Comput. Stat. Data Anal. 51, 1525–1548 (2006).
Hamilton, JD: Rationalexpectations econometric analysis of changes in regime: An investigation of the term structure of interest rates. J. Econ. Dyn. Control. 12, 385–423 (1988).
Hayashi, F: Tobin’s marginal q and average q: A neoclassical interpretation. Econometrica. 50, 215–224 (1982).
Hovakimian, G, Titman, S: Coporate investment with financial constraints: Sensitivity of investment to funds from voluntary asset sales. J. Money Credit Bank. 38(2), 357–374 (2006).
Jäckel, P: A note on multivariate gausshermite quadrature (2005). http://www.btinternet.com/pjaeckel/ANoteOnMultivariateGaussHermiteQuadrature.pdf. Accessed 14 Nov 2009.
Kapetanios, G: A bootstrap procedure for panel data sets with many crosssectional units. Econ. J. 11, 377–395 (2008).
Kiefer, NM: Discrete parameter variation: Efficient estimation of a switching regression model. Econometrica. 46, 427–434 (1978).
Kim, CJ, Piger, J, Startz, R: Estimation of markov regimeswitching regression models with endogenous switching. J. Econ. 143, 263–273 (2008).
Lee, Y, Nelder, JA: Hierarchical generalized linear models. J. R. Stat. Soc. 58, 619–678 (1996).
Louis, TA: Finding the observed information matrix when using the em algorithm. J. R. Stat. Soc. 44, 226–233 (1982).
MacDonald, IL, Zucchini, W: Hidden Markov and Other Models for Discretevalued Times Series. Chapman & Hall, Boca Raton (1997).
MacQueen, JB: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press, Berkeley (1967).
Maddala, GS: LimitedDependent and Qualitative Variables in Econometrics. Cambridge University Press, Cambridge (1999).
Maruotti, A: Hidden Markov Models for Longitudinal Data. PhD thesis. Università degli Studi di Roma (2007).
McLachlan, G, Krishman, T: The EM Algorithm and Extensions. WileyInterscience, New York (1997).
McLachlan, G, Peel, D: Finite Mixture Models. 1st ed. WileyInterscience, New York (2000).
Mundlak, Y: On the pooling of time series and cross section data. Econometrica. 46, 69–85 (1978).
Rabiner, LR: A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, vol. 77, pp. 257–286 (1989).
Stram, DO, Lee, JW: Variance components testing in the longitudinal mixed effects model. Biometrics. 50, 1171–1177 (1994).
Trivedi, P, Hyppolite, J: Alternative approaches for econometric analysis of panel count data using dynamic latent class models (with application to doctor visits data). Health Economics. 21, 101–128 (2012).
Wooldridge, JM: Econometric Analysis of Cross Section And Panel Data. The MIT Press, Cambridge (2002).
Xiaoqiang, H, Schiantarelli, F: Investment and capital markets imperfections: A switching regression approach using u.s. firm panel data. Rev. Econ. Stat. 80(3), 466–479 (1998).
Zhang, S, Jin, J: Computation of special functions. WileyInterscience, New York (1996).
Acknowledgement
I am grateful for all the help and support received from Professor Pravin Trivedi in completing this project. I am thankful for the suggestions obtained from two anonymous referees.
Funding
I did not receive any financial support in writing this paper.
Author information
Ethics declarations
Competing interests
I declare that I have no competing interests in publishing this paper.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Authors’ contribution
I, Judex Hyppolite, am the only author of this paper.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Panel data
 Mixture of distributions
 Hidden Markov models
 Heterogeneity