 Research
 Open access
 Published:
Approximating the distributions of runs and patterns
Journal of Statistical Distributions and Applications volume 1, Article number: 5 (2014)
Abstract
The distribution theory of runs and patterns has been successfully used in a variety of applications including, for example, nonparametric hypothesis testing, reliability theory, quality control, DNA sequence analysis, general applied probability and computer science. The exact distributions of the number of runs and patterns are often very hard to obtain or computationally problematic, especially when the pattern is complex and n is very large. Normal, Poisson and compound Poisson approximations are frequently used to approximate these distributions. In this manuscript, we (i) study the asymptotic relative error of the normal, Poisson, compound Poisson and finite Markov chain imbedding and large deviation approximations; and (ii) provide some numerical studies to comparing these approximations with the exact probabilities for moderately sized n. Both theoretical and numerical results show that, in the relative sense, the finite Markov chain imbedding approximation performs the best in the left tail and the large deviation approximation performs best in the right tail.
AMS Subject Classification
Primary 60E05; Secondary 60J10
Introduction and notation
Let {\left\{{X}_{i}\right\}}_{i=1}^{n} be a sequence of mstate trials (m≥2) taking values in the set \mathcal{S}=\{{s}_{1},\dots ,{s}_{m}\} of m symbols. For simplicity, {\left\{{X}_{i}\right\}}_{i=1}^{n} will be denoted {X_{ i }} and n will be allowed to be ∞. A simple pattern\Lambda ={s}_{{i}_{1}}{s}_{{i}_{2}}\cdots {s}_{{i}_{\ell}}, of length ℓ, is the juxtaposition of ℓ (not necessarily distinct) symbols from . Given a simple pattern Λ, we let X_{ n }(Λ) denote the number of either nonoverlapping or overlapping occurrences of Λ in the sequence {\left\{{X}_{i}\right\}}_{i=1}^{n}, where the method of counting will be made clear by the context. The waiting time W(Λ,x) until the x’th occurrence of the simple pattern Λ in {\left\{{X}_{i}\right\}}_{i=1}^{n} is thus defined by
and, by convention, the waiting time for the first occurrence is denoted W(Λ)=W(Λ,1). Finally, we define the inter arrival times
where W(Λ,0):=0.
We say that two patterns Λ_{1} and Λ_{2} are distinct if neither Λ_{1} appears in Λ_{2} nor Λ_{2} appears in Λ_{1}. If Λ_{1},…,Λ_{ r } are pairwise distinct simple patterns, we define the compound pattern \Lambda =\bigcup _{i=1}^{r}{\Lambda}_{i}, where an occurrence of any Λ_{ i } is considered an occurrence of Λ. For a compound pattern Λ=Λ_{1}∪⋯∪Λ_{ r }, we similarly define
The waiting times W(Λ,x), W(Λ) and W_{ i }(Λ) are then defined as above, and often referred to as sooner waiting times.
From these definitions it is easy to see that, for any simple or compound pattern Λ, x and n, the events {X_{ n }(Λ)<x} and {W(Λ,x)>n} are equivalent and hence
which provides a convenient way of studying the exact and approximate distribution of X_{ n }(Λ) through the waiting time distributions of W(Λ,x).
Throughout this paper, unless specified otherwise, we assume that the trials {X_{ i }} are either independent and identically distributed (i.i.d.) or first order Markov dependent; the pattern Λ is either simple or compound; and the counting of occurrences of Λ is in a nonoverlapping fashion.
The distribution of the number of runs and patterns in a sequence of multistate trials or random permutations of a set of integers have been successfully used in various fields in applied probability, statistics and discrete mathematics. Examples include reliability theory, quality control, DNA sequence analysis, psychology, ecology, astronomy, nonparametric tests, successions, and the Eulerian and SimonNewcomb numbers (the latter 3 being defined for permutations). Two recent books, Balakrishnan and Koutras (2002) and Fu and Lou (2003), provide some scope of the distribution theory of runs and patterns and Martin et al. (2010) and Nuel et al. (2010) provides some extensions to sets of sequences.
Given a pattern Λ, the exact distribution of X_{ n }(Λ) traditionally has been determined using combinatoric analysis on a case by case basis. The formulae for these distributions are often very complex and computationally problematic. Even for many simple patterns, their distributions in terms of combinatoric analysis remains unknown, especially when the {X_{ i }} are Markov dependent multistate trials.
The waiting time W(Λ) for the first occurrence of certain types of runs and patterns have been studied by many authors. See, for example, Blom and Thorburn (1982), Gerber and Li (1981), Schwager (1983), and Solov’ev (1966). More recently, Fu and Koutras (1994) developed a method for determining the exact distributions of X_{ n }(Λ) and W(Λ) for any simple or compound Λ in either i.i.d. or Markov dependent trials (see also Fu and Lou 2003). The method was referred to as the Finite Markov Chain Imbedding (FMCI) technique, which can be easily described as follows: given a simple or compound pattern Λ, there exists a finite Markov chain {Y_{ i }} defined on a finite state space, say Ω={1,…,d,α}, with an absorbing state α and transition probability matrix of the form
where c is a column vector. The distribution of the waiting time for Λ is given by
where ξ_{0} is the initial distribution, N is the essential transition probability matrix (i.e. the substochastic matrix consisting of only the transient states of {Y_{ i }}) as defined in (2), I is a d×d identity matrix and 1=(1,1,…,1) is a 1×d rowvector. Furthermore, the random variable X_{ n }(Λ), the number of occurrences of Λ in {X_{ i }}, is also finite Markov chain imbeddable and its distribution is given by
where the essential transition probability matrix N_{ x } has the form
the matrix N is given by (2), and the matrix C defines the “continuation” transition probabilities from one occurrence to the next and depends on c in (2).
If the pattern Λ is long and complex and n is very large, then the computation of \mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\} can become problematic and, to overcome this problem, various asymptotic approximations have been developed for these probabilities.
In real applications, if the exact distribution is not available or is hard to compute, it is important to know which approximations perform well and are easy to compute. Furthermore, it is important to know how these approximations perform with respect to each other and the exact distribution from both a theoretical and numerical standpoint. The aims of this manuscript are twofold: (i) we first study the asymptotic relative error of the normal, Poisson (or compound Poisson), and FMCI approximations with respect to the exact distribution; and (ii) we then provide a numerical study of these three approximations with the exact probabilities in cases where x is fixed and n→∞ and when n is fixed and x varies. As an important byproduct, the FMCI technique allows the normal and Poisson approximations to be applied in more cases, for example, the distribution of compound patterns and patterns in Markov dependent trials.
The approximations
Normal approximation
The normal approximation is one of the most popular for approximating the distribution of the number of runs or patterns X_{ n }(Λ) in Statistics. In general, when Λ is simple or compound, the trials are i.i.d., and the counting is nonoverlapping, by appealing to (1) and renewal arguments, it has been shown that X_{ n }(Λ) is asymptotically normally distributed (cf. Fu and Lou 2007; Karlin and Taylor 1975). The form of the approximation is
where Φ(·) denotes the standard normal distribution function and μ_{ W } and {\sigma}_{W}^{2} are the mean and variance of W(Λ) respectively, which are given by
Given a pattern Λ, it is well known that the mean μ_{ W } and the variance {\sigma}_{W}^{2} are difficult to obtain via combinatoric arguments, especially when Λ is a compound pattern or the trials are Markov dependent. For example, as pointed out in Karlin (2005) and Kleffe and Borodovski (1992), approximate values of μ_{ W } and {\sigma}_{W}^{2} must sometimes be used. Since W(Λ) is finite Markov chain imbeddeble, (7) and (8), provide the exact values.
The limit in (6) is appropriate when the sequence of inter arrival times {W_{ i }(Λ)} are i.i.d., which is the case for simple and compound patterns when the {X_{ i }} are i.i.d. and counting is nonoverlapping. When occurrences of Λ correspond to a delayed renewal process, which can occur for Markov dependent trials and/or overlapping counting, we could use the mean and variance of W_{2}(Λ) for the normalizing constants, which are easily obtained by modifying ξ_{0} in (7) and (8). Even more general cases can be handled by making use of a functional central limit theorem for Markov chains (see, for example, (Meyn and Tweedie 1993, §17.4) and (Asmussen 2003, Theorem 7.2, pg. 30) for the details).
Poisson and compound poisson approximations
It is well known that, in a sequence of Bernoulli (p) trials, if n p→λ as n→∞, then the probability of k successes in n trials can be approximated by a Poisson probability with parameter λ, denoted P\left(\lambda \right). This idea has been extended to certain patterns Λ and, under certain conditions, the distribution of X_{ n }(Λ) can be approximated by a Poisson distribution with parameter μ_{ n } in the sense that
where \mathcal{L}(\xb7) denotes the distribution (law) of a random variable and d_{TV}(·,·) denotes the total variation distance.
The primary tool used to obtain μ_{ n } and the bound ε_{ n } is the SteinChen method (Chen 1975), and this method has been refined by various authors Arratia et al. (1990), Barbour and Eagleson (1983), Barbour and Eagleson (1984), Barbour and Eagleson (1987), Barbour and Hall (1984), Godbole (1990a), Godbole (1990b), Godbole (1991), Godbole and Schaffner (1993), and Holst et al. (1988). This method has also been extended to compound Poisson approximations for the distributions of runs and patterns and Barbour and Chryssaphinou (2001) provides an excellent theoretical review of these approximations.
In practice, {\mu}_{n}=\mathbb{E}{X}_{n}\left(\Lambda \right) or the expectation of a closely related run statistic is used (cf. Balakrishnan and Koutras 2002, §5.2.3) so that, in the former case,
Finding \mathbb{E}{X}_{n}\left(\Lambda \right) and the bound ε_{ n } is usually done on a case by case basis. For the mathematical details, the books (Barbour et al. 1992a) and (Balakrishnan and Koutras 2002) are recommended.
Let {P}_{\phantom{\rule{0.3em}{0ex}}c}(\lambda ,\nu ) denote the compound Poisson distribution, that is, the distribution of the random variable \sum _{j=1}^{M}{Y}_{j} where the random variable M has a Poisson distribution with parameter λ and the Y_{ j } are i.i.d. having distribution ν. A compound Poisson distribution for approximating nonnegative random variables was suggested in Barbour et al. (1992b) (see also Barbour et al. (19951996)). The approximation is formulated similarly to the Poisson approximation:
The distribution of N_{n,k}, the number of nonoverlapping occurrences of k consecutive successes in n i.i.d. Bernoulli trials, is one of the most important in this area and one of the most studied in the literature. Reversing the roles of S (success) and F (failure), the reliability of consecutivekoutofn system, denoted C(k,n : F), is given by \mathbb{P}\{{N}_{n,k}=0\}. Even in this simple case (i.e. Λ=S S⋯S), there are several ways to apply the Poisson approximation techniques. For example, (Godbole 1991, Theorem 2) shows that approximating N_{n,k} with a P\left(\mathbb{E}{N}_{n,k}\right) distribution works well if certain conditions hold. Godbole and Schaffner (Godbole and Schaffner 1993, pg. 340) suggests an improved Poisson approximation for word patterns.
The primary difficulty in applying the Poisson approximation is the determination of the optimal parameter μ_{ n }, which is higly dependent on the structure of the pattern Λ. In particular, if Λ is long and has several uneven overlapping subpatterns, then finding μ_{ n } by their method can be very tedious. In the sequel, we show that even the (asymptotic) best choice for μ_{ n } for Poisson approximations does not perform well in the relative sense.
FMCI approximations
Approximations based on the FMCI approach depend on the spectral decomposition of the essential transition probability matrix N.
Let N be a w×w essential transition probability matrix associated with a finite Markov chain {Y_{ n }:n≥0} corresponding to the distribution of the waiting time W(Λ). Let 1>λ_{1}≥λ_{2}≥⋯≥λ_{ w } denote the ordered eigenvalues of N, repeated according to their algebraic multiplicities, with associated (right) eigenvectors {\mathit{\eta}}_{1}^{\prime},{\mathit{\eta}}_{2}^{\prime},\cdots \phantom{\rule{0.3em}{0ex}},{\mathit{\eta}}_{w}^{\prime}. When the geometric multiplicity of λ_{ i } is less than its algebraic multiplicity, we will use vectors of 0’s for the unspecified eigenvectors. The fact that λ_{1} can be taken as a positive real number and that η_{1} can be taken to be nonnegative are consequences of the PerronFrobenious Theorem for nonnegative matrices ( Seneta cf.1981).
Definition 1
We will say that {Y_{ n }:n≥0}, or equivalently, N, satisfies the FMCI Approximation Conditions if

(i)
there exists constants a _{1},…,a _{ w } such that
{1}^{\prime}=\sum _{i=1}^{w}{a}_{i}{\mathit{\eta}}_{i}^{\prime},(12) 
(ii)
λ _{1} has algebraic multiplicity g and λ _{1}>λ _{ j } for all j>g.
Verifying these conditions is usually straightforward. They certainly hold if N is irreducible and aperiodic, but also hold in many other cases as well. For example, (12) requires only that 1^{′} is in the linear space spanned by \{{\mathit{\eta}}_{1}^{\prime},{\mathit{\eta}}_{2}^{\prime},\cdots \phantom{\rule{0.3em}{0ex}},{\mathit{\eta}}_{w}^{\prime}\}, which can hold even when N is defective (not diagonizable). Condition (ii) requires that the communication classes corresponding λ_{1} are aperiodic. That is, if Ψ is a communication class and N[Ψ] corresponds to the substocastic matrix N restricted to the states in Ψ, with largest eigenvalue λ_{1}[Ψ], then all Ψ such that λ_{1}[Ψ]=λ_{1} should be aperiodic. We also mention that the algebraic multiplicity of λ_{1} is the number of communication classes Ψ such that λ_{1}[Ψ]=λ_{1}.
Fu and Johnson (2009) give the following theorem.
Theorem 1
Let {X_{ i }} be a sequence of i.i.d. trials taking values in, let Λ be a simple pattern of length ℓ with d×d essential transition probability matrix N and let X_{ n }(Λ) be the number of nonoverlapping occurrences of Λ in {X_{ i }}. If N satisfies the FMCI approximation conditions then, for any fixed x≥0,
wherea=\sum _{j=1}^{g}{a}_{j}\left({\mathit{\xi}}_{0}{\mathit{\eta}}_{j}^{\prime}\right). If g=1, as is usually the case, then a=a_{1}(ξ_{0}η 1′).
Given a pattern Λ, the approximation in (13) requires finding the Markov chain imbedding associated with the waiting time W(Λ), the essential transition probability matrix N as well as its eigenvalues and associated eigenvectors. Usually, these steps are rather simple and can be easily automated together with (13). Even for very large n and large ℓ, say n=1,000,000 and ℓ=50, the CPU time is negligible. Fu and Johnson (2009) also provide details on extending these results to compound patterns, overlapping counting and Markov dependent trials.
For the purpose of comparing these approximations, we prefer to write (13) as
Note that the approximation havs three parts: a constant part; a polynomial in n of degree x; and a third (dominant) part which converges to 0 exponentially fast as n→∞.
More precisely, the FMCI approximation in (13) may be written as
Since λ_{g+1}<λ_{1}, the term λ_{g+1}/λ_{1}^{n/(x+1)−ℓ} tends to 0 exponentially as n→∞ and hence is negligible if n/(x+1)−ℓ is moderate or large (say ≥50).
Large deviation approximation
Fu et al. (2012) provide the following large deviation approximation for righttail probabilities for the number of nonoverlapping occurrences for simple patterns Λ. The reasons for providing only the righttail large deviation approximation are (i) all of the above mentioned approximations fail to approximate the extreme righttail probabilities and (ii) the FMCI approximation provides an accurate approximation for lefttail probabilities.
Theorem 2
Let \epsilon =x{\mu}_{W}^{2}/(1+x{\mu}_{W}) and let
be the moment generating function of W(Λ). Then
where
, τ is the solution to h^{′}(ε,τ)=0, and
Comparisons and relative error
For a given n, x and pattern Λ, we define the relative error of an approximation with respect to the exact probability \mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\} as
where A stands for the approximate probability and E stands for the exact probability \mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\}. This quantity, R(x:E,A), goes from −∞ to ∞ and treats the importance of overestimation the same as underestimation. It is clear that R(x:E,A)>0 implies that the approximation is overestimating the exact probability and that R(x:E,A)<0 implies that the approximation is underestimating the exact probability. Since, for fixed x, the probability \mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\} converges to 0 exponentially fast as n→∞, it follows that R(x:E,A)→±∞ implies that the approximation tends to 0 with the wrong rate. If R(x:E,A) is near 0 then the approximation is close to the exact probability \mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\}.
Note that R(x:E,A) is a function of x, n and the method of approximation used. The following theorem provides the asymptotic relative error for the Normal approximation (N), the Poisson approximation (P(μ_{ n })) and the finite Markov chain imbedding approximation (F).
Theorem 3
Let {X_{ i }} be a sequence of i.i.d. multistate trials taking values inand let Λ be a simple pattern defined on. Then, for every fixed x, we have,
where the exact probability is computed using (4) and
Proof
Given a pattern Λ and x, for the finite Markov chain imbedding approximation we have
and hence (i) follows immediately from the definition of R(x:E,A) and Theorem 1.
For the Poisson approximation we have, since E/F∼1 by (i),
and hence
If {liminf}_{n}{\mu}_{n}/n>ln{\lambda}_{1} then exp{n lnλ_{1}+μ_{ n }} tends to 0 exponentially fast which overrides the polynomial term and hence R(x:E,P(μ_{ n }))→−∞ as n→∞ for all fixed x. Similarly, if {limsup}_{n}{\mu}_{n}/n<ln{\lambda}_{1}, then R(x:E,P(μ_{ n }))→∞ as n→∞ for all fixed x. Furthermore, if {lim}_{n}{\mu}_{n}/n=ln{\lambda}_{1}, then the ratio yields
and this completes the proof of (ii). Note also that, if {limsup}_{n}{\mu}_{n}/n>ln{\lambda}_{1} and {liminf}_{n}{\mu}_{n}/n<ln{\lambda}_{1}, then {lim}_{n}R(x:E,P({\mu}_{n}\left)\right) will not exist.
For the normal approximation we have that X_{ n }(Λ) is approximately normal with mean n/μ_{ W } and variance n{\sigma}_{W}^{2}/{\mu}_{W}^{3} and hence
Hence, provided n>μ_{ W }(x+1/2), we have
Therefore, as in the proof of (ii), we are interested in the asymptotics of F/N, which yields
We may rewrite the argument of the exponential function as
making it clear that (24) converges to ∞ if {\mu}_{W}/2{\sigma}_{W}^{2}\ge ln{\lambda}_{1} and 0 otherwise. Therefore, R(x:E,N)→∞ if {\mu}_{W}/2{\sigma}_{W}^{2}\ge ln{\lambda}_{1} and R(x:E,N)→−∞ if {\mu}_{W}/2{\sigma}_{W}^{2}<ln{\lambda}_{1} and the proof of (iii) is complete.
Theorem 3 (ii) implies that asymptotically (for fixed x and n→∞), the Poisson approximation performs poorly (in the relative sense) regardless of the value μ_{ n } used. When Λ is simple and does not have overlapping subpatterns, taking {\mu}_{n}=\mathbb{E}{X}_{n}\left(\Lambda \right) is normally recommended for the Poisson approximation (cf. Arratia et al. 1990). In this case, nonoverlapping and overlapping counting is equivalent. The following corollary shows that, for fixed x, the Poisson approximation will (asymptotically) always overestimate the exact probability in the following sense.
Corollary 1
Let Λ be a simple pattern defined on an i.i.d. sequence of multistate trials. For{\mu}_{n}=\mathbb{E}{X}_{n}\left(\Lambda \right), we have
for all fixed x.
Proof
Recall that, in this case, X_{ n }(Λ) is a renewal process with i.i.d. interrenewal times with mean {\mu}_{W}=\mathbb{EW}\left(\Lambda \right) and hence, by the elementary renewal theorem, we have \mathbb{E}{X}_{n}\left(\Lambda \right)/n\to 1/{\mu}_{W} so that \mathbb{E}{X}_{n}\left(\Lambda \right)\sim n/{\mu}_{W}. Therefore, by Theorem 3 (ii), it is sufficient to show that n/μ_{ W }<−n lnλ_{1} for all sufficiently large n, or
Now, since 0<{\lambda}_{1}\in \mathbb{R} is a dominant eigenvalue of N, it follows that: 0<{(1{\lambda}_{1})}^{1}\in \mathbb{R} is a dominant eigenvalue of the matrix (I−N)^{−1}=A=(a_{ i j }); a_{ i j }≥0 with at least one a_{ i j }>0; and A 1^{′}=(I−N)^{−1}1^{′}≤μ_{ W }1^{′}. Hence, by a simple corollary to the PerronFrobenius Theorem for nonnegative matrices (cf. Karlin and Taylor 1975, Corollary 2.2, pg. 551), we have
where {a}_{\mathit{\text{ij}}}^{\left(n\right)}={\left({\mathbf{\text{A}}}^{n}\right)}_{\mathit{\text{ij}}}. Therefore, provided μ_{ W }<∞,
which completes the proof.
Corollary 1 implies that, if {\mu}_{n}\sim \mathbb{E}{X}_{n}\left(\Lambda \right), then the Poisson approximation will always overestimate the exact probability as n→∞. Together with Theorem 3 (ii), this implies that using μ_{ n }∼−n lnλ_{1} results in the best Poisson approximation as n→∞.
We also comment that, for the normal approximation, both {\mu}_{W}/2{\sigma}_{W}^{2}<ln{\lambda}_{1} and {\mu}_{W}/2{\sigma}_{W}^{2}\ge ln{\lambda}_{1} are possible. As a simple example, suppose we have a sequence of i.i.d. Bernoulli (p) trials and Λ=S S S. If p=1/2, we obtain
and
However, with p=0.9, we obtain
and
Thus, R(x:E,N)→±∞ are both possible depending on x, the pattern, and the probability structure of the {X_{ i }}.
Numerical comparisons
In the previous section we showed that, for fixed x and n→∞, the approximation based on the finite Markov chain imbedding technique outperforms the Poisson and normal approximations. In practice, however, one is interested in the performance of these approximations not only when x is fixed and n→∞, but also when n is fixed (at some moderate value) and x varies. The reason we consider only large or moderate n in our numerical study is that, for small n, the FMCI technique easily gives the exact results. In this section we present some numerical experiments to illustrate the advantages (and disadvantages) of the methods discussed.
The approximations we compare are: the finite Markov chain approximation in (13) (FMCI); the Poisson approximation with {\mu}_{n}=n/{\mu}_{W}\phantom{\rule{0.3em}{0ex}}(\sim \mathbb{E}{X}_{n}(\Lambda \left)\right) where μ_{ W } is calculated using (7) (Poisson); The normal approximation given in (6) (Normal); and the large deviation approximation given in Theorem 2 (LD), which is only for righttail probabilities.
Reliability of C(k,n:F) systems
A consecutivekoutofn:F system is a system of n independent and linearly connected components, each with common (continuous) lifetime distribution F, in which the system fails if k consecutive components fail. At a given time t>0, the probability a component is working is p=1−F(t) and the probability a single component has failed is q=1−p and hence the probability the system has failed is equivalent to the probability that k (or more) consecutive components have failed, which is equivalent to the probability of k consecutive failures in a sequence of n Bernoulli trials with success probability p. Barbour et al. (1995) present a table of various bounds for system reliability based on a Poisson approximation and a compound approximation and compare these to bounds found in Fu (1985). Table 1 shows the exact probabilities and relative errors for the FMCI and Poisson approximations as well as the compound Poisson approximation in Barbour et al. (1995) (CP).
The FMCI approximation performs very well for the parameters tested here. As expected, the Poisson and compound Poisson approximations perform well when n q^{k} is relatively small. When the reliability of the system is relatively low, the Poisson and compound Poisson approximations begin to degrade.
Approximating the distribution of N_{ n,k }
Recall that N_{n,k} is the number of nonoverlapping occurrences of k consecutive successes in {X_{ i }} (i.e. N_{n,k}=X_{ n }(Λ) with Λ=S S⋯S of length k). By reversing the roles of success and failure, the reliability of C(k,n : F) systems can be related to the distribution of N_{n,k}. In this section we present some examples of approximating \mathbb{P}\{{N}_{n,k}=x\} with the approximations FMCI, Normal, Poisson and LD.
Figure 1 shows the relative error R(x:E,A) in these approximations for (a) N_{2000,4}; (b) N_{5000,4}; and (c) N_{250000,6} when the probability of success is p=0.3. On all of the figures, the top axis is on a standard zscale making use of the asymptotic mean and variance of X_{ n }(Λ) — namely,
We notice that the Finite Markov chain imbedding approximation (FMCI) performs very well in the left tail of the distribution in all cases. Its performance degrades as x gets large but its performance is more consistent than both the Poisson and Normal approximations in this case. The large deviation approximation performs well in the right tail in all cases. In (c), the FMCI approximation performs very well throughout most of the support. The Poisson approximations also perform well over most of the x considered. The normal approximation performs well in the neighbourhood of \mathbb{E}{X}_{n}\left(\Lambda \right) but not in the tails.
As the probability of success p increases, the FMCI approximation still performs very well in the left tail, but it’s performance tends to degrade more quickly as x increases. The Poisson approximations also quickly degrades as p increases since \mathbb{E}{N}_{n,k} increases. For larger p, the Normal approximation tends to work better near the mean. In the far left tail, the FMCI approximation is preferred and in the far right tail, the LD approximation is preferred.
Biological sequences
Sequences of DNA nucleotides are of great interest (as are sequences of amino acids and other biological sequences). Figure 2 shows the relative errors for approximating \mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\} with Λ=A C G (n=1,000 and 10,000) and Λ=C A T T A G (n=500,000). We see that the FMCI approximation again performs very well in the left tail, although, in (b), the performance degrades somewhat as x gets large. The large deviation approximation performs very well in the right tail, especially when x is greater than 3 standard deviations above the mean. While it is difficult to give a rule of thumb, the FMCI approximation seems to perform very well when x\le \mathcal{O}\left({n}^{1/2}\right). The normal approximation works best within a few standard deviations of the mean and performs best in this region when \mathbb{E}{X}_{n}\left(\Lambda \right) is relatively large.
Discussion and conclusions
The finite Markov chain imbedding approximations (FMCI and LD) provide an alternative to the usual normal and Poisson approximations for the distributions of runs and patterns. While the FMCI approximation is simple, accurate and fast, it has one disadvantage over the normal and Poisson approximations — it requires the use of the FMCI technique, which is nontraditional and less known in the Statistics community, except in the area of system reliability (cf. Cui et al. 2010). On the other hand, the FMCI technique does not require the rather strong conditions necessary for the Poisson techniques, such as n p^{k}→λ. This condition is seldom satisfied in practical applications. For example, in DNA sequence analysis, the probabilities p_{ A }, p_{ C }, p_{ G } and p_{ T } do not tend to 0 as n increases. They may not all be in the neighbourhood of 1/4 but they are bounded away from 0.
For all of the numeric results in the previous section, the exact probabilities \mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\} are obtained via the FMCI technique and their CPU times were only a few seconds or less than a minute even in the case of Λ=C A T T A G and n=500,000. Based on our experience, if the length of the pattern is less than 20 and n is less than 1,000,000, the exact probability should be computed.
References
Arratia R, Goldstein L, Gordon L: Poisson approximation and the ChenStein method. Stat. Sci 1990, 5(4):403–434.
Asmussen S: Applied Probability and Queues. Springer, New York; 2003.
Balakrishnan N, Koutras MV: Runs and Scans with Applications. Wiley Series in Probability and Statistics. WileyInterscience [John Wiley & Sons], New York; 2002.
Barbour AD, Eagleson GK: Poisson approximation for some statistics based on exchangeable trials. Adv. Appl. Probab 1983, 15(3):585–600.
Barbour AD, Eagleson GK: Poisson convergence for dissociated statistics. J. Roy. Statist. Soc. Ser. B 1984, 46(3):397–402.
Barbour AD, Eagleson GK: An improved Poisson limit theorem for sums of dissociated random variables. J. Appl. Probab 1987, 24(3):586–599.
Barbour AD, Hall P: On the rate of Poisson convergence. Math. Proc. Cambridge Philos. Soc 1984, 95(3):473–480.
Barbour AD, Chryssaphinou O: Compound Poisson approximation: a user’s guide. Ann. Appl. Probab 2001, 11(3):964–1002.
Barbour AD, Holst L, Janson S: Poisson Approximation. Oxford Studies in Probability. 199a. Oxford Science Publications
Barbour AD, Chen LHY, Loh WL: Compound Poisson approximation for nonnegative random variables via Stein’s method. Ann. Probab 1992b, 20(4):1843–1866.
Barbour AD, Chryssaphinou O, Roos M: Compound Poisson approximation in reliability theory. IEEE T. Reliab 1995, 44(3):398–402.
Barbour AD, Chryssaphinou O, Roos M: Compound Poisson approximation in systems reliability. Naval Res. Logist 1996, 43(2):251–264.
Blom G, Thorburn D: How many random digits are required until given sequences are obtained? J. Appl. Probab 1982, 19(3):518–531.
Chen LHY: Poisson approximation for dependent trials. Ann. Probab 1975, 3(3):534–545.
Cui L, Xu Y, Zhao X: Developments and applications of the finite Markov chain imbedding approach in reliability. IEEE T. Reliab 2010, 59(4):685–690.
Fu JC: Reliability of a large consecutivekoutofn:F system. IEEE T. Reliab 1985, R34: 120–127.
Fu JC, Johnson BC: Approximate probabilities for runs and patterns in i.i.d. and Markov dependent multistate trials. Adv. Appl. Probab 2009, 41(1):292–308.
Fu JC, Koutras MV: Distribution theory of runs: a Markov chain approach. J. Amer. Statist. Assoc 1994, 89(427):1050–1058.
Fu JC, Lou WYW: Distribution Theory of Runs and Patterns and Its Applications. World Scientific Publishing Co. Inc, River Edge; 2003.
Fu JC, Lou WYW: On the normal approximation for the distribution of the number of simple or compound patterns in a random sequence of multistate trials. Methodol. Comput. Appl. Probab 2007, 9(2):195–205.
Fu JC, Johnson BC, Chang YM: Approximating the extreme righthand tail probability for the distribution of the number of patterns in a sequence of multistate trials. J. Stat. Plan. Infer 2012, 142(2):473–480.
Gerber HU, Li SYR: The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain. Stochastic Process. Appl 1981, 11(1):101–108.
Godbole AP: Degenerate and Poisson convergence criteria for success runs. Statist. Probab. Lett 1990a, 10(3):247–255.
Godbole AP: Specific formulae for some success run distributions. Statist. Probab. Lett 1990b, 10(2):119–124.
Godbole AP: Poisson approximations for runs and patterns of rare events. Adv. Appl. Probab 1991, 23(4):851–865.
Godbole AP, Schaffner AA: Improved Poisson approximations for word patterns. Adv. Appl. Probab 1993, 25(2):334–347.
Holst L, Kennedy JE, Quine MP: Rates of Poisson convergence for some coverage and urn problems using coupling. J. Appl. Probab 1988, 25(4):717–724.
Karlin S: Statistical signals in bioinformatics. Proc. Natl. Acad. Sci. U. S. A 2005, 102(38):13355–13362.
Karlin S, Taylor HM: A First Course in Stochastic Processes. Academic Press [A subsidiary of Harcourt Brace Jovanovich, Publishers], New YorkLondon; 1975.
Kleffe J, Borodovski M: First and second moment of counts of words in random text generated by Markov chains. Comp Applic Biosci 1992, 8: 443–441.
Martin J, Regad L, Camproux AC, Nuel G: Finite Markov chain embedding for the exact distribution of patterns in a set of random sequences. In Advances in Data Analysis. Statistics for Industry and Technology. Edited by: Skiadas CH. Birkhäuser, Boston; 2010.
Meyn SP, Tweedie RL: Markov Chains and Stochastic Stability. Communications and Control Engineering Series. 1993.
Nuel G, Regad L, Martin J, Camproux AC: Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithm Mol. Biol 2010, 5(1):1–18.
Schwager SJ: Run probabilities in sequences of Markovdependent trials. J. Amer. Statist. Assoc 1983, 78(381):168–180.
Seneta E: Nonnegative Matrices and Markov Chains. Springer, New York; 1983.
Solov’ev AD: A combinatorial identity and its application to the problem on the first occurrence of a rare event. Teor. Verojatnost. i Primenen 1966, 11: 313–320.
Acknowledgements
This work was supported, in part, by the Natural Sciences and Engineering Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
BJ and JF contributed equally to the mathematical details. BJ performed the numerical comparisons and prepared the manuscript. Both authors read and approved the final manuscript.
Brad C Johnson and James C Fu contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Johnson, B.C., Fu, J.C. Approximating the distributions of runs and patterns. J Stat Distrib App 1, 5 (2014). https://doi.org/10.1186/2195583215
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/2195583215