# Approximating the distributions of runs and patterns

- Brad C Johnson†
^{1}Email author and - James C Fu†
^{1}

**1**:5

https://doi.org/10.1186/2195-5832-1-5

© Johnson and Fu; licensee Springer. 2014

**Received: **14 November 2013

**Accepted: **7 March 2014

**Published: **11 June 2014

## Abstract

The distribution theory of runs and patterns has been successfully used in a variety of applications including, for example, nonparametric hypothesis testing, reliability theory, quality control, DNA sequence analysis, general applied probability and computer science. The exact distributions of the number of runs and patterns are often very hard to obtain or computationally problematic, especially when the pattern is complex and *n* is very large. Normal, Poisson and compound Poisson approximations are frequently used to approximate these distributions. In this manuscript, we (i) study the asymptotic relative error of the normal, Poisson, compound Poisson and finite Markov chain imbedding and large deviation approximations; and (ii) provide some numerical studies to comparing these approximations with the exact probabilities for moderately sized *n*. Both theoretical and numerical results show that, in the relative sense, the finite Markov chain imbedding approximation performs the best in the left tail and the large deviation approximation performs best in the right tail.

### AMS Subject Classification

Primary 60E05; Secondary 60J10

## Keywords

## Introduction and notation

*m*-state trials (

*m*≥2) taking values in the set $\mathcal{S}=\{{s}_{1},\dots ,{s}_{m}\}$ of

*m*symbols. For simplicity, ${\left\{{X}_{i}\right\}}_{i=1}^{n}$ will be denoted {

*X*

_{ i }} and

*n*will be allowed to be

*∞*. A

*simple pattern*$\Lambda ={s}_{{i}_{1}}{s}_{{i}_{2}}\cdots {s}_{{i}_{\ell}}$, of length

*ℓ*, is the juxtaposition of

*ℓ*(not necessarily distinct) symbols from . Given a simple pattern

*Λ*, we let

*X*

_{ n }(

*Λ*) denote the number of either non-overlapping or overlapping occurrences of

*Λ*in the sequence ${\left\{{X}_{i}\right\}}_{i=1}^{n}$, where the method of counting will be made clear by the context. The waiting time

*W*(

*Λ*,

*x*) until the

*x*’th occurrence of the simple pattern

*Λ*in ${\left\{{X}_{i}\right\}}_{i=1}^{n}$ is thus defined by

*W*(

*Λ*)=

*W*(

*Λ*,1). Finally, we define the inter arrival times

where *W*(*Λ*,0):=0.

*Λ*

_{1}and

*Λ*

_{2}are distinct if neither

*Λ*

_{1}appears in

*Λ*

_{2}nor

*Λ*

_{2}appears in

*Λ*

_{1}. If

*Λ*

_{1},…,

*Λ*

_{ r }are pairwise distinct simple patterns, we define the compound pattern $\Lambda =\bigcup _{i=1}^{r}{\Lambda}_{i}$, where an occurrence of any

*Λ*

_{ i }is considered an occurrence of

*Λ*. For a compound pattern

*Λ*=

*Λ*

_{1}∪⋯∪

*Λ*

_{ r }, we similarly define

The waiting times *W*(*Λ*,*x*), *W*(*Λ*) and *W*_{
i
}(*Λ*) are then defined as above, and often referred to as *sooner* waiting times.

*Λ*,

*x*and

*n*, the events {

*X*

_{ n }(

*Λ*)<

*x*} and {

*W*(

*Λ*,

*x*)>

*n*} are equivalent and hence

which provides a convenient way of studying the exact and approximate distribution of *X*_{
n
}(*Λ*) through the waiting time distributions of *W*(*Λ*,*x*).

Throughout this paper, unless specified otherwise, we assume that the trials {*X*_{
i
}} are either independent and identically distributed (i.i.d.) or first order Markov dependent; the pattern *Λ* is either simple or compound; and the counting of occurrences of *Λ* is in a non-overlapping fashion.

The distribution of the number of runs and patterns in a sequence of multi-state trials or random permutations of a set of integers have been successfully used in various fields in applied probability, statistics and discrete mathematics. Examples include reliability theory, quality control, DNA sequence analysis, psychology, ecology, astronomy, nonparametric tests, successions, and the Eulerian and Simon-Newcomb numbers (the latter 3 being defined for permutations). Two recent books, Balakrishnan and Koutras (2002) and Fu and Lou (2003), provide some scope of the distribution theory of runs and patterns and Martin et al. (2010) and Nuel et al. (2010) provides some extensions to sets of sequences.

Given a pattern *Λ*, the exact distribution of *X*_{
n
}(*Λ*) traditionally has been determined using combinatoric analysis on a case by case basis. The formulae for these distributions are often very complex and computationally problematic. Even for many simple patterns, their distributions in terms of combinatoric analysis remains unknown, especially when the {*X*_{
i
}} are Markov dependent multi-state trials.

*W*(

*Λ*) for the first occurrence of certain types of runs and patterns have been studied by many authors. See, for example, Blom and Thorburn (1982), Gerber and Li (1981), Schwager (1983), and Solov’ev (1966). More recently, Fu and Koutras (1994) developed a method for determining the exact distributions of

*X*

_{ n }(

*Λ*) and

*W*(

*Λ*) for any simple or compound

*Λ*in either i.i.d. or Markov dependent trials (see also Fu and Lou 2003). The method was referred to as the Finite Markov Chain Imbedding (FMCI) technique, which can be easily described as follows: given a simple or compound pattern

*Λ*, there exists a finite Markov chain {

*Y*

_{ i }} defined on a finite state space, say

*Ω*={1,…,

*d*,

*α*}, with an absorbing state

*α*and transition probability matrix of the form

*Λ*is given by

_{0}is the initial distribution,

**N**is the

*essential transition probability matrix*(i.e. the sub-stochastic matrix consisting of only the transient states of {

*Y*

_{ i }}) as defined in (2),

**I**is a

*d*×

*d*identity matrix and 1=(1,1,…,1) is a 1×

*d*row-vector. Furthermore, the random variable

*X*

_{ n }(

*Λ*), the number of occurrences of

*Λ*in {

*X*

_{ i }}, is also finite Markov chain imbeddable and its distribution is given by

**N**

_{ x }has the form

the matrix **N** is given by (2), and the matrix **C** defines the “continuation” transition probabilities from one occurrence to the next and depends on **c** in (2).

If the pattern *Λ* is long and complex and *n* is very large, then the computation of $\mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\}$ can become problematic and, to overcome this problem, various asymptotic approximations have been developed for these probabilities.

In real applications, if the exact distribution is not available or is hard to compute, it is important to know which approximations perform well and are easy to compute. Furthermore, it is important to know how these approximations perform with respect to each other and the exact distribution from both a theoretical and numerical standpoint. The aims of this manuscript are two-fold: (i) we first study the asymptotic relative error of the normal, Poisson (or compound Poisson), and FMCI approximations with respect to the exact distribution; and (ii) we then provide a numerical study of these three approximations with the exact probabilities in cases where *x* is fixed and *n*→*∞* and when *n* is fixed and *x* varies. As an important byproduct, the FMCI technique allows the normal and Poisson approximations to be applied in more cases, for example, the distribution of compound patterns and patterns in Markov dependent trials.

## The approximations

### Normal approximation

*X*

_{ n }(

*Λ*) in Statistics. In general, when

*Λ*is simple or compound, the trials are i.i.d., and the counting is non-overlapping, by appealing to (1) and renewal arguments, it has been shown that

*X*

_{ n }(

*Λ*) is asymptotically normally distributed (cf. Fu and Lou 2007; Karlin and Taylor 1975). The form of the approximation is

*Φ*(·) denotes the standard normal distribution function and

*μ*

_{ W }and ${\sigma}_{W}^{2}$ are the mean and variance of

*W*(

*Λ*) respectively, which are given by

Given a pattern *Λ*, it is well known that the mean *μ*_{
W
} and the variance ${\sigma}_{W}^{2}$ are difficult to obtain via combinatoric arguments, especially when *Λ* is a compound pattern or the trials are Markov dependent. For example, as pointed out in Karlin (2005) and Kleffe and Borodovski (1992), approximate values of *μ*_{
W
} and ${\sigma}_{W}^{2}$ must sometimes be used. Since *W*(*Λ*) is finite Markov chain imbeddeble, (7) and (8), provide the exact values.

The limit in (6) is appropriate when the sequence of inter arrival times {*W*_{
i
}(*Λ*)} are i.i.d., which is the case for simple and compound patterns when the {*X*_{
i
}} are i.i.d. and counting is non-overlapping. When occurrences of *Λ* correspond to a delayed renewal process, which can occur for Markov dependent trials and/or overlapping counting, we could use the mean and variance of *W*_{2}(*Λ*) for the normalizing constants, which are easily obtained by modifying ξ_{0} in (7) and (8). Even more general cases can be handled by making use of a functional central limit theorem for Markov chains (see, for example, (Meyn and Tweedie 1993, §17.4) and (Asmussen 2003, Theorem 7.2, pg. 30) for the details).

### Poisson and compound poisson approximations

*p*) trials, if

*n*

*p*→

*λ*as

*n*→

*∞*, then the probability of

*k*successes in

*n*trials can be approximated by a Poisson probability with parameter

*λ*, denoted $P\left(\lambda \right)$. This idea has been extended to certain patterns

*Λ*and, under certain conditions, the distribution of

*X*

_{ n }(

*Λ*) can be approximated by a Poisson distribution with parameter

*μ*

_{ n }in the sense that

where $\mathcal{L}(\xb7)$ denotes the distribution (law) of a random variable and *d*_{TV}(·,·) denotes the total variation distance.

The primary tool used to obtain *μ*_{
n
} and the bound *ε*_{
n
} is the Stein-Chen method (Chen 1975), and this method has been refined by various authors Arratia et al. (1990), Barbour and Eagleson (1983), Barbour and Eagleson (1984), Barbour and Eagleson (1987), Barbour and Hall (1984), Godbole (1990a), Godbole (1990b), Godbole (1991), Godbole and Schaffner (1993), and Holst et al. (1988). This method has also been extended to compound Poisson approximations for the distributions of runs and patterns and Barbour and Chryssaphinou (2001) provides an excellent theoretical review of these approximations.

Finding $\mathbb{E}{X}_{n}\left(\Lambda \right)$ and the bound *ε*_{
n
} is usually done on a case by case basis. For the mathematical details, the books (Barbour et al. 1992a) and (Balakrishnan and Koutras 2002) are recommended.

*M*has a Poisson distribution with parameter

*λ*and the

*Y*

_{ j }are i.i.d. having distribution

*ν*. A compound Poisson distribution for approximating nonnegative random variables was suggested in Barbour et al. (1992b) (see also Barbour et al. (19951996)). The approximation is formulated similarly to the Poisson approximation:

The distribution of *N*_{n,k}, the number of non-overlapping occurrences of *k* consecutive successes in *n* i.i.d. Bernoulli trials, is one of the most important in this area and one of the most studied in the literature. Reversing the roles of *S* (success) and *F* (failure), the reliability of consecutive-*k*-out-of-*n* system, denoted *C*(*k*,*n* : *F*), is given by $\mathbb{P}\{{N}_{n,k}=0\}$. Even in this simple case (i.e. *Λ*=*S* *S*⋯*S*), there are several ways to apply the Poisson approximation techniques. For example, (Godbole 1991, Theorem 2) shows that approximating *N*_{n,k} with a $P\left(\mathbb{E}{N}_{n,k}\right)$ distribution works well if certain conditions hold. Godbole and Schaffner (Godbole and Schaffner 1993, pg. 340) suggests an improved Poisson approximation for word patterns.

The primary difficulty in applying the Poisson approximation is the determination of the optimal parameter *μ*_{
n
}, which is higly dependent on the structure of the pattern *Λ*. In particular, if *Λ* is long and has several uneven overlapping sub-patterns, then finding *μ*_{
n
} by their method can be very tedious. In the sequel, we show that even the (asymptotic) best choice for *μ*_{
n
} for Poisson approximations does not perform well in the relative sense.

### FMCI approximations

Approximations based on the FMCI approach depend on the spectral decomposition of the essential transition probability matrix **N**.

Let **N** be a *w*×*w* essential transition probability matrix associated with a finite Markov chain {*Y*_{
n
}:*n*≥0} corresponding to the distribution of the waiting time *W*(*Λ*). Let 1>*λ*_{1}≥|*λ*_{2}|≥⋯≥|*λ*_{
w
}| denote the ordered eigenvalues of **N**, repeated according to their algebraic multiplicities, with associated (right) eigenvectors ${\mathit{\eta}}_{1}^{\prime},{\mathit{\eta}}_{2}^{\prime},\cdots \phantom{\rule{0.3em}{0ex}},{\mathit{\eta}}_{w}^{\prime}$. When the geometric multiplicity of *λ*_{
i
} is less than its algebraic multiplicity, we will use vectors of 0’s for the unspecified eigenvectors. The fact that *λ*_{1} can be taken as a positive real number and that η_{1} can be taken to be non-negative are consequences of the Perron-Frobenious Theorem for non-negative matrices ( Seneta *cf.*1981).

#### Definition 1

*Y*

_{ n }:

*n*≥0}, or equivalently,

**N**, satisfies the

*FMCI Approximation Conditions*if

- (i)there exists constants
*a*_{1},…,*a*_{ w }such that${1}^{\prime}=\sum _{i=1}^{w}{a}_{i}{\mathit{\eta}}_{i}^{\prime},$(12) - (ii)
*λ*_{1}has algebraic multiplicity*g*and*λ*_{1}>|*λ*_{ j }| for all*j*>*g*.

Verifying these conditions is usually straightforward. They certainly hold if **N** is irreducible and aperiodic, but also hold in many other cases as well. For example, (12) requires only that **1**^{′} is in the linear space spanned by $\{{\mathit{\eta}}_{1}^{\prime},{\mathit{\eta}}_{2}^{\prime},\cdots \phantom{\rule{0.3em}{0ex}},{\mathit{\eta}}_{w}^{\prime}\}$, which can hold even when **N** is defective (not diagonizable). Condition (ii) requires that the communication classes corresponding *λ*_{1} are aperiodic. That is, if *Ψ* is a communication class and **N**[*Ψ*] corresponds to the substocastic matrix **N** restricted to the states in *Ψ*, with largest eigenvalue *λ*_{1}[*Ψ*], then all *Ψ* such that *λ*_{1}[*Ψ*]=*λ*_{1} should be aperiodic. We also mention that the algebraic multiplicity of *λ*_{1} is the number of communication classes *Ψ* such that *λ*_{1}[*Ψ*]=*λ*_{1}.

Fu and Johnson (2009) give the following theorem.

#### Theorem 1

*Let*{

*X*

_{ i }}

*be a sequence of i.i.d. trials taking values in*,

*let*

*Λ*

*be a simple pattern of length*

*ℓ*

*with*

*d*×

*d*

*essential transition probability matrix*

**N**

*and let*

*X*

_{ n }(

*Λ*)

*be the number of non-overlapping occurrences of*

*Λ*

*in*{

*X*

_{ i }}.

*If*

**N**

*satisfies the FMCI approximation conditions then, for any fixed*

*x*≥0,

*where*$a=\sum _{j=1}^{g}{a}_{j}\left({\mathit{\xi}}_{0}{\mathit{\eta}}_{j}^{\prime}\right)$. *If* *g*=1, *as is usually the case, then* *a*=*a*_{1}(ξ_{0}η 1′).

Given a pattern *Λ*, the approximation in (13) requires finding the Markov chain imbedding associated with the waiting time *W*(*Λ*), the essential transition probability matrix **N** as well as its eigenvalues and associated eigenvectors. Usually, these steps are rather simple and can be easily automated together with (13). Even for very large *n* and large *ℓ*, say *n*=1,000,000 and *ℓ*=50, the CPU time is negligible. Fu and Johnson (2009) also provide details on extending these results to compound patterns, overlapping counting and Markov dependent trials.

Note that the approximation havs three parts: a constant part; a polynomial in *n* of degree *x*; and a third (dominant) part which converges to 0 exponentially fast as *n*→*∞*.

Since |*λ*_{g+1}|<*λ*_{1}, the term |*λ*_{g+1}/*λ*_{1}|^{n/(x+1)−ℓ} tends to 0 exponentially as *n*→*∞* and hence is negligible if *n*/(*x*+1)−*ℓ* is moderate or large (say ≥50).

### Large deviation approximation

Fu et al. (2012) provide the following large deviation approximation for right-tail probabilities for the number of non-overlapping occurrences for simple patterns *Λ*. The reasons for providing only the right-tail large deviation approximation are (i) all of the above mentioned approximations fail to approximate the extreme right-tail probabilities and (ii) the FMCI approximation provides an accurate approximation for left-tail probabilities.

#### Theorem 2

*Let*$\epsilon =x{\mu}_{W}^{2}/(1+x{\mu}_{W})$

*and let*

*be the moment generating function of*

*W*(

*Λ*).

*Then*

*where*

*τ*is the solution to

*h*

^{′}(

*ε*,

*τ*)=0, and

## Comparisons and relative error

*n*,

*x*and pattern

*Λ*, we define the relative error of an approximation with respect to the exact probability $\mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\}$ as

where *A* stands for the approximate probability and *E* stands for the exact probability $\mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\}$. This quantity, *R*(*x*:*E*,*A*), goes from −*∞* to *∞* and treats the importance of overestimation the same as underestimation. It is clear that *R*(*x*:*E*,*A*)>0 implies that the approximation is overestimating the exact probability and that *R*(*x*:*E*,*A*)<0 implies that the approximation is underestimating the exact probability. Since, for fixed *x*, the probability $\mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\}$ converges to 0 exponentially fast as *n*→*∞*, it follows that *R*(*x*:*E*,*A*)→±*∞* implies that the approximation tends to 0 with the wrong rate. If *R*(*x*:*E*,*A*) is near 0 then the approximation is close to the exact probability $\mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\}$.

Note that *R*(*x*:*E*,*A*) is a function of *x*, *n* and the method of approximation used. The following theorem provides the asymptotic relative error for the Normal approximation (N), the Poisson approximation (*P*(*μ*_{
n
})) and the finite Markov chain imbedding approximation (F).

### Theorem 3

*Let*{

*X*

_{ i }}

*be a sequence of i.i.d. multi-state trials taking values in*

*and let*

*Λ*

*be a simple pattern defined on*.

*Then, for every fixed*

*x*,

*we have*,

*where the exact probability is computed using (4) and*

### Proof

*Λ*and

*x*, for the finite Markov chain imbedding approximation we have

and hence (i) follows immediately from the definition of *R*(*x*:*E*,*A*) and Theorem 1.

*E*/

*F*∼1 by (i),

*n*ln

*λ*

_{1}+

*μ*

_{ n }} tends to 0 exponentially fast which overrides the polynomial term and hence

*R*(

*x*:

*E*,

*P*(

*μ*

_{ n }))→−

*∞*as

*n*→

*∞*for all fixed

*x*. Similarly, if ${limsup}_{n}{\mu}_{n}/n<-ln{\lambda}_{1}$, then

*R*(

*x*:

*E*,

*P*(

*μ*

_{ n }))→

*∞*as

*n*→

*∞*for all fixed

*x*. Furthermore, if ${lim}_{n}{\mu}_{n}/n=-ln{\lambda}_{1}$, then the ratio yields

and this completes the proof of (ii). Note also that, if ${limsup}_{n}{\mu}_{n}/n>-ln{\lambda}_{1}$ and ${liminf}_{n}{\mu}_{n}/n<-ln{\lambda}_{1}$, then ${lim}_{n}R(x:E,P({\mu}_{n}\left)\right)$ will not exist.

*X*

_{ n }(

*Λ*) is approximately normal with mean

*n*/

*μ*

_{ W }and variance $n{\sigma}_{W}^{2}/{\mu}_{W}^{3}$ and hence

*n*>

*μ*

_{ W }(

*x*+1/2), we have

*F*/

*N*, which yields

making it clear that (24) converges to *∞* if ${\mu}_{W}/2{\sigma}_{W}^{2}\ge -ln{\lambda}_{1}$ and 0 otherwise. Therefore, *R*(*x*:*E*,*N*)→*∞* if ${\mu}_{W}/2{\sigma}_{W}^{2}\ge -ln{\lambda}_{1}$ and *R*(*x*:*E*,*N*)→−*∞* if ${\mu}_{W}/2{\sigma}_{W}^{2}<-ln{\lambda}_{1}$ and the proof of (iii) is complete.

Theorem 3 (ii) implies that asymptotically (for fixed *x* and *n*→*∞*), the Poisson approximation performs poorly (in the relative sense) regardless of the value *μ*_{
n
} used. When *Λ* is simple and does not have overlapping sub-patterns, taking ${\mu}_{n}=\mathbb{E}{X}_{n}\left(\Lambda \right)$ is normally recommended for the Poisson approximation (*cf.* Arratia et al. 1990). In this case, non-overlapping and overlapping counting is equivalent. The following corollary shows that, for fixed *x*, the Poisson approximation will (asymptotically) always overestimate the exact probability in the following sense.

### Corollary 1

*Let*

*Λ*

*be a simple pattern defined on an i.i.d. sequence of multi-state trials. For*${\mu}_{n}=\mathbb{E}{X}_{n}\left(\Lambda \right)$,

*we have*

*for all fixed* *x*.

### Proof

*X*

_{ n }(

*Λ*) is a renewal process with i.i.d. inter-renewal times with mean ${\mu}_{W}=\mathbb{EW}\left(\Lambda \right)$ and hence, by the elementary renewal theorem, we have $\mathbb{E}{X}_{n}\left(\Lambda \right)/n\to 1/{\mu}_{W}$ so that $\mathbb{E}{X}_{n}\left(\Lambda \right)\sim n/{\mu}_{W}$. Therefore, by Theorem 3 (ii), it is sufficient to show that

*n*/

*μ*

_{ W }<−

*n*ln

*λ*

_{1}for all sufficiently large

*n*, or

**N**, it follows that: $0<{(1-{\lambda}_{1})}^{-1}\in \mathbb{R}$ is a dominant eigenvalue of the matrix (

**I**−

**N**)

^{−1}=

**A**=(

*a*

_{ i j });

*a*

_{ i j }≥0 with at least one

*a*

_{ i j }>0; and

**A**1

^{′}=(

**I**−

**N**)

^{−1}1

^{′}≤

*μ*

_{ W }1

^{′}. Hence, by a simple corollary to the Perron-Frobenius Theorem for nonnegative matrices (cf. Karlin and Taylor 1975, Corollary 2.2, pg. 551), we have

*μ*

_{ W }<

*∞*,

which completes the proof.

Corollary 1 implies that, if ${\mu}_{n}\sim \mathbb{E}{X}_{n}\left(\Lambda \right)$, then the Poisson approximation will always overestimate the exact probability as *n*→*∞*. Together with Theorem 3 (ii), this implies that using *μ*_{
n
}∼−*n* ln*λ*_{1} results in the best Poisson approximation as *n*→*∞*.

*p*) trials and

*Λ*=

*S*

*S*

*S*. If

*p*=1/2, we obtain

*p*=0.9, we obtain

Thus, *R*(*x*:*E*,*N*)→±*∞* are both possible depending on *x*, the pattern, and the probability structure of the {*X*_{
i
}}.

## Numerical comparisons

In the previous section we showed that, for fixed *x* and *n*→*∞*, the approximation based on the finite Markov chain imbedding technique outperforms the Poisson and normal approximations. In practice, however, one is interested in the performance of these approximations not only when *x* is fixed and *n*→*∞*, but also when *n* is fixed (at some moderate value) and *x* varies. The reason we consider only large or moderate *n* in our numerical study is that, for small *n*, the FMCI technique easily gives the exact results. In this section we present some numerical experiments to illustrate the advantages (and disadvantages) of the methods discussed.

The approximations we compare are: the finite Markov chain approximation in (13) (FMCI); the Poisson approximation with ${\mu}_{n}=n/{\mu}_{W}\phantom{\rule{0.3em}{0ex}}(\sim \mathbb{E}{X}_{n}(\Lambda \left)\right)$ where *μ*_{
W
} is calculated using (7) (Poisson); The normal approximation given in (6) (Normal); and the large deviation approximation given in Theorem 2 (LD), which is only for right-tail probabilities.

### Reliability of C(k,n:F) systems

*k*-out-of-

*n*:F system is a system of

*n*independent and linearly connected components, each with common (continuous) lifetime distribution

*F*, in which the system fails if

*k*consecutive components fail. At a given time

*t*>0, the probability a component is working is

*p*=1−

*F*(

*t*) and the probability a single component has failed is

*q*=1−

*p*and hence the probability the system has failed is equivalent to the probability that

*k*(or more) consecutive components have failed, which is equivalent to the probability of

*k*consecutive failures in a sequence of

*n*Bernoulli trials with success probability

*p*. Barbour et al. (1995) present a table of various bounds for system reliability based on a Poisson approximation and a compound approximation and compare these to bounds found in Fu (1985). Table 1 shows the exact probabilities and relative errors for the FMCI and Poisson approximations as well as the compound Poisson approximation in Barbour et al. (1995) (CP).

**Approximation errors for**
C(k,n : F)
**systems**

n | k | q | Exact | FMCI | Poisson | CP |
---|---|---|---|---|---|---|

5 | 2 | 0.01 | 0.99960 | 0.00000 | -0.00010 | 0.00000 |

5 | 2 | 0.10 | 0.96309 | 0.00000 | -0.00788 | 0.00119 |

5 | 2 | 0.25 | 0.79980 | -0.00002 | -0.02697 | 0.04654 |

10 | 2 | 0.01 | 0.99911 | 0.00000 | -0.00010 | 0.00000 |

10 | 2 | 0.10 | 0.91975 | 0.00000 | -0.00728 | 0.00312 |

10 | 2 | 0.25 | 0.61180 | 0.00000 | -0.00869 | 0.12266 |

10 | 4 | 0.01 | 1.00000 | 0.00000 | 0.00000 | 0.00000 |

10 | 4 | 0.10 | 0.99936 | 0.00000 | -0.00026 | 0.00000 |

10 | 4 | 0.25 | 0.97855 | 0.00000 | -0.00776 | 0.00038 |

50 | 2 | 0.01 | 0.99516 | 0.00000 | -0.00010 | 0.00000 |

50 | 2 | 0.10 | 0.63633 | 0.00000 | -0.00251 | 0.01871 |

50 | 2 | 0.25 | 0.07173 | 0.00000 | 0.14441 | 0.96838 |

50 | 4 | 0.01 | 1.00000 | 0.00000 | 0.00000 | 0.00000 |

50 | 4 | 0.10 | 0.99577 | 0.00000 | -0.00026 | 0.00000 |

50 | 4 | 0.25 | 0.86897 | 0.00000 | -0.00663 | 0.00312 |

100 | 2 | 0.01 | 0.99024 | 0.00000 | -0.00010 | 0.00000 |

100 | 2 | 0.10 | 0.40151 | 0.00000 | 0.00343 | 0.03854 |

100 | 2 | 0.25 | 0.00492 | 0.00000 | 0.36933 | 2.97133 |

100 | 4 | 0.01 | 1.00000 | 0.00000 | 0.00000 | 0.00000 |

100 | 4 | 0.10 | 0.99129 | 0.00000 | -0.00026 | 0.00001 |

100 | 4 | 0.25 | 0.74908 | 0.00000 | -0.00523 | 0.00656 |

500 | 4 | 0.20 | 0.52721 | 0.00000 | -0.00086 | 0.00611 |

1,000 | 4 | 0.20 | 0.27696 | 0.00000 | 0.00183 | 0.01232 |

10,000 | 5 | 0.20 | 0.07710 | 0.00000 | 0.00183 | 0.00560 |

The FMCI approximation performs very well for the parameters tested here. As expected, the Poisson and compound Poisson approximations perform well when *n* *q*^{
k
} is relatively small. When the reliability of the system is relatively low, the Poisson and compound Poisson approximations begin to degrade.

### Approximating the distribution of N_{
n,k
}

Recall that *N*_{n,k} is the number of non-overlapping occurrences of *k* consecutive successes in {*X*_{
i
}} (i.e. *N*_{n,k}=*X*_{
n
}(*Λ*) with *Λ*=*S* *S*⋯*S* of length *k*). By reversing the roles of success and failure, the reliability of *C*(*k*,*n* : *F*) systems can be related to the distribution of *N*_{n,k}. In this section we present some examples of approximating $\mathbb{P}\{{N}_{n,k}=x\}$ with the approximations FMCI, Normal, Poisson and LD.

*R*(

*x*:

*E*,

*A*) in these approximations for (a)

*N*

_{2000,4}; (b)

*N*

_{5000,4}; and (c)

*N*

_{250000,6}when the probability of success is

*p*=0.3. On all of the figures, the top axis is on a standard

*z*-scale making use of the asymptotic mean and variance of

*X*

_{ n }(

*Λ*) — namely,

We notice that the Finite Markov chain imbedding approximation (FMCI) performs very well in the left tail of the distribution in all cases. Its performance degrades as *x* gets large but its performance is more consistent than both the Poisson and Normal approximations in this case. The large deviation approximation performs well in the right tail in all cases. In (c), the FMCI approximation performs very well throughout most of the support. The Poisson approximations also perform well over most of the *x* considered. The normal approximation performs well in the neighbourhood of $\mathbb{E}{X}_{n}\left(\Lambda \right)$ but not in the tails.

As the probability of success *p* increases, the FMCI approximation still performs very well in the left tail, but it’s performance tends to degrade more quickly as *x* increases. The Poisson approximations also quickly degrades as *p* increases since $\mathbb{E}{N}_{n,k}$ increases. For larger *p*, the Normal approximation tends to work better near the mean. In the far left tail, the FMCI approximation is preferred and in the far right tail, the LD approximation is preferred.

### Biological sequences

*Λ*=

*A*

*C*

*G*(

*n*=1,000 and 10,000) and

*Λ*=

*C*

*A*

*T*

*T*

*A*

*G*(

*n*=500,000). We see that the FMCI approximation again performs very well in the left tail, although, in (b), the performance degrades somewhat as

*x*gets large. The large deviation approximation performs very well in the right tail, especially when

*x*is greater than 3 standard deviations above the mean. While it is difficult to give a rule of thumb, the FMCI approximation seems to perform very well when $x\le \mathcal{O}\left({n}^{1/2}\right)$. The normal approximation works best within a few standard deviations of the mean and performs best in this region when $\mathbb{E}{X}_{n}\left(\Lambda \right)$ is relatively large.

## Discussion and conclusions

The finite Markov chain imbedding approximations (FMCI and LD) provide an alternative to the usual normal and Poisson approximations for the distributions of runs and patterns. While the FMCI approximation is simple, accurate and fast, it has one disadvantage over the normal and Poisson approximations — it requires the use of the FMCI technique, which is non-traditional and less known in the Statistics community, except in the area of system reliability (*cf.* Cui et al. 2010). On the other hand, the FMCI technique does not require the rather strong conditions necessary for the Poisson techniques, such as *n* *p*^{
k
}→*λ*. This condition is seldom satisfied in practical applications. For example, in DNA sequence analysis, the probabilities *p*_{
A
}, *p*_{
C
}, *p*_{
G
} and *p*_{
T
} do not tend to 0 as *n* increases. They may not all be in the neighbourhood of 1/4 but they are bounded away from 0.

For all of the numeric results in the previous section, the exact probabilities $\mathbb{P}\left\{{X}_{n}\right(\Lambda )=x\}$ are obtained via the FMCI technique and their CPU times were only a few seconds or less than a minute even in the case of *Λ*=*C* *A* *T* *T* *A* *G* and *n*=500,000. Based on our experience, if the length of the pattern is less than 20 and *n* is less than 1,000,000, the exact probability should be computed.

## Notes

## Declarations

### Acknowledgements

This work was supported, in part, by the Natural Sciences and Engineering Research Council of Canada.

## Authors’ Affiliations

## References

- Arratia R, Goldstein L, Gordon L: Poisson approximation and the Chen-Stein method.
*Stat. Sci*1990, 5(4):403–434.MathSciNetMATHGoogle Scholar - Asmussen S:
*Applied Probability and Queues*. Springer, New York; 2003.MATHGoogle Scholar - Balakrishnan N, Koutras MV:
*Runs and Scans with Applications. Wiley Series in Probability and Statistics*. Wiley-Interscience [John Wiley & Sons], New York; 2002.MATHGoogle Scholar - Barbour AD, Eagleson GK: Poisson approximation for some statistics based on exchangeable trials.
*Adv. Appl. Probab*1983, 15(3):585–600.MathSciNetView ArticleMATHGoogle Scholar - Barbour AD, Eagleson GK: Poisson convergence for dissociated statistics.
*J. Roy. Statist. Soc. Ser. B*1984, 46(3):397–402.MathSciNetMATHGoogle Scholar - Barbour AD, Eagleson GK: An improved Poisson limit theorem for sums of dissociated random variables.
*J. Appl. Probab*1987, 24(3):586–599.MathSciNetView ArticleMATHGoogle Scholar - Barbour AD, Hall P: On the rate of Poisson convergence.
*Math. Proc. Cambridge Philos. Soc*1984, 95(3):473–480.MathSciNetView ArticleMATHGoogle Scholar - Barbour AD, Chryssaphinou O: Compound Poisson approximation: a user’s guide.
*Ann. Appl. Probab*2001, 11(3):964–1002.MathSciNetView ArticleMATHGoogle Scholar - Barbour AD, Holst L, Janson S: Poisson Approximation. Oxford Studies in Probability. 199a. Oxford Science PublicationsGoogle Scholar
- Barbour AD, Chen LHY, Loh W-L: Compound Poisson approximation for nonnegative random variables via Stein’s method.
*Ann. Probab*1992b, 20(4):1843–1866.MathSciNetView ArticleGoogle Scholar - Barbour AD, Chryssaphinou O, Roos M: Compound Poisson approximation in reliability theory.
*IEEE T. Reliab*1995, 44(3):398–402.View ArticleMATHGoogle Scholar - Barbour AD, Chryssaphinou O, Roos M: Compound Poisson approximation in systems reliability.
*Naval Res. Logist*1996, 43(2):251–264.MathSciNetView ArticleMATHGoogle Scholar - Blom G, Thorburn D: How many random digits are required until given sequences are obtained?
*J. Appl. Probab*1982, 19(3):518–531.MathSciNetView ArticleMATHGoogle Scholar - Chen LHY: Poisson approximation for dependent trials.
*Ann. Probab*1975, 3(3):534–545.MathSciNetView ArticleMATHGoogle Scholar - Cui L, Xu Y, Zhao X: Developments and applications of the finite Markov chain imbedding approach in reliability.
*IEEE T. Reliab*2010, 59(4):685–690.View ArticleMathSciNetGoogle Scholar - Fu JC: Reliability of a large consecutive-k-out-of-n:F system.
*IEEE T. Reliab*1985, R-34: 120–127.View ArticleMATHGoogle Scholar - Fu JC, Johnson BC: Approximate probabilities for runs and patterns in i.i.d. and Markov dependent multi-state trials.
*Adv. Appl. Probab*2009, 41(1):292–308.MathSciNetView ArticleMATHGoogle Scholar - Fu JC, Koutras MV: Distribution theory of runs: a Markov chain approach.
*J. Amer. Statist. Assoc*1994, 89(427):1050–1058.MathSciNetView ArticleMATHGoogle Scholar - Fu JC, Lou WYW:
*Distribution Theory of Runs and Patterns and Its Applications*. World Scientific Publishing Co. Inc, River Edge; 2003.View ArticleMATHGoogle Scholar - Fu JC, Lou WYW: On the normal approximation for the distribution of the number of simple or compound patterns in a random sequence of multi-state trials.
*Methodol. Comput. Appl. Probab*2007, 9(2):195–205.MathSciNetView ArticleMATHGoogle Scholar - Fu JC, Johnson BC, Chang Y-M: Approximating the extreme right-hand tail probability for the distribution of the number of patterns in a sequence of multi-state trials.
*J. Stat. Plan. Infer*2012, 142(2):473–480.MathSciNetView ArticleMATHGoogle Scholar - Gerber HU, Li S-YR: The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain.
*Stochastic Process. Appl*1981, 11(1):101–108.MathSciNetView ArticleMATHGoogle Scholar - Godbole AP: Degenerate and Poisson convergence criteria for success runs.
*Statist. Probab. Lett*1990a, 10(3):247–255.MathSciNetView ArticleGoogle Scholar - Godbole AP: Specific formulae for some success run distributions.
*Statist. Probab. Lett*1990b, 10(2):119–124.MathSciNetView ArticleGoogle Scholar - Godbole AP: Poisson approximations for runs and patterns of rare events.
*Adv. Appl. Probab*1991, 23(4):851–865.MathSciNetView ArticleMATHGoogle Scholar - Godbole AP, Schaffner AA: Improved Poisson approximations for word patterns.
*Adv. Appl. Probab*1993, 25(2):334–347.MathSciNetView ArticleMATHGoogle Scholar - Holst L, Kennedy JE, Quine MP: Rates of Poisson convergence for some coverage and urn problems using coupling.
*J. Appl. Probab*1988, 25(4):717–724.MathSciNetView ArticleMATHGoogle Scholar - Karlin S: Statistical signals in bioinformatics.
*Proc. Natl. Acad. Sci. U. S. A*2005, 102(38):13355–13362.View ArticleGoogle Scholar - Karlin S, Taylor HM:
*A First Course in Stochastic Processes*. Academic Press [A subsidiary of Harcourt Brace Jovanovich, Publishers], New York-London; 1975.MATHGoogle Scholar - Kleffe J, Borodovski M: First and second moment of counts of words in random text generated by Markov chains.
*Comp Applic Biosci*1992, 8: 443–441.Google Scholar - Martin J, Regad L, Camproux A-C, Nuel G: Finite Markov chain embedding for the exact distribution of patterns in a set of random sequences. In
*Advances in Data Analysis. Statistics for Industry and Technology*. Edited by: Skiadas CH. Birkhäuser, Boston; 2010.Google Scholar - Meyn SP, Tweedie RL: Markov Chains and Stochastic Stability. Communications and Control Engineering Series. 1993.View ArticleGoogle Scholar
- Nuel G, Regad L, Martin J, Camproux A-C: Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.
*Algorithm Mol. Biol*2010, 5(1):1–18.View ArticleGoogle Scholar - Schwager SJ: Run probabilities in sequences of Markov-dependent trials.
*J. Amer. Statist. Assoc*1983, 78(381):168–180.MathSciNetView ArticleMATHGoogle Scholar - Seneta E:
*Non-negative Matrices and Markov Chains*. Springer, New York; 1983.MATHGoogle Scholar - Solov’ev AD: A combinatorial identity and its application to the problem on the first occurrence of a rare event.
*Teor. Verojatnost. i Primenen*1966, 11: 313–320.MathSciNetGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.