Approximating the distributions of runs and patterns

Johnson, Brad C; Fu, James C

doi:10.1186/2195-5832-1-5

Research
Open access
Published: 11 June 2014

Approximating the distributions of runs and patterns

Brad C Johnson¹ &
James C Fu¹

Journal of Statistical Distributions and Applications volume 1, Article number: 5 (2014) Cite this article

2991 Accesses
3 Citations
Metrics details

Abstract

The distribution theory of runs and patterns has been successfully used in a variety of applications including, for example, nonparametric hypothesis testing, reliability theory, quality control, DNA sequence analysis, general applied probability and computer science. The exact distributions of the number of runs and patterns are often very hard to obtain or computationally problematic, especially when the pattern is complex and n is very large. Normal, Poisson and compound Poisson approximations are frequently used to approximate these distributions. In this manuscript, we (i) study the asymptotic relative error of the normal, Poisson, compound Poisson and finite Markov chain imbedding and large deviation approximations; and (ii) provide some numerical studies to comparing these approximations with the exact probabilities for moderately sized n. Both theoretical and numerical results show that, in the relative sense, the finite Markov chain imbedding approximation performs the best in the left tail and the large deviation approximation performs best in the right tail.

AMS Subject Classification

Primary 60E05; Secondary 60J10

Introduction and notation

Let ${X_{i}}_{i = 1}^{n}$ be a sequence of m-state trials (m≥2) taking values in the set $S = {s_{1}, \dots, s_{m}}$ of m symbols. For simplicity, ${X_{i}}_{i = 1}^{n}$ will be denoted {X_i} and n will be allowed to be ∞. A simple pattern $Λ = s_{i_{1}} s_{i_{2}} \dots s_{i_{ℓ}}$ , of length ℓ, is the juxtaposition of ℓ (not necessarily distinct) symbols from . Given a simple pattern Λ, we let X_n(Λ) denote the number of either non-overlapping or overlapping occurrences of Λ in the sequence ${X_{i}}_{i = 1}^{n}$ , where the method of counting will be made clear by the context. The waiting time W(Λ,x) until the x’th occurrence of the simple pattern Λ in ${X_{i}}_{i = 1}^{n}$ is thus defined by

W (Λ, x) = inf {n \in N : X_{n} (Λ) = x},

and, by convention, the waiting time for the first occurrence is denoted W(Λ)=W(Λ,1). Finally, we define the inter arrival times

W_{i} (Λ) = W (Λ, i) - W (Λ, i - 1), for i = 1, 2, \dots,

where W(Λ,0):=0.

We say that two patterns Λ₁ and Λ₂ are distinct if neither Λ₁ appears in Λ₂ nor Λ₂ appears in Λ₁. If Λ₁,…,Λ_r are pairwise distinct simple patterns, we define the compound pattern $Λ = ⋃_{i = 1}^{r} Λ_{i}$ , where an occurrence of any Λ_i is considered an occurrence of Λ. For a compound pattern Λ=Λ₁∪⋯∪Λ_r, we similarly define

X_{n} (Λ) = \sum_{j = 1}^{r} X_{n} (Λ_{j}) .

The waiting times W(Λ,x), W(Λ) and W_i(Λ) are then defined as above, and often referred to as sooner waiting times.

From these definitions it is easy to see that, for any simple or compound pattern Λ, x and n, the events {X_n(Λ)<x} and {W(Λ,x)>n} are equivalent and hence

P {X_{n} (Λ) < x} = P {W (Λ, x) > n},

(1)

which provides a convenient way of studying the exact and approximate distribution of X_n(Λ) through the waiting time distributions of W(Λ,x).

Throughout this paper, unless specified otherwise, we assume that the trials {X_i} are either independent and identically distributed (i.i.d.) or first order Markov dependent; the pattern Λ is either simple or compound; and the counting of occurrences of Λ is in a non-overlapping fashion.

The distribution of the number of runs and patterns in a sequence of multi-state trials or random permutations of a set of integers have been successfully used in various fields in applied probability, statistics and discrete mathematics. Examples include reliability theory, quality control, DNA sequence analysis, psychology, ecology, astronomy, nonparametric tests, successions, and the Eulerian and Simon-Newcomb numbers (the latter 3 being defined for permutations). Two recent books, Balakrishnan and Koutras (2002) and Fu and Lou (2003), provide some scope of the distribution theory of runs and patterns and Martin et al. (2010) and Nuel et al. (2010) provides some extensions to sets of sequences.

Given a pattern Λ, the exact distribution of X_n(Λ) traditionally has been determined using combinatoric analysis on a case by case basis. The formulae for these distributions are often very complex and computationally problematic. Even for many simple patterns, their distributions in terms of combinatoric analysis remains unknown, especially when the {X_i} are Markov dependent multi-state trials.

The waiting time W(Λ) for the first occurrence of certain types of runs and patterns have been studied by many authors. See, for example, Blom and Thorburn (1982), Gerber and Li (1981), Schwager (1983), and Solov’ev (1966). More recently, Fu and Koutras (1994) developed a method for determining the exact distributions of X_n(Λ) and W(Λ) for any simple or compound Λ in either i.i.d. or Markov dependent trials (see also Fu and Lou 2003). The method was referred to as the Finite Markov Chain Imbedding (FMCI) technique, which can be easily described as follows: given a simple or compound pattern Λ, there exists a finite Markov chain {Y_i} defined on a finite state space, say Ω={1,…,d,α}, with an absorbing state α and transition probability matrix of the form

where c is a column vector. The distribution of the waiting time for Λ is given by

P {W (Λ) = n} = ξ_{0} N^{n - 1} (I - N) 1^{'}

(3)

where ξ₀ is the initial distribution, N is the essential transition probability matrix (i.e. the sub-stochastic matrix consisting of only the transient states of {Y_i}) as defined in (2), I is a d×d identity matrix and 1=(1,1,…,1) is a 1×d row-vector. Furthermore, the random variable X_n(Λ), the number of occurrences of Λ in {X_i}, is also finite Markov chain imbeddable and its distribution is given by

P {X_{n} (Λ) < x} = P {W (Λ, x) > n} = ξ_{0} N_{x}^{n} 1^{'},

(4)

where the essential transition probability matrix N_x has the form

N_{x} = [\begin{array}{l} N & C \\ N & C & 0 \\ ⋱ & ⋱ \\ 0 & N & C \\ N \end{array}],

(5)

the matrix N is given by (2), and the matrix C defines the “continuation” transition probabilities from one occurrence to the next and depends on c in (2).

If the pattern Λ is long and complex and n is very large, then the computation of $P {X_{n} (Λ) = x}$ can become problematic and, to overcome this problem, various asymptotic approximations have been developed for these probabilities.

In real applications, if the exact distribution is not available or is hard to compute, it is important to know which approximations perform well and are easy to compute. Furthermore, it is important to know how these approximations perform with respect to each other and the exact distribution from both a theoretical and numerical standpoint. The aims of this manuscript are two-fold: (i) we first study the asymptotic relative error of the normal, Poisson (or compound Poisson), and FMCI approximations with respect to the exact distribution; and (ii) we then provide a numerical study of these three approximations with the exact probabilities in cases where x is fixed and n→∞ and when n is fixed and x varies. As an important byproduct, the FMCI technique allows the normal and Poisson approximations to be applied in more cases, for example, the distribution of compound patterns and patterns in Markov dependent trials.

The approximations

Normal approximation

The normal approximation is one of the most popular for approximating the distribution of the number of runs or patterns X_n(Λ) in Statistics. In general, when Λ is simple or compound, the trials are i.i.d., and the counting is non-overlapping, by appealing to (1) and renewal arguments, it has been shown that X_n(Λ) is asymptotically normally distributed (cf. Fu and Lou 2007; Karlin and Taylor 1975). The form of the approximation is

lim_{n \to \infty} P \{\frac{X_{n} (Λ) - n / μ_{W}}{\sqrt{n σ_{W}^{2} μ_{W}^{- 3}}} \leq u\} = Φ (u),

(6)

where Φ(·) denotes the standard normal distribution function and μ_W and $σ_{W}^{2}$ are the mean and variance of W(Λ) respectively, which are given by

\begin{align} μ_{W} & = ξ_{0} {(I - N)}^{- 1} 1^{'}, and \end{align}

(7)

\begin{align} σ_{W}^{2} & = ξ_{0} (I + N) {(I - N)}^{- 2} 1^{'} - μ_{W}^{2} . \end{align}

(8)

Given a pattern Λ, it is well known that the mean μ_W and the variance $σ_{W}^{2}$ are difficult to obtain via combinatoric arguments, especially when Λ is a compound pattern or the trials are Markov dependent. For example, as pointed out in Karlin (2005) and Kleffe and Borodovski (1992), approximate values of μ_W and $σ_{W}^{2}$ must sometimes be used. Since W(Λ) is finite Markov chain imbeddeble, (7) and (8), provide the exact values.

The limit in (6) is appropriate when the sequence of inter arrival times {W_i(Λ)} are i.i.d., which is the case for simple and compound patterns when the {X_i} are i.i.d. and counting is non-overlapping. When occurrences of Λ correspond to a delayed renewal process, which can occur for Markov dependent trials and/or overlapping counting, we could use the mean and variance of W₂(Λ) for the normalizing constants, which are easily obtained by modifying ξ₀ in (7) and (8). Even more general cases can be handled by making use of a functional central limit theorem for Markov chains (see, for example, (Meyn and Tweedie 1993, §17.4) and (Asmussen 2003, Theorem 7.2, pg. 30) for the details).

Poisson and compound poisson approximations

It is well known that, in a sequence of Bernoulli (p) trials, if n p→λ as n→∞, then the probability of k successes in n trials can be approximated by a Poisson probability with parameter λ, denoted $P (λ)$ . This idea has been extended to certain patterns Λ and, under certain conditions, the distribution of X_n(Λ) can be approximated by a Poisson distribution with parameter μ_n in the sense that

d_{TV} (ℒ (X_{n} (Λ)), P (μ_{n})) < ε_{n},

(9)

where $ℒ (\cdot)$ denotes the distribution (law) of a random variable and d_TV(·,·) denotes the total variation distance.

The primary tool used to obtain μ_n and the bound ε_n is the Stein-Chen method (Chen 1975), and this method has been refined by various authors Arratia et al. (1990), Barbour and Eagleson (1983), Barbour and Eagleson (1984), Barbour and Eagleson (1987), Barbour and Hall (1984), Godbole (1990a), Godbole (1990b), Godbole (1991), Godbole and Schaffner (1993), and Holst et al. (1988). This method has also been extended to compound Poisson approximations for the distributions of runs and patterns and Barbour and Chryssaphinou (2001) provides an excellent theoretical review of these approximations.

In practice, $μ_{n} = E X_{n} (Λ)$ or the expectation of a closely related run statistic is used (cf. Balakrishnan and Koutras 2002, §5.2.3) so that, in the former case,

P {X_{n} (Λ) = x} \approx \frac{{(E X_{n} (Λ))}^{x}}{x!} exp \{- E X_{n} (Λ)\} .

(10)

Finding $E X_{n} (Λ)$ and the bound ε_n is usually done on a case by case basis. For the mathematical details, the books (Barbour et al. 1992a) and (Balakrishnan and Koutras 2002) are recommended.

Let $P_{c} (λ, ν)$ denote the compound Poisson distribution, that is, the distribution of the random variable $\sum_{j = 1}^{M} Y_{j}$ where the random variable M has a Poisson distribution with parameter λ and the Y_j are i.i.d. having distribution ν. A compound Poisson distribution for approximating nonnegative random variables was suggested in Barbour et al. (1992b) (see also Barbour et al. (1995 1996)). The approximation is formulated similarly to the Poisson approximation:

d_{TV} (ℒ (X_{n} (Λ)), P_{c} (λ, ν)) < ε_{n} .

(11)

The distribution of N_n,k, the number of non-overlapping occurrences of k consecutive successes in n i.i.d. Bernoulli trials, is one of the most important in this area and one of the most studied in the literature. Reversing the roles of S (success) and F (failure), the reliability of consecutive-k-out-of-n system, denoted C(k,n : F), is given by $P {N_{n, k} = 0}$ . Even in this simple case (i.e. Λ=S S⋯S), there are several ways to apply the Poisson approximation techniques. For example, (Godbole 1991, Theorem 2) shows that approximating N_n,k with a $P (E N_{n, k})$ distribution works well if certain conditions hold. Godbole and Schaffner (Godbole and Schaffner 1993, pg. 340) suggests an improved Poisson approximation for word patterns.

The primary difficulty in applying the Poisson approximation is the determination of the optimal parameter μ_n, which is higly dependent on the structure of the pattern Λ. In particular, if Λ is long and has several uneven overlapping sub-patterns, then finding μ_n by their method can be very tedious. In the sequel, we show that even the (asymptotic) best choice for μ_n for Poisson approximations does not perform well in the relative sense.

FMCI approximations

Approximations based on the FMCI approach depend on the spectral decomposition of the essential transition probability matrix N.

Let N be a w×w essential transition probability matrix associated with a finite Markov chain {Y_n:n≥0} corresponding to the distribution of the waiting time W(Λ). Let 1>λ₁≥|λ₂|≥⋯≥|λ_w| denote the ordered eigenvalues of N, repeated according to their algebraic multiplicities, with associated (right) eigenvectors $η_{1}^{'}, η_{2}^{'}, \dots, η_{w}^{'}$ . When the geometric multiplicity of λ_i is less than its algebraic multiplicity, we will use vectors of 0’s for the unspecified eigenvectors. The fact that λ₁ can be taken as a positive real number and that η₁ can be taken to be non-negative are consequences of the Perron-Frobenious Theorem for non-negative matrices ( Seneta cf.1981).

Definition 1

We will say that {Y_n:n≥0}, or equivalently, N, satisfies the FMCI Approximation Conditions if

(i)
there exists constants a ₁,…,a _w such that
$1^{'} = \sum_{i = 1}^{w} a_{i} η_{i}^{'},$
(12)
(ii)
λ ₁ has algebraic multiplicity g and λ ₁>|λ _j| for all j>g.

Verifying these conditions is usually straightforward. They certainly hold if N is irreducible and aperiodic, but also hold in many other cases as well. For example, (12) requires only that 1^′ is in the linear space spanned by ${η_{1}^{'}, η_{2}^{'}, \dots, η_{w}^{'}}$ , which can hold even when N is defective (not diagonizable). Condition (ii) requires that the communication classes corresponding λ₁ are aperiodic. That is, if Ψ is a communication class and N[Ψ] corresponds to the substocastic matrix N restricted to the states in Ψ, with largest eigenvalue λ₁[Ψ], then all Ψ such that λ₁[Ψ]=λ₁ should be aperiodic. We also mention that the algebraic multiplicity of λ₁ is the number of communication classes Ψ such that λ₁[Ψ]=λ₁.

Fu and Johnson (2009) give the following theorem.

Theorem 1

Let {X_i} be a sequence of i.i.d. trials taking values in, let Λ be a simple pattern of length ℓ with d×d essential transition probability matrix N and let X_n(Λ) be the number of non-overlapping occurrences of Λ in {X_i}. If N satisfies the FMCI approximation conditions then, for any fixed x≥0,

P {X_{n} (Λ) = x} \sim a^{x + 1} (\binom{n - x (ℓ - 1)}{x}) {(1 - λ_{1})}^{x} λ_{1}^{n - x},

(13)

where $a = \sum_{j = 1}^{g} a_{j} (ξ_{0} η_{j}^{'})$ . If g=1, as is usually the case, then a=a₁(ξ₀η 1′).

Given a pattern Λ, the approximation in (13) requires finding the Markov chain imbedding associated with the waiting time W(Λ), the essential transition probability matrix N as well as its eigenvalues and associated eigenvectors. Usually, these steps are rather simple and can be easily automated together with (13). Even for very large n and large ℓ, say n=1,000,000 and ℓ=50, the CPU time is negligible. Fu and Johnson (2009) also provide details on extending these results to compound patterns, overlapping counting and Markov dependent trials.

For the purpose of comparing these approximations, we prefer to write (13) as

\begin{align} P {X_{n} (Λ) = x} & \sim a^{x + 1} {(\frac{1 - λ_{1}}{λ_{1}})}^{x} (\binom{n - x (ℓ - 1)}{x}) exp {n ln λ_{1}} \end{align}

(14)

Note that the approximation havs three parts: a constant part; a polynomial in n of degree x; and a third (dominant) part which converges to 0 exponentially fast as n→∞.

More precisely, the FMCI approximation in (13) may be written as

\begin{array}{l} P {X_{n} (Λ) = x} & = a^{x + 1} {(\frac{1 - λ_{1}}{λ_{1}})}^{x} (\binom{n - x (ℓ - 1)}{x}) \\ \times exp {n ln λ_{1}} [1 + o ({|\frac{λ_{g + 1}}{λ_{1}}|}^{n / (x + 1) - ℓ})] . \end{array}

(15)

Since |λ_g+1|<λ₁, the term |λ_g+1/λ₁|^{n/(x+1)−ℓ} tends to 0 exponentially as n→∞ and hence is negligible if n/(x+1)−ℓ is moderate or large (say ≥50).

Large deviation approximation

Fu et al. (2012) provide the following large deviation approximation for right-tail probabilities for the number of non-overlapping occurrences for simple patterns Λ. The reasons for providing only the right-tail large deviation approximation are (i) all of the above mentioned approximations fail to approximate the extreme right-tail probabilities and (ii) the FMCI approximation provides an accurate approximation for left-tail probabilities.

Theorem 2

Let $ε = x μ_{W}^{2} / (1 + x μ_{W})$ and let

φ_{W} (t) = 1 + (e^{t} - 1) ξ {(I - e^{t} N)}^{- 1} 1^{'},

(16)

be the moment generating function of W(Λ). Then

P {X_{n} (Λ) \geq E X_{n} (Λ) + nx} = e^{- nβ (ε, Λ)} \frac{1}{\sqrt{n}} \{b_{0} + b_{1} n^{- 1} + \dots + b_{m} n^{- m} + O (n^{- m - 1})\},

(17)

where

β (x, Λ) = (\frac{1}{μ_{W}} + x) h (ε, τ) = (\frac{1}{μ_{W}} + x) [- \frac{τ μ_{W}}{1 + x μ_{W}} - ln φ_{W (Λ)} (- τ)],

(18)

h (ε, t) = εt - ln φ_{μ_{W} - W (Λ)} (t)

, τ is the solution to h^′(ε,τ)=0, and

\begin{array}{l} b_{0} & = \frac{1}{στ \sqrt{2 π (μ^{- 1} + x)}} \\ b_{1} & = \frac{1}{στ \sqrt{2 π {(μ^{- 1} + x)}^{3}}} \{- \frac{1}{σ^{2} τ^{2}} + \frac{h^{(3)} (ε, τ)}{2 τ σ^{4}} - \frac{h^{(4)} (ε, τ)}{8 σ^{4}} - \frac{5 {(h^{(3)} (ε, τ))}^{2}}{24 σ^{6}}\} \\ σ & = \sqrt{- h^{′′} (ε, τ)} . \end{array}

(19)

Comparisons and relative error

For a given n, x and pattern Λ, we define the relative error of an approximation with respect to the exact probability $P {X_{n} (Λ) = x}$ as

R (x : E, A) = sgn (A - E) [max (\frac{E}{A}, \frac{A}{E}) - 1],

where A stands for the approximate probability and E stands for the exact probability $P {X_{n} (Λ) = x}$ . This quantity, R(x:E,A), goes from −∞ to ∞ and treats the importance of overestimation the same as underestimation. It is clear that R(x:E,A)>0 implies that the approximation is overestimating the exact probability and that R(x:E,A)<0 implies that the approximation is underestimating the exact probability. Since, for fixed x, the probability $P {X_{n} (Λ) = x}$ converges to 0 exponentially fast as n→∞, it follows that R(x:E,A)→±∞ implies that the approximation tends to 0 with the wrong rate. If R(x:E,A) is near 0 then the approximation is close to the exact probability $P {X_{n} (Λ) = x}$ .

Note that R(x:E,A) is a function of x, n and the method of approximation used. The following theorem provides the asymptotic relative error for the Normal approximation (N), the Poisson approximation (P(μ_n)) and the finite Markov chain imbedding approximation (F).

Theorem 3

Let {X_i} be a sequence of i.i.d. multi-state trials taking values inand let Λ be a simple pattern defined on. Then, for every fixed x, we have,

(i) lim_{n \to \infty} R (x : E, F) = 0;

(20)

\begin{align} (ii) lim_{n \to \infty} R (x : E, P (μ_{n})) = \{\begin{matrix} \infty, & if {lim sup}_{n} μ_{n} / n < - ln λ_{1}; \\ c (x), & if {lim}_{n} μ_{n} / n = - ln λ_{1}; \\ - \infty, & if {lim inf}_{n} μ_{n} / n > - ln λ_{1}; \end{matrix} \end{align}

(21)

\begin{align} (iii) lim_{n \to \infty} R (x : E, N) = \{\begin{matrix} \infty, & if μ_{W} / 2 σ_{W}^{2} \leq - ln λ_{1}; \\ - \infty, & if μ_{W} / 2 σ_{W}^{2} > - ln λ_{1}; \end{matrix} \end{align}

(22)

where the exact probability is computed using (4) and

c (x) = a^{x + 1} {(\frac{λ_{1} - 1}{λ_{1} ln λ_{1}})}^{x} - 1 .

Proof

Given a pattern Λ and x, for the finite Markov chain imbedding approximation we have

lim_{n \to \infty} \frac{P {X_{n} (Λ) = x}}{a^{x + 1} {(\frac{1 - λ_{1}}{λ_{1}})}^{x} (\binom{n - x (ℓ - 1)}{x}) exp {n ln λ_{1}}} = 1

and hence (i) follows immediately from the definition of R(x:E,A) and Theorem 1.

For the Poisson approximation we have, since E/F∼1 by (i),

\frac{E}{P (μ_{n})} = \frac{E}{F} \times \frac{F}{P (μ_{n})} \sim \frac{F}{P (μ_{n})}

and hence

\begin{align} \frac{E}{P (μ_{n})} & = \frac{P {X_{n} (Λ) = x}}{\frac{μ_{n}^{x}}{x!} exp {- μ_{n}}} \\ \sim \frac{a^{x + 1} {(\frac{1 - λ_{1}}{λ_{1}})}^{x} (\binom{n - x (ℓ - 1)}{x}) exp {n ln λ_{1}}}{\frac{μ_{n}^{x}}{x!} exp {- μ_{n}}} . \end{align}

(23)

If ${liminf}_{n} μ_{n} / n > - ln λ_{1}$ then exp{n lnλ₁+μ_n} tends to 0 exponentially fast which overrides the polynomial term and hence R(x:E,P(μ_n))→−∞ as n→∞ for all fixed x. Similarly, if ${limsup}_{n} μ_{n} / n < - ln λ_{1}$ , then R(x:E,P(μ_n))→∞ as n→∞ for all fixed x. Furthermore, if ${lim}_{n} μ_{n} / n = - ln λ_{1}$ , then the ratio yields

lim_{n \to \infty} R (x : E, P (- n ln λ_{1})) = a^{x + 1} {(\frac{λ_{1} - 1}{λ_{1} ln λ_{1}})}^{x} - 1

and this completes the proof of (ii). Note also that, if ${limsup}_{n} μ_{n} / n > - ln λ_{1}$ and ${liminf}_{n} μ_{n} / n < - ln λ_{1}$ , then ${lim}_{n} R (x : E, P (μ_{n}))$ will not exist.

For the normal approximation we have that X_n(Λ) is approximately normal with mean n/μ_W and variance $n σ_{W}^{2} / μ_{W}^{3}$ and hence

P {X_{n} (Λ) = x} \approx N = \int_{x - 1 / 2}^{x + 1 / 2} \frac{1}{\sqrt{2 πn σ_{W}^{2} μ_{W}^{- 3}}} exp \{- \frac{{(t - n / μ_{W})}^{2}}{2 n σ_{W}^{2} μ_{W}^{- 3}}\} dt

Hence, provided n>μ_W(x+1/2), we have

\begin{align} N \leq \frac{1}{\sqrt{2 πn σ_{W}^{2} μ_{W}^{- 3}}} exp \{- \frac{{(x + 1 / 2 - n / μ_{W})}^{2}}{2 n σ_{W}^{2} μ_{W}^{- 3}}\} . \end{align}

Therefore, as in the proof of (ii), we are interested in the asymptotics of F/N, which yields

\begin{align} \frac{F}{N} & \sim \sqrt{\frac{2 πn σ_{W}^{2}}{μ_{W}^{3}}} a^{x + 1} {(\frac{1 - λ_{1}}{λ_{1}})}^{x} (\binom{n - x (ℓ - 1)}{x}) \\ \times exp \{n ln λ_{1} + \frac{{(x + 1 / 2 - n / μ_{W})}^{2}}{2 n σ_{W}^{2} μ_{W}^{- 3}}\} . \end{align}

(24)

We may rewrite the argument of the exponential function as

n [ln λ_{1} + \frac{μ_{W}}{2 σ_{W}^{2}} {(\frac{μ_{W} (x + 1 / 2)}{n} - 1)}^{2}],

making it clear that (24) converges to ∞ if $μ_{W} / 2 σ_{W}^{2} \geq - ln λ_{1}$ and 0 otherwise. Therefore, R(x:E,N)→∞ if $μ_{W} / 2 σ_{W}^{2} \geq - ln λ_{1}$ and R(x:E,N)→−∞ if $μ_{W} / 2 σ_{W}^{2} < - ln λ_{1}$ and the proof of (iii) is complete.

Theorem 3 (ii) implies that asymptotically (for fixed x and n→∞), the Poisson approximation performs poorly (in the relative sense) regardless of the value μ_n used. When Λ is simple and does not have overlapping sub-patterns, taking $μ_{n} = E X_{n} (Λ)$ is normally recommended for the Poisson approximation (cf. Arratia et al. 1990). In this case, non-overlapping and overlapping counting is equivalent. The following corollary shows that, for fixed x, the Poisson approximation will (asymptotically) always overestimate the exact probability in the following sense.

Corollary 1

Let Λ be a simple pattern defined on an i.i.d. sequence of multi-state trials. For $μ_{n} = E X_{n} (Λ)$ , we have

lim_{n \to \infty} R (x : E, P (μ_{n})) = \infty

for all fixed x.

Proof

Recall that, in this case, X_n(Λ) is a renewal process with i.i.d. inter-renewal times with mean $μ_{W} = EW (Λ)$ and hence, by the elementary renewal theorem, we have $E X_{n} (Λ) / n \to 1 / μ_{W}$ so that $E X_{n} (Λ) \sim n / μ_{W}$ . Therefore, by Theorem 3 (ii), it is sufficient to show that n/μ_W<−n lnλ₁ for all sufficiently large n, or

e^{- 1 / μ_{W}} > λ_{1} .

Now, since $0 < λ_{1} \in ℝ$ is a dominant eigenvalue of N, it follows that: $0 < {(1 - λ_{1})}^{- 1} \in ℝ$ is a dominant eigenvalue of the matrix (I−N)⁻¹=A=(a_{i
j}); a_{i
j}≥0 with at least one a_{i
j}>0; and A 1^′=(I−N)⁻¹1^′≤μ_W1^′. Hence, by a simple corollary to the Perron-Frobenius Theorem for nonnegative matrices (cf. Karlin and Taylor 1975, Corollary 2.2, pg. 551), we have

\frac{1}{1 - λ_{1}} = \underset{n \to \infty}{limsup} {(max_{i, j} | a_{ij}^{(n)} |)}^{1 / n} \leq μ_{W},

where $a_{ij}^{(n)} = {(A^{n})}_{ij}$ . Therefore, provided μ_W<∞,

e^{- 1 / μ_{W}} > 1 - \frac{1}{μ_{W}} \geq λ_{1},

which completes the proof.

Corollary 1 implies that, if $μ_{n} \sim E X_{n} (Λ)$ , then the Poisson approximation will always overestimate the exact probability as n→∞. Together with Theorem 3 (ii), this implies that using μ_n∼−n lnλ₁ results in the best Poisson approximation as n→∞.

We also comment that, for the normal approximation, both $μ_{W} / 2 σ_{W}^{2} < - ln λ_{1}$ and $μ_{W} / 2 σ_{W}^{2} \geq - ln λ_{1}$ are possible. As a simple example, suppose we have a sequence of i.i.d. Bernoulli (p) trials and Λ=S S S. If p=1/2, we obtain

μ_{W} = 14, σ_{W}^{2} = 142 and λ_{1} = 0.9196434,

and

\frac{μ_{W}}{2 σ_{W}^{2}} = 0.04929577 < - ln λ_{1} = 0.08376932 .

However, with p=0.9, we obtain

μ_{W} = 3.717421, σ_{W}^{2} = 2.145694 and λ_{1} = 0.5419067;

and

\frac{μ_{W}}{2 σ_{W}^{2}} = 0.8662513 > - ln λ_{1} = 0.6126614 .

Thus, R(x:E,N)→±∞ are both possible depending on x, the pattern, and the probability structure of the {X_i}.

Numerical comparisons

In the previous section we showed that, for fixed x and n→∞, the approximation based on the finite Markov chain imbedding technique outperforms the Poisson and normal approximations. In practice, however, one is interested in the performance of these approximations not only when x is fixed and n→∞, but also when n is fixed (at some moderate value) and x varies. The reason we consider only large or moderate n in our numerical study is that, for small n, the FMCI technique easily gives the exact results. In this section we present some numerical experiments to illustrate the advantages (and disadvantages) of the methods discussed.

The approximations we compare are: the finite Markov chain approximation in (13) (FMCI); the Poisson approximation with $μ_{n} = n / μ_{W} (\sim E X_{n} (Λ))$ where μ_W is calculated using (7) (Poisson); The normal approximation given in (6) (Normal); and the large deviation approximation given in Theorem 2 (LD), which is only for right-tail probabilities.

Reliability of C(k,n:F) systems

A consecutive-k-out-of-n:F system is a system of n independent and linearly connected components, each with common (continuous) lifetime distribution F, in which the system fails if k consecutive components fail. At a given time t>0, the probability a component is working is p=1−F(t) and the probability a single component has failed is q=1−p and hence the probability the system has failed is equivalent to the probability that k (or more) consecutive components have failed, which is equivalent to the probability of k consecutive failures in a sequence of n Bernoulli trials with success probability p. Barbour et al. (1995) present a table of various bounds for system reliability based on a Poisson approximation and a compound approximation and compare these to bounds found in Fu (1985). Table 1 shows the exact probabilities and relative errors for the FMCI and Poisson approximations as well as the compound Poisson approximation in Barbour et al. (1995) (CP).

Table 1 Approximation errors for C(k,n : F) systems

Full size table

The FMCI approximation performs very well for the parameters tested here. As expected, the Poisson and compound Poisson approximations perform well when n q^k is relatively small. When the reliability of the system is relatively low, the Poisson and compound Poisson approximations begin to degrade.

Approximating the distribution of N_n,k

Recall that N_n,k is the number of non-overlapping occurrences of k consecutive successes in {X_i} (i.e. N_n,k=X_n(Λ) with Λ=S S⋯S of length k). By reversing the roles of success and failure, the reliability of C(k,n : F) systems can be related to the distribution of N_n,k. In this section we present some examples of approximating $P {N_{n, k} = x}$ with the approximations FMCI, Normal, Poisson and LD.

Figure 1 shows the relative error R(x:E,A) in these approximations for (a) N_2000,4; (b) N_5000,4; and (c) N_250000,6 when the probability of success is p=0.3. On all of the figures, the top axis is on a standard z-scale making use of the asymptotic mean and variance of X_n(Λ) — namely,

z = \frac{x - n / μ_{W}}{\sqrt{n σ_{W}^{2} μ_{W}^{- 3}}} .

We notice that the Finite Markov chain imbedding approximation (FMCI) performs very well in the left tail of the distribution in all cases. Its performance degrades as x gets large but its performance is more consistent than both the Poisson and Normal approximations in this case. The large deviation approximation performs well in the right tail in all cases. In (c), the FMCI approximation performs very well throughout most of the support. The Poisson approximations also perform well over most of the x considered. The normal approximation performs well in the neighbourhood of $E X_{n} (Λ)$ but not in the tails.

As the probability of success p increases, the FMCI approximation still performs very well in the left tail, but it’s performance tends to degrade more quickly as x increases. The Poisson approximations also quickly degrades as p increases since $E N_{n, k}$ increases. For larger p, the Normal approximation tends to work better near the mean. In the far left tail, the FMCI approximation is preferred and in the far right tail, the LD approximation is preferred.

Biological sequences

Sequences of DNA nucleotides are of great interest (as are sequences of amino acids and other biological sequences). Figure 2 shows the relative errors for approximating $P {X_{n} (Λ) = x}$ with Λ=A C G (n=1,000 and 10,000) and Λ=C A T T A G (n=500,000). We see that the FMCI approximation again performs very well in the left tail, although, in (b), the performance degrades somewhat as x gets large. The large deviation approximation performs very well in the right tail, especially when x is greater than 3 standard deviations above the mean. While it is difficult to give a rule of thumb, the FMCI approximation seems to perform very well when $x \leq O (n^{1 / 2})$ . The normal approximation works best within a few standard deviations of the mean and performs best in this region when $E X_{n} (Λ)$ is relatively large.

Discussion and conclusions

The finite Markov chain imbedding approximations (FMCI and LD) provide an alternative to the usual normal and Poisson approximations for the distributions of runs and patterns. While the FMCI approximation is simple, accurate and fast, it has one disadvantage over the normal and Poisson approximations — it requires the use of the FMCI technique, which is non-traditional and less known in the Statistics community, except in the area of system reliability (cf. Cui et al. 2010). On the other hand, the FMCI technique does not require the rather strong conditions necessary for the Poisson techniques, such as n p^k→λ. This condition is seldom satisfied in practical applications. For example, in DNA sequence analysis, the probabilities p_A, p_C, p_G and p_T do not tend to 0 as n increases. They may not all be in the neighbourhood of 1/4 but they are bounded away from 0.

For all of the numeric results in the previous section, the exact probabilities $P {X_{n} (Λ) = x}$ are obtained via the FMCI technique and their CPU times were only a few seconds or less than a minute even in the case of Λ=C A T T A G and n=500,000. Based on our experience, if the length of the pattern is less than 20 and n is less than 1,000,000, the exact probability should be computed.

References

Arratia R, Goldstein L, Gordon L: Poisson approximation and the Chen-Stein method. Stat. Sci 1990, 5(4):403–434.
MathSciNet MATH Google Scholar
Asmussen S: Applied Probability and Queues. Springer, New York; 2003.
MATH Google Scholar
Balakrishnan N, Koutras MV: Runs and Scans with Applications. Wiley Series in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], New York; 2002.
MATH Google Scholar
Barbour AD, Eagleson GK: Poisson approximation for some statistics based on exchangeable trials. Adv. Appl. Probab 1983, 15(3):585–600.
Article MathSciNet MATH Google Scholar
Barbour AD, Eagleson GK: Poisson convergence for dissociated statistics. J. Roy. Statist. Soc. Ser. B 1984, 46(3):397–402.
MathSciNet MATH Google Scholar
Barbour AD, Eagleson GK: An improved Poisson limit theorem for sums of dissociated random variables. J. Appl. Probab 1987, 24(3):586–599.
Article MathSciNet MATH Google Scholar
Barbour AD, Hall P: On the rate of Poisson convergence. Math. Proc. Cambridge Philos. Soc 1984, 95(3):473–480.
Article MathSciNet MATH Google Scholar
Barbour AD, Chryssaphinou O: Compound Poisson approximation: a user’s guide. Ann. Appl. Probab 2001, 11(3):964–1002.
Article MathSciNet MATH Google Scholar
Barbour AD, Holst L, Janson S: Poisson Approximation. Oxford Studies in Probability. 199a. Oxford Science Publications
Google Scholar
Barbour AD, Chen LHY, Loh W-L: Compound Poisson approximation for nonnegative random variables via Stein’s method. Ann. Probab 1992b, 20(4):1843–1866.
Article MathSciNet Google Scholar
Barbour AD, Chryssaphinou O, Roos M: Compound Poisson approximation in reliability theory. IEEE T. Reliab 1995, 44(3):398–402.
Article MATH Google Scholar
Barbour AD, Chryssaphinou O, Roos M: Compound Poisson approximation in systems reliability. Naval Res. Logist 1996, 43(2):251–264.
Article MathSciNet MATH Google Scholar
Blom G, Thorburn D: How many random digits are required until given sequences are obtained? J. Appl. Probab 1982, 19(3):518–531.
Article MathSciNet MATH Google Scholar
Chen LHY: Poisson approximation for dependent trials. Ann. Probab 1975, 3(3):534–545.
Article MathSciNet MATH Google Scholar
Cui L, Xu Y, Zhao X: Developments and applications of the finite Markov chain imbedding approach in reliability. IEEE T. Reliab 2010, 59(4):685–690.
Article MathSciNet Google Scholar
Fu JC: Reliability of a large consecutive-k-out-of-n:F system. IEEE T. Reliab 1985, R-34: 120–127.
Article MATH Google Scholar
Fu JC, Johnson BC: Approximate probabilities for runs and patterns in i.i.d. and Markov dependent multi-state trials. Adv. Appl. Probab 2009, 41(1):292–308.
Article MathSciNet MATH Google Scholar
Fu JC, Koutras MV: Distribution theory of runs: a Markov chain approach. J. Amer. Statist. Assoc 1994, 89(427):1050–1058.
Article MathSciNet MATH Google Scholar
Fu JC, Lou WYW: Distribution Theory of Runs and Patterns and Its Applications. World Scientific Publishing Co. Inc, River Edge; 2003.
Book MATH Google Scholar
Fu JC, Lou WYW: On the normal approximation for the distribution of the number of simple or compound patterns in a random sequence of multi-state trials. Methodol. Comput. Appl. Probab 2007, 9(2):195–205.
Article MathSciNet MATH Google Scholar
Fu JC, Johnson BC, Chang Y-M: Approximating the extreme right-hand tail probability for the distribution of the number of patterns in a sequence of multi-state trials. J. Stat. Plan. Infer 2012, 142(2):473–480.
Article MathSciNet MATH Google Scholar
Gerber HU, Li S-YR: The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain. Stochastic Process. Appl 1981, 11(1):101–108.
Article MathSciNet MATH Google Scholar
Godbole AP: Degenerate and Poisson convergence criteria for success runs. Statist. Probab. Lett 1990a, 10(3):247–255.
Article MathSciNet Google Scholar
Godbole AP: Specific formulae for some success run distributions. Statist. Probab. Lett 1990b, 10(2):119–124.
Article MathSciNet Google Scholar
Godbole AP: Poisson approximations for runs and patterns of rare events. Adv. Appl. Probab 1991, 23(4):851–865.
Article MathSciNet MATH Google Scholar
Godbole AP, Schaffner AA: Improved Poisson approximations for word patterns. Adv. Appl. Probab 1993, 25(2):334–347.
Article MathSciNet MATH Google Scholar
Holst L, Kennedy JE, Quine MP: Rates of Poisson convergence for some coverage and urn problems using coupling. J. Appl. Probab 1988, 25(4):717–724.
Article MathSciNet MATH Google Scholar
Karlin S: Statistical signals in bioinformatics. Proc. Natl. Acad. Sci. U. S. A 2005, 102(38):13355–13362.
Article Google Scholar
Karlin S, Taylor HM: A First Course in Stochastic Processes. Academic Press [A subsidiary of Harcourt Brace Jovanovich, Publishers], New York-London; 1975.
MATH Google Scholar
Kleffe J, Borodovski M: First and second moment of counts of words in random text generated by Markov chains. Comp Applic Biosci 1992, 8: 443–441.
Google Scholar
Martin J, Regad L, Camproux A-C, Nuel G: Finite Markov chain embedding for the exact distribution of patterns in a set of random sequences. In Advances in Data Analysis. Statistics for Industry and Technology. Edited by: Skiadas CH. Birkhäuser, Boston; 2010.
Google Scholar
Meyn SP, Tweedie RL: Markov Chains and Stochastic Stability. Communications and Control Engineering Series. 1993.
Chapter Google Scholar
Nuel G, Regad L, Martin J, Camproux A-C: Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithm Mol. Biol 2010, 5(1):1–18.
Article Google Scholar
Schwager SJ: Run probabilities in sequences of Markov-dependent trials. J. Amer. Statist. Assoc 1983, 78(381):168–180.
Article MathSciNet MATH Google Scholar
Seneta E: Non-negative Matrices and Markov Chains. Springer, New York; 1983.
MATH Google Scholar
Solov’ev AD: A combinatorial identity and its application to the problem on the first occurrence of a rare event. Teor. Verojatnost. i Primenen 1966, 11: 313–320.
MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported, in part, by the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations

Department of Statistics, University of Manitoba, Winnipeg, Canada
Brad C Johnson & James C Fu

Authors

Brad C Johnson
View author publications
You can also search for this author in PubMed Google Scholar
James C Fu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brad C Johnson.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

BJ and JF contributed equally to the mathematical details. BJ performed the numerical comparisons and prepared the manuscript. Both authors read and approved the final manuscript.

Brad C Johnson and James C Fu contributed equally to this work.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Johnson, B.C., Fu, J.C. Approximating the distributions of runs and patterns. J Stat Distrib App 1, 5 (2014). https://doi.org/10.1186/2195-5832-1-5

Download citation

Received: 14 November 2013
Accepted: 07 March 2014
Published: 11 June 2014
DOI: https://doi.org/10.1186/2195-5832-1-5

Approximating the distributions of runs and patterns

Abstract

AMS Subject Classification

Introduction and notation

The approximations

Normal approximation

Poisson and compound poisson approximations

FMCI approximations

Definition 1

Theorem 1

Large deviation approximation

Theorem 2

Comparisons and relative error

Theorem 3

Proof

Corollary 1

Proof

Numerical comparisons

Reliability of C(k,n:F) systems

Approximating the distribution of N n,k

Biological sequences

Discussion and conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Approximating the distribution of N_n,k