Skip to main content

Joint distribution of k-tuple statistics in zero-one sequences of Markov-dependent trials

Abstract

We consider a sequence of n, n≥3, zero (0) - one (1) Markov-dependent trials. We focus on k-tuples of 1s; i.e. runs of 1s of length at least equal to a fixed integer number k, 1≤kn. The statistics denoting the number of k-tuples of 1s, the number of 1s in them and the distance between the first and the last k-tuple of 1s in the sequence, are defined. The work provides, in a closed form, the exact conditional joint distribution of these statistics given that the number of k-tuples of 1s in the sequence is at least two. The case of independent and identical 0−1 trials is also covered in the study. A numerical example illustrates further the theoretical results.

Introduction

Run counting statistics defined on a sequence of binary (zero (0) - one (1)) random variables (RVs), along with their exact and approximate distributions, have been extensively studied in the literature. Their popularity is due to the fact that such statistics appear as useful theoretical models in many research areas including statistics (e.g. hypothesis testing), engineering (e.g. system reliability and quality control), biology (e.g. population genetics and DNA sequence analysis), computer science (e.g. encoding/decoding/transmission of digital information) and financial engineering (e.g. insurance and risk analysis).

In such applications, a key point is the understanding how 1s and 0s are distributed and combined as elements of a 0−1 sequence (finite or infinite, memoryless or not) and eventually forming runs of 1s or 0s which are enumerated according to certain counting schemes. Each scheme defines how runs of same symbols or strings (patterns) of both symbols are formed and consequently are enumerated. A counting scheme may depend on, among other considerations, whether overlapping counting is allowed or not as well as if the counting starts or not from scratch when a run/string of a certain size has been so far enumerated.

The counting scheme as well as the intrinsic uncertainty of a 0−1 sequence are often suggested by the applications. Probabilistic models, in common use, for the internal structure of a 0−1 sequence include the model of a sequence with elements independent of each other or a model for which it is assumed some kind of dependence among the elements of it. The methods used to derive exact/approximating, marginal/joint probability distributions include combinatorial analysis, generating functions, finite Markov chain imbedding technique, recursive schemes as well as normal, Poisson and large deviation approximations.

For extensive reviews of the recent literature on the distribution theory of runs and patterns we refer to Balakrishnan and Koutras (2002) and Fu and Lou (2003). Current works on the subject include, among others, those of Antzoulakos and Chadjiconstantinidis (2001); Eryilmaz (2006, 2015, 2016, 2017); Eryilmaz and Yalcin (2011); Johnson and Fu (2014); Koutras (2003); Koutras et al. (2016); Makri and Psillakis (2015); Makri et al. (2013) and Mytalas and Zazanis (2013, 2014).

In this article we derive expressions for a conditional distribution of a trivariate statistic. Its components denote the number of runs of 1s of length exceeding a fixed threshold number, the number of 1s in such runs of 1s and the length of the minimum sequence’s segment in which these runs are concentrated. The study is developed on a sequence of two-state (0−1) Markov-dependent trials. The runs are enumerated according to Mood’s (1940) counting scheme.

More specifically, the manuscript is organized as follows. In Section 2 we present some preliminary material, including notation and definitions, necessary to develop our results which are obtained in Section 4. In Section 3 we give a motivation along with a statement of the aim of the work. A numerical example, showed in Section 5, clarifies the theoretical results of Section 4. A discussion on the results as well as a note on a future work are given in Section 6.

Throughout the article, for integers, n, m, \({n\choose m}\) denotes the extended binomial coefficient (see, Feller (1968), pp. 50, 63), x stands for the greatest integer less than or equal to x and δ ij denotes the Kronecker delta fuction of the integer arguments i and j. Further, for α>β, we apply the conventions \(\sum _{i=\alpha }^{\beta }y_{i}=0\), \(\prod _{i=\alpha }^{\beta }y_{i}=1\), \(\sum _{i=\alpha }^{\beta }\mathbf {Y}^{(i)}=\mathbf {O}\equiv {\scriptsize \left (\begin {array}{cc} 0 &0\\ 0 & 0 \end {array}\right)}\), \(\prod _{i=\alpha }^{\beta }\mathbf {Y}^{(i)}=\mathbf {I}\equiv {\scriptsize \left (\begin {array}{cc} 1 &0\\ 0 & 1 \end {array}\right)}\), where y i and Y (i) are scalars and 2×2 matrices, respectively.

Preliminaries

2.1 Run counting statistics

Let \(\{X_{t}\}_{t=1}^{n}\), n≥1, be the first n trials of a binary (0−1) sequence of RVs, X t =x t {0,1}. A run of 1s, is a (sub)sequence of \(\{X_{t}\}_{t=1}^{n}\) consisting of consecutive 1s, the number of which is referred to as its length, preceded and succeeded by 0s or by nothing.

Given a fixed integer k, 1≤kn, a k-tuple of 1s is a run of 1s of length k or more. In the paper we will deal with the following statistics defined on a \(0-1 \{X_{t}\}_{t=1}^{n}\). For details see, e.g. Makri et al. (2015) and the references therein.

(I) G n,k denoting the number of k-tuples of 1s, 1≤kn. In particular, G n,1 denotes the number of 1-tuples of 1s, i.e. it represents the number R n G n,1 of all runs of 1s in the sequence. Using the convention X 0=X n+1≡0, we can define G n,k as

$$ G_{n,k}=\sum_{i=k}^{n}E_{n,i},\, 1\leq k\leq n, $$
(1)

where

$$E_{n,i}=\sum_{j=i}^{n}J_{j},\, J_{j}=\left(1-X_{j-i}\right)\left(1-X_{j+1}\right)\prod_{r=j-i+1}^{j}X_{r}. $$

(II) S n,k denoting the number of 1s in the G n,k k-tuples of 1s; i.e. S n,k represents the sum of lengths of the G n,k k-tuples of 1s, 1≤kn. In particular S n,1 represents the number of all 1s in the sequence; hence, the number of 0s, Y n , in the sequence is Y n =nS n,1. S n,k is formally defined as

$$ S_{n,k}=\sum_{i=k}^{n}{iE}_{n,i}, \,1\leq k\leq n. $$
(2)

Readily, k G n,k S n,k .

(III) L n , n≥1, denoting the length of the longest run of 1s in the sequence. By setting

$$\Lambda_{n}=\{i:G_{n,i}>0, 1\leq i\leq n\}, $$

we have that

$$ L_{n}=\max \{k: k\in \Lambda_{n}\},\, \text{if}\, \Lambda_{n}\neq \emptyset;\,0,\,\text{otherwise}. $$
(3)

Readily L n <k iff G n,k <1.

(IV) For G n,k ≥1, 1≤kn, D n,k denotes the distance (number of trials) between and including the first 1 of the first k-tuple of 1s and the last 1 of the last k-tuple of 1s in the sequence. If there is only one k-tuple of 1s in the sequence then D n,k denotes its length. That is, D n,k represents the size (length) of the minimum (sub)sequence of \(\{X_{t}\}_{t=1}^{n}\) in which all G n,k k-tuple of 1s are concentrated. In particular, D n,1 represents the length of the minimum segment of the sequence containing all R n runs of 1s or in other words all S n,1 1s appearing in the sequence. For G n,k ≥1, 1≤kn, D n,k can be formally defined as

$$ D_{n,k}=U_{n,k}^{(2)}-U_{n,k}^{(1)}+1, $$
(4)

where

$$U_{n,k}^{(1)}=\min\{j:I_{j}=1, 1\leq j\leq n-k+1\}, $$
$$U_{n,k}^{(2)}=\max\{j:I_{j-k+1}=1, k\leq j\leq n\}, $$
$$I_{j}=\prod_{r=j}^{j+k-1}X_{r},\, 1\leq j\leq n-k+1. $$

Readily, D n,k =S n,k =L n , if G n,k =1 and D n,k >S n,k >L n , if G n,k >1.

(V) For G n,k ≥1, 1≤kn, set V n,k =(D n,k ,G n,k ,S n,k ). This is the RV we focus on in the article.

Example: By way of illustration consider the trials 1110001100010001010011101111001001001001 numbered from 1 to 40. Then, L 40=4 and V 40,1=(40,11,19), V 40,2=(28,4,12), V 40,3=(28,3,10), V 40,4=(4,1,4).

2.2 Internal structure’s models

A general enough model for the internal structure of a \(0-1 \{X_{t}\}_{t=1}^{n}\), n≥2, is that of the first n trials of a homogeneous 0−1 Markov chain of first order (HMC1). On such a model we will develop our results. Accordingly, we next state the necessary notation/definitions.

Let {X t } t≥1 be a HMC1 with state space ={0,1}, one step transition probability matrix

$$\mathbf{P}=(p_{ij})=\left(\begin{array}{cc} p_{00} & p_{01} \\ p_{10} & p_{11} \\ \end{array} \right), $$

with

$$ p_{ij}=P\left(X_{t}=j\mid X_{t-1}=i\right),\,i,j\in {\cal{A}},\,\sum_{j\in \cal{A}}p_{ij}=1,\,i\in {\cal{A}},\,t\geq 2 $$
(5)

and probability distribution vector at time t

$$\mathbf{p}^{(t)}=\left(p_{0}^{(t)}, p_{1}^{(t)}\right), $$

with

$$ p_{i}^{(t)}=P(X_{t}=i),\, i\in {\cal{A}},\, \sum_{i\in \cal{A}}p_{i}^{(t)}=1,\, t\geq 1. $$
(6)

Readily, because of the homogeneity of {X t } t≥1, it holds

$$\mathbf{p}^{(t)}=\mathbf{p}^{(t-1)}\mathbf{P}=\mathbf{p}^{(1)}\mathbf{P}^{t-1 },\,t\geq 2;\,\mathbf{p}^{(1)},\,t=1\,\,\text{and}\,\,\mathbf{P}^{t-1}=\left(p_{ij}^{(t-1)}\right),\, t\geq 2, $$

with

$$p_{i}^{(t)}=\mathbf{p}^{(t)}\mathbf{e}_{i}^{'},\, i\in {\mathcal{A}},\,t\geq 1, $$
$$ p_{ij}^{(t-1)}=P(X_{t-1+m}=j\mid X_{m}=i)=\mathbf{e}_{i}\mathbf{P}^{t-1}\mathbf{e}_{j}^{'},\,i,j\in {\mathcal{A}},\, t\geq 2,\, m\geq 1, $$
(7)

where \(\mathbf {e}_{i}^{'}\) is the transpose (i.e. the column vector) of the row vector e i , \(i\in {\mathcal {A}}\), with e 0=(1,0) and e 1=(0,1).

In particular, for p 01+p 10≠0, i.e. PI, it holds

$$ \mathbf{P}^{t-1}=\left(p_{01}+p_{10}\right)^{-1}\left\{\left(\begin{array}{cc} p_{10} & p_{01} \\ p_{10} & p_{01} \\ \end{array} \right)+(1-p_{01}-p_{10})^{t-1}\left(\begin{array}{cc} p_{01} & -p_{01} \\ -p_{10} & p_{10} \\ \end{array} \right)\right\},\, t\geq 2, $$
(8)
$$ p_{0}^{(t)}=p_{0}^{(1)}\left(1-p_{01}-p_{10}\right)^{t-1}+p_{10}\left(p_{01}+p_{10}\right)^{-1}\left[1-\left(1-p_{01}-p_{10}\right)^{t-1}\right],\, t\geq 1. $$
(9)

The setup of a 0−1 HMC1 \(\{X_{t}\}_{t=1}^{n}\), n≥2, covers the case of a 0−1 sequence of independent and identically distributed (IID) RVs, too. This is so, because a \(0-1 \{X_{t}\}_{t=1}^{n}\), n≥2, IID sequence with

$$ P(X_{t}=1)=1-P(X_{t}=0)=p_{1},\, 1\leq t\leq n, $$
(10)

is a particular HMC1 with

$$p_{ij}=1-p_{1},\, j=0;\, p_{1}, j=1,\, i\in {\mathcal{A}},\,p_{ij}^{(t-1)}=p_{ij},\,i,j\in {\mathcal{A}},\,t\geq 2, $$
$$ p_{1}^{(t)}=p_{1}=1-p_{0}^{(t)},\, 1\leq t\leq n. $$
(11)

2.3 A combinatorial result

In combinatorial analysis which will be used in Section 4, the following result, recalled from Makri et al. (2007), is useful. The coefficient

$$ H_{m}(\alpha,r,k)=\sum_{j=0}^{\left\lfloor\frac{\alpha}{k+1}\right\rfloor}(-1)^{j}{m\choose j}{\alpha-(k+1)j+r-1\choose \alpha-(k+1)j}, $$
(12)

represents the number of allocations of α indistinguishable balls into r distinguishable cells where each of the m, 0≤mr, specified cells is occupied by at most k balls. Equivalently, it gives the number of nonnegative integer solutions of the linear equation x 1+x 2+…+x r =α with the restrictions, for m≥1, \(0\leq x_{i_{j}}\leq k\), 1≤jm, for some specific m-combination {i 1,i 2,…,i m } of {1,2,…,r}, and no restrictions on x j s, 1≤jr, for m=0.

Moreover, H r (α,r,k) is Riordan’s (1964, p. 104) coefficient

$$ C(\alpha,r,k)=\sum_{j=0}^{\left\lfloor\frac{\alpha}{k+1}\right\rfloor}(-1)^{j}{r\choose j}{\alpha-(k+1)j+r-1\choose \alpha-(k+1)j}. $$
(13)

Motivation and aim of the work

In a study of a 0−1 sequence \(\{X_{t}\}_{t=1}^{n}\), n≥3, it is reasonable for one to be interested in the probabilistic behavior of RV V n,k =(D n,k ,G n,k ,S n,k ). This happens because jointly its components provide a more refined view of the internal clustering structure of the sequence than the information extracted by each one alone.

Interpreting a k-tuple of 1s as a cluster of consecutive 1s of size at least k, D n,k represents the size of the minimum segment of \(\{X_{t}\}_{t=1}^{n}\) in which G n,k clusters of size at least k and at most L n are concentrated. The overall density of G n,k clusters, with respect to the number of 1s in them, as well as of the minimum concentration segment is evaluated by S n,k . Large values of D n,k suggest that these G n,k clusters spread over the interval between the left and the right side of the sequence whereas small values of D n,k indicate rather that the clusters are concentrated in a segment of the sequence of small size leaving the rest part(s) of the sequence empty of such clusters.

In addition to this information, a large value of S n,k paired with a small value of G n,k indicates the existence of clusters of 1s of a large size and therefore a trend whereas the same value of S n,k paired with a large value of G n,k indicates rather a distribution of clusters of small size in the (sub)sequence in which they are concentrated.

Therefore, based on the former interpretation, the motivation for the study as well as the usefulness of the statistic V n,k =(D n,k ,G n,k ,S n,k ) is apparent. In the sequel, we assume that G n,k ≥2 in order to have at least two k-tuples of 1s in the sequence and accordingly the distance D n,k is not a degenerate one. Moreover, this assumption is a common one in an application area of D n,k ; e.g., in detecting pattern (tandem or non-tandem direct) repeats in DNA sequences (Benson 1999).

For 1≤kn, set

$$ {\mathcal{M}}_{n,k}=\{G_{n,k}\geq 2\},\,\alpha_{n,k}=P\left({\mathcal{M}}_{n,k}\right) $$
(14)

and for n≥3, 1≤k(n−1)/2, define

$$ \Omega_{n,k}=\left\{(d,m,s): 2k+1\leq d\leq n, 2k\leq s\leq d-1, 2\leq m\leq \min\left(\lfloor s/k\rfloor, d-s+1\right)\right\} $$
(15)

and for (d,m,s)Ω n,k ,

$$h_{n,k}(d,m,s)=P\left(\mathbf{V}_{n,k}=(d,m,s), {\cal {M}}_{n,k}\right), $$
$$ v_{n,k}(d,m,s)=P\left(\mathbf{V}_{n,k}=(d,m,s)\mid {\cal {M}}_{n,k}\right)=h_{n,k}(d,m,s)/\alpha_{n,k}. $$
(16)

The paper provides exact closed form expressions for α n,k , h n,k (d,m,s) and eventually for v n,k (d,m,s) when V n,k is defined on a 0−1 HMC1/IID. The expressions are obtained via combinatorial analysis.

More specifically, closed formulae are established for the first time for h n,k (d,m,s), 1≤k(n−1)/2, when V n,k is defined on a 0−1 HMC1 with given P and p (1). Since, the general frame of HMC1 covers as a particular case IID sequences, the so implied expressions for v n,k (d,m,s) are alternative to those obtained for v n,k (d,m,s), 1≤k(n−1)/2, by Makri et al. (2015) for IID sequences.

Moreover, for n≥3, 1≤k(n−1)/2, 2k+1≤dn, let

$$f_{n,k}(d)=P\left(D_{n,k}=d\mid {\cal {M}}_{n,k}\right). $$

Therefore, since

$$ f_{n,k}(d)=\sum_{s=2k}^{d-1}\sum_{m=2}^{\min\left(\lfloor s/k\rfloor,d-s+1\right)}v_{n,k}(d,m,s)=\alpha_{n,k}^{-1}\sum_{s=2k}^{d-1}\sum_{m=2}^{\min\left(\lfloor s/k\rfloor,d-s+1\right)}h_{n,k}(d,m,s), $$
(17)

hence, the work provides closed form expressions for determining f n,k (d) for HMC1 and IID \(0-1 \{X_{t}\}_{t=1}^{n}\). These expressions are alternative to those derived, for IID sequences, by Makri et al. (2015) for 1≤k(n−1)/2 as well as to those obtained, for HMC1, by Arapis et al. (2016) for k=1 and by Arapis et al. (2017) for 1≤k(n−1)/2.

Results

In a 0−1 sequence \(\{X_{t}\}_{t=1}^{n}\), n≥2, for 0≤yn, 0≤r(n+1)/2 and (i,j){0,1}2, define

$$B_{n}^{(i,j)}(y,r)=\{X_{1}=i,X_{n}=j,Y_{n}=y,G_{n,1}=r\}, $$
$$\pi_{n}^{(i,j)}(y,r)=P(B_{n}^{(i,j)}(y,r)). $$

Accordingly, for a HMC1 \(\{X_{t}\}_{t=1}^{n}\), n≥2, with given P and p (1), it holds

$$ \pi_{n}^{(i,j)}(y,r)=\left(p_{1}^{(1)}\right)^{i}\left(1-p_{1}^{(1)}\right)^{1-i}p_{00}^{y-r-1+i+j}\left(1-p_{00}\right)^{r-i} \left(1-p_{11}\right)^{r-j}p_{11}^{n-y-r}, $$
(18)

for 2−(i+j)≤yn−(i+j), 1−δ y,0δ y,n +δ i+j,2r≤ min{ny,y−1+i+j} and \(\pi _{n}^{(i,j)}(y,r)=0\), otherwise.

Consequently, \(\pi _{n}^{(i,j)}(y,r)\), for a 0−1 IID sequence, reduces to

$$ \pi_{n}^{(i,j)}(y,r)=\pi_{n}(y)=p_{1}^{n-y}(1-p_{1})^{y},\, 0\leq y\leq n. $$
(19)

Theorem 1

For n≥3, (d,m,s)Ω n,1, \(0<p_{1}^{(1)}<1\), it holds

$$\begin{array}{@{}rcl@{}} h_{n,1}(d,m,s)&=& {s-1\choose m-1}{d-s-1\choose m-2}\pi_{d}^{(1,1)}(d-s,m)\varepsilon_{n}(d) \end{array} $$
(20)

where ε n (d)=1, if n=d; \(p_{00}^{n-d-2}\left \{p_{10}p_{00}+p_{0}^{(1)}(p_{1}^{(1)})^{-1}p_{01}\left [(n-d-1)p_{10}+p_{00}\right ]\right \}\), if nd+1.

Proof

For d=3,…,n−2, i=2,3,…,nd, s=2,3,…,d−1, m=2,3,…, min{s,ds+1} an element of the event \(\Gamma _{i,d,m,s}=\{U_{n,1}^{(1)}=i, D_{n,1}=d, R_{n}=m, S_{n,1}=s\}\) is a 0−1 sequence of length n with probability

$$p_{0}^{(1)}p_{00}^{i-2}p_{01}\left[\pi_{d}^{(1,1)}(d-s,m)\left(p_{1}^{(1)}\right)^{-1}\right]p_{10}p_{00}^{n-i-d}. $$

Fix i. Then the number of elements of the event Γ i,d,m,s is \({s-1\choose m-1}{d-s-1\choose m-2}\), since the number of allocations of s 1s in m runs of 1s is \({s-1\choose m-1}\) and the number of allocations of ds 0s in m−1 runs of 0s is \({d-s-1\choose m-2}\), so that

$$P\left(\Gamma_{i,d,m,s}\right)={s-1\choose m-1}{d-s-1\choose m-2}p_{0}^{(1)}p_{01}\left[\pi_{d}^{(1,1)}(d-s,m)\left(p_{1}^{(1)}\right)^{-1}\right]p_{10}p_{00}^{n-d-2}. $$

We use similar reasoning for the rest cases. Then summing with respect to i we get the result. □

For a sequence \(\{X_{t}\}_{t=1}^{n}\) of 0−1 IID RVs, h n,1(d,m,s) reduces to the explicit formula given in the next Corollary.

Corollary 1

For n≥3, (d,m,s)Ω n,1, 0<p 1<1, it is true that

$$ h_{n,1}(d,m,s)=(n-d+1){s-1\choose m-1}{d-s-1\choose m-2}p_{1}^{s}(1-p_{1})^{n-s}.\quad \diamondsuit $$
(21)

In order to derive for HMC1, in the forthcoming Theorem 2, h n,k (d,m,s), 5≤2k+1≤n, we next recall, in Lemma 1, a result from (Makri et al.: On the concentration of runs of ones of length exceeding a threshold in a Markov chain, submitted).

Lemma 1

For (i,j){0,1}2, n≥2, set \(\lambda _{n,k}^{(i,j)}(x)=P(G_{n,k}=x,X_{1}=i,X_{n}=j)\), x=0,1. Then, it holds that:

(I) For 2≤kn−2+i+j,

$$\lambda_{n,k}^{(i,j)}(0)=\sum_{y=1}^{n-(i+j)}\sum_{r=i+j}^{y-1+i+j} {y-1\choose r-i-j}C(n-y-r,r,k-2)\pi_{n}^{(i,j)}(y,r), $$
$$ {}\lambda_{n,k}^{(i,j)}(1)=\pi_{n}^{(i,j)}(0,1)\delta_{2,i+j}+\sum_{y=1}^{n-k}\sum_{r=1}^{y-1+i+j} r{y-1\choose r-i-j}H_{r-1}(n-y-r-k+1,r,k-2)\pi_{n}^{(i,j)}(y,r). $$
(22)

(II) For k>n−2+i+j,

$$\lambda_{n,k}^{(i,j)}(0)=\left(p_{1}^{(1)}\right)^{i}\left(1-p_{1}^{(1)}\right)^{1-i}p_{ij}^{(n-1)}, $$
$$ \lambda_{n,k}^{(i,j)}(1)=0. $$
(23)

Theorem 2

For n≥5, 2≤k(n−1)/2, (d,m,s)Ω n,k , \(0<p_{1}^{(1)}<1\), it holds

$$\begin{array}{*{20}l} {}h_{n,k}(d,m,s)\,=\,p_{11}^{2k-2}\left(p_{1}^{(1)}\right)^{-1}\!\! \sum_{i=1}^{n-d+1}\!\!\!\ell_{i-1,k}^{(\alpha)}\ell_{n-d-i+1,k}^{(\beta)}\!\!\!\!\!\!\!\!\!\sum_{r=m}^{m+\left\lfloor\frac{d-s-m+1}{2}\right\rfloor}\! \sum_{y=r-1}^{d-s-r+m}\!\!\!\gamma_{d,m,s}(y,\!r)\pi_{d-\!2k+\!2}^{(1,1)}(y,\!r), \end{array} $$
(24)

where

$${}\ell_{n,k}^{(\alpha)}=p_{1}^{(1)},\,\text{for}\,n=0;\quad p_{0}^{(n)}p_{01},\,\text{for}\, 1\leq n\leq k;\quad p_{01}\left[\lambda_{n,k}^{(0,0)}(0)+\lambda_{n,k}^{(1,0)}(0)\right],\,\text{for}\,n\geq k+1, $$
$$ {}\ell_{n,k}^{(\beta)}=1,\,\text{for}\,n=0;\quad p_{10},\,\text{for}\, 1\leq n\leq k;\quad \!p_{10}(p_{0}^{(1)})^{-1}\left[\lambda_{n,k}^{(0,0)}(0)+\lambda_{n,k}^{(0,1)}(0)\right],\,\text{for}\,n\geq k+1 $$
(25)

and

$$\begin{array}{@{}rcl@{}} {}\gamma_{d,m,s}(y,r)\,=\,{y-1\choose r-2}{r-2\choose m-2}{s-mk+m-1\choose m-1}C(d-y-s-r+m,r-m,k\,-\,2). \end{array} $$
(26)

Proof

For 1≤r 1r 2n let \(Y_{r_{1},r_{2}}\), \(R_{r_{1},r_{2}}\), \(L_{r_{1},r_{2}}\), \(S_{r_{1},r_{2},k}\), \(D_{r_{1},r_{2},k}\), \(G_{r_{1},r_{2},k}\) be RVs defined on the subsequence \(X_{r_{1}}, X_{r_{1}+1},\ldots,X_{r_{2}}\) of \(\{X_{t}\}_{t=1}^{n}\). For m≥2 define the event

$$\begin{array}{@{}rcl@{}} \lefteqn{\Delta_{r_{1},r_{1}+d-1}(d,s,m,y,r)}\\ & & =\{D_{r_{1},r_{1}+d-1,k}=d, G_{r_{1},r_{1}+d-1,k}=m, S_{r_{1},r_{1}+d-1,k}=s, Y_{r_{1},r_{1}+d-1}=y, R_{r_{1},r_{1}+d-1}=r\}. \end{array} $$

An element of this event is a 0 - 1 sequence of length d, starting and ending with a 1, for which y j ’s and z j ’s, representing the lengths of the failure and success runs, respectively, satisfy the conditions:

  1. (a)

    y 1+y 2+…+y r−1=y, y j ≥1, 1≤jr−1.

  2. (b)

    \(\phantom {\dot {i}\!}z_{1}+z_{i_{1}}+z_{i_{2}}+\ldots +z_{i_{m-2}}+z_{r}=s\), z j k, j{1,i 1,i 2,…,i m−2,r}, for some specific combination {1,i 1,i 2,…,i m−2,r} of {1,2,…,r−1,r} among the \({r-2\choose m-2}\) ones.

  3. (c)

    \(z_{i_{m-1}}+z_{i_{m}}+\ldots +z_{i_{r-2}}=d-y-s\), \(1\leq z_{i_{j}}\leq k-1\), m−1≤jr−2, for {i m−1,…,i r−2}{1,2,…,r}−{1,i 1,i 2,…,i m−2,r}.

Fix i 1,i 2,…,i m−2. Then the number of such sequences, i.e. the number of solutions of the system (a)-(c), is

$${y-1\choose r-2}C(d-y-s-r+m,r-m,k-2){s-mk+m-1\choose m-1} $$

and each such sequence has probability

$$p_{1}^{(1)}p_{11}^{k-1}(p_{1}^{(1)})^{-1}\pi_{d-2k+2}^{(1,1)}(y,r)p_{11}^{k-1}=p_{11}^{2k-2}\pi_{d-2k+2}^{(1,1)}(y,r). $$

Hence,

$$\begin{array}{@{}rcl@{}} P(\Delta_{r_{1},r_{1}+d-1}(d,s,m,y,r))&=& p_{11}^{2k-2}\pi_{d-2k+2}^{(1,1)}(y,r){r-2\choose m-2}{y-1\choose r-2}{s-mk+m-1\choose m-1}\\ & & \times C(d-y-s-r+m,r-m,k-2). \end{array} $$

For k+2≤inkd, m≥2, we have that

$$\begin{array}{@{}rcl@{}} \lefteqn{P\left(U_{n,k}^{(1)}=i, D_{n,k}=d, G_{n,k}=m, S_{n,k}=s, Y_{i,i+d-1}=y, R_{i,i+d-1}=r\right)}\\ &=&P\Big\{\left[(L_{1,i-1}<k, X_{i-1}=0)\cap\left[(X_{1}=0)\cup(X_{1}=1)\right]\right]\cap \Delta_{i,i+d-1}(d,s,m,y,r)\\ & &\cap\left[(L_{i+d,n}<k, X_{i+d}=0)\cap\left[(X_{n}=0)\cup(X_{n}=1)\right]\right]\Big\}\\ &= & \left[\lambda_{i-1,k}^{(0,0)}(0)+\lambda_{i-1,k}^{(1,0)}(0)\right]p_{01}\\ & &\times \left(p_{1}^{(1)}\right)^{-1}P\left(\Delta_{i,i+d-1}(d,s,m,y,r)\right) p_{10}\left[\lambda_{n-i-d+1,k}^{(0,0)}(0)+\lambda_{n-i-d+1,k}^{(0,1)}(0)\right]/p_{0}^{(1)}\\ &=& \left[\lambda_{i-1,k}^{(0,0)}(0)+\lambda_{i-1,k}^{(1,0)}(0)\right]p_{01}\left(p_{1}^{(1)}\right)^{-1}p_{11}^{2k-2}\pi_{d-2k+2}^{(1,1)}(y,r)\\ & & \times{r-2\choose m-2}{y-1\choose r-2}{s-mk+m-1\choose m-1}\\ & & \times C(d-y-s-r+m,r-m,k-2)p_{10}\left(p_{0}^{(1)}\right)^{-1} \left[\lambda_{n-i-d+1,k}^{(0,0)}(0)+\lambda_{n-i-d+1,k}^{(0,1)}(0)\right]. \end{array} $$

By similar reasoning we get the remaining cases of i, i.e. 1≤ik+1 and nd+1−kind+1. Then summing with respect to i, y and r we get the result. □

Having found h n,k (d,m,s), we next proceed to obtain v n,k (d,m,s). In accomplishing it, the required probabilities α n,k for HMC1 are recalled, in Lemma 2, from Arapis et al. (2016) for k=1, and they are computed via Lemma 1 for 2≤k(n−1)/2.

Lemma 2

For nk≥1, the probability α n,k , for HMC1, is computed via the expressions:(I) For k=1,

$$\begin{array}{@{}rcl@{}} \alpha_{n,1}&=&1-p_{00}^{n-3}\left\{p_{00}\left(1+(n-2)p_{01}\right)+\frac{(n-1)(n-2)}{2}p_{0}^{(1)}p_{01}^{2}\right\},\quad \text{if}\quad p_{00}=p_{11} \end{array} $$

and

$$\begin{array}{@{}rcl@{}} \alpha_{n,1}&=& 1-p_{0}^{(1)}p_{00}^{n-1}-p_{11}^{n-2}\left(p_{1}^{(1)}+p_{0}^{(1)}p_{01}\right)-p_{00} \left(p_{0}^{(1)}p_{01}+p_{1}^{(1)}p_{10}\right)\frac{p_{11}^{n-2}-p_{00}^{n-2}}{p_{11}-p_{00}} \\ & &\quad -p_{0}^{(1)}p_{01}p_{10}\frac{p_{11}^{n-1}-p_{00}^{n-2} \left[p_{11}+(n-2)\left(p_{11}-p_{00}\right)\right]}{\left(p_{11}-p_{00}\right)^{2}},\quad \text{if}\quad p_{00}\neq p_{11}. \end{array} $$
(27)

(II) For 2≤kn,

$$\begin{array}{@{}rcl@{}}\alpha_{n,k}=1-\sum_{(i,j)\in \{0,1\}^{2}}\left[\lambda_{n,k}^{(i,j)}(0)+\lambda_{n,k}^{(i,j)}(1)\right]. \end{array} $$
(28)

Theorem 3

For n≥3, 1≤k(n−1)/2, (d,m,s)Ω n,k , \(0<p_{1}^{(1)}<1\), the PMF v n,k (d,m,s) for a HMC1, with given P and p (1), is calculated by

$$ v_{n,k}(d,m,s)=\alpha_{n,k}^{-1}h_{n,k}(d,m,s), $$
(29)

where α n,k and h n,k (d,m,s) are provided by Lemma 2 and Theorems 1 (for k=1) and 2 (for 2≤k(n−1)/2), respectively.

Remark 1

For IID sequences, in implementing Theorem 3, one has to take into consideration Eqs. (10) - (11), (19) and (21). Moreover, for speeding up calculations, one has to set π n (y) in front of the inner summation in (22).

A numerical example

In this example we compute some indicative numerics concerning two model (i.e. HMC1 and IID) 0−1 sequences \(\{X_{t}\}_{t=1}^{n}\) which are considered in the paper. The common length of these was taken small, i.e. n=8, so that the required computations can also be carried out by a hand/pocket calculator and thus it is possible to gain insight in the formulae developed in Section Results, and also because of space limitations. The sequences that have been used are as follows. Table 1: An IID sequence with p 1=0.5. Table 2: A HMC1 sequence with p 00=p 11=0.9, \(p_{1}^{(1)}=0.5\).

Table 1 0−1 IID sequence with p 1=0.5
Table 2 0−1 HMC1 sequence with p 00=p 11=0.9, \(p_{1}^{(1)}=0.5\)

Both tables depict for k=1,2,3, v 8,k (d,m,s), (d,m,s)Ω 8,k and f 8,k (d), 2k+1≤d≤8 illustrating the numeric values of the involved probabilities. v 8,k (d,m,s) and f 8,k (d) were computed via Eqs. (29) and (17), respectively.

Discussion and further study

In this article we have derived exact closed form expressions for PMF v n,k (d,m,s), n≥3, 1≤k(n−1)/2, (d,m,s)Ω n,k , of the RV V n,k n,k defined on a 0−1 sequence of homogeneous Markov-dependent trials. The method used is a combinatorial one relied on results exploiting the internal structure of such a sequence.

As it is noticed in the Introduction the application domain of runs contains a diverse range of fields. Indicative potential ones are next discussed.

Encoding, compression and transmission of digital information calls for the understanding the distributions of runs of 1s or 0s. Such a knowledge helps in analyzing, and also in comparing, several techniques used in communication networks. In such networks 0−1 data ranging from a few kilobytes (e.g. e-mails) to many gigabytes of greedy multimedia applications (e.g. video on demand) are highly encoded, decoded and eventually proceeded under security. For details, see e.g., Sinha and Sinha (2009), Makri and Psillakis (2011a) and Tabatabaei and Zivic (2015).

An area where the study of runs of 1s and 0s has become increasingly useful is the field of bioinformatics or computational biology. For instance, molecular biologists design similarity tests between two DNA sequences where a 1 is interpreted as a match of the sequences at a given position and everything else as a 0. Moreover, the probabilistic analysis of such sequences according to the form, the length and the number of detected patterns as well as of the positions and the lengths of the segments of the sequence in which they are concentrated, probably suggests a functional reason for the internal structure of the examined sequence. The latter facts might be useful in suggesting a further investigation of the underline sequence(s) by biologists. See, e.g. Avery and Henderson (1999), Benson (1999) and Nuel et al. (2010).

Another active area where run statistics, in particular G n,k and S n,k , have interesting statistical applications is that connected to hypothesis testing; e.g., in tests of randomness. For a systematic study of such a topic, we refer among others, the works of Koutras and Alexandrou (1997) and Antzoulakos et al. (2003).

Accordingly, it is reasonable for one to use the exact expressions obtained for v n,k (d,m,s) in applications like the ones mentioned above. This is so, because this distribution, as a joint one, is more flexible than each one of its marginals which have been used in such applications. See, e.g. Lou (2003), Makri and Psillakis (2011b) and Arapis et al. (2016).

Moreover, in handling 0 - 1 sequences of a large length, with dependent or not elements, a Monte - Carlo simulation, based on Eqs. (1) - (4) would be a useful tool in obtaining approximate values for v n,k (d,m,s). In addition, the general approximating methods, suggested by Johnson and Fu (2014), might be helpful in deriving approximate values for f n,k (d).

References

  • Antzoulakos, DL, Bersimis, S, Koutras, MV: On the distribution of the total number of run lengths. Ann. Inst. Statist. Math. 55, 865–884 (2003).

    Article  MATH  MathSciNet  Google Scholar 

  • Antzoulakos, DL, Chadjiconstantinidis, S: Distributions of numbers of success runs of fixed length in Markov dependent trials. Ann. Inst. Statist. Math. 53, 559–619 (2001).

    Article  MATH  MathSciNet  Google Scholar 

  • Arapis, AN, Makri, FS, Psillakis, ZM: On the length and the position of the minimum sequence containing all runs of ones in a Markovian binary sequence. Statist. Probab. Lett. 116, 45–54 (2016).

    Article  MATH  MathSciNet  Google Scholar 

  • Arapis, AN, Makri, FS, Psillakis, ZM: Distribution of statistics describing concentration of runs in non homogeneous Markov-dependent trials. Commun. Statist. Theor. Meth. (2017). doi:10.1080/03610926.2017.1337144.

  • Avery, PJ, Henderson, D: Fiting Markov chain models to discrete state series such as DNA sequences. Appl. Statist. 48(Part 1), 53–61 (1999).

    MATH  Google Scholar 

  • Balakrishnan, N, Koutras, MV: Runs and Scans with Applications. Wiley, New York (2002).

    MATH  Google Scholar 

  • Benson, G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).

    Article  Google Scholar 

  • Eryilmaz, S: Some results associated with the longest run statistic in a sequence of Markov dependent trials. Appl. Math. Comput. 175, 119–130 (2006).

    Article  MATH  MathSciNet  Google Scholar 

  • Eryilmaz, S: Discrete time shock models involving runs. Statist. Probab. Lett. 107, 93–100 (2015).

    Article  MATH  MathSciNet  Google Scholar 

  • Eryilmaz, S: Generalized waiting time distributions associated with runs. Metrika. 79, 357–368 (2016).

    Article  MATH  MathSciNet  Google Scholar 

  • Eryilmaz, S: The concept of weak exchangeability and its applications. Metrika. 80, 259–271 (2017).

    Article  MATH  MathSciNet  Google Scholar 

  • Eryilmaz, S, Yalcin, F: Distribution of run statistics in partially exchangeable processes. Metrika. 73, 293–304 (2011).

    Article  MATH  MathSciNet  Google Scholar 

  • Feller, W: An Introduction to Probability Theory and Its Applications. 3rd Ed., Vol. I. Wiley, New York (1968).

    MATH  Google Scholar 

  • Fu, JC, Lou, WYW: Distribution Theory of Runs and Patterns and Its Applications: A finite Markov chain imbedding approach. World Scientific, River Edge (2003).

  • Johnson, BC, Fu, JC: Approximating the distributions of runs and patterns. J. Stat. Distrib. Appl. 1:5, 1–15 (2014).

    Article  MATH  Google Scholar 

  • Koutras, MV: Applications of Markov chains to the distribution of runs and patterns. In: Shanbhag, DN, Rao, CR (eds.)Handbook of Statistics, pp. 431–472. Elsevier, North-Holland (2003).

    Google Scholar 

  • Koutras, MV, Alexandrou, V: Non-parametric randomness tests based on success runs of fixed length. Statist. Probab. Lett. 32, 393–404 (1997).

    Article  MATH  MathSciNet  Google Scholar 

  • Koutras, VM, Koutras, MV, Yalcin, F: A simple compound scan statistic useful for modeling insurance and risk management problems. Insur. Math. Econ. 69, 202–209 (2016).

    Article  MATH  MathSciNet  Google Scholar 

  • Lou, WYW: The exact distribution of the k-tuple statistic for sequence homology. Statist. Probab. Lett. 61, 51–59 (2003).

    Article  MATH  MathSciNet  Google Scholar 

  • Makri, FS, Philippou, AN, Psillakis, ZM: Success run statistics defined on an urn model. Adv. Appl. Prob. 39, 991–1019 (2007).

    Article  MATH  MathSciNet  Google Scholar 

  • Makri, FS, Psillakis, ZM: On success runs of a fixed length in Bernoulli sequences: Exact and asymptotic results. Comput. Math. Appl. 61, 761–772 (2011a).

    Article  MATH  MathSciNet  Google Scholar 

  • Makri, FS, Psillakis, ZM: On runs of length exceeding a threshold: normal approximation. Stat. Papers. 52, 531–551 (2011b).

    Article  MATH  MathSciNet  Google Scholar 

  • Makri, FS, Psillakis, ZM: On -overlapping runs of ones of length k in sequences of independent binary random variables. Commun. Statist. Theor. Meth. 44, 3865–3884 (2015).

    Article  MATH  MathSciNet  Google Scholar 

  • Makri, FS, Psillakis, ZM, Arapis, AN: Counting runs of ones with overlapping parts in binary strings ordered linearly and circularly. Intern. J. Statist. Probab. 2, 50–60 (2013).

    Article  Google Scholar 

  • Makri, FS, Psillakis, ZM, Arapis, AN: Length of the minimum sequence containing repeats of success runs. Statist. Probab. Lett. 96, 28–37 (2015).

    Article  MATH  MathSciNet  Google Scholar 

  • Mood, AM: The distribution theory of runs. Ann. Math. Statist. 11, 367–392 (1940).

    Article  MATH  MathSciNet  Google Scholar 

  • Mytalas, GC, Zazanis, MA: Central limit theorem approximations for the number of runs in Markov-dependent binary sequences. J. Statist. Plann. Infer. 143, 321–333 (2013).

    Article  MATH  MathSciNet  Google Scholar 

  • Mytalas, GC, Zazanis, MA: Central limit theorem approximations for the number of runs in Markov-dependent multi-type sequences. Commun. Statist. Theor. Meth. 43, 1340–1350 (2014).

    Article  MATH  MathSciNet  Google Scholar 

  • Nuel, G, Regad, L, Martin, J, Camproux, A-C: Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithm Mol. Biol. 5, 1–18 (2010).

    Article  Google Scholar 

  • Riordan, AM: An Introduction to Combinatorial Analysis. Second Ed. John Wiley, New York (1964).

    MATH  Google Scholar 

  • Sinha, K, Sinha, BP: On the distribution of runs of ones in binary trials. Comput. Math. Appl. 58, 1816–1829 (2009).

    Article  MATH  MathSciNet  Google Scholar 

  • Tabatabaei, SAH, Zivic, N: A review of approximate message authentication codes. In: Zivic, N (ed.)Robust Image Authentication in the Presence of Noise, pp. 106–127. Springer International Publishing AG, Cham (ZG), Switzerland (2015).

    Google Scholar 

Download references

Acknowledgements

The authors wish to thank the Editor for the thorough reading, and the anonymous reviewers for useful comments and suggestions which improved the article.

Author information

Authors and Affiliations

Authors

Contributions

The authors, ANA, FSM and ZMP with the consultation of each other carried out this work and drafted the manuscript together. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Frosso S. Makri.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arapis, A., Makri, F. & Psillakis, Z. Joint distribution of k-tuple statistics in zero-one sequences of Markov-dependent trials. J Stat Distrib App 4, 26 (2017). https://doi.org/10.1186/s40488-017-0080-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40488-017-0080-5

Keywords

AMS Subject Classification