Open Access

Joint distribution of k-tuple statistics in zero-one sequences of Markov-dependent trials

  • Anastasios N. Arapis1,
  • Frosso S. Makri1Email author and
  • Zaharias M. Psillakis2
Journal of Statistical Distributions and Applications20174:26

https://doi.org/10.1186/s40488-017-0080-5

Received: 29 March 2017

Accepted: 18 October 2017

Published: 15 November 2017

Abstract

We consider a sequence of n, n≥3, zero (0) - one (1) Markov-dependent trials. We focus on k-tuples of 1s; i.e. runs of 1s of length at least equal to a fixed integer number k, 1≤kn. The statistics denoting the number of k-tuples of 1s, the number of 1s in them and the distance between the first and the last k-tuple of 1s in the sequence, are defined. The work provides, in a closed form, the exact conditional joint distribution of these statistics given that the number of k-tuples of 1s in the sequence is at least two. The case of independent and identical 0−1 trials is also covered in the study. A numerical example illustrates further the theoretical results.

Keywords

Exact DistributionsRunsBinary trialsMarkov chain

AMS Subject Classification

Primary 60E05, 62E15Secondary 60J10, 60C05

Introduction

Run counting statistics defined on a sequence of binary (zero (0) - one (1)) random variables (RVs), along with their exact and approximate distributions, have been extensively studied in the literature. Their popularity is due to the fact that such statistics appear as useful theoretical models in many research areas including statistics (e.g. hypothesis testing), engineering (e.g. system reliability and quality control), biology (e.g. population genetics and DNA sequence analysis), computer science (e.g. encoding/decoding/transmission of digital information) and financial engineering (e.g. insurance and risk analysis).

In such applications, a key point is the understanding how 1s and 0s are distributed and combined as elements of a 0−1 sequence (finite or infinite, memoryless or not) and eventually forming runs of 1s or 0s which are enumerated according to certain counting schemes. Each scheme defines how runs of same symbols or strings (patterns) of both symbols are formed and consequently are enumerated. A counting scheme may depend on, among other considerations, whether overlapping counting is allowed or not as well as if the counting starts or not from scratch when a run/string of a certain size has been so far enumerated.

The counting scheme as well as the intrinsic uncertainty of a 0−1 sequence are often suggested by the applications. Probabilistic models, in common use, for the internal structure of a 0−1 sequence include the model of a sequence with elements independent of each other or a model for which it is assumed some kind of dependence among the elements of it. The methods used to derive exact/approximating, marginal/joint probability distributions include combinatorial analysis, generating functions, finite Markov chain imbedding technique, recursive schemes as well as normal, Poisson and large deviation approximations.

For extensive reviews of the recent literature on the distribution theory of runs and patterns we refer to Balakrishnan and Koutras (2002) and Fu and Lou (2003). Current works on the subject include, among others, those of Antzoulakos and Chadjiconstantinidis (2001); Eryilmaz (2006, 2015, 2016, 2017); Eryilmaz and Yalcin (2011); Johnson and Fu (2014); Koutras (2003); Koutras et al. (2016); Makri and Psillakis (2015); Makri et al. (2013) and Mytalas and Zazanis (2013, 2014).

In this article we derive expressions for a conditional distribution of a trivariate statistic. Its components denote the number of runs of 1s of length exceeding a fixed threshold number, the number of 1s in such runs of 1s and the length of the minimum sequence’s segment in which these runs are concentrated. The study is developed on a sequence of two-state (0−1) Markov-dependent trials. The runs are enumerated according to Mood’s (1940) counting scheme.

More specifically, the manuscript is organized as follows. In Section 2 we present some preliminary material, including notation and definitions, necessary to develop our results which are obtained in Section 4. In Section 3 we give a motivation along with a statement of the aim of the work. A numerical example, showed in Section 5, clarifies the theoretical results of Section 4. A discussion on the results as well as a note on a future work are given in Section 6.

Throughout the article, for integers, n, m, \({n\choose m}\) denotes the extended binomial coefficient (see, Feller (1968), pp. 50, 63), x stands for the greatest integer less than or equal to x and δ ij denotes the Kronecker delta fuction of the integer arguments i and j. Further, for α>β, we apply the conventions \(\sum _{i=\alpha }^{\beta }y_{i}=0\), \(\prod _{i=\alpha }^{\beta }y_{i}=1\), \(\sum _{i=\alpha }^{\beta }\mathbf {Y}^{(i)}=\mathbf {O}\equiv {\scriptsize \left (\begin {array}{cc} 0 &0\\ 0 & 0 \end {array}\right)}\), \(\prod _{i=\alpha }^{\beta }\mathbf {Y}^{(i)}=\mathbf {I}\equiv {\scriptsize \left (\begin {array}{cc} 1 &0\\ 0 & 1 \end {array}\right)}\), where y i and Y (i) are scalars and 2×2 matrices, respectively.

Preliminaries

2.1 Run counting statistics

Let \(\{X_{t}\}_{t=1}^{n}\), n≥1, be the first n trials of a binary (0−1) sequence of RVs, X t =x t {0,1}. A run of 1s, is a (sub)sequence of \(\{X_{t}\}_{t=1}^{n}\) consisting of consecutive 1s, the number of which is referred to as its length, preceded and succeeded by 0s or by nothing.

Given a fixed integer k, 1≤kn, a k-tuple of 1s is a run of 1s of length k or more. In the paper we will deal with the following statistics defined on a \(0-1 \{X_{t}\}_{t=1}^{n}\). For details see, e.g. Makri et al. (2015) and the references therein.

(I) G n,k denoting the number of k-tuples of 1s, 1≤kn. In particular, G n,1 denotes the number of 1-tuples of 1s, i.e. it represents the number R n G n,1 of all runs of 1s in the sequence. Using the convention X 0=X n+1≡0, we can define G n,k as
$$ G_{n,k}=\sum_{i=k}^{n}E_{n,i},\, 1\leq k\leq n, $$
(1)
where
$$E_{n,i}=\sum_{j=i}^{n}J_{j},\, J_{j}=\left(1-X_{j-i}\right)\left(1-X_{j+1}\right)\prod_{r=j-i+1}^{j}X_{r}. $$
(II) S n,k denoting the number of 1s in the G n,k k-tuples of 1s; i.e. S n,k represents the sum of lengths of the G n,k k-tuples of 1s, 1≤kn. In particular S n,1 represents the number of all 1s in the sequence; hence, the number of 0s, Y n , in the sequence is Y n =nS n,1. S n,k is formally defined as
$$ S_{n,k}=\sum_{i=k}^{n}{iE}_{n,i}, \,1\leq k\leq n. $$
(2)

Readily, k G n,k S n,k .

(III) L n , n≥1, denoting the length of the longest run of 1s in the sequence. By setting
$$\Lambda_{n}=\{i:G_{n,i}>0, 1\leq i\leq n\}, $$
we have that
$$ L_{n}=\max \{k: k\in \Lambda_{n}\},\, \text{if}\, \Lambda_{n}\neq \emptyset;\,0,\,\text{otherwise}. $$
(3)

Readily L n <k iff G n,k <1.

(IV) For G n,k ≥1, 1≤kn, D n,k denotes the distance (number of trials) between and including the first 1 of the first k-tuple of 1s and the last 1 of the last k-tuple of 1s in the sequence. If there is only one k-tuple of 1s in the sequence then D n,k denotes its length. That is, D n,k represents the size (length) of the minimum (sub)sequence of \(\{X_{t}\}_{t=1}^{n}\) in which all G n,k k-tuple of 1s are concentrated. In particular, D n,1 represents the length of the minimum segment of the sequence containing all R n runs of 1s or in other words all S n,1 1s appearing in the sequence. For G n,k ≥1, 1≤kn, D n,k can be formally defined as
$$ D_{n,k}=U_{n,k}^{(2)}-U_{n,k}^{(1)}+1, $$
(4)
where
$$U_{n,k}^{(1)}=\min\{j:I_{j}=1, 1\leq j\leq n-k+1\}, $$
$$U_{n,k}^{(2)}=\max\{j:I_{j-k+1}=1, k\leq j\leq n\}, $$
$$I_{j}=\prod_{r=j}^{j+k-1}X_{r},\, 1\leq j\leq n-k+1. $$
Readily, D n,k =S n,k =L n , if G n,k =1 and D n,k >S n,k >L n , if G n,k >1.

(V) For G n,k ≥1, 1≤kn, set V n,k =(D n,k ,G n,k ,S n,k ). This is the RV we focus on in the article.

Example: By way of illustration consider the trials 1110001100010001010011101111001001001001 numbered from 1 to 40. Then, L 40=4 and V 40,1=(40,11,19), V 40,2=(28,4,12), V 40,3=(28,3,10), V 40,4=(4,1,4).

2.2 Internal structure’s models

A general enough model for the internal structure of a \(0-1 \{X_{t}\}_{t=1}^{n}\), n≥2, is that of the first n trials of a homogeneous 0−1 Markov chain of first order (HMC1). On such a model we will develop our results. Accordingly, we next state the necessary notation/definitions.

Let {X t } t≥1 be a HMC1 with state space ={0,1}, one step transition probability matrix
$$\mathbf{P}=(p_{ij})=\left(\begin{array}{cc} p_{00} & p_{01} \\ p_{10} & p_{11} \\ \end{array} \right), $$
with
$$ p_{ij}=P\left(X_{t}=j\mid X_{t-1}=i\right),\,i,j\in {\cal{A}},\,\sum_{j\in \cal{A}}p_{ij}=1,\,i\in {\cal{A}},\,t\geq 2 $$
(5)
and probability distribution vector at time t
$$\mathbf{p}^{(t)}=\left(p_{0}^{(t)}, p_{1}^{(t)}\right), $$
with
$$ p_{i}^{(t)}=P(X_{t}=i),\, i\in {\cal{A}},\, \sum_{i\in \cal{A}}p_{i}^{(t)}=1,\, t\geq 1. $$
(6)
Readily, because of the homogeneity of {X t } t≥1, it holds
$$\mathbf{p}^{(t)}=\mathbf{p}^{(t-1)}\mathbf{P}=\mathbf{p}^{(1)}\mathbf{P}^{t-1 },\,t\geq 2;\,\mathbf{p}^{(1)},\,t=1\,\,\text{and}\,\,\mathbf{P}^{t-1}=\left(p_{ij}^{(t-1)}\right),\, t\geq 2, $$
with
$$p_{i}^{(t)}=\mathbf{p}^{(t)}\mathbf{e}_{i}^{'},\, i\in {\mathcal{A}},\,t\geq 1, $$
$$ p_{ij}^{(t-1)}=P(X_{t-1+m}=j\mid X_{m}=i)=\mathbf{e}_{i}\mathbf{P}^{t-1}\mathbf{e}_{j}^{'},\,i,j\in {\mathcal{A}},\, t\geq 2,\, m\geq 1, $$
(7)

where \(\mathbf {e}_{i}^{'}\) is the transpose (i.e. the column vector) of the row vector e i , \(i\in {\mathcal {A}}\), with e 0=(1,0) and e 1=(0,1).

In particular, for p 01+p 10≠0, i.e. PI, it holds
$$ \mathbf{P}^{t-1}=\left(p_{01}+p_{10}\right)^{-1}\left\{\left(\begin{array}{cc} p_{10} & p_{01} \\ p_{10} & p_{01} \\ \end{array} \right)+(1-p_{01}-p_{10})^{t-1}\left(\begin{array}{cc} p_{01} & -p_{01} \\ -p_{10} & p_{10} \\ \end{array} \right)\right\},\, t\geq 2, $$
(8)
$$ p_{0}^{(t)}=p_{0}^{(1)}\left(1-p_{01}-p_{10}\right)^{t-1}+p_{10}\left(p_{01}+p_{10}\right)^{-1}\left[1-\left(1-p_{01}-p_{10}\right)^{t-1}\right],\, t\geq 1. $$
(9)
The setup of a 0−1 HMC1 \(\{X_{t}\}_{t=1}^{n}\), n≥2, covers the case of a 0−1 sequence of independent and identically distributed (IID) RVs, too. This is so, because a \(0-1 \{X_{t}\}_{t=1}^{n}\), n≥2, IID sequence with
$$ P(X_{t}=1)=1-P(X_{t}=0)=p_{1},\, 1\leq t\leq n, $$
(10)
is a particular HMC1 with
$$p_{ij}=1-p_{1},\, j=0;\, p_{1}, j=1,\, i\in {\mathcal{A}},\,p_{ij}^{(t-1)}=p_{ij},\,i,j\in {\mathcal{A}},\,t\geq 2, $$
$$ p_{1}^{(t)}=p_{1}=1-p_{0}^{(t)},\, 1\leq t\leq n. $$
(11)

2.3 A combinatorial result

In combinatorial analysis which will be used in Section 4, the following result, recalled from Makri et al. (2007), is useful. The coefficient
$$ H_{m}(\alpha,r,k)=\sum_{j=0}^{\left\lfloor\frac{\alpha}{k+1}\right\rfloor}(-1)^{j}{m\choose j}{\alpha-(k+1)j+r-1\choose \alpha-(k+1)j}, $$
(12)

represents the number of allocations of α indistinguishable balls into r distinguishable cells where each of the m, 0≤mr, specified cells is occupied by at most k balls. Equivalently, it gives the number of nonnegative integer solutions of the linear equation x 1+x 2+…+x r =α with the restrictions, for m≥1, \(0\leq x_{i_{j}}\leq k\), 1≤jm, for some specific m-combination {i 1,i 2,…,i m } of {1,2,…,r}, and no restrictions on x j s, 1≤jr, for m=0.

Moreover, H r (α,r,k) is Riordan’s (1964, p. 104) coefficient
$$ C(\alpha,r,k)=\sum_{j=0}^{\left\lfloor\frac{\alpha}{k+1}\right\rfloor}(-1)^{j}{r\choose j}{\alpha-(k+1)j+r-1\choose \alpha-(k+1)j}. $$
(13)

Motivation and aim of the work

In a study of a 0−1 sequence \(\{X_{t}\}_{t=1}^{n}\), n≥3, it is reasonable for one to be interested in the probabilistic behavior of RV V n,k =(D n,k ,G n,k ,S n,k ). This happens because jointly its components provide a more refined view of the internal clustering structure of the sequence than the information extracted by each one alone.

Interpreting a k-tuple of 1s as a cluster of consecutive 1s of size at least k, D n,k represents the size of the minimum segment of \(\{X_{t}\}_{t=1}^{n}\) in which G n,k clusters of size at least k and at most L n are concentrated. The overall density of G n,k clusters, with respect to the number of 1s in them, as well as of the minimum concentration segment is evaluated by S n,k . Large values of D n,k suggest that these G n,k clusters spread over the interval between the left and the right side of the sequence whereas small values of D n,k indicate rather that the clusters are concentrated in a segment of the sequence of small size leaving the rest part(s) of the sequence empty of such clusters.

In addition to this information, a large value of S n,k paired with a small value of G n,k indicates the existence of clusters of 1s of a large size and therefore a trend whereas the same value of S n,k paired with a large value of G n,k indicates rather a distribution of clusters of small size in the (sub)sequence in which they are concentrated.

Therefore, based on the former interpretation, the motivation for the study as well as the usefulness of the statistic V n,k =(D n,k ,G n,k ,S n,k ) is apparent. In the sequel, we assume that G n,k ≥2 in order to have at least two k-tuples of 1s in the sequence and accordingly the distance D n,k is not a degenerate one. Moreover, this assumption is a common one in an application area of D n,k ; e.g., in detecting pattern (tandem or non-tandem direct) repeats in DNA sequences (Benson 1999).

For 1≤kn, set
$$ {\mathcal{M}}_{n,k}=\{G_{n,k}\geq 2\},\,\alpha_{n,k}=P\left({\mathcal{M}}_{n,k}\right) $$
(14)
and for n≥3, 1≤k(n−1)/2, define
$$ \Omega_{n,k}=\left\{(d,m,s): 2k+1\leq d\leq n, 2k\leq s\leq d-1, 2\leq m\leq \min\left(\lfloor s/k\rfloor, d-s+1\right)\right\} $$
(15)
and for (d,m,s)Ω n,k ,
$$h_{n,k}(d,m,s)=P\left(\mathbf{V}_{n,k}=(d,m,s), {\cal {M}}_{n,k}\right), $$
$$ v_{n,k}(d,m,s)=P\left(\mathbf{V}_{n,k}=(d,m,s)\mid {\cal {M}}_{n,k}\right)=h_{n,k}(d,m,s)/\alpha_{n,k}. $$
(16)

The paper provides exact closed form expressions for α n,k , h n,k (d,m,s) and eventually for v n,k (d,m,s) when V n,k is defined on a 0−1 HMC1/IID. The expressions are obtained via combinatorial analysis.

More specifically, closed formulae are established for the first time for h n,k (d,m,s), 1≤k(n−1)/2, when V n,k is defined on a 0−1 HMC1 with given P and p (1). Since, the general frame of HMC1 covers as a particular case IID sequences, the so implied expressions for v n,k (d,m,s) are alternative to those obtained for v n,k (d,m,s), 1≤k(n−1)/2, by Makri et al. (2015) for IID sequences.

Moreover, for n≥3, 1≤k(n−1)/2, 2k+1≤dn, let
$$f_{n,k}(d)=P\left(D_{n,k}=d\mid {\cal {M}}_{n,k}\right). $$
Therefore, since
$$ f_{n,k}(d)=\sum_{s=2k}^{d-1}\sum_{m=2}^{\min\left(\lfloor s/k\rfloor,d-s+1\right)}v_{n,k}(d,m,s)=\alpha_{n,k}^{-1}\sum_{s=2k}^{d-1}\sum_{m=2}^{\min\left(\lfloor s/k\rfloor,d-s+1\right)}h_{n,k}(d,m,s), $$
(17)

hence, the work provides closed form expressions for determining f n,k (d) for HMC1 and IID \(0-1 \{X_{t}\}_{t=1}^{n}\). These expressions are alternative to those derived, for IID sequences, by Makri et al. (2015) for 1≤k(n−1)/2 as well as to those obtained, for HMC1, by Arapis et al. (2016) for k=1 and by Arapis et al. (2017) for 1≤k(n−1)/2.

Results

In a 0−1 sequence \(\{X_{t}\}_{t=1}^{n}\), n≥2, for 0≤yn, 0≤r(n+1)/2 and (i,j){0,1}2, define
$$B_{n}^{(i,j)}(y,r)=\{X_{1}=i,X_{n}=j,Y_{n}=y,G_{n,1}=r\}, $$
$$\pi_{n}^{(i,j)}(y,r)=P(B_{n}^{(i,j)}(y,r)). $$
Accordingly, for a HMC1 \(\{X_{t}\}_{t=1}^{n}\), n≥2, with given P and p (1), it holds
$$ \pi_{n}^{(i,j)}(y,r)=\left(p_{1}^{(1)}\right)^{i}\left(1-p_{1}^{(1)}\right)^{1-i}p_{00}^{y-r-1+i+j}\left(1-p_{00}\right)^{r-i} \left(1-p_{11}\right)^{r-j}p_{11}^{n-y-r}, $$
(18)

for 2−(i+j)≤yn−(i+j), 1−δ y,0δ y,n +δ i+j,2r≤ min{ny,y−1+i+j} and \(\pi _{n}^{(i,j)}(y,r)=0\), otherwise.

Consequently, \(\pi _{n}^{(i,j)}(y,r)\), for a 0−1 IID sequence, reduces to
$$ \pi_{n}^{(i,j)}(y,r)=\pi_{n}(y)=p_{1}^{n-y}(1-p_{1})^{y},\, 0\leq y\leq n. $$
(19)

Theorem 1

For n≥3, (d,m,s)Ω n,1, \(0<p_{1}^{(1)}<1\), it holds
$$\begin{array}{@{}rcl@{}} h_{n,1}(d,m,s)&=& {s-1\choose m-1}{d-s-1\choose m-2}\pi_{d}^{(1,1)}(d-s,m)\varepsilon_{n}(d) \end{array} $$
(20)

where ε n (d)=1, if n=d; \(p_{00}^{n-d-2}\left \{p_{10}p_{00}+p_{0}^{(1)}(p_{1}^{(1)})^{-1}p_{01}\left [(n-d-1)p_{10}+p_{00}\right ]\right \}\), if nd+1.

Proof

For d=3,…,n−2, i=2,3,…,nd, s=2,3,…,d−1, m=2,3,…, min{s,ds+1} an element of the event \(\Gamma _{i,d,m,s}=\{U_{n,1}^{(1)}=i, D_{n,1}=d, R_{n}=m, S_{n,1}=s\}\) is a 0−1 sequence of length n with probability
$$p_{0}^{(1)}p_{00}^{i-2}p_{01}\left[\pi_{d}^{(1,1)}(d-s,m)\left(p_{1}^{(1)}\right)^{-1}\right]p_{10}p_{00}^{n-i-d}. $$
Fix i. Then the number of elements of the event Γ i,d,m,s is \({s-1\choose m-1}{d-s-1\choose m-2}\), since the number of allocations of s 1s in m runs of 1s is \({s-1\choose m-1}\) and the number of allocations of ds 0s in m−1 runs of 0s is \({d-s-1\choose m-2}\), so that
$$P\left(\Gamma_{i,d,m,s}\right)={s-1\choose m-1}{d-s-1\choose m-2}p_{0}^{(1)}p_{01}\left[\pi_{d}^{(1,1)}(d-s,m)\left(p_{1}^{(1)}\right)^{-1}\right]p_{10}p_{00}^{n-d-2}. $$

We use similar reasoning for the rest cases. Then summing with respect to i we get the result. □

For a sequence \(\{X_{t}\}_{t=1}^{n}\) of 0−1 IID RVs, h n,1(d,m,s) reduces to the explicit formula given in the next Corollary.

Corollary 1

For n≥3, (d,m,s)Ω n,1, 0<p 1<1, it is true that
$$ h_{n,1}(d,m,s)=(n-d+1){s-1\choose m-1}{d-s-1\choose m-2}p_{1}^{s}(1-p_{1})^{n-s}.\quad \diamondsuit $$
(21)

In order to derive for HMC1, in the forthcoming Theorem 2, h n,k (d,m,s), 5≤2k+1≤n, we next recall, in Lemma 1, a result from (Makri et al.: On the concentration of runs of ones of length exceeding a threshold in a Markov chain, submitted).

Lemma 1

For (i,j){0,1}2, n≥2, set \(\lambda _{n,k}^{(i,j)}(x)=P(G_{n,k}=x,X_{1}=i,X_{n}=j)\), x=0,1. Then, it holds that:

(I) For 2≤kn−2+i+j,
$$\lambda_{n,k}^{(i,j)}(0)=\sum_{y=1}^{n-(i+j)}\sum_{r=i+j}^{y-1+i+j} {y-1\choose r-i-j}C(n-y-r,r,k-2)\pi_{n}^{(i,j)}(y,r), $$
$$ {}\lambda_{n,k}^{(i,j)}(1)=\pi_{n}^{(i,j)}(0,1)\delta_{2,i+j}+\sum_{y=1}^{n-k}\sum_{r=1}^{y-1+i+j} r{y-1\choose r-i-j}H_{r-1}(n-y-r-k+1,r,k-2)\pi_{n}^{(i,j)}(y,r). $$
(22)
(II) For k>n−2+i+j,
$$\lambda_{n,k}^{(i,j)}(0)=\left(p_{1}^{(1)}\right)^{i}\left(1-p_{1}^{(1)}\right)^{1-i}p_{ij}^{(n-1)}, $$
$$ \lambda_{n,k}^{(i,j)}(1)=0. $$
(23)

Theorem 2

For n≥5, 2≤k(n−1)/2, (d,m,s)Ω n,k , \(0<p_{1}^{(1)}<1\), it holds
$$\begin{array}{*{20}l} {}h_{n,k}(d,m,s)\,=\,p_{11}^{2k-2}\left(p_{1}^{(1)}\right)^{-1}\!\! \sum_{i=1}^{n-d+1}\!\!\!\ell_{i-1,k}^{(\alpha)}\ell_{n-d-i+1,k}^{(\beta)}\!\!\!\!\!\!\!\!\!\sum_{r=m}^{m+\left\lfloor\frac{d-s-m+1}{2}\right\rfloor}\! \sum_{y=r-1}^{d-s-r+m}\!\!\!\gamma_{d,m,s}(y,\!r)\pi_{d-\!2k+\!2}^{(1,1)}(y,\!r), \end{array} $$
(24)
where
$${}\ell_{n,k}^{(\alpha)}=p_{1}^{(1)},\,\text{for}\,n=0;\quad p_{0}^{(n)}p_{01},\,\text{for}\, 1\leq n\leq k;\quad p_{01}\left[\lambda_{n,k}^{(0,0)}(0)+\lambda_{n,k}^{(1,0)}(0)\right],\,\text{for}\,n\geq k+1, $$
$$ {}\ell_{n,k}^{(\beta)}=1,\,\text{for}\,n=0;\quad p_{10},\,\text{for}\, 1\leq n\leq k;\quad \!p_{10}(p_{0}^{(1)})^{-1}\left[\lambda_{n,k}^{(0,0)}(0)+\lambda_{n,k}^{(0,1)}(0)\right],\,\text{for}\,n\geq k+1 $$
(25)
and
$$\begin{array}{@{}rcl@{}} {}\gamma_{d,m,s}(y,r)\,=\,{y-1\choose r-2}{r-2\choose m-2}{s-mk+m-1\choose m-1}C(d-y-s-r+m,r-m,k\,-\,2). \end{array} $$
(26)

Proof

For 1≤r 1r 2n let \(Y_{r_{1},r_{2}}\), \(R_{r_{1},r_{2}}\), \(L_{r_{1},r_{2}}\), \(S_{r_{1},r_{2},k}\), \(D_{r_{1},r_{2},k}\), \(G_{r_{1},r_{2},k}\) be RVs defined on the subsequence \(X_{r_{1}}, X_{r_{1}+1},\ldots,X_{r_{2}}\) of \(\{X_{t}\}_{t=1}^{n}\). For m≥2 define the event
$$\begin{array}{@{}rcl@{}} \lefteqn{\Delta_{r_{1},r_{1}+d-1}(d,s,m,y,r)}\\ & & =\{D_{r_{1},r_{1}+d-1,k}=d, G_{r_{1},r_{1}+d-1,k}=m, S_{r_{1},r_{1}+d-1,k}=s, Y_{r_{1},r_{1}+d-1}=y, R_{r_{1},r_{1}+d-1}=r\}. \end{array} $$
An element of this event is a 0 - 1 sequence of length d, starting and ending with a 1, for which y j ’s and z j ’s, representing the lengths of the failure and success runs, respectively, satisfy the conditions:
  1. (a)

    y 1+y 2+…+y r−1=y, y j ≥1, 1≤jr−1.

     
  2. (b)

    \(\phantom {\dot {i}\!}z_{1}+z_{i_{1}}+z_{i_{2}}+\ldots +z_{i_{m-2}}+z_{r}=s\), z j k, j{1,i 1,i 2,…,i m−2,r}, for some specific combination {1,i 1,i 2,…,i m−2,r} of {1,2,…,r−1,r} among the \({r-2\choose m-2}\) ones.

     
  3. (c)

    \(z_{i_{m-1}}+z_{i_{m}}+\ldots +z_{i_{r-2}}=d-y-s\), \(1\leq z_{i_{j}}\leq k-1\), m−1≤jr−2, for {i m−1,…,i r−2}{1,2,…,r}−{1,i 1,i 2,…,i m−2,r}.

     
Fix i 1,i 2,…,i m−2. Then the number of such sequences, i.e. the number of solutions of the system (a)-(c), is
$${y-1\choose r-2}C(d-y-s-r+m,r-m,k-2){s-mk+m-1\choose m-1} $$
and each such sequence has probability
$$p_{1}^{(1)}p_{11}^{k-1}(p_{1}^{(1)})^{-1}\pi_{d-2k+2}^{(1,1)}(y,r)p_{11}^{k-1}=p_{11}^{2k-2}\pi_{d-2k+2}^{(1,1)}(y,r). $$
Hence,
$$\begin{array}{@{}rcl@{}} P(\Delta_{r_{1},r_{1}+d-1}(d,s,m,y,r))&=& p_{11}^{2k-2}\pi_{d-2k+2}^{(1,1)}(y,r){r-2\choose m-2}{y-1\choose r-2}{s-mk+m-1\choose m-1}\\ & & \times C(d-y-s-r+m,r-m,k-2). \end{array} $$
For k+2≤inkd, m≥2, we have that
$$\begin{array}{@{}rcl@{}} \lefteqn{P\left(U_{n,k}^{(1)}=i, D_{n,k}=d, G_{n,k}=m, S_{n,k}=s, Y_{i,i+d-1}=y, R_{i,i+d-1}=r\right)}\\ &=&P\Big\{\left[(L_{1,i-1}<k, X_{i-1}=0)\cap\left[(X_{1}=0)\cup(X_{1}=1)\right]\right]\cap \Delta_{i,i+d-1}(d,s,m,y,r)\\ & &\cap\left[(L_{i+d,n}<k, X_{i+d}=0)\cap\left[(X_{n}=0)\cup(X_{n}=1)\right]\right]\Big\}\\ &= & \left[\lambda_{i-1,k}^{(0,0)}(0)+\lambda_{i-1,k}^{(1,0)}(0)\right]p_{01}\\ & &\times \left(p_{1}^{(1)}\right)^{-1}P\left(\Delta_{i,i+d-1}(d,s,m,y,r)\right) p_{10}\left[\lambda_{n-i-d+1,k}^{(0,0)}(0)+\lambda_{n-i-d+1,k}^{(0,1)}(0)\right]/p_{0}^{(1)}\\ &=& \left[\lambda_{i-1,k}^{(0,0)}(0)+\lambda_{i-1,k}^{(1,0)}(0)\right]p_{01}\left(p_{1}^{(1)}\right)^{-1}p_{11}^{2k-2}\pi_{d-2k+2}^{(1,1)}(y,r)\\ & & \times{r-2\choose m-2}{y-1\choose r-2}{s-mk+m-1\choose m-1}\\ & & \times C(d-y-s-r+m,r-m,k-2)p_{10}\left(p_{0}^{(1)}\right)^{-1} \left[\lambda_{n-i-d+1,k}^{(0,0)}(0)+\lambda_{n-i-d+1,k}^{(0,1)}(0)\right]. \end{array} $$

By similar reasoning we get the remaining cases of i, i.e. 1≤ik+1 and nd+1−kind+1. Then summing with respect to i, y and r we get the result. □

Having found h n,k (d,m,s), we next proceed to obtain v n,k (d,m,s). In accomplishing it, the required probabilities α n,k for HMC1 are recalled, in Lemma 2, from Arapis et al. (2016) for k=1, and they are computed via Lemma 1 for 2≤k(n−1)/2.

Lemma 2

For nk≥1, the probability α n,k , for HMC1, is computed via the expressions:(I) For k=1,
$$\begin{array}{@{}rcl@{}} \alpha_{n,1}&=&1-p_{00}^{n-3}\left\{p_{00}\left(1+(n-2)p_{01}\right)+\frac{(n-1)(n-2)}{2}p_{0}^{(1)}p_{01}^{2}\right\},\quad \text{if}\quad p_{00}=p_{11} \end{array} $$
and
$$\begin{array}{@{}rcl@{}} \alpha_{n,1}&=& 1-p_{0}^{(1)}p_{00}^{n-1}-p_{11}^{n-2}\left(p_{1}^{(1)}+p_{0}^{(1)}p_{01}\right)-p_{00} \left(p_{0}^{(1)}p_{01}+p_{1}^{(1)}p_{10}\right)\frac{p_{11}^{n-2}-p_{00}^{n-2}}{p_{11}-p_{00}} \\ & &\quad -p_{0}^{(1)}p_{01}p_{10}\frac{p_{11}^{n-1}-p_{00}^{n-2} \left[p_{11}+(n-2)\left(p_{11}-p_{00}\right)\right]}{\left(p_{11}-p_{00}\right)^{2}},\quad \text{if}\quad p_{00}\neq p_{11}. \end{array} $$
(27)
(II) For 2≤kn,
$$\begin{array}{@{}rcl@{}}\alpha_{n,k}=1-\sum_{(i,j)\in \{0,1\}^{2}}\left[\lambda_{n,k}^{(i,j)}(0)+\lambda_{n,k}^{(i,j)}(1)\right]. \end{array} $$
(28)

Theorem 3

For n≥3, 1≤k(n−1)/2, (d,m,s)Ω n,k , \(0<p_{1}^{(1)}<1\), the PMF v n,k (d,m,s) for a HMC1, with given P and p (1), is calculated by
$$ v_{n,k}(d,m,s)=\alpha_{n,k}^{-1}h_{n,k}(d,m,s), $$
(29)

where α n,k and h n,k (d,m,s) are provided by Lemma 2 and Theorems 1 (for k=1) and 2 (for 2≤k(n−1)/2), respectively.

Remark 1

For IID sequences, in implementing Theorem 3, one has to take into consideration Eqs. (10) - (11), (19) and (21). Moreover, for speeding up calculations, one has to set π n (y) in front of the inner summation in (22).

A numerical example

In this example we compute some indicative numerics concerning two model (i.e. HMC1 and IID) 0−1 sequences \(\{X_{t}\}_{t=1}^{n}\) which are considered in the paper. The common length of these was taken small, i.e. n=8, so that the required computations can also be carried out by a hand/pocket calculator and thus it is possible to gain insight in the formulae developed in Section Results, and also because of space limitations. The sequences that have been used are as follows. Table 1: An IID sequence with p 1=0.5. Table 2: A HMC1 sequence with p 00=p 11=0.9, \(p_{1}^{(1)}=0.5\).
Table 1

0−1 IID sequence with p 1=0.5

s

m

d=3

d=4

d=5

d=6

d=7

d=8

v 8,1(d,m,s)

2

2

0.02739726

0.02283105

0.01826484

0.01369863

0.00913242

0.00456621

3

2

 

0.04566210

0.03652968

0.02739726

0.01826484

0.00913242

 

3

  

0.01826484

0.02739726

0.02739726

0.01826484

4

2

  

0.05479452

0.04109589

0.02739726

0.01369863

 

3

   

0.04109589

0.05479452

0.04109589

 

4

    

0.00913242

0.01369863

5

2

   

0.05479452

0.03652968

0.01826484

 

3

    

0.05479452

0.05479452

 

4

     

0.01826484

6

2

    

0.04566210

0.02283105

 

3

     

0.04566210

7

2

     

0.02739726

f 8,1(d)

 

0.02739726

0.06849315

0.12785388

0.20547945

0.28310503

0.28767123

v 8,2(d,m,s)

4

2

  

0.18518519

0.09259259

0.07407407

0.05555556

5

2

   

0.18518519

0.07407407

0.07407407

6

2

    

0.11111111

0.05555556

 

3

     

0.01851852

7

2

     

0.07407407

f 8,2(d)

   

0.18518519

0.27777778

0.25925925

0.27777778

v 8,3(d,m,s)

6

2

    

0.40000000

0.20000000

7

2

     

0.40000000

f 8,3(d)

     

0.40000000

0.60000000

Table 2

0−1 HMC1 sequence with p 00=p 11=0.9, \(p_{1}^{(1)}=0.5\)

s

m

d=3

d=4

d=5

d=6

d=7

d=8

v 8,1(d,m,s)

2

2

0.00914441

0.00872875

0.00831310

0.00789744

0.00748179

0.03366804

3

2

 

0.01745750

0.01662619

0.01579488

0.01496357

0.06733609

 

3

  

0.00010263

0.00019500

0.00027710

0.00166262

4

2

  

0.02493929

0.02369233

0.02244536

0.10100413

 

3

   

0.00029250

0.00055421

0.00374089

 

4

    

0.00000114

0.00001539

5

2

   

0.03158977

0.02992715

0.13467217

 

3

    

0.00055421

0.00498786

 

4

     

0.00002053

6

2

    

0.03740894

0.16834021

 

3

     

0.00415655

7

2

     

0.20200826

f 8,1(d)

 

0.00914441

0.02618626

0.04998121

0.07946192

0.11361346

0.72161274

v 8,2(d,m,s)

4

2

  

0.02225160

0.02081956

0.01806565

0.08228685

5

2

   

0.04163913

0.03569068

0.16259088

6

2

    

0.05353602

0.24091210

 

3

     

0.00099141

7

2

     

0.32121613

f 8,2(d)

   

0.02225160

0.06245869

0.10729236

0.80799735

v 8,3(d,m,s)

6

2

    

0.06896552

0.31034483

7

2

     

0.62068966

f 8,3(d)

     

0.06896552

0.93103448

Both tables depict for k=1,2,3, v 8,k (d,m,s), (d,m,s)Ω 8,k and f 8,k (d), 2k+1≤d≤8 illustrating the numeric values of the involved probabilities. v 8,k (d,m,s) and f 8,k (d) were computed via Eqs. (29) and (17), respectively.

Discussion and further study

In this article we have derived exact closed form expressions for PMF v n,k (d,m,s), n≥3, 1≤k(n−1)/2, (d,m,s)Ω n,k , of the RV V n,k n,k defined on a 0−1 sequence of homogeneous Markov-dependent trials. The method used is a combinatorial one relied on results exploiting the internal structure of such a sequence.

As it is noticed in the Introduction the application domain of runs contains a diverse range of fields. Indicative potential ones are next discussed.

Encoding, compression and transmission of digital information calls for the understanding the distributions of runs of 1s or 0s. Such a knowledge helps in analyzing, and also in comparing, several techniques used in communication networks. In such networks 0−1 data ranging from a few kilobytes (e.g. e-mails) to many gigabytes of greedy multimedia applications (e.g. video on demand) are highly encoded, decoded and eventually proceeded under security. For details, see e.g., Sinha and Sinha (2009), Makri and Psillakis (2011a) and Tabatabaei and Zivic (2015).

An area where the study of runs of 1s and 0s has become increasingly useful is the field of bioinformatics or computational biology. For instance, molecular biologists design similarity tests between two DNA sequences where a 1 is interpreted as a match of the sequences at a given position and everything else as a 0. Moreover, the probabilistic analysis of such sequences according to the form, the length and the number of detected patterns as well as of the positions and the lengths of the segments of the sequence in which they are concentrated, probably suggests a functional reason for the internal structure of the examined sequence. The latter facts might be useful in suggesting a further investigation of the underline sequence(s) by biologists. See, e.g. Avery and Henderson (1999), Benson (1999) and Nuel et al. (2010).

Another active area where run statistics, in particular G n,k and S n,k , have interesting statistical applications is that connected to hypothesis testing; e.g., in tests of randomness. For a systematic study of such a topic, we refer among others, the works of Koutras and Alexandrou (1997) and Antzoulakos et al. (2003).

Accordingly, it is reasonable for one to use the exact expressions obtained for v n,k (d,m,s) in applications like the ones mentioned above. This is so, because this distribution, as a joint one, is more flexible than each one of its marginals which have been used in such applications. See, e.g. Lou (2003), Makri and Psillakis (2011b) and Arapis et al. (2016).

Moreover, in handling 0 - 1 sequences of a large length, with dependent or not elements, a Monte - Carlo simulation, based on Eqs. (1) - (4) would be a useful tool in obtaining approximate values for v n,k (d,m,s). In addition, the general approximating methods, suggested by Johnson and Fu (2014), might be helpful in deriving approximate values for f n,k (d).

Declarations

Acknowledgements

The authors wish to thank the Editor for the thorough reading, and the anonymous reviewers for useful comments and suggestions which improved the article.

Authors’ contributions

The authors, ANA, FSM and ZMP with the consultation of each other carried out this work and drafted the manuscript together. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

(1)
Department of Mathematics, University of Patras
(2)
Department of Physics, University of Patras

References

  1. Antzoulakos, DL, Bersimis, S, Koutras, MV: On the distribution of the total number of run lengths. Ann. Inst. Statist. Math. 55, 865–884 (2003).View ArticleMATHGoogle Scholar
  2. Antzoulakos, DL, Chadjiconstantinidis, S: Distributions of numbers of success runs of fixed length in Markov dependent trials. Ann. Inst. Statist. Math. 53, 559–619 (2001).View ArticleMATHGoogle Scholar
  3. Arapis, AN, Makri, FS, Psillakis, ZM: On the length and the position of the minimum sequence containing all runs of ones in a Markovian binary sequence. Statist. Probab. Lett. 116, 45–54 (2016).View ArticleMATHGoogle Scholar
  4. Arapis, AN, Makri, FS, Psillakis, ZM: Distribution of statistics describing concentration of runs in non homogeneous Markov-dependent trials. Commun. Statist. Theor. Meth. (2017). doi:10.1080/03610926.2017.1337144.
  5. Avery, PJ, Henderson, D: Fiting Markov chain models to discrete state series such as DNA sequences. Appl. Statist. 48(Part 1), 53–61 (1999).MATHGoogle Scholar
  6. Balakrishnan, N, Koutras, MV: Runs and Scans with Applications. Wiley, New York (2002).MATHGoogle Scholar
  7. Benson, G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).View ArticleGoogle Scholar
  8. Eryilmaz, S: Some results associated with the longest run statistic in a sequence of Markov dependent trials. Appl. Math. Comput. 175, 119–130 (2006).View ArticleMATHGoogle Scholar
  9. Eryilmaz, S: Discrete time shock models involving runs. Statist. Probab. Lett. 107, 93–100 (2015).View ArticleMATHGoogle Scholar
  10. Eryilmaz, S: Generalized waiting time distributions associated with runs. Metrika. 79, 357–368 (2016).View ArticleMATHGoogle Scholar
  11. Eryilmaz, S: The concept of weak exchangeability and its applications. Metrika. 80, 259–271 (2017).View ArticleMATHGoogle Scholar
  12. Eryilmaz, S, Yalcin, F: Distribution of run statistics in partially exchangeable processes. Metrika. 73, 293–304 (2011).View ArticleMATHGoogle Scholar
  13. Feller, W: An Introduction to Probability Theory and Its Applications. 3rd Ed., Vol. I. Wiley, New York (1968).MATHGoogle Scholar
  14. Fu, JC, Lou, WYW: Distribution Theory of Runs and Patterns and Its Applications: A finite Markov chain imbedding approach. World Scientific, River Edge (2003).Google Scholar
  15. Johnson, BC, Fu, JC: Approximating the distributions of runs and patterns. J. Stat. Distrib. Appl. 1:5, 1–15 (2014).View ArticleMATHGoogle Scholar
  16. Koutras, MV: Applications of Markov chains to the distribution of runs and patterns. In: Shanbhag, DN, Rao, CR (eds.)Handbook of Statistics, pp. 431–472. Elsevier, North-Holland (2003).Google Scholar
  17. Koutras, MV, Alexandrou, V: Non-parametric randomness tests based on success runs of fixed length. Statist. Probab. Lett. 32, 393–404 (1997).View ArticleMATHGoogle Scholar
  18. Koutras, VM, Koutras, MV, Yalcin, F: A simple compound scan statistic useful for modeling insurance and risk management problems. Insur. Math. Econ. 69, 202–209 (2016).View ArticleMATHGoogle Scholar
  19. Lou, WYW: The exact distribution of the k-tuple statistic for sequence homology. Statist. Probab. Lett. 61, 51–59 (2003).View ArticleMATHGoogle Scholar
  20. Makri, FS, Philippou, AN, Psillakis, ZM: Success run statistics defined on an urn model. Adv. Appl. Prob. 39, 991–1019 (2007).View ArticleMATHGoogle Scholar
  21. Makri, FS, Psillakis, ZM: On success runs of a fixed length in Bernoulli sequences: Exact and asymptotic results. Comput. Math. Appl. 61, 761–772 (2011a).View ArticleMATHGoogle Scholar
  22. Makri, FS, Psillakis, ZM: On runs of length exceeding a threshold: normal approximation. Stat. Papers. 52, 531–551 (2011b).View ArticleMATHGoogle Scholar
  23. Makri, FS, Psillakis, ZM: On -overlapping runs of ones of length k in sequences of independent binary random variables. Commun. Statist. Theor. Meth. 44, 3865–3884 (2015).View ArticleMATHGoogle Scholar
  24. Makri, FS, Psillakis, ZM, Arapis, AN: Counting runs of ones with overlapping parts in binary strings ordered linearly and circularly. Intern. J. Statist. Probab. 2, 50–60 (2013).View ArticleGoogle Scholar
  25. Makri, FS, Psillakis, ZM, Arapis, AN: Length of the minimum sequence containing repeats of success runs. Statist. Probab. Lett. 96, 28–37 (2015).View ArticleMATHGoogle Scholar
  26. Mood, AM: The distribution theory of runs. Ann. Math. Statist. 11, 367–392 (1940).View ArticleMATHGoogle Scholar
  27. Mytalas, GC, Zazanis, MA: Central limit theorem approximations for the number of runs in Markov-dependent binary sequences. J. Statist. Plann. Infer. 143, 321–333 (2013).View ArticleMATHGoogle Scholar
  28. Mytalas, GC, Zazanis, MA: Central limit theorem approximations for the number of runs in Markov-dependent multi-type sequences. Commun. Statist. Theor. Meth. 43, 1340–1350 (2014).View ArticleMATHGoogle Scholar
  29. Nuel, G, Regad, L, Martin, J, Camproux, A-C: Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithm Mol. Biol. 5, 1–18 (2010).View ArticleGoogle Scholar
  30. Riordan, AM: An Introduction to Combinatorial Analysis. Second Ed. John Wiley, New York (1964).MATHGoogle Scholar
  31. Sinha, K, Sinha, BP: On the distribution of runs of ones in binary trials. Comput. Math. Appl. 58, 1816–1829 (2009).View ArticleMATHGoogle Scholar
  32. Tabatabaei, SAH, Zivic, N: A review of approximate message authentication codes. In: Zivic, N (ed.)Robust Image Authentication in the Presence of Noise, pp. 106–127. Springer International Publishing AG, Cham (ZG), Switzerland (2015).Google Scholar

Copyright

© The Author(s) 2017