Joint distribution of k-tuple statistics in zero-one sequences of Markov-dependent trials

Arapis, Anastasios N.; Makri, Frosso S.; Psillakis, Zaharias M.

doi:10.1186/s40488-017-0080-5

Research
Open access
Published: 15 November 2017

Joint distribution of k-tuple statistics in zero-one sequences of Markov-dependent trials

Anastasios N. Arapis¹,
Frosso S. Makri¹ &
Zaharias M. Psillakis²

Journal of Statistical Distributions and Applications volume 4, Article number: 26 (2017) Cite this article

2669 Accesses
Metrics details

Abstract

We consider a sequence of n, n≥3, zero (0) - one (1) Markov-dependent trials. We focus on k-tuples of 1s; i.e. runs of 1s of length at least equal to a fixed integer number k, 1≤k≤n. The statistics denoting the number of k-tuples of 1s, the number of 1s in them and the distance between the first and the last k-tuple of 1s in the sequence, are defined. The work provides, in a closed form, the exact conditional joint distribution of these statistics given that the number of k-tuples of 1s in the sequence is at least two. The case of independent and identical 0−1 trials is also covered in the study. A numerical example illustrates further the theoretical results.

Introduction

Run counting statistics defined on a sequence of binary (zero (0) - one (1)) random variables (RVs), along with their exact and approximate distributions, have been extensively studied in the literature. Their popularity is due to the fact that such statistics appear as useful theoretical models in many research areas including statistics (e.g. hypothesis testing), engineering (e.g. system reliability and quality control), biology (e.g. population genetics and DNA sequence analysis), computer science (e.g. encoding/decoding/transmission of digital information) and financial engineering (e.g. insurance and risk analysis).

In such applications, a key point is the understanding how 1s and 0s are distributed and combined as elements of a 0−1 sequence (finite or infinite, memoryless or not) and eventually forming runs of 1s or 0s which are enumerated according to certain counting schemes. Each scheme defines how runs of same symbols or strings (patterns) of both symbols are formed and consequently are enumerated. A counting scheme may depend on, among other considerations, whether overlapping counting is allowed or not as well as if the counting starts or not from scratch when a run/string of a certain size has been so far enumerated.

The counting scheme as well as the intrinsic uncertainty of a 0−1 sequence are often suggested by the applications. Probabilistic models, in common use, for the internal structure of a 0−1 sequence include the model of a sequence with elements independent of each other or a model for which it is assumed some kind of dependence among the elements of it. The methods used to derive exact/approximating, marginal/joint probability distributions include combinatorial analysis, generating functions, finite Markov chain imbedding technique, recursive schemes as well as normal, Poisson and large deviation approximations.

For extensive reviews of the recent literature on the distribution theory of runs and patterns we refer to Balakrishnan and Koutras (2002) and Fu and Lou (2003). Current works on the subject include, among others, those of Antzoulakos and Chadjiconstantinidis (2001); Eryilmaz (2006, 2015, 2016, 2017); Eryilmaz and Yalcin (2011); Johnson and Fu (2014); Koutras (2003); Koutras et al. (2016); Makri and Psillakis (2015); Makri et al. (2013) and Mytalas and Zazanis (2013, 2014).

In this article we derive expressions for a conditional distribution of a trivariate statistic. Its components denote the number of runs of 1s of length exceeding a fixed threshold number, the number of 1s in such runs of 1s and the length of the minimum sequence’s segment in which these runs are concentrated. The study is developed on a sequence of two-state (0−1) Markov-dependent trials. The runs are enumerated according to Mood’s (1940) counting scheme.

More specifically, the manuscript is organized as follows. In Section 2 we present some preliminary material, including notation and definitions, necessary to develop our results which are obtained in Section 4. In Section 3 we give a motivation along with a statement of the aim of the work. A numerical example, showed in Section 5, clarifies the theoretical results of Section 4. A discussion on the results as well as a note on a future work are given in Section 6.

Throughout the article, for integers, n, m, ${n\choose m}$ denotes the extended binomial coefficient (see, Feller (1968), pp. 50, 63), ⌊x⌋ stands for the greatest integer less than or equal to x and δ _ij denotes the Kronecker delta fuction of the integer arguments i and j. Further, for α>β, we apply the conventions $\sum _{i=\alpha }^{\beta }y_{i}=0$, $\prod _{i=\alpha }^{\beta }y_{i}=1$, $\sum _{i=\alpha }^{\beta }\mathbf {Y}^{(i)}=\mathbf {O}\equiv {\scriptsize \left (\begin {array}{cc} 0 &0\\ 0 & 0 \end {array}\right)}$, $\prod _{i=\alpha }^{\beta }\mathbf {Y}^{(i)}=\mathbf {I}\equiv {\scriptsize \left (\begin {array}{cc} 1 &0\\ 0 & 1 \end {array}\right)}$, where y _i and Y ⁽ⁱ⁾ are scalars and 2×2 matrices, respectively.

Preliminaries

2.1 Run counting statistics

Let $\{X_{t}\}_{t=1}^{n}$, n≥1, be the first n trials of a binary (0−1) sequence of RVs, X _t=x _t∈{0,1}. A run of 1s, is a (sub)sequence of $\{X_{t}\}_{t=1}^{n}$ consisting of consecutive 1s, the number of which is referred to as its length, preceded and succeeded by 0s or by nothing.

Given a fixed integer k, 1≤k≤n, a k-tuple of 1s is a run of 1s of length k or more. In the paper we will deal with the following statistics defined on a $0-1 \{X_{t}\}_{t=1}^{n}$. For details see, e.g. Makri et al. (2015) and the references therein.

(I) G _n,k denoting the number of k-tuples of 1s, 1≤k≤n. In particular, G _n,1 denotes the number of 1-tuples of 1s, i.e. it represents the number R _n≡G _n,1 of all runs of 1s in the sequence. Using the convention X ₀=X _n+1≡0, we can define G _n,k as

$$ G_{n,k}=\sum_{i=k}^{n}E_{n,i},\, 1\leq k\leq n, $$

(1)

where

$$E_{n,i}=\sum_{j=i}^{n}J_{j},\, J_{j}=\left(1-X_{j-i}\right)\left(1-X_{j+1}\right)\prod_{r=j-i+1}^{j}X_{r}. $$

(II) S _n,k denoting the number of 1s in the G _n,k k-tuples of 1s; i.e. S _n,k represents the sum of lengths of the G _n,k k-tuples of 1s, 1≤k≤n. In particular S _n,1 represents the number of all 1s in the sequence; hence, the number of 0s, Y _n, in the sequence is Y _n=n−S _n,1. S _n,k is formally defined as

$$ S_{n,k}=\sum_{i=k}^{n}{iE}_{n,i}, \,1\leq k\leq n. $$

(2)

Readily, k G _n,k≤S _n,k.

(III) L _n, n≥1, denoting the length of the longest run of 1s in the sequence. By setting

$$\Lambda_{n}=\{i:G_{n,i}>0, 1\leq i\leq n\}, $$

we have that

$$ L_{n}=\max \{k: k\in \Lambda_{n}\},\, \text{if}\, \Lambda_{n}\neq \emptyset;\,0,\,\text{otherwise}. $$

(3)

Readily L _n<k iff G _n,k<1.

(IV) For G _n,k≥1, 1≤k≤n, D _n,k denotes the distance (number of trials) between and including the first 1 of the first k-tuple of 1s and the last 1 of the last k-tuple of 1s in the sequence. If there is only one k-tuple of 1s in the sequence then D _n,k denotes its length. That is, D _n,k represents the size (length) of the minimum (sub)sequence of $\{X_{t}\}_{t=1}^{n}$ in which all G _n,k k-tuple of 1s are concentrated. In particular, D _n,1 represents the length of the minimum segment of the sequence containing all R _n runs of 1s or in other words all S _n,1 1s appearing in the sequence. For G _n,k≥1, 1≤k≤n, D _n,k can be formally defined as

$$ D_{n,k}=U_{n,k}^{(2)}-U_{n,k}^{(1)}+1, $$

(4)

where

$$U_{n,k}^{(1)}=\min\{j:I_{j}=1, 1\leq j\leq n-k+1\}, $$

$$U_{n,k}^{(2)}=\max\{j:I_{j-k+1}=1, k\leq j\leq n\}, $$

$$I_{j}=\prod_{r=j}^{j+k-1}X_{r},\, 1\leq j\leq n-k+1. $$

Readily, D _n,k=S _n,k=L _n, if G _n,k=1 and D _n,k>S _n,k>L _n, if G _n,k>1.

(V) For G _n,k≥1, 1≤k≤n, set V _n,k=(D _n,k,G _n,k,S _n,k). This is the RV we focus on in the article.

Example: By way of illustration consider the trials 1110001100010001010011101111001001001001 numbered from 1 to 40. Then, L ₄₀=4 and V _40,1=(40,11,19), V _40,2=(28,4,12), V _40,3=(28,3,10), V _40,4=(4,1,4).

2.2 Internal structure’s models

A general enough model for the internal structure of a $0-1 \{X_{t}\}_{t=1}^{n}$, n≥2, is that of the first n trials of a homogeneous 0−1 Markov chain of first order (HMC1). On such a model we will develop our results. Accordingly, we next state the necessary notation/definitions.

Let {X _t}_t≥1 be a HMC1 with state space ={0,1}, one step transition probability matrix

$$\mathbf{P}=(p_{ij})=\left(\begin{array}{cc} p_{00} & p_{01} \\ p_{10} & p_{11} \\ \end{array} \right), $$

with

$$ p_{ij}=P\left(X_{t}=j\mid X_{t-1}=i\right),\,i,j\in {\cal{A}},\,\sum_{j\in \cal{A}}p_{ij}=1,\,i\in {\cal{A}},\,t\geq 2 $$

(5)

and probability distribution vector at time t

$$\mathbf{p}^{(t)}=\left(p_{0}^{(t)}, p_{1}^{(t)}\right), $$

with

$$ p_{i}^{(t)}=P(X_{t}=i),\, i\in {\cal{A}},\, \sum_{i\in \cal{A}}p_{i}^{(t)}=1,\, t\geq 1. $$

(6)

Readily, because of the homogeneity of {X _t}_t≥1, it holds

$$\mathbf{p}^{(t)}=\mathbf{p}^{(t-1)}\mathbf{P}=\mathbf{p}^{(1)}\mathbf{P}^{t-1 },\,t\geq 2;\,\mathbf{p}^{(1)},\,t=1\,\,\text{and}\,\,\mathbf{P}^{t-1}=\left(p_{ij}^{(t-1)}\right),\, t\geq 2, $$

with

$$p_{i}^{(t)}=\mathbf{p}^{(t)}\mathbf{e}_{i}^{'},\, i\in {\mathcal{A}},\,t\geq 1, $$

$$ p_{ij}^{(t-1)}=P(X_{t-1+m}=j\mid X_{m}=i)=\mathbf{e}_{i}\mathbf{P}^{t-1}\mathbf{e}_{j}^{'},\,i,j\in {\mathcal{A}},\, t\geq 2,\, m\geq 1, $$

(7)

where $\mathbf {e}_{i}^{'}$ is the transpose (i.e. the column vector) of the row vector e _i, $i\in {\mathcal {A}}$, with e ₀=(1,0) and e ₁=(0,1).

In particular, for p ₀₁+p ₁₀≠0, i.e. P≠I, it holds

$$ \mathbf{P}^{t-1}=\left(p_{01}+p_{10}\right)^{-1}\left\{\left(\begin{array}{cc} p_{10} & p_{01} \\ p_{10} & p_{01} \\ \end{array} \right)+(1-p_{01}-p_{10})^{t-1}\left(\begin{array}{cc} p_{01} & -p_{01} \\ -p_{10} & p_{10} \\ \end{array} \right)\right\},\, t\geq 2, $$

(8)

$$ p_{0}^{(t)}=p_{0}^{(1)}\left(1-p_{01}-p_{10}\right)^{t-1}+p_{10}\left(p_{01}+p_{10}\right)^{-1}\left[1-\left(1-p_{01}-p_{10}\right)^{t-1}\right],\, t\geq 1. $$

(9)

The setup of a 0−1 HMC1 $\{X_{t}\}_{t=1}^{n}$, n≥2, covers the case of a 0−1 sequence of independent and identically distributed (IID) RVs, too. This is so, because a $0-1 \{X_{t}\}_{t=1}^{n}$, n≥2, IID sequence with

$$ P(X_{t}=1)=1-P(X_{t}=0)=p_{1},\, 1\leq t\leq n, $$

(10)

is a particular HMC1 with

$$p_{ij}=1-p_{1},\, j=0;\, p_{1}, j=1,\, i\in {\mathcal{A}},\,p_{ij}^{(t-1)}=p_{ij},\,i,j\in {\mathcal{A}},\,t\geq 2, $$

$$ p_{1}^{(t)}=p_{1}=1-p_{0}^{(t)},\, 1\leq t\leq n. $$

(11)

2.3 A combinatorial result

In combinatorial analysis which will be used in Section 4, the following result, recalled from Makri et al. (2007), is useful. The coefficient

$$ H_{m}(\alpha,r,k)=\sum_{j=0}^{\left\lfloor\frac{\alpha}{k+1}\right\rfloor}(-1)^{j}{m\choose j}{\alpha-(k+1)j+r-1\choose \alpha-(k+1)j}, $$

(12)

represents the number of allocations of α indistinguishable balls into r distinguishable cells where each of the m, 0≤m≤r, specified cells is occupied by at most k balls. Equivalently, it gives the number of nonnegative integer solutions of the linear equation x ₁+x ₂+…+x _r=α with the restrictions, for m≥1, $0\leq x_{i_{j}}\leq k$, 1≤j≤m, for some specific m-combination {i ₁,i ₂,…,i _m} of {1,2,…,r}, and no restrictions on x _js, 1≤j≤r, for m=0.

Moreover, H _r(α,r,k) is Riordan’s (1964, p. 104) coefficient

$$ C(\alpha,r,k)=\sum_{j=0}^{\left\lfloor\frac{\alpha}{k+1}\right\rfloor}(-1)^{j}{r\choose j}{\alpha-(k+1)j+r-1\choose \alpha-(k+1)j}. $$

(13)

Motivation and aim of the work

In a study of a 0−1 sequence $\{X_{t}\}_{t=1}^{n}$, n≥3, it is reasonable for one to be interested in the probabilistic behavior of RV V _n,k=(D _n,k,G _n,k,S _n,k). This happens because jointly its components provide a more refined view of the internal clustering structure of the sequence than the information extracted by each one alone.

Interpreting a k-tuple of 1s as a cluster of consecutive 1s of size at least k, D _n,k represents the size of the minimum segment of $\{X_{t}\}_{t=1}^{n}$ in which G _n,k clusters of size at least k and at most L _n are concentrated. The overall density of G _n,k clusters, with respect to the number of 1s in them, as well as of the minimum concentration segment is evaluated by S _n,k. Large values of D _n,k suggest that these G _n,k clusters spread over the interval between the left and the right side of the sequence whereas small values of D _n,k indicate rather that the clusters are concentrated in a segment of the sequence of small size leaving the rest part(s) of the sequence empty of such clusters.

In addition to this information, a large value of S _n,k paired with a small value of G _n,k indicates the existence of clusters of 1s of a large size and therefore a trend whereas the same value of S _n,k paired with a large value of G _n,k indicates rather a distribution of clusters of small size in the (sub)sequence in which they are concentrated.

Therefore, based on the former interpretation, the motivation for the study as well as the usefulness of the statistic V _n,k=(D _n,k,G _n,k,S _n,k) is apparent. In the sequel, we assume that G _n,k≥2 in order to have at least two k-tuples of 1s in the sequence and accordingly the distance D _n,k is not a degenerate one. Moreover, this assumption is a common one in an application area of D _n,k; e.g., in detecting pattern (tandem or non-tandem direct) repeats in DNA sequences (Benson 1999).

For 1≤k≤n, set

$$ {\mathcal{M}}_{n,k}=\{G_{n,k}\geq 2\},\,\alpha_{n,k}=P\left({\mathcal{M}}_{n,k}\right) $$

(14)

and for n≥3, 1≤k≤⌊(n−1)/2⌋, define

$$ \Omega_{n,k}=\left\{(d,m,s): 2k+1\leq d\leq n, 2k\leq s\leq d-1, 2\leq m\leq \min\left(\lfloor s/k\rfloor, d-s+1\right)\right\} $$

(15)

and for (d,m,s)∈Ω _n,k,

$$h_{n,k}(d,m,s)=P\left(\mathbf{V}_{n,k}=(d,m,s), {\cal {M}}_{n,k}\right), $$

$$ v_{n,k}(d,m,s)=P\left(\mathbf{V}_{n,k}=(d,m,s)\mid {\cal {M}}_{n,k}\right)=h_{n,k}(d,m,s)/\alpha_{n,k}. $$

(16)

The paper provides exact closed form expressions for α _n,k, h _n,k(d,m,s) and eventually for v _n,k(d,m,s) when V _n,k is defined on a 0−1 HMC1/IID. The expressions are obtained via combinatorial analysis.

More specifically, closed formulae are established for the first time for h _n,k(d,m,s), 1≤k≤⌊(n−1)/2⌋, when V _n,k is defined on a 0−1 HMC1 with given P and p ⁽¹⁾. Since, the general frame of HMC1 covers as a particular case IID sequences, the so implied expressions for v _n,k(d,m,s) are alternative to those obtained for v _n,k(d,m,s), 1≤k≤⌊(n−1)/2⌋, by Makri et al. (2015) for IID sequences.

Moreover, for n≥3, 1≤k≤⌊(n−1)/2⌋, 2k+1≤d≤n, let

$$f_{n,k}(d)=P\left(D_{n,k}=d\mid {\cal {M}}_{n,k}\right). $$

Therefore, since

$$ f_{n,k}(d)=\sum_{s=2k}^{d-1}\sum_{m=2}^{\min\left(\lfloor s/k\rfloor,d-s+1\right)}v_{n,k}(d,m,s)=\alpha_{n,k}^{-1}\sum_{s=2k}^{d-1}\sum_{m=2}^{\min\left(\lfloor s/k\rfloor,d-s+1\right)}h_{n,k}(d,m,s), $$

(17)

hence, the work provides closed form expressions for determining f _n,k(d) for HMC1 and IID $0-1 \{X_{t}\}_{t=1}^{n}$. These expressions are alternative to those derived, for IID sequences, by Makri et al. (2015) for 1≤k≤⌊(n−1)/2⌋ as well as to those obtained, for HMC1, by Arapis et al. (2016) for k=1 and by Arapis et al. (2017) for 1≤k≤⌊(n−1)/2⌋.

Results

In a 0−1 sequence $\{X_{t}\}_{t=1}^{n}$, n≥2, for 0≤y≤n, 0≤r≤⌊(n+1)/2⌋ and (i,j)∈{0,1}², define

$$B_{n}^{(i,j)}(y,r)=\{X_{1}=i,X_{n}=j,Y_{n}=y,G_{n,1}=r\}, $$

$$\pi_{n}^{(i,j)}(y,r)=P(B_{n}^{(i,j)}(y,r)). $$

Accordingly, for a HMC1 $\{X_{t}\}_{t=1}^{n}$, n≥2, with given P and p ⁽¹⁾, it holds

$$ \pi_{n}^{(i,j)}(y,r)=\left(p_{1}^{(1)}\right)^{i}\left(1-p_{1}^{(1)}\right)^{1-i}p_{00}^{y-r-1+i+j}\left(1-p_{00}\right)^{r-i} \left(1-p_{11}\right)^{r-j}p_{11}^{n-y-r}, $$

(18)

for 2−(i+j)≤y≤n−(i+j), 1−δ _y,0−δ _y,n+δ _i+j,2≤r≤ min{n−y,y−1+i+j} and $\pi _{n}^{(i,j)}(y,r)=0$, otherwise.

Consequently, $\pi _{n}^{(i,j)}(y,r)$, for a 0−1 IID sequence, reduces to

$$ \pi_{n}^{(i,j)}(y,r)=\pi_{n}(y)=p_{1}^{n-y}(1-p_{1})^{y},\, 0\leq y\leq n. $$

(19)

Theorem 1

For n≥3, (d,m,s)∈Ω _n,1, $0<p_{1}^{(1)}<1$, it holds

$$\begin{array}{@{}rcl@{}} h_{n,1}(d,m,s)&=& {s-1\choose m-1}{d-s-1\choose m-2}\pi_{d}^{(1,1)}(d-s,m)\varepsilon_{n}(d) \end{array} $$

(20)

where ε _n(d)=1, if n=d; $p_{00}^{n-d-2}\left \{p_{10}p_{00}+p_{0}^{(1)}(p_{1}^{(1)})^{-1}p_{01}\left [(n-d-1)p_{10}+p_{00}\right ]\right \}$, if n≥d+1.

Proof

For d=3,…,n−2, i=2,3,…,n−d, s=2,3,…,d−1, m=2,3,…, min{s,d−s+1} an element of the event $\Gamma _{i,d,m,s}=\{U_{n,1}^{(1)}=i, D_{n,1}=d, R_{n}=m, S_{n,1}=s\}$ is a 0−1 sequence of length n with probability

$$p_{0}^{(1)}p_{00}^{i-2}p_{01}\left[\pi_{d}^{(1,1)}(d-s,m)\left(p_{1}^{(1)}\right)^{-1}\right]p_{10}p_{00}^{n-i-d}. $$

Fix i. Then the number of elements of the event Γ _i,d,m,s is ${s-1\choose m-1}{d-s-1\choose m-2}$, since the number of allocations of s 1s in m runs of 1s is ${s-1\choose m-1}$ and the number of allocations of d−s 0s in m−1 runs of 0s is ${d-s-1\choose m-2}$, so that

$$P\left(\Gamma_{i,d,m,s}\right)={s-1\choose m-1}{d-s-1\choose m-2}p_{0}^{(1)}p_{01}\left[\pi_{d}^{(1,1)}(d-s,m)\left(p_{1}^{(1)}\right)^{-1}\right]p_{10}p_{00}^{n-d-2}. $$

We use similar reasoning for the rest cases. Then summing with respect to i we get the result. □

For a sequence $\{X_{t}\}_{t=1}^{n}$ of 0−1 IID RVs, h _n,1(d,m,s) reduces to the explicit formula given in the next Corollary.

Corollary 1

For n≥3, (d,m,s)∈Ω _n,1, 0<p ₁<1, it is true that

$$ h_{n,1}(d,m,s)=(n-d+1){s-1\choose m-1}{d-s-1\choose m-2}p_{1}^{s}(1-p_{1})^{n-s}.\quad \diamondsuit $$

(21)

In order to derive for HMC1, in the forthcoming Theorem 2, h _n,k(d,m,s), 5≤2k+1≤n, we next recall, in Lemma 1, a result from (Makri et al.: On the concentration of runs of ones of length exceeding a threshold in a Markov chain, submitted).

Lemma 1

For (i,j)∈{0,1}², n≥2, set $\lambda _{n,k}^{(i,j)}(x)=P(G_{n,k}=x,X_{1}=i,X_{n}=j)$, x=0,1. Then, it holds that:

(I) For 2≤k≤n−2+i+j,

$$\lambda_{n,k}^{(i,j)}(0)=\sum_{y=1}^{n-(i+j)}\sum_{r=i+j}^{y-1+i+j} {y-1\choose r-i-j}C(n-y-r,r,k-2)\pi_{n}^{(i,j)}(y,r), $$

$$ {}\lambda_{n,k}^{(i,j)}(1)=\pi_{n}^{(i,j)}(0,1)\delta_{2,i+j}+\sum_{y=1}^{n-k}\sum_{r=1}^{y-1+i+j} r{y-1\choose r-i-j}H_{r-1}(n-y-r-k+1,r,k-2)\pi_{n}^{(i,j)}(y,r). $$

(22)

(II) For k>n−2+i+j,

$$\lambda_{n,k}^{(i,j)}(0)=\left(p_{1}^{(1)}\right)^{i}\left(1-p_{1}^{(1)}\right)^{1-i}p_{ij}^{(n-1)}, $$

$$ \lambda_{n,k}^{(i,j)}(1)=0. $$

(23)

Theorem 2

For n≥5, 2≤k≤⌊(n−1)/2⌋, (d,m,s)∈Ω _n,k, $0<p_{1}^{(1)}<1$, it holds

$$\begin{array}{*{20}l} {}h_{n,k}(d,m,s)\,=\,p_{11}^{2k-2}\left(p_{1}^{(1)}\right)^{-1}\!\! \sum_{i=1}^{n-d+1}\!\!\!\ell_{i-1,k}^{(\alpha)}\ell_{n-d-i+1,k}^{(\beta)}\!\!\!\!\!\!\!\!\!\sum_{r=m}^{m+\left\lfloor\frac{d-s-m+1}{2}\right\rfloor}\! \sum_{y=r-1}^{d-s-r+m}\!\!\!\gamma_{d,m,s}(y,\!r)\pi_{d-\!2k+\!2}^{(1,1)}(y,\!r), \end{array} $$

(24)

where

$${}\ell_{n,k}^{(\alpha)}=p_{1}^{(1)},\,\text{for}\,n=0;\quad p_{0}^{(n)}p_{01},\,\text{for}\, 1\leq n\leq k;\quad p_{01}\left[\lambda_{n,k}^{(0,0)}(0)+\lambda_{n,k}^{(1,0)}(0)\right],\,\text{for}\,n\geq k+1, $$

$$ {}\ell_{n,k}^{(\beta)}=1,\,\text{for}\,n=0;\quad p_{10},\,\text{for}\, 1\leq n\leq k;\quad \!p_{10}(p_{0}^{(1)})^{-1}\left[\lambda_{n,k}^{(0,0)}(0)+\lambda_{n,k}^{(0,1)}(0)\right],\,\text{for}\,n\geq k+1 $$

(25)

and

$$\begin{array}{@{}rcl@{}} {}\gamma_{d,m,s}(y,r)\,=\,{y-1\choose r-2}{r-2\choose m-2}{s-mk+m-1\choose m-1}C(d-y-s-r+m,r-m,k\,-\,2). \end{array} $$

(26)

Proof

For 1≤r ₁≤r ₂≤n let $Y_{r_{1},r_{2}}$, $R_{r_{1},r_{2}}$, $L_{r_{1},r_{2}}$, $S_{r_{1},r_{2},k}$, $D_{r_{1},r_{2},k}$, $G_{r_{1},r_{2},k}$ be RVs defined on the subsequence $X_{r_{1}}, X_{r_{1}+1},\ldots,X_{r_{2}}$ of $\{X_{t}\}_{t=1}^{n}$. For m≥2 define the event

$$\begin{array}{@{}rcl@{}} \lefteqn{\Delta_{r_{1},r_{1}+d-1}(d,s,m,y,r)}\\ & & =\{D_{r_{1},r_{1}+d-1,k}=d, G_{r_{1},r_{1}+d-1,k}=m, S_{r_{1},r_{1}+d-1,k}=s, Y_{r_{1},r_{1}+d-1}=y, R_{r_{1},r_{1}+d-1}=r\}. \end{array} $$

An element of this event is a 0 - 1 sequence of length d, starting and ending with a 1, for which y _j’s and z _j’s, representing the lengths of the failure and success runs, respectively, satisfy the conditions:

(a)
y ₁+y ₂+…+y _r−1=y, y _j≥1, 1≤j≤r−1.
(b)
$\phantom {\dot {i}\!}z_{1}+z_{i_{1}}+z_{i_{2}}+\ldots +z_{i_{m-2}}+z_{r}=s$, z _j≥k, j∈{1,i ₁,i ₂,…,i _m−2,r}, for some specific combination {1,i ₁,i ₂,…,i _m−2,r} of {1,2,…,r−1,r} among the ${r-2\choose m-2}$ ones.
(c)
$z_{i_{m-1}}+z_{i_{m}}+\ldots +z_{i_{r-2}}=d-y-s$, $1\leq z_{i_{j}}\leq k-1$, m−1≤j≤r−2, for {i _m−1,…,i _r−2}∈{1,2,…,r}−{1,i ₁,i ₂,…,i _m−2,r}.

Fix i ₁,i ₂,…,i _m−2. Then the number of such sequences, i.e. the number of solutions of the system (a)-(c), is

$${y-1\choose r-2}C(d-y-s-r+m,r-m,k-2){s-mk+m-1\choose m-1} $$

and each such sequence has probability

$$p_{1}^{(1)}p_{11}^{k-1}(p_{1}^{(1)})^{-1}\pi_{d-2k+2}^{(1,1)}(y,r)p_{11}^{k-1}=p_{11}^{2k-2}\pi_{d-2k+2}^{(1,1)}(y,r). $$

Hence,

$$\begin{array}{@{}rcl@{}} P(\Delta_{r_{1},r_{1}+d-1}(d,s,m,y,r))&=& p_{11}^{2k-2}\pi_{d-2k+2}^{(1,1)}(y,r){r-2\choose m-2}{y-1\choose r-2}{s-mk+m-1\choose m-1}\\ & & \times C(d-y-s-r+m,r-m,k-2). \end{array} $$

For k+2≤i≤n−k−d, m≥2, we have that

$$\begin{array}{@{}rcl@{}} \lefteqn{P\left(U_{n,k}^{(1)}=i, D_{n,k}=d, G_{n,k}=m, S_{n,k}=s, Y_{i,i+d-1}=y, R_{i,i+d-1}=r\right)}\\ &=&P\Big\{\left[(L_{1,i-1}<k, X_{i-1}=0)\cap\left[(X_{1}=0)\cup(X_{1}=1)\right]\right]\cap \Delta_{i,i+d-1}(d,s,m,y,r)\\ & &\cap\left[(L_{i+d,n}<k, X_{i+d}=0)\cap\left[(X_{n}=0)\cup(X_{n}=1)\right]\right]\Big\}\\ &= & \left[\lambda_{i-1,k}^{(0,0)}(0)+\lambda_{i-1,k}^{(1,0)}(0)\right]p_{01}\\ & &\times \left(p_{1}^{(1)}\right)^{-1}P\left(\Delta_{i,i+d-1}(d,s,m,y,r)\right) p_{10}\left[\lambda_{n-i-d+1,k}^{(0,0)}(0)+\lambda_{n-i-d+1,k}^{(0,1)}(0)\right]/p_{0}^{(1)}\\ &=& \left[\lambda_{i-1,k}^{(0,0)}(0)+\lambda_{i-1,k}^{(1,0)}(0)\right]p_{01}\left(p_{1}^{(1)}\right)^{-1}p_{11}^{2k-2}\pi_{d-2k+2}^{(1,1)}(y,r)\\ & & \times{r-2\choose m-2}{y-1\choose r-2}{s-mk+m-1\choose m-1}\\ & & \times C(d-y-s-r+m,r-m,k-2)p_{10}\left(p_{0}^{(1)}\right)^{-1} \left[\lambda_{n-i-d+1,k}^{(0,0)}(0)+\lambda_{n-i-d+1,k}^{(0,1)}(0)\right]. \end{array} $$

By similar reasoning we get the remaining cases of i, i.e. 1≤i≤k+1 and n−d+1−k≤i≤n−d+1. Then summing with respect to i, y and r we get the result. □

Having found h _n,k(d,m,s), we next proceed to obtain v _n,k(d,m,s). In accomplishing it, the required probabilities α _n,k for HMC1 are recalled, in Lemma 2, from Arapis et al. (2016) for k=1, and they are computed via Lemma 1 for 2≤k≤⌊(n−1)/2⌋.

Lemma 2

For n≥k≥1, the probability α _n,k, for HMC1, is computed via the expressions:(I) For k=1,

$$\begin{array}{@{}rcl@{}} \alpha_{n,1}&=&1-p_{00}^{n-3}\left\{p_{00}\left(1+(n-2)p_{01}\right)+\frac{(n-1)(n-2)}{2}p_{0}^{(1)}p_{01}^{2}\right\},\quad \text{if}\quad p_{00}=p_{11} \end{array} $$

and

$$\begin{array}{@{}rcl@{}} \alpha_{n,1}&=& 1-p_{0}^{(1)}p_{00}^{n-1}-p_{11}^{n-2}\left(p_{1}^{(1)}+p_{0}^{(1)}p_{01}\right)-p_{00} \left(p_{0}^{(1)}p_{01}+p_{1}^{(1)}p_{10}\right)\frac{p_{11}^{n-2}-p_{00}^{n-2}}{p_{11}-p_{00}} \\ & &\quad -p_{0}^{(1)}p_{01}p_{10}\frac{p_{11}^{n-1}-p_{00}^{n-2} \left[p_{11}+(n-2)\left(p_{11}-p_{00}\right)\right]}{\left(p_{11}-p_{00}\right)^{2}},\quad \text{if}\quad p_{00}\neq p_{11}. \end{array} $$

(27)

(II) For 2≤k≤n,

$$\begin{array}{@{}rcl@{}}\alpha_{n,k}=1-\sum_{(i,j)\in \{0,1\}^{2}}\left[\lambda_{n,k}^{(i,j)}(0)+\lambda_{n,k}^{(i,j)}(1)\right]. \end{array} $$

(28)

Theorem 3

For n≥3, 1≤k≤⌊(n−1)/2⌋, (d,m,s)∈Ω _n,k, $0<p_{1}^{(1)}<1$, the PMF v _n,k(d,m,s) for a HMC1, with given P and p ⁽¹⁾, is calculated by

$$ v_{n,k}(d,m,s)=\alpha_{n,k}^{-1}h_{n,k}(d,m,s), $$

(29)

where α _n,k and h _n,k(d,m,s) are provided by Lemma 2 and Theorems 1 (for k=1) and 2 (for 2≤k≤⌊(n−1)/2⌋), respectively.

Remark 1

For IID sequences, in implementing Theorem 3, one has to take into consideration Eqs. (10) - (11), (19) and (21). Moreover, for speeding up calculations, one has to set π _n(y) in front of the inner summation in (22).

A numerical example

In this example we compute some indicative numerics concerning two model (i.e. HMC1 and IID) 0−1 sequences $\{X_{t}\}_{t=1}^{n}$ which are considered in the paper. The common length of these was taken small, i.e. n=8, so that the required computations can also be carried out by a hand/pocket calculator and thus it is possible to gain insight in the formulae developed in Section Results, and also because of space limitations. The sequences that have been used are as follows. Table 1: An IID sequence with p ₁=0.5. Table 2: A HMC1 sequence with p ₀₀=p ₁₁=0.9, $p_{1}^{(1)}=0.5$.

Table 1 0−1 IID sequence with p ₁=0.5

Full size table

Table 2 0−1 HMC1 sequence with p ₀₀=p ₁₁=0.9, $p_{1}^{(1)}=0.5$

Full size table

Both tables depict for k=1,2,3, v _8,k(d,m,s), (d,m,s)∈Ω _8,k and f _8,k(d), 2k+1≤d≤8 illustrating the numeric values of the involved probabilities. v _8,k(d,m,s) and f _8,k(d) were computed via Eqs. (29) and (17), respectively.

Discussion and further study

In this article we have derived exact closed form expressions for PMF v _n,k(d,m,s), n≥3, 1≤k≤⌊(n−1)/2⌋, (d,m,s)∈Ω _n,k, of the RV V _n,k∣_n,k defined on a 0−1 sequence of homogeneous Markov-dependent trials. The method used is a combinatorial one relied on results exploiting the internal structure of such a sequence.

As it is noticed in the Introduction the application domain of runs contains a diverse range of fields. Indicative potential ones are next discussed.

Encoding, compression and transmission of digital information calls for the understanding the distributions of runs of 1s or 0s. Such a knowledge helps in analyzing, and also in comparing, several techniques used in communication networks. In such networks 0−1 data ranging from a few kilobytes (e.g. e-mails) to many gigabytes of greedy multimedia applications (e.g. video on demand) are highly encoded, decoded and eventually proceeded under security. For details, see e.g., Sinha and Sinha (2009), Makri and Psillakis (2011a) and Tabatabaei and Zivic (2015).

An area where the study of runs of 1s and 0s has become increasingly useful is the field of bioinformatics or computational biology. For instance, molecular biologists design similarity tests between two DNA sequences where a 1 is interpreted as a match of the sequences at a given position and everything else as a 0. Moreover, the probabilistic analysis of such sequences according to the form, the length and the number of detected patterns as well as of the positions and the lengths of the segments of the sequence in which they are concentrated, probably suggests a functional reason for the internal structure of the examined sequence. The latter facts might be useful in suggesting a further investigation of the underline sequence(s) by biologists. See, e.g. Avery and Henderson (1999), Benson (1999) and Nuel et al. (2010).

Another active area where run statistics, in particular G _n,k and S _n,k, have interesting statistical applications is that connected to hypothesis testing; e.g., in tests of randomness. For a systematic study of such a topic, we refer among others, the works of Koutras and Alexandrou (1997) and Antzoulakos et al. (2003).

Accordingly, it is reasonable for one to use the exact expressions obtained for v _n,k(d,m,s) in applications like the ones mentioned above. This is so, because this distribution, as a joint one, is more flexible than each one of its marginals which have been used in such applications. See, e.g. Lou (2003), Makri and Psillakis (2011b) and Arapis et al. (2016).

Moreover, in handling 0 - 1 sequences of a large length, with dependent or not elements, a Monte - Carlo simulation, based on Eqs. (1) - (4) would be a useful tool in obtaining approximate values for v _n,k(d,m,s). In addition, the general approximating methods, suggested by Johnson and Fu (2014), might be helpful in deriving approximate values for f _n,k(d).

References

Antzoulakos, DL, Bersimis, S, Koutras, MV: On the distribution of the total number of run lengths. Ann. Inst. Statist. Math. 55, 865–884 (2003).
Article MATH MathSciNet Google Scholar
Antzoulakos, DL, Chadjiconstantinidis, S: Distributions of numbers of success runs of fixed length in Markov dependent trials. Ann. Inst. Statist. Math. 53, 559–619 (2001).
Article MATH MathSciNet Google Scholar
Arapis, AN, Makri, FS, Psillakis, ZM: On the length and the position of the minimum sequence containing all runs of ones in a Markovian binary sequence. Statist. Probab. Lett. 116, 45–54 (2016).
Article MATH MathSciNet Google Scholar
Arapis, AN, Makri, FS, Psillakis, ZM: Distribution of statistics describing concentration of runs in non homogeneous Markov-dependent trials. Commun. Statist. Theor. Meth. (2017). doi:10.1080/03610926.2017.1337144.
Avery, PJ, Henderson, D: Fiting Markov chain models to discrete state series such as DNA sequences. Appl. Statist. 48(Part 1), 53–61 (1999).
MATH Google Scholar
Balakrishnan, N, Koutras, MV: Runs and Scans with Applications. Wiley, New York (2002).
MATH Google Scholar
Benson, G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article Google Scholar
Eryilmaz, S: Some results associated with the longest run statistic in a sequence of Markov dependent trials. Appl. Math. Comput. 175, 119–130 (2006).
Article MATH MathSciNet Google Scholar
Eryilmaz, S: Discrete time shock models involving runs. Statist. Probab. Lett. 107, 93–100 (2015).
Article MATH MathSciNet Google Scholar
Eryilmaz, S: Generalized waiting time distributions associated with runs. Metrika. 79, 357–368 (2016).
Article MATH MathSciNet Google Scholar
Eryilmaz, S: The concept of weak exchangeability and its applications. Metrika. 80, 259–271 (2017).
Article MATH MathSciNet Google Scholar
Eryilmaz, S, Yalcin, F: Distribution of run statistics in partially exchangeable processes. Metrika. 73, 293–304 (2011).
Article MATH MathSciNet Google Scholar
Feller, W: An Introduction to Probability Theory and Its Applications. 3rd Ed., Vol. I. Wiley, New York (1968).
MATH Google Scholar
Fu, JC, Lou, WYW: Distribution Theory of Runs and Patterns and Its Applications: A finite Markov chain imbedding approach. World Scientific, River Edge (2003).
Johnson, BC, Fu, JC: Approximating the distributions of runs and patterns. J. Stat. Distrib. Appl. 1:5, 1–15 (2014).
Article MATH Google Scholar
Koutras, MV: Applications of Markov chains to the distribution of runs and patterns. In: Shanbhag, DN, Rao, CR (eds.)Handbook of Statistics, pp. 431–472. Elsevier, North-Holland (2003).
Google Scholar
Koutras, MV, Alexandrou, V: Non-parametric randomness tests based on success runs of fixed length. Statist. Probab. Lett. 32, 393–404 (1997).
Article MATH MathSciNet Google Scholar
Koutras, VM, Koutras, MV, Yalcin, F: A simple compound scan statistic useful for modeling insurance and risk management problems. Insur. Math. Econ. 69, 202–209 (2016).
Article MATH MathSciNet Google Scholar
Lou, WYW: The exact distribution of the k-tuple statistic for sequence homology. Statist. Probab. Lett. 61, 51–59 (2003).
Article MATH MathSciNet Google Scholar
Makri, FS, Philippou, AN, Psillakis, ZM: Success run statistics defined on an urn model. Adv. Appl. Prob. 39, 991–1019 (2007).
Article MATH MathSciNet Google Scholar
Makri, FS, Psillakis, ZM: On success runs of a fixed length in Bernoulli sequences: Exact and asymptotic results. Comput. Math. Appl. 61, 761–772 (2011a).
Article MATH MathSciNet Google Scholar
Makri, FS, Psillakis, ZM: On runs of length exceeding a threshold: normal approximation. Stat. Papers. 52, 531–551 (2011b).
Article MATH MathSciNet Google Scholar
Makri, FS, Psillakis, ZM: On ℓ-overlapping runs of ones of length k in sequences of independent binary random variables. Commun. Statist. Theor. Meth. 44, 3865–3884 (2015).
Article MATH MathSciNet Google Scholar
Makri, FS, Psillakis, ZM, Arapis, AN: Counting runs of ones with overlapping parts in binary strings ordered linearly and circularly. Intern. J. Statist. Probab. 2, 50–60 (2013).
Article Google Scholar
Makri, FS, Psillakis, ZM, Arapis, AN: Length of the minimum sequence containing repeats of success runs. Statist. Probab. Lett. 96, 28–37 (2015).
Article MATH MathSciNet Google Scholar
Mood, AM: The distribution theory of runs. Ann. Math. Statist. 11, 367–392 (1940).
Article MATH MathSciNet Google Scholar
Mytalas, GC, Zazanis, MA: Central limit theorem approximations for the number of runs in Markov-dependent binary sequences. J. Statist. Plann. Infer. 143, 321–333 (2013).
Article MATH MathSciNet Google Scholar
Mytalas, GC, Zazanis, MA: Central limit theorem approximations for the number of runs in Markov-dependent multi-type sequences. Commun. Statist. Theor. Meth. 43, 1340–1350 (2014).
Article MATH MathSciNet Google Scholar
Nuel, G, Regad, L, Martin, J, Camproux, A-C: Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithm Mol. Biol. 5, 1–18 (2010).
Article Google Scholar
Riordan, AM: An Introduction to Combinatorial Analysis. Second Ed. John Wiley, New York (1964).
MATH Google Scholar
Sinha, K, Sinha, BP: On the distribution of runs of ones in binary trials. Comput. Math. Appl. 58, 1816–1829 (2009).
Article MATH MathSciNet Google Scholar
Tabatabaei, SAH, Zivic, N: A review of approximate message authentication codes. In: Zivic, N (ed.)Robust Image Authentication in the Presence of Noise, pp. 106–127. Springer International Publishing AG, Cham (ZG), Switzerland (2015).
Google Scholar

Download references

Acknowledgements

The authors wish to thank the Editor for the thorough reading, and the anonymous reviewers for useful comments and suggestions which improved the article.

Author information

Authors and Affiliations

Department of Mathematics, University of Patras, Patras, 26500, Greece
Anastasios N. Arapis & Frosso S. Makri
Department of Physics, University of Patras, Patras, 26500, Greece
Zaharias M. Psillakis

Authors

Anastasios N. Arapis
View author publications
You can also search for this author in PubMed Google Scholar
Frosso S. Makri
View author publications
You can also search for this author in PubMed Google Scholar
Zaharias M. Psillakis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors, ANA, FSM and ZMP with the consultation of each other carried out this work and drafted the manuscript together. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Frosso S. Makri.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Arapis, A., Makri, F. & Psillakis, Z. Joint distribution of k-tuple statistics in zero-one sequences of Markov-dependent trials. J Stat Distrib App 4, 26 (2017). https://doi.org/10.1186/s40488-017-0080-5

Download citation

Received: 29 March 2017
Accepted: 18 October 2017
Published: 15 November 2017
DOI: https://doi.org/10.1186/s40488-017-0080-5

Joint distribution of k-tuple statistics in zero-one sequences of Markov-dependent trials

Abstract

Introduction

Preliminaries

2.1 Run counting statistics

2.2 Internal structure’s models

2.3 A combinatorial result

Motivation and aim of the work

Results

Theorem 1

Proof

Corollary 1

Lemma 1

Theorem 2

Proof

Lemma 2

Theorem 3

Remark 1

A numerical example

Discussion and further study

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

AMS Subject Classification

Joint distribution of k-tuple statistics in zero-one sequences of Markov-dependent trials

Abstract

Introduction

Preliminaries

2.1 Run counting statistics

2.2 Internal structure’s models

2.3 A combinatorial result

Motivation and aim of the work

Results

Theorem 1

Proof

Corollary 1

Lemma 1

Theorem 2

Proof

Lemma 2

Theorem 3

Remark 1

A numerical example

Discussion and further study

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

AMS Subject Classification