Markov Chain

Essential properties of Markov Chain

In the setup of MCMC algorithms, we construct Markov chains from a transition kernel
KK
, a conditional probability density such that
Xn+1K(Xn,Xn+1)X_{n+1}\sim K(X_n,X_{n+1})
.
The chain encountered in MCMC settings enjoy a very strong stability property, namely a stationary probability distribution; that is, a distribution
π\pi
such that if
XnπX_n\sim\pi
, then
Xn+1πX_{n+1}\sim \pi
, if the kernel
KK
allows for free moves all over the state space. This freedom is called irreducibility in the theory of Markov chains and is formalized as the existence of
nNn\in\mathbb{N}
such that
P(XnAX0)>0P(X_n\in A\mid X_0)>0
for every
AA
such that
π(A)>0\pi(A)>0
. This property also ensures that most of the chains involved in MCMC algorithms are recurrent (that is, that the average number of visits to an arbitrary set
AA
is infinite), or even Harris recurrent (that is, such that the probability of an infinite number of returns to
AA
is 1).
Harris recurrence ensures that the chain has the same limiting behavior for every starting value instead of almost every starting value.
The stationary distribution is also a limiting distribution in the sense that the limiting distribution of
Xn+1X_{n+1}
is
π\pi
under the total variation norm, notwithstanding the initial value of
X0X_0
.
Strong forms of convergence are also encountered in MCMC settings, like geometric and uniform convergences.
If the marginals are proper, for convergence we only need our chain to be aperiodic. A sufficient condition is that
K(xn,)>0K(x_n,\cdot)>0
(or, equivalently,
f(xn)>0f(\cdot\mid x_n)>0
) in a neighborhood of
xnx_n
.
If the marginal are not proper, or if they do not exist, then the chain is not positive recurrent. It is either null recurrent, and both cases are bad.
The detailed balance condition is not necessary for
ff
to be a stationary measure associated with the transition kernel
KK
, but it provides a sufficient condition that is often easy to check and that can be used for most MCMC algorithms.
Ergodicity: independence of initial conditions
  • geometrically h-ergodic: decreasing at least at a geometric speed.
  • uniform ergodicity: stronger than geometric ergodicity in the sense that the rate of geometric convergence must be uniform over the whole space.
Irreducibility + Aperiodic = Ergodicity ?

Basic Notions

Bernoulli-Laplace Model

The finite chain is indeed irreducible since it is possible to connect the status
xx
and
yy
in
xy\vert x-y\vert
steps with probability
i=xyxy1(MiM)2.\prod_{i=x\land y}^{x\vee y-1}\Big(\frac{M-i}{M}\Big)^2\,.
The Bernoulli-Laplace chain is aperiodic and even strongly aperiodic since the diagonal terms satisfy
Pxx>0P_{xx}>0
for every
x{0,,K}x\in \{0,\ldots,K\}
.
Given the quasi-diagonal shape of the transition matrix, it is possible to directly determine the invariant distribution,
π=(π0,,πK)\pi=(\pi_0,\ldots,\pi_K)
. From the equation
πP=π\pi P = \pi
,
π0=P00π0+P10π1π1=P01π1+P11π1+P21π2=πK=P(K1)KπK1+PKKπK.\begin{aligned} \pi_0 &= P_{00}\pi_0 + P_{10}\pi_1\\ \pi_1 &= P_{01}\pi_1 + P_{11}\pi_1 + P_{21}\pi_2\\ \cdots &=\cdots\\ \pi_K &= P_{(K-1)K}\pi_{K-1} + P_{KK}\pi_K\,. \end{aligned}
Thus,
πk=(Kk)2π0,k=0,,K,\pi_k=\binom{K}{k}^2\pi_0\,,\qquad k=0,\ldots,K\,,
and through normalization,
πk=(Kk)2(2KK),\pi_k=\frac{\binom{K}{k}^2}{\binom{2K}{K}}\,,
(m+nr)=k=0r(mk)(nrk)\binom{m+n}{r}=\sum_{k=0}^r\binom{m}{k}\binom{n}{r-k}
with
m=n=r=Km=n=r=K
. It turns out that the hypergeometric distribution
H(2K,K,1/2)H(2K,K,1/2)
is the invariant distribution for the Bernoulli-Laplace model.

AR(1) Models

A simple illustration of Markov chains on continuous state-space.
Xn=θXn1+εn  ,θI ⁣R,X_n = \theta X_{n-1}+\varepsilon_n\;,\theta\in \mathrm{I\!R}\,,
with
εnN(0,σ2)\varepsilon_n\in N(0,\sigma^2)
, and if the
εn\varepsilon_n
's are independent,
XnX_n
is indeed independent from
Xn2,Xn3,X_{n-2},X_{n-3},\ldots
conditionally on
Xn1X_{n-1}
.
  • The Markovian properties of an AR(q) can be derived from
    (Xn,,Xnq+1)(X_n,\ldots,X_{n-q+1})
    .
  • ARMA(p, q) doesn't fit in the Markovian framework.
Since
Xnxn1N(θxn1,σ2)X_n\mid x_{n-1}\sim N(\theta x_{n-1},\sigma^2)
, consider the lower bound of the transition kernel (
θ>0\theta > 0
):
K(xn1,xn)=12πexp{12σ2(xnθxn1)2}12πσexp{12σ2max{(xnθw)2,(xnθwˉ)2}}12πσexp{12σ2[max{2θxnw,2θxnwˉ}+xn2+θ2w2wˉ2]}={12πσexp{12σ2[2θxnw+xn2+θ2w2wˉ2]}if xn>012πσexp{12σ2[2θxnwˉ+xn2+θ2w2wˉ2]}if xn<0,\begin{aligned} K(x_{n-1},x_n) &= \frac{1}{\sqrt{2\pi}}\exp\Big\{-\frac{1}{2\sigma^2}(x_n-\theta x_{n-1})^2\Big\}\\ &\ge \frac{1}{\sqrt{2\pi}\sigma}\exp\Big\{-\frac{1}{2\sigma^2}\max\{(x_n-\theta \underline w)^2, (x_n-\theta \bar w)^2\}\Big\}\\ &\ge \frac{1}{\sqrt{2\pi}\sigma}\exp\Big\{-\frac{1}{2\sigma^2}\Big[ \max\{-2\theta x_n\underline w,-2\theta x_n\bar w\}+x_n^2 + \theta^2\underline w^2\land \bar w^2 \Big]\Big\}\\ &=\begin{cases} \frac{1}{\sqrt{2\pi}\sigma}\exp\Big\{-\frac{1}{2\sigma^2}\Big[ -2\theta x_n\underline w+x_n^2 + \theta^2\underline w^2\land \bar w^2 \Big]\Big\}& \text{if }x_n>0\\ \frac{1}{\sqrt{2\pi}\sigma}\exp\Big\{-\frac{1}{2\sigma^2}\Big[ -2\theta x_n\bar w+x_n^2 + \theta^2\underline w^2\land \bar w^2 \Big]\Big\}& \text{if }x_n<0 \end{cases}\,, \end{aligned}
when
xn1[w,wˉ]x_{n-1}\in[\underline w, \bar w]
. The set
C=[w,wˉ]C = [\underline w, \bar w]
is a small set, as the measure
ν1\nu_1
with density
exp{(x2+2θxw)/2σ2}Ix>0+exp{(x2+2θxwˉ)/2σ2}Ix<02πσ{[1Φ(θw/σ2)]exp(θ2w2/2σ2)+Φ(θwˉ/σ)exp(θ2wˉ2/2σ2)},\frac{\exp\{(-x^2+2\theta x\underline w)/2\sigma^2\}I_{x>0} + \exp\{(-x^2+2\theta x\bar w)/2\sigma^2\}I_{x<0} }{\sqrt{2\pi}\sigma\{[1-\Phi(-\theta\underline w/\sigma^2)]\exp(\theta^2\underline w^2/2\sigma^2)+\Phi(-\theta\bar w/\sigma)\exp(\theta^2\bar w^2/2\sigma^2)\}}\,,
satisfy
K1(x,A)ν1(A),xC,AB(X).K^1(x,A)\ge \nu_1(A),\qquad \forall x\in C, \forall A\in {\cal B(X)}\,.
Given that the transition kernel corresponds to the
N(θxn1,σ2)N(\theta x_{n-1},\sigma^2)
distribution, a normal distribution
N(μ,τ2)N(\mu,\tau^2)
is stationary for the AR(1) chain only if
μ=θμandτ2=τ2θ2+σ2.\mu=\theta\mu\qquad\text{and}\qquad \tau^2=\tau^2\theta^2+\sigma^2\,.
These conditions imply that
μ=0\mu=0
and that
τ2=σ2/(1θ2)\tau^2=\sigma^2/(1-\theta^2)
, which can only occur for
θ<1\vert \theta\vert < 1
. In this case,
N(0,σ2/(1θ2))N(0,\sigma^2/(1-\theta^2))
is indeed the unique stationary distribution of the AR(1) chain.

Branching process

  • If
    ϕ\phi
    doesn't have a constant term, i.e.,
    P(X1=0)=0P(X_1=0)=0
    , then chain
    StS_t
    is necessarily transient since it is increasing.
  • If
    P(X1=0)>0P(X_1=0)>0
    , the probability of a return to 0 at time
    tt
    is
    ρ(t)=P(St=0)=gt(0)\rho(t)=P(S_t=0)=g_t(0)
    , which thus satisfies the recurrence equation
    ρt=ϕ(ρt1)\rho_t=\phi(\rho_{t-1})
    . There exists a limit
    ρ\rho
    different from 1, such that
    ρ=ϕ(ρ)\rho=\phi(\rho)
    , iff
    ϕ(1)>1\phi'(1)>1
    ; namely if
    E[X]>1E[X]>1
    . The chain is thus transient when the average number of siblings per individual is larger than 1. If there exists a restarting mechanism in 0,
    St+1St=0ϕS_{t+1}\mid S_t=0\sim\phi
    , it is easily shown that when
    ϕ(1)>1\phi'(1)>1
    , the number of returns to 0 follows a geometric distribution with parameter
    ρ\rho
    .
  • If
    ϕ(1)1\phi'(1)\le 1
    , the chain is recurrent.