math

Computational Mechanics and Epsilon Machines

The Question

You’re observing a stochastic process—a sequence of symbols $\ldots X_{-2}, X_{-1}, X_0, X_1, X_2, \ldots$ drawn from some alphabet. You want to predict the future given the past.

What’s the minimal amount of information about the past that you need to retain?

Not “all the information”—that’s typically infinite for a long history. And not “nothing”—then you couldn’t predict. There’s something in between: the minimal sufficient statistic for prediction.

Computational mechanics, developed by Crutchfield and collaborators, makes this precise. The answer is the epsilon machine—a canonical, minimal, optimal predictor for any stationary stochastic process.

The Setup

Let $\overleftarrow{X} = \ldots X_{-2}, X_{-1}, X_0$ be the past (a semi-infinite sequence). Let $\overrightarrow{X} = X_1, X_2, X_3, \ldots$ be the future.

We want to find a function $\epsilon: \text{pasts} \to \text{states}$ such that:

Sufficiency: Knowing $\epsilon(\overleftarrow{X})$ is as good as knowing $\overleftarrow{X}$ for predicting $\overrightarrow{X}$.

\[P(\overrightarrow{X} | \overleftarrow{X}) = P(\overrightarrow{X} | \epsilon(\overleftarrow{X}))\]

Minimality: $\epsilon$ has the smallest range (fewest states) among all sufficient statistics.

Causal States

The key construction is the causal equivalence relation on pasts:

\[\overleftarrow{x} \sim_\epsilon \overleftarrow{x}' \iff P(\overrightarrow{X} | \overleftarrow{X} = \overleftarrow{x}) = P(\overrightarrow{X} | \overleftarrow{X} = \overleftarrow{x}')\]

Two pasts are equivalent if and only if they give the same conditional distribution over futures.

The equivalence classes are called causal states. The set of all causal states is denoted $\mathcal{S}$.

Theorem: The causal states form a sufficient statistic for prediction, and among all sufficient statistics, they have minimal entropy.

This is the epsilon machine: the causal states plus the transition structure between them.

Why “Epsilon Machine”?

The name comes from the equivalence relation $\sim_\epsilon$. The “machine” part is because the causal states, together with the transitions induced by observing new symbols, form a (possibly infinite) hidden Markov model.

Given current causal state $s$ and observed symbol $x$, there’s a deterministic transition to a new causal state $s’ = T(s, x)$ and a probability $P(x

s)$ for that symbol.

The epsilon machine is:

Unifilar: Given the current state and the next symbol, the next state is determined
Minimal: No other unifilar HMM with the same output process has fewer states
Canonical: It’s uniquely determined by the process (up to state relabeling)

Statistical Complexity

The statistical complexity $C_\mu$ is the entropy of the causal state distribution:

\[C_\mu = H[\mathcal{S}] = -\sum_{s \in \mathcal{S}} P(s) \log P(s)\]

This measures how much memory the process has—how much information about the past is relevant for predicting the future.

Key properties:

$C_\mu = 0$ iff the process is iid (no memory needed)
$C_\mu$ can be finite even for processes with infinite-range correlations
$C_\mu$ is bounded below by the excess entropy $E = I(\overleftarrow{X}; \overrightarrow{X})$

The excess entropy $E$ measures total correlation between past and future. Statistical complexity $C_\mu$ measures how much you need to store. Generally $C_\mu \geq E$, with equality for certain “simple” processes.

Example: The Golden Mean Process

Consider a process over ${0, 1}$ where $11$ is forbidden—you can never have two 1’s in a row.

The causal states are:

State A: Last symbol was 0 (or we’re at the start)
State B: Last symbol was 1

From A: can emit 0 (stay in A) or 1 (go to B) From B: must emit 0 (go to A)

This is a two-state epsilon machine. The statistical complexity is $C_\mu = H(p_A, p_B)$ where the stationary distribution is $(p_A, p_B) = (2/3, 1/3)$.

So $C_\mu = H(2/3, 1/3) \approx 0.918$ bits.

Example: The Even Process

Consider a process over ${0, 1}$ where 1’s come in pairs—every run of 1’s has even length.

The causal states:

State A: We’re in a context where the next 1 (if any) starts a new pair
State B: We’ve seen an odd number of 1’s; the next symbol must be 1

This is also a two-state machine, but with different structure than Golden Mean.

Crypticity: When the Past Doesn’t Look Like the Future

An interesting phenomenon: for some processes, knowing the causal state tells you more about the future than about the past (or vice versa).

The crypticity measures this asymmetry. A process is cryptic if the past contains information about the causal state that the future doesn’t reveal.

Cryptic processes have a gap between $C_\mu$ and the “reverse” statistical complexity (computed by predicting the past from the future).

The Epsilon Machine Is Optimal

Theorem: Among all models of a process (HMMs, predictive state representations, etc.), the epsilon machine minimizes stored information while achieving optimal prediction.

More precisely:

Any sufficient statistic for prediction has entropy $\geq C_\mu$
The epsilon machine achieves this bound
It’s the unique (up to isomorphism) minimal sufficient statistic

This makes computational mechanics “canonical” in a way that other modeling approaches aren’t.

Connections to Information Theory

The Information Bottleneck

The epsilon machine is related to the information bottleneck method: find a representation $T$ of $X$ that maximizes $I(T; Y)$ (information about target $Y$) while minimizing $I(T; X)$ (complexity of representation).

For prediction, $X$ = past, $Y$ = future, and the optimal bottleneck is the causal state.

Excess Entropy and Mutual Information

The excess entropy is:

\[E = I(\overleftarrow{X}; \overrightarrow{X}) = H[\overrightarrow{X}] - H[\overrightarrow{X} | \overleftarrow{X}]\]

This measures how much the past tells you about the future. It’s a lower bound on statistical complexity:

\[E \leq C_\mu\]

The gap $C_\mu - E$ is called the crypticity or gauge information.

Entropy Rate

The entropy rate is:

\[h_\mu = \lim_{n \to \infty} \frac{1}{n} H[X_1, \ldots, X_n] = H[X_n | X_1, \ldots, X_{n-1}]\]

This measures the intrinsic randomness of the process—how unpredictable it is even with perfect knowledge of the causal state.

Beyond Epsilon Machines: Mixed States and Upsilon Machines

For some purposes, we want to allow probabilistic state representations. Given a new observation, we don’t deterministically transition to a new state—we update a probability distribution over states.

The upsilon machine or mixed-state presentation generalizes this. You track a distribution $P(s

\overleftarrow{x})$ over causal states, and update it Bayesianly with each observation.

This is relevant for:

Filtering: When observations are noisy
Partially observed processes: When you see a function of the underlying process
Transient behavior: Before the system has “synchronized” to a causal state

What Epsilon Machines Measure

Statistical complexity $C_\mu$ captures something different from entropy rate $h_\mu$:

$h_\mu$: How random is the process? (unpredictability)
$C_\mu$: How complex is the process? (structure/memory)

A process can have high entropy but low complexity (iid fair coin: $h = 1$ bit, $C_\mu = 0$).

A process can have low entropy but high complexity (a deterministic but complicated sequence: $h = 0$, $C_\mu$ can be large).

The epsilon machine separates randomness from structure.

Applications

Time series analysis: Infer the epsilon machine from data; $C_\mu$ measures “how much structure” is present
Physics: Statistical complexity of spin chains, dynamical systems, cellular automata
Neuroscience: Complexity of neural spike trains
Linguistics: Structure of language viewed as a stochastic process
Machine learning: Connections to recurrent neural networks and state-space models

Summary

Concept	What It Measures
Causal states	Equivalence classes of pasts with same predictive distribution
Epsilon machine	Minimal unifilar HMM generating the process
Statistical complexity $C_\mu$	Memory needed for optimal prediction
Excess entropy $E$	Total past-future mutual information
Entropy rate $h_\mu$	Intrinsic randomness
Crypticity	Gap between $C_\mu$ and $E$

The philosophy: Every stochastic process has a canonical “machine” that generates it—the epsilon machine. Its size measures intrinsic complexity. This gives a principled, information-theoretic notion of structure that separates randomness from computation.

See also: [Information Geometry and Fisher Metric], [Bayesian Entropy].