Computational Mechanics and Epsilon Machines
The Question
You’re observing a stochastic process—a sequence of symbols $\ldots X_{-2}, X_{-1}, X_0, X_1, X_2, \ldots$ drawn from some alphabet. You want to predict the future given the past.
What’s the minimal amount of information about the past that you need to retain?
Not “all the information”—that’s typically infinite for a long history. And not “nothing”—then you couldn’t predict. There’s something in between: the minimal sufficient statistic for prediction.
Computational mechanics, developed by Crutchfield and collaborators, makes this precise. The answer is the epsilon machine—a canonical, minimal, optimal predictor for any stationary stochastic process.
The Setup
Let $\overleftarrow{X} = \ldots X_{-2}, X_{-1}, X_0$ be the past (a semi-infinite sequence). Let $\overrightarrow{X} = X_1, X_2, X_3, \ldots$ be the future.
We want to find a function $\epsilon: \text{pasts} \to \text{states}$ such that:
- Sufficiency: Knowing $\epsilon(\overleftarrow{X})$ is as good as knowing $\overleftarrow{X}$ for predicting $\overrightarrow{X}$.
- Minimality: $\epsilon$ has the smallest range (fewest states) among all sufficient statistics.
Causal States
The key construction is the causal equivalence relation on pasts:
\[\overleftarrow{x} \sim_\epsilon \overleftarrow{x}' \iff P(\overrightarrow{X} | \overleftarrow{X} = \overleftarrow{x}) = P(\overrightarrow{X} | \overleftarrow{X} = \overleftarrow{x}')\]Two pasts are equivalent if and only if they give the same conditional distribution over futures.
The equivalence classes are called causal states. The set of all causal states is denoted $\mathcal{S}$.
Theorem: The causal states form a sufficient statistic for prediction, and among all sufficient statistics, they have minimal entropy.
This is the epsilon machine: the causal states plus the transition structure between them.
Why “Epsilon Machine”?
The name comes from the equivalence relation $\sim_\epsilon$. The “machine” part is because the causal states, together with the transitions induced by observing new symbols, form a (possibly infinite) hidden Markov model.
| Given current causal state $s$ and observed symbol $x$, there’s a deterministic transition to a new causal state $s’ = T(s, x)$ and a probability $P(x | s)$ for that symbol. |
The epsilon machine is:
- Unifilar: Given the current state and the next symbol, the next state is determined
- Minimal: No other unifilar HMM with the same output process has fewer states
- Canonical: It’s uniquely determined by the process (up to state relabeling)
Statistical Complexity
The statistical complexity $C_\mu$ is the entropy of the causal state distribution:
\[C_\mu = H[\mathcal{S}] = -\sum_{s \in \mathcal{S}} P(s) \log P(s)\]This measures how much memory the process has—how much information about the past is relevant for predicting the future.
Key properties:
- $C_\mu = 0$ iff the process is iid (no memory needed)
- $C_\mu$ can be finite even for processes with infinite-range correlations
- $C_\mu$ is bounded below by the excess entropy $E = I(\overleftarrow{X}; \overrightarrow{X})$
The excess entropy $E$ measures total correlation between past and future. Statistical complexity $C_\mu$ measures how much you need to store. Generally $C_\mu \geq E$, with equality for certain “simple” processes.
Example: The Golden Mean Process
Consider a process over ${0, 1}$ where $11$ is forbidden—you can never have two 1’s in a row.
The causal states are:
- State A: Last symbol was 0 (or we’re at the start)
- State B: Last symbol was 1
From A: can emit 0 (stay in A) or 1 (go to B) From B: must emit 0 (go to A)
This is a two-state epsilon machine. The statistical complexity is $C_\mu = H(p_A, p_B)$ where the stationary distribution is $(p_A, p_B) = (2/3, 1/3)$.
So $C_\mu = H(2/3, 1/3) \approx 0.918$ bits.
Example: The Even Process
Consider a process over ${0, 1}$ where 1’s come in pairs—every run of 1’s has even length.
The causal states:
- State A: We’re in a context where the next 1 (if any) starts a new pair
- State B: We’ve seen an odd number of 1’s; the next symbol must be 1
This is also a two-state machine, but with different structure than Golden Mean.
Crypticity: When the Past Doesn’t Look Like the Future
An interesting phenomenon: for some processes, knowing the causal state tells you more about the future than about the past (or vice versa).
The crypticity measures this asymmetry. A process is cryptic if the past contains information about the causal state that the future doesn’t reveal.
Cryptic processes have a gap between $C_\mu$ and the “reverse” statistical complexity (computed by predicting the past from the future).
The Epsilon Machine Is Optimal
Theorem: Among all models of a process (HMMs, predictive state representations, etc.), the epsilon machine minimizes stored information while achieving optimal prediction.
More precisely:
- Any sufficient statistic for prediction has entropy $\geq C_\mu$
- The epsilon machine achieves this bound
- It’s the unique (up to isomorphism) minimal sufficient statistic
This makes computational mechanics “canonical” in a way that other modeling approaches aren’t.
Connections to Information Theory
The Information Bottleneck
The epsilon machine is related to the information bottleneck method: find a representation $T$ of $X$ that maximizes $I(T; Y)$ (information about target $Y$) while minimizing $I(T; X)$ (complexity of representation).
For prediction, $X$ = past, $Y$ = future, and the optimal bottleneck is the causal state.
Excess Entropy and Mutual Information
The excess entropy is:
\[E = I(\overleftarrow{X}; \overrightarrow{X}) = H[\overrightarrow{X}] - H[\overrightarrow{X} | \overleftarrow{X}]\]This measures how much the past tells you about the future. It’s a lower bound on statistical complexity:
\[E \leq C_\mu\]The gap $C_\mu - E$ is called the crypticity or gauge information.
Entropy Rate
The entropy rate is:
\[h_\mu = \lim_{n \to \infty} \frac{1}{n} H[X_1, \ldots, X_n] = H[X_n | X_1, \ldots, X_{n-1}]\]This measures the intrinsic randomness of the process—how unpredictable it is even with perfect knowledge of the causal state.
Beyond Epsilon Machines: Mixed States and Upsilon Machines
For some purposes, we want to allow probabilistic state representations. Given a new observation, we don’t deterministically transition to a new state—we update a probability distribution over states.
| The upsilon machine or mixed-state presentation generalizes this. You track a distribution $P(s | \overleftarrow{x})$ over causal states, and update it Bayesianly with each observation. |
This is relevant for:
- Filtering: When observations are noisy
- Partially observed processes: When you see a function of the underlying process
- Transient behavior: Before the system has “synchronized” to a causal state
What Epsilon Machines Measure
Statistical complexity $C_\mu$ captures something different from entropy rate $h_\mu$:
- $h_\mu$: How random is the process? (unpredictability)
- $C_\mu$: How complex is the process? (structure/memory)
A process can have high entropy but low complexity (iid fair coin: $h = 1$ bit, $C_\mu = 0$).
A process can have low entropy but high complexity (a deterministic but complicated sequence: $h = 0$, $C_\mu$ can be large).
The epsilon machine separates randomness from structure.
Applications
- Time series analysis: Infer the epsilon machine from data; $C_\mu$ measures “how much structure” is present
- Physics: Statistical complexity of spin chains, dynamical systems, cellular automata
- Neuroscience: Complexity of neural spike trains
- Linguistics: Structure of language viewed as a stochastic process
- Machine learning: Connections to recurrent neural networks and state-space models
Summary
| Concept | What It Measures |
|---|---|
| Causal states | Equivalence classes of pasts with same predictive distribution |
| Epsilon machine | Minimal unifilar HMM generating the process |
| Statistical complexity $C_\mu$ | Memory needed for optimal prediction |
| Excess entropy $E$ | Total past-future mutual information |
| Entropy rate $h_\mu$ | Intrinsic randomness |
| Crypticity | Gap between $C_\mu$ and $E$ |
The philosophy: Every stochastic process has a canonical “machine” that generates it—the epsilon machine. Its size measures intrinsic complexity. This gives a principled, information-theoretic notion of structure that separates randomness from computation.
See also: [Information Geometry and Fisher Metric], [Bayesian Entropy].