Fruit of Preterition

math

Research notes from MATS 9.0, under the mentorship of Richard Ngo. These are Claude-generated documents that I roughly 80% endorse; treat them as research notes rather than polished publications. I’m posting them mainly to give people a sense of what I’m working on. Feedback welcome.


0. Why care?

Can we say what makes data learnable without referring to any particular learner?

Deep learning theory usually treats data as given and asks what happens when you train a model on it. I think there’s a complementary question: what properties of the data determine what any sufficiently expressive learner will find? If we could answer this, we’d have a way to talk about universal features — representations that are properties of the distribution itself, not artifacts of a particular architecture or training procedure.

This matters for alignment in at least two ways. First, distinguishing what’s forced from what’s chosen: if certain representations are universal — forced by data geometry — then any sufficiently capable system will find them, and we should focus interpretability and safety efforts on the contingent parts. Second, the RL question: does RL trajectory data have the same spectral signatures that make supervised learning tractable? If not, that’s a structural explanation for why RL is hard, and it bears on how much to expect from RLHF and RLVR.

A companion piece I’m developing (“Learner Geometry Meets Data Geometry”) handles the learner side — how a model’s geometry couples to data geometry order by order. This piece is about the data side only: what tools do we have for characterizing data structure, and what do they tell us?


1. The spectral picture

Natural data lives on or near low-dimensional manifolds embedded in high-dimensional ambient spaces. This is the manifold hypothesis, and it’s well-supported empirically. But “low-dimensional manifold” is a pretty coarse description. The question is: what finer-grained structure does this manifold have, and what does that structure imply for learning?

The canonical tool for probing manifold geometry is the Laplace-Beltrami operator \(\Delta_\mathcal{M}\). On a compact Riemannian manifold \(\mathcal{M}\), this operator has a discrete spectrum of eigenvalues \(0 = \lambda_0 \leq \lambda_1 \leq \lambda_2 \leq \cdots\) with corresponding eigenfunctions \(\{\varphi_k\}\) forming an orthonormal basis of \(L^2(\mathcal{M})\). The eigenfunctions are the “natural harmonics” of the manifold — the analogue of Fourier modes, but adapted to the geometry.

Weyl’s law and intrinsic dimension

The rate at which eigenvalues grow encodes the intrinsic dimension \(d\) of the manifold:

\[\lambda_k \sim c \cdot k^{2/d} \quad \text{as } k \to \infty\]

This is Weyl’s law (1911), and it’s one of the oldest results in spectral geometry. The key point for us: the Laplacian spectrum encodes the intrinsic dimension, and kernels built from the Laplacian (via functional calculus, as we’ll see below) inherit this dimensional dependence — though the precise decay rate of kernel eigenvalues depends on the kernel’s smoothness as well.

There’s a suggestive empirical pattern here. Across a range of natural datasets — images (MNIST, CIFAR, ImageNet), text, protein families, single-cell RNA — the measured intrinsic dimension is consistent with logarithmic scaling in the ambient dimension:

\[d_{\text{int}} \sim c \log D\]

with \(c\) in the range \(2\)-\(4\) depending on the domain. MNIST (\(D = 784\)) has \(d \approx 7\)-\(13\); ImageNet (\(D \approx 150000\)) has \(d \approx 26\)-\(43\); protein families have \(d \approx 6\)-\(12\) regardless of sequence length. I want to be careful here: this is a handful of data points, each measured with different methods and different notions of “intrinsic dimension,” and many functional forms would fit. But the pattern is at least suggestive of data generated by a compositional/hierarchical process where each level of the hierarchy contributes \(O(1)\) dimensions — which is what you’d expect if, say, images are built from objects built from parts built from edges.

Kernels as spectral filters

Every canonical kernel used in machine learning is basically just functional calculus applied to the Laplacian:

\[K_f(x, y) = \sum_k f(\lambda_k) \varphi_k(x) \varphi_k(y)\]

Different choices of \(f\) probe different aspects of the geometry:

  • Heat kernel \(f(\lambda) = e^{-t\lambda}\): exponential damping of high frequencies. This is the diffusion maps kernel. Smooth, maximum regularization, loses all sharp geometric information.
  • Resolvent \(f(\lambda) = (\lambda + \alpha)^{-s}\): algebraic decay. Connected to Matern kernels. The resolvent family is equivalent to the heat kernel (via Laplace transform), but a single resolvent at fixed \(\alpha\) is not.
  • Spectral projector \(f(\lambda) = \mathbf{1}_{[0, \Lambda]}\): hard cutoff. Where Weyl’s law lives — counting eigenvalues below \(\Lambda\). No smoothing at all.
  • Wave propagator \(f(\lambda) = e^{it\sqrt{\lambda}}\): all frequencies weighted equally, finite propagation speed. Information-theoretically the richest, but the hardest to work with.

There’s a rough information-loss ordering: wave kernel retains the most geometric information (finite propagation speed gives direct access to geodesic distances), heat kernel smooths everything, resolvent at a single \(\alpha\) retains less still. These are related by integral transforms — heat kernel is a Wick rotation of the wave kernel, resolvent is the Laplace transform of the heat kernel — and each transform loses something. The practical workhorse is the heat kernel (diffusion maps), but that’s a choice, and it throws away structure.

From compositionality to concrete spectral signatures

In an earlier writeup, I argued that compositionality arises from two opposing pressures: reliability (small input changes shouldn’t wildly change what’s learned) and expressiveness (you must distinguish task-relevant features). Under a power-law spectral model \(\sigma_i \propto i^{-\alpha}\), the minimax rate on a \(d\)-dimensional manifold scales as \(\epsilon(N) \sim N^{-2\alpha/(2\alpha + d)}\) — small \(d\) is what makes the rate favorable. Compositionality and hierarchy are sufficient to generate this kind of spectral structure.

That argument tells us what to look for — spectral decay, low intrinsic dimension — but not what the structure of natural data actually is. For that we need tools that can characterize multi-scale geometry concretely, which is what the next section is about.


2. Diffusion wavelets: a multi-scale language

Why single-scale methods aren’t enough

Diffusion maps give you a global spectral decomposition — eigenfunctions of the Laplacian, weighted by some kernel — but they commit you to a single scale. The kernel bandwidth \(\varepsilon\) and diffusion time \(t\) combine into an effective diffusion scale (\(\sqrt{t\varepsilon}\)), and you pick one. Natural data has structure at multiple scales simultaneously — edges and textures and parts and objects, or phonemes and morphemes and phrases and discourse.

So: wavelets — decompositions localized in both position and scale.

From classical wavelets to diffusion wavelets

Classical wavelet theory has a clean organizing principle: Fourier analysis diagonalizes translation-invariant operators; wavelets almost-diagonalize Calderon-Zygmund operators, which are the natural operators for the affine group (translations + dilations). When the relevant symmetry is scale-translation rather than pure translation, wavelets are the right basis. (I haven’t worked through all the proofs here — flagging that.)

The move to data manifolds replaces dilation with diffusion. In the Coifman-Maggioni construction, you take dyadic powers \(T^{2^j}\) of the diffusion operator \(T\). At each scale \(j\), the columns of the powered operator are orthogonalized and compressed to build nested approximation spaces — a multiresolution analysis (MRA) adapted to the geometry of the data. The eigenvalues of \(T\) decay doubly exponentially under powering (\(\lambda_k^{2^j}\)), which provides natural frequency bands at each scale.

GMRA (Geometric Multi-Resolution Analysis, Jones-Maggioni-Schul) takes a different but complementary approach: build a dyadic tree partition of the data, do local PCA at each node, and define geometric wavelets as inter-scale residuals — what each finer scale captures that the coarser scale missed.

What wavelet asymptotics tell you

This is the part I’m least confident about but find most promising. There are results from approximation theory (well-established, though I haven’t worked through them all in detail) that speak to why wavelets might be the right tool here:

Universality of scaling exponents. The Holder regularity, multifractal spectra, and spectral slopes extracted from a wavelet decomposition are mother-wavelet-independent. This is exactly analogous to how critical exponents in statistical physics are RG-scheme-independent — the particular wavelet you use is like the particular blocking scheme, and the universal quantities depend only on the geometry of the affine group. Coefficient values depend on the probe, but the scaling doesn’t. This is good: it means the characterization is intrinsic.

Coefficient decay rates. For functions with Holder regularity \(s\), wavelet coefficients decay as \(\vert c_{j,k}\vert \lesssim 2^{-j(s + 1/2)}\). More generally, the decay rate determines membership in Besov spaces \(B^s_{p,q}\), which are finer than Sobolev spaces and capture exactly the kind of inhomogeneous regularity that natural signals have. In the GMRA framework specifically, the coefficient decay is controlled by curvature — giving a precise geometric notion of compressibility.

Nonlinear approximation optimality. Keeping the \(n\) largest wavelet coefficients (regardless of scale) gives near-optimal \(n\)-term approximation for functions in the relevant Besov spaces. This is fundamentally better than linear approximation (keeping the first \(n\) coefficients in any fixed ordering) for functions with spatially inhomogeneous regularity — which is exactly what natural signals are. An image has smooth regions and edges; a signal has quiet periods and bursts. Wavelets adapt to this.

Wavelet sparsity as a naturalness diagnostic. Natural images have super-Gaussian (high kurtosis) wavelet coefficients — most coefficients are small, a few are large. Relatedly, ICA, sparse coding, and predictive coding all converge to approximately Gabor-like filters on natural images. This convergence is at the level of local oriented edge detectors; it’s less clear that it extends to higher-level features where architecture and training details matter more. But even at the first layer, the fact that the data appears to prefer a particular basis is worth taking seriously.

To be clear about status: diffusion wavelets are, for me right now, a theoretical perspective — richer than the default single-scale diffusion framework, with approximation-theoretic teeth — but I’m not sure they’re computationally tractable at scale, and I’m not claiming they’re the uniquely correct tool.

The random hierarchy model as a test case

The Random Hierarchy Model (RHM, Cagnetta et al.) is the cleanest toy model of hierarchical data. Data factorizes as:

\[P(x) = P(x_{\text{coarse}}) \prod_\ell P(x_{\text{fine}, \ell} \mid x_{\text{coarse}, \ell})\]

Each level of the hierarchy acts as a noisy channel, and correlation functions (Fourier coefficients) decay exponentially in the depth \(l\) — the amplitude at scale \(\ell\) goes as \(\sim m^{-\ell}\) where \(m\) is the branching factor. This gives a clean Walsh-Hadamard spectral picture: the hierarchy creates a tower of conditional independences across scales.

In the RHM, sample complexity has a three-level hierarchy:

  • Information-theoretic lower bound: \(P \gtrsim m^{2l}\)
  • Kernel methods: \(P \sim m^{s^l}\) (doubly exponential — kernels treat \(s^l\)-grams as unstructured)
  • Deep networks: \(P \sim l \cdot m^s\) (linear in depth \(l\), exponential only in patch size \(s\) — they exploit the compositional structure)

The gap between kernel methods and deep networks is a direct consequence of hierarchy: kernels can’t exploit the conditional independence structure across scales, while deep networks can.

What this means for wavelets

Here’s the connection I think matters (speculative, but bear with me). In the RHM, the hierarchical structure manifests as cross-scale conditional independence: \(I(X_\ell; X_{\ell+2} \mid X_{\ell+1}) \approx 0\). Information at scale \(\ell\) is essentially screened off by scale \(\ell + 1\). This is a discrete, clean version of what in the wavelet world would be decorrelation across scales — wavelet coefficients at different scales being approximately independent.

For natural data — language, images — the hierarchical structure is real but much messier than the RHM. Long-range latent correlations exist (the topic of a paragraph influences word choice many sentences later; the lighting in a scene affects every pixel), and these correlations are what make the data learnable in a deep sense — they’re what a hierarchical learner exploits.

The promise of diffusion wavelets is that they could identify approximate hierarchical structure in continuous data where the RHM is too discrete a model. The wavelet coefficient decay rates, the cross-scale correlation structure, the multifractal spectrum — these are all things that the theory says should be sensitive to the presence of hierarchical latent structure, and they’re defined without reference to any learner. If this works, it would give us a way to say “this distribution is approximately hierarchical at these scales, with these coupling strengths” in a principled way that’s pretty hard to get at otherwise.

The gap between “the theory says wavelet coefficients carry this information” and “we can extract it from finite samples of a complex distribution” is large. But it’s a well-posed gap — we know what to try.


3. The \(C^a_k\) bridge: connecting data geometry to learning

Intrinsic vs. extrinsic geometry

Consider a manifold \(\mathcal{M}^d\) isometrically embedded in \(\mathbb{R}^D\) via \(\iota: \mathcal{M} \hookrightarrow \mathbb{R}^D\). Each ambient coordinate function \(\iota^a: \mathcal{M} \to \mathbb{R}\) can be expanded in the Laplacian eigenbasis:

\[\iota^a = \sum_k C^a_k \varphi_k, \qquad C^a_k = \langle \iota^a, \varphi_k \rangle_{L^2(\mathcal{M})}\]

The matrix \(C^a_k\) — which I’ll call the spectral embedding matrix — is the bridge object. It separates two fundamentally different things:

  • Intrinsic geometry: entirely encoded in the spectrum \(\{\lambda_k\}\). This is the “shape of the drum” — curvature, dimension, topology.
  • Extrinsic geometry: encoded in \(C^a_k\). This is how the manifold sits inside ambient space — how the embedding distributes energy across intrinsic modes.

The Laplacian of the embedding coordinates is proportional to the mean curvature vector: \(\Delta_\mathcal{M} \iota = \pm d \vec{H}\) (sign depends on the Laplacian convention). The upshot is that high-frequency coefficients \(C^a_k\) (large \(k\)) are controlled by the second fundamental form — a “flat” embedding concentrates energy at low \(k\), while a “twisted” embedding spreads it across high \(k\).

Why this matters for learning

An ambient kernel (say, a Gaussian \(K(x,y) = \exp(-\lVert\iota(x) - \iota(y)\rVert^2 / 2\sigma^2)\)) has eigenvalues in the \(\{\varphi_k\}\) basis that depend on both \(\lambda_k\) and \(C^a_k\). This gives a clean decomposition of learning difficulty:

  • Intrinsic difficulty: high-frequency intrinsic modes (large \(\lambda_k\)) are hard to resolve regardless of embedding. This is the sample complexity set by the manifold’s geometry.
  • Extrinsic difficulty: even if the intrinsic spectrum is favorable, a twisted embedding (\(C^a_k\) with energy at high \(k\)) can prevent the kernel from resolving intrinsic structure. The kernel “sees” the ambient distances, and a bad embedding mixes up intrinsic scales in ambient space.

A “mild” embedding (low-rank \(C\)) means the kernel’s spectral profile follows intrinsic geometry — learning is as easy as the manifold allows. A “twisted” embedding means the kernel confuses intrinsic scales — learning is harder than it needs to be.

Connection to feature learning

In the lazy (NTK) regime, the cross-Gram alignment matrix \(A_{jk} = \langle \phi_j^{\text{NTK}}, \varphi_k \rangle\) quantifies how well the NTK eigenfunctions match the data manifold eigenfunctions. When \(A\) is diagonal, the network’s learning preferences match the data geometry.

Feature learning — the thing that makes deep networks qualitatively better than kernel methods — can be understood as the process of making \(A\) more diagonal over training. Atanasov et al. call this silent alignment: the NTK eigenvectors rotate to align with data geometry on a faster timescale than the eigenvalues grow. In the language of \(C^a_k\): the network is learning to “unwind” the embedding, separating intrinsic from extrinsic structure.

This is where the companion piece comes in. The learner-data decomposition provides a formal framework for understanding this alignment process order by order: the \(n\)-th learner tensor \(\mathcal{K}_n\) couples to the \(n\)-th data moment tensor \(M_n\), and the spectral structure I’ve been describing here is what determines what those data tensors look like. The \(C^a_k\) matrix is, in some sense, a summary of the zeroth-order data geometry that a kernel learner (the linear term in the Volterra expansion) sees.


4. Next steps

The concrete question I want to answer next: does RL trajectory data have the same spectral signatures that make supervised learning tractable? The toolkit here gives us things to measure — intrinsic dimension, eigenvalue decay rates, wavelet sparsity, cross-scale correlations. If trajectory data turns out to be structurally different from language and images, that’s not just a curiosity; it would mean the difficulty of RL is partly about data geometry, not just sample efficiency or reward sparsity.

The bigger open question is whether multi-scale structure (wavelets) plus embedding structure (\(C^a_k\)) is rich enough to predict what features a learner will find, and in what order. The convergence of ICA, sparse coding, and predictive coding to similar first-layer features on natural images is suggestive — something like a preferred factorization exists, at least at the lowest level of abstraction. Whether it extends to higher-level features is what I’m trying to figure out.


This is a research note — a snapshot of where my thinking is, not a finished argument. The gap between “interesting theoretical framework” and “tools that tell us new things about real data” is real, and I’m working on closing it.

Written with Claude.