The Rise and Fall of Complexity: From Coffee Cups to Neural Networks
Entropy increases monotonically. Complexity does not. That asymmetry — why a half-mixed cup of coffee looks more interesting than either a fresh pour or a fully stirred one — turns out to connect physics, information theory, and how we train neural networks.
Watch coffee and cream mix. At the start you have two distinct layers — low entropy, low complexity. At the end you have a uniform brown liquid — high entropy, low complexity. But in the middle, for a brief window, you see beautiful swirling tendrils: high complexity.
Entropy increases monotonically. Complexity does not. That asymmetry is the puzzle this post is about.
Why Entropy Isn't Enough
The second law of thermodynamics says entropy always increases in a closed system. Entropy, loosely, is a measure of disorder — the number of microstates consistent with what you observe. A perfectly ordered crystal has low entropy. A gas with molecules flying in all directions has high entropy.
But that doesn't capture what we care about intuitively. A random string of bits has maximum entropy — it's as disordered as possible. Yet it also carries no interesting structure. Compare that to a genome, or a well-written novel, or the human brain: these have intermediate entropy but enormous apparent complexity. They're neither simple nor random. They're interesting.
Scott Aaronson noticed this problem in 2011 and named it complexodynamics — the study of how complexity evolves over time in a closed system. His key observation: any satisfying definition of complexity should first rise and then fall as a system evolves from low-entropy order to high-entropy noise. The challenge is formalizing what that means.
The Coffee Automaton
Aaronson, Carroll, and Ouellette (2014) built a concrete model to test candidate definitions: the Coffee Automaton. It's a simple cellular automaton with two cell types (coffee and cream) that mix according to local rules, capturing the essential physics: start ordered, end mixed, with structure emerging and dissolving in between.
They evaluated several candidate complexity measures:
- Entropy: monotonically increases. Disqualified.
- Compressed size: rises but doesn't fall cleanly.
- Sophistication: rises, then falls. Passes.
Sophistication is the key concept. It measures the amount of information that is both non-trivial and non-random — the "meaningful" part of a string's description. To make this precise, you need Kolmogorov complexity.
Kolmogorov Complexity: The Shortest Description
Kolmogorov complexity K(x) of a string x is the length of the shortest program that produces x and halts (Uspensky, Shen, and Vereshchagin, 2017).
A string of a million zeros has very low K — you can describe it with a tiny program: "print '0' one million times." A truly random string has K close to its own length — no program shorter than the string itself can generate it.
This gives us a formal definition of randomness: a string is random if its Kolmogorov complexity is close to its length. Importantly, most strings are random. Structure is the exception.
But K alone doesn't distinguish the two failure modes. Both "000...0" (all zeros) and a random string have trivially short descriptions in one sense: the former is simple-because-patterned, the latter is complex-because-uncompressible. Neither is "interesting."
Sophistication carves out the middle. Define it roughly as: the length of the shortest model that explains the "structured" part of x, after separating structure from noise. A random string has near-zero sophistication — its shortest model is trivial. A simple repeating string also has low sophistication. A genome has high sophistication — it requires a long model to explain why those specific nucleotides are there. In the Coffee Automaton experiments, sophistication behaves exactly as intuition demands: low at the start, rising as structure emerges, then falling as noise takes over.
The Minimum Description Length Principle
Kolmogorov complexity is incomputable — there's no algorithm that, given x, tells you K(x). But its ideas transfer beautifully into computable practice via the Minimum Description Length (MDL) principle (Grünwald, 2004).
MDL says: the best hypothesis H for data D is the one that minimizes the total description length:
L(H) + L(D | H)
The first term is the cost of encoding the model itself. The second is the cost of encoding the data given the model. Together they measure how well the model compresses the data.
This is not just a heuristic. MDL has deep connections to Bayesian inference — minimizing description length is equivalent to maximizing the posterior under a prior that assigns shorter descriptions higher probability — and to statistical learning theory, where models with shorter descriptions generalize better because they have lower VC complexity.
The MDL principle operationalizes a principle often stated vaguely: prefer simpler explanations. It makes "simpler" precise — shorter description — and gives a principled way to trade off model complexity against fit.
Keeping Neural Networks Simple
Hinton and van Camp (1993) applied MDL directly to neural network training. Instead of storing weights as exact floating-point numbers, they used a stochastic encoding: add Gaussian noise to each weight, then encode only the noisy version. The description length of a weight under Gaussian noise is proportional to w²/σ² — a quadratic penalty. The total objective becomes:
L(weights) + L(data | weights)
Minimizing this is exactly equivalent to training with L2 regularization (weight decay) — except now you understand why. You're not adding regularization as an arbitrary trick to prevent overfitting. You're minimizing the description length of the model.
They went further: let the variance σ be a trainable parameter per weight. Weights that matter (high gradient signal) can afford to be precise; weights that don't shrink toward zero and are encoded cheaply. This gives a form of learned pruning — the network discovers which weights to keep. The complexity of the network, measured by total description length of the weights, is minimized by the training process itself.
The Arc From Physics to Learning
The thread connecting all five papers is a single question: what is complexity, and what should happen to it?
Aaronson's complexodynamics gives the physics perspective: in a closed system evolving from order to chaos, meaningful complexity must peak at intermediate times. Kolmogorov complexity provides the mathematical substrate: a universal, formal definition of description length and structure. MDL translates these ideas into computable statistics: the best model is the shortest model. And Hinton and van Camp close the loop: neural networks that minimize description length are exactly the neural networks we want — ones that capture true structure in the data without overfitting to noise.
There's something satisfying about that. The coffee cup's arc from order through complexity to noise is the same arc we try to avoid when training a model. A network that memorizes training data is the fully-mixed cup: maximum fitting, zero generalization, no interesting structure. A network that learns too little is the unmixed cup. The goal — always — is the brief, beautiful middle: a model complex enough to explain the data's genuine structure, simple enough not to explain the noise.
References
- Aaronson, S. (2011). The First Law of Complexodynamics. scottaaronson.blog
- Aaronson, S., Carroll, S., and Ouellette, L. (2014). Quantifying the Rise and Fall of Complexity in Closed Systems: the Coffee Automaton. arXiv:1405.6903
- Uspensky, V. A., Shen, A., and Vereshchagin, N. K. (2017). Kolmogorov Complexity and Algorithmic Randomness. American Mathematical Society.
- Grünwald, P. (2004). A Tutorial Introduction to the Minimum Description Length Principle. arXiv:math/0406077
- Hinton, G. E. and van Camp, D. (1993). Keeping Neural Networks Simple by Minimizing the Description Length of the Weights. Proceedings of COLT.