Philosophical Monologues: forcing

Thursday, June 25, 2026

Note on Large Language Models

LLMs have a certain analogy to compression. From the training data D we obtain a LLM T(D) which is supposed to contain (or "extract") the essential "information" or "statistical patterns" present in D. T(D) is much smaller than D. It is speculated that Claude models are trained on D of the size of a petabyte and that the models themselves range from 150 to 500 GB. The response to a given prompt is analogous to decompression. Supposedly T(D) can "generate" an approximation of all the information originally contained in D. Some questions:

1. Is it not true that the passage D -> T(D) is not lossless, that important information present in D is lost in T(D) and cannot be recovered by it?
2. Is there any way to study T(D) as a mathematical object, detect its structure and geometry? And to study likewise the correspondence between D and T(D)? If there are limitations to doing this are they practical or theoretical?
3. There is an analogy between passing from D to T(D) and passing from general to countable models of ZF set theory (which exist by the downward Löwenheim-Skolem theorems)?
4. Is there not some analogy between forcing using countable models and generic sets and the process of training to generate T(D)? In both cases there is pattern generalization from fragmentary data.
5. Is there any structural correspondence between the structure of T(D) and structures found in the world (not counting neurological analogues of MLPs)?
6. Can we construct toy universes, toy languages and toy training data and study how D -> T(D) works in this simplified idealized scenario to gain more insight regarding real world LLMs?
7. Do LLMs express an essentially emergent phenomenon in which hardware capabilities are a crucial factor? Can we formalize rigorously such a concept of emergent phenomenon or capability?
8. But most importantly LLMs are linear statistical predictors (next token predictors) and they are trained as such. We need to formalize clearly what LLMs are supposed to do in the first place. Suppose we have a (first-order) model M that represents the world. We want our LLM T to be able to deal with a good degree of approximation with the theory of M, Th(M). We are given a finite large set L of first-order formulae with probabilities of their belonging to Th(M). A transformation is applied to L to obtain the object T which is able to include the reliable part of L in Th(M) and to extrapolate to other elements of Th(M). Is this to be understood as both logical and statistical inference?
9. A LLM is just a finite state automaton. But recursively axiomatizable theories are in general not recursive. Can we can construct a theory T such that for any finite subset L of T all LLMs trained on L will err to an arbitrarily with regards to infinitely many sentences of T. We define metrics on expressions, that's the key.
10. And most importantly: are LLM analogous to syntactic (and algebraic) models used in logic and category theory?