The transformer architecture is solved: the real LLM moat lives in training, not in the diagram

🕒 Published on Zendoric: June 28, 2026 · 09:00
When someone says 'it's just a transformer' they're pointing to the part that's already solved and leaving out what costs billions. A technical essay dissects with precision where the real value of language models is forged.
By Zendoric · June 28, 2026.
There is a phrase that circulates in technical forums every time a new model comes out: 'in the end, it's just a transformer.' The claim is technically correct and analytically useless. An essay published by Bharadwaj P. on his personal blog lays out, with worked examples and diagrams, exactly what hides behind that 'just': three training phases, trillions of tokens, human preference data, and a GPU infrastructure within almost no one's reach. The architecture is the starting point, not the product.
**The empty transformer is a stack of random numbers**
The text starts from a fact worth remembering: a freshly initialized model knows nothing. It is a meaningless weight matrix. What turns it into Claude, GPT or Gemini is not the attention block diagram —that is already published, public and reproducible— but the process that happens afterward. The author identifies six levers that close that gap: tokenization, pretraining, supervised fine-tuning, alignment, cheap adaptation via LoRA, and inference infrastructure.
Before getting to the training phases, the essay devotes space to two elements that are usually taken for granted but have concrete consequences for model behavior.
The first is **residual connections**. The classic problem of training deep networks is that the error signal degrades as it propagates backward: it shrinks or explodes. The solution, elegant in its simplicity, is to let each block's original input also travel directly to the output, so that the block only learns an incremental adjustment over a stable signal. Without that parallel route, training deep, many-layered models on trillions of tokens would be numerically unfeasible.
The second is **tokenization**. Models do not see words; they see statistical fragments. A tokenizer like Llama 3's splits text into pieces whose frequency of appearance justifies giving them their own identifier. 'Aardvark' becomes three tokens ('a', 'ard', 'vark') because it is rare; 'I' stays intact because it is ubiquitous. This design explains why LLMs handle typos and language mixing well —the model never saw 'words' as units— and also why they systematically fail at counting letters: 'strawberry' does not arrive as ten separate characters, but as opaque blocks. The fact that the model answers 'two' when the correct answer is 'three r's' is not a failure of abstract reasoning; it is a direct consequence of how the text is represented at the input.
**Three phases, one same mechanism, different data**
The core of the essay is the training sequence. The mechanism is always the same —predict the next token, compare against the actual answer, correct the weights— but the nature of the data changes radically between phases, and that changes the resulting model.
**Phase one, pretraining**, is the brute-force part: more than ten trillion tokens of filtered web text, open-source code, math problems. The model learns to continue text. What emerges is called the base model: a sophisticated autocomplete that, faced with a question, may generate more questions instead of answering them. This is where the name GPT comes from: Generative Pre-trained Transformer. The 'pre-trained' is literally this phase.
**Phase two, supervised fine-tuning** (SFT), transforms that autocomplete into an assistant. Web text is replaced by conversations: instruction, system context, correct answer. The mechanism does not change; the example does. The model learns that its turn begins when a special token appears —`<|im_start|>assistant`— and that what follows must be a response, not a continuation. The labs invest significant amounts in building these conversation datasets and keep them private. Two models that share architecture and web text can behave very differently if their conversation data differ.
**Phase three, alignment**, is where reinforcement learning comes in. The best-known variant is RLHF: two responses are presented to human evaluators, which one they prefer is recorded, a preference-predictor model is trained, and that predictor is used to push the chat model toward better-rated responses. DPO (Direct Preference Optimization) achieves a similar effect with less infrastructure, learning directly from preference pairs without the intermediate predictor. It is also in this phase where, according to the external analysis available, reasoning models emerge: the transformer is identical, but the training teaches the model to emit intermediate reasoning tokens before the final answer. The gray 'thinking' box that some interfaces display is not architectural magic; it is learned behavior.
**LoRA: the access route for everyone else**
Fine-tuning the full weights of a model with tens of billions of parameters is beyond the budget of almost any team that is not a frontier lab. LoRA (Low-Rank Adaptation) solves that problem by leaving the original weights frozen and adding, in parallel, two small matrices whose product approximates the adjustment that full training would make. The intermediate dimension of those matrices —the 'low rank' that gives the technique its name— is chosen so that the number of trainable parameters falls below one percent of the total. In a real weight matrix, instead of updating billions of values, millions are trained. There is a loss of expressiveness, but the cost falls by orders of magnitude. For specialized use cases —a model expert in legal code, in medical terminology, in a low-resource language— LoRA is the difference between possible and impossible.
**The race is not about the architecture**
The essay mentions the emergence pattern documented around 2022: on many tasks, a model's accuracy stays flat near chance and then jumps sharply once training compute exceeds a threshold. That pattern turned the accumulation of data and compute into an investment logic hard to resist for any lab watching the curve. As sector context, that is the engine behind the large-scale data-licensing deals, the building of data centers and the fight for GPU capacity that has defined the industry ever since.
As for inference, the article recalls something non-specialists tend to overlook: serving a model at scale is an engineering problem of its own, separate from training. GPUs dominate because multiplying matrices in parallel is exactly what they do well. The efforts to reduce inference cost —quantization, smaller models for classification tasks— and the Mixture of Experts (MoE) architecture, which keeps total capacity but activates only a fraction of parameters per token, are direct responses to that economic pressure.
**The reading that matters**
The value of this kind of technical synthesis lies in providing an analytical framework for reading the market. When a lab announces a new model and the debate revolves around whether it uses one or another attention mechanism, number of layers or vocabulary size, it is discussing the part that is already solved and public. The competitive difference lies in the conversation data, in the quality and scale of the alignment process, in the compute capacity sustained over months, and in the set of training decisions the labs guard jealously.
Put another way: the transformer architecture is the blank page. The text written on top —who gathered it, how it was filtered, which preference signals were used, how much was spent— is what makes a model worth what it is worth.