How LLMs Work for SREs: Tokens, Transformers, and Inference Internals

How I Got Here

I use LLMs every day, Claude Code for code, GPT for drafting. They’ve become as routine as kubectl or git. But if someone asked me how they actually work, I’d mumble something about “neural networks” and change the subject.

That bothered me. I’m an infrastructure engineer. I don’t like running tools I can’t reason about. So I went digging - not as an ML researcher.

This post is what I found, framed the way I’d want it explained to me. Where are the moving parts, where do they live in memory, and what wakes someone up when they break.

The System, In One Diagram

An LLM is a pipeline:

text → tokens → vectors → transformer stack → vector → next-token probabilities → sampled token
                                  ↑__________________________________________________________|
                                                  (loop, one token at a time)

Six stages. The model isn’t doing anything magical - it’s a deterministic pipeline that runs once per token. Generation is just running the pipeline in a loop.

The rest of this post walks through each stage, and where serving infrastructure earns its money.

Stage 1: Tokens

LLMs don’t see text. They see integers.

Before anything else, your input gets chunked into tokens - substrings drawn from a fixed vocabulary. Modern tokenizers use Byte Pair Encoding (BPE), a frequency-based scheme that builds its vocabulary bottom-up. Common patterns become single tokens. Rare patterns split into pieces.

I ran Kubernetes is running on my Raspberry Pi through GPT-3’s tokenizer:

["K", "uber", "net", "es", " is", " running", " on", " my", " Raspberry", " Pi"]

running is one token (common). Kubernetes shatters into four (rare, never merged in training).

Each token then maps to an integer ID - its index into the model’s vocab. Different models, different vocab sizes.

Model	Tokenizer	Vocab Size
GPT-3	tiktoken `r50k_base`	~50K
GPT-4o	tiktoken `o200k_base`	~200K
Llama 3	tiktoken-style BPE	128K
Gemma / Gemini	SentencePiece	262K

Llama 3 quadrupled its vocab from Llama 2 specifically to compress non-English text and code. Bigger vocab = fewer tokens per sentence = more text fits in the context window.

Why an SRE cares

Context windows are denominated in tokens, not characters. ~0.75 words/token for English prose. Code and YAML are denser - burn tokens faster.
API pricing is per token. Tokenizer choice = direct cost lever.
The token ID is just an array index, not a magnitude. It carries no semantic information by itself. Meaning lives in the next stage.

Stage 2: Tokens to Vectors (Embeddings)

The model can’t multiply integers and learn anything. It needs vectors.

Each token ID indexes into a giant matrix - the embedding table - of shape [vocab_size, d_model]. Row n is the embedding for token n. For Llama 3 8B, that’s [128256, 4096] - about 525M floats just for this lookup table. ~1GB at fp16.

The vectors aren’t random. They’re learned during training. Over billions of training steps, tokens that appear in similar contexts get nudged toward similar embedding rows. By the end, king and queen sit near each other; bicycle is far away. Nobody told the model what these words mean - pure statistics over context produced the geometry.

Where does position go?

Here is a problem the embedding table does not solve: the embedding for cat or mat is identical regardless of where it appears in the sentence. But the cat sat on the mat and the mat sat on the cat are different sentences with different meanings.

The model needs position information somewhere. Two design choices have dominated:

The 2017 transformer added a fixed sinusoidal position vector to each token’s embedding before the first layer. Bounded values (so they don’t drown out the token signal), and chosen so the model could learn relative distances from them.

Modern models (Llama, Mistral, Qwen) use Rotary Position Embeddings (RoPE). Instead of adding a position vector, RoPE rotates the Query and Key vectors inside attention by an angle proportional to the token’s position. The math falls out so that when two tokens compute their dot product, the absolute positions cancel and only the relative distance survives.

Skipping the rotation matrix algebra. The infra-relevant property:

Position is encoded inside attention, not at the input.
KV cache stores already-rotated K vectors. No recomputation when appending tokens.
Models can stretch their context window beyond training length by tweaking RoPE’s frequency parameters (this is what YaRN does - Llama 3.1 went from 8K -> 128K context this way, no full retrain).

This is the whole story for stage 2. Tokens -> vectors via lookup. Position injected via rotation inside attention.

Stage 3: The Transformer Stack

The bulk of the model is a stack of identical blocks. GPT-3 has 96. Llama 3 70B has 80. Each block takes in [N, d_model] (N tokens, each a 4096-d vector) and outputs the same shape. The vectors get refined; the dimensions don’t change.

input vectors [N, 4096]
       ↓
┌──────────────────┐
│  Block 1         │
└──────────────────┘
       ↓
   ... 80 more times ...
       ↓
┌──────────────────┐
│  Block N         │
└──────────────────┘
       ↓
output vectors [N, 4096]

Each block has two sub-layers:

input
  ↓
RMSNorm  →  Self-Attention  →  + (residual)
  ↓
RMSNorm  →  Feed-Forward    →  + (residual)
  ↓
output

Each sub-layer does a different job:

Self-Attention is the only mechanism that lets information flow between tokens. Strip it and every token vector evolves in isolation. No way to resolve “it” referring to a noun three sentences back.
Feed-forward network (FFN) processes each token independently through a non-linear transformation. This is also where most of the model’s parameters live (~2/3 of total). Acts as key-value memory: pattern matches in the first projection, retrieves info in the second. The fact “New Delhi is the capital of India” is encoded as weights somewhere in some FFN layer.

Mental model:

Sub-layer	Role	Analogy
Attention	Communication	Tokens talking to each other
FFN	Computation + memory	Each token consulting the model’s learned facts

You need both. Attention without FFN = no facts, no nonlinearity, no real learning. FFN without attention = no cross-token reasoning.

Residual connection

The + at the end of each sub-layer is critical. Sub-layer output is added to its input, not replaced:

y = x + f(x)

Why this matters: with 80 stacked blocks, gradient signal during training has to flow backward through the entire stack. Without the residual, every block multiplies the gradient by some Jacobian. Stack 80 of those, and signal vanishes by the time it reaches the embedding table - early layers never learn.

The residual gives gradient a “highway” past each block. Same trick that enabled deep ResNets in computer vision. Universal in modern transformers.

Stage 4: Self-Attention (the Heart)

Each token needs to decide which other tokens to look at, and how much to weight them. Concretely: the vector for the word it in “the animal didn’t cross the street because it was too tired” should end up mostly composed of information from animal, with some from tired.

The mechanism: each token vector gets projected through three learned matrices into three smaller vectors:

Query (Q): “what am I looking for?”
Key (K): “what do I advertise?”
Value (V): “what info do I share when selected?”

Then for each token, the model computes:

\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Walk through:

$QK^T$ - every Q dotted with every K. A [N, N] matrix of similarity scores.
$/\sqrt{d_k}$ - scale to keep softmax from saturating when the head dimension is large.
softmax - turn each row of scores into weights summing to 1.
$\cdot V$ - weighted sum of value vectors. Each token’s output is a blend of all other tokens’ V, weighted by relevance.

The SRE intuition

Think of each token as a microservice in a distributed system. Each service exposes three things:

Vector	Service equivalent	Plain English
Q	Outbound HTTP request	What I’m looking for
K	Service registry entry / tags	What I am, what I offer
V	Response payload	What I return when selected

Attention is service discovery. Each token sends out a query, scores it against every other token’s registry advertisement, softmax-weights the matches, and aggregates the responses.

Multi-head attention

What I described is one head. Real attention runs many heads in parallel - typically 32 or more. Each head has its own learned $W_Q$ , $W_K$ , $W_V$ matrices and learns to attend to different patterns:

Head	Might learn
Head 1	Subject-verb agreement
Head 2	Pronoun → noun resolution
Head 3	Adjective → noun pairing
Head 4	Long-range topic tracking

Outputs are concatenated and projected through one more matrix $W_O$ . Not theoretical: interpretability researchers have identified specific heads in real models that do specific linguistic jobs. Real specialization emerges from training.

Causal masking

During text generation, token at position t can only attend to positions 0..t. Future tokens have not been generated yet. Implementation: set scores for future positions to $-\infty$ before softmax. Softmax of $-\infty$ = 0. Future tokens get zero weight. Lower-triangular mask, applied every attention call.

Stage 5: Modern Tweaks (Where Infra Lives)

The original 2017 architecture has evolved. Modern models (Llama 3, Mistral, Qwen, DeepSeek) share four changes. Three are quality-of-life. One is critical for serving.

RMSNorm

Drop-in replacement for LayerNorm. Drops mean-subtraction and shift parameter. Halves the norm-layer parameter count. Trains more stably for deep stacks. Universal in modern open models.

SwiGLU

The FFN’s activation function. Original transformer used ReLU. Modern uses SwiGLU - a gated unit with two parallel projections that get multiplied element-wise. One path goes through a smooth nonlinearity (Swish), the other acts as a learned gate. Costs ~2/3 more FFN parameters but pays back in training quality.

The infra-relevant fact: when sizing GPU memory for serving, FFN params are about 2/3 of the model. SwiGLU makes that already-large fraction bigger.

Grouped Query Attention (GQA) — the important one

This is the change that determines whether your serving setup works.

During decode, the model generates Q for each new token but reuses K and V from past tokens. Those past K/V vectors live in the KV cache.

Standard multi-head attention (MHA) gives each head its own K and V projections. KV cache size scales linearly with head count:

\text{KV cache per request} = 2 \times \text{num\_heads} \times N \times d_{\text{head}} \times \text{bytes} \times \text{num\_layers}

For Llama 2 70B at 16K context: ~43 GB just for KV cache, per request. That’s most of an H100 gone, for one user.

Three flavors of attention attack this:

Variant	KV heads	Q heads	KV cache size	Quality
MHA (original)	32	32	1× (baseline)	best
GQA (modern default)	8	32–64	0.25×	near-MHA
MQA (extreme)	1	32	0.03×	noticeable drop

GQA shares each KV head across a group of query heads. Llama 3 70B has 64 query heads sharing 8 KV heads. KV cache shrinks 4–8×; quality holds.

Concrete: Llama 3 70B at 16K context with 4 concurrent requests:

2 × 8 KV heads × 16384 tokens × 128 d_head × 2 bytes × 80 layers
≈ 5.4 GB per request × 4 = ~21.5 GB total KV cache

Mixture of Experts (MoE)

Used by Mixtral 8x7B, DeepSeek-V3, Llama 4 Maverick.

Replaces a single FFN per layer with multiple parallel FFNs (the “experts”). A small router network picks the top-K (typically 2) experts per token. Only those experts run.

Model	Total params	Active per token
Mixtral 8x7B	47B	~13B
Llama 4 Maverick	400B	~17B
DeepSeek-V3	671B	~37B

The decoupling matters: total capacity (params) and per-token compute become independent knobs. You get GPT-3.5-quality output for the FLOPs of a 13B model.

MoE is essentially consistent hashing for neural nets. Router = hash function, experts = shards. Same load-balancing failure modes too - hot keys, expert collapse, rebalancing pain.

Stage 6: Generation (The Loop)

After all the layers run, you have a refined vector for the last token. One more matrix multiplication - the LM head, shape [d_model, vocab_size] - projects that vector into a score per vocabulary token. These are logits.

Apply softmax to logits → a probability distribution over the entire vocab. Now pick a token.

Sampling strategies

Greedy (argmax): always pick the highest-probability token. Deterministic. Boring - gets stuck in loops, repeats itself.
Temperature: divide logits by T before softmax. T=0 ≈ greedy. T=0.7 = mildly random. T>1.5 = chaotic.
Top-k: only sample from the top k tokens.
Top-p (nucleus): only sample from the smallest set whose cumulative probability ≥ p. Adaptive: keeps few tokens when the model is confident, more when it’s uncertain.

Production default: temperature=0.7, top_p=0.9. These are knobs on every API.

The autoregressive loop

tokens = tokenize(prompt)

while not done:
    logits = model.forward(tokens)[-1]
    probs = softmax(logits / temperature)
    probs = top_p_filter(probs, p=0.9)
    next_token = sample(probs)
    tokens.append(next_token)
    if next_token == END_TOKEN: break

print(detokenize(tokens))

That’s the whole trick. Generate one token. Append. Repeat. There’s no plan, no draft, no go-back. Each token is a bet based on everything that came before. Once the model commits to a token, it’s stuck with it.

Prefill vs decode — two different cost regimes

This is the most important infra distinction in LLM serving.

Stage	What runs	Bottleneck
Prefill	Process whole prompt at once	Compute-bound. Parallel-friendly. Fast per token.
Decode	One new token at a time	Memory-bandwidth-bound. GPU underutilized. Slow per token.

Why decode is slow: to compute one new token, the GPU must read every model weight from HBM. For Llama 3 70B at fp16 that’s 140 GB. H100 HBM bandwidth is ~3 TB/s. Reading the weights takes ~47 ms. The actual compute takes ~0.07 ms. You’re 670× memory-bound. The compute pipeline sits idle most of the step.

Practical consequences:

Long prompt + short answer = relatively cheap per token.
Short prompt + long answer = much more expensive per token.
Optimization targets are different for the two regimes.

KV cache during the loop

Without the cache, each decode step would re-process the entire context - quadratic compute. The cache makes it linear.

Prefill: compute K and V for every prompt token, store them.
Decode step: compute Q, K, V for the new token only. Append the new K, V to the cache. Q attends to all cached K, V.

The KV cache is the single biggest knob in production LLM serving. It dictates how many concurrent requests you can fit, how long your context windows can be, and how fragmented your GPU memory becomes.

What Changed for Me

After working through this, a few things shifted in how I use and reason about LLMs:

Concept	What I do differently now
Tokenization	Watch token counts on code/YAML-heavy prompts. They burn through context fast.
Context window	Front-load the important stuff. The model only “knows” what’s in the window.
Autoregressive generation	The model can’t go back. Early errors compound. Be deliberate about prompt structure.
Temperature	Lower for factual / deterministic tasks. Higher for creative work.
GQA & MoE	Parameter count alone is a misleading metric. A 400B MoE model can be cheaper to serve than a 70B dense model.
KV cache	Long conversations cost memory linearly. GQA helps; context limits are real.
Effective context length	Models degrade before their advertised window. RULER benchmarks show reliable retrieval at only 50–65% of the stated max. Don’t trust the marketing number.
Quantization	INT4 / fp8 models lose only a small amount of quality. I don’t always need the full fp16 endpoint.
Training-time cutoff	The model’s knowledge is frozen. Use tools and RAG for current information.

What I Took Away

LLMs are next-token prediction machines built on the transformer architecture. Tokens go in, get embedded into vectors, pass through dozens of attention and feed-forward layers, and come out as a probability distribution over the next token. Repeat that loop and you get coherent text.

The system is less magical than I thought, and more impressive. The model doesn’t “understand” anything in the conceptual sense - it’s that statistical patterns over trillions of tokens, compressed into billions of parameters, produce behavior that looks like understanding.

References

Vaswani et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762
Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). https://arxiv.org/abs/2203.15556
Dao et al. (2022). FlashAttention. https://arxiv.org/abs/2205.14135
Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864
Shazeer (2019). Fast Transformer Decoding (MQA / GQA precursor). https://arxiv.org/abs/1911.02150
Ouyang et al. (2022). Training language models to follow instructions with human feedback (RLHF). https://arxiv.org/abs/2203.02155
Karpathy (2023). Let’s build GPT: from scratch, in code, spelled out. https://www.youtube.com/watch?v=kCc8FmEb1nY
Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). https://arxiv.org/abs/2309.06180