Back to all posts

Understanding LLM Inference Internals

Introduction

What makes LLM inference fast (or slow)? This post breaks down the full pipeline.

The Inference Pipeline

1. Tokenization

How text becomes tokens.

2. Embedding Lookup

How tokens become vectors.

3. Transformer Layers

Attention, FFN, and residual connections.

4. Sampling

Temperature, top-k, top-p — how the next token is chosen.

Code Walkthrough

// Example from llama.cpp or similar

Conclusion

Key takeaways about inference performance.


Have thoughts or questions? Reach out on X.