LLM Serving
These notes summarize practical techniques for serving large language models (LLMs) in production systems such as vLLM, SGLang, TensorRT-LLM, and TGI.
What LLM Serving Optimizes
Serving is not only "run model.forward() repeatedly". A production engine must optimize:
- Throughput: tokens per second across all users.
- Latency: time-to-first-token (TTFT) and inter-token latency (ITL).
- Cost: GPU utilization, memory footprint, and energy.
- Stability: predictable tail latency (p95/p99), fairness, and OOM safety.
In practice, there is always a trade-off between throughput and latency. The scheduler decides where to sit on that curve.
Two Execution Phases: Prefill and Decode
Each request has two very different phases:
| Phase | What it does | Dominant bottleneck | Typical symptom |
|---|---|---|---|
| Prefill | Processes the full input prompt and builds KV cache for all prompt tokens | Compute-bound | High Streaming Multiprocessor (SM) utilization, large General Matrix-to-Matrix Multiplication (GEMMs) |
| Decode | Generates one (or a few) new token(s) per step and appends KV incrementally | Memory-bandwidth and memory-access bound | Lower arithmetic intensity, frequent KV reads |
Important clarification:
- Both prefill and decode compute new K/V tensors.
- The difference is execution shape:
- Prefill uses large matrix operations over long sequences (high compute intensity).
- Decode uses small per-step operations and repeatedly reads historical KV cache (memory pressure dominates).
Core Components of an Inference Engine
The two most critical components are still:
- Scheduler
- Inference kernels
In modern systems, a complete serving stack usually includes:
- Frontend/runtime: request parsing, tokenization, streaming responses.
- Scheduler: admission control, batching, fairness, preemption.
- Memory manager: KV cache allocation, paging, compaction, offload.
- Model executor: CUDA/Triton kernels, NCCL collectives, graph execution.
- Observability: metrics, tracing, debugging, autoscaling signals.
Scheduler Design (The Real Heart of Throughput)
Continuous Batching
Static batching waits for a fixed batch; continuous batching fills and refills the batch every decode step.
- Better GPU occupancy.
- Lower idle gaps between requests.
- Higher global throughput under bursty traffic.
Mixed Prefill/Decode Scheduling
If prefill jobs are always admitted aggressively, decode users may experience unstable ITL. If decode is over-prioritized, TTFT becomes poor.
Common policies:
- Token-budget-based scheduling per iteration.
- Decode-priority under strict ITL SLO.
- Chunked prefill to avoid one long prompt monopolizing the GPU.
Disaggregated Prefill/Decode
Large deployments may run prefill and decode on different GPU pools:
- Prefill pool optimized for high compute throughput.
- Decode pool optimized for KV residency and bandwidth.
This can improve SLO control but introduces transfer/synchronization overhead and more complex routing.
KV Cache: The Main Memory Problem
KV cache stores attention keys and values for past tokens. It is the dominant memory consumer in long-context serving.
Approximate KV memory:
$$ \mathrm{KV bytes} \approx B \cdot S \cdot 2 \cdot L \cdot H_{kv} \cdot D_h \cdot \mathrm{bytes per element} $$
where $B$ is active batch, $S$ is sequence length, $L$ is layer count, $H_{kv}$ is KV heads, and $D_h$ is head dimension.
Paged KV Cache
Paged layouts (for example, paged attention) avoid large contiguous allocations:
- Reduce fragmentation.
- Support dynamic request lengths.
- Enable faster allocation/free operations.
Prefix Caching
When many requests share prompt prefixes (system prompt, retrieved context templates), reuse existing prefix KV:
- Reduces prefill FLOPs.
- Improves TTFT for repeated prompts.
Offloading Strategy
Offloading can target:
- KV cache pages (GPU <-> CPU memory).
- Model weights (for very constrained GPU memory).
Design rule:
- Offload only when transfer overhead is lower than recomputation or eviction cost.
- For small models, weight offloading often hurts latency because PCIe/NVLink transfers dominate compute.
Kernel-Level Optimization
Schedulers expose parallelism; kernels realize it efficiently.
Common kernel optimizations:
- Fused attention kernels (FlashAttention-style implementations).
- Fused RMSNorm + linear + activation where possible.
- Quantized GEMM kernels (INT8, FP8, AWQ/GPTQ-compatible paths).
- Efficient gather/scatter for paged KV layouts.
- CUDA Graph capture to reduce launch overhead in stable shapes.
Implementation choices:
- Handwritten CUDA/C++ for peak performance hotspots.
- Triton kernels for faster iteration and maintainability.
Both are widely used in real serving frameworks.
Parallelism and Multi-GPU Scaling
When one GPU is insufficient, combine parallelism methods:
- Tensor parallelism: split weight matrices across GPUs.
- Pipeline parallelism: split layers across stages.
- Expert parallelism (MoE): distribute experts and route tokens.
- Data parallel replicas: increase request-level capacity.
Trade-off reminder: stronger parallelism improves capacity but increases communication overhead and tail-latency sensitivity.
Why "GPU Optimizations" May Not Help Inference
Some optimizations are excellent for training but weak for serving.
Examples:
- Training-oriented fused backward kernels are irrelevant in pure inference.
- Throughput-only benchmarks can hide terrible TTFT or p99 ITL.
- Large static batches look great offline but underperform with real online traffic.
Serving optimization must be workload-aware, not benchmark-only.
Why Small Models May Underperform in vLLM/SGLang Workloads
A better statement than "cannot support small models" is:
- Small models can be served, but speedup may be smaller than expected in end-to-end latency.
Reasons:
- Framework overhead (scheduler, tokenization, networking, streaming) becomes a larger fraction of total latency.
- GPU may be underutilized due to tiny kernels and launch overhead.
- Memory and synchronization overhead may dominate arithmetic.
- Quantization/dequantization and host-device copies can erase compute gains.
Practical implication: choose the serving stack and batching policy based on traffic pattern and SLO, not only model parameter count.
Practical Tuning Checklist
- Define SLO first: TTFT target, ITL target, and p99 bounds.
- Tune scheduler before deep kernel work.
- Enable paged KV and monitor fragmentation.
- Use prefix caching for repeated prompts.
- Set max input/output lengths to protect tail latency.
- Evaluate quantization with quality checks, not speed only.
- Profile end-to-end: tokenizer, runtime, scheduler, kernels, network.
- Separate benchmarking by workload class: short-chat, long-context QA, tool-calling, batch offline generation.
Key Metrics to Report
- TTFT (p50/p95/p99)
- ITL (p50/p95/p99)
- Output tokens/s (aggregate and per GPU)
- Request throughput (req/s)
- GPU memory usage and KV cache hit rate
- Scheduler queue depth and rejection rate
- Cost per million output tokens
Summary
LLM serving performance is mostly a systems problem:
- Prefill and decode have different bottlenecks.
- Scheduler + KV memory management usually matter more than isolated kernel micro-optimizations.
- Kernel quality is still crucial, especially for attention and quantized GEMM.
- Small-model serving needs careful system-level tuning because overheads can dominate compute.
When analyzing any serving paper, focus on three questions first:
- What scheduling policy is used?
- How is KV cache allocated and moved?
- Which latency metrics improved under realistic traffic?