LLM Serving

These notes summarize practical techniques for serving large language models (LLMs) in production systems such as vLLM, SGLang, TensorRT-LLM, and TGI.

What LLM Serving Optimizes

Serving is not only "run model.forward() repeatedly". A production engine must optimize:

Throughput: tokens per second across all users.
Latency: time-to-first-token (TTFT) and inter-token latency (ITL).
Cost: GPU utilization, memory footprint, and energy.
Stability: predictable tail latency (p95/p99), fairness, and OOM safety.

In practice, there is always a trade-off between throughput and latency. The scheduler decides where to sit on that curve.

Two Execution Phases: Prefill and Decode

Each request has two very different phases:

Phase	What it does	Dominant bottleneck	Typical symptom
Prefill	Processes the full input prompt and builds KV cache for all prompt tokens	Compute-bound	High Streaming Multiprocessor (SM) utilization, large General Matrix-to-Matrix Multiplication (GEMMs)
Decode	Generates one (or a few) new token(s) per step and appends KV incrementally	Memory-bandwidth and memory-access bound	Lower arithmetic intensity, frequent KV reads

Important clarification:

Both prefill and decode compute new K/V tensors.
The difference is execution shape:
- Prefill uses large matrix operations over long sequences (high compute intensity).
- Decode uses small per-step operations and repeatedly reads historical KV cache (memory pressure dominates).

Core Components of an Inference Engine

The two most critical components are still:

Scheduler
Inference kernels

In modern systems, a complete serving stack usually includes:

Frontend/runtime: request parsing, tokenization, streaming responses.
Scheduler: admission control, batching, fairness, preemption.
Memory manager: KV cache allocation, paging, compaction, offload.
Model executor: CUDA/Triton kernels, NCCL collectives, graph execution.
Observability: metrics, tracing, debugging, autoscaling signals.

Scheduler Design (The Real Heart of Throughput)

Continuous Batching

Static batching waits for a fixed batch; continuous batching fills and refills the batch every decode step.

Better GPU occupancy.
Lower idle gaps between requests.
Higher global throughput under bursty traffic.

Mixed Prefill/Decode Scheduling

If prefill jobs are always admitted aggressively, decode users may experience unstable ITL. If decode is over-prioritized, TTFT becomes poor.

Common policies:

Token-budget-based scheduling per iteration.
Decode-priority under strict ITL SLO.
Chunked prefill to avoid one long prompt monopolizing the GPU.

Disaggregated Prefill/Decode

Large deployments may run prefill and decode on different GPU pools:

Prefill pool optimized for high compute throughput.
Decode pool optimized for KV residency and bandwidth.

This can improve SLO control but introduces transfer/synchronization overhead and more complex routing.

KV Cache: The Main Memory Problem

KV cache stores attention keys and values for past tokens. It is the dominant memory consumer in long-context serving.

Approximate KV memory:

$$ \mathrm{KV bytes} \approx B \cdot S \cdot 2 \cdot L \cdot H_{kv} \cdot D_h \cdot \mathrm{bytes per element} $$

where $B$ is active batch, $S$ is sequence length, $L$ is layer count, $H_{kv}$ is KV heads, and $D_h$ is head dimension.

Paged KV Cache

Paged layouts (for example, paged attention) avoid large contiguous allocations:

Reduce fragmentation.
Support dynamic request lengths.
Enable faster allocation/free operations.

Prefix Caching

When many requests share prompt prefixes (system prompt, retrieved context templates), reuse existing prefix KV:

Reduces prefill FLOPs.
Improves TTFT for repeated prompts.

Offloading Strategy

Offloading can target:

KV cache pages (GPU <-> CPU memory).
Model weights (for very constrained GPU memory).

Design rule:

Offload only when transfer overhead is lower than recomputation or eviction cost.
For small models, weight offloading often hurts latency because PCIe/NVLink transfers dominate compute.

Kernel-Level Optimization

Schedulers expose parallelism; kernels realize it efficiently.

Common kernel optimizations:

Fused attention kernels (FlashAttention-style implementations).
Fused RMSNorm + linear + activation where possible.
Quantized GEMM kernels (INT8, FP8, AWQ/GPTQ-compatible paths).
Efficient gather/scatter for paged KV layouts.
CUDA Graph capture to reduce launch overhead in stable shapes.

Implementation choices:

Handwritten CUDA/C++ for peak performance hotspots.
Triton kernels for faster iteration and maintainability.

Both are widely used in real serving frameworks.

Parallelism and Multi-GPU Scaling

When one GPU is insufficient, combine parallelism methods:

Tensor parallelism: split weight matrices across GPUs.
Pipeline parallelism: split layers across stages.
Expert parallelism (MoE): distribute experts and route tokens.
Data parallel replicas: increase request-level capacity.

Trade-off reminder: stronger parallelism improves capacity but increases communication overhead and tail-latency sensitivity.

Why "GPU Optimizations" May Not Help Inference

Some optimizations are excellent for training but weak for serving.

Examples:

Training-oriented fused backward kernels are irrelevant in pure inference.
Throughput-only benchmarks can hide terrible TTFT or p99 ITL.
Large static batches look great offline but underperform with real online traffic.

Serving optimization must be workload-aware, not benchmark-only.

Why Small Models May Underperform in vLLM/SGLang Workloads

A better statement than "cannot support small models" is:

Small models can be served, but speedup may be smaller than expected in end-to-end latency.

Reasons:

Framework overhead (scheduler, tokenization, networking, streaming) becomes a larger fraction of total latency.
GPU may be underutilized due to tiny kernels and launch overhead.
Memory and synchronization overhead may dominate arithmetic.
Quantization/dequantization and host-device copies can erase compute gains.

Practical implication: choose the serving stack and batching policy based on traffic pattern and SLO, not only model parameter count.

Practical Tuning Checklist

Define SLO first: TTFT target, ITL target, and p99 bounds.
Tune scheduler before deep kernel work.
Enable paged KV and monitor fragmentation.
Use prefix caching for repeated prompts.
Set max input/output lengths to protect tail latency.
Evaluate quantization with quality checks, not speed only.
Profile end-to-end: tokenizer, runtime, scheduler, kernels, network.
Separate benchmarking by workload class: short-chat, long-context QA, tool-calling, batch offline generation.

Key Metrics to Report

TTFT (p50/p95/p99)
ITL (p50/p95/p99)
Output tokens/s (aggregate and per GPU)
Request throughput (req/s)
GPU memory usage and KV cache hit rate
Scheduler queue depth and rejection rate
Cost per million output tokens

Summary

LLM serving performance is mostly a systems problem:

Prefill and decode have different bottlenecks.
Scheduler + KV memory management usually matter more than isolated kernel micro-optimizations.
Kernel quality is still crucial, especially for attention and quantized GEMM.
Small-model serving needs careful system-level tuning because overheads can dominate compute.

When analyzing any serving paper, focus on three questions first:

What scheduling policy is used?
How is KV cache allocated and moved?
Which latency metrics improved under realistic traffic?