Skip to content

LLM Serving

These notes summarize practical techniques for serving large language models (LLMs) in production systems such as vLLM, SGLang, TensorRT-LLM, and TGI.

What LLM Serving Optimizes

Serving is not only "run model.forward() repeatedly". A production engine must optimize:

  • Throughput: tokens per second across all users.
  • Latency: time-to-first-token (TTFT) and inter-token latency (ITL).
  • Cost: GPU utilization, memory footprint, and energy.
  • Stability: predictable tail latency (p95/p99), fairness, and OOM safety.

In practice, there is always a trade-off between throughput and latency. The scheduler decides where to sit on that curve.

Two Execution Phases: Prefill and Decode

Each request has two very different phases:

Phase What it does Dominant bottleneck Typical symptom
Prefill Processes the full input prompt and builds KV cache for all prompt tokens Compute-bound High Streaming Multiprocessor (SM) utilization, large General Matrix-to-Matrix Multiplication (GEMMs)
Decode Generates one (or a few) new token(s) per step and appends KV incrementally Memory-bandwidth and memory-access bound Lower arithmetic intensity, frequent KV reads

Important clarification:

  • Both prefill and decode compute new K/V tensors.
  • The difference is execution shape:
    • Prefill uses large matrix operations over long sequences (high compute intensity).
    • Decode uses small per-step operations and repeatedly reads historical KV cache (memory pressure dominates).

Core Components of an Inference Engine

The two most critical components are still:

  1. Scheduler
  2. Inference kernels

In modern systems, a complete serving stack usually includes:

  • Frontend/runtime: request parsing, tokenization, streaming responses.
  • Scheduler: admission control, batching, fairness, preemption.
  • Memory manager: KV cache allocation, paging, compaction, offload.
  • Model executor: CUDA/Triton kernels, NCCL collectives, graph execution.
  • Observability: metrics, tracing, debugging, autoscaling signals.

Scheduler Design (The Real Heart of Throughput)

Continuous Batching

Static batching waits for a fixed batch; continuous batching fills and refills the batch every decode step.

  • Better GPU occupancy.
  • Lower idle gaps between requests.
  • Higher global throughput under bursty traffic.

Mixed Prefill/Decode Scheduling

If prefill jobs are always admitted aggressively, decode users may experience unstable ITL. If decode is over-prioritized, TTFT becomes poor.

Common policies:

  • Token-budget-based scheduling per iteration.
  • Decode-priority under strict ITL SLO.
  • Chunked prefill to avoid one long prompt monopolizing the GPU.

Disaggregated Prefill/Decode

Large deployments may run prefill and decode on different GPU pools:

  • Prefill pool optimized for high compute throughput.
  • Decode pool optimized for KV residency and bandwidth.

This can improve SLO control but introduces transfer/synchronization overhead and more complex routing.

KV Cache: The Main Memory Problem

KV cache stores attention keys and values for past tokens. It is the dominant memory consumer in long-context serving.

Approximate KV memory:

$$ \mathrm{KV bytes} \approx B \cdot S \cdot 2 \cdot L \cdot H_{kv} \cdot D_h \cdot \mathrm{bytes per element} $$

where $B$ is active batch, $S$ is sequence length, $L$ is layer count, $H_{kv}$ is KV heads, and $D_h$ is head dimension.

Paged KV Cache

Paged layouts (for example, paged attention) avoid large contiguous allocations:

  • Reduce fragmentation.
  • Support dynamic request lengths.
  • Enable faster allocation/free operations.

Prefix Caching

When many requests share prompt prefixes (system prompt, retrieved context templates), reuse existing prefix KV:

  • Reduces prefill FLOPs.
  • Improves TTFT for repeated prompts.

Offloading Strategy

Offloading can target:

  • KV cache pages (GPU <-> CPU memory).
  • Model weights (for very constrained GPU memory).

Design rule:

  • Offload only when transfer overhead is lower than recomputation or eviction cost.
  • For small models, weight offloading often hurts latency because PCIe/NVLink transfers dominate compute.

Kernel-Level Optimization

Schedulers expose parallelism; kernels realize it efficiently.

Common kernel optimizations:

  • Fused attention kernels (FlashAttention-style implementations).
  • Fused RMSNorm + linear + activation where possible.
  • Quantized GEMM kernels (INT8, FP8, AWQ/GPTQ-compatible paths).
  • Efficient gather/scatter for paged KV layouts.
  • CUDA Graph capture to reduce launch overhead in stable shapes.

Implementation choices:

  • Handwritten CUDA/C++ for peak performance hotspots.
  • Triton kernels for faster iteration and maintainability.

Both are widely used in real serving frameworks.

Parallelism and Multi-GPU Scaling

When one GPU is insufficient, combine parallelism methods:

  • Tensor parallelism: split weight matrices across GPUs.
  • Pipeline parallelism: split layers across stages.
  • Expert parallelism (MoE): distribute experts and route tokens.
  • Data parallel replicas: increase request-level capacity.

Trade-off reminder: stronger parallelism improves capacity but increases communication overhead and tail-latency sensitivity.

Why "GPU Optimizations" May Not Help Inference

Some optimizations are excellent for training but weak for serving.

Examples:

  • Training-oriented fused backward kernels are irrelevant in pure inference.
  • Throughput-only benchmarks can hide terrible TTFT or p99 ITL.
  • Large static batches look great offline but underperform with real online traffic.

Serving optimization must be workload-aware, not benchmark-only.

Why Small Models May Underperform in vLLM/SGLang Workloads

A better statement than "cannot support small models" is:

  • Small models can be served, but speedup may be smaller than expected in end-to-end latency.

Reasons:

  • Framework overhead (scheduler, tokenization, networking, streaming) becomes a larger fraction of total latency.
  • GPU may be underutilized due to tiny kernels and launch overhead.
  • Memory and synchronization overhead may dominate arithmetic.
  • Quantization/dequantization and host-device copies can erase compute gains.

Practical implication: choose the serving stack and batching policy based on traffic pattern and SLO, not only model parameter count.

Practical Tuning Checklist

  1. Define SLO first: TTFT target, ITL target, and p99 bounds.
  2. Tune scheduler before deep kernel work.
  3. Enable paged KV and monitor fragmentation.
  4. Use prefix caching for repeated prompts.
  5. Set max input/output lengths to protect tail latency.
  6. Evaluate quantization with quality checks, not speed only.
  7. Profile end-to-end: tokenizer, runtime, scheduler, kernels, network.
  8. Separate benchmarking by workload class: short-chat, long-context QA, tool-calling, batch offline generation.

Key Metrics to Report

  • TTFT (p50/p95/p99)
  • ITL (p50/p95/p99)
  • Output tokens/s (aggregate and per GPU)
  • Request throughput (req/s)
  • GPU memory usage and KV cache hit rate
  • Scheduler queue depth and rejection rate
  • Cost per million output tokens

Summary

LLM serving performance is mostly a systems problem:

  • Prefill and decode have different bottlenecks.
  • Scheduler + KV memory management usually matter more than isolated kernel micro-optimizations.
  • Kernel quality is still crucial, especially for attention and quantized GEMM.
  • Small-model serving needs careful system-level tuning because overheads can dominate compute.

When analyzing any serving paper, focus on three questions first:

  1. What scheduling policy is used?
  2. How is KV cache allocated and moved?
  3. Which latency metrics improved under realistic traffic?