thaw is the fork() primitive for live LLM inference. Snapshot a running session — weights, KV cache, scheduler state — and hydrate N divergent children that skip the cold prefill. A pre-warmed pool forks 4 branches in 0.88s median (H100, Llama-3.1-8B) vs ~340s cold. Built for RL rollouts, multi-agent reasoning, and parallel coding agents. Open source, works with vLLM and SGLang.
thaw
thaw is the fork primitive for live LLM inference.
Snapshot a running session — weights, KV cache, scheduler state, and prefix-hash table — and hydrate N divergent children that diverge from the fork point without re-running prefill. git branch for a running model.
NVIDIA shipped GPU memory snapshots last week with Dynamo Snapshot, and it deallocates the KV cache before checkpointing by design. thaw makes the opposite bet: preserve the KV cache so a fork is near-free. Different problem, opposite mechanic. [web:10][web:1]
The receipt — ForkPool, H100 80 GB PCIe, Llama-3.1-8B
A pre-warmed pool holds the engine once; each fork round snapshots KV only. [web:1]
5 rounds × 4 branches × 64 tokens:
| Stage | Time |
|---|---|
init_pool (one-time — workers boot with real weights) | 22.3s |
| First fork round | 1.16s |
| Median fork round | 0.88s |
~340s cold-boot per round → sub-second warm pool (≈400× amortized).
All rounds 4/4 non-empty and divergent, bit-identical at the fork boundary. The full JSON receipt and reproducer are in the repo — nothing hand-waved. [web:1]
What you can build with it
- Agent branching — fork a conversation into N parallel hypotheses mid-reasoning, run them concurrently, keep the winner. [web:1]
- RL rollouts — collapse
num_rollouts × prefill_timeintonum_rollouts × memcpy_time. Real money on $100k+/month training budgets. - Parallel coding agents — turn "8 agents exploring 8 solutions" from an 8× prefill tax into one fork against a shared warm state.
- Session migration — move a live session between GPUs or pods without losing state. [web:1]
Example
import thaw_vllm
from thaw_vllm import ForkPool
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enforce_eager=True)
# Boot a pool of workers once (one-time cost, ~22s for 8B on an H100).
pool = ForkPool()
pool.init_pool(
model="meta-llama/Llama-3.1-8B-Instruct",
workers=4,
preload_weights=True,
)
# Each call snapshots the live parent and forks into the warm pool.
# 4 branches in ~0.88s median per round — no cold prefill.
prompts = [shared_context + branch for branch in branches]
sp = SamplingParams(max_tokens=64, temperature=0.7)
results = thaw_vllm.fork_completions(llm, prompts, sp, pool=pool)
for r in results:
print(r.text)
How it works
freeze/restore is pipelined CUDA DMA — WC-pinned double buffers, O_DIRECT, two CUDA streams — so restore runs at GPU-memory-bandwidth speeds. Multi-GPU tensor parallel is supported via collective_rpc. KV cache freeze/restore reconstructs the prefix-cache hash so forked children resume mid-generation. [web:1]
Install
pip install thaw-vllm
Open source under Apache-2.0. Works with vLLM and SGLang. [web:1]
No comments yet.
Sign in to be the first to comment.