PitchHut logo
thaw
fork() for AI agents — snapshot a live LLM, branch N ways, skip the prefill.
Pitch

thaw is the fork() primitive for live LLM inference. Snapshot a running session — weights, KV cache, scheduler state — and hydrate N divergent children that skip the cold prefill. A pre-warmed pool forks 4 branches in 0.88s median (H100, Llama-3.1-8B) vs ~340s cold. Built for RL rollouts, multi-agent reasoning, and parallel coding agents. Open source, works with vLLM and SGLang.

Description

thaw

thaw is the fork primitive for live LLM inference.

Snapshot a running session — weights, KV cache, scheduler state, and prefix-hash table — and hydrate N divergent children that diverge from the fork point without re-running prefill. git branch for a running model.

NVIDIA shipped GPU memory snapshots last week with Dynamo Snapshot, and it deallocates the KV cache before checkpointing by design. thaw makes the opposite bet: preserve the KV cache so a fork is near-free. Different problem, opposite mechanic. [web:10][web:1]

The receipt — ForkPool, H100 80 GB PCIe, Llama-3.1-8B

A pre-warmed pool holds the engine once; each fork round snapshots KV only. [web:1]

5 rounds × 4 branches × 64 tokens:

StageTime
init_pool (one-time — workers boot with real weights)22.3s
First fork round1.16s
Median fork round0.88s

~340s cold-boot per round → sub-second warm pool (≈400× amortized).

All rounds 4/4 non-empty and divergent, bit-identical at the fork boundary. The full JSON receipt and reproducer are in the repo — nothing hand-waved. [web:1]

What you can build with it

  • Agent branching — fork a conversation into N parallel hypotheses mid-reasoning, run them concurrently, keep the winner. [web:1]
  • RL rollouts — collapse num_rollouts × prefill_time into num_rollouts × memcpy_time. Real money on $100k+/month training budgets.
  • Parallel coding agents — turn "8 agents exploring 8 solutions" from an 8× prefill tax into one fork against a shared warm state.
  • Session migration — move a live session between GPUs or pods without losing state. [web:1]

Example

import thaw_vllm
from thaw_vllm import ForkPool
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enforce_eager=True)

# Boot a pool of workers once (one-time cost, ~22s for 8B on an H100).
pool = ForkPool()
pool.init_pool(
    model="meta-llama/Llama-3.1-8B-Instruct",
    workers=4,
    preload_weights=True,
)

# Each call snapshots the live parent and forks into the warm pool.
# 4 branches in ~0.88s median per round — no cold prefill.
prompts = [shared_context + branch for branch in branches]
sp = SamplingParams(max_tokens=64, temperature=0.7)

results = thaw_vllm.fork_completions(llm, prompts, sp, pool=pool)
for r in results:
    print(r.text)

How it works

freeze/restore is pipelined CUDA DMA — WC-pinned double buffers, O_DIRECT, two CUDA streams — so restore runs at GPU-memory-bandwidth speeds. Multi-GPU tensor parallel is supported via collective_rpc. KV cache freeze/restore reconstructs the prefix-cache hash so forked children resume mid-generation. [web:1]

Install

pip install thaw-vllm

Open source under Apache-2.0. Works with vLLM and SGLang. [web:1]

0 comments

No comments yet.

Sign in to be the first to comment.