LLM Inference Performance Calculator

kevin2025

A interactive web application for analyzing large-scale model inference performance.

Pitch

The LLM Inference Performance Calculator addresses the complexities of deploying Large Scale Mixture-of-Experts models. It offers an interactive visualization tool that simulates inference physics, allowing engineers to explore design trade-offs and optimize performance without needing physical hardware.

Description

The LLM Inference Performance Calculator addresses the complexities of deploying Large Scale Mixture-of-Experts (MoE) models, such as DeepSeek-V3, by enabling in-depth performance analysis through a first-principles, interactive visualization tool. Designed for modern AI engineering, this app helps navigate the extensive design space, balancing Logical Architecture and Physical Hardware constraints without requiring physical hardware access.

Introduction

The evolution of AI modeling raises intricate challenges, as engineers must tackle critical “What-If” scenarios that are cost-prohibitive to test in live environments. Examples include:

Impact of sequence length scaling on KV Cache memory wall.
Utilizing DualPipe optimization to hide MoE All-to-All communication latency.
Potential benefits of offloading "Cold Experts" to system RAM, memory pool, or Near Memory Computing devices.

Key Features

🧠 Predefined Models & Presets

Quickly load industry-standard configurations as baselines and adapt them for custom testing. Key model presets include:

DeepSeek-V3: (671B MoE, MLA, Multi-Token Prediction)
Mixtral 8x7B: (Sparse MoE, GQA)
Grok-1: (Large Scale MoE)
Qwen2.5-MoE: (High granularity experts)

🛠️ Architecture & Pipeline Customization

Achieve fine-grained control over the logical inference pipeline by differentiating between Prefill (Throughput-bound) and Decode (Latency-bound) stages with customizable parallelism strategies. Key customizable parameters include:

Architecture Config: Adjust layers, expert counts (N), active experts (K), and attention types (MLA/GQA/MHA).
Parallelism Strategy: Configure Tensor Parallel (TP), Pipeline Parallel (PP), Sequence Parallel (SP), and Data Parallel (DP) for independent Prefill and Decode management.
Optimizations: Enable or disable Paged KV Cache, DualPipe (Compute-Comm Overlap), and quantization settings (FP8/INT4).

⚡ AI Infrastructure Configuration

Illustrate how logical workloads correspond to physical hardware architecture to highlight potential bottlenecks.

Compute: Select from NVIDIA H100, B200, A100, or generic SKUs; configure host CPUs (Sapphire Rapids, Emerald Rapids).
Networking: Choose between InfiniBand and Ethernet (RoCE) scale-out fabrics while adjusting Scale-Up topology (NVLink V3/V4/V5).
Topology: Automatically calculate required node numbers based on memory capacity and pipeline depth.

🧪 Experimental Features: MemPool & NMC

Delve into cutting-edge research concepts for advanced inference systems, including:

Memory Pooling (MemPool): Simulate Transparent Page Placement (TPP) with defined hierarchical storage tiers (VRAM → System RAM → Node Pool (NVMe) → Global Pool), allowing experimentation with predictive prefetching vs. on-demand paging policies. Visualize the ramifications of “Expert Locality” on PCIe saturation.
Near Memory Computing (NMC): Model the effects of offloading specific tasks (Top-K Selection, Quantization, Sparse Attention) to NMC-enabled memory devices, evaluating latency benefits derived from local processing.

The LLM Inference Performance Calculator provides a robust platform for architects and system engineers to explore and validate inference model architectures in various environments, facilitating improved design choices and operational efficiency.

0 comments

No comments yet.

New comment