PitchHut logo
MetalBench
Streamlined benchmarking for Apple Metal kernels on Apple Silicon.
Pitch

MetalBench provides a robust framework for benchmarking Apple Metal GPU kernels against MLX reference implementations. By integrating an agent-driven closed-loop system, it automates kernel development, performance testing, and accuracy validation, ensuring optimal utilization of Apple Silicon's capabilities.

Description

MetalBench

MetalBench is a comprehensive benchmarking and development framework designed for Apple Silicon kernels utilizing Metal. This project serves as a robust collection and harness facilitating the evaluation of GPU kernel performance against reference implementations provided by MLX. Modeled on the successful approach of KernelBench, it adapts the methodology for Apple's ecosystem by replacing CUDA with Metal and PyTorch with MLX.

This repository features a unique agent-based closed-loop system that optimally writes and tests kernels, ensuring they meet performance and accuracy benchmarks efficiently. The harness organizes benchmarking procedures in a systematic manner, allowing users to quickly assess kernel performance across various M-series chips. Contributions for additional kernels compatible with different M-chip types are encouraged.

Core Component: Agent-Steel 👨‍🏭

The agent_steel/ directory includes a language model-driven closed-loop harness responsible for profiling existing kernels, generating new candidates, and validating performance against benchmarks. This process continues until no further improvements can be achieved.

Key Agents in the Process:

AgentRole
ProfilerAnalyzes performance traces and outputs, generating a detailed diagnosis of the kernel's performance.
OptimizerBased on the diagnosis and historical data, it crafts the next kernel version, ensuring it meets the accuracy threshold before progressing.
VerifierBenchmarks the new kernel, comparing it to previous best performances and logging results for future reference.

Usage Example

To run the agent-steel harness for a kernel such as 'relu':

python -m agent_steel --kernel-name relu --loop --max-rounds 5

Comprehensive Kernel Library

The project includes a diverse range of kernels categorized by complexity:

SetDescription
CommonBasic operations like activations, matrix multiplications, and convolutions.
StandardAdvanced fused operations including attention mechanisms and specialized layers.
FullComplete model architectures such as transformer blocks and various neural network structures.

For a comprehensive list of kernels, refer to KERNELS.md

Benchmark Evaluation

MetalBench provides detailed benchmarking that includes metrics for accuracy and performance across several key categories:

  • Speedup compared to MLX
  • Compute throughput (GFLOPS)
  • Memory bandwidth (GB/s)
  • Run-to-run stability
  • An overall balanced score based on these metrics.

Reporting

Upon completing a benchmark, the results yield a detailed, human-readable report in the terminal, displaying vital performance metrics and comparisons against reference implementations. Each successful benchmarking process updates session artifacts for reproducibility and tracking of kernel performance over time.

Citation

For academic referencing, cite MetalBench as follows:

@misc{metalbench2026,
  title  = {MetalBench: Apple Metal GPU Kernel Benchmarks},
  author = {Manakelew, Alazar},
  year   = {2026},
  url    = {https://github.com/Lazarus-931/MetalBench},
  note   = {Live leaderboard: https://lazarus-931.github.io/leaderboard.html}
}

Explore MetalBench to streamline the development and benchmarking of Metal kernels on Apple Silicon, enhancing performance evaluation and optimization strategies.

0 comments

No comments yet.

Sign in to be the first to comment.