MetalBench provides a robust framework for benchmarking Apple Metal GPU kernels against MLX reference implementations. By integrating an agent-driven closed-loop system, it automates kernel development, performance testing, and accuracy validation, ensuring optimal utilization of Apple Silicon's capabilities.
MetalBench
MetalBench is a comprehensive benchmarking and development framework designed for Apple Silicon kernels utilizing Metal. This project serves as a robust collection and harness facilitating the evaluation of GPU kernel performance against reference implementations provided by MLX. Modeled on the successful approach of KernelBench, it adapts the methodology for Apple's ecosystem by replacing CUDA with Metal and PyTorch with MLX.
This repository features a unique agent-based closed-loop system that optimally writes and tests kernels, ensuring they meet performance and accuracy benchmarks efficiently. The harness organizes benchmarking procedures in a systematic manner, allowing users to quickly assess kernel performance across various M-series chips. Contributions for additional kernels compatible with different M-chip types are encouraged.
Core Component: Agent-Steel 👨🏭
The agent_steel/ directory includes a language model-driven closed-loop harness responsible for profiling existing kernels, generating new candidates, and validating performance against benchmarks. This process continues until no further improvements can be achieved.
Key Agents in the Process:
| Agent | Role |
|---|---|
| Profiler | Analyzes performance traces and outputs, generating a detailed diagnosis of the kernel's performance. |
| Optimizer | Based on the diagnosis and historical data, it crafts the next kernel version, ensuring it meets the accuracy threshold before progressing. |
| Verifier | Benchmarks the new kernel, comparing it to previous best performances and logging results for future reference. |
Usage Example
To run the agent-steel harness for a kernel such as 'relu':
python -m agent_steel --kernel-name relu --loop --max-rounds 5
Comprehensive Kernel Library
The project includes a diverse range of kernels categorized by complexity:
| Set | Description |
|---|---|
| Common | Basic operations like activations, matrix multiplications, and convolutions. |
| Standard | Advanced fused operations including attention mechanisms and specialized layers. |
| Full | Complete model architectures such as transformer blocks and various neural network structures. |
For a comprehensive list of kernels, refer to KERNELS.md
Benchmark Evaluation
MetalBench provides detailed benchmarking that includes metrics for accuracy and performance across several key categories:
- Speedup compared to MLX
- Compute throughput (GFLOPS)
- Memory bandwidth (GB/s)
- Run-to-run stability
- An overall balanced score based on these metrics.
Reporting
Upon completing a benchmark, the results yield a detailed, human-readable report in the terminal, displaying vital performance metrics and comparisons against reference implementations. Each successful benchmarking process updates session artifacts for reproducibility and tracking of kernel performance over time.
Citation
For academic referencing, cite MetalBench as follows:
@misc{metalbench2026,
title = {MetalBench: Apple Metal GPU Kernel Benchmarks},
author = {Manakelew, Alazar},
year = {2026},
url = {https://github.com/Lazarus-931/MetalBench},
note = {Live leaderboard: https://lazarus-931.github.io/leaderboard.html}
}
Explore MetalBench to streamline the development and benchmarking of Metal kernels on Apple Silicon, enhancing performance evaluation and optimization strategies.
No comments yet.
Sign in to be the first to comment.