AX Engine is a Mac-first runtime designed for LLM inference, providing a local server and SDK layers for developers. It supports Gemma 4 and Qwen 3.6 models while enabling compatibility with OpenAI-like capabilities. With advanced benchmarking tools and explicit routing for model paths, it delivers efficiency without compromising performance.
AX Engine
AX Engine is an innovative LLM inference runtime and local server specifically tailored for Apple Silicon. This developer-friendly toolkit enhances productivity by offering a local OpenAI-compatible model server, complete with a comprehensive SDK layer and benchmarking tools. It supports native direct MLX model families and facilitates compatibility for non-MLX models through dedicated routing pathways with mlx-lm and llama.cpp.
Key Features
- OpenAI-Compatible Endpoints: Provides easily accessible text endpoints for typical chat and completion workflows utilizing robust SDKs in Python, TypeScript/JavaScript, Go, Ruby, and Mojo.
- Superior Performance: Benchmark results showcase impressive claims such as:
- A performance enhancement of 2.83-2.92 times faster than the direct decoding for Gemma 4 12B assistant-MTP, and
- A 76.4% increase in speed for Qwen 3.6 35B-A3B AX MTP over MTPLX in benchmark scenarios.
- Comprehensive Benchmarking: Equipped with tools to track performance metrics, including route identity and workload history, ensuring accountable benchmarking.
Runtime Paths
AX Engine structures its performance around different runtime paths, distinguishing between:
- Repo-Owned MLX Runtime: This path is utilized for direct-support MLX model families, preserving strict performance claims.
- Delegated Routes: Engage with MLX text models supported by upstream
mlx-lm, enhancing compatibility while ensuring clarity in performance claims.
Hardware Recommendations
For optimal performance, AX Engine is best utilized with specific hardware configurations:
| Device | Recommended Memory | Optimal Use Case |
|---|---|---|
| Mac mini M4 Pro | 64 GB | Compact local chatbot and agent server |
| MacBook Pro M5 Max | 128 GB | High-throughput chatbot and coding stack |
| Mac Studio M3 Ultra | 256 GB | Extensive model portfolio with heavy computational workloads |
Supported Models
AX Engine seamlessly supports various model families, including direct support for:
- Gemma 4 Models: Such as
gemma-4-12b-itand others optimized for specific memory configurations. - Qwen Models: Featuring a range of options and architectures, designed for versatility and performance.
Performance Insights
AX Engine consistently outperforms recognized benchmarks, as evidenced by detailed performance tables and testing methodologies, showcasing essential metrics such as throughput and latency across different configurations and loads.
AX Engine stands out as a cutting-edge tool for developers looking to harness the power of local LLMs on Apple Silicon, backed by rigorous benchmarking and a robust support structure.
No comments yet.
Sign in to be the first to comment.