AX Engine - Efficient LLM inference for Apple Silicon with direct MLX support.

Projects Leaderboard

Pitch

Description

Comments

AX Engine

spectacular_copper_clarine

Efficient LLM inference for Apple Silicon with direct MLX support.

Visit project

Pitch

AX Engine is a Mac-first runtime designed for LLM inference, providing a local server and SDK layers for developers. It supports Gemma 4 and Qwen 3.6 models while enabling compatibility with OpenAI-like capabilities. With advanced benchmarking tools and explicit routing for model paths, it delivers efficiency without compromising performance.

Description

AX Engine

AX Engine is an innovative LLM inference runtime and local server specifically tailored for Apple Silicon. This developer-friendly toolkit enhances productivity by offering a local OpenAI-compatible model server, complete with a comprehensive SDK layer and benchmarking tools. It supports native direct MLX model families and facilitates compatibility for non-MLX models through dedicated routing pathways with mlx-lm and llama.cpp.

Key Features

OpenAI-Compatible Endpoints: Provides easily accessible text endpoints for typical chat and completion workflows utilizing robust SDKs in Python, TypeScript/JavaScript, Go, Ruby, and Mojo.
Superior Performance: Benchmark results showcase impressive claims such as:
- A performance enhancement of 2.83-2.92 times faster than the direct decoding for Gemma 4 12B assistant-MTP, and
- A 76.4% increase in speed for Qwen 3.6 35B-A3B AX MTP over MTPLX in benchmark scenarios.
Comprehensive Benchmarking: Equipped with tools to track performance metrics, including route identity and workload history, ensuring accountable benchmarking.

Runtime Paths

AX Engine structures its performance around different runtime paths, distinguishing between:

Repo-Owned MLX Runtime: This path is utilized for direct-support MLX model families, preserving strict performance claims.
Delegated Routes: Engage with MLX text models supported by upstream mlx-lm, enhancing compatibility while ensuring clarity in performance claims.

Hardware Recommendations

For optimal performance, AX Engine is best utilized with specific hardware configurations:

Device	Recommended Memory	Optimal Use Case
Mac mini M4 Pro	64 GB	Compact local chatbot and agent server
MacBook Pro M5 Max	128 GB	High-throughput chatbot and coding stack
Mac Studio M3 Ultra	256 GB	Extensive model portfolio with heavy computational workloads

Supported Models

AX Engine seamlessly supports various model families, including direct support for:

Gemma 4 Models: Such as gemma-4-12b-it and others optimized for specific memory configurations.
Qwen Models: Featuring a range of options and architectures, designed for versatility and performance.

Performance Insights

AX Engine consistently outperforms recognized benchmarks, as evidenced by detailed performance tables and testing methodologies, showcasing essential metrics such as throughput and latency across different configurations and loads.

AX Engine stands out as a cutting-edge tool for developers looking to harness the power of local LLMs on Apple Silicon, backed by rigorous benchmarking and a robust support structure.

0 comments

No comments yet.

New comment