PitchHut logo
Efficient LLM inference for Apple Silicon with direct MLX support.
Pitch

AX Engine is a Mac-first runtime designed for LLM inference, providing a local server and SDK layers for developers. It supports Gemma 4 and Qwen 3.6 models while enabling compatibility with OpenAI-like capabilities. With advanced benchmarking tools and explicit routing for model paths, it delivers efficiency without compromising performance.

Description

AX Engine

AX Engine is an innovative LLM inference runtime and local server specifically tailored for Apple Silicon. This developer-friendly toolkit enhances productivity by offering a local OpenAI-compatible model server, complete with a comprehensive SDK layer and benchmarking tools. It supports native direct MLX model families and facilitates compatibility for non-MLX models through dedicated routing pathways with mlx-lm and llama.cpp.

Key Features

  • OpenAI-Compatible Endpoints: Provides easily accessible text endpoints for typical chat and completion workflows utilizing robust SDKs in Python, TypeScript/JavaScript, Go, Ruby, and Mojo.
  • Superior Performance: Benchmark results showcase impressive claims such as:
    • A performance enhancement of 2.83-2.92 times faster than the direct decoding for Gemma 4 12B assistant-MTP, and
    • A 76.4% increase in speed for Qwen 3.6 35B-A3B AX MTP over MTPLX in benchmark scenarios.
  • Comprehensive Benchmarking: Equipped with tools to track performance metrics, including route identity and workload history, ensuring accountable benchmarking.

Runtime Paths

AX Engine structures its performance around different runtime paths, distinguishing between:

  • Repo-Owned MLX Runtime: This path is utilized for direct-support MLX model families, preserving strict performance claims.
  • Delegated Routes: Engage with MLX text models supported by upstream mlx-lm, enhancing compatibility while ensuring clarity in performance claims.

Hardware Recommendations

For optimal performance, AX Engine is best utilized with specific hardware configurations:

DeviceRecommended MemoryOptimal Use Case
Mac mini M4 Pro64 GBCompact local chatbot and agent server
MacBook Pro M5 Max128 GBHigh-throughput chatbot and coding stack
Mac Studio M3 Ultra256 GBExtensive model portfolio with heavy computational workloads

Supported Models

AX Engine seamlessly supports various model families, including direct support for:

  • Gemma 4 Models: Such as gemma-4-12b-it and others optimized for specific memory configurations.
  • Qwen Models: Featuring a range of options and architectures, designed for versatility and performance.

Performance Insights

AX Engine consistently outperforms recognized benchmarks, as evidenced by detailed performance tables and testing methodologies, showcasing essential metrics such as throughput and latency across different configurations and loads.


AX Engine stands out as a cutting-edge tool for developers looking to harness the power of local LLMs on Apple Silicon, backed by rigorous benchmarking and a robust support structure.

0 comments

No comments yet.

Sign in to be the first to comment.