Micro-Expert-Router-SSD-Streamed-MoE-MER - Efficient execution of Mixture-of-Experts models with rapid expert swapping

Pitch

Description

Comments

Micro-Expert-Router: SSD-Streamed Mixture-of-Experts Execution Engine

The Micro-Expert-Router (MER) is a cutting-edge Rust execution engine designed specifically for Mixture-of-Experts (MoE) models. This innovative project leverages the power of PCIe-attached NVMe drives to enhance model performance by hot-swapping individual experts on demand, efficiently utilizing resources for optimal performance.

Key Features

High Performance Data Management: MER keeps the router in RAM while managing the storage and retrieval of expert models from SSD, ensuring rapid access to the necessary data.

Efficient I/O Operations: The engine uses O_DIRECT positional reads to bypass the kernel’s page cache, offering a streamlined approach to data handling by exploiting the SSD’s high sequential read speeds (6-14 GB/s).

Dynamic Expert Activation: By activating only the top-K experts per token, MER allows the processing of larger models without exceeding memory limits, thus facilitating the operation of MoE models that are significantly larger than available DRAM.

Quantification Optimization: The ability to utilize quantized weights (e.g., 4-bit) drastically reduces the data footprint and enhances interaction rates on Mixtral-class models, making it a practical solution for modern computational requirements.

Architecture Overview

The architecture employs a multi-layer cache system that ensures that only frequently used experts remain in memory:

Cache Management: It integrates a predictive caching mechanism using a blend of sparse Markov chains, a Locality Monitor, and a Neural Speculator to efficiently manage access to expert models. This setup guarantees that the most relevant experts are always ready for retrieval from storage.

Modular Design: The Rust crate organizes functional components into single-responsibility modules, which include the router for directing data flow, buffer management for handling I/O operations, and specialized modules for predictive analysis.

End-to-End Workflow

An example workflow for expert activation includes:

       +------------+    +-------------+   +-----------+   +-------------------+
Token → |   Router   | → | Expert IDs   | → | LRU Cache | →  | SwiGLU FFN        |
       | LinearGate |   |  e.g. [3,7]  |   +-----+-----+    | per expert,       |
       |  or Markov |   +------+-------+         | Miss     |  gate-weighted sum |
       +-----+------+          |                 ↓          +-------------------+
             │                 │        +------------------+ 
             │ Hidden State    │        | BufferPool slot  | ←─────┐
             ↓                 │        |  (Aligned, Pre- |       │
   +------------------------+   │        |   Allocated)    |       │
   | Predictive Controller  |   │        +--------+---------+       │
   |   S = 2nd-order Markov |   │                 ↓                 │
   |   L = LocalityMonitor  | → │        +------------------+       │ On Arc drop
   |   M = NeuralSpeculator |   │        |  pread(2) read   |       │
   |   E = S ∪ L ∪ M        |   │        | O_DIRECT + (Opt)|       │
   +-----------+------------+   │        +--------+---------+       │
               │                │        ↑                             │
               ↓                ↓        |                             │
       Non-evicting Prefetches            NVMe SSD → DMA → RAM ──────┘  
                                              ↓
                                    Bytes reinterpreted as weights → matmul

This process optimally retrieves and manages the data necessary for running multiple expert models, ensuring that latency is minimized while processing throughput remains high.

Conclusion

The Micro-Expert-Router is a powerful solution for managing and executing Mixture-of-Experts models, providing substantial improvements in data handling and processing efficiency. By integrating refined storage utilization methods with innovative data management strategies, it empowers users to leverage larger and more complex models than ever before.