NTransformer is a high-efficiency LLM inference engine designed for C++/CUDA. It enables running the Llama 70B model on an RTX 3090 by utilizing innovative streaming techniques and a 3-tier caching approach. This project minimizes CPU involvement for faster performance, making advanced AI accessible on consumer hardware.
NTransformer is a high-efficiency C++/CUDA inference engine designed specifically for Language Learning Models (LLMs). This powerful engine is capable of running the demanding Llama 70B model on a single NVIDIA RTX 3090 GPU with 24GB of VRAM. It achieves this by leveraging advanced techniques to stream model layers through GPU memory via PCIe, with an optional NVMe direct I/O feature that allows bypassing the CPU entirely for improved performance.
Key Features
- High Performance: Achieves a 33x speedup over the mmap baseline for 70B models on consumer hardware. For instance, under Tiered (auto) mode, it successfully runs Llama 3.1 70B Q6_K at approximately 0.2 tok/s using only 23.1 GB VRAM while managing RAM and NVMe efficiently.
- Multi-tier Adaptive Caching: Automatically adjusts the usage of VRAM-resident layers, pinned RAM, and NVMe/mmap tiers based on available hardware resources to optimize performance.
- No External Dependencies: Operates with zero external dependencies beyond the CUDA Toolkit, ensuring a streamlined setup process. No need for complex frameworks like PyTorch or cuBLAS.
- Multiple Quantization Formats: Supports various quantization formats including Q4_0, Q8_0, Q4_K_M, Q6_K, F16, and F32, enhancing flexibility and performance.
- Advanced Streaming Capabilities: Integrates a SLEP (Streaming Layer Engine Pipeline) that enables overlapping NVMe reads, PCIe Direct Memory Access (DMA), and GPU computation, maximizing efficiency.
Benchmark Results
The following table summarizes the performance benchmarks of notable models:
| Model | Mode | Decode | VRAM | Notes |
|---|---|---|---|---|
| Llama 3.1 8B Q8_0 | Resident | 48.9 tok/s | 10.0 GB | All layers stored in VRAM |
| Llama 3.1 70B Q6_K | Tiered (auto) | 0.2 tok/s | 23.1 GB | Combines VRAM, RAM, and NVMe |
Architecture Overview
NTransformer's modular architecture consists of distinct components for core functionalities, CUDA kernels, memory management, model handling, inference processes, utilities, and an easy-to-use command-line interface. Each section is specifically designed to ensure optimal performance and maintainability.
Quick Start Usage
To utilize NTransformer, users can follow these command examples depending on their model size and desired mode of operation:
# For models that fit in VRAM
./ntransformer -m /path/to/llama-8b-q8_0.gguf -p "Hello" -n 128
# For models larger than VRAM
./ntransformer -m /path/to/llama-70b-q6_k.gguf -p "Hello" -n 32 --streaming
# Run in chat mode
./ntransformer -m /path/to/model.gguf --chat
NVMe Direct Streaming
For enhanced performance with models that do not fit entirely in VRAM, the NVMe backend allows data to stream directly to the GPU, significantly accelerating the inference process. The architecture utilized includes:
NVMe SSD → (DMA) → Pinned Staging → (PCIe H2D) → GPU Buffers → Compute
This process ensures minimal latency by removing the CPU from the data path.
Roadmap for Future Development
The development roadmap is structured in phases, focusing on enhancing features and capabilities. Completed phases have already established a solid foundation, advanced streaming techniques, and optimized performance, with future phases exploring advanced quantization methods and novel architectures.
In summary, NTransformer emerges as a compelling solution for high-efficiency LLM inference, especially for those requiring the capability to run large models on consumer-grade hardware.
No comments yet.
Sign in to be the first to comment.