prima.cpp - Run 70B-level LLMs on everyday home devices with fast speed and low memory pressure!

prima.cpp

by nutty_indigo_brandea

Run 70B-level LLMs on everyday home devices with fast speed and low memory pressure!

Pitch

Prima.cpp is designed for running large language models (LLMs) like QwQ-32B and Llama 3-70B on everyday devices with exceptional efficiency. It enables users to experience fast inference speeds while maintaining low memory pressure, making high-performance computing accessible to everyone, regardless of hardware.

Description

prima.cpp is an innovative distributed implementation of llama.cpp designed to efficiently run 70B-level large language models (LLMs) on everyday low-resource devices, such as laptops, desktops, phones, and tablets—regardless of whether they have a GPU. This powerful tool enables users to execute models like QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B directly within a local home cluster.

Key Features

Low Memory Pressure: prima.cpp minimizes memory pressure to below 10%, allowing users to run substantial models without causing the device to freeze or run out of memory.
Enhanced Speed: Leveraging llama.cpp, prima.cpp boasts speeds that are 15 times faster, with impressive token generation capabilities (e.g., QwQ-32B generates 11 tokens per second, while Llama 3-70B operates at 1.5 tokens per second).
Memory Efficient Loading: Utilizes mmap to lazily load model weights, enabling models of any size to run with low memory pressure.
Optimized for Cheap Home Clusters: Supports GPU and CPU offloading, allowing devices with GPUs to take advantage of both CPU and GPU capabilities. Advanced techniques, such as piped-ring parallelism with prefetching, improve execution by overlapping disk loading latency and enhancing pipeline efficiency.
Heterogeneous Workload Distribution: A sophisticated scheduler manages the assignment of tasks based on each device's capabilities, including CPU power, disk speed, memory, and operating system. This ensures that resource use is optimized across the cluster.
Cross-Platform Functionality: prima.cpp can run on various operating systems, including macOS, Linux, Android, and HarmonyOS, with support for future Windows integration planned.

Supported Models

prima.cpp is compatible with several popular models, including but not limited to:

Llama: Models from Llama 3-8B to Llama 3-70B.
Qwen: Models like Qwen 2.5-7B to Qwen 2.5-72B.
DeepSeek: Includes models such as DeepSeek R1-7B to DeepSeek R1-70B.

The ability to run these models on personal devices ensures that users can harness the power of advanced AI technologies without needing expensive, high-end computing systems.

Use Cases

With prima.cpp, users can conduct a range of computational tasks at home—from running large language models for text generation to experimenting with AI solutions in a decentralized manner. This project aims to make powerful AI accessible to everyone, facilitating personal and academic exploration in foundational AI technologies.

By supporting a wide range of devices and models, prima.cpp offers flexible deployment options that cater to individual requirements—making it an invaluable tool for AI enthusiasts and developers alike.

0 comments

No comments yet.

New comment