PasLLM - High-performance LLM inference engine in pure Object Pascal.

PasLLM

bero1985

High-performance LLM inference engine in pure Object Pascal.

Pitch

PasLLM is an efficient LLM inference engine designed for local execution using pure Object Pascal. With support for multiple architectures and advanced quantization formats, it enables optimized model deployment without dependencies on other languages. Aimed at cross-platform compatibility, PasLLM combines performance with easy accessibility for developers.

Description

PasLLM is a high-performance Large Language Model (LLM) inference engine developed entirely in pure Object Pascal. This implementation enables users to run LLMs locally with a range of optimized quantization and inference features, making it a powerful tool for developers working with machine learning models.

Key Features

Pure Object Pascal: Designed without the need for Python or any external dependencies, ensuring a streamlined setup for inference.
Cross-Platform Compatibility: Works with Delphi version ≥ 11.2 and FreePascal version ≥ 3.3.1, allowing for broad usability across different systems.
Support for Multiple Architectures: Incorporates various model architectures including Llama, Qwen, Phi, Gemma, Mixtral, and more, enhancing its versatility.
Advanced Quantization Methods: Utilizes custom 4-bit and 8-bit quantization formats (Q4*NL) that significantly improve model deployment efficiency without compromising quality.
Optimized Performance: Native Pascal implementation ensures performance benefits through platform-specific optimizations.
Flexible Interface Options: Provides both command-line and graphical user interface (GUI) versions, enabling users to choose the interface that best suits their workflow.

Quantization Formats

PasLLM features multiple quantization formats aimed at achieving an excellent balance of model quality and size:

Q40NL: Achieves 4.5 bits per weight with a non-linear decode, offering superior performance.
Q41NL: Introduces an alternative non-linearity for enhanced tail emphasis.
Q42NL: An improved version designed for better reconstruction efficiency.
Q43NL: Utilizes advanced methods for model optimization including gradient and coarse-fine improvements.
Standard Formats: Features Q40, Q80, Q3F8, and other floating point formats (FP8, FP16, BF16, FP32) tailored for various precision needs while maintaining compact model sizes.

Model Support

PasLLM includes pre-quantized models which can be found at Mega.nz. Supported models include:

Llama (various variants: 1B, 3B, 8B)...
Qwen, Phi, Gemma, and others with various sizes and configurations.

Quick Start Examples

Command Line Interface Usage

# Execute inference with a quantized model
./bin/pasllmcli -model=bin/models/qwen2.5_0.5b_instruct_q40nl.safetensors

Building from Source

For those who wish to build from source, the following commands are available: FreePascal:

fpc -O3 src/pasllmcli/pasllmcli.dpr

Delphi: Open src/pasllmcli/pasllmcli.dproj in the Delphi IDE and initiate the build.

Structure and Compatibility

PasLLM is structured to ensure that the core inference engine and applications are easily navigable and modifiable. It supports compiling on various platforms without relying on third-party dependencies, allowing developers to focus on their projects effectively.

Conversion of Models

Models from platforms such as Hugging Face can be seamlessly converted for use in PasLLM with provided scripts, enabling unprecedented flexibility in model utilization for developers.

Documentation and Support

Comprehensive documentation is available for the quantization formats and deeper insights into the functionalities of PasLLM, ensuring users have all the necessary tools for successful deployment of LLMs.

0 comments

No comments yet.

New comment