TensorSharp - Run large language models locally with a powerful C# inference engine.

TensorSharp

Run large language models locally with a powerful C# inference engine.

Pitch

TensorSharp is an innovative C# inference engine designed for executing large language models (LLMs) locally using GGUF model files. It offers a console application and a web-based chatbot interface, along with HTTP APIs that are compatible with Ollama and OpenAI. With support for Windows, MacOS, and Linux, TensorSharp enables efficient and effective LLM inference leveraging full GPU capabilities.

Description

TensorSharp is a powerful C# inference engine designed for running large language models (LLMs) locally by utilizing GGUF model files. This innovative tool provides several features, including a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for seamless programmatic access.

Key Features

Multi-architecture Support: Compatible with advanced models such as Gemma 3, Gemma 4, Qwen 3, Qwen 3.5/3.6-family, GPT OSS, Nemotron-H, and Mistral 3.
Multimodal Inference: Supports image, video, and audio inputs, enabling sophisticated interactions and analyses. For instance, Gemma 4 can process images, videos, and audio, while other models handle specific modalities.
Optimized Performance: TensorSharp supports GPU acceleration through multiple backends, including CUDA/cuBLAS for NVIDIA, MLX Metal for Apple Silicon, and optimized pure C# CPU execution paths.
Efficient Batching: It implements continuous batching and paged attention mechanisms that enhance performance and resource utilization when processing multiple requests.
Extensive API Compatibility: TensorSharp offers Ollama and OpenAI API-compatible endpoints, allowing for easy integration with existing tools and workflows.
Custom Tool Invocation: Models are capable of invoking user-defined tools, enabling complex, multi-turn interactions.
Inference Engine: A sophisticated InferenceEngine is responsible for handling concurrency and optimizations within the server, ensuring efficient model usage.
Detailed Logging and Observability: The framework provides structured logs capturing user input and responses, as well as various performance metrics, enhancing transparency and debugging capabilities.

Model Architectures Supported

TensorSharp supports a range of architectures, ensuring flexibility and compatibility:

Gemma 4: Advanced multimodal model that can handle images, video, and audio input.
Gemma 3: Specifically optimized for image input.
Qwen 3: Text-only model with excellent reasoning capabilities.
GPT OSS: Text-only optimized for robust interactions, including dedicated functionality for reasoning and tool calling.
Nemotron-H: A hybrid model that combines various technological advancements for image input.
Mistral 3: Specifically aimed at text with efficient inference processing.

More detailed documentation about specific model capabilities and examples can be found in the provided links.

Getting Started

Users can quickly access the capabilities of TensorSharp by engaging with the console application for text and multimodal inference. The application supports various commands for different types of inputs, including:

# Text inference
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --output result.txt 
# Interactive chat (REPL) mode
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal --interactive

For extensive functionality and performance metrics, users can also run batch processing or throughput benchmarks with ease.

In sum, TensorSharp is an advanced inference engine catering to developers and researchers looking to utilize large language models locally. It combines speed, performance, and flexibility in its approach to LLM inference, making it a valuable tool in the fields of AI and machine learning.

0 comments

No comments yet.

New comment