Changelog

v2.1.0

Features

OpenAI compatibility endpoint:
- Support for max_completion_tokens parameter in /v1/chat/completions endpoint
- Support for messages parameter in /v1/chat/completions endpoint
- Support for stream parameter in /v1/chat/completions endpoint

v2.0.0

Important

This release no longer uses llama-server. Instead, we bundle the llama.cpp codebase directly into Paddler.

We only use llama.cpp as a library for inference and have reimplemented llama-server functionality within Paddler itself.

Instead of llama-server, you can use paddler agent, and you no longer need to run llama-server separately, which significantly simplifies the setup.

Features

llama.cpp is now built-in directly into Paddler, no need to run llama-server separately
paddler agent command replaces llama-server functionality
Check out the API page for complete list of changes in the API

v1.2.0

Features

Add TUI dashboard (paddler dashboard --management-addr [HOST]:[PORT]) to be able to easily observe balancer instances from the terminal level

v1.1.0

More meaningful error messages when the agent can't connect to the llama.cpp slot endpoint, or when slot endpoint is not enabled in llama.cpp
Set default logging level to info for agents and balancer to increase the amount of information in the logs (it wasn't clean if the agent was running or not)
Enable LTO optimization for the release builds (see #28)

v1.0.0

The first stable release! Paddler is now rewritten in Rust and uses the Pingora framework for the networking stack. A few minor API changes and reporting improvements are introduced (documented in the README). API and configuration are now stable, and won't be changed until version 2.0.0.

This is a stability/quality release. The next plan is to introduce a supervisor who does not just monitor llama.cpp instances, but to also manage them.

Requires llama.cpp version b4027 or above.

v0.10.0

This update is a minor release to make Paddler compatible with /slots endpoint changes introduced in llama.cpp b4027.

Requires llama.cpp version b4027 or above.

v0.9.0

Latest supported llama.cpp release: b4026

Features

Add --local-llamacpp-api-key flag to balancer to support llama.cpp API keys (see: #23)

v0.8.0

Features

Add --rewrite-host-header flag to balancer to rewrite the Host header in forwarded requests (see: #20)

v0.7.1

Fixes

Incorrect preemptive counting of remaining slots in some scenarios

v0.7.0

Requires at least b3606 llama.cpp release.

Breaking Changes

Adjusted to handle breaking changes in llama.cpp /health endpoint: https://github.com/ggerganov/llama.cpp/pull/9056
Instead of using the /health endpoint to monitor slot statuses, starting from this version, Paddler uses the /slots endpoint to monitor llama.cpp instances. Paddler's /health endpoint remains unchanged.

v0.6.0

Latest supported llama.cpp release: b3604

Features

Name agents with --name flag

v0.5.0

Fixes

Management server crashed in some scenarios due to concurrency issues

v0.4.0

Thank you, @ScottMcNaught, for the help with debugging the issues! :)

Fixes

OpenAI compatible endpoint is now properly balanced (/v1/chat/completions)
Balancer's reverse proxy panicked in some scenarios when the underlying llama.cpp instance was abruptly closed during the generation of completion tokens
Added mutex in the targets collection for better internal slots data integrity

v0.3.0

Features

Requests can queue when all llama.cpp instances are busy
AWS Metadata support for agent local IP address
StatsD metrics support

v0.1.0

Features

Aggregated Health Status Responses

Installation Set up a basic LLM cluster