Changelog
v2.1.0
Features
OpenAI compatibility endpoint:
Support for
max_completion_tokensparameter in/v1/chat/completionsendpointSupport for
messagesparameter in/v1/chat/completionsendpointSupport for
streamparameter in/v1/chat/completionsendpoint
v2.0.0
Important
This release no longer uses llama-server. Instead, we bundle the llama.cpp codebase directly into Paddler.
We only use llama.cpp as a library for inference and have reimplemented llama-server functionality within Paddler itself.
Instead of llama-server, you can use paddler agent, and you no longer need to run llama-server separately, which significantly simplifies the setup.
Features
llama.cppis now built-in directly into Paddler, no need to runllama-serverseparatelypaddler agentcommand replacesllama-serverfunctionalityCheck out the API page for complete list of changes in the API
v1.2.0
Features
Add TUI dashboard (
paddler dashboard --management-addr [HOST]:[PORT]) to be able to easily observe balancer instances from the terminal level
v1.1.0
More meaningful error messages when the agent can't connect to the llama.cpp slot endpoint, or when slot endpoint is not enabled in llama.cpp
Set default logging level to
infofor agents and balancer to increase the amount of information in the logs (it wasn't clean if the agent was running or not)Enable LTO optimization for the release builds (see #28)
v1.0.0
The first stable release! Paddler is now rewritten in Rust and uses the Pingora framework for the networking stack. A few minor API changes and reporting improvements are introduced (documented in the README). API and configuration are now stable, and won't be changed until version 2.0.0.
This is a stability/quality release. The next plan is to introduce a supervisor who does not just monitor llama.cpp instances, but to also manage them.
Requires llama.cpp version b4027 or above.
v0.10.0
This update is a minor release to make Paddler compatible with /slots endpoint changes introduced in llama.cpp b4027.
Requires llama.cpp version b4027 or above.
v0.9.0
Latest supported llama.cpp release: b4026
Features
Add
--local-llamacpp-api-keyflag to balancer to support llama.cpp API keys (see: #23)
v0.8.0
Features
Add
--rewrite-host-headerflag to balancer to rewrite theHostheader in forwarded requests (see: #20)
v0.7.1
Fixes
Incorrect preemptive counting of remaining slots in some scenarios
v0.7.0
Requires at least b3606 llama.cpp release.
Breaking Changes
Adjusted to handle breaking changes in llama.cpp
/healthendpoint: https://github.com/ggerganov/llama.cpp/pull/9056Instead of using the
/healthendpoint to monitor slot statuses, starting from this version, Paddler uses the/slotsendpoint to monitor llama.cpp instances. Paddler's/healthendpoint remains unchanged.
v0.6.0
Latest supported llama.cpp release: b3604
Features
v0.5.0
Fixes
Management server crashed in some scenarios due to concurrency issues
v0.4.0
Thank you, @ScottMcNaught, for the help with debugging the issues! :)
Fixes
OpenAI compatible endpoint is now properly balanced (
/v1/chat/completions)Balancer's reverse proxy
panicked in some scenarios when the underlyingllama.cppinstance was abruptly closed during the generation of completion tokensAdded mutex in the targets collection for better internal slots data integrity
v0.3.0
Features
Requests can queue when all llama.cpp instances are busy
AWS Metadata support for agent local IP address
StatsD metrics support