Changelog
v2.1.0
Features
OpenAI compatibility endpoint:
Support for
max_completion_tokens
parameter in/v1/chat/completions
endpointSupport for
messages
parameter in/v1/chat/completions
endpointSupport for
stream
parameter in/v1/chat/completions
endpoint
v2.0.0
Important
This release no longer uses llama-server
. Instead, we bundle the llama.cpp
codebase directly into Paddler.
We only use llama.cpp
as a library for inference and have reimplemented llama-server
functionality within Paddler itself.
Instead of llama-server
, you can use paddler agent
, and you no longer need to run llama-server
separately, which significantly simplifies the setup.
Features
llama.cpp
is now built-in directly into Paddler, no need to runllama-server
separatelypaddler agent
command replacesllama-server
functionalityCheck out the API page for complete list of changes in the API
v1.2.0
Features
Add TUI dashboard (
paddler dashboard --management-addr [HOST]:[PORT]
) to be able to easily observe balancer instances from the terminal level
v1.1.0
More meaningful error messages when the agent can't connect to the llama.cpp slot endpoint, or when slot endpoint is not enabled in llama.cpp
Set default logging level to
info
for agents and balancer to increase the amount of information in the logs (it wasn't clean if the agent was running or not)Enable LTO optimization for the release builds (see #28)
v1.0.0
The first stable release! Paddler is now rewritten in Rust and uses the Pingora framework for the networking stack. A few minor API changes and reporting improvements are introduced (documented in the README). API and configuration are now stable, and won't be changed until version 2.0.0
.
This is a stability/quality release. The next plan is to introduce a supervisor who does not just monitor llama.cpp instances, but to also manage them.
Requires llama.cpp version b4027 or above.
v0.10.0
This update is a minor release to make Paddler compatible with /slots
endpoint changes introduced in llama.cpp b4027.
Requires llama.cpp version b4027 or above.
v0.9.0
Latest supported llama.cpp release: b4026
Features
Add
--local-llamacpp-api-key
flag to balancer to support llama.cpp API keys (see: #23)
v0.8.0
Features
Add
--rewrite-host-header
flag to balancer to rewrite theHost
header in forwarded requests (see: #20)
v0.7.1
Fixes
Incorrect preemptive counting of remaining slots in some scenarios
v0.7.0
Requires at least b3606 llama.cpp release.
Breaking Changes
Adjusted to handle breaking changes in llama.cpp
/health
endpoint: https://github.com/ggerganov/llama.cpp/pull/9056Instead of using the
/health
endpoint to monitor slot statuses, starting from this version, Paddler uses the/slots
endpoint to monitor llama.cpp instances. Paddler's/health
endpoint remains unchanged.
v0.6.0
Latest supported llama.cpp release: b3604
Features
v0.5.0
Fixes
Management server crashed in some scenarios due to concurrency issues
v0.4.0
Thank you, @ScottMcNaught, for the help with debugging the issues! :)
Fixes
OpenAI compatible endpoint is now properly balanced (
/v1/chat/completions
)Balancer's reverse proxy
panic
ked in some scenarios when the underlyingllama.cpp
instance was abruptly closed during the generation of completion tokensAdded mutex in the targets collection for better internal slots data integrity
v0.3.0
Features
Requests can queue when all llama.cpp instances are busy
AWS Metadata support for agent local IP address
StatsD metrics support