Changelog
v2.1.0
Features
- OpenAI compatibility endpoint:
- Support for
max_completion_tokens
parameter in/v1/chat/completions
endpoint - Support for
messages
parameter in/v1/chat/completions
endpoint - Support for
stream
parameter in/v1/chat/completions
endpoint
- Support for
v2.0.0
Important
We only use llama.cpp
as a library for inference and have reimplemented llama-server
functionality within Paddler itself.
Instead of llama-server
, you can use paddler agent
, and you no longer need to run llama-server
separately, which significantly simplifies the setup.
Features
llama.cpp
is now built-in directly into Paddler, no need to runllama-server
separatelypaddler agent
command replacesllama-server
functionality- Check out the API page for complete list of changes in the API
v1.2.0
Features
- Add TUI dashboard (
paddler dashboard --management-addr [HOST]:[PORT]
) to be able to easily observe balancer instances from the terminal level
v1.1.0
- More meaningful error messages when the agent can't connect to the llama.cpp slot endpoint, or when slot endpoint is not enabled in llama.cpp
- Set default logging level to
info
for agents and balancer to increase the amount of information in the logs (it wasn't clean if the agent was running or not) - Enable LTO optimization for the release builds (see #28)
v1.0.0
The first stable release! Paddler is now rewritten in Rust and uses the Pingora framework for the networking stack. A few minor API changes and reporting improvements are introduced (documented in the README). API and configuration are now stable, and won't be changed until version 2.0.0
.
This is a stability/quality release. The next plan is to introduce a supervisor who does not just monitor llama.cpp instances, but to also manage them.
Requires llama.cpp version b4027 or above.
v0.10.0
This update is a minor release to make Paddler compatible with /slots
endpoint changes introduced in llama.cpp b4027.
Requires llama.cpp version b4027 or above.
v0.9.0
Latest supported llama.cpp release: b4026
Features
- Add
--local-llamacpp-api-key
flag to balancer to support llama.cpp API keys (see: #23)
v0.8.0
Features
- Add
--rewrite-host-header
flag to balancer to rewrite theHost
header in forwarded requests (see: #20)
v0.7.1
Fixes
- Incorrect preemptive counting of remaining slots in some scenarios
v0.7.0
Requires at least b3606 llama.cpp release.
Breaking Changes
-
Adjusted to handle breaking changes in llama.cpp
/health
endpoint: https://github.com/ggerganov/llama.cpp/pull/9056Instead of using the
/health
endpoint to monitor slot statuses, starting from this version, Paddler uses the/slots
endpoint to monitor llama.cpp instances. Paddler's/health
endpoint remains unchanged.
v0.6.0
Latest supported llama.cpp release: b3604
Features
v0.5.0
Fixes
- Management server crashed in some scenarios due to concurrency issues
v0.4.0
Thank you, @ScottMcNaught, for the help with debugging the issues! :)
Fixes
- OpenAI compatible endpoint is now properly balanced (
/v1/chat/completions
) - Balancer's reverse proxy
panic
ked in some scenarios when the underlyingllama.cpp
instance was abruptly closed during the generation of completion tokens - Added mutex in the targets collection for better internal slots data integrity
v0.3.0
Features
- Requests can queue when all llama.cpp instances are busy
- AWS Metadata support for agent local IP address
- StatsD metrics support