Continue from raw prompt
This endpoint is here only as a fail-safe, workaround, just-in-case solution to handle some edge-cases we possibly didn't consider in Paddler.
This endpoint allows you to send a raw prompt to the LLM without applying any chat templates and receive tokens in response.
Not applying the chat template means that whatever you pass as the raw prompt, Paddler will send directly to the LLM.
Generally, use the other endpoint to generate completions (Continue from conversation history), because they are much safer, and guarantee any kind of quality of the response.
If you need to force your own chat template, prefer to do that through the Balancer desired state endpoint instead.
Endpoint
Method: POST
Path: /api/v1/continue_from_raw_promptPayload
Parameters
grammar
Optional grammar that constrains the response. Accepts a gbnf or json_schema constraint. See Using grammars for details on both formats and the error cases.
max_tokens
Maximum number of tokens to generate in the response. This is a hard limit; use it as a failsafe to prevent the model from generating too many tokens.
raw_prompt
String with the raw prompt to send to the LLM.
Response
Success
The response body is a stream of JSON objects, one per generated token. Each token's inner key under GeneratedToken is the token kind. See Continue from conversation history for the full set of token kinds.
generated_by is the name of the agent that produced the token (its --name), or null if the agent has no name.
The stream ends with a single Done message that carries the token usage for the request:
Error
In case of an error, the response will be:
Sending requests with a grammar
To constrain the response, pass a grammar in the optional grammar parameter:
See Using grammars for both supported formats and the error cases.