OpenAI compatibility

How to use OpenAI-style API?

First, you can check how to setup a basic llm cluster.

Then, to start Paddler with an OpenAI-style API service, you can add the --compat-openai-addr parameter to the paddler balancer command. This will start the compatibility service that will listen on the specified address and port.

paddler balancer 
    --compat-openai-addr 127.0.0.1:8070
    --inference-addr 127.0.0.1:8061 
    --management-addr 127.0.0.1:8060

It is intentionally started at a separate address from the main inference service, to prevent any conflicts with the endpoint, and parameter names.

It still uses exactly the same Paddler stack internally (with buffered requests, chat templates, etc.), so you can use all the features of Paddler. The only difference is the format of the requests and responses, and the API endpoints.

How does it work?

Internally, all those compatibility endpoints do is map the request parameters and responses back and forth between the OpenAI-style API and Paddler's internal API, so you do not need any additional configuration in your setup.

For example, if you used --compat-openai-addr 127.0.0.1:8070, you can find their completions endpoint at: http://127.0.0.1:8070/v1/chat/completions and the responses endpoint at http://127.0.0.1:8070/v1/responses.

Current status

Maintaining compatibility with OpenAI's API is a perpetually ongoing task because we need to keep up with their updates and changes.

Luckily, they do not update their API very often, but still, this is something to keep in mind.

Endpoint Supported parameters

/v1/chat/completions

  • max_completion_tokens
  • messages
  • stream
  • stream options
  • tools

/v1/responses

  • input
  • instructions
  • max_output_tokens
  • reasoning
  • stream
  • text
  • tools

🫵💪❤️ You can help us improve the compatibility! 😊

Check out the GitHub issues, or add your own.

Token usage

Both compatibility endpoints report token usage in OpenAI's own format. The numbers come from Paddler's per-kind token counting (see Token classification and usage count); the compatibility service simply renames the fields to match OpenAI.

For /v1/chat/completions, the response carries a usage object:

"usage": {
  "prompt_tokens": 318,
  "completion_tokens": 47,
  "total_tokens": 365,
  "prompt_tokens_details": {
    "cached_tokens": 0,
    "audio_tokens": 0
  },
  "completion_tokens_details": {
    "reasoning_tokens": 45
  }
}

In non-streaming requests it is always present. In streaming requests it is sent as a final chunk only if you ask for it, using stream_options:

"stream_options": { "include_usage": true }

For /v1/responses, the completed response carries a usage object with the Responses API field names:

"usage": {
  "input_tokens": 318,
  "input_tokens_details": { "cached_tokens": 0 },
  "output_tokens": 47,
  "output_tokens_details": { "reasoning_tokens": 45 },
  "total_tokens": 365
}

In both shapes, completion_tokens / output_tokens count every kind of generated token (content, reasoning, tool-call, and undeterminable), and reasoning_tokens is the thinking portion of that. The cached_tokens and audio_tokens fields are always 0 for now.

Contributing

In Paddler 2.1, we provided some libraries and tools in the code to make it easier to contribute to the compatibility efforts. If you want to start helping us, you can check out the GitHub issues tagged with the "compatibility" label.