How to enable embeddings

You can enable embeddings either in the web admin panel or through API.

Enabling embeddings in the web admin panel

In the "Model" section, check the enable_embeddings checkbox at the bottom of the inference parameters list. You can also select the pooling type from the dropdown. Next, apply your changes.

Testing the request to generate tokens
Testing the request to generate tokens

Enabling embeddings through API

enable_embeddings should be set to true in the PUT request to change the balancer's desired state.

Pooling type can also be specified in that request.

Mixing embedding and generation requests

An agent processes either token generation requests or embedding requests at a time, but not both. If an embedding request reaches an agent currently generating tokens, it is rejected.

In a multi-agent fleet, the balancer routes embedding requests to an idle agent. With a single agent serving mixed (tokens and embeddings) traffic, you may expect occasional rejections when an embedding request overlaps with active generation; adding more agents resolves this.