How to enable embeddings
You can enable embeddings either in the web admin panel or through API.
Enabling embeddings in the web admin panel
In the "Model" section, check the enable_embeddings checkbox at the bottom of the inference parameters list. You can also select the pooling type from the dropdown. Next, apply your changes.
Enabling embeddings through API
enable_embeddings should be set to true in the PUT request to change the balancer's desired state.
Pooling type can also be specified in that request.
Mixing embedding and generation requests
An agent processes either token generation requests or embedding requests at a time, but not both. If an embedding request reaches an agent currently generating tokens, it is rejected.
In a multi-agent fleet, the balancer routes embedding requests to an idle agent. With a single agent serving mixed (tokens and embeddings) traffic, you may expect occasional rejections when an embedding request overlaps with active generation; adding more agents resolves this.