inference
📄️ Single non-streaming cloud inference. Server proxies to the configured LLM backend and returns the full response in one
Single non-streaming cloud inference. Server proxies to the configured LLM backend and returns the full response in one shot.
📄️ Return the recommended KV cache configuration for a model/device-class combination. No authentication required. The runt
Return the recommended KV cache configuration for a model/device-class combination. No authentication required. The runtime_config object is intended to be merged directly into engine init parameters.
📄️ Return the recommended speculative decoding configuration for a model/device combination. No authentication required. en
Return the recommended speculative decoding configuration for a model/device combination. No authentication required. enabled=true requires >= 6 GB RAM and a supported chip family.
📄️ Streaming inference. Same routing semantics as inference.create but the response is text/event-stream and the SDK can st
Streaming inference. Same routing semantics as inference.create but the response is text/event-stream and the SDK can start consuming tokens before the model is finished.