inference

📄️ Single non-streaming cloud inference. Server proxies to the configured LLM backend and returns the full response in one

Single non-streaming cloud inference. Server proxies to the configured LLM backend and returns the full response in one shot.

📄️ Return the recommended KV cache configuration for a model/device-class combination. No authentication required. The runt

Return the recommended KV cache configuration for a model/device-class combination. No authentication required. The runtime_config object is intended to be merged directly into engine init parameters.

📄️ Return the recommended speculative decoding configuration for a model/device combination. No authentication required. en

Return the recommended speculative decoding configuration for a model/device combination. No authentication required. enabled=true requires >= 6 GB RAM and a supported chip family.

📄️ Streaming inference. Same routing semantics as inference.create but the response is text/event-stream and the SDK can st

Streaming inference. Same routing semantics as inference.create but the response is text/event-stream and the SDK can start consuming tokens before the model is finished.