Skip to main content

Structured Decoding

Octomil guarantees structurally valid output from any model. When you specify a JSON schema or enable JSON mode, every token the model generates is constrained to produce valid output. No post-hoc parsing, no retry loops, no malformed responses.

The Problem

LLMs sometimes produce output that is almost-valid JSON -- a missing closing brace, a trailing comma, an unquoted key. Traditional approaches validate after generation and retry on failure, which wastes compute and adds unpredictable latency.

Octomil enforces structure during generation, at the token level. The model can only emit tokens that keep the output valid according to your schema. The result is always parseable on the first attempt.

JSON Mode

Force the model to output valid JSON with a single field in the API request:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "List 3 ML frameworks with pros and cons."}],
"response_format": {"type": "json_object"}
}'

JSON mode guarantees the output is a valid JSON object or array. The model decides the shape.

Schema Mode

For precise control over the output shape, pass a JSON Schema. The model will produce output that conforms exactly to that schema.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Analyze the sentiment of: This product is amazing but overpriced."}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "sentiment_analysis",
"schema": {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "mixed", "neutral"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"aspects": {
"type": "array",
"items": {
"type": "object",
"properties": {
"topic": {"type": "string"},
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
},
"required": ["topic", "sentiment"]
}
}
},
"required": ["sentiment", "confidence", "aspects"]
}
}
}
}'

Supported Schema Features

FeatureSupportedExample
type (string, number, integer, boolean, array, object)Yes{"type": "string"}
enumYes{"enum": ["a", "b", "c"]}
requiredYes{"required": ["name"]}
propertiesYesNested object definitions
itemsYesArray element schema
minimum / maximumYesNumeric bounds
minLength / maxLengthYesString length bounds
minItems / maxItemsYesArray length bounds
patternYesRegex string patterns

Works with Any Model

Structured decoding is not a fine-tuning feature. It works with any model served by Octomil -- the constraint is applied at the decoding layer, not the model layer. A 360M parameter model produces valid JSON just as reliably as a 7B model.

# All of these produce guaranteed valid JSON when response_format is set
octomil serve smollm-360m
octomil serve gemma-1b
octomil serve phi-4-mini
octomil serve llama-3.2-3b

Server-Wide JSON Mode

Enable JSON mode for all requests at startup:

octomil serve gemma-1b --json

With this flag, every request produces valid JSON output regardless of whether response_format is set in the request.

Performance

Structured decoding adds minimal overhead. Token generation speed is within 5-10% of unconstrained generation for most schemas. Complex schemas with deeply nested structures or long enum lists may see slightly higher overhead.

Structured decoding works simultaneously with speculative decoding. Both optimizations are applied together, so you get fast and valid output.

Streaming

Structured decoding works with streaming responses. Each streamed chunk is part of a valid output sequence -- you can parse incrementally as tokens arrive.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Generate a user profile."}],
"response_format": {"type": "json_object"},
"stream": true
}'

Error Handling

If the model cannot produce output that satisfies the schema (extremely rare, typically only with very restrictive schemas and very small models), the response includes a structured error:

{
"error": {
"code": "schema_constraint_failure",
"message": "Model could not generate output satisfying the provided schema within the token limit."
}
}

Gotchas

  • Small models struggle with complex schemas — models under 1B parameters may hit schema_constraint_failure with deeply nested schemas. Simplify the schema or use a larger model.
  • JSON mode vs schema mode — JSON mode guarantees valid JSON but not a specific shape. Use schema mode when you need specific fields. JSON mode is faster because it has less to constrain.
  • Streaming still validates — the full response is valid JSON/schema-conformant, but individual chunks are not. Parse only after the stream completes.
  • Token limit can truncate — if the model runs out of tokens mid-generation, the constraint engine cannot recover. Set a sufficient max_tokens for your expected output size.