Structured Decoding
Octomil guarantees structurally valid output from any model. When you specify a JSON schema or enable JSON mode, every token the model generates is constrained to produce valid output. No post-hoc parsing, no retry loops, no malformed responses.
The Problem
LLMs sometimes produce output that is almost-valid JSON -- a missing closing brace, a trailing comma, an unquoted key. Traditional approaches validate after generation and retry on failure, which wastes compute and adds unpredictable latency.
Octomil enforces structure during generation, at the token level. The model can only emit tokens that keep the output valid according to your schema. The result is always parseable on the first attempt.
JSON Mode
Force the model to output valid JSON with a single field in the API request:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "List 3 ML frameworks with pros and cons."}],
"response_format": {"type": "json_object"}
}'
import json
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma-1b",
messages=[{"role": "user", "content": "List 3 ML frameworks with pros and cons."}],
response_format={"type": "json_object"},
)
data = json.loads(response.choices[0].message.content) # Always valid JSON
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "gemma-1b",
messages: [{ role: "user", content: "List 3 ML frameworks with pros and cons." }],
response_format: { type: "json_object" },
});
const data = JSON.parse(response.choices[0].message.content); // Always valid JSON
JSON mode guarantees the output is a valid JSON object or array. The model decides the shape.
Schema Mode
For precise control over the output shape, pass a JSON Schema. The model will produce output that conforms exactly to that schema.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Analyze the sentiment of: This product is amazing but overpriced."}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "sentiment_analysis",
"schema": {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "mixed", "neutral"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"aspects": {
"type": "array",
"items": {
"type": "object",
"properties": {
"topic": {"type": "string"},
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
},
"required": ["topic", "sentiment"]
}
}
},
"required": ["sentiment", "confidence", "aspects"]
}
}
}
}'
import json
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Analyze the sentiment of: 'This product is amazing but overpriced.'"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "sentiment_analysis",
"schema": {
"type": "object",
"properties": {
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "mixed", "neutral"]
},
"confidence": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"aspects": {
"type": "array",
"items": {
"type": "object",
"properties": {
"topic": {"type": "string"},
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
},
"required": ["topic", "sentiment"]
}
}
},
"required": ["sentiment", "confidence", "aspects"]
}
}
},
)
result = json.loads(response.choices[0].message.content)
# {"sentiment": "mixed", "confidence": 0.85, "aspects": [{"topic": "quality", "sentiment": "positive"}, ...]}
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "phi-4-mini",
messages: [{ role: "user", content: "Analyze the sentiment of: 'This product is amazing but overpriced.'" }],
response_format: {
type: "json_schema",
json_schema: {
name: "sentiment_analysis",
schema: {
type: "object",
properties: {
sentiment: { type: "string", enum: ["positive", "negative", "mixed", "neutral"] },
confidence: { type: "number", minimum: 0, maximum: 1 },
aspects: {
type: "array",
items: {
type: "object",
properties: {
topic: { type: "string" },
sentiment: { type: "string", enum: ["positive", "negative", "neutral"] },
},
required: ["topic", "sentiment"],
},
},
},
required: ["sentiment", "confidence", "aspects"],
},
},
},
});
const result = JSON.parse(response.choices[0].message.content);
// {sentiment: "mixed", confidence: 0.85, aspects: [{topic: "quality", sentiment: "positive"}, ...]}
Supported Schema Features
| Feature | Supported | Example |
|---|---|---|
type (string, number, integer, boolean, array, object) | Yes | {"type": "string"} |
enum | Yes | {"enum": ["a", "b", "c"]} |
required | Yes | {"required": ["name"]} |
properties | Yes | Nested object definitions |
items | Yes | Array element schema |
minimum / maximum | Yes | Numeric bounds |
minLength / maxLength | Yes | String length bounds |
minItems / maxItems | Yes | Array length bounds |
pattern | Yes | Regex string patterns |
Works with Any Model
Structured decoding is not a fine-tuning feature. It works with any model served by Octomil -- the constraint is applied at the decoding layer, not the model layer. A 360M parameter model produces valid JSON just as reliably as a 7B model.
# All of these produce guaranteed valid JSON when response_format is set
octomil serve smollm-360m
octomil serve gemma-1b
octomil serve phi-4-mini
octomil serve llama-3.2-3b
Server-Wide JSON Mode
Enable JSON mode for all requests at startup:
octomil serve gemma-1b --json
With this flag, every request produces valid JSON output regardless of whether response_format is set in the request.
Performance
Structured decoding adds minimal overhead. Token generation speed is within 5-10% of unconstrained generation for most schemas. Complex schemas with deeply nested structures or long enum lists may see slightly higher overhead.
Structured decoding works simultaneously with speculative decoding. Both optimizations are applied together, so you get fast and valid output.
Streaming
Structured decoding works with streaming responses. Each streamed chunk is part of a valid output sequence -- you can parse incrementally as tokens arrive.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Generate a user profile."}],
"response_format": {"type": "json_object"},
"stream": true
}'
import json
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
stream = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Generate a user profile."}],
response_format={"type": "json_object"},
stream=True,
)
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
full_response += chunk.choices[0].delta.content
print(chunk.choices[0].delta.content, end="")
# full_response is valid JSON
data = json.loads(full_response)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const stream = await client.chat.completions.create({
model: "phi-4-mini",
messages: [{ role: "user", content: "Generate a user profile." }],
response_format: { type: "json_object" },
stream: true,
});
let fullResponse = "";
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
fullResponse += content;
process.stdout.write(content);
}
}
// fullResponse is valid JSON
const data = JSON.parse(fullResponse);
Error Handling
If the model cannot produce output that satisfies the schema (extremely rare, typically only with very restrictive schemas and very small models), the response includes a structured error:
{
"error": {
"code": "schema_constraint_failure",
"message": "Model could not generate output satisfying the provided schema within the token limit."
}
}
Gotchas
- Small models struggle with complex schemas — models under 1B parameters may hit
schema_constraint_failurewith deeply nested schemas. Simplify the schema or use a larger model. - JSON mode vs schema mode — JSON mode guarantees valid JSON but not a specific shape. Use schema mode when you need specific fields. JSON mode is faster because it has less to constrain.
- Streaming still validates — the full response is valid JSON/schema-conformant, but individual chunks are not. Parse only after the stream completes.
- Token limit can truncate — if the model runs out of tokens mid-generation, the constraint engine cannot recover. Set a sufficient
max_tokensfor your expected output size.
Related
- Local Inference — server setup and JSON mode flag
- Speculative Decoding — works alongside structured decoding
- Observability — monitor structured decoding usage