Structured Decoding

Octomil guarantees structurally valid output from any model. When you specify a JSON schema or enable JSON mode, every token the model generates is constrained to produce valid output. No post-hoc parsing, no retry loops, no malformed responses.

The Problem

LLMs sometimes produce output that is almost-valid JSON -- a missing closing brace, a trailing comma, an unquoted key. Traditional approaches validate after generation and retry on failure, which wastes compute and adds unpredictable latency.

Octomil enforces structure during generation, at the token level. The model can only emit tokens that keep the output valid according to your schema. The result is always parseable on the first attempt.

JSON Mode

Force the model to output valid JSON with a single field in the API request:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3-1b",
    "messages": [{"role": "user", "content": "List 3 ML frameworks with pros and cons."}],
    "response_format": {"type": "json_object"}
  }'

import json
import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="gemma3-1b",
    messages=[{"role": "user", "content": "List 3 ML frameworks with pros and cons."}],
    response_format={"type": "json_object"},
)

data = json.loads(response.choices[0].message.content)  # Always valid JSON

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });

const response = await client.chat.completions.create({
  model: "gemma3-1b",
  messages: [{ role: "user", content: "List 3 ML frameworks with pros and cons." }],
  response_format: { type: "json_object" },
});

const data = JSON.parse(response.choices[0].message.content); // Always valid JSON

JSON mode guarantees the output is a valid JSON object or array. The model decides the shape.

Schema Mode

For precise control over the output shape, pass a JSON Schema. The model will produce output that conforms exactly to that schema.

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-4-mini",
    "messages": [{"role": "user", "content": "Analyze the sentiment of: This product is amazing but overpriced."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "sentiment_analysis",
        "schema": {
          "type": "object",
          "properties": {
            "sentiment": {"type": "string", "enum": ["positive", "negative", "mixed", "neutral"]},
            "confidence": {"type": "number", "minimum": 0, "maximum": 1},
            "aspects": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "topic": {"type": "string"},
                  "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
                },
                "required": ["topic", "sentiment"]
              }
            }
          },
          "required": ["sentiment", "confidence", "aspects"]
        }
      }
    }
  }'

import json
import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Analyze the sentiment of: 'This product is amazing but overpriced.'"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "sentiment_analysis",
            "schema": {
                "type": "object",
                "properties": {
                    "sentiment": {
                        "type": "string",
                        "enum": ["positive", "negative", "mixed", "neutral"]
                    },
                    "confidence": {
                        "type": "number",
                        "minimum": 0,
                        "maximum": 1
                    },
                    "aspects": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "topic": {"type": "string"},
                                "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
                            },
                            "required": ["topic", "sentiment"]
                        }
                    }
                },
                "required": ["sentiment", "confidence", "aspects"]
            }
        }
    },
)

result = json.loads(response.choices[0].message.content)
# {"sentiment": "mixed", "confidence": 0.85, "aspects": [{"topic": "quality", "sentiment": "positive"}, ...]}

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });

const response = await client.chat.completions.create({
  model: "phi-4-mini",
  messages: [{ role: "user", content: "Analyze the sentiment of: 'This product is amazing but overpriced.'" }],
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "sentiment_analysis",
      schema: {
        type: "object",
        properties: {
          sentiment: { type: "string", enum: ["positive", "negative", "mixed", "neutral"] },
          confidence: { type: "number", minimum: 0, maximum: 1 },
          aspects: {
            type: "array",
            items: {
              type: "object",
              properties: {
                topic: { type: "string" },
                sentiment: { type: "string", enum: ["positive", "negative", "neutral"] },
              },
              required: ["topic", "sentiment"],
            },
          },
        },
        required: ["sentiment", "confidence", "aspects"],
      },
    },
  },
});

const result = JSON.parse(response.choices[0].message.content);
// {sentiment: "mixed", confidence: 0.85, aspects: [{topic: "quality", sentiment: "positive"}, ...]}

Supported Schema Features

Feature	Supported	Example
`type` (string, number, integer, boolean, array, object)	Yes	`{"type": "string"}`
`enum`	Yes	`{"enum": ["a", "b", "c"]}`
`required`	Yes	`{"required": ["name"]}`
`properties`	Yes	Nested object definitions
`items`	Yes	Array element schema
`minimum` / `maximum`	Yes	Numeric bounds
`minLength` / `maxLength`	Yes	String length bounds
`minItems` / `maxItems`	Yes	Array length bounds
`pattern`	Yes	Regex string patterns

Works with Any Model

Structured decoding is not a fine-tuning feature. It works with any model served by Octomil -- the constraint is applied at the decoding layer, not the model layer. A 360M parameter model produces valid JSON just as reliably as a 7B model.

# All of these produce guaranteed valid JSON when response_format is set
octomil serve smollm-360m
octomil serve gemma3-1b
octomil serve phi-4-mini
octomil serve llama-3.2-3b

Server-Wide JSON Mode

Enable JSON mode for all requests at startup:

octomil serve gemma3-1b --json

With this flag, every request produces valid JSON output regardless of whether response_format is set in the request.

Performance

Structured decoding adds minimal overhead. Token generation speed is within 5-10% of unconstrained generation for most schemas. Complex schemas with deeply nested structures or long enum lists may see slightly higher overhead.

Structured decoding works simultaneously with speculative decoding. Both optimizations are applied together, so you get fast and valid output.

Streaming

Structured decoding works with streaming responses. Each streamed chunk is part of a valid output sequence -- you can parse incrementally as tokens arrive.

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-4-mini",
    "messages": [{"role": "user", "content": "Generate a user profile."}],
    "response_format": {"type": "json_object"},
    "stream": true
  }'

import json
import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

stream = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Generate a user profile."}],
    response_format={"type": "json_object"},
    stream=True,
)

full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content:
        full_response += chunk.choices[0].delta.content
        print(chunk.choices[0].delta.content, end="")

# full_response is valid JSON
data = json.loads(full_response)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });

const stream = await client.chat.completions.create({
  model: "phi-4-mini",
  messages: [{ role: "user", content: "Generate a user profile." }],
  response_format: { type: "json_object" },
  stream: true,
});

let fullResponse = "";
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    fullResponse += content;
    process.stdout.write(content);
  }
}

// fullResponse is valid JSON
const data = JSON.parse(fullResponse);

Error Handling

If the model cannot produce output that satisfies the schema (extremely rare, typically only with very restrictive schemas and very small models), the response includes a structured error:

{
  "error": {
    "code": "schema_constraint_failure",
    "message": "Model could not generate output satisfying the provided schema within the token limit."
  }
}

Gotchas

Small models struggle with complex schemas — models under 1B parameters may hit schema_constraint_failure with deeply nested schemas. Simplify the schema or use a larger model.
JSON mode vs schema mode — JSON mode guarantees valid JSON but not a specific shape. Use schema mode when you need specific fields. JSON mode is faster because it has less to constrain.
Streaming still validates — the full response is valid JSON/schema-conformant, but individual chunks are not. Parse only after the stream completes.
Token limit can truncate — if the model runs out of tokens mid-generation, the constraint engine cannot recover. Set a sufficient max_tokens for your expected output size.

Local Inference — server setup and JSON mode flag
Speculative Decoding — works alongside structured decoding
Observability — monitor structured decoding usage

The Problem​

JSON Mode​

Schema Mode​

Supported Schema Features​

Works with Any Model​

Server-Wide JSON Mode​

Performance​

Streaming​

Error Handling​

Gotchas​

Related​