Workflows
Chain multiple inference calls, tool invocations, and conditional logic into multi-step workflows. Workflows run on-device with cloud fallback where needed.
Basic chain
Pass the output of one inference call as input to another:
# Step 1: Extract entities
extract = client.chat.completions.create(
model="phi-4-mini",
messages=[
{"role": "system", "content": "Extract all company names from the text. Return JSON array."},
{"role": "user", "content": article_text},
],
response_format={"type": "json_object"},
)
companies = json.loads(extract.choices[0].message.content)
# Step 2: Summarize each
for company in companies["names"]:
summary = client.chat.completions.create(
model="phi-4-mini",
messages=[
{"role": "system", "content": f"Write a 1-sentence summary of {company}'s role in this article."},
{"role": "user", "content": article_text},
],
max_tokens=100,
)
print(f"{company}: {summary.choices[0].message.content}")
Tool-augmented workflow
Combine inference with tool calls for workflows that interact with external systems:
tools = [
{"type": "function", "function": {
"name": "search_db",
"description": "Search the product database",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
}},
{"type": "function", "function": {
"name": "send_email",
"description": "Send an email to a customer",
"parameters": {
"type": "object",
"properties": {
"to": {"type": "string"},
"subject": {"type": "string"},
"body": {"type": "string"},
},
"required": ["to", "subject", "body"],
},
}},
]
messages = [
{"role": "system", "content": "You are a customer support agent. Use tools to look up info and respond to customers."},
{"role": "user", "content": "Customer asks: Is the X200 in stock? Their email is user@example.com"},
]
# Run until the model stops calling tools
while True:
response = client.chat.completions.create(
model="phi-4-mini", messages=messages, tools=tools,
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
messages.append(choice.message)
for tc in choice.message.tool_calls:
result = execute_tool(tc.function.name, tc.function.arguments)
messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)})
else:
print(choice.message.content)
break
Conditional routing
Route different inputs to different models based on classification:
# Classify the query
classification = client.chat.completions.create(
model="smollm-360m", # fast, small model for classification
messages=[
{"role": "system", "content": "Classify as 'simple' or 'complex'. Reply with one word."},
{"role": "user", "content": user_query},
],
max_tokens=5,
)
complexity = classification.choices[0].message.content.strip().lower()
# Route to appropriate model
model = "gemma3-1b" if complexity == "simple" else "phi-4-mini"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": user_query}],
)
For automatic routing without manual classification, see Routing.
Patterns
| Pattern | When to use |
|---|---|
| Chain | Output of step N is input to step N+1 |
| Fan-out | One input, multiple parallel inferences |
| Tool loop | Model calls tools until it has enough info to answer |
| Classify-then-route | Small model classifies, larger model handles |
| Verify | Second model checks first model's output |
On-device considerations
- Memory -- each model in a workflow loads into memory. Use the same model across steps when possible, or unload between steps on constrained devices.
- Latency -- each step adds inference time. Keep workflows short (2-4 steps) for interactive use cases.
- Structured output -- use JSON Schema between steps to avoid parsing errors in the chain.
Related
- Tool calling -- defining and handling tool calls
- Routing -- automatic model selection
- Control -- constraining output between steps