Skip to main content

Workflows

Chain multiple inference calls, tool invocations, and conditional logic into multi-step workflows. Workflows run on-device with cloud fallback where needed.

Basic chain

Pass the output of one inference call as input to another:

# Step 1: Extract entities
extract = client.chat.completions.create(
model="phi-4-mini",
messages=[
{"role": "system", "content": "Extract all company names from the text. Return JSON array."},
{"role": "user", "content": article_text},
],
response_format={"type": "json_object"},
)
companies = json.loads(extract.choices[0].message.content)

# Step 2: Summarize each
for company in companies["names"]:
summary = client.chat.completions.create(
model="phi-4-mini",
messages=[
{"role": "system", "content": f"Write a 1-sentence summary of {company}'s role in this article."},
{"role": "user", "content": article_text},
],
max_tokens=100,
)
print(f"{company}: {summary.choices[0].message.content}")

Tool-augmented workflow

Combine inference with tool calls for workflows that interact with external systems:

tools = [
{"type": "function", "function": {
"name": "search_db",
"description": "Search the product database",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
}},
{"type": "function", "function": {
"name": "send_email",
"description": "Send an email to a customer",
"parameters": {
"type": "object",
"properties": {
"to": {"type": "string"},
"subject": {"type": "string"},
"body": {"type": "string"},
},
"required": ["to", "subject", "body"],
},
}},
]

messages = [
{"role": "system", "content": "You are a customer support agent. Use tools to look up info and respond to customers."},
{"role": "user", "content": "Customer asks: Is the X200 in stock? Their email is user@example.com"},
]

# Run until the model stops calling tools
while True:
response = client.chat.completions.create(
model="phi-4-mini", messages=messages, tools=tools,
)
choice = response.choices[0]

if choice.finish_reason == "tool_calls":
messages.append(choice.message)
for tc in choice.message.tool_calls:
result = execute_tool(tc.function.name, tc.function.arguments)
messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)})
else:
print(choice.message.content)
break

Conditional routing

Route different inputs to different models based on classification:

# Classify the query
classification = client.chat.completions.create(
model="smollm-360m", # fast, small model for classification
messages=[
{"role": "system", "content": "Classify as 'simple' or 'complex'. Reply with one word."},
{"role": "user", "content": user_query},
],
max_tokens=5,
)
complexity = classification.choices[0].message.content.strip().lower()

# Route to appropriate model
model = "gemma3-1b" if complexity == "simple" else "phi-4-mini"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": user_query}],
)

For automatic routing without manual classification, see Routing.

Patterns

PatternWhen to use
ChainOutput of step N is input to step N+1
Fan-outOne input, multiple parallel inferences
Tool loopModel calls tools until it has enough info to answer
Classify-then-routeSmall model classifies, larger model handles
VerifySecond model checks first model's output

On-device considerations

  • Memory -- each model in a workflow loads into memory. Use the same model across steps when possible, or unload between steps on constrained devices.
  • Latency -- each step adds inference time. Keep workflows short (2-4 steps) for interactive use cases.
  • Structured output -- use JSON Schema between steps to avoid parsing errors in the chain.
  • Tool calling -- defining and handling tool calls
  • Routing -- automatic model selection
  • Control -- constraining output between steps