Android SDK
On-device inference, model deployment, and federated learning for Android.
GitHub: github.com/octomil/octomil-android
Installation
- Kotlin DSL
- Groovy
dependencies {
implementation("ai.octomil:octomil-android:1.0.0")
implementation("com.google.ai.edge.litert:litert:2.0.0")
}
dependencies {
implementation 'ai.octomil:octomil-android:1.0.0'
implementation 'com.google.ai.edge.litert:litert:2.0.0'
}
Add to AndroidManifest.xml:
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />
Quick Start
import ai.octomil.OctomilClient
// 1. Initialize
val client = OctomilClient(apiKey = "edg_...", orgId = "your-org-id", context = this)
// 2. Register device
client.register()
// 3. Download model (cached locally for offline use)
val model = client.downloadModel(modelId = "fraud-detector")
// 4. Run inference — all on-device, zero cloud calls
val prediction = model.predict(mapOf("features" to userFeatures))
println("Fraud score: ${prediction["score"]}")
// 5. Participate in federated training (optional)
client.train(modelId = "fraud-detector", data = localData, samples = 1000)
Device Pairing
Handle octomil://pair deep links via manifest:
<activity
android:name="ai.octomil.pairing.ui.PairingActivity"
android:exported="true">
<intent-filter>
<action android:name="android.intent.action.VIEW" />
<category android:name="android.intent.category.DEFAULT" />
<category android:name="android.intent.category.BROWSABLE" />
<data android:scheme="octomil" android:host="pair" />
</intent-filter>
</activity>
For Compose, use PairingScreen directly:
import ai.octomil.pairing.ui.PairingScreen
@Composable
fun MyPairingScreen(token: String, host: String) {
PairingScreen(token = token, host = host)
}
Model Wrapping
Wrap an existing TFLite Interpreter with telemetry, validation, and OTA updates. One line changes at model load — zero changes at call sites:
// Before
val interpreter = Interpreter(modelFile)
// After
val interpreter = Octomil.wrap(Interpreter(modelFile), modelId = "classifier")
interpreter.run(input, output) // identical API
The wrapper adds contract validation, latency telemetry, and OTA model updates.
Network Discovery
Make your device discoverable on the local network:
val discovery = DiscoveryManager(context)
discovery.startDiscoverable(deviceId = client.deviceId ?: "unknown")
The SDK registers a _octomil._tcp. NSD service. The CLI discovers this service and connects directly — no QR code needed.
OctomilClient
Constructor
OctomilClient(
apiKey: String,
orgId: String,
context: Context,
baseURL: String = "https://api.octomil.com",
config: OctomilConfiguration = OctomilConfiguration.DEFAULT
)
Key Methods
| Method | Description |
|---|---|
register() | Register device with Octomil server |
sendHeartbeat() | Send health report |
downloadModel(modelId, version, format) | Download and cache model |
getCachedModel(modelId, version) | Get cached model without downloading |
clearModelCache() | Clear all cached models |
train(modelId, data, samples) | Train locally and upload |
participateInTrainingRound(...) | Train and upload with full control |
uploadWeights(...) | Upload raw weight update |
Model Rollouts
When you download a model, Octomil automatically determines which version your device receives based on active rollouts:
val model = client.downloadModel(modelId = "sentiment-classifier")
// Server decides version based on rollout percentage + device hash
Configuration
val config = OctomilConfiguration(
privacyConfiguration = PrivacyConfiguration.HIGH_PRIVACY,
trainingConfiguration = TrainingConfiguration(
batchSize = 64,
learningRate = 0.01f,
epochs = 3
),
enableLogging = true,
enableOfflineQueue = true,
maxQueueSize = 10
)
val client = OctomilClient(
apiKey = "edg_...",
orgId = "my-org",
context = applicationContext,
config = config
)
Privacy presets:
PrivacyConfiguration.DEFAULT— standard settingsPrivacyConfiguration.HIGH_PRIVACY— staggered uploads (1-10 min), DP with epsilon=0.5
Streaming Inference
The SDK supports streaming inference via Kotlin Flow. Implement StreamingInferenceEngine to plug in your own model backend. Each chunk carries modality-specific payload data and timing information.
import ai.octomil.inference.InferenceChunk
import ai.octomil.inference.Modality
import ai.octomil.inference.StreamingInferenceEngine
import ai.octomil.inference.StreamingInferenceResult
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.flow
// 1. Implement the engine interface for your model
val engine = StreamingInferenceEngine { input, modality ->
flow {
// Your on-device generation logic here
for (i in 0 until tokenCount) {
emit(
InferenceChunk(
index = i,
data = tokenBytes,
modality = Modality.TEXT,
timestamp = System.currentTimeMillis(),
latencyMs = 0.0 // filled in by SDK wrapper
)
)
}
}
}
// 2. Consume the stream
engine.generate(input = "Summarize this document...", modality = Modality.TEXT)
.collect { chunk ->
val token = String(chunk.data, Charsets.UTF_8)
print(token) // stream tokens to UI
}
InferenceChunk carries the chunk index, raw data bytes, modality (TEXT, IMAGE, AUDIO, VIDEO), timestamp (epoch millis), and latencyMs. The StreamingInferenceResult aggregates sessionId, ttfcMs, avgChunkLatencyMs, totalChunks, totalDurationMs, and throughput for completed sessions.
StreamingInferenceEngine is a functional interface, so you can use a lambda as shown above.
Embeddings
Use EmbeddingClient to generate dense vector embeddings via the Octomil server.
import ai.octomil.client.EmbeddingClient
import ai.octomil.client.EmbeddingResult
val embeddingClient = EmbeddingClient(
serverUrl = "https://api.octomil.com",
apiKey = "edg_..."
)
// Single string
val result: EmbeddingResult = embeddingClient.embed(
modelId = "nomic-embed-text",
input = "Hello, world!"
)
println(result.embeddings) // [[0.1, 0.2, ...]]
println(result.model) // "nomic-embed-text"
println(result.usage.promptTokens)
println(result.usage.totalTokens)
// Batch embedding
val batchResult = embeddingClient.embed(
modelId = "nomic-embed-text",
input = listOf("First document", "Second document", "Third document")
)
// batchResult.embeddings contains one vector per input string
EmbeddingResult contains:
embeddings: List<List<Double>>-- one dense vector per input stringmodel: String-- the model that produced the embeddingsusage: EmbeddingUsage-- token counts (promptTokens,totalTokens)
Smart Routing
RoutingClient calls the Octomil routing API to decide whether inference should run on-device or in the cloud. Decisions are cached with a configurable TTL.
import ai.octomil.client.RoutingClient
import ai.octomil.client.RoutingConfig
import ai.octomil.client.RoutingPreference
import ai.octomil.client.RoutingDeviceCapabilities
// 1. Configure the routing client
val routingConfig = RoutingConfig(
serverUrl = "https://api.octomil.com",
apiKey = "edg_...",
cacheTtlMs = 300_000L, // cache decisions for 5 minutes
prefer = RoutingPreference.FASTEST, // DEVICE, CLOUD, CHEAPEST, FASTEST
modelParams = 2_000_000_000, // 2B parameter model
modelSizeMb = 1400.0
)
val router = RoutingClient(config = routingConfig)
// 2. Build device capabilities
val capabilities = RoutingDeviceCapabilities(
platform = "android",
model = android.os.Build.MODEL,
totalMemoryMb = Runtime.getRuntime().maxMemory() / (1024 * 1024),
gpuAvailable = true,
npuAvailable = false,
supportedRuntimes = listOf("tflite", "nnapi")
)
// 3. Ask the routing API for a decision
val decision = router.route(
modelId = "gemma-2b",
deviceCapabilities = capabilities
)
// 4. Act on the decision
when (decision?.target) {
"device" -> {
// Run inference on-device using decision.format and decision.engine
println("Run on-device with ${decision.engine} engine")
}
"cloud" -> {
// Fall back to cloud inference
val response = router.cloudInfer(
modelId = "gemma-2b",
inputData = Json.encodeToJsonElement(mapOf("prompt" to "Hello"))
)
println("Cloud result: ${response.output}, latency: ${response.latencyMs}ms")
}
else -> {
// decision is null — network failure, default to on-device
}
}
RoutingClient is thread-safe (uses ConcurrentHashMap for cache and OkHttpClient for HTTP). route() returns null on any network failure, allowing you to fall back to local inference gracefully.
RoutingDecision contains: id, target ("device" or "cloud"), format, engine, and an optional fallbackTarget with an endpoint URL.
Cache Management
// Invalidate a specific model's cached decision
router.invalidate(modelId = "gemma-2b")
// Clear all cached decisions
router.clearCache()
Cloud Fallback
When the routing API determines a device cannot run a model (insufficient memory, missing runtime, etc.), it returns target: "cloud" with a fallbackTarget containing the cloud endpoint. Use cloudInfer() to run inference in the cloud:
val decision = router.route(modelId = "large-model", deviceCapabilities = capabilities)
if (decision?.target == "cloud") {
try {
val response = router.cloudInfer(
modelId = "large-model",
inputData = Json.encodeToJsonElement(mapOf("prompt" to "Explain quantum computing")),
parameters = mapOf("max_tokens" to Json.encodeToJsonElement(256))
)
println("Provider: ${response.provider}")
println("Latency: ${response.latencyMs}ms")
} catch (e: Exception) {
// Cloud also failed — handle gracefully
println("Cloud inference failed: ${e.message}")
}
}
If route() returns null (network failure, server down), the caller should default to on-device inference. The SDK never silently falls back -- you control the fallback logic.
Background Training
Use WorkManager for background training:
class FederatedTrainingWorker(
context: Context,
params: WorkerParameters
) : CoroutineWorker(context, params) {
override suspend fun doWork(): Result {
return try {
val client = OctomilClient(apiKey = "edg_...", orgId = "my-org", context = applicationContext)
client.participateInTrainingRound(
modelId = "fraud-detector",
trainingData = loadBackgroundData(),
sampleCount = 1000
)
Result.success()
} catch (e: Exception) {
Result.retry()
}
}
}
Error Handling
All suspend functions throw OctomilException:
try {
val result = client.train(modelId = "fraud-detector", data = data, samples = 100)
} catch (e: OctomilException.NetworkException) {
println("Network error: ${e.message}")
} catch (e: OctomilException.AuthenticationException) {
println("Invalid API key or orgId")
} catch (e: OctomilException.DeviceNotRegisteredException) {
println("Call register() first")
}
GPU Acceleration
// GPU delegation
val gpuDelegate = GpuDelegate()
val options = Interpreter.Options().addDelegate(gpuDelegate)
val interpreter = Interpreter(modelFile, options)
// Or NNAPI delegation
val nnApiDelegate = NnApiDelegate()
val options = Interpreter.Options().addDelegate(nnApiDelegate)
ProGuard Rules
-keep class ai.octomil.** { *; }
-keepclassmembers class ai.octomil.** { *; }
-keep class com.google.ai.edge.litert.** { *; }
Requirements
- Android 8.0 (API 26)+, Kotlin 1.9+, Gradle 8.0+, LiteRT 2.0+
Gotchas
- API 26+ required — the SDK uses Kotlin coroutines and modern Android APIs. Devices on Android 7.1 or earlier are not supported.
register()must be called first — all other methods throwOctomilException.DeviceNotRegisteredExceptionif the device hasn't registered.- ProGuard rules are required — without the keep rules, R8 strips Octomil and LiteRT classes. Add the ProGuard rules from the section above to your release build.
- GPU delegate is optional —
GpuDelegateandNnApiDelegateare not bundled. Add them as separate Gradle dependencies if you want hardware acceleration. - WorkManager for background training — don't use
CoroutineScopefor federated training in production. Use WorkManager (shown above) so training survives app backgrounding and device sleep. - Offline queue has a limit —
maxQueueSizedefaults to 10. Training updates beyond this limit are dropped. Increase it for devices with unreliable connectivity. - Model downloads are cached —
downloadModelcaches to internal storage. UseclearModelCache()to force re-download. - Routing returns
nullon failure —route()never throws. It returnsnullwhen the server is unreachable, so you always control fallback behavior. EmbeddingClientcalls are blocking —embed()makes a synchronous HTTP call. Call it from a coroutine onDispatchers.IOto avoid blocking the main thread.cloudInfer()throws on failure — unlikeroute(), cloud inference throws on HTTP errors so you can catch and fall back to local inference.
Related
- Python SDK — server-side orchestration
- iOS SDK — native iOS client
- Browser SDK — browser inference
- Model Catalog — model versioning
- Rollouts — progressive deployment