Android SDK

On-device inference, model deployment, streaming, and smart routing for Android.

GitHub: github.com/octomil/octomil-android

Installation

Kotlin DSL
Groovy

dependencies {
    implementation("ai.octomil:octomil-android:1.0.0")
    implementation("com.google.ai.edge.litert:litert:2.0.0")
}

dependencies {
    implementation 'ai.octomil:octomil-android:1.0.0'
    implementation 'com.google.ai.edge.litert:litert:2.0.0'
}

Add to AndroidManifest.xml:

<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />

Quick Start

import ai.octomil.OctomilClient

// 1. Initialize
val client = OctomilClient(apiKey = "oct_sk_live_...", orgId = "your-org-id", context = this)

// 2. Register device
client.register()

// 3. Download model (cached locally for offline use)
val model = client.downloadModel(modelId = "fraud-detector")

// 4. Run inference — all on-device, zero cloud calls
val prediction = model.predict(mapOf("features" to userFeatures))
println("Fraud score: ${prediction["score"]}")

// 5. Participate in federated training (optional)
client.train(modelId = "fraud-detector", data = localData, samples = 1000)

Device Pairing

Handle octomil://pair deep links via manifest:

<activity
    android:name="ai.octomil.pairing.ui.PairingActivity"
    android:exported="true">
    <intent-filter>
        <action android:name="android.intent.action.VIEW" />
        <category android:name="android.intent.category.DEFAULT" />
        <category android:name="android.intent.category.BROWSABLE" />
        <data android:scheme="octomil" android:host="pair" />
    </intent-filter>
</activity>

For Compose, use PairingScreen directly:

import ai.octomil.pairing.ui.PairingScreen

@Composable
fun MyPairingScreen(token: String, host: String) {
    PairingScreen(token = token, host = host)
}

Model Wrapping

Wrap an existing TFLite Interpreter with telemetry, validation, and OTA updates. One line changes at model load — zero changes at call sites:

// Before
val interpreter = Interpreter(modelFile)

// After
val interpreter = Octomil.wrap(Interpreter(modelFile), modelId = "classifier")
interpreter.run(input, output)  // identical API

The wrapper adds contract validation, latency telemetry, and OTA model updates.

Network Discovery

Make your device discoverable on the local network:

val discovery = DiscoveryManager(context)
discovery.startDiscoverable(deviceId = client.deviceId ?: "unknown")

The SDK registers a _octomil._tcp. NSD service. The CLI discovers this service and connects directly — no QR code needed.

OctomilClient

Constructor

OctomilClient(
    apiKey: String,
    orgId: String,
    context: Context,
    baseURL: String = "https://api.octomil.com",
    config: OctomilConfiguration = OctomilConfiguration.DEFAULT
)

Key Methods

Method	Description
`register()`	Register device with Octomil server
`sendHeartbeat()`	Send health report
`downloadModel(modelId, version, format)`	Download and cache model
`getCachedModel(modelId, version)`	Get cached model without downloading
`clearModelCache()`	Clear all cached models
`train(modelId, data, samples)`	Train locally and upload
`participateInTrainingRound(...)`	Train and upload with full control
`uploadWeights(...)`	Upload raw weight update

Model Rollouts

When you download a model, Octomil automatically determines which version your device receives based on active rollouts:

val model = client.downloadModel(modelId = "sentiment-classifier")
// Server decides version based on rollout percentage + device hash

Configuration

val config = OctomilConfiguration(
    privacyConfiguration = PrivacyConfiguration.HIGH_PRIVACY,
    trainingConfiguration = TrainingConfiguration(
        batchSize = 64,
        learningRate = 0.01f,
        epochs = 3
    ),
    enableLogging = true,
    enableOfflineQueue = true,
    maxQueueSize = 10
)

val client = OctomilClient(
    apiKey = "oct_sk_live_...",
    orgId = "my-org",
    context = applicationContext,
    config = config
)

Privacy presets:

PrivacyConfiguration.DEFAULT — standard settings
PrivacyConfiguration.HIGH_PRIVACY — staggered uploads (1-10 min), DP with epsilon=0.5

Streaming Inference

The SDK supports streaming inference via Kotlin Flow. Implement StreamingInferenceEngine to plug in your own model backend. Each chunk carries modality-specific payload data and timing information.

import ai.octomil.inference.InferenceChunk
import ai.octomil.inference.Modality
import ai.octomil.inference.StreamingInferenceEngine
import ai.octomil.inference.StreamingInferenceResult
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.flow

// 1. Implement the engine interface for your model
val engine = StreamingInferenceEngine { input, modality ->
    flow {
        // Your on-device generation logic here
        for (i in 0 until tokenCount) {
            emit(
                InferenceChunk(
                    index = i,
                    data = tokenBytes,
                    modality = Modality.TEXT,
                    timestamp = System.currentTimeMillis(),
                    latencyMs = 0.0  // filled in by SDK wrapper
                )
            )
        }
    }
}

// 2. Consume the stream
engine.generate(input = "Summarize this document...", modality = Modality.TEXT)
    .collect { chunk ->
        val token = String(chunk.data, Charsets.UTF_8)
        print(token)  // stream tokens to UI
    }

InferenceChunk carries the chunk index, raw data bytes, modality (TEXT, IMAGE, AUDIO, VIDEO), timestamp (epoch millis), and latencyMs. The StreamingInferenceResult aggregates sessionId, ttfcMs, avgChunkLatencyMs, totalChunks, totalDurationMs, and throughput for completed sessions.

StreamingInferenceEngine is a functional interface, so you can use a lambda as shown above.

Embeddings

Use EmbeddingClient to generate dense vector embeddings via the Octomil server.

import ai.octomil.client.EmbeddingClient
import ai.octomil.client.EmbeddingResult

val embeddingClient = EmbeddingClient(
    serverUrl = "https://api.octomil.com",
    apiKey = "oct_sk_live_..."
)

// Single string
val result: EmbeddingResult = embeddingClient.embed(
    modelId = "nomic-embed-text",
    input = "Hello, world!"
)
println(result.embeddings)       // [[0.1, 0.2, ...]]
println(result.model)            // "nomic-embed-text"
println(result.usage.promptTokens)
println(result.usage.totalTokens)

// Batch embedding
val batchResult = embeddingClient.embed(
    modelId = "nomic-embed-text",
    input = listOf("First document", "Second document", "Third document")
)
// batchResult.embeddings contains one vector per input string

EmbeddingResult contains:

embeddings: List<List<Double>> -- one dense vector per input string
model: String -- the model that produced the embeddings
usage: EmbeddingUsage -- token counts (promptTokens, totalTokens)

Smart Routing

RoutingClient calls the Octomil routing API to decide whether inference should run on-device or in the cloud. Decisions are cached with a configurable TTL.

import ai.octomil.client.RoutingClient
import ai.octomil.client.RoutingConfig
import ai.octomil.client.RoutingPreference
import ai.octomil.client.RoutingDeviceCapabilities

// 1. Configure the routing client
val routingConfig = RoutingConfig(
    serverUrl = "https://api.octomil.com",
    apiKey = "oct_sk_live_...",
    cacheTtlMs = 300_000L,          // cache decisions for 5 minutes
    prefer = RoutingPreference.FASTEST,  // DEVICE, CLOUD, CHEAPEST, FASTEST
    modelParams = 2_000_000_000,    // 2B parameter model
    modelSizeMb = 1400.0
)

val router = RoutingClient(config = routingConfig)

// 2. Build device capabilities
val capabilities = RoutingDeviceCapabilities(
    platform = "android",
    model = android.os.Build.MODEL,
    totalMemoryMb = Runtime.getRuntime().maxMemory() / (1024 * 1024),
    gpuAvailable = true,
    npuAvailable = false,
    supportedRuntimes = listOf("tflite", "nnapi")
)

// 3. Ask the routing API for a decision
val decision = router.route(
    modelId = "gemma-2b",
    deviceCapabilities = capabilities
)

// 4. Act on the decision
when (decision?.target) {
    "device" -> {
        // Run inference on-device using decision.format and decision.engine
        println("Run on-device with ${decision.engine} engine")
    }
    "cloud" -> {
        // Fall back to cloud inference
        val response = router.cloudInfer(
            modelId = "gemma-2b",
            inputData = Json.encodeToJsonElement(mapOf("prompt" to "Hello"))
        )
        println("Cloud result: ${response.output}, latency: ${response.latencyMs}ms")
    }
    else -> {
        // decision is null — network failure, default to on-device
    }
}

RoutingClient is thread-safe (uses ConcurrentHashMap for cache and OkHttpClient for HTTP). route() returns null on any network failure, allowing you to fall back to local inference gracefully.

RoutingDecision contains: id, target ("device" or "cloud"), format, engine, and an optional fallbackTarget with an endpoint URL.

Cache Management

// Invalidate a specific model's cached decision
router.invalidate(modelId = "gemma-2b")

// Clear all cached decisions
router.clearCache()

Cloud Fallback

When the routing API determines a device cannot run a model (insufficient memory, missing runtime, etc.), it returns target: "cloud" with a fallbackTarget containing the cloud endpoint. Use cloudInfer() to run inference in the cloud:

val decision = router.route(modelId = "large-model", deviceCapabilities = capabilities)

if (decision?.target == "cloud") {
    try {
        val response = router.cloudInfer(
            modelId = "large-model",
            inputData = Json.encodeToJsonElement(mapOf("prompt" to "Explain quantum computing")),
            parameters = mapOf("max_tokens" to Json.encodeToJsonElement(256))
        )
        println("Provider: ${response.provider}")
        println("Latency: ${response.latencyMs}ms")
    } catch (e: Exception) {
        // Cloud also failed — handle gracefully
        println("Cloud inference failed: ${e.message}")
    }
}

If route() returns null (network failure, server down), the caller should default to on-device inference. The SDK never silently falls back -- you control the fallback logic.

Background Training

Use WorkManager for background training:

class FederatedTrainingWorker(
    context: Context,
    params: WorkerParameters
) : CoroutineWorker(context, params) {
    override suspend fun doWork(): Result {
        return try {
            val client = OctomilClient(apiKey = "oct_sk_live_...", orgId = "my-org", context = applicationContext)
            client.participateInTrainingRound(
                modelId = "fraud-detector",
                trainingData = loadBackgroundData(),
                sampleCount = 1000
            )
            Result.success()
        } catch (e: Exception) {
            Result.retry()
        }
    }
}

Error Handling

All suspend functions throw OctomilException:

try {
    val result = client.train(modelId = "fraud-detector", data = data, samples = 100)
} catch (e: OctomilException.NetworkException) {
    println("Network error: ${e.message}")
} catch (e: OctomilException.AuthenticationException) {
    println("Invalid API key or orgId")
} catch (e: OctomilException.DeviceNotRegisteredException) {
    println("Call register() first")
}

GPU Acceleration

// GPU delegation
val gpuDelegate = GpuDelegate()
val options = Interpreter.Options().addDelegate(gpuDelegate)
val interpreter = Interpreter(modelFile, options)

// Or NNAPI delegation
val nnApiDelegate = NnApiDelegate()
val options = Interpreter.Options().addDelegate(nnApiDelegate)

ProGuard Rules

-keep class ai.octomil.** { *; }
-keepclassmembers class ai.octomil.** { *; }
-keep class com.google.ai.edge.litert.** { *; }

Requirements

Android 8.0 (API 26)+, Kotlin 1.9+, Gradle 8.0+, LiteRT 2.0+

Gotchas

API 26+ required — the SDK uses Kotlin coroutines and modern Android APIs. Devices on Android 7.1 or earlier are not supported.
register() must be called first — all other methods throw OctomilException.DeviceNotRegisteredException if the device hasn't registered.
ProGuard rules are required — without the keep rules, R8 strips Octomil and LiteRT classes. Add the ProGuard rules from the section above to your release build.
GPU delegate is optional — GpuDelegate and NnApiDelegate are not bundled. Add them as separate Gradle dependencies if you want hardware acceleration.
WorkManager for background training — don't use CoroutineScope for federated training in production. Use WorkManager (shown above) so training survives app backgrounding and device sleep.
Offline queue has a limit — maxQueueSize defaults to 10. Training updates beyond this limit are dropped. Increase it for devices with unreliable connectivity.
Model downloads are cached — downloadModel caches to internal storage. Use clearModelCache() to force re-download.
Routing returns null on failure — route() never throws. It returns null when the server is unreachable, so you always control fallback behavior.
EmbeddingClient calls are blocking — embed() makes a synchronous HTTP call. Call it from a coroutine on Dispatchers.IO to avoid blocking the main thread.
cloudInfer() throws on failure — unlike route(), cloud inference throws on HTTP errors so you can catch and fall back to local inference.

Python SDK — server-side orchestration
iOS SDK — native iOS client
Browser SDK — browser inference
Model Catalog — model versioning
Rollouts — progressive deployment

Installation​

Quick Start​

Device Pairing​

Model Wrapping​

Network Discovery​

OctomilClient​

Constructor​

Key Methods​

Model Rollouts​

Configuration​

Streaming Inference​

Embeddings​

Smart Routing​

Cache Management​

Cloud Fallback​

Background Training​

Error Handling​

GPU Acceleration​

ProGuard Rules​

Requirements​

Gotchas​

Related​