Skip to main content

Android SDK

On-device inference, model deployment, and federated learning for Android.

GitHub: github.com/octomil/octomil-android

Installation

dependencies {
implementation("ai.octomil:octomil-android:1.0.0")
implementation("com.google.ai.edge.litert:litert:2.0.0")
}

Add to AndroidManifest.xml:

<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />

Quick Start

import ai.octomil.OctomilClient

// 1. Initialize
val client = OctomilClient(apiKey = "edg_...", orgId = "your-org-id", context = this)

// 2. Register device
client.register()

// 3. Download model (cached locally for offline use)
val model = client.downloadModel(modelId = "fraud-detector")

// 4. Run inference — all on-device, zero cloud calls
val prediction = model.predict(mapOf("features" to userFeatures))
println("Fraud score: ${prediction["score"]}")

// 5. Participate in federated training (optional)
client.train(modelId = "fraud-detector", data = localData, samples = 1000)

Device Pairing

Handle octomil://pair deep links via manifest:

<activity
android:name="ai.octomil.pairing.ui.PairingActivity"
android:exported="true">
<intent-filter>
<action android:name="android.intent.action.VIEW" />
<category android:name="android.intent.category.DEFAULT" />
<category android:name="android.intent.category.BROWSABLE" />
<data android:scheme="octomil" android:host="pair" />
</intent-filter>
</activity>

For Compose, use PairingScreen directly:

import ai.octomil.pairing.ui.PairingScreen

@Composable
fun MyPairingScreen(token: String, host: String) {
PairingScreen(token = token, host = host)
}

Model Wrapping

Wrap an existing TFLite Interpreter with telemetry, validation, and OTA updates. One line changes at model load — zero changes at call sites:

// Before
val interpreter = Interpreter(modelFile)

// After
val interpreter = Octomil.wrap(Interpreter(modelFile), modelId = "classifier")
interpreter.run(input, output) // identical API

The wrapper adds contract validation, latency telemetry, and OTA model updates.

Network Discovery

Make your device discoverable on the local network:

val discovery = DiscoveryManager(context)
discovery.startDiscoverable(deviceId = client.deviceId ?: "unknown")

The SDK registers a _octomil._tcp. NSD service. The CLI discovers this service and connects directly — no QR code needed.

OctomilClient

Constructor

OctomilClient(
apiKey: String,
orgId: String,
context: Context,
baseURL: String = "https://api.octomil.com",
config: OctomilConfiguration = OctomilConfiguration.DEFAULT
)

Key Methods

MethodDescription
register()Register device with Octomil server
sendHeartbeat()Send health report
downloadModel(modelId, version, format)Download and cache model
getCachedModel(modelId, version)Get cached model without downloading
clearModelCache()Clear all cached models
train(modelId, data, samples)Train locally and upload
participateInTrainingRound(...)Train and upload with full control
uploadWeights(...)Upload raw weight update

Model Rollouts

When you download a model, Octomil automatically determines which version your device receives based on active rollouts:

val model = client.downloadModel(modelId = "sentiment-classifier")
// Server decides version based on rollout percentage + device hash

Configuration

val config = OctomilConfiguration(
privacyConfiguration = PrivacyConfiguration.HIGH_PRIVACY,
trainingConfiguration = TrainingConfiguration(
batchSize = 64,
learningRate = 0.01f,
epochs = 3
),
enableLogging = true,
enableOfflineQueue = true,
maxQueueSize = 10
)

val client = OctomilClient(
apiKey = "edg_...",
orgId = "my-org",
context = applicationContext,
config = config
)

Privacy presets:

  • PrivacyConfiguration.DEFAULT — standard settings
  • PrivacyConfiguration.HIGH_PRIVACY — staggered uploads (1-10 min), DP with epsilon=0.5

Streaming Inference

The SDK supports streaming inference via Kotlin Flow. Implement StreamingInferenceEngine to plug in your own model backend. Each chunk carries modality-specific payload data and timing information.

import ai.octomil.inference.InferenceChunk
import ai.octomil.inference.Modality
import ai.octomil.inference.StreamingInferenceEngine
import ai.octomil.inference.StreamingInferenceResult
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.flow

// 1. Implement the engine interface for your model
val engine = StreamingInferenceEngine { input, modality ->
flow {
// Your on-device generation logic here
for (i in 0 until tokenCount) {
emit(
InferenceChunk(
index = i,
data = tokenBytes,
modality = Modality.TEXT,
timestamp = System.currentTimeMillis(),
latencyMs = 0.0 // filled in by SDK wrapper
)
)
}
}
}

// 2. Consume the stream
engine.generate(input = "Summarize this document...", modality = Modality.TEXT)
.collect { chunk ->
val token = String(chunk.data, Charsets.UTF_8)
print(token) // stream tokens to UI
}

InferenceChunk carries the chunk index, raw data bytes, modality (TEXT, IMAGE, AUDIO, VIDEO), timestamp (epoch millis), and latencyMs. The StreamingInferenceResult aggregates sessionId, ttfcMs, avgChunkLatencyMs, totalChunks, totalDurationMs, and throughput for completed sessions.

StreamingInferenceEngine is a functional interface, so you can use a lambda as shown above.

Embeddings

Use EmbeddingClient to generate dense vector embeddings via the Octomil server.

import ai.octomil.client.EmbeddingClient
import ai.octomil.client.EmbeddingResult

val embeddingClient = EmbeddingClient(
serverUrl = "https://api.octomil.com",
apiKey = "edg_..."
)

// Single string
val result: EmbeddingResult = embeddingClient.embed(
modelId = "nomic-embed-text",
input = "Hello, world!"
)
println(result.embeddings) // [[0.1, 0.2, ...]]
println(result.model) // "nomic-embed-text"
println(result.usage.promptTokens)
println(result.usage.totalTokens)

// Batch embedding
val batchResult = embeddingClient.embed(
modelId = "nomic-embed-text",
input = listOf("First document", "Second document", "Third document")
)
// batchResult.embeddings contains one vector per input string

EmbeddingResult contains:

  • embeddings: List<List<Double>> -- one dense vector per input string
  • model: String -- the model that produced the embeddings
  • usage: EmbeddingUsage -- token counts (promptTokens, totalTokens)

Smart Routing

RoutingClient calls the Octomil routing API to decide whether inference should run on-device or in the cloud. Decisions are cached with a configurable TTL.

import ai.octomil.client.RoutingClient
import ai.octomil.client.RoutingConfig
import ai.octomil.client.RoutingPreference
import ai.octomil.client.RoutingDeviceCapabilities

// 1. Configure the routing client
val routingConfig = RoutingConfig(
serverUrl = "https://api.octomil.com",
apiKey = "edg_...",
cacheTtlMs = 300_000L, // cache decisions for 5 minutes
prefer = RoutingPreference.FASTEST, // DEVICE, CLOUD, CHEAPEST, FASTEST
modelParams = 2_000_000_000, // 2B parameter model
modelSizeMb = 1400.0
)

val router = RoutingClient(config = routingConfig)

// 2. Build device capabilities
val capabilities = RoutingDeviceCapabilities(
platform = "android",
model = android.os.Build.MODEL,
totalMemoryMb = Runtime.getRuntime().maxMemory() / (1024 * 1024),
gpuAvailable = true,
npuAvailable = false,
supportedRuntimes = listOf("tflite", "nnapi")
)

// 3. Ask the routing API for a decision
val decision = router.route(
modelId = "gemma-2b",
deviceCapabilities = capabilities
)

// 4. Act on the decision
when (decision?.target) {
"device" -> {
// Run inference on-device using decision.format and decision.engine
println("Run on-device with ${decision.engine} engine")
}
"cloud" -> {
// Fall back to cloud inference
val response = router.cloudInfer(
modelId = "gemma-2b",
inputData = Json.encodeToJsonElement(mapOf("prompt" to "Hello"))
)
println("Cloud result: ${response.output}, latency: ${response.latencyMs}ms")
}
else -> {
// decision is null — network failure, default to on-device
}
}

RoutingClient is thread-safe (uses ConcurrentHashMap for cache and OkHttpClient for HTTP). route() returns null on any network failure, allowing you to fall back to local inference gracefully.

RoutingDecision contains: id, target ("device" or "cloud"), format, engine, and an optional fallbackTarget with an endpoint URL.

Cache Management

// Invalidate a specific model's cached decision
router.invalidate(modelId = "gemma-2b")

// Clear all cached decisions
router.clearCache()

Cloud Fallback

When the routing API determines a device cannot run a model (insufficient memory, missing runtime, etc.), it returns target: "cloud" with a fallbackTarget containing the cloud endpoint. Use cloudInfer() to run inference in the cloud:

val decision = router.route(modelId = "large-model", deviceCapabilities = capabilities)

if (decision?.target == "cloud") {
try {
val response = router.cloudInfer(
modelId = "large-model",
inputData = Json.encodeToJsonElement(mapOf("prompt" to "Explain quantum computing")),
parameters = mapOf("max_tokens" to Json.encodeToJsonElement(256))
)
println("Provider: ${response.provider}")
println("Latency: ${response.latencyMs}ms")
} catch (e: Exception) {
// Cloud also failed — handle gracefully
println("Cloud inference failed: ${e.message}")
}
}

If route() returns null (network failure, server down), the caller should default to on-device inference. The SDK never silently falls back -- you control the fallback logic.

Background Training

Use WorkManager for background training:

class FederatedTrainingWorker(
context: Context,
params: WorkerParameters
) : CoroutineWorker(context, params) {
override suspend fun doWork(): Result {
return try {
val client = OctomilClient(apiKey = "edg_...", orgId = "my-org", context = applicationContext)
client.participateInTrainingRound(
modelId = "fraud-detector",
trainingData = loadBackgroundData(),
sampleCount = 1000
)
Result.success()
} catch (e: Exception) {
Result.retry()
}
}
}

Error Handling

All suspend functions throw OctomilException:

try {
val result = client.train(modelId = "fraud-detector", data = data, samples = 100)
} catch (e: OctomilException.NetworkException) {
println("Network error: ${e.message}")
} catch (e: OctomilException.AuthenticationException) {
println("Invalid API key or orgId")
} catch (e: OctomilException.DeviceNotRegisteredException) {
println("Call register() first")
}

GPU Acceleration

// GPU delegation
val gpuDelegate = GpuDelegate()
val options = Interpreter.Options().addDelegate(gpuDelegate)
val interpreter = Interpreter(modelFile, options)

// Or NNAPI delegation
val nnApiDelegate = NnApiDelegate()
val options = Interpreter.Options().addDelegate(nnApiDelegate)

ProGuard Rules

-keep class ai.octomil.** { *; }
-keepclassmembers class ai.octomil.** { *; }
-keep class com.google.ai.edge.litert.** { *; }

Requirements

  • Android 8.0 (API 26)+, Kotlin 1.9+, Gradle 8.0+, LiteRT 2.0+

Gotchas

  • API 26+ required — the SDK uses Kotlin coroutines and modern Android APIs. Devices on Android 7.1 or earlier are not supported.
  • register() must be called first — all other methods throw OctomilException.DeviceNotRegisteredException if the device hasn't registered.
  • ProGuard rules are required — without the keep rules, R8 strips Octomil and LiteRT classes. Add the ProGuard rules from the section above to your release build.
  • GPU delegate is optionalGpuDelegate and NnApiDelegate are not bundled. Add them as separate Gradle dependencies if you want hardware acceleration.
  • WorkManager for background training — don't use CoroutineScope for federated training in production. Use WorkManager (shown above) so training survives app backgrounding and device sleep.
  • Offline queue has a limitmaxQueueSize defaults to 10. Training updates beyond this limit are dropped. Increase it for devices with unreliable connectivity.
  • Model downloads are cacheddownloadModel caches to internal storage. Use clearModelCache() to force re-download.
  • Routing returns null on failureroute() never throws. It returns null when the server is unreachable, so you always control fallback behavior.
  • EmbeddingClient calls are blockingembed() makes a synchronous HTTP call. Call it from a coroutine on Dispatchers.IO to avoid blocking the main thread.
  • cloudInfer() throws on failure — unlike route(), cloud inference throws on HTTP errors so you can catch and fall back to local inference.