Google's production-ready, high-performance inference framework for deploying Large Language Models on edge devices.
Android, iOS, Web, Desktop, and IoT (Raspberry Pi). Deploy anywhere with a single framework.
GPU and NPU accelerators for peak inference performance on edge devices.
Vision and audio input support for rich multimodal experiences.
Built-in function calling support for agentic workflows with structured output.
Multi-token prediction for faster inference latency on supported models.
Python, Kotlin, C++, C, and Swift (in development) bindings available.
| Model Family | Capabilities | Status |
|---|---|---|
| Gemma 3 / 3N / 4 | Text, Vision, Audio, Tool Calling | Stable |
| Qwen 2.5 / 3 | Text, Tool Calling | Stable |
| Llama | Text | Stable |
| Phi-4 | Text | Stable |
| Language | Best For | Status |
|---|---|---|
| Kotlin | Android apps & JVM | Stable |
| Python | Prototyping & Scripting | Stable |
| C++ | High-performance native | Stable |
| Swift | Native iOS & macOS | In Dev |
Model loading, hardware backend selection (CPU/GPU/NPU), session and conversation management.
Two-phase inference: Prefill processes the full prompt, Decode generates tokens autoregressively.
Modular tokenizers (SentencePiece, HuggingFace), samplers (Top-K, Top-P, Greedy), and LoRA adapters.