LiteRT-LM

Google's production-ready, high-performance inference framework for deploying Large Language Models on edge devices.

v0.10.1 Apache 2.0 Gemma 4 Support Multi-Platform

Key Features

Cross-Platform

Android, iOS, Web, Desktop, and IoT (Raspberry Pi). Deploy anywhere with a single framework.

Hardware Acceleration

GPU and NPU accelerators for peak inference performance on edge devices.

Multi-Modality

Vision and audio input support for rich multimodal experiences.

Tool Use / Function Calling

Built-in function calling support for agentic workflows with structured output.

Speculative Decoding

Multi-token prediction for faster inference latency on supported models.

Multi-Language APIs

Python, Kotlin, C++, C, and Swift (in development) bindings available.

Quick Start

# Install via uv
uv tool install litert-lm

# Run a model instantly
litert-lm run \
  --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
  --prompt="What is the capital of France?"

Supported Models

Model FamilyCapabilitiesStatus
Gemma 3 / 3N / 4Text, Vision, Audio, Tool CallingStable
Qwen 2.5 / 3Text, Tool CallingStable
LlamaTextStable
Phi-4TextStable

Language APIs

LanguageBest ForStatus
KotlinAndroid apps & JVMStable
PythonPrototyping & ScriptingStable
C++High-performance nativeStable
SwiftNative iOS & macOSIn Dev

Architecture

Engine Layer

Model loading, hardware backend selection (CPU/GPU/NPU), session and conversation management.

Prefill-Decode Pipeline

Two-phase inference: Prefill processes the full prompt, Decode generates tokens autoregressively.

Component System

Modular tokenizers (SentencePiece, HuggingFace), samplers (Top-K, Top-P, Greedy), and LoRA adapters.