LiteRT-LM

Google's production-ready, high-performance inference framework for deploying Large Language Models on edge devices.

v0.10.1 Apache 2.0 Gemma 4 Support Multi-Platform

GitHub Repository Documentation CLI Guide

Key Features

Cross-Platform

Android, iOS, Web, Desktop, and IoT (Raspberry Pi). Deploy anywhere with a single framework.

Hardware Acceleration

GPU and NPU accelerators for peak inference performance on edge devices.

Multi-Modality

Vision and audio input support for rich multimodal experiences.

Tool Use / Function Calling

Built-in function calling support for agentic workflows with structured output.

Speculative Decoding

Multi-token prediction for faster inference latency on supported models.

Multi-Language APIs

Python, Kotlin, C++, C, and Swift (in development) bindings available.

Quick Start

# Install via uv
uv tool install litert-lm

# Run a model instantly
litert-lm run \
  --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
  --prompt="What is the capital of France?"

Supported Models

Language APIs

Architecture

Model Family	Capabilities	Status
Gemma 3 / 3N / 4	Text, Vision, Audio, Tool Calling	Stable
Qwen 2.5 / 3	Text, Tool Calling	Stable
Llama	Text	Stable
Phi-4	Text	Stable

Language	Best For	Status
Kotlin	Android apps & JVM	Stable
Python	Prototyping & Scripting	Stable
C++	High-performance native	Stable
Swift	Native iOS & macOS	In Dev

Engine Layer

Model loading, hardware backend selection (CPU/GPU/NPU), session and conversation management.

Prefill-Decode Pipeline

Two-phase inference: Prefill processes the full prompt, Decode generates tokens autoregressively.

Component System

Modular tokenizers (SentencePiece, HuggingFace), samplers (Top-K, Top-P, Greedy), and LoRA adapters.