
A practical guide to running Gemma 4 on your iPhone, Android phone, and Mac Mini. Which model, which quantization, which tool, and what performance to expect.
I downloaded all four Gemma 4 models onto my Mac Mini. Then I tried the 31B on my iPhone.
It crashed.
The phone got hot. The app froze. I wasted 40 minutes waiting for a download that was never going to work on 8 GB of RAM.
Gemma 4 is the first open model family that genuinely runs on phones. But "runs on phones" doesn't mean "every model runs on every phone." Pick the wrong variant and you'll burn an afternoon downloading a model that won't fit in memory.
This guide is the cheat sheet I wish I'd had. For every device — iPhone, Android, Mac Mini — I'll tell you exactly which Gemma 4 model to run, at what quantization, using which tool, and what speed to expect.
No guessing. No trial and error.
Google released Gemma 4 on April 2, 2026, with something the open-source community had been begging for: an Apache 2.0 license. No usage restrictions. No 700-million MAU limits. Just open.
The family has four members, and they're wildly different:
E2B — The pocket rocket. 2.3 billion effective parameters (5.1B total). Fits in 3.1 GB quantized. Runs on basically anything, including a Raspberry Pi.
E4B — The balanced pick. 4.5 billion effective (8B total). Needs 5 GB quantized. The sweet spot for flagship phones.
26B A4B — The magician. 26 billion total parameters, but only 4 billion active per token. Near-31B quality at half the compute.
31B — The heavy hitter. 31 billion dense parameters. Ranks #3 among all open models. Needs serious hardware.
All four are natively multimodal — text, images, video. E2B and E4B also handle audio. Context: 128K (small) to 256K (large).

You quantize models — compress weights from 16-bit floats to 4-bit or 8-bit integers. Here's what matters:
Q4_K_M — ~92% quality. ~75% smaller. Default for phones and lower-RAM Macs.
Q8_0 — ~99% quality. ~50% smaller than full precision. Noticeably better for reasoning.
BF16 — Full precision. Only for 64+ GB RAM machines.
One thing nobody tells you upfront: the KV cache adds 30-50% memory on top of model weights. Always budget 30% headroom.
You'll run E2B or E4B. The 26B and 31B don't fit on any current iPhone.
iPhone 15/16 Pro (8 GB):
iPhone 14 Pro (6 GB): E2B Q4 only.
Google AI Edge Gallery — Google's official app. Download, select model, go. Easiest path.
LM Studio iOS (beta) — GGUF models with more control. Most flexible for power users.
LiteRT-LM — Google's production runtime. For building iOS apps with embedded Gemma 4.
Fair warning: After 3-5 minutes of sustained inference, your iPhone will thermal-throttle. Short bursts work great. Hour-long conversations — not ideal.
Android has a wider hardware range — more options and more confusion.
8 GB RAM — E2B Q4 comfortable. E4B Q4 tight.
12 GB RAM — E2B Q8 or E4B Q4. The sweet spot.
16 GB RAM — E4B Q8 runs great.
Google AI Edge Gallery — Fastest path. Play Store, pick model, go.
LiteRT-LM — Production runtime for Android apps.
Android AICore — System-level Gemma 4 as a shared service.
Termux + llama.cpp — Full control, command line.
pkg install cmake git
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build && cmake --build build -j4
# Download E2B Q4 model from HuggingFace
MODEL=gemma-4-E2B-it-Q4_K_M.gguf
HF=unsloth/gemma-4-E2B-it-GGUF
huggingface-cli download $HF $MODEL
# Run
./build/bin/llama-cli -m $MODEL -c 4096 -ngl 99
Best chipsets: Snapdragon 8 Gen 2+, Dimensity 9000+, Tensor G3/G4. On Snapdragon 8 Gen 3: E2B 20-35 tok/s, E4B 12-20 tok/s.
This is where Gemma 4 gets exciting.
A Mac Mini sits on your desk, draws less power than a light bulb, and serves Gemma 4 to every device on your network. 24/7, for about $15 a year in electricity.

Recommendation: Mac Mini M4 24 GB + 26B A4B Q4. Near-31B quality, runs comfortably. M4 Pro 48 GB if budget allows.
Why Mac Mini punches above its weight: unified memory. CPU and GPU share the same RAM pool. No copying data back and forth.
# Install
curl -fsSL https://ollama.com/install.sh | sh
ollama --version # need v0.20.0+
# Pull model (pick one)
ollama pull gemma4:e2b # 8 GB Macs
ollama pull gemma4:e4b # 16 GB Macs
ollama pull gemma4:26b # 24 GB+ Macs
ollama pull gemma4:31b # 48 GB+ Macs
# Test
ollama run gemma4:26b "Hello, world"
# Listen on all interfaces
launchctl setenv OLLAMA_HOST 0.0.0.0
# Keep model in memory forever
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
brew services restart ollama
Now any device on your network can hit http://<mac-mini-ip>:11434 with the OpenAI-compatible API.
Ollama's MLX backend (preview) is up to 93% faster on Apple Silicon. For maximum speed:
pip install rapid-mlx
rapid-mlx serve gemma-4-26b-a4b-it --quant 4bit
This hit 85 tok/s on the 26B A4B on M3 Ultra. ChatGPT-like speed, running on your desk.
Here's the table that would've saved me an afternoon:

| Device | RAM | Model | Quant | Size | Tool | tok/s |
|---|---|---|---|---|---|---|
| iPhone 14 Pro | 6 GB | E2B | Q4_K_M | 3.1 GB | AI Edge Gallery | 15-25 |
| iPhone 15/16 Pro | 8 GB | E2B | Q4_K_M | 3.1 GB | AI Edge Gallery | 20-35 |
| iPhone 15/16 Pro | 8 GB | E4B | Q4_K_M | 5.0 GB | LM Studio iOS | 12-20 |
| Android 8 GB | 8 GB | E2B | Q8_0 | 5.1 GB | AI Edge Gallery | 20-35 |
| Android 12 GB | 12 GB | E4B | Q4_K_M | 5.0 GB | AI Edge Gallery | 12-20 |
| Android 16 GB | 16 GB | E4B | Q8_0 | 8.2 GB | LiteRT-LM | 12-20 |
| Mac Mini M1 | 8 GB | E2B | Q8_0 | 5.1 GB | Ollama | 30-50 |
| Mac Mini M2 | 16 GB | E4B | Q8_0 | 8.2 GB | Ollama | 40-60 |
| Mac Mini M4 | 24 GB | 26B A4B | Q4_K_M | 16.9 GB | Ollama/MLX | 15-25 |
| Mac Mini M4 Pro | 48 GB | 31B | Q4_K_M | 18.3 GB | Ollama/MLX | 20-35 |
For most people: Mac Mini M4 24 GB + 26B A4B Q4. Near-GPT-4-class quality, $15/year in power.
KV cache will blindside you. At 128K context, the 31B needs 30 GB total — not 18.3 GB. Check total memory with context, not just file size.
Phones thermal-throttle after 3-5 minutes. Use phones for quick queries, not extended sessions.
MLX vs. llama.cpp: pick one. MLX is 20-30% faster on Apple Silicon. llama.cpp for cross-platform. Don't install both.
Q4 for chat, Q8 for code. Quantization hurts reasoning more than conversation.
MediaPipe is deprecated. Use LiteRT-LM instead. Same Google team, better runtime.
Ollama needs v0.20.0+ for Gemma 4. Check with ollama --version first.
E2B for your phone. 26B A4B for your Mac Mini. That's the 80/20 answer.
The E2B fits on any modern smartphone, runs fully offline, 20-35 tokens per second. The 26B A4B delivers near-flagship quality on a $600 Mac Mini that costs $15 a year to run.
You don't need a $3,000 GPU rig. You don't need a cloud subscription. A phone and a Mac Mini running Apache 2.0 Gemma 4 — that's a genuinely useful local AI setup in 2026.

邮件列表
加入我们的社区
订阅邮件列表,及时获取最新消息和更新