
The definitive deep dive into Google Gemma 4 — covering the complete Gemma series evolution from 2024-2026, the 'significant-otter' leak, MoE architecture innovations, benchmark breakthroughs (AIME 89.2%), unprecedented day-one ecosystem coverage, and the full spectrum of community reception.
Google Gemma 4, officially released on April 2, 2026, represents the largest generational leap in the Gemma series to date. This generation marks the first adoption of the Apache 2.0 open-source license (all previous Gemma models used restrictive licenses), offers 4 model sizes (E2B, E4B, 26B MoE, 31B Dense), natively supports four modalities — text, image, video, and audio — and extends the context window up to 256K tokens. On the AIME 2026 math benchmark, the 31B model soared from Gemma 3's 20.8% to 89.2%, while coding performance (LiveCodeBench) jumped from 29.1% to 80.0%, signaling Gemma's transition from "usable" to "competitive with top-tier large models." Within 48 hours of release, Ollama pulls exceeded 207,000, and by April 10, cumulative downloads had surpassed 400 million.

To fully appreciate Gemma 4's significance, we need to trace the entire series' development arc. Starting in early 2024, Google has iterated the Gemma family at roughly six-month to one-year intervals, with each generation bringing significant upgrades in architecture, modality support, and openness.
Gemma 1 (February 21, 2024) was Google's first open-weights model family aimed at developers, offering 2B and 7B parameter sizes built on the same technical foundation as Gemini. Two months later on April 9, CodeGemma (2B/7B), focused on code generation, was released.
Gemma 2 (June 27, 2024) was officially launched after being previewed at Google I/O 2024 (May 14), expanding to 9B and 27B parameter sizes and introducing Grouped-Query Attention (GQA) and an 80K-token context window. A 2B variant and the safety evaluation model ShieldGemma followed on July 31. Later that year, the vision-language model PaliGemma and its successor PaliGemma 2 were released.
Gemma 3 (March 12, 2025) achieved several key breakthroughs: the first introduction of multimodal capabilities (text + image input), a context window expansion from 8K to 128K tokens, support for 140+ languages, and four model sizes at 1B, 4B, 12B, and 27B. It reached 1338 Elo on LMArena, outperforming many larger models. At Google I/O 2025 on May 22, Google released Gemma 3n (E2B/E4B) optimized for edge devices, introducing the Per-Layer Embeddings (PLE) architectural innovation that would play a major role in Gemma 4.
| Generation | Release Date | Time Since Previous | Key Breakthroughs |
|---|---|---|---|
| Gemma 1 | 2024-02-21 | — | First open-weights model |
| Gemma 2 | 2024-06-27 | ~4 months | 27B parameters, GQA |
| Gemma 3 | 2025-03-12 | ~8.5 months | Multimodal, 128K context |
| Gemma 3n | 2025-05-22 | ~2 months (sub-release) | Edge optimization, PLE |
| Gemma 4 | 2026-04-02 | ~12.5 months | Apache 2.0, MoE, 256K, audio |
Roughly a week before Gemma 4's official release, a series of leaks and hints had already sent ripples through the community.
March 28-29, 2026 — An anonymous model codenamed "significant-otter" appeared on the LMSYS Chatbot Arena. When users pressed it about its identity, the model replied directly: "I am Gemma 4, a large language model developed by Google DeepMind." Users on Reddit's r/LocalLLaMA were the first to spot the leak, noting the model's fast response speed, passage of baseline capability tests, and that it was not a reasoning-specialized model. The leaked information also hinted at 2B and 4B Dense variants and a 120B/15B-active MoE model (which has yet to be officially released). This event was widely covered by barnacle.ai, KuCoin News, multiple AI newsletters, and Reddit discussions.
Late March to early April — Google's "Gemma models family" collection on Hugging Face received updates — exactly mirroring the pattern seen before the Gemma 2 and Gemma 3 releases, and flagged by community model watchers as a signal of an imminent launch.
Early hours of April 2 (hours before launch) — Google DeepMind CEO Demis Hassabis posted four diamond emojis on X, followed by Google AI Studio and Gemini API lead Logan Kilpatrick posting a single word: "Gemma." Both posts were widely interpreted by the community as countdown signals, triggering a wave of excited pre-launch discussion.
Gemma 4 is a Transformer-based model family derived from Gemini 3 research, employing a hybrid attention mechanism (alternating between local sliding-window attention and global full-context attention) along with several cutting-edge architectural innovations.
Model Specifications Overview:
| Model | Architecture | Total Parameters | Effective/Active Parameters | Layers | Context Window | Modalities |
|---|---|---|---|---|---|---|
| E2B | Dense + PLE | 5.1B | 2.3B | 35 | 128K | Text, image, video, audio |
| E4B | Dense + PLE | 8B | 4.5B | 42 | 128K | Text, image, video, audio |
| 26B A4B | MoE | 25.2B | 3.8B | 30 | 256K | Text, image, video |
| 31B | Dense | 30.7B | 30.7B | 60 | 256K | Text, image, video |
The 26B A4B is the first MoE (Mixture-of-Experts) model in the Gemma series, with 128 experts per layer, activating 8 experts plus 1 shared expert per token. In practice, only ~3.8B parameters are active during inference, achieving speeds comparable to a 4B Dense model.
Core architectural innovations include: Per-Layer Embeddings (PLE), which provide each decoder layer with independent token-level conditioning signals rather than front-loading all information into a single embedding; Shared KV Cache, which allows the last N layers to reuse key-value states from earlier layers to save memory; dual RoPE configuration using standard RoPE for sliding-window layers and proportional RoPE (p-RoPE) for global layers to support longer contexts; and Unified Keys and Values for optimizing long-context memory. The vision encoder supports variable aspect ratios with a configurable token budget (70 to 1,120 tokens per image), while the audio encoder (E2B/E4B only) is based on a USM-style Conformer architecture supporting up to 30 seconds of audio input. The vocabulary size is 262K tokens, supporting 140+ languages (35+ out of the box), with GeGLU activation and RMSNorm normalization.
Notable capabilities include configurable thinking mode (the <|think|> token can generate 4,000+ token internal reasoning chains), native function calling and tool use, structured JSON output, native system prompts (a Gemma-series first with system role support), multi-step planning for agent workflows, object detection and localization (natively returning bounding box coordinates in JSON format), document/PDF parsing, multilingual OCR, chart comprehension, and handwriting recognition.
Gemma 4's performance across major benchmarks is nothing short of remarkable — especially compared to Gemma 3 27B, where nearly every metric shows a doubling or even greater improvement. The following figures come from the official model cards (instruction-tuned versions with thinking mode enabled):
| Benchmark | 31B | 26B A4B | E4B | E2B | Gemma 3 27B |
|---|---|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% | 67.6% |
| AIME 2026 | 89.2% | 88.3% | 42.5% | 37.5% | 20.8% |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% | 42.4% |
| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% | 29.1% |
| Codeforces ELO | 2150 | 1718 | 940 | 633 | 110 |
| MMMU Pro (Vision) | 76.9% | 73.8% | 52.6% | 44.2% | 49.7% |
| MATH-Vision | 85.6% | 82.4% | 59.5% | 52.4% | 46.0% |
| t2-bench (Agent) | 76.9% | 68.2% | 42.2% | 24.5% | 16.2% |
| Arena AI Text Elo | 1452 (#3) | 1441 (#6) | — | — | 1365 |
Several improvements stand out: math capability (AIME) surged from 20.8% to 89.2%, a 68.4 percentage point gain; coding (LiveCodeBench) improved by 50.9 percentage points; agent capability (t2-bench) improved by over 60 percentage points; and Codeforces ELO jumped from 110 to 2150, roughly a 20x increase. The 31B model ranks #3 among open models on the Arena AI text leaderboard (and #1 among US open models).
It is worth noting that the official model cards use MMLU Pro (not classic MMLU) and LiveCodeBench/Codeforces ELO (not HumanEval) as primary evaluation metrics — classic MMLU and HumanEval scores were not reported in official materials. A third-party comparison paper (arXiv 2604.07035, April 8, 2026) analyzed the accuracy-efficiency tradeoffs between Gemma 4, Phi-4, and Qwen3 across Dense and MoE reasoning language models. As of now, Google has not yet published a formal technical report for Gemma 4.
Gemma 4's release strategy is virtually unprecedented in open model history — nearly every major AI tool and platform achieved support on the same day, April 2 — reflecting deep coordination between Google and ecosystem partners ahead of the launch.
Platforms and tools live on Day One (April 2):
google/gemma-4-* namespace; by April 10, the 31B model had 1.33M+ downloads, the 26B had 1.05M+, and the community had created 1,156+ models tagged with gemma4gemma4 name with 17 model tags; pulls reached 1.8M+ by April 10Third-party API providers that followed on April 2-3: OpenRouter (31B priced at $0.14/M input, $0.40/M output tokens), Together AI, Fireworks AI, Replicate, plus Featherless AI, Scaleway, OVHcloud, and others via Hugging Face Inference Providers.
Major platforms still without Gemma 4 support as of April 10: Groq (community feature request submitted April 3 but not yet fulfilled), AWS Bedrock (only lists Gemma 3; Gemma 4 requires self-hosted deployment via SageMaker), Azure AI Foundry (also supports only Gemma 3; manual deployment required).
Gemma 4's social media buzz followed a clear three-phase curve: "leak-driven warmup, launch-day explosion, and deep evaluation."
Phase 1: Warmup (March 28 - April 1). The "significant-otter" leak on the LMSYS Arena first sparked discussion on r/LocalLLaMA, then spread to X/Twitter and AI newsletters. Demis Hassabis's four diamond emojis and Logan Kilpatrick's one-word post further fueled expectations.
Phase 2: Explosion (April 2-4). On launch day, the Hacker News thread "Google releases Gemma 4 open models" garnered 1,306+ upvotes and hit the front page multiple times. The activity score across 12 subreddits and 544 Twitter accounts tracked by AINews reached 3,412 points. Ollama pulls surpassed 207,000 within 48 hours. The most viral technical demonstration came from llama.cpp creator Georgi Gerganov (@ggerganov), who showcased Gemma 4 26B A4B Q8_0 running real-time video processing at 300 t/s on an M2 Ultra.
Phase 3: Deep Evaluation (April 4-10). The community shifted to hands-on deployment and comparative testing. r/LocalLLaMA saw a flood of practical posts including quantization recommendations for "Gemma 4 on 16 GB VRAM," TurboQuant KV cache quantization experiments, and Apple Silicon multimodal fine-tuning tools (which earned 152 upvotes on HN). A demo running real-time audio-visual AI with Gemma E2B on an M3 Pro drew attention on both Reddit and HN simultaneously.
In the Chinese-language community, several highly upvoted posts on Zhihu provided in-depth Gemma 4 analysis. Xinzhiyuan (a leading Chinese AI media outlet) ran a headline declaring the "31B destroys giants 20x its size." A top Zhihu contributor's testing conclusion: the 31B matches DeepSeek V3.2 in quality while using only 65% of the tokens needed by Qwen3.5-27B for equivalent output quality, but its reasoning consistency is significantly lower than Qwen3.5 (only 1 out of all test questions maintained consistent output across 3 runs, compared to 8 for Qwen3.5). Additionally, the fact that Gemma 4 was jailbroken by Heretic v1.2.0 within 90 minutes of release (KL divergence of only 0.1522, with virtually no capability loss) sparked widespread discussion on Zhihu about the fundamental limitations of AI safety.
Five areas of strong community approval: The Apache 2.0 license was widely regarded as "the single most important change," eliminating the enterprise legal friction that plagued Gemma 1-3; parameter efficiency was impressive — the 31B competes with 200B+ models, while the 26B MoE approaches 31B quality with only 3.8B active parameters, earning the label "the best intelligence-per-parameter ratio available"; edge deployment capabilities make running E2B in 2GB of memory possible, marking "the first time VRAM-constrained users can run a genuinely useful model locally"; the breadth of day-one ecosystem coverage was unprecedented; and the performance gains over Gemma 3 were described by the community as "this is not an incremental improvement — this is an entirely different model."
Key criticisms centered on the following: MoE inference speed was slower than expected — the 26B A4B managed only ~11 t/s on an RTX 5060 Ti, while Qwen 3.5 35B-A3B achieves 60+ t/s, with significant MoE routing overhead; early fine-tuning toolchain issues — PEFT could not handle Gemma4ClippableLinear layers, text-only data required mm_token_type_ids, among other problems; disappointment that the rumored larger 120B MoE model did not materialize; the restriction of audio modality to smaller models only (E2B/E4B) rather than 26B/31B was seen as an oversight; and reasoning stability and hallucination issues — a botany benchmark scored only 2.5/5 even with search augmentation enabled. Redis creator antirez's criticism on HN drew significant attention: "Showing ELO score as main benchmark metric is very misleading. The large Dense Gemma 4 model does not seem to surpass Qwen 3.5 27B Dense in most benchmarks."
Consensus on comparisons with major competitors: The comparison with Qwen 3.5 was the most discussed topic — benchmarks are close, with Qwen 3.5 slightly ahead on MMLU Pro (86.1% vs 85.2%) and GPQA Diamond, while Gemma 4 leads on AIME and Codeforces. However, Qwen offers faster inference and more reliable agent tasks, leading one HN commenter to summarize: "Gemma 4 feels better, Qwen 3.5 works better." Compared to Llama 4, Gemma 4 holds clear advantages in license openness (Apache 2.0 vs Llama's 700M MAU restriction) and deployment flexibility (from phones to workstations vs Llama's server-class-only). Compared to DeepSeek V3.2, Zhihu testers found the Gemma 4 31B "comparable" in quality but at lower parameter cost. Notably, the 31B model, with far fewer total parameters than its rivals, stands alongside Kimi K2.5 (744B-A40B) and GLM-5 (1T-A32B) as one of the world's top open models.
Gemma 4 is more than a technical upgrade — it represents a fundamental shift in Google's open-source AI strategy. The cumulative impact of moving from a restrictive license to Apache 2.0, from text-only to four modalities, and from a single Dense architecture to a dual Dense+MoE lineup makes Gemma 4 the most closely watched open model release of 2026 so far. The real innovation lies not in any single architectural technique, but in the masterful combination of known techniques — PLE, shared KV cache, p-RoPE, 128-small-expert MoE — as one Zhihu technical analyst put it: "None of these tricks are new on their own, but together they are genuinely powerful." Sebastian Raschka's analysis corroborates this — the 31B architecture is nearly unchanged from Gemma 3, with the performance leap driven primarily by improvements in training recipes and data.
Looking ahead, the community is still waiting for three things: whether the rumored 120B+ MoE model will ship, when the fine-tuning toolchain will stabilize, and when native integration with AWS Bedrock, Azure, and other major cloud platforms will arrive. Nathan Lambert's assessment may be the most incisive: "Gemma 4 is all about ease of use... not benchmark scores." Just over a week since launch, the ecosystem is still maturing, but the direction is clear — Google has finally found its rhythm in the open model space.

최신 소식 받기
AI 이미지 에디터 커뮤니티에 참여하세요
최신 AI 이미지 편집 팁, 새로운 기능, 튜토리얼, 독점 콘텐츠를 이메일로 받아보세요