Gemma 4: The Ultimate Guide to Google DeepMind's Most Capable Open-Weights Model

Published on April 4, 2026

Built from Gemini 3 research. Apache 2.0 licensed. Four model sizes from edge devices to workstations. Native multimodal reasoning across text, image, video, and audio.

Introduction: Google’s Open-Weights Paradigm Shift

On April 2, 2026, Google DeepMind released Gemma 4 — a family of four open-weights models that represents the most significant leap in the Gemma lineage to date. Built using the same underlying research that powers the proprietary Gemini 3 model family, Gemma 4 delivers frontier-level intelligence across a range of deployment targets, from sub-3GB edge devices to multi-GPU workstations.

Two things make Gemma 4 a watershed moment for the open-source AI ecosystem:

Generational Intelligence Jump: Gemma 4 31B scores 39 on the Artificial Analysis Intelligence Index — a staggering +29 points over Gemma 3 27B Instruct (10). The smaller Gemma 4 E4B gains +13 points over its Gemma 3n predecessor, and E2B gains +10 points. These are not incremental improvements; they represent a fundamental architectural and training overhaul.
Apache 2.0 Licensing: Previous Gemma models shipped under Google’s restrictive “Gemma Terms of Use” license. Gemma 4 moves to the commercially permissive Apache 2.0 license, granting developers complete flexibility — modify, redistribute, fine-tune, commercialize, and deploy without usage restrictions. This provides full digital sovereignty: enterprises and sovereign nations can deploy Gemma 4 in air-gapped environments with no licensing callbacks, no usage telemetry, and no commercial gatekeeping.

The result is a model family that competes directly with proprietary offerings from OpenAI, Anthropic, and other closed-source providers, while being entirely self-hostable and commercially unrestricted.

Model Family Overview: Four Sizes, One Architecture

Gemma 4 ships in four distinct sizes, each targeting a specific deployment envelope:

Model	Total Parameters	Active Parameters	Architecture	Context Window	Modalities	License
Gemma 4 E2B	5.1B	2.3B	Dense (PLE)	128K tokens	Text, Image, Video, Audio	Apache 2.0
Gemma 4 E4B	8B	~4B effective	Dense (PLE)	128K tokens	Text, Image, Video, Audio	Apache 2.0
Gemma 4 26B A4B	26–27B	4B (MoE routing)	Mixture-of-Experts	256K tokens	Text, Image, Video	Apache 2.0
Gemma 4 31B	31B	31B	Dense	256K tokens	Text, Image, Video	Apache 2.0

Edge Models: E2B and E4B

The “E” in E2B and E4B stands for “effective” parameters. These models incorporate Per-Layer Embeddings (PLE) — a technique where each decoder layer maintains its own small embedding table for every token. These embedding tables are large in raw storage but serve only as fast lookups, meaning the effective computational footprint is dramatically smaller than the total parameter count.

E2B (5.1B total / 2.3B active): Designed for on-device deployment with minimal RAM. In 4-bit quantization, model weights fit in under 3GB of RAM, making it suitable for background tasks, basic function calling, and multimodal understanding on smartphones, Raspberry Pi, and IoT hardware.
E4B (8B total): A step up in capability while remaining edge-friendly. Both E2B and E4B uniquely support native audio input for automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.

Both edge models run completely offline with near-zero latency on devices like phones, Raspberry Pi 5, and NVIDIA Jetson Orin Nano.

Workstation Models: 26B A4B and 31B

26B A4B (MoE, 4B active): The “A” stands for “active parameters.” This Mixture-of-Experts model contains 26–27B total parameters but activates only a 4B subset during each inference pass via a routing mechanism. The result: it runs nearly as fast as a 4B-parameter model while retaining the reasoning depth of a much larger model. Ideal for applications where tokens-per-second latency is critical.
31B Dense: The flagship model where all 31B parameters activate during inference. Delivers the highest raw intelligence in the family, scoring 39 on the Artificial Analysis Intelligence Index. Optimized for consumer GPUs — a single RTX 4090 or Apple Silicon Mac can serve this model locally, giving students, researchers, and developers the ability to turn workstations into local-first AI servers.

Architecture Deep Dive

Hybrid Attention Mechanism

All Gemma 4 models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks.

Memory Optimization for Long Contexts

For global attention layers, Gemma 4 uses two critical innovations:

Unified Keys and Values: Reduces the KV-cache memory footprint by sharing key-value projections across attention heads, allowing significantly longer context processing within the same memory budget.
Proportional RoPE (p-RoPE): A modified Rotary Position Encoding scheme that scales proportionally with sequence length, enabling stable attention patterns across the full 128K–256K context window without the degradation typically seen at extreme sequence lengths.

Per-Layer Embeddings (PLE) for Edge Models

Rather than adding more layers or parameters, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups — this is why the effective parameter count (and therefore inference cost) is much smaller than the total. This technique maximizes parameter efficiency specifically for on-device deployments where every megabyte of RAM counts.

Mixture-of-Experts (MoE) for the 26B A4B

The 26B A4B model uses sparse expert routing: for each token, a routing mechanism selects which subset of the total 26B parameters to activate. Only ~4B parameters fire per token, producing inference speeds comparable to a 4B dense model while maintaining the representational capacity of a much larger network. This makes it an excellent choice for fast, cost-efficient inference compared to the dense 31B model.

Configurable Thinking Modes

All Gemma 4 models are designed as highly capable reasoners with configurable thinking modes:

Thinking Enabled: The model outputs internal reasoning within structured <|channel>thought\n...<channel|> tags before providing its final answer. This step-by-step reasoning improves accuracy on multi-step tasks, mathematical problems, and complex coding challenges.
Thinking Disabled: For the 26B and 31B models, disabling thinking still generates the channel tags but with an empty thought block, proceeding directly to the final answer. For E2B and E4B, the tags are omitted entirely when thinking is disabled.
Control Token: Thinking is enabled by including the <|think|> token at the start of the system prompt. Libraries like Hugging Face Transformers and Ollama handle this automatically via enable_thinking=True.

Multimodal Capabilities

Gemma 4 is natively multimodal across all four model sizes. Every model in the family processes text, images, and video at variable aspect ratios and resolutions.

Image Understanding

All models support:

Object detection and localization
Document and PDF parsing
Screen and UI understanding
Chart comprehension
OCR (including multilingual)
Handwriting recognition
Pointing and spatial reasoning
Variable image resolution via configurable visual token budgets (70, 140, 280, 560, 1120 tokens per image)

Use lower token budgets for classification, captioning, or video frame processing. Use higher budgets for OCR, document parsing, or reading small text.

Video Understanding

All models analyze video by processing sequences of frames. Video supports a maximum of 60 seconds assuming images are processed at one frame per second.

Audio (E2B and E4B Only)

The edge models feature native audio input supporting:

Automatic Speech Recognition (ASR): Transcribe speech segments in any supported language with digit-first number formatting.
Automatic Speech Translation (AST): Transcribe in the source language, then translate to a target language — all within a single inference call.
Audio supports a maximum length of 30 seconds.

Interleaved Multimodal Input

All models support freely mixing text and images in any order within a single prompt. For optimal performance, place image and/or audio content before the text in your prompt.

Language Support

Gemma 4 is natively trained on data covering over 140 languages and provides out-of-the-box support for 35+ languages at production quality. The multilingual capability extends beyond translation — the models understand cultural context, idiomatic expressions, and language-specific code commenting conventions.

Agentic Workflows: Beyond Chatbots

Gemma 4 is designed explicitly for autonomous agent use cases — a fundamental shift from the instruct-only Gemma 3 models.

Native Function Calling

All models support structured tool use out of the box. Developers can define function schemas and Gemma 4 will generate valid function calls with proper argument formatting. This enables agentic workflows where the model can:

Query external APIs
Interact with databases
Execute code
Navigate web interfaces
Orchestrate multi-step processes

Structured JSON Output

Models reliably produce machine-parseable JSON output on demand, critical for integration into automated pipelines, data extraction workflows, and API-first architectures.

Native System Instructions

Gemma 4 introduces native support for the system role in conversation, enabling more structured and controllable conversations. Unlike Gemma 3 which required workarounds, Gemma 4 uses standard system, assistant, and user roles natively.

Offline Code Generation

All models — including the sub-3GB E2B — can generate, complete, and correct code entirely offline. This enables:

IDE-integrated code completion without cloud connectivity
Air-gapped development environments
Privacy-sensitive code generation on proprietary codebases

Multi-Step Planning

The configurable thinking modes allow Gemma 4 to plan multi-step processes, maintain reasoning consistency across dozens of actions, and self-correct when tool calls return unexpected results. The 26B and 31B models are particularly capable in long-horizon agentic scenarios.

Benchmarks & Competitive Analysis

Gemma 4’s benchmark performance has been validated across multiple independent sources: the official Google DeepMind model card, Hugging Face model pages, NVIDIA technical reports, the Artificial Analysis Intelligence Index, and the LMSYS Chatbot Arena. Every number in this section has been cross-referenced against at least three independent sources to ensure accuracy.

1. Official Google Model Card Benchmarks (Instruction-Tuned, Reasoning Variants)

These are the official benchmark results published in the Gemma 4 model card, cross-verified on the Hugging Face model page and Ollama library. All scores reflect instruction-tuned models with thinking enabled.

Benchmark	31B Dense	26B A4B MoE	E4B	E2B	Gemma 3 27B
MMLU Pro	85.2%	82.6%	69.4%	60.0%	67.6%
AIME 2026 (Math, no tools)	89.2%	88.3%	42.5%	37.5%	20.8%
LiveCodeBench v6 (Coding)	80.0%	77.1%	52.0%	44.0%	29.1%
Codeforces ELO	2150	1718	940	633	110
GPQA Diamond	84.3%	82.3%	58.6%	43.4%	42.4%
τ²-Bench (Agentic, avg. over 3)	76.9%	68.2%	42.2%	24.5%	16.2%
Humanity’s Last Exam (no tools)	19.5%	8.7%	–	–	–
Humanity’s Last Exam (with search)	26.5%	17.2%	–	–	–
BigBench Extra Hard	74.4%	64.8%	33.1%	21.9%	19.3%
MMMLU (Multilingual)	88.4%	86.3%	76.6%	67.4%	70.7%

Key Takeaways from Official Benchmarks:

Math (AIME 2026): The 31B scores 89.2% — a staggering +68.4 points over Gemma 3 27B (20.8%). Even the 26B MoE model achieves 88.3%, proving that the MoE variant retains near-identical reasoning depth despite activating only 4B parameters.
Coding (LiveCodeBench v6): The 31B scores 80.0% — nearly +51 points over Gemma 3 27B (29.1%). The Codeforces ELO of 2150 places it in “Grandmaster” territory for competitive programming.
Reasoning (GPQA Diamond): 84.3% on graduate-level science questions — double the Gemma 3 27B score (42.4%). This benchmark contains PhD-level physics, chemistry, and biology questions.
Agentic Tasks (τ²-Bench): 76.9% average on realistic agentic tasks (telecom customer service workflows) — nearly 5× the Gemma 3 27B score (16.2%).
Multilingual (MMMLU): 88.4% on the massively multilingual version of MMLU, up from 70.7% for Gemma 3 27B. Notably, E4B (76.6%) outperforms Gemma 3 27B despite being a fraction of the size.
Edge Models Punch Above Weight: E4B (effective 4B) scores 69.4% on MMLU Pro — surpassing Gemma 3 27B’s 67.6% — while running on mobile hardware. E2B matches or exceeds Gemma 3 27B on AIME 2026, LiveCodeBench, and GPQA Diamond.

2. Vision & Multimodal Benchmarks

All Gemma 4 models support native image and video understanding with variable aspect ratios and resolutions. The vision benchmark results from the official model card are:

Benchmark	31B	26B A4B	E4B	E2B	Gemma 3 27B
MMMU Pro (Visual reasoning)	76.9%	73.8%	52.6%	44.2%	49.7%
MATH-Vision (Math + visual)	85.6%	82.4%	59.5%	52.4%	46.0%
OmniDocBench (edit distance ↓)	0.131	0.149	0.181	0.290	0.365

Key Vision Insights:

MMMU Pro: The 31B scores 76.9% on multi-discipline multimodal understanding — a +27.2 point improvement over Gemma 3 27B (49.7%). This benchmark tests reasoning over complex charts, diagrams, circuit schematics, and scientific figures.
MATH-Vision: 85.6% on mathematical problems that require visual understanding — nearly doubling Gemma 3 27B’s score (46.0%). Tasks include reading graphs, interpreting geometric diagrams, and solving problems from visual data.
OmniDocBench: An edit distance of 0.131 (lower is better) represents a 64% improvement over Gemma 3 27B (0.365) in document understanding. This measures accuracy in extracting and processing structured information from PDFs, invoices, and complex document layouts.

3. Audio Benchmarks (E2B & E4B Only)

The edge models include native audio input capabilities. Audio benchmarks from the official model card:

Benchmark	E4B	E2B	Description
CoVoST (ASR + Translation)	35.54	33.47	Cross-lingual speech-to-text translation quality (higher is better)
FLEURS WER (lower is better)	0.08	0.09	Word Error Rate across multilingual speech recognition

The E4B achieves a CoVoST score of 35.54 and a FLEURS Word Error Rate of just 8% — production-quality speech recognition running entirely on-device with no cloud dependency. Audio supports up to 30 seconds of input.

4. Long-Context Benchmarks (MRCR v2, 8-Needle, 128K)

The Multi-Round Coreference Resolution (MRCR) v2 benchmark measures how well models maintain attention and reasoning across extremely long inputs. The 8-needle variant at 128K tokens tests retrieval of 8 scattered facts across a 128K token context window:

Model	MRCR v2 (8-needle 128K)
Gemma 4 31B	66.4%
Gemma 4 26B A4B	44.1%
Gemma 4 E4B	25.4%
Gemma 4 E2B	19.1%
Gemma 3 27B	13.5%

The 31B model scores 66.4% — nearly 5× the Gemma 3 27B score (13.5%). This confirms that the 256K context window is not just a theoretical claim — the model can effectively reason over vast contexts. Even the edge E2B model (19.1%) outperforms Gemma 3 27B (13.5%) at long-context retrieval despite being dramatically smaller.

5. Artificial Analysis Intelligence Index

The Artificial Analysis Intelligence Index v4.0 is an independent, composite benchmark incorporating 10 evaluations: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, and CritPt.

Model	Intelligence Index	Type	Active Params	Output Tokens Used
Gemini 3.1 Pro Preview	57	Closed	—	—
GPT-5.4 (xhigh)	57	Closed	—	—
GPT-5.3 Codex (xhigh)	54	Closed	—	—
Claude Opus 4.6 (Max Effort)	53	Closed	—	—
Claude Sonnet 4.6 (Max Effort)	52	Closed	—	—
GLM-5 (Reasoning)	50	Open	—	—
Kimi K2.5 (Reasoning)	47	Open	—	—
Qwen3.5 397B A17B (Reasoning)	45	Open	~17B	—
Qwen3.5 27B (Reasoning)	42	Open	27B	98M
GLM-4.7 (Reasoning)	42	Open	—	167M
MiniMax-M2.5	42	Open	—	56M
DeepSeek V3.2 (Reasoning)	42	Open	—	61M
Gemma 4 31B (Reasoning)	39	Open	31B	39M
Qwen3.5 35B A3B (Reasoning)	37	Open	~3B	—
Gemma 4 26B A4B (Reasoning)	30	Open	4B	22M
Gemma 4 E4B	19	Open	~4B	20M
Gemma 4 E2B	15	Open	2.3B	20M
Gemma 3 27B Instruct	10	Open	27B	—
Gemma 3n E4B Instruct	6	Open	—	—
Gemma 3n E2B Instruct	5	Open	—	—

Token Efficiency: Gemma 4 31B used just 39M output tokens to run the full Intelligence Index — making it the most token-efficient model at its intelligence tier. Qwen3.5 27B used 98M tokens (2.5× more), GLM-4.7 used 167M (4.3× more), and MiniMax-M2.5 used 56M (1.4× more). In production, this translates directly to lower inference costs and faster response times.

AA Sub-Benchmark Breakdown for Gemma 4 31B (Reasoning) vs Qwen3.5 27B (Reasoning):

AA Sub-Benchmark	Gemma 4 31B	Qwen3.5 27B	Winner
SciCode	43%	40%	Gemma 4
Terminal-Bench Hard	36%	33%	Gemma 4
GPQA Diamond	86%	86%	Tied
IFBench	76%	76%	Tied
HLE (Humanity’s Last Exam)	23%	22%	Gemma 4
Agentic Index	32	44	Qwen3.5

The primary gap between Gemma 4 31B (39) and Qwen3.5 27B (42) is in agentic performance, where Qwen’s Agentic Index (44) significantly exceeds Gemma 4’s (32). On non-agentic evaluations — particularly scientific computing (SciCode) and terminal-based tasks (Terminal-Bench Hard) — Gemma 4 consistently leads.

AA-Omniscience (Factual Accuracy / Hallucination Control):

An interesting finding: the smaller Gemma 4 models perform better on factual accuracy than their larger siblings:

Model	AA-Omniscience Score
Gemma 4 E4B	-20
DeepSeek V3.2 (Reasoning)	-21
Gemma 4 E2B	-24
Qwen3.5 27B	-42
Gemma 4 31B	-45
Gemma 4 26B A4B	-48

Both E4B (-20) and E2B (-24) score substantially better than larger models on hallucination control — and outperform DeepSeek V3.2 (-21) and Qwen3.5 27B (-42). This suggests the PLE architecture and focused edge training produce models that are more factually grounded per parameter.

6. LMSYS Chatbot Arena / Arena.ai (Human Preference Rankings)

The LMSYS Chatbot Arena measures real-world user preference through blind A/B testing of model responses. As of April 2026, Gemma 4’s rankings among open-weights models on the Text leaderboard:

Gemma 4 31B → #3 open-weights model overall — beats Gemini 2.5 Pro and Qwen3.5-397B in direct human votes
Gemma 4 26B A4B → #6 open-weights model

Qualitative consensus from the community: Gemma 4 31B is described as “close to Claude 4.5 Sonnet level in many blind tests” while being fully open-weights and Apache 2.0 licensed. This is a remarkable achievement for a 31B-parameter model that can run on a single consumer GPU.

7. Generational Leap: Gemma 3 → Gemma 4

The improvement from Gemma 3 to Gemma 4 is the largest single-generation jump in the Gemma family history:

Metric	Gemma 3 27B	Gemma 4 31B	Improvement
AA Intelligence Index	10	39	+29 points (3.9×)
MMLU Pro	67.6%	85.2%	+17.6 points
AIME 2026	20.8%	89.2%	+68.4 points (4.3×)
LiveCodeBench v6	29.1%	80.0%	+50.9 points (2.7×)
Codeforces ELO	110	2150	+2040 points (19.5×)
GPQA Diamond	42.4%	84.3%	+41.9 points (2.0×)
τ²-Bench (Agentic)	16.2%	76.9%	+60.7 points (4.7×)
BigBench Extra Hard	19.3%	74.4%	+55.1 points (3.9×)
MMMLU	70.7%	88.4%	+17.7 points
MMMU Pro (Vision)	49.7%	76.9%	+27.2 points
MATH-Vision	46.0%	85.6%	+39.6 points (1.9×)
MRCR v2 128K	13.5%	66.4%	+52.9 points (4.9×)
Context Window	128K	256K	2× larger

Edge model improvements are equally dramatic:

Metric	Gemma 3n E4B	Gemma 4 E4B	Improvement
AA Intelligence Index	6	19	+13 points (3.2×)
MMLU Pro	—	69.4%	Exceeds Gemma 3 27B
AIME 2026	—	42.5%	2× Gemma 3 27B
Context Window	32K	128K	4× larger

Metric	Gemma 3n E2B	Gemma 4 E2B	Improvement
AA Intelligence Index	5	15	+10 points (3.0×)
AIME 2026	—	37.5%	1.8× Gemma 3 27B
Context Window	32K	128K	4× larger

8. Head-to-Head vs. Open-Weights Competitors

Gemma 4 31B vs. Qwen3.5 27B (Reasoning)

Dimension	Gemma 4 31B	Qwen3.5 27B	Advantage
AA Intelligence Index	39	42	Qwen3.5 (+3)
MMLU Pro	85.2%	86.1%	Qwen3.5 (+0.9)
GPQA Diamond	84.3%	85.5%	Qwen3.5 (+1.2)
LiveCodeBench v6	80.0%	80.7%	Qwen3.5 (+0.7)
SciCode	43%	40%	Gemma 4 (+3)
Terminal-Bench Hard	36%	33%	Gemma 4 (+3)
Output Tokens (AA eval)	39M	98M	Gemma 4 (2.5× efficient)
Agentic Index	32	44	Qwen3.5 (+12)
Multimodality	Text + Image + Video	Text + Image + Video	Tied
Audio	E2B/E4B only	No	Gemma 4
Context Window	256K	262K (extendable to 1M)	Qwen3.5
License	Apache 2.0	Apache 2.0	Tied

Verdict: Qwen3.5 27B leads overall (+3 on Intelligence Index), primarily due to stronger agentic capabilities. However, Gemma 4 31B is 2.5× more token-efficient, leads on scientific computing (SciCode) and terminal tasks (Terminal-Bench Hard), and offers native audio support on its edge variants. For cost-sensitive production workloads, Gemma 4’s token efficiency translates to significant savings.

Gemma 4 31B vs. Llama 4 Maverick (400B total / 17B active MoE)

Dimension	Gemma 4 31B	Llama 4 Maverick
Total Params	31B	400B
Active Params	31B (Dense)	17B (MoE)
MMLU Pro	85.2%	80.5%
GPQA Diamond	84.3%	69.8%
Context Window	256K	1M
Multimodality	Text + Image + Video + Audio*	Text + Image
License	Apache 2.0	Llama License
Architecture	Dense	MoE

Verdict: Despite being 13× smaller in total parameters, Gemma 4 31B significantly outperforms Llama 4 Maverick on MMLU Pro (+4.7 points) and GPQA Diamond (+14.5 points). Maverick’s advantage is its 1M context window, but Gemma 4 is dramatically more efficient to deploy and cheaper to run. (*Audio on Gemma 4 E2B/E4B edge variants.)

Gemma 4 31B vs. DeepSeek V3.2 / GLM-4.7

Both DeepSeek V3.2 and GLM-4.7 score 42 on the AA Intelligence Index — 3 points above Gemma 4 31B’s score of 39. However, Gemma 4 31B distinguishes itself in several critical dimensions:

Token Efficiency: 39M tokens vs. 61M (DeepSeek) and 167M (GLM-4.7) — 1.6×–4.3× more efficient.
Native Multimodality: Gemma 4 processes image, video, and audio natively. DeepSeek V3.2 and GLM-4.7 have more limited multimodal support.
Edge Deployment: The Gemma 4 family includes E2B and E4B edge models with audio support — no comparable offering exists from DeepSeek or GLM.
License: Apache 2.0 (Gemma 4) vs. model-specific licenses with varying commercial restrictions.

Gemma 4 26B A4B vs. MoE Peers (~3-4B Active Parameters)

Model	Intelligence Index	Active Params	Total Params
Qwen3.5 35B A3B (Reasoning)	37	~3B	35B
Gemma 4 26B A4B (Reasoning)	30	4B	26B
GLM-4.7-Flash (Reasoning)	30	~3B	—

Qwen3.5 35B A3B leads this tier by 7 points, primarily due to stronger agentic capabilities (Agentic Index 44 vs 32 for Gemma 4 26B A4B). However, Gemma 4 26B A4B matches GLM-4.7-Flash while offering superior multimodal capabilities and Apache 2.0 licensing.

9. Head-to-Head vs. Closed-Source Competitors

Gemma 4 31B vs. GPT-5.3 Codex (xhigh)

The Artificial Analysis comparison reveals the trade-offs between open and closed models:

Dimension	Gemma 4 31B	GPT-5.3 Codex (xhigh)
AA Intelligence Index	39	54
License	Apache 2.0	Proprietary
Self-Hostable	Yes	No
API Cost	Free (self-hosted)	Premium
Customizable	Full fine-tuning	Limited (API only)
Data Sovereignty	Complete	None
Multimodality	Text + Image + Video + Audio*	Text + Code

While GPT-5.3 Codex leads by 15 points on intelligence, Gemma 4 31B fills a fundamentally different niche: maximum intelligence per dollar for self-hosted deployments with complete data sovereignty.

Gemma 4 31B vs. GPT-5-mini

Dimension	Gemma 4 31B	GPT-5-mini
MMLU Pro	85.2%	83.7%
GPQA Diamond	84.3%	82.8%
LiveCodeBench v6	80.0%	80.5%
License	Apache 2.0	Proprietary
Self-Hostable	Yes	No

Gemma 4 31B outperforms GPT-5-mini on MMLU Pro (+1.5 points) and GPQA Diamond (+1.5 points) — the first time an open-weights model has decisively beaten a GPT-5 family member on major academic benchmarks while being fully self-hostable.

Where Closed-Source Still Leads

The top-tier proprietary models maintain a clear lead. Gemini 3.1 Pro Preview and GPT-5.4 both score 57 on the AA Intelligence Index — 18 points above Gemma 4 31B. Claude Opus 4.6 (53) and Claude Sonnet 4.6 (52) also maintain significant advantages, particularly in complex agentic workflows and extended multi-step reasoning. The gap is real, but the trajectory shows it narrowing with each Gemma generation.

System Requirements & Where to Test

Cloud / Online Testing

Platform	Models Available	Notes
Google AI Studio	31B, 26B A4B	Free tier available; try at aistudio.google.com
Google AI Edge Gallery	E2B, E4B	Android app for on-device testing; Play Store
Google Cloud Run	31B, 26B	Deploy with Blackwell / RTX PRO 6000 GPUs
GKE via vLLM	31B, 26B	Production-grade Kubernetes deployment
Novita, Lightning AI, Parasail	31B, 26B	Third-party API providers

Local Hardware Targets

Device	Model(s)	Notes
Android (via AICore Developer Preview)	E2B, E4B	On-device inference with GPU acceleration
Raspberry Pi 5	E2B	Runs offline with audio support
Qualcomm Dragonwing IQ8	E2B, E4B	NPU acceleration for ultra-efficient inference
NVIDIA Jetson Orin Nano	E2B, E4B	Edge AI compute with CUDA acceleration
Mac (Apple Silicon, Metal)	All sizes	Unified memory advantage; 31B runs on M3 Max+
Consumer GPU (RTX 4090)	26B, 31B (quantized)	Single-GPU workstation deployment
Multi-GPU Server	31B (full precision)	Production serving with tensor parallelism

Step-by-Step Setup Guide

Option 1: Ollama (Simplest, All Platforms)

Ollama provides the fastest path to running Gemma 4 locally. Available on macOS, Linux, and Windows.

Installation

macOS / Linux:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

Windows:

# Download and install from the official website
# Visit https://ollama.com/download and run the Windows installer
# Alternatively, install via winget:
winget install Ollama.Ollama

Running Models

# Edge Models (lightweight, audio-capable)
ollama run gemma4:e2b    # 7.2GB download, 128K context
ollama run gemma4:e4b    # 9.6GB download, 128K context

# Workstation Models (frontier intelligence)
ollama run gemma4:26b    # 18GB download, 256K context, MoE (4B active)
ollama run gemma4:31b    # 20GB download, 256K context, Dense

API Usage (All Platforms)

# Start the Ollama server (runs automatically on install)
ollama serve

# Query via REST API
curl http://localhost:11434/api/chat 
  -d '{
    "model": "gemma4:31b",
    "messages": [{"role": "user", "content": "Explain quantum entanglement."}]
  }'

Python SDK

from ollama import chat

response = chat(
    model='gemma4:31b',
    messages=[{'role': 'user', 'content': 'Write a Python function to find prime numbers using the Sieve of Eratosthenes.'}],
)
print(response.message.content)

JavaScript/TypeScript SDK

import ollama from 'ollama';

const response = await ollama.chat({
  model: 'gemma4:31b',
  messages: [{ role: 'user', content: 'Explain the CAP theorem.' }],
});
console.log(response.message.content);

Option 2: LiteRT-LM (Edge Devices — CLI & Python)

LiteRT-LM is Google’s production-ready inference framework for deploying LLMs on edge devices. It powers on-device GenAI in Chrome, Chromebook Plus, and Pixel Watch.

Installation

macOS / Linux / Windows (WSL) / Raspberry Pi:

# Install using uv (recommended)
uv tool install litert-lm

# Or install via pip
pip install litert-lm

CLI Usage

# Run Gemma 4 E2B on edge hardware
litert-lm run 
  --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm 
  gemma-4-E2B-it.litertlm 
  --prompt="What is the capital of France?"

# Run Gemma 4 E4B
litert-lm run 
  --from-huggingface-repo=litert-community/gemma-4-E4B-it-litert-lm 
  gemma-4-E4B-it.litertlm 
  --prompt="Summarize the key points of quantum computing."

Python Package

from litert_lm import LiteRTModel

# Load and run Gemma 4 E2B
model = LiteRTModel.from_huggingface(
    "litert-community/gemma-4-E2B-it-litert-lm"
)
response = model.generate("Explain photosynthesis in simple terms.")
print(response)

Option 3: Hugging Face Transformers (Python)

The official method for loading Gemma 4 with full control over generation parameters.

Installation (All Platforms)

# Windows (PowerShell)
pip install -U transformers torch accelerate

# macOS (Apple Silicon — ensure MPS backend)
pip install -U transformers torch accelerate

# Linux (with CUDA)
pip install -U transformers torch accelerate

Basic Text Generation

import torch
from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-31b-it"  # Options: gemma-4-E2B-it, gemma-4-E4B-it, gemma-4-26b-a4b-it, gemma-4-31b-it

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Prepare prompt with system role
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a Python function to implement binary search."},
]

# Process input
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Enable reasoning mode
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=1.0,
    top_p=0.95,
    top_k=64
)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse thinking and final answer
parsed = processor.parse_response(response)
print(parsed)

Image Understanding

from PIL import Image
import requests

# Load an image
image_url = "https://example.com/chart.png"
image = Image.open(requests.get(image_url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe what you see in this image."}
    ]}
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(outputs[0], skip_special_tokens=True))

Option 4: vLLM (Production Serving)

For production-grade serving with batching, tensor parallelism, and OpenAI-compatible API.

Installation

# Linux (recommended for production)
pip install vllm

# Start the server
python -m vllm.entrypoints.openai.api_server 
  --model google/gemma-4-31b-it 
  --tensor-parallel-size 2 
  --dtype bfloat16 
  --gpu-memory-utilization 0.90 
  --max-model-len 32768 
  --port 8000

Query via OpenAI-Compatible API

curl http://localhost:8000/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "google/gemma-4-31b-it",
    "messages": [
      {"role": "system", "content": "You are a coding assistant."},
      {"role": "user", "content": "Implement a thread-safe LRU cache in Python."}
    ],
    "max_tokens": 2048,
    "temperature": 1.0,
    "top_p": 0.95
  }'

Python Client

from openai import OpenAI

client = OpenAI(api_key="not-needed", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model="google/gemma-4-31b-it",
    messages=[
        {"role": "system", "content": "You are a senior software engineer."},
        {"role": "user", "content": "Design a rate limiter using the token bucket algorithm."}
    ],
    max_tokens=2048,
    temperature=1.0,
)
print(response.choices[0].message.content)

Option 5: GKE with vLLM (Enterprise Cloud)

# Deploy Gemma 4 31B on Google Kubernetes Engine
# Requires a GKE cluster with GPU node pool (e.g., NVIDIA Blackwell, RTX PRO 6000)

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemma4-31b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma4-31b
  template:
    metadata:
      labels:
        app: gemma4-31b
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model=google/gemma-4-31b-it
        - --tensor-parallel-size=2
        - --dtype=bfloat16
        - --max-model-len=65536
        resources:
          limits:
            nvidia.com/gpu: 2
        ports:
        - containerPort: 8000
EOF

Platform-Specific Notes

Windows

Ollama is the recommended method. Install via winget install Ollama.Ollama or the official installer.
For Hugging Face Transformers, ensure CUDA is properly installed if using an NVIDIA GPU. Run nvidia-smi to verify.
vLLM has limited Windows support. Use WSL2 (Windows Subsystem for Linux) for production serving.
LiteRT-LM works under WSL2. Native Windows support is not yet available.

macOS (Apple Silicon / Metal)

Ollama uses Metal acceleration automatically on Apple Silicon Macs. No additional configuration needed.
For Hugging Face Transformers, PyTorch uses the MPS (Metal Performance Shaders) backend automatically. Ensure torch >= 2.0.

MLX (Apple’s ML framework) provides an additional optimization layer:

pip install mlx-lm

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/gemma-4-31b-it-8bit")
output = generate(model, tokenizer, prompt="def fibonacci(n):", max_tokens=512)
print(output)

Apple Silicon unified memory is a significant advantage — a Mac Studio M3 Ultra with 192GB can serve the 31B model at competitive speeds.

Linux

All methods work natively. Linux is the recommended platform for production deployments.
For GPU acceleration, install the NVIDIA Container Toolkit for Docker-based deployments.

llama.cpp provides an additional lightweight option:

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make
# Download GGUF quantized weights from Hugging Face
./llama-server -m gemma-4-31b-it-Q4_K_M.gguf --ctx-size 32768 --port 8080

Model Download Sources

All Gemma 4 model weights are available from multiple official sources:

Source	Link	Formats
Hugging Face	google/gemma-4	Safetensors, GGUF
Ollama	ollama.com/library/gemma4	GGUF (pre-quantized)
Kaggle	kaggle.com/models/google/gemma-4	Safetensors
LM Studio	lmstudio.ai/models/gemma-4	GGUF
Docker	hub.docker.com/r/ai/gemma4	Container images

Best Practices

Sampling Parameters

Use these standardized sampling configurations across all use cases for optimal performance:

temperature=1.0
top_p=0.95
top_k=64

Multi-Turn Conversations

In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins. This prevents context pollution from accumulated reasoning traces.

Visual Token Budget

Gemma 4 supports variable image resolution through configurable visual token budgets. The supported budgets are: 70, 140, 280, 560, and 1120.

Lower budgets (70–140): Classification, captioning, video frame processing
Higher budgets (560–1120): OCR, document parsing, reading small text

Audio Processing (E2B/E4B)

For ASR:

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.
Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven.

For speech translation:

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate
it into {TARGET_LANGUAGE}. When formatting the answer, first output the
transcription in {SOURCE_LANGUAGE}, then one newline, then output the string
'{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

Safety & Security

Gemma 4 models undergo the same rigorous infrastructure security protocols as Google’s proprietary Gemini models. Key safety measures include:

CSAM Filtering: Rigorous Child Sexual Abuse Material filtering at multiple stages of data preparation.
Sensitive Data Filtering: Automated techniques to filter personal information and sensitive data from training sets.
Content Safety Evaluations: Partnership with internal safety and responsible AI teams at Google DeepMind.
Significant safety improvements: All Gemma 4 models significantly outperform Gemma 3 and 3n models in content safety while keeping unjustified refusals low.

For enterprises and sovereign organizations, the Apache 2.0 license combined with Google’s security protocols provides a trusted, transparent foundation that delivers state-of-the-art capabilities while meeting the highest standards for security and reliability.

Who Should Use Gemma 4

Use Case	Recommended Model	Why
Mobile / IoT apps	E2B	Under 3GB in 4-bit quant; audio + vision
On-device voice assistants	E2B / E4B	Native ASR; offline-capable
Low-latency API serving	26B A4B	4B active params = fast tokens/sec
Coding assistants	31B	Highest reasoning; leads on SciCode
Enterprise RAG pipelines	31B	256K context; strong factual grounding
Multilingual applications	Any size	140+ language training; 35+ production languages
Edge AI research	E2B / E4B	Apache 2.0; run on Raspberry Pi, Jetson
Air-gapped / sovereign deployments	31B	No licensing callbacks; full Apache 2.0
Cost-sensitive production	26B A4B	MoE efficiency; ~4B inference cost

Key Links & Resources

Official Product Page: deepmind.google/models/gemma/gemma-4
Model Card: ai.google.dev/gemma/docs/core/model_card_4
Documentation: ai.google.dev/gemma/docs
Google AI Studio: aistudio.google.com
Hugging Face Collection: huggingface.co/collections/google/gemma-4
Ollama Library: ollama.com/library/gemma4
LiteRT-LM (Edge): github.com/google-ai-edge/LiteRT-LM
Artificial Analysis Benchmarks: artificialanalysis.ai/models/gemma-4-31b
License (Apache 2.0): ai.google.dev/gemma/apache_2
Prompt Formatting Guide: ai.google.dev/gemma/docs/core/prompt-formatting-gemma4
Function Calling Guide: ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4

Conclusion

Gemma 4 is not just an iterative upgrade — it is a generational leap that fundamentally changes the open-weights landscape. The combination of Apache 2.0 licensing, multimodal reasoning across text/image/video/audio, configurable thinking modes, native agentic capabilities, and deployment flexibility from sub-3GB edge devices to multi-GPU servers makes it the most versatile open model family available today.

The Intelligence Index numbers speak clearly: +29 points over Gemma 3 for the flagship model, with 2.5× better token efficiency than the nearest open-weights competitor at a comparable intelligence level. For developers, researchers, and enterprises seeking maximum capability per parameter — with complete licensing freedom — Gemma 4 is the new baseline.

The model weights are available now on Hugging Face, Ollama, Kaggle, LM Studio, and Docker. The Apache 2.0 license means you can start building today with zero commercial restrictions.

Comments

Your comments help others in the community.