Gemma 4: The Ultimate Guide to Google DeepMind's Most Capable Open-Weights Model
Published on April 4, 2026
Built from Gemini 3 research. Apache 2.0 licensed. Four model sizes from edge devices to workstations. Native multimodal reasoning across text, image, video, and audio.
Introduction: Googleโs Open-Weights Paradigm Shift
On April 2, 2026, Google DeepMind released Gemma 4 โ a family of four open-weights models that represents the most significant leap in the Gemma lineage to date. Built using the same underlying research that powers the proprietary Gemini 3 model family, Gemma 4 delivers frontier-level intelligence across a range of deployment targets, from sub-3GB edge devices to multi-GPU workstations.
Two things make Gemma 4 a watershed moment for the open-source AI ecosystem:
Generational Intelligence Jump: Gemma 4 31B scores 39 on the Artificial Analysis Intelligence Index โ a staggering +29 points over Gemma 3 27B Instruct (10). The smaller Gemma 4 E4B gains +13 points over its Gemma 3n predecessor, and E2B gains +10 points. These are not incremental improvements; they represent a fundamental architectural and training overhaul.
Apache 2.0 Licensing: Previous Gemma models shipped under Googleโs restrictive โGemma Terms of Useโ license. Gemma 4 moves to the commercially permissive Apache 2.0 license, granting developers complete flexibility โ modify, redistribute, fine-tune, commercialize, and deploy without usage restrictions. This provides full digital sovereignty: enterprises and sovereign nations can deploy Gemma 4 in air-gapped environments with no licensing callbacks, no usage telemetry, and no commercial gatekeeping.
The result is a model family that competes directly with proprietary offerings from OpenAI, Anthropic, and other closed-source providers, while being entirely self-hostable and commercially unrestricted.
Model Family Overview: Four Sizes, One Architecture
Gemma 4 ships in four distinct sizes, each targeting a specific deployment envelope:
| Model | Total Parameters | Active Parameters | Architecture | Context Window | Modalities | License |
|---|---|---|---|---|---|---|
| Gemma 4 E2B | 5.1B | 2.3B | Dense (PLE) | 128K tokens | Text, Image, Video, Audio | Apache 2.0 |
| Gemma 4 E4B | 8B | ~4B effective | Dense (PLE) | 128K tokens | Text, Image, Video, Audio | Apache 2.0 |
| Gemma 4 26B A4B | 26โ27B | 4B (MoE routing) | Mixture-of-Experts | 256K tokens | Text, Image, Video | Apache 2.0 |
| Gemma 4 31B | 31B | 31B | Dense | 256K tokens | Text, Image, Video | Apache 2.0 |
Edge Models: E2B and E4B
The โEโ in E2B and E4B stands for โeffectiveโ parameters. These models incorporate Per-Layer Embeddings (PLE) โ a technique where each decoder layer maintains its own small embedding table for every token. These embedding tables are large in raw storage but serve only as fast lookups, meaning the effective computational footprint is dramatically smaller than the total parameter count.
- E2B (5.1B total / 2.3B active): Designed for on-device deployment with minimal RAM. In 4-bit quantization, model weights fit in under 3GB of RAM, making it suitable for background tasks, basic function calling, and multimodal understanding on smartphones, Raspberry Pi, and IoT hardware.
- E4B (8B total): A step up in capability while remaining edge-friendly. Both E2B and E4B uniquely support native audio input for automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.
Both edge models run completely offline with near-zero latency on devices like phones, Raspberry Pi 5, and NVIDIA Jetson Orin Nano.
Workstation Models: 26B A4B and 31B
- 26B A4B (MoE, 4B active): The โAโ stands for โactive parameters.โ This Mixture-of-Experts model contains 26โ27B total parameters but activates only a 4B subset during each inference pass via a routing mechanism. The result: it runs nearly as fast as a 4B-parameter model while retaining the reasoning depth of a much larger model. Ideal for applications where tokens-per-second latency is critical.
- 31B Dense: The flagship model where all 31B parameters activate during inference. Delivers the highest raw intelligence in the family, scoring 39 on the Artificial Analysis Intelligence Index. Optimized for consumer GPUs โ a single RTX 4090 or Apple Silicon Mac can serve this model locally, giving students, researchers, and developers the ability to turn workstations into local-first AI servers.
Architecture Deep Dive
Hybrid Attention Mechanism
All Gemma 4 models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks.
Memory Optimization for Long Contexts
For global attention layers, Gemma 4 uses two critical innovations:
- Unified Keys and Values: Reduces the KV-cache memory footprint by sharing key-value projections across attention heads, allowing significantly longer context processing within the same memory budget.
- Proportional RoPE (p-RoPE): A modified Rotary Position Encoding scheme that scales proportionally with sequence length, enabling stable attention patterns across the full 128Kโ256K context window without the degradation typically seen at extreme sequence lengths.
Per-Layer Embeddings (PLE) for Edge Models
Rather than adding more layers or parameters, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups โ this is why the effective parameter count (and therefore inference cost) is much smaller than the total. This technique maximizes parameter efficiency specifically for on-device deployments where every megabyte of RAM counts.
Mixture-of-Experts (MoE) for the 26B A4B
The 26B A4B model uses sparse expert routing: for each token, a routing mechanism selects which subset of the total 26B parameters to activate. Only ~4B parameters fire per token, producing inference speeds comparable to a 4B dense model while maintaining the representational capacity of a much larger network. This makes it an excellent choice for fast, cost-efficient inference compared to the dense 31B model.
Configurable Thinking Modes
All Gemma 4 models are designed as highly capable reasoners with configurable thinking modes:
- Thinking Enabled: The model outputs internal reasoning within structured
<|channel>thought\n...<channel|>tags before providing its final answer. This step-by-step reasoning improves accuracy on multi-step tasks, mathematical problems, and complex coding challenges. - Thinking Disabled: For the 26B and 31B models, disabling thinking still generates the channel tags but with an empty thought block, proceeding directly to the final answer. For E2B and E4B, the tags are omitted entirely when thinking is disabled.
- Control Token: Thinking is enabled by including the
<|think|>token at the start of the system prompt. Libraries like Hugging Face Transformers and Ollama handle this automatically viaenable_thinking=True.
Multimodal Capabilities
Gemma 4 is natively multimodal across all four model sizes. Every model in the family processes text, images, and video at variable aspect ratios and resolutions.
Image Understanding
All models support:
- Object detection and localization
- Document and PDF parsing
- Screen and UI understanding
- Chart comprehension
- OCR (including multilingual)
- Handwriting recognition
- Pointing and spatial reasoning
- Variable image resolution via configurable visual token budgets (70, 140, 280, 560, 1120 tokens per image)
Use lower token budgets for classification, captioning, or video frame processing. Use higher budgets for OCR, document parsing, or reading small text.
Video Understanding
All models analyze video by processing sequences of frames. Video supports a maximum of 60 seconds assuming images are processed at one frame per second.
Audio (E2B and E4B Only)
The edge models feature native audio input supporting:
- Automatic Speech Recognition (ASR): Transcribe speech segments in any supported language with digit-first number formatting.
- Automatic Speech Translation (AST): Transcribe in the source language, then translate to a target language โ all within a single inference call.
- Audio supports a maximum length of 30 seconds.
Interleaved Multimodal Input
All models support freely mixing text and images in any order within a single prompt. For optimal performance, place image and/or audio content before the text in your prompt.
Language Support
Gemma 4 is natively trained on data covering over 140 languages and provides out-of-the-box support for 35+ languages at production quality. The multilingual capability extends beyond translation โ the models understand cultural context, idiomatic expressions, and language-specific code commenting conventions.
Agentic Workflows: Beyond Chatbots
Gemma 4 is designed explicitly for autonomous agent use cases โ a fundamental shift from the instruct-only Gemma 3 models.
Native Function Calling
All models support structured tool use out of the box. Developers can define function schemas and Gemma 4 will generate valid function calls with proper argument formatting. This enables agentic workflows where the model can:
- Query external APIs
- Interact with databases
- Execute code
- Navigate web interfaces
- Orchestrate multi-step processes
Structured JSON Output
Models reliably produce machine-parseable JSON output on demand, critical for integration into automated pipelines, data extraction workflows, and API-first architectures.
Native System Instructions
Gemma 4 introduces native support for the system role in conversation, enabling more structured and controllable conversations. Unlike Gemma 3 which required workarounds, Gemma 4 uses standard system, assistant, and user roles natively.
Offline Code Generation
All models โ including the sub-3GB E2B โ can generate, complete, and correct code entirely offline. This enables:
- IDE-integrated code completion without cloud connectivity
- Air-gapped development environments
- Privacy-sensitive code generation on proprietary codebases
Multi-Step Planning
The configurable thinking modes allow Gemma 4 to plan multi-step processes, maintain reasoning consistency across dozens of actions, and self-correct when tool calls return unexpected results. The 26B and 31B models are particularly capable in long-horizon agentic scenarios.
Benchmarks & Competitive Analysis
Gemma 4โs benchmark performance has been validated across multiple independent sources: the official Google DeepMind model card, Hugging Face model pages, NVIDIA technical reports, the Artificial Analysis Intelligence Index, and the LMSYS Chatbot Arena. Every number in this section has been cross-referenced against at least three independent sources to ensure accuracy.
1. Official Google Model Card Benchmarks (Instruction-Tuned, Reasoning Variants)
These are the official benchmark results published in the Gemma 4 model card, cross-verified on the Hugging Face model page and Ollama library. All scores reflect instruction-tuned models with thinking enabled.
| Benchmark | 31B Dense | 26B A4B MoE | E4B | E2B | Gemma 3 27B |
|---|---|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% | 67.6% |
| AIME 2026 (Math, no tools) | 89.2% | 88.3% | 42.5% | 37.5% | 20.8% |
| LiveCodeBench v6 (Coding) | 80.0% | 77.1% | 52.0% | 44.0% | 29.1% |
| Codeforces ELO | 2150 | 1718 | 940 | 633 | 110 |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% | 42.4% |
| ฯยฒ-Bench (Agentic, avg. over 3) | 76.9% | 68.2% | 42.2% | 24.5% | 16.2% |
| Humanityโs Last Exam (no tools) | 19.5% | 8.7% | โ | โ | โ |
| Humanityโs Last Exam (with search) | 26.5% | 17.2% | โ | โ | โ |
| BigBench Extra Hard | 74.4% | 64.8% | 33.1% | 21.9% | 19.3% |
| MMMLU (Multilingual) | 88.4% | 86.3% | 76.6% | 67.4% | 70.7% |
Key Takeaways from Official Benchmarks:
- Math (AIME 2026): The 31B scores 89.2% โ a staggering +68.4 points over Gemma 3 27B (20.8%). Even the 26B MoE model achieves 88.3%, proving that the MoE variant retains near-identical reasoning depth despite activating only 4B parameters.
- Coding (LiveCodeBench v6): The 31B scores 80.0% โ nearly +51 points over Gemma 3 27B (29.1%). The Codeforces ELO of 2150 places it in โGrandmasterโ territory for competitive programming.
- Reasoning (GPQA Diamond): 84.3% on graduate-level science questions โ double the Gemma 3 27B score (42.4%). This benchmark contains PhD-level physics, chemistry, and biology questions.
- Agentic Tasks (ฯยฒ-Bench): 76.9% average on realistic agentic tasks (telecom customer service workflows) โ nearly 5ร the Gemma 3 27B score (16.2%).
- Multilingual (MMMLU): 88.4% on the massively multilingual version of MMLU, up from 70.7% for Gemma 3 27B. Notably, E4B (76.6%) outperforms Gemma 3 27B despite being a fraction of the size.
- Edge Models Punch Above Weight: E4B (effective 4B) scores 69.4% on MMLU Pro โ surpassing Gemma 3 27Bโs 67.6% โ while running on mobile hardware. E2B matches or exceeds Gemma 3 27B on AIME 2026, LiveCodeBench, and GPQA Diamond.
2. Vision & Multimodal Benchmarks
All Gemma 4 models support native image and video understanding with variable aspect ratios and resolutions. The vision benchmark results from the official model card are:
| Benchmark | 31B | 26B A4B | E4B | E2B | Gemma 3 27B |
|---|---|---|---|---|---|
| MMMU Pro (Visual reasoning) | 76.9% | 73.8% | 52.6% | 44.2% | 49.7% |
| MATH-Vision (Math + visual) | 85.6% | 82.4% | 59.5% | 52.4% | 46.0% |
| OmniDocBench (edit distance โ) | 0.131 | 0.149 | 0.181 | 0.290 | 0.365 |
Key Vision Insights:
- MMMU Pro: The 31B scores 76.9% on multi-discipline multimodal understanding โ a +27.2 point improvement over Gemma 3 27B (49.7%). This benchmark tests reasoning over complex charts, diagrams, circuit schematics, and scientific figures.
- MATH-Vision: 85.6% on mathematical problems that require visual understanding โ nearly doubling Gemma 3 27Bโs score (46.0%). Tasks include reading graphs, interpreting geometric diagrams, and solving problems from visual data.
- OmniDocBench: An edit distance of 0.131 (lower is better) represents a 64% improvement over Gemma 3 27B (0.365) in document understanding. This measures accuracy in extracting and processing structured information from PDFs, invoices, and complex document layouts.
3. Audio Benchmarks (E2B & E4B Only)
The edge models include native audio input capabilities. Audio benchmarks from the official model card:
| Benchmark | E4B | E2B | Description |
|---|---|---|---|
| CoVoST (ASR + Translation) | 35.54 | 33.47 | Cross-lingual speech-to-text translation quality (higher is better) |
| FLEURS WER (lower is better) | 0.08 | 0.09 | Word Error Rate across multilingual speech recognition |
The E4B achieves a CoVoST score of 35.54 and a FLEURS Word Error Rate of just 8% โ production-quality speech recognition running entirely on-device with no cloud dependency. Audio supports up to 30 seconds of input.
4. Long-Context Benchmarks (MRCR v2, 8-Needle, 128K)
The Multi-Round Coreference Resolution (MRCR) v2 benchmark measures how well models maintain attention and reasoning across extremely long inputs. The 8-needle variant at 128K tokens tests retrieval of 8 scattered facts across a 128K token context window:
| Model | MRCR v2 (8-needle 128K) |
|---|---|
| Gemma 4 31B | 66.4% |
| Gemma 4 26B A4B | 44.1% |
| Gemma 4 E4B | 25.4% |
| Gemma 4 E2B | 19.1% |
| Gemma 3 27B | 13.5% |
The 31B model scores 66.4% โ nearly 5ร the Gemma 3 27B score (13.5%). This confirms that the 256K context window is not just a theoretical claim โ the model can effectively reason over vast contexts. Even the edge E2B model (19.1%) outperforms Gemma 3 27B (13.5%) at long-context retrieval despite being dramatically smaller.
5. Artificial Analysis Intelligence Index
The Artificial Analysis Intelligence Index v4.0 is an independent, composite benchmark incorporating 10 evaluations: GDPval-AA, ฯยฒ-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanityโs Last Exam, GPQA Diamond, and CritPt.
| Model | Intelligence Index | Type | Active Params | Output Tokens Used |
|---|---|---|---|---|
| Gemini 3.1 Pro Preview | 57 | Closed | โ | โ |
| GPT-5.4 (xhigh) | 57 | Closed | โ | โ |
| GPT-5.3 Codex (xhigh) | 54 | Closed | โ | โ |
| Claude Opus 4.6 (Max Effort) | 53 | Closed | โ | โ |
| Claude Sonnet 4.6 (Max Effort) | 52 | Closed | โ | โ |
| GLM-5 (Reasoning) | 50 | Open | โ | โ |
| Kimi K2.5 (Reasoning) | 47 | Open | โ | โ |
| Qwen3.5 397B A17B (Reasoning) | 45 | Open | ~17B | โ |
| Qwen3.5 27B (Reasoning) | 42 | Open | 27B | 98M |
| GLM-4.7 (Reasoning) | 42 | Open | โ | 167M |
| MiniMax-M2.5 | 42 | Open | โ | 56M |
| DeepSeek V3.2 (Reasoning) | 42 | Open | โ | 61M |
| Gemma 4 31B (Reasoning) | 39 | Open | 31B | 39M |
| Qwen3.5 35B A3B (Reasoning) | 37 | Open | ~3B | โ |
| Gemma 4 26B A4B (Reasoning) | 30 | Open | 4B | 22M |
| Gemma 4 E4B | 19 | Open | ~4B | 20M |
| Gemma 4 E2B | 15 | Open | 2.3B | 20M |
| Gemma 3 27B Instruct | 10 | Open | 27B | โ |
| Gemma 3n E4B Instruct | 6 | Open | โ | โ |
| Gemma 3n E2B Instruct | 5 | Open | โ | โ |
Token Efficiency: Gemma 4 31B used just 39M output tokens to run the full Intelligence Index โ making it the most token-efficient model at its intelligence tier. Qwen3.5 27B used 98M tokens (2.5ร more), GLM-4.7 used 167M (4.3ร more), and MiniMax-M2.5 used 56M (1.4ร more). In production, this translates directly to lower inference costs and faster response times.
AA Sub-Benchmark Breakdown for Gemma 4 31B (Reasoning) vs Qwen3.5 27B (Reasoning):
| AA Sub-Benchmark | Gemma 4 31B | Qwen3.5 27B | Winner |
|---|---|---|---|
| SciCode | 43% | 40% | Gemma 4 |
| Terminal-Bench Hard | 36% | 33% | Gemma 4 |
| GPQA Diamond | 86% | 86% | Tied |
| IFBench | 76% | 76% | Tied |
| HLE (Humanityโs Last Exam) | 23% | 22% | Gemma 4 |
| Agentic Index | 32 | 44 | Qwen3.5 |
The primary gap between Gemma 4 31B (39) and Qwen3.5 27B (42) is in agentic performance, where Qwenโs Agentic Index (44) significantly exceeds Gemma 4โs (32). On non-agentic evaluations โ particularly scientific computing (SciCode) and terminal-based tasks (Terminal-Bench Hard) โ Gemma 4 consistently leads.
AA-Omniscience (Factual Accuracy / Hallucination Control):
An interesting finding: the smaller Gemma 4 models perform better on factual accuracy than their larger siblings:
| Model | AA-Omniscience Score |
|---|---|
| Gemma 4 E4B | -20 |
| DeepSeek V3.2 (Reasoning) | -21 |
| Gemma 4 E2B | -24 |
| Qwen3.5 27B | -42 |
| Gemma 4 31B | -45 |
| Gemma 4 26B A4B | -48 |
Both E4B (-20) and E2B (-24) score substantially better than larger models on hallucination control โ and outperform DeepSeek V3.2 (-21) and Qwen3.5 27B (-42). This suggests the PLE architecture and focused edge training produce models that are more factually grounded per parameter.
6. LMSYS Chatbot Arena / Arena.ai (Human Preference Rankings)
The LMSYS Chatbot Arena measures real-world user preference through blind A/B testing of model responses. As of April 2026, Gemma 4โs rankings among open-weights models on the Text leaderboard:
- Gemma 4 31B โ #3 open-weights model overall โ beats Gemini 2.5 Pro and Qwen3.5-397B in direct human votes
- Gemma 4 26B A4B โ #6 open-weights model
Qualitative consensus from the community: Gemma 4 31B is described as โclose to Claude 4.5 Sonnet level in many blind testsโ while being fully open-weights and Apache 2.0 licensed. This is a remarkable achievement for a 31B-parameter model that can run on a single consumer GPU.
7. Generational Leap: Gemma 3 โ Gemma 4
The improvement from Gemma 3 to Gemma 4 is the largest single-generation jump in the Gemma family history:
| Metric | Gemma 3 27B | Gemma 4 31B | Improvement |
|---|---|---|---|
| AA Intelligence Index | 10 | 39 | +29 points (3.9ร) |
| MMLU Pro | 67.6% | 85.2% | +17.6 points |
| AIME 2026 | 20.8% | 89.2% | +68.4 points (4.3ร) |
| LiveCodeBench v6 | 29.1% | 80.0% | +50.9 points (2.7ร) |
| Codeforces ELO | 110 | 2150 | +2040 points (19.5ร) |
| GPQA Diamond | 42.4% | 84.3% | +41.9 points (2.0ร) |
| ฯยฒ-Bench (Agentic) | 16.2% | 76.9% | +60.7 points (4.7ร) |
| BigBench Extra Hard | 19.3% | 74.4% | +55.1 points (3.9ร) |
| MMMLU | 70.7% | 88.4% | +17.7 points |
| MMMU Pro (Vision) | 49.7% | 76.9% | +27.2 points |
| MATH-Vision | 46.0% | 85.6% | +39.6 points (1.9ร) |
| MRCR v2 128K | 13.5% | 66.4% | +52.9 points (4.9ร) |
| Context Window | 128K | 256K | 2ร larger |
Edge model improvements are equally dramatic:
| Metric | Gemma 3n E4B | Gemma 4 E4B | Improvement |
|---|---|---|---|
| AA Intelligence Index | 6 | 19 | +13 points (3.2ร) |
| MMLU Pro | โ | 69.4% | Exceeds Gemma 3 27B |
| AIME 2026 | โ | 42.5% | 2ร Gemma 3 27B |
| Context Window | 32K | 128K | 4ร larger |
| Metric | Gemma 3n E2B | Gemma 4 E2B | Improvement |
|---|---|---|---|
| AA Intelligence Index | 5 | 15 | +10 points (3.0ร) |
| AIME 2026 | โ | 37.5% | 1.8ร Gemma 3 27B |
| Context Window | 32K | 128K | 4ร larger |
8. Head-to-Head vs. Open-Weights Competitors
Gemma 4 31B vs. Qwen3.5 27B (Reasoning)
| Dimension | Gemma 4 31B | Qwen3.5 27B | Advantage |
|---|---|---|---|
| AA Intelligence Index | 39 | 42 | Qwen3.5 (+3) |
| MMLU Pro | 85.2% | 86.1% | Qwen3.5 (+0.9) |
| GPQA Diamond | 84.3% | 85.5% | Qwen3.5 (+1.2) |
| LiveCodeBench v6 | 80.0% | 80.7% | Qwen3.5 (+0.7) |
| SciCode | 43% | 40% | Gemma 4 (+3) |
| Terminal-Bench Hard | 36% | 33% | Gemma 4 (+3) |
| Output Tokens (AA eval) | 39M | 98M | Gemma 4 (2.5ร efficient) |
| Agentic Index | 32 | 44 | Qwen3.5 (+12) |
| Multimodality | Text + Image + Video | Text + Image + Video | Tied |
| Audio | E2B/E4B only | No | Gemma 4 |
| Context Window | 256K | 262K (extendable to 1M) | Qwen3.5 |
| License | Apache 2.0 | Apache 2.0 | Tied |
Verdict: Qwen3.5 27B leads overall (+3 on Intelligence Index), primarily due to stronger agentic capabilities. However, Gemma 4 31B is 2.5ร more token-efficient, leads on scientific computing (SciCode) and terminal tasks (Terminal-Bench Hard), and offers native audio support on its edge variants. For cost-sensitive production workloads, Gemma 4โs token efficiency translates to significant savings.
Gemma 4 31B vs. Llama 4 Maverick (400B total / 17B active MoE)
| Dimension | Gemma 4 31B | Llama 4 Maverick |
|---|---|---|
| Total Params | 31B | 400B |
| Active Params | 31B (Dense) | 17B (MoE) |
| MMLU Pro | 85.2% | 80.5% |
| GPQA Diamond | 84.3% | 69.8% |
| Context Window | 256K | 1M |
| Multimodality | Text + Image + Video + Audio* | Text + Image |
| License | Apache 2.0 | Llama License |
| Architecture | Dense | MoE |
Verdict: Despite being 13ร smaller in total parameters, Gemma 4 31B significantly outperforms Llama 4 Maverick on MMLU Pro (+4.7 points) and GPQA Diamond (+14.5 points). Maverickโs advantage is its 1M context window, but Gemma 4 is dramatically more efficient to deploy and cheaper to run. (*Audio on Gemma 4 E2B/E4B edge variants.)
Gemma 4 31B vs. DeepSeek V3.2 / GLM-4.7
Both DeepSeek V3.2 and GLM-4.7 score 42 on the AA Intelligence Index โ 3 points above Gemma 4 31Bโs score of 39. However, Gemma 4 31B distinguishes itself in several critical dimensions:
- Token Efficiency: 39M tokens vs. 61M (DeepSeek) and 167M (GLM-4.7) โ 1.6รโ4.3ร more efficient.
- Native Multimodality: Gemma 4 processes image, video, and audio natively. DeepSeek V3.2 and GLM-4.7 have more limited multimodal support.
- Edge Deployment: The Gemma 4 family includes E2B and E4B edge models with audio support โ no comparable offering exists from DeepSeek or GLM.
- License: Apache 2.0 (Gemma 4) vs. model-specific licenses with varying commercial restrictions.
Gemma 4 26B A4B vs. MoE Peers (~3-4B Active Parameters)
| Model | Intelligence Index | Active Params | Total Params |
|---|---|---|---|
| Qwen3.5 35B A3B (Reasoning) | 37 | ~3B | 35B |
| Gemma 4 26B A4B (Reasoning) | 30 | 4B | 26B |
| GLM-4.7-Flash (Reasoning) | 30 | ~3B | โ |
Qwen3.5 35B A3B leads this tier by 7 points, primarily due to stronger agentic capabilities (Agentic Index 44 vs 32 for Gemma 4 26B A4B). However, Gemma 4 26B A4B matches GLM-4.7-Flash while offering superior multimodal capabilities and Apache 2.0 licensing.
9. Head-to-Head vs. Closed-Source Competitors
Gemma 4 31B vs. GPT-5.3 Codex (xhigh)
The Artificial Analysis comparison reveals the trade-offs between open and closed models:
| Dimension | Gemma 4 31B | GPT-5.3 Codex (xhigh) |
|---|---|---|
| AA Intelligence Index | 39 | 54 |
| License | Apache 2.0 | Proprietary |
| Self-Hostable | Yes | No |
| API Cost | Free (self-hosted) | Premium |
| Customizable | Full fine-tuning | Limited (API only) |
| Data Sovereignty | Complete | None |
| Multimodality | Text + Image + Video + Audio* | Text + Code |
While GPT-5.3 Codex leads by 15 points on intelligence, Gemma 4 31B fills a fundamentally different niche: maximum intelligence per dollar for self-hosted deployments with complete data sovereignty.
Gemma 4 31B vs. GPT-5-mini
| Dimension | Gemma 4 31B | GPT-5-mini |
|---|---|---|
| MMLU Pro | 85.2% | 83.7% |
| GPQA Diamond | 84.3% | 82.8% |
| LiveCodeBench v6 | 80.0% | 80.5% |
| License | Apache 2.0 | Proprietary |
| Self-Hostable | Yes | No |
Gemma 4 31B outperforms GPT-5-mini on MMLU Pro (+1.5 points) and GPQA Diamond (+1.5 points) โ the first time an open-weights model has decisively beaten a GPT-5 family member on major academic benchmarks while being fully self-hostable.
Where Closed-Source Still Leads
The top-tier proprietary models maintain a clear lead. Gemini 3.1 Pro Preview and GPT-5.4 both score 57 on the AA Intelligence Index โ 18 points above Gemma 4 31B. Claude Opus 4.6 (53) and Claude Sonnet 4.6 (52) also maintain significant advantages, particularly in complex agentic workflows and extended multi-step reasoning. The gap is real, but the trajectory shows it narrowing with each Gemma generation.
System Requirements & Where to Test
Cloud / Online Testing
| Platform | Models Available | Notes |
|---|---|---|
| Google AI Studio | 31B, 26B A4B | Free tier available; try at aistudio.google.com |
| Google AI Edge Gallery | E2B, E4B | Android app for on-device testing; Play Store |
| Google Cloud Run | 31B, 26B | Deploy with Blackwell / RTX PRO 6000 GPUs |
| GKE via vLLM | 31B, 26B | Production-grade Kubernetes deployment |
| Novita, Lightning AI, Parasail | 31B, 26B | Third-party API providers |
Local Hardware Targets
| Device | Model(s) | Notes |
|---|---|---|
| Android (via AICore Developer Preview) | E2B, E4B | On-device inference with GPU acceleration |
| Raspberry Pi 5 | E2B | Runs offline with audio support |
| Qualcomm Dragonwing IQ8 | E2B, E4B | NPU acceleration for ultra-efficient inference |
| NVIDIA Jetson Orin Nano | E2B, E4B | Edge AI compute with CUDA acceleration |
| Mac (Apple Silicon, Metal) | All sizes | Unified memory advantage; 31B runs on M3 Max+ |
| Consumer GPU (RTX 4090) | 26B, 31B (quantized) | Single-GPU workstation deployment |
| Multi-GPU Server | 31B (full precision) | Production serving with tensor parallelism |
Step-by-Step Setup Guide
Option 1: Ollama (Simplest, All Platforms)
Ollama provides the fastest path to running Gemma 4 locally. Available on macOS, Linux, and Windows.
Installation
macOS / Linux:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh Windows:
# Download and install from the official website
# Visit https://ollama.com/download and run the Windows installer
# Alternatively, install via winget:
winget install Ollama.Ollama Running Models
# Edge Models (lightweight, audio-capable)
ollama run gemma4:e2b # 7.2GB download, 128K context
ollama run gemma4:e4b # 9.6GB download, 128K context
# Workstation Models (frontier intelligence)
ollama run gemma4:26b # 18GB download, 256K context, MoE (4B active)
ollama run gemma4:31b # 20GB download, 256K context, Dense API Usage (All Platforms)
# Start the Ollama server (runs automatically on install)
ollama serve
# Query via REST API
curl http://localhost:11434/api/chat
-d '{
"model": "gemma4:31b",
"messages": [{"role": "user", "content": "Explain quantum entanglement."}]
}' Python SDK
from ollama import chat
response = chat(
model='gemma4:31b',
messages=[{'role': 'user', 'content': 'Write a Python function to find prime numbers using the Sieve of Eratosthenes.'}],
)
print(response.message.content) JavaScript/TypeScript SDK
import ollama from 'ollama';
const response = await ollama.chat({
model: 'gemma4:31b',
messages: [{ role: 'user', content: 'Explain the CAP theorem.' }],
});
console.log(response.message.content); Option 2: LiteRT-LM (Edge Devices โ CLI & Python)
LiteRT-LM is Googleโs production-ready inference framework for deploying LLMs on edge devices. It powers on-device GenAI in Chrome, Chromebook Plus, and Pixel Watch.
Installation
macOS / Linux / Windows (WSL) / Raspberry Pi:
# Install using uv (recommended)
uv tool install litert-lm
# Or install via pip
pip install litert-lm CLI Usage
# Run Gemma 4 E2B on edge hardware
litert-lm run
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm
gemma-4-E2B-it.litertlm
--prompt="What is the capital of France?"
# Run Gemma 4 E4B
litert-lm run
--from-huggingface-repo=litert-community/gemma-4-E4B-it-litert-lm
gemma-4-E4B-it.litertlm
--prompt="Summarize the key points of quantum computing." Python Package
from litert_lm import LiteRTModel
# Load and run Gemma 4 E2B
model = LiteRTModel.from_huggingface(
"litert-community/gemma-4-E2B-it-litert-lm"
)
response = model.generate("Explain photosynthesis in simple terms.")
print(response) Option 3: Hugging Face Transformers (Python)
The official method for loading Gemma 4 with full control over generation parameters.
Installation (All Platforms)
# Windows (PowerShell)
pip install -U transformers torch accelerate
# macOS (Apple Silicon โ ensure MPS backend)
pip install -U transformers torch accelerate
# Linux (with CUDA)
pip install -U transformers torch accelerate Basic Text Generation
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
MODEL_ID = "google/gemma-4-31b-it" # Options: gemma-4-E2B-it, gemma-4-E4B-it, gemma-4-26b-a4b-it, gemma-4-31b-it
# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Prepare prompt with system role
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a Python function to implement binary search."},
]
# Process input
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Enable reasoning mode
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
# Generate output
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=1.0,
top_p=0.95,
top_k=64
)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
# Parse thinking and final answer
parsed = processor.parse_response(response)
print(parsed) Image Understanding
from PIL import Image
import requests
# Load an image
image_url = "https://example.com/chart.png"
image = Image.open(requests.get(image_url, stream=True).raw)
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe what you see in this image."}
]}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(outputs[0], skip_special_tokens=True)) Option 4: vLLM (Production Serving)
For production-grade serving with batching, tensor parallelism, and OpenAI-compatible API.
Installation
# Linux (recommended for production)
pip install vllm
# Start the server
python -m vllm.entrypoints.openai.api_server
--model google/gemma-4-31b-it
--tensor-parallel-size 2
--dtype bfloat16
--gpu-memory-utilization 0.90
--max-model-len 32768
--port 8000 Query via OpenAI-Compatible API
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "google/gemma-4-31b-it",
"messages": [
{"role": "system", "content": "You are a coding assistant."},
{"role": "user", "content": "Implement a thread-safe LRU cache in Python."}
],
"max_tokens": 2048,
"temperature": 1.0,
"top_p": 0.95
}' Python Client
from openai import OpenAI
client = OpenAI(api_key="not-needed", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="google/gemma-4-31b-it",
messages=[
{"role": "system", "content": "You are a senior software engineer."},
{"role": "user", "content": "Design a rate limiter using the token bucket algorithm."}
],
max_tokens=2048,
temperature=1.0,
)
print(response.choices[0].message.content) Option 5: GKE with vLLM (Enterprise Cloud)
# Deploy Gemma 4 31B on Google Kubernetes Engine
# Requires a GKE cluster with GPU node pool (e.g., NVIDIA Blackwell, RTX PRO 6000)
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemma4-31b
spec:
replicas: 1
selector:
matchLabels:
app: gemma4-31b
template:
metadata:
labels:
app: gemma4-31b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=google/gemma-4-31b-it
- --tensor-parallel-size=2
- --dtype=bfloat16
- --max-model-len=65536
resources:
limits:
nvidia.com/gpu: 2
ports:
- containerPort: 8000
EOF Platform-Specific Notes
Windows
- Ollama is the recommended method. Install via
winget install Ollama.Ollamaor the official installer. - For Hugging Face Transformers, ensure CUDA is properly installed if using an NVIDIA GPU. Run
nvidia-smito verify. - vLLM has limited Windows support. Use WSL2 (Windows Subsystem for Linux) for production serving.
- LiteRT-LM works under WSL2. Native Windows support is not yet available.
macOS (Apple Silicon / Metal)
- Ollama uses Metal acceleration automatically on Apple Silicon Macs. No additional configuration needed.
- For Hugging Face Transformers, PyTorch uses the MPS (Metal Performance Shaders) backend automatically. Ensure
torch >= 2.0. - MLX (Appleโs ML framework) provides an additional optimization layer:
pip install mlx-lmfrom mlx_lm import load, generate model, tokenizer = load("mlx-community/gemma-4-31b-it-8bit") output = generate(model, tokenizer, prompt="def fibonacci(n):", max_tokens=512) print(output) - Apple Silicon unified memory is a significant advantage โ a Mac Studio M3 Ultra with 192GB can serve the 31B model at competitive speeds.
Linux
- All methods work natively. Linux is the recommended platform for production deployments.
- For GPU acceleration, install the NVIDIA Container Toolkit for Docker-based deployments.
- llama.cpp provides an additional lightweight option:
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make # Download GGUF quantized weights from Hugging Face ./llama-server -m gemma-4-31b-it-Q4_K_M.gguf --ctx-size 32768 --port 8080
Model Download Sources
All Gemma 4 model weights are available from multiple official sources:
| Source | Link | Formats |
|---|---|---|
| Hugging Face | google/gemma-4 | Safetensors, GGUF |
| Ollama | ollama.com/library/gemma4 | GGUF (pre-quantized) |
| Kaggle | kaggle.com/models/google/gemma-4 | Safetensors |
| LM Studio | lmstudio.ai/models/gemma-4 | GGUF |
| Docker | hub.docker.com/r/ai/gemma4 | Container images |
Best Practices
Sampling Parameters
Use these standardized sampling configurations across all use cases for optimal performance:
temperature=1.0
top_p=0.95
top_k=64 Multi-Turn Conversations
In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins. This prevents context pollution from accumulated reasoning traces.
Visual Token Budget
Gemma 4 supports variable image resolution through configurable visual token budgets. The supported budgets are: 70, 140, 280, 560, and 1120.
- Lower budgets (70โ140): Classification, captioning, video frame processing
- Higher budgets (560โ1120): OCR, document parsing, reading small text
Audio Processing (E2B/E4B)
For ASR:
Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.
Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven. For speech translation:
Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate
it into {TARGET_LANGUAGE}. When formatting the answer, first output the
transcription in {SOURCE_LANGUAGE}, then one newline, then output the string
'{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}. Safety & Security
Gemma 4 models undergo the same rigorous infrastructure security protocols as Googleโs proprietary Gemini models. Key safety measures include:
- CSAM Filtering: Rigorous Child Sexual Abuse Material filtering at multiple stages of data preparation.
- Sensitive Data Filtering: Automated techniques to filter personal information and sensitive data from training sets.
- Content Safety Evaluations: Partnership with internal safety and responsible AI teams at Google DeepMind.
- Significant safety improvements: All Gemma 4 models significantly outperform Gemma 3 and 3n models in content safety while keeping unjustified refusals low.
For enterprises and sovereign organizations, the Apache 2.0 license combined with Googleโs security protocols provides a trusted, transparent foundation that delivers state-of-the-art capabilities while meeting the highest standards for security and reliability.
Who Should Use Gemma 4
| Use Case | Recommended Model | Why |
|---|---|---|
| Mobile / IoT apps | E2B | Under 3GB in 4-bit quant; audio + vision |
| On-device voice assistants | E2B / E4B | Native ASR; offline-capable |
| Low-latency API serving | 26B A4B | 4B active params = fast tokens/sec |
| Coding assistants | 31B | Highest reasoning; leads on SciCode |
| Enterprise RAG pipelines | 31B | 256K context; strong factual grounding |
| Multilingual applications | Any size | 140+ language training; 35+ production languages |
| Edge AI research | E2B / E4B | Apache 2.0; run on Raspberry Pi, Jetson |
| Air-gapped / sovereign deployments | 31B | No licensing callbacks; full Apache 2.0 |
| Cost-sensitive production | 26B A4B | MoE efficiency; ~4B inference cost |
Key Links & Resources
- Official Product Page: deepmind.google/models/gemma/gemma-4
- Model Card: ai.google.dev/gemma/docs/core/model_card_4
- Documentation: ai.google.dev/gemma/docs
- Google AI Studio: aistudio.google.com
- Hugging Face Collection: huggingface.co/collections/google/gemma-4
- Ollama Library: ollama.com/library/gemma4
- LiteRT-LM (Edge): github.com/google-ai-edge/LiteRT-LM
- Artificial Analysis Benchmarks: artificialanalysis.ai/models/gemma-4-31b
- License (Apache 2.0): ai.google.dev/gemma/apache_2
- Prompt Formatting Guide: ai.google.dev/gemma/docs/core/prompt-formatting-gemma4
- Function Calling Guide: ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
Conclusion
Gemma 4 is not just an iterative upgrade โ it is a generational leap that fundamentally changes the open-weights landscape. The combination of Apache 2.0 licensing, multimodal reasoning across text/image/video/audio, configurable thinking modes, native agentic capabilities, and deployment flexibility from sub-3GB edge devices to multi-GPU servers makes it the most versatile open model family available today.
The Intelligence Index numbers speak clearly: +29 points over Gemma 3 for the flagship model, with 2.5ร better token efficiency than the nearest open-weights competitor at a comparable intelligence level. For developers, researchers, and enterprises seeking maximum capability per parameter โ with complete licensing freedom โ Gemma 4 is the new baseline.
The model weights are available now on Hugging Face, Ollama, Kaggle, LM Studio, and Docker. The Apache 2.0 license means you can start building today with zero commercial restrictions.
Comments
Sign in to join the discussion!
Your comments help others in the community.