Overview
Gemma 4 E4B is Google DeepMind's high-performance edge model in the Gemma 4 family, released April 2, 2026. Despite having only 4.5B effective parameters (8.0B total with Per-Layer Embeddings), it punches far above its weight class—outperforming the previous-generation Gemma 3 27B on reasoning (GPQA Diamond: 58.6% vs 42.4%), coding (LiveCodeBench: 52.0% vs 29.1%), and math (AIME: 42.5% vs 20.8%) benchmarks. This represents a generational leap where a model one-sixth the size of its predecessor delivers superior results, making it a landmark achievement in efficient AI architecture.
The model is designed for deployment on consumer hardware including laptops, smartphones, and IoT devices. On an RTX 4070 Ti with full GPU offload, it achieves ~70 tokens/second generation with stable performance across context lengths up to 128K. Its Q4 quantized version requires only ~8GB of VRAM, and it runs comfortably on Apple Silicon unified memory. Uniquely within the Gemma 4 family, E4B supports native audio input (alongside text and vision), enabling on-device speech understanding—a capability absent from the larger 26B and 31B variants.
Positioned as the recommended starting point for edge and local deployments, E4B uses Apache 2.0 licensing for frictionless commercial adoption. While it lacks the thinking/CoT mode available on E2B, benchmarks show it delivers superior output quality on structured tasks (extraction, translation, commit messages) with faster effective throughput. For developers choosing within the Gemma 4 family, Google and community consensus is clear: start with E4B for edge/mobile use cases, step up to 26B A4B for workstation-grade reasoning, and only pursue 31B when maximum quality is non-negotiable.
Benchmarks & Performance
## Benchmark Scores (IT + Thinking mode where applicable)
| Benchmark | Gemma 4 E4B | Gemma 4 E2B | Gemma 4 26B A4B | Gemma 4 31B | Gemma 3 27B |
|---|---|---|---|---|---|
| MMLU Pro (Knowledge) | 69.4% | 60.0% | 82.6% | 85.2% | 67.6% |
| GPQA Diamond (Science) | 58.6% | 43.4% | 82.3% | 84.3% | 42.4% |
| AIME 2026 (Math) | 42.5% | 37.5% | 88.3% | 89.2% | 20.8% |
| LiveCodeBench v6 (Code) | 52.0% | 44.0% | 77.1% | 80.0% | 29.1% |
| MMMU Pro (Vision) | 52.6% | 44.2% | 73.8% | 76.9% | 49.7% |
| τ2-bench Retail (Agentic) | 57.5% | 29.4% | 85.5% | 86.4% | 6.6% |
| Arena Elo (Text) | N/A | N/A | 1441 | 1452 | 1365 |
| Codeforces ELO | 940 | — | 1718 | 2150 | 110 |
## Inference Performance (RTX 4070 Ti, Q8_0, full GPU offload)
| Metric | Gemma 4 E4B | Gemma 4 26B A4B (Q4) |
|---|---|---|
| Prompt Processing (pp512) | 6,757 t/s | 333 t/s |
| Generation (tg128) | 69.7 t/s | 13.7 t/s |
| Prompt Processing (pp16K) | 5,993 t/s | 268 t/s |
| Generation (tg256) | 70.8 t/s | ~15 t/s |
Key takeaway: E4B is ~5x faster in generation and ~20x faster in prompt processing than 26B A4B on the same consumer hardware. Generation speed remains stable from 4K to 128K context, showing graceful KV cache behavior.
Detailed Comparison
## Gemma 4 E4B vs Gemma 4 E2B (Edge sibling)
| Dimension | E4B | E2B |
|---|---|---|
| Effective params | 4.5B | 2.3B |
| Total params | 8.0B | 5.1B |
| Context window | 128K | 128K |
| Audio input | ✅ | ✅ |
| Thinking mode | ❌ | ✅ (Ollama default) |
| MMLU Pro | 69.4% | 60.0% |
| GPQA Diamond | 58.6% | 43.4% |
| Generation speed (RTX 3070) | ~30 t/s | ~40-46 t/s |
| VRAM (Q4) | ~8 GB | ~5 GB |
| Best for | Structured tasks, agent workflows, quality-sensitive edge use | Ultra-low-resource devices, reasoning via thinking mode |
Despite E2B's higher raw TPS, E4B is often faster in practice for structured tasks because E2B's default thinking mode consumes 10-30x more tokens internally. For key-word extraction, E4B took 0.74s/13 tokens vs E2B's 7.4s/280 tokens.
## Gemma 4 E4B vs Gemma 4 26B A4B (Workstation MoE sibling)
| Dimension | E4B | 26B A4B |
|---|---|---|
| Architecture | Dense | MoE (128 experts, 8 active) |
| Total params | 8.0B | 25.2B |
| Context window | 128K | 256K |
| Audio input | ✅ | ❌ |
| AIME 2026 | 42.5% | 88.3% |
| LiveCodeBench | 52.0% | 77.1% |
| VRAM (Q4) | ~8 GB | ~18 GB |
| Generation speed (4070 Ti) | ~70 t/s | ~14 t/s |
| Best for | Mobile/edge, fast local tasks, RAG | Deep reasoning, coding agents, long-context workflows |
The 26B A4B offers dramatically better quality on hard tasks but requires 2x+ the VRAM and runs 5x slower. For lightweight RAG, summarization, and retrieval tasks, E4B is the pragmatic choice. For complex multi-step coding or reasoning, 26B A4B is clearly superior.
## Gemma 4 E4B vs Gemma 3 27B (Previous generation)
E4B decisively outperforms Gemma 3 27B across all benchmarks despite being ~6x smaller: GPQA 58.6% vs 42.4%, LiveCodeBench 52.0% vs 29.1%, AIME 42.5% vs 20.8%, τ2-bench 57.5% vs 6.6%. This demonstrates the architectural advances (PLE, shared KV cache, dual RoPE) in Gemma 4.
Community Feedback
The developer community has responded enthusiastically to Gemma 4 E4B as the 'sweet spot' model of the family. Key reactions:
**Praise for efficiency**: Multiple benchmarks confirm E4B's standout position. The KodeLab team noted that E4B 'outperforms the previous generation's 27B model in several reasoning tasks, despite being nearly one-sixth the size,' calling it 'an ideal candidate for local agent development.' The Gemma 4 Wiki describes it as achieving 'performance that punches far above its actual parameter count.'
**Ollama adoption is high**: Community guides consistently recommend `ollama run gemma4:e4b` as the first command to try.t decided which to try, start with E4B.' Several tutorials center E4B as the default local model.
**Agent workflow traction**: The RTX 4070 Ti benchmark author (Alfonso Fortunato) concluded: 'for day-to-day local work on a 4070 Ti, I would still pick E4B first... if your real goal is lightweight local RAG, fetching information from the web, summarizing it well, and handling smaller tool-using tasks.'
**Thinking mode surprise**: A notable community discovery (KodeLab) revealed that Ollama's E2B renderer auto-injects thinking tokens by default, while E4B does not support thinking mode at all. This led to practical guidance: 'use E4B for agent workflows where direct answers are preferred; use E2B + thinking for complex reasoning.'
**Quantization and deployment**: Unsloth's GGUF versions are widely used. Community consensus is that Q4_K_M (~8GB) is the optimal tradeoff for most users, with Q8_0 recommended for quality-critical work when VRAM allows (~15GB).
**Adoption pattern**: E4B appears to be the most-deployed edge variant, frequently appearing in Ollama configurations, llama.cpp setups, and mobile deployment guides. The Apache 2.0 license has been specifically cited as removing 'procurement and legal friction' for enterprise evaluation.
Use Cases
### 1. Local RAG and Document Summarization
E4B excels at retrieval-augmented generation workflows on consumer hardware. With 128K context and ~70 t/s generation on an RTX 4070 Ti, it can ingest substantial document collections, answer grounded questions, and produce accurate summaries. Its stable performance across context lengths (no speed degradation from 4K to 128K) makes it ideal for 'paste a large document, ask questions' patterns. Choose E4B over 26B A4B here because the 5x speed advantage matters more than marginal quality gains for summarization tasks.
### 2. Edge-Device Voice and Multimodal Understanding
As the only Gemma 4 model combining audio + vision + text input in a small form factor, E4B enables unique on-device use cases: transcribing meetings (30s audio chunks), understanding screenshots/GUIs, reading text from images, and processing short video clips. On a Raspberry Pi 5 or Jetson Nano with quantized weights, it can run completely offline. Choose E4B over E2B when audio/vision quality matters; choose it over 26B/31B because those lack native audio support entirely.
### 3. Agentic Tool Use and Code Assistants
E4B's τ2-bench score of 57.5% (vs E2B's 29.4%) demonstrates strong function-calling and tool-use capabilities. It integrates cleanly with agentic frameworks like OpenClaw, Hermes Agent, and opencode via OpenAI-compatible endpoints served through llama.cpp or Ollama. For coding assistants, it generates conventional commit messages, shell commands, and code snippets with high reliability. Choose E4B over E2B for agent workflows because it produces more complete, accurate structured outputs without the latency penalty of thinking mode.
### 4. Lightweight Classification, Extraction, and Translation Pipelines
For batch processing tasks—sentiment analysis, keyword extraction, email-to-JSON conversion, translation, and structured data extraction—E4B dramatically outperforms E2B in effective throughput. The KodeLab benchmark showed E4B completing keyword extraction in 0.74s (13 tokens) vs E2B's 7.4s (280 tokens) for equivalent quality output. Choose E4B over all alternatives when: the task is well-defined, output is short, and throughput/latency are critical. Set `temperature=0.2` and disable thinking for maximum efficiency.
Latest News
## Release & Availability (April 2-3, 2026)
- **Gemma 4 family released** April 2, 2026, including E2B, E4B, 26B A4B, and 31B variants under Apache 2.0 license.
- **Google AI Edge Gallery** announced as the primary download/deploy path for E4B and E2B on mobile and edge devices.
- **AI Studio** provides online testing for the 31B and 26B A4B variants (not E4B directly).
## Ecosystem Support (April 2-7, 2026)
- **Ollama**: Full support with `ollama run gemma4:e4b` (tagged as `gemma4:latest` for E4B). Community discovered that E4B lacks thinking mode support in Ollama, while E2B gets auto-injected thinking tokens.
- **Transformers v5.5.0**: Required minimum version for native Gemma 4 support, including MoE architecture handling.
- **llama.cpp**: Full support including GGUF quantizations via Unsloth (Q4_K_M, Q8_0, etc.). Anthropic-compatible endpoint `/v1/messages` enables experimental Claude Code integration.
- **MLX**: Apple Silicon optimized support available.
- **LM Studio**: Supported with graphic interface for local deployment.
## Architecture Highlights
- **Per-Layer Embeddings (PLE)**: Each decoder layer has its own small embedding table, allowing the model to deliver ~4B-parameter performance with 8B total parameters.
- **Dual RoPE**: Alternating sliding-window and global attention for efficient long-context processing.
- **Shared KV Cache**: Final layers reuse key-value states from earlier layers, reducing memory footprint.
## Pricing
- E4B is **free and open-weight** under Apache 2.0. No API pricing tier exists for E4B specifically; it is designed for self-hosting.
- The larger Gemma 4 models are available for free testing in Google AI Studio; no standard paid API SKU has been announced as of the search results.