DeepSeek V4 Pro vs Flash vs Gemini 3.1 Pro vs MiniMax M3 vs Kimi K2.6: Complete 2026 Benchmark, Pricing & Deployment Guide

Published on June 4, 2026

Research conducted on June 4, 2026. Benchmark data sourced from vendor reports, Hugging Face model cards, and third-party evaluations. Pricing verified against official API documentation. All claims are attributed to their original source.

What This Article Covers

DeepSeek V4 is a family of two production-ready large language models — DeepSeek-V4-Pro and DeepSeek-V4-Flash — released publicly on April 24, 2026 under the MIT License. Both models feature a 1-million-token context window and a hybrid Mixture-of-Experts (MoE) architecture.

This guide covers:

What V4-Pro and V4-Flash are, and how they differ
Benchmark comparison against current frontier and open-source models
Current official API pricing versus competitors
Architecture and training details from official sources
Multimodal and agent capabilities
Where to test and download the models
Local deployment via Ollama, LM Studio, llama.cpp, vLLM, and SGLang
How to use V4 with Claude Code, OpenCode, and OpenClaw
Practical decision guide: when to use which variant

Note: The “V4 Preview” designation was the original release label used on April 24, 2026. Both V4-Pro and V4-Flash are now in general availability. There is no separate “preview” product — the preview label referred to the initial release state of the full V4 family.

Benchmark Comparison

This section contains benchmark scores primarily from vendor-reported results at the time of the April 24, 2026 release, supplemented by third-party data where available. Vendor-reported scores are labeled (V) and independently sourced scores are labeled (3P).

Benchmark evaluations used maximum reasoning effort (“Max” or thinking mode) unless otherwise noted. Scores for competitors are sourced from their respective official technical reports or model cards.

Core Performance Benchmarks

Model	Type	MMLU-Pro	GPQA Diamond	LiveCodeBench	SWE-bench Verified	Terminal-Bench 2.0
DeepSeek-V4-Pro	Open	87.5 (V)	90.1% (V)	93.5% (V)	80.6% (V)	67.9% (V)
DeepSeek-V4-Flash	Open	86.2 (V)	88.1% (V)	91.6% (V)	79.0% (V)	56.9% (V)
GPT-5.5	Closed	~89.0 (3P est.)	~91.5% (3P est.)	~92.0% (3P est.)	~82.0% (3P est.)	~65.0% (3P est.)
Claude Opus 4.8	Closed	~88.5 (3P est.)	~91.0% (3P est.)	~88.0% (3P est.)	~83.5% (3P est.)	~62.0% (3P est.)
Claude Sonnet 4.6	Closed	84.6 (V)	81.2% (V)	59.0% (V)	77.2% (V)	42.8% (V)
Gemini 3.1 Pro	Closed	92.6% (V)	94.3% (V)	—	80.6% (V)	—
Gemini 3.5 Flash	Closed	~88.0% (3P est.)	~89.0% (3P est.)	~90.0% (3P est.)	~79.0% (3P est.)	—
GLM-5.1	Open	~84.2% (3P)	—	—	78.0%† (V)	—
Kimi K2.6	Open	—	—	—	80.2% (V)	66.7% (V)
MiniMax M3	Open	84.22% (V)	92.68% (V)	82.15% (V)	79.0%† (V)	66.0% (V)
DeepSeek V3.2	Open	85.0 (V)	82.4% (V)	83.3% (V)	73.1% (V)	46.4% (V)

Table key: (V) = vendor-reported, (3P est.) = third-party estimate, (3P) = third-party independently sourced. — = no published score found. † = SWE-bench Verified score; SWE-bench Pro score may differ. “Open” = open-weights model. “Closed” = closed-source, API-only.

Gemini 3.1 Pro note: Released February 19, 2026, currently in preview. Benchmark data from the official Google DeepMind technical report. The GPQA Diamond score of 94.3% and MMLU score of 92.6% are vendor-reported by Google. LiveCodeBench score is expressed as Elo (2887) in Google’s report, not a direct percentage — omitted from table for comparability reasons.

Gemini 3.5 Flash note: Released May 19, 2026 (generally available). It is primarily optimized for speed, coding, and agentic loops. Official detailed benchmark reports were not fully published at research time; scores are estimates from third-party sources.

GLM-5.1 note: Released March 2026 by Z.ai (formerly Zhipu AI), open weights on April 7, 2026. 754B MoE (40B active). Scored 58.4% on SWE-bench Pro (different benchmark variant; not directly comparable to SWE-bench Verified in the table). MMLU-Pro score is from third-party trackers (BenchLM.ai). Other full benchmark suite scores were not officially published.

Kimi K2.6 note: Released April 20, 2026 by Moonshot AI. 1 trillion total parameters, 32B active, 256K context. Native multimodal (text, image, video via MoonViT). SWE-bench Verified 80.2% is vendor-reported. Terminal-Bench 2.0 score of 66.7% is vendor-reported.

MiniMax M3 note: Released June 1, 2026. All benchmark scores are vendor-reported by MiniMax. Weights were pending community availability at time of writing (expected ~10 days after API launch). Independent verification not available at research time.

Important caveat on closed-source model estimates: GPT-5.5, Claude Opus 4.8, and Gemini 3.5 Flash scores marked “(3P est.)” are estimates based on available third-party evaluations and community testing as of early June 2026. Treat these with appropriate caution.

Key Benchmark Observations

SWE-bench Verified is the most widely accepted benchmark for real-world software engineering capability. DeepSeek-V4-Pro’s 80.6% places it in direct competition with frontier proprietary models and Kimi K2.6 (80.2%). Gemini 3.1 Pro also reports 80.6% on SWE-bench Verified.

GPQA Diamond (expert-level scientific reasoning) shows Gemini 3.1 Pro (94.3%) and MiniMax M3 (92.68%) leading, with DeepSeek V4-Pro at 90.1%.

Terminal-Bench 2.0 measures autonomous agentic capability. V4-Pro (67.9%) and Kimi K2.6 (66.7%) are the strongest open-weight models in this category. This is where the Pro/Flash gap within DeepSeek is most pronounced: 67.9% vs 56.9%, an 11-percentage-point difference.

GLM-5.1 on SWE-bench Pro scored 58.4%, placing it above GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%) on that specific benchmark at the time of its release (April 2026). Note: SWE-bench Pro is a harder variant than SWE-bench Verified; scores are not directly interchangeable.

All benchmark scores for DeepSeek V4-Pro and V4-Flash are primarily from the April 24, 2026 vendor report unless marked otherwise. Readers should consult updated leaderboards such as Artificial Analysis and LMSYS Arena for the most current independent rankings.

API Pricing Comparison

Prices verified as of June 4, 2026 against official API documentation pages. All prices are in USD per 1 million (1M) tokens. Prices are subject to change; always verify at the official provider documentation before production deployment.

Provider	Model	Input (Cache Miss)	Input (Cache Hit)	Output	Context Limit	Notes
DeepSeek	V4-Flash	$0.14	$0.0028	$0.28	1M tokens	MIT License; OpenAI & Anthropic-compatible API
DeepSeek	V4-Pro	$0.435	$0.003625	$0.87	1M tokens	MIT License; max output 384K tokens
OpenAI	GPT-5.5	$5.00	~$2.50 (50% batch)	$30.00	400K tokens	Proprietary
Anthropic	Claude Opus 4.8	$5.00	—	$25.00	200K tokens	Proprietary
Anthropic	Claude Sonnet 4.6	$3.00	—	$15.00	200K tokens	Proprietary
Google	Gemini 3.1 Pro	$2.00	—	$12.00	1M tokens	Preview; verify at Google AI pricing
Google	Gemini 3.5 Flash	$1.50	—	$9.00	1M tokens	GA since May 19, 2026; output limit 65K tokens
Z.ai	GLM-5.1	$1.40	—	$4.40	200K tokens	Open weights MIT; Z.ai API pricing as of April 2026
MiniMax	MiniMax M3	$0.30–$0.60	$0.06–$0.12	$1.20–$2.40	1M tokens	Promotional/standard tiers; >512K input costs more
Moonshot AI	Kimi K2.6	—	—	—	256K tokens	Check official Kimi API; API pricing not confirmed at research time

Pricing Notes

DeepSeek cache pricing: DeepSeek’s context caching is enabled by default for all API requests. The “cache hit” price applies when an identical prompt prefix is reused across requests. This is especially significant for agentic workflows that repeatedly send long system prompts or code repositories — effective input cost can drop from $0.14 to $0.0028 per 1M tokens for Flash (50× reduction).

Gemini 3.1 Pro vs Gemini 3.5 Flash pricing: Gemini 3.5 Flash ($1.50 input / $9.00 output) is cheaper than Gemini 3.1 Pro ($2.00 input / $12.00 output). Notably, Gemini 3.5 Flash often outperforms 3.1 Pro on coding and agentic benchmarks despite being lower-priced — it is Google’s agentic-optimized model as of June 2026.

Legacy aliases: The older deepseek-chat and deepseek-reasoner model aliases are scheduled for deprecation on July 24, 2026. These aliases currently route transparently to V4-Flash. Users should migrate to explicit model IDs (deepseek-v4-flash or deepseek-v4-pro) before that date.

Promotional history: DeepSeek V4-Pro was initially offered at a 75% promotional discount from April 24 through May 31, 2026. The prices in the table above reflect the current standard rates effective June 2026.

MiniMax M3 pricing tiers: Input pricing scales with context length. Inputs above 512K tokens cost $1.20/M (promo) to $2.40/M (standard). Prompt cache reads cost $0.06/M (promo) to $0.12/M (standard). Verify at official MiniMax API documentation.

GLM-5.1 pricing: Z.ai adjusted pricing in April 2026. The $1.40 input / $4.40 output rates are from that update. Verify at z.ai for current pricing.

Kimi K2.6 pricing: Official API pricing for K2.6 was not confirmed in available sources at research time. Check the official Kimi platform and API documentation.

What DeepSeek V4 Is

The V4 Family

DeepSeek-V4 is the fourth major iteration of DeepSeek’s flagship model line. Released on April 24, 2026, it consists of two architecturally distinct variants:

DeepSeek-V4-Pro: 1.6 trillion total parameters, 49 billion activated per token. The flagship model designed for deep reasoning, complex agentic coding, and long-context tasks.
DeepSeek-V4-Flash: 284 billion total parameters, 13 billion activated per token. Optimized for high-throughput, low-latency inference while maintaining strong reasoning performance.

Both models support:

1,000,000-token context window
384,000-token maximum output length
Thinking (extended reasoning) and Non-Thinking modes
OpenAI-compatible /v1/chat/completions API format
Anthropic-compatible endpoint (official, documented)

Open Weights, Open Source?

DeepSeek V4-Pro and V4-Flash are released as open-weights models under the MIT License. The weights are downloadable from the official DeepSeek AI Hugging Face collection. “Open weights” means the model weights are publicly downloadable for inference, fine-tuning, and research — the training code and full training data are not publicly released.

The MIT License permits commercial use, modification, and redistribution with minimal restrictions. This is a more permissive license than the custom licenses used for earlier DeepSeek models.

Thinking and Non-Thinking Modes

Both V4-Pro and V4-Flash implement a dual-mode architecture:

Thinking mode (extended reasoning): The model generates internal chain-of-thought reasoning before outputting its final answer. This produces higher accuracy on complex, multi-step tasks at the cost of additional tokens and latency.
Non-thinking mode (direct output): The model responds directly without extended reasoning, prioritizing speed and cost efficiency.

Thinking mode is controlled via the API parameter enable_thinking: true/false. The official API documentation at api-docs.deepseek.com describes both modes.

Compatibility and Aliases

The V4 series is compatible with both OpenAI and Anthropic API clients. The official model IDs are:

deepseek-v4-pro — V4-Pro (production ID)
deepseek-v4-flash — V4-Flash (production ID)
deepseek-chat — deprecated alias, currently routes to V4-Flash; retires July 24, 2026
deepseek-reasoner — deprecated alias, currently routes to V4-Flash; retires July 24, 2026

Architecture and Training

All technical details in this section are sourced from the official DeepSeek V4 technical report published on Hugging Face and the DeepSeek website at the time of release. Claims marked with (official) are directly from these sources.

Mixture-of-Experts (MoE) Design

DeepSeek V4 uses a sparse Mixture-of-Experts architecture for both Pro and Flash variants. In a sparse MoE model, only a fraction of the total parameters activate per token — this is how V4-Pro achieves 1.6T total parameters while only computing with 49B per inference step.

	V4-Pro	V4-Flash
Total Parameters	1.6 Trillion	284 Billion
Activated per Token	49 Billion	13 Billion
Context Window	1,000,000 tokens	1,000,000 tokens
Max Output	384,000 tokens	384,000 tokens
Pre-training Data	>32 trillion tokens (official)	—

Hybrid Attention: CSA and HCA

The 1-million-token context window is made computationally feasible by a hybrid attention architecture that replaces standard full attention with two complementary mechanisms (official):

Compressed Sparse Attention (CSA): Designed for local precision. CSA groups small windows of tokens (typically 4 tokens) into a single compressed KV entry using learned compression weights. A “Lightning Indexer” then performs top-k selection for each query, allowing selective attention to relevant compressed entries while maintaining a 128-token sliding window for local detail. This handles short-range, high-precision attention.

Heavily Compressed Attention (HCA): Designed for global context. HCA applies aggressive compression (approximately 128× reduction) and performs dense attention over these compressed representations. This provides a broad, low-cost “wide-angle” view of distant context. HCA handles long-range, approximate attention efficiently.

By interleaving CSA (precision) and HCA (breadth), the architecture achieves 1M-token context at a fraction of the FLOPs required by standard attention. At 1M context length, V4-Pro reportedly requires only 27% of the single-token inference FLOPs and 10% of the KV cache memory compared to DeepSeek-V3.2 (official report figure).

Manifold-Constrained Hyper-Connections (mHC)

DeepSeek V4 replaces traditional residual connections with Manifold-Constrained Hyper-Connections (mHC) (official). This technique constrains the residual mapping to the manifold of doubly stochastic matrices (the Birkhoff polytope), improving gradient stability across the model’s depth. This is intended to prevent signal degradation in very deep MoE architectures.

Optimizer and Training Pipeline

Muon Optimizer (official): The model was trained using the Muon optimizer, credited with improved convergence stability at the 1.6T parameter scale compared to Adam-based alternatives.

Post-training pipeline (official): DeepSeek used a two-stage process:

Independent Expert Cultivation: Domain-specific expert networks are trained separately using Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO).
Unified Consolidation: These domain experts are merged into the final MoE structure via on-policy distillation.

Multimodal and Agent Capabilities

Multimodal Support

As of June 2026, the official DeepSeek API documentation describes multimodal input capabilities for the V4 series. However, the exact scope of multimodal support in the V4 release warrants clarification:

The core V4 architecture is a language model first. Text input and output is fully supported in both Pro and Flash variants.
Image processing capability is mentioned in the official API docs for the V4 series. Developers should consult api-docs.deepseek.com for current limitations on supported image formats, maximum image count per request, and resolution constraints.
Audio and video input are not documented as supported features in the V4 series at the time of this writing.

Comparison note: Unlike Gemini 3.1 Pro or 3.5 Flash (full native multimodal including video), Gemma 4 (native audio/video), or Kimi K2.6 (native image/video via MoonViT), DeepSeek V4’s multimodal feature set is primarily focused on text + image. Do not assume full audio/video parity with other frontier models.

Function Calling and Tool Use

Both V4-Pro and V4-Flash support function calling (tool use) via the standard OpenAI tools parameter in the chat completions API:

from openai import OpenAI

client = OpenAI(
    api_key="your_deepseek_api_key",
    base_url="https://api.deepseek.com"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and state, e.g. San Francisco, CA"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)

Context Caching

DeepSeek’s context caching is enabled by default for all API requests (official docs). The cache persists identical prompt prefixes across requests:

prompt_cache_hit_tokens: tokens retrieved from cache (billed at $0.0028/1M for Flash, $0.003625/1M for Pro)
prompt_cache_miss_tokens: tokens requiring recomputation (billed at standard input rate)

This is particularly valuable for agentic loops that repeatedly send long system prompts or code contexts.

Thinking Mode API Usage

# Enable extended reasoning (higher accuracy, more tokens)
response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Solve this complex coding problem..."}],
    extra_body={"enable_thinking": True}  # DeepSeek-specific parameter
)

# Non-thinking mode (faster, cheaper)
response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Summarize this code..."}],
    extra_body={"enable_thinking": False}
)

Where to Test It

Official Platforms

Platform	URL	Notes
DeepSeek Chat (web app)	chat.deepseek.com	Official web interface; free to use; model switching available
DeepSeek API	api-docs.deepseek.com	Official API; requires account and API key
DeepSeek API Playground	Available in the DeepSeek dashboard	Interactive API testing environment

Third-Party Platforms

Platform	Notes
OpenRouter	Hosts both `deepseek-v4-pro` and `deepseek-v4-flash`; unified API access
Lightning AI	Cloud GPU access; useful for testing alongside local deployment
NVIDIA NIM	NVIDIA’s hosted inference service; check availability
Hugging Face Inference	Model pages at `huggingface.co/deepseek-ai/DeepSeek-V4-Pro`

Downloading the Model Weights

The official open weights for DeepSeek V4-Pro and V4-Flash are available on Hugging Face:

V4-Pro: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
V4-Flash: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash

Full-precision weights for V4-Pro are approximately 865 GB (FP16). V4-Flash full-precision weights are considerably smaller given the 284B parameter count.

Both repositories include:

Full FP16 weights
FP8 quantized variants (from DeepSeek, on HuggingFace)
Model cards describing architecture, training, and usage

Community-provided quantizations (GGUF, GPTQ, AWQ) for use with llama.cpp-based tools appear over time in the Hugging Face hub — search for “DeepSeek-V4-Pro GGUF” or “DeepSeek-V4-Flash GGUF” to find them. Always prefer quantizations from reputable community contributors (e.g., bartowski, unsloth, TheBloke) and verify the file hash.

Local Setup and System Requirements

Hardware Reality Check

DeepSeek V4-Pro (1.6T parameters) is not practical on consumer hardware in full precision. Even V4-Flash (284B) requires significant GPU VRAM for inference at full precision. Quantization is required for most local deployments.

Model	Precision	Minimum VRAM	Practical Config
V4-Flash	FP16	~170–175 GB	2× H200 or 2× RTX Pro 6000 Blackwell
V4-Flash	4-bit GGUF	~70–80 GB	3× RTX 4090 or 2× A100 80GB
V4-Flash	2-bit GGUF	~45 GB	2× RTX 4090 or 1× H100 SXM
V4-Pro	FP16	~865 GB	High-end GPU cluster (not consumer)
V4-Pro	4-bit GGUF	~220 GB	Multi-GPU server setup

For most developers, DeepSeek-V4-Flash is the viable local deployment target. V4-Pro local deployment is realistically limited to multi-GPU server environments.

Option 1: Ollama

Best for: Quick start, single-machine inference, OpenAI-compatible API

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from https://ollama.com/download

# Run V4-Flash (check Ollama library for current availability)
ollama run deepseek-v4-flash

# If not yet in official library, create a Modelfile from GGUF:
# 1. Download a community GGUF from Hugging Face
# 2. Create Modelfile:
cat > Modelfile << 'EOF'
FROM ./DeepSeek-V4-Flash-Q4_K_M.gguf
PARAMETER num_ctx 32768
PARAMETER temperature 0.7
EOF

# Import and run
ollama create deepseek-v4-flash-local -f Modelfile
ollama run deepseek-v4-flash-local

Ollama library availability: DeepSeek V4 may or may not be in the official Ollama library at the time you read this. Check ollama.com/library for current availability. If not listed, use the Modelfile import method above with a community GGUF.

Hardware: Depends on quantization level. For Q4 GGUF of V4-Flash, expect 70–80 GB VRAM across multiple GPUs. Context window in Ollama is limited by available VRAM/RAM.

Caveats: Ollama’s KV cache handling for MoE models can be memory-intensive. Reduce num_ctx to 16384 or 8192 if you run into memory issues. The 1M context is not achievable locally on most hardware.

Option 2: LM Studio

Best for: Windows/macOS GUI-based local deployment, easy GGUF loading

1. Download LM Studio from https://lmstudio.ai
2. Open LM Studio → Search for "DeepSeek-V4-Flash" in the model catalog
3. If not in catalog: Use "Import GGUF" to load a downloaded GGUF file
4. Configure context length (start with 8192–16384 for stability)
5. Start server → Connect your client to http://localhost:1234/v1

Best for: Developers on Windows or macOS who want a visual interface for model management. Supports OpenAI-compatible local server.

Hardware: Practical use requires a system with 64–80+ GB total RAM/VRAM for V4-Flash Q4. CPU-only mode is possible but very slow (~0.5–1 token/sec for a 284B model at any quantization).

Caveats: LM Studio’s GPU offloading handles MoE layers better in recent versions (post-0.3). Earlier versions may have issues offloading expert layers efficiently. Update to the latest LM Studio before attempting these models.

Option 3: llama.cpp

Best for: Linux/macOS developers who want maximum control, CPU+GPU hybrid inference

# 1. Build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# 2. Download V4-Flash GGUF from Hugging Face
# Find community GGUF at: https://huggingface.co/search/full-text?q=deepseek-v4-flash+gguf

# 3. Run with MoE expert offloading (CRITICAL for these models)
./build/bin/llama-server 
  -m DeepSeek-V4-Flash-Q4_K_M.gguf 
  --ctx-size 16384 
  --threads 24 
  --n-gpu-layers 30 
  -ot ".ffn_.*_exps.=CPU" 
  --port 8080

# The -ot flag offloads MoE expert layers to CPU RAM
# This is critical — without it, VRAM consumption becomes unmanageable
# --n-gpu-layers: tune based on available VRAM (more = faster, but requires more VRAM)

Key flags:

-ot ".ffn_.*_exps.=CPU": Offloads all MoE expert layers to system RAM. Required for running these models on limited VRAM.
--ctx-size: Context window. Keep at 8192–32768 for stable performance.
--n-gpu-layers: Number of layers on GPU. Adjust based on available VRAM.

Hardware: A system with 2× RTX 4090 (48 GB total VRAM) + 256 GB system RAM can run V4-Flash Q4 with expert offloading at approximately 3–5 tokens/sec.

Option 4: vLLM (Production Serving)

Best for: Production deployments, multi-GPU servers, batch workloads

# Install vLLM (requires Linux and CUDA)
pip install vllm

# Serve V4-Flash (multi-GPU recommended)
python -m vllm.entrypoints.openai.api_server 
  --model deepseek-ai/DeepSeek-V4-Flash 
  --tensor-parallel-size 4 
  --dtype bfloat16 
  --gpu-memory-utilization 0.90 
  --max-model-len 32768 
  --port 8000

# For FP8 quantized weights (if available):
python -m vllm.entrypoints.openai.api_server 
  --model deepseek-ai/DeepSeek-V4-Flash 
  --quantization fp8 
  --tensor-parallel-size 2 
  --port 8000

Hardware: Minimum 4× A100 80GB or 2× H200 for V4-Flash at full FP16 with practical context lengths. V4-Pro requires 8×+ H100/H200 at FP8.

Best for: High-throughput production APIs, multi-user serving, batch inference jobs. vLLM’s PagedAttention and continuous batching maximize GPU utilization.

Option 5: SGLang (Best for Agentic Workloads)

Best for: Agentic workflows, prefix-heavy caching, high-throughput serving

# Install SGLang
pip install "sglang[all]"

# Launch server with RadixAttention (best for agentic/caching workloads)
python -m sglang.launch_server 
  --model-path deepseek-ai/DeepSeek-V4-Flash 
  --port 30000 
  --quantization fp8 
  --max-total-tokens 32768

# Query via OpenAI-compatible endpoint
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:30000/v1")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": "Debug this code..."}],
    max_tokens=2048
)

SGLang’s RadixAttention mechanism provides significant throughput advantages (up to ~30% higher than vLLM on H100s) for prefix-heavy workloads — exactly the pattern of agentic systems that repeatedly send long system prompts or code contexts.

Option 6: Hugging Face Transformers

Best for: Research, fine-tuning, full control over model internals

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "deepseek-ai/DeepSeek-V4-Flash"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # Automatically distributes across available GPUs
)

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to parse JSON with error handling."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.7,
        do_sample=True
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantization Options Summary

Quantization	Approx. Size (V4-Flash)	Quality	Speed	Best For
FP16	~570 GB	Reference	Best	Multi-GPU servers
FP8	~285 GB	Excellent	Very good	Production (official DeepSeek quants)
GPTQ 4-bit	~145 GB	Very good	Good	Multi-GPU workstations
Q4_K_M GGUF	~140 GB	Good	Moderate	llama.cpp, Ollama, LM Studio
Q2_K GGUF	~75 GB	Acceptable	Slower	Memory-constrained deployments

Unsloth variants: Unsloth has published Dynamic 2.0 GGUF quantizations for previous DeepSeek models that significantly reduce size while preserving quality. Check unsloth.ai or the Unsloth Hugging Face organization for published V4 GGUF variants. If available, these are the recommended GGUF format for llama.cpp and Ollama local deployment.

Coding Workflow and Agent Use

Connecting to Claude Code

Claude Code supports custom base URLs and model overrides via environment variables. To use DeepSeek V4 with Claude Code:

# Linux/macOS
export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_AUTH_TOKEN=your_deepseek_api_key
export ANTHROPIC_MODEL=deepseek-v4-pro

# For sub-agent tasks (use Flash for cost efficiency)
export CLAUDE_CODE_SUBAGENT_MODEL=deepseek-v4-flash

# Start Claude Code in your project directory
claude

# Windows PowerShell
$env:ANTHROPIC_BASE_URL = "https://api.deepseek.com/anthropic"
$env:ANTHROPIC_AUTH_TOKEN = "your_deepseek_api_key"
$env:ANTHROPIC_MODEL = "deepseek-v4-pro"
$env:CLAUDE_CODE_SUBAGENT_MODEL = "deepseek-v4-flash"
claude

Recommended: Use V4-Pro for primary coding and V4-Flash for sub-agent tasks. The ~3× price difference between Pro and Flash makes this configuration cost-effective for long agentic sessions.

Caveat: Ensure Claude Code is updated to a version that supports DeepSeek’s reasoning_content fields in API responses. Older versions may encounter compatibility issues with thinking-mode outputs.

Connecting to OpenClaw

OpenClaw is configured via a JSON config file at ~/.openclaw/openclaw.json (Linux/macOS) or the equivalent path on Windows:

{
  "models": {
    "providers": {
      "deepseek": {
        "baseUrl": "https://api.deepseek.com/v1",
        "apiKey": "your_deepseek_api_key",
        "api": "openai-completions",
        "models": [
          {
            "id": "deepseek-v4-pro",
            "name": "DeepSeek V4 Pro"
          },
          {
            "id": "deepseek-v4-flash",
            "name": "DeepSeek V4 Flash"
          }
        ]
      }
    }
  }
}

Ensure you are using OpenClaw version ≥ 2026.4.24 for full V4 thinking mode support.

Connecting to OpenCode

OpenCode supports interactive provider setup:

# 1. Start OpenCode in your terminal
opencode

# 2. In the OpenCode input box, type:
/connect

# 3. Select "DeepSeek" from the provider list
# 4. Enter your DeepSeek API key
# 5. Select deepseek-v4-pro or deepseek-v4-flash

Alternatively, configure via the OpenCode config file with the DeepSeek OpenAI-compatible endpoint.

Model Selection Guidance: Pro vs Flash for Agents

Task Type	Recommended Model	Rationale
Repository analysis, complex bug fixing	V4-Pro	Higher SWE-bench score (80.6% vs 79.0%)
Terminal-based agentic workflows	V4-Pro	Terminal-Bench gap is significant (67.9% vs 56.9%)
Code generation from specifications	V4-Flash	91.6% LiveCodeBench; 3× cheaper for high-volume generation
Inline code completion	V4-Flash	Faster response, lower cost per request
Sub-agent tasks in orchestrated workflows	V4-Flash	Cost efficiency in multi-agent architectures
Long-context document analysis	V4-Flash	Similar context handling; lower cost
Math/STEM reasoning	V4-Pro	More reliable on complex multi-step problems

When Not to Use DeepSeek V4

Maximum SWE-bench performance: Claude Opus 4.8 or GPT-5.5 may lead on real-world debugging.
Expert-level scientific reasoning: Gemini 3.1 Pro (GPQA 94.3%) leads on graduate-level reasoning tasks.
Native audio/video processing: Gemini 3.1 Pro, Gemini 3.5 Flash, Kimi K2.6, or Gemma 4 handle audio/video natively.
Agentic swarm workflows at scale: Kimi K2.6 supports up to 300 parallel sub-agents and 4,000 coordinated steps.
Sub-8K context, fast response, minimum cost: Smaller specialized models (Gemma 4 E4B, Qwen 3 small variants) may be more appropriate.
Enterprise SLA requirements: Proprietary models from OpenAI or Anthropic come with formal uptime guarantees and support contracts.
Heavily regulated environments: Verify DeepSeek’s data handling and privacy policies before processing sensitive data via the cloud API.

Practical Recommendations

Choose V4-Pro When

You’re running complex agentic workflows requiring terminal command execution and multi-step reasoning
Your primary use case is repository-scale debugging or pull request analysis
You need the highest available reasoning quality for STEM or mathematical problems
Cost is secondary to accuracy on complex tasks
You’re using thinking mode for extended chain-of-thought on difficult problems

Choose V4-Flash When

You’re running high-volume inference pipelines where cost per token matters
Your tasks are code generation, summarization, or explanation (not complex debugging)
You’re serving as a sub-agent in a larger orchestrated system
You need faster first-token latency for interactive use
You’re building systems where 80%+ of requests are routine (non-complex)

Choose a Quantized/Compressed Local Variant When

You require data privacy and cannot send code to external APIs
You have a multi-GPU server with 80–200 GB total VRAM for V4-Flash
You’re building an offline or air-gapped development environment
Cost over time exceeds cloud API cost (break-even typically at 2–4 months of moderate API usage)

Choose a Closed-Source Alternative When

You need the absolute highest SWE-bench performance (Claude Opus 4.8 or GPT-5.5)
You need the strongest scientific reasoning: Gemini 3.1 Pro (GPQA Diamond 94.3%)
You require formal enterprise support, SLAs, and guaranteed uptime
Your organization has compliance requirements that mandate vendor audits
You need native video/audio multimodal capabilities: use Gemini 3.1 Pro, Gemini 3.5 Flash, or Kimi K2.6
You’re processing regulated data (HIPAA, GDPR) and need a provider with signed data processing agreements

Limitations and Verification Notes

Source Transparency

Benchmark data sources:

DeepSeek-V4-Pro and V4-Flash benchmark scores (SWE-bench, LiveCodeBench, GPQA Diamond, MMLU-Pro, Terminal-Bench 2.0): Vendor-reported, from the official April 24, 2026 DeepSeek V4 release announcement and Hugging Face model cards. These have not been independently replicated by a third-party organization at the time this article was written.
Gemini 3.1 Pro benchmark scores (GPQA 94.3%, MMLU 92.6%, SWE-bench 80.6%): Vendor-reported by Google DeepMind at the February 19, 2026 launch. Source: official Google DeepMind technical report.
Gemini 3.5 Flash scores: Third-party estimates — official full benchmark reports not published at research time.
MiniMax M3 scores (MMLU-Pro 84.22%, GPQA 92.68%, LiveCodeBench 82.15%, SWE-bench Pro 59.0%): Vendor-reported by MiniMax at the June 1, 2026 launch. Independent verification pending as weights were not widely available at research time.
GLM-5.1 SWE-bench Pro 58.4%: Vendor-reported by Z.ai at the April 2026 launch. Third-party benchmark tracker data (BenchLM.ai) used for supplementary metrics.
Kimi K2.6 SWE-bench Verified 80.2%, Terminal-Bench 2.0 66.7%: Vendor-reported by Moonshot AI at the April 20, 2026 launch.
Estimates marked “(3P est.)” for GPT-5.5, Claude Opus 4.8, and Gemini 3.5 Flash: Third-party estimates based on community testing and partial official data.

Pricing data: Verified against official API pricing pages on June 4, 2026. Competitor pricing sourced from official provider documentation.

Architecture details: Sourced from the official DeepSeek V4 technical report. Details on CSA, HCA, mHC, and the Muon optimizer are from the official technical report.

What Was Not Independently Verified

The claimed 27% FLOP reduction and 10% KV cache reduction at 1M context versus V3.2 (official report figure)
Benchmark scores in thinking/non-thinking mode comparisons
Community GGUF quantization quality for V4 series (varies by contributor)
Specific token throughput numbers for local deployment configurations
MiniMax M3 benchmark scores (weights not widely available at research time)

Known Limitations of This Article

Competitor benchmarks for GPT-5.5 and Claude Opus 4.8 on several benchmarks are estimates.
Independent third-party evaluation of V4-Pro and V4-Flash on the AA Intelligence Index was not available at research time.
Kimi K2.6 and MiniMax M3 are very recently released models; their benchmark data is primarily vendor-reported.
GLM-5.1 full benchmark suite coverage is incomplete in available sources.
Local deployment hardware requirements may evolve as better quantization methods emerge.

Research Date

This article was researched and written on June 4, 2026. The AI landscape changes rapidly. Benchmark scores, model availability, pricing, and deployment tools may have changed since publication. Always verify critical information at primary sources before making production decisions.

Primary sources consulted:

DeepSeek official website
DeepSeek API documentation
DeepSeek Hugging Face collection
Google DeepMind — Gemini 3.1 Pro technical report (February 2026)
Google AI — Gemini 3.5 Flash announcement (May 2026)
Z.ai — GLM-5.1 documentation (April 2026)
Moonshot AI — Kimi K2.6 (April 2026)
MiniMax — M3 launch (June 2026)
Unsloth AI — for quantization methods and GGUF variants
Artificial Analysis — for independent benchmark methodology
Official model cards and technical reports for competing models (OpenAI, Anthropic, Google, Z.ai, MiniMax, Moonshot AI)

Have corrections, updated benchmark data, or deployment notes? This blog is community-oriented — feedback helps keep the information accurate.

Comments

Your comments help others in the community.