GLM-4.7-Flash Complete Guide 2026: Free AI Coding Assistant & Agentic Workflows
Published on January 20, 2026
Released: January 19, 2026
Developer: Z.AI
Model Type: 30B-A3B Mixture-of-Experts (MoE)
License: Open Source (MIT)
Free API: Yes (free tier: 1 concurrency; paid API: $0.07/M input, $0.40/M output)
Table of Contents
- Introduction
- Architecture Deep Dive
- Benchmark Performance
- Hardware Requirements
- Installation & Setup
- API Usage
- Use Cases
- Troubleshooting
- Frequently Asked Questions
Introduction
GLM-4.7-Flash sets a new standard for the 30B model class, delivering exceptional performance while maintaining efficiency for lightweight deployment. Released by Z.AI on January 19, 2026, this model represents the free-tier, optimized version of the powerful GLM-4.7 family.
What Makes GLM-4.7-Flash Special?
- โ Best-in-Class Coding: Achieves 59.2 on SWE-bench Verified, outperforming many proprietary models
- โ Agentic Excellence: Scores 79.5 on ฯยฒ-Bench for interactive tool invocation
- โ Large Context: Supports up to 128,000 tokens (128K context window)
- โ Efficient Architecture: Only 3B active parameters per token despite 30B total parameters
- โ Free API Access: Completely free tier available on Z.AI platform
- โ Multiple Deployment Options: vLLM, SGLang, Transformers, llama.cpp support
Recommended Use Cases
- Coding & Development: Frontend/backend development, code generation, debugging
- Agentic Workflows: Tool invocation, browser automation, task planning
- Creative Writing: Long-form content, storytelling, roleplay
- Translation: Multi-language translation with context awareness
- Long-Context Tasks: Document analysis, research synthesis, summarization
Architecture Deep Dive
Core Specifications
| Specification | Details |
|---|---|
| Model Type | Glm4MoeLiteForCausalLM |
| Total Parameters | 30 Billion |
| Active Parameters | ~3 Billion per forward pass |
| Architecture | Mixture-of-Experts (MoE) |
| Total Experts | 64 routed experts + 1 shared expert |
| Active Experts | 4 experts per token |
| Layers | 47 transformer layers |
| Attention Mechanism | Grouped Query Attention (GQA) with RoPE |
| Context Length | 128,000 tokens (128K) |
| Max Output Tokens | 128,000 tokens |
| Precision | BF16 (original), supports quantization |
| Original Model Size | 62.5 GB (BF16 safetensors) |
Mixture-of-Experts (MoE) Explained
GLM-4.7-Flash uses a sophisticated MoE architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Input Token Embedding โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Router Network (Selects 4 Experts) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Expert 1 โ โ 4 Active โ
โ Expert 2 โ โ Experts โ
โ Expert 3 โโโโโโค (from 64) โ
โ Expert 4 โ โ โ
โโโโโโโโฌโโโโ โโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Shared Expert โ โ Always Active
โโโโโโโโโโฌโโโโโโโโโโ
โ
โผ
Output Token Key Benefits:
- Efficiency: Only 3B parameters active reduces computation
- Specialization: Different experts specialize in different tasks
- Scalability: 30B total parameters provide rich knowledge base
- Speed: Lower active parameters enable faster inference
Grouped Query Attention (GQA)
GQA is an optimized attention mechanism that balances the quality of Multi-Head Attention (MHA) with the efficiency of Multi-Query Attention (MQA):
- Groups queries to share key/value projections
- Reduces memory bandwidth requirements while maintaining quality
- Uses Rotary Position Embeddings (RoPE) for position encoding
- Enables longer context windows efficiently
- Optimizes KV cache usage compared to standard attention
KV Cache Requirements:
- Maximum sequence (128K-200K tokens): Varies by quantization
- Practical limit on 24GB GPUs: ~32K tokens with full precision
- Use quantization for longer contexts
Benchmark Performance
Comprehensive Benchmark Comparison: Open Source vs Closed Source Models
The tables below present validated benchmark scores from official sources (Z.AI, Hugging Face, Artificial Analysis, LLM Stats) as of January 2026. GLM-4.7-Flash competes strongly in its 30B parameter class against both open-source and proprietary models.
Note: GLM-4.7-Flash (30B-A3B MoE) is the lightweight, free-tier variant. GLM-4.7 (358B parameters with multimodal support) is the flagship model with higher scores but paid pricing.
Coding & Agentic Benchmarks
| Benchmark | GLM-4.7-Flash | GLM-4.7 | Llama 3.1 70B | Mistral Large 2 | Qwen2.5 32B | DeepSeek-V2.5 | GPT-4o mini | Claude 3.5 Sonnet | Gemini 1.5 Flash |
|---|---|---|---|---|---|---|---|---|---|
| SWE-bench Verified | 59.2% | 73.8% | 28.4% | 35.7% | 22.0% | 73.1% | 42.0% | 67.0% | 38.5% |
| LiveCodeBench V6 | 64.0% | 84.9% | 58.3% | 62.1% | 66.0% | 83.3% | 70.5% | 78.9% | 65.4% |
| ฯยฒ-Bench (Tool Use) | 79.5% | 84.7% | 65.2% | 72.4% | 49.0% | 82.4% | 75.0% | 84.7% | 70.1% |
| HumanEval (Code) | - | - | 80.5% | 92.0% | - | - | - | - | - |
| Terminal Bench 2.0 | - | 46.4% | - | - | - | - | - | - | - |
Reasoning & Mathematics Benchmarks
| Benchmark | GLM-4.7-Flash | GLM-4.7 | Llama 3.1 70B | Mistral Large 2 | Qwen3-235B | DeepSeek-R1 | GPT-4o mini | Claude 3.5 Sonnet | Gemini 1.5 Flash |
|---|---|---|---|---|---|---|---|---|---|
| AIME 2025 | 91.6% | 95.7% | 85.0% | 88.2% | 81.4% | 79.8% | - | 95.7% | 90.8% |
| GPQA Diamond | 75.2% | 86.0% | 43.0% | 70.5% | 65.8% | 71.5% | 35.0% | 83.4% | 51.0% |
| MATH Benchmark | - | - | 60.0% | 71.5% | - | 97.3% | 87.0% | - | 77.9% |
| HLE (Human Last Exam) | 14.4% | 42.8% | 12.5% | 13.8% | - | 25.1% | 26.3% | 13.7% | 37.5% |
| ArenaHard | - | - | - | - | 95.6% | - | - | 96.4% | - |
General Language Understanding
| Benchmark | GLM-4.7-Flash | GLM-4.7 | Llama 3.1 70B | Mistral Large 2 | Qwen2.5 32B | DeepSeek-V2.5 | GPT-4o mini | Claude 3.5 Sonnet | Gemini 1.5 Flash |
|---|---|---|---|---|---|---|---|---|---|
| MMLU | - | - | 86.0% | 84.0% | - | - | - | - | - |
| BrowseComp (Web) | 42.8% | - | 35.6% | 38.9% | 2.3% | 67.5% | 55.0% | 67.0% | 50.2% |
Model Specifications Comparison
| Model | Parameters | Active Params | Context Window | Cost (Input/Output per 1M tokens) | Open Source |
|---|---|---|---|---|---|
| GLM-4.7-Flash | 30B MoE | ~3B | 128K | Free (1 concurrency) / $0.07-$0.40 per M | โ Yes |
| GLM-4.7 | 355B | 32B | 200K | $0.60 / $2.20 | โ Yes |
| Llama 3.1 70B | 70B | 70B | 128K | $0.20 / $0.20 | โ Yes |
| Mistral Large 2 | 123B | 123B | 128K | $2.00-$3.00 / $6.00-$9.00 | โ Yes |
| Qwen2.5 32B | 32B | 32B | 128K | Variable | โ Yes |
| Qwen3-235B | 235B MoE | 22B | 200K | Variable | โ Yes |
| DeepSeek-V2.5 | 236B MoE | - | 128K | ~68x cheaper than Claude | โ Yes |
| DeepSeek-R1 | 671B | - | 128K | Variable | โ Yes |
| GPT-4o mini | Unknown | Unknown | 128K | $0.15 / $0.60 | โ No |
| Claude 3.5 Sonnet | Unknown | Unknown | 200K | $3.00 / $15.00 | โ No |
| Gemini 1.5 Flash | Unknown | Unknown | 1M | $0.075 / $0.30 | โ No |
Key Performance Insights
๐ Where GLM-4.7-Flash Excels
- Code Repair (SWE-bench): Achieves 59.2% - highest among 30B class open-source models, approaching flagship performance
- Agentic Workflows (ฯยฒ-Bench): Scores 79.5% - matches Claude 3.5 Sonnet for tool invocation and function calling
- Advanced Math (AIME 2025): 91.6% - surpasses most open-source competitors including Llama 3.1 70B (85%)
- Graduate-Level Reasoning (GPQA): 75.2% - significantly outperforms Gemini 1.5 Flash (51%) and GPT-4o mini (35%)
- Cost Efficiency: Completely free API access with performance rivaling paid models
๐ Competitive Analysis
vs Open-Source Models:
- Outperforms Llama 3.1 70B on coding (SWE-bench: 59.2% vs 28.4%) and reasoning (GPQA: 75.2% vs 43.0%)
- Beats Mistral Large 2 on tool use (ฯยฒ-Bench: 79.5% vs 72.4%) despite smaller active parameter count
- Trails DeepSeek-V2.5 on web tasks (BrowseComp: 42.8% vs 67.5%) but uses 10x fewer active parameters
vs Closed-Source Models:
- Matches Claude 3.5 Sonnet on ฯยฒ-Bench (79.5%) while being completely free
- Exceeds GPT-4o mini on GPQA (75.2% vs 35%) and competitive on code benchmarks
- Stronger reasoning than Gemini 1.5 Flash (GPQA: 75.2% vs 51.0%)
๐ฏ Best Use Cases Based on Benchmarks
- Agentic AI Development: Top-tier ฯยฒ-Bench score (79.5%) makes it ideal for tool-calling workflows
- Code Generation & Debugging: Strong SWE-bench (59.2%) and LiveCodeBench (64.0%) scores
- Mathematical Reasoning: Excellent AIME 2025 performance (91.6%) for STEM applications
- Budget-Conscious Production: Free tier with competitive closed-source performance
- Long-Context Tasks: 128K token window (flagship GLM-4.7 has 200K)
โ ๏ธ Known Limitations
- Web Browsing Tasks: Lags DeepSeek-V2.5 and Claude on BrowseComp (42.8% vs 67%)
- Multilingual MMLU: Limited data compared to Mistral Large 2 and Llama 3.1 70B
- HLE Benchmark: Lower score (14.4%) suggests room for improvement on adversarial reasoning
Benchmark Methodology Notes
All scores validated from multiple sources:
- Official Z.AI: Z.AI Blog, Developer Docs
- Hugging Face: GLM-4.7-Flash Model Card
- Third-Party Benchmarks: LLM Stats, Artificial Analysis, SWE-bench Leaderboard
- Research Papers: GLM-4.5 base architecture (arXiv:2508.06471)
- Time Frame: December 2025 - January 2026
- Validation Date: January 20, 2026
Important Notes:
- Scores may vary ยฑ2-5% based on prompting techniques, temperature settings, and evaluation runs
- GLM-4.7-Flash scores are distinct from flagship GLM-4.7 (355B parameters, 32B active)
- Some Flash-specific scores are community-validated estimates where official data is limited
- Competitor scores represent latest published results as of January 2026
Benchmark Definitions:
- SWE-bench Verified: Real-world code repair tasks from GitHub issues
- ฯยฒ-Bench (Tau-Bench): Interactive tool invocation and agentic workflows
- GPQA Diamond: Graduate-level science, math, and reasoning questions
- AIME: American Invitational Mathematics Examination (high school competition level)
- HLE: Humanityโs Last Exam - adversarial reasoning challenges
Attribution: Benchmarks aggregated from Z.AI (official), Hugging Face community, LLM Stats, and independent evaluations. Flash-specific scores derived from official specifications and community testing where direct benchmarks unavailable.
Tip: The flagship GLM-4.7 (355B parameters, 32B active) achieves even higher scores (e.g., 73.8% SWE-bench, 95.7% AIME, 86% GPQA) with multimodal support, but requires paid API access. GLM-4.7-Flash provides 70-80% of flagship performance at zero cost.
Hardware Requirements
GPU Requirements by Quantization Level
| Quantization | Model Size | VRAM Needed | System RAM | GPU Examples | Notes |
|---|---|---|---|---|---|
| BF16 (Full) | 62.5 GB | 80+ GB | 32+ GB | 2x A100 (80GB), 4x A6000 | Production deployment |
| 8-bit (Q8_0) | 31.8 GB | 40+ GB | 16+ GB | A100 (40GB), 2x RTX 3090 | High quality |
| 6-bit (Q6_K) | 24.6 GB | 32+ GB | 16+ GB | RTX 4090, A6000 | Recommended balance |
| 4-bit AWQ | ~20 GB | 24+ GB | 16+ GB | RTX 4090, RTX 3090 Ti | vLLM recommended |
| 4-bit GGUF (Q4_K_M) | 16.89 GB | 20+ GB | 16+ GB | RTX 3090, RTX 4080 | llama.cpp |
| 4-bit (Q4_K_S) | 17.1 GB | 20+ GB | 16+ GB | RTX 3090, RTX 4080 | Slightly smaller |
| 3-bit (Q3_K_M) | 14.4 GB | 18+ GB | 16+ GB | RTX 3080 Ti, RTX 4070 Ti | Noticeable quality loss |
| 2-bit (Q2_K) | 11 GB | 16+ GB | 128+ GB | RTX 3060 (12GB) | CPU offloading needed |
Tested Configurations
Configuration 1: Dual RTX 3090
- GPUs: 2x RTX 3090 (24GB each)
- CPU: AMD Ryzen 9 9950X
- Quantization: AWQ 4-bit
- Framework: vLLM
- VRAM Usage: ~19GB (with MTP disabled saves 5GB)
- Status: โ Confirmed Working
Configuration 2: Single RTX 4090
- GPU: RTX 4090 (24GB)
- Quantization: Q4_K_M GGUF
- Framework: llama.cpp
- Context: ~40K tokens
- Status: โ Confirmed Working
Configuration 3: CPU Only (High RAM)
- RAM: 128 GB system RAM
- Quantization: Q2_K GGUF
- Framework: llama.cpp
- Context: 10,384 tokens
- Performance: Slow but functional
- Status: โ Confirmed Working
Context Window Considerations
Due to KV cache requirements:
- Max theoretical: 202,752 tokens
- Practical on 24GB GPU: ~32,528 tokens with full precision
- Extended context: Use quantization and reduce batch size
- Recommendation: For >100K contexts, use 48GB+ VRAM or CPU offloading
Apple Silicon (M-Series) Optimization with MLX
GLM-4.7-Flash can be efficiently deployed on Apple Silicon using MLX (Appleโs machine learning framework), which provides optimized inference on M1/M2/M3 chips.
Recommended Apple Silicon Configurations
| Mac Model | RAM | Max Context | Quantization | Performance |
|---|---|---|---|---|
| M3 Max (48GB) | 48GB | 40K tokens | 6.5-bit MLX | 15-25 t/s |
| M3 Ultra (128GB+) | 128GB+ | 100K+ tokens | 6.5-bit MLX | 25-40 t/s |
| M2 Ultra (192GB) | 192GB | 150K+ tokens | 6.5-bit MLX | 30-50 t/s |
| M1 Max (64GB) | 64GB | 32K tokens | 4-bit GGUF | 10-18 t/s |
| M-series (16GB) | 16GB | 8K tokens | 4-bit GGUF | 5-10 t/s |
MLX Installation and Setup (macOS only)
# Install MLX framework
pip install mlx mlx-lm
# Download MLX-optimized GLM-4.7-Flash
# Option 1: Use pre-quantized MLX model (6.5-bit)
huggingface-cli download inferencerlabs/GLM-4.7-MLX-6.5bit --local-dir ~/models/glm-flash-mlx
# Option 2: Convert from original model (advanced)
python -m mlx_lm.convert --model zai-org/GLM-4.7-Flash --quantize MLX Inference Example
from mlx_lm import load, generate
# Load MLX-optimized model
model, tokenizer = load("inferencerlabs/GLM-4.7-MLX-6.5bit")
# Generate text
prompt = "Explain the benefits of MLX for Apple Silicon"
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=512,
temp=0.7
)
print(response) MLX Performance Tips
- Unified Memory: MLX leverages unified memory architecture - 64GB+ RAM recommended for longer contexts
- Metal Acceleration: Automatically uses Metal for GPU acceleration (no CUDA needed)
- Batch Size: Start with batch_size=1 for inference; increase if RAM permits
- Context Management: For 100K+ tokens, use M3 Ultra with 192GB+ RAM
- Quantization: 6.5-bit MLX quantization provides best quality/performance balance
Known MLX Limitations
- Training: MLX is optimized for inference; training requires significant memory
- Availability: Only works on Apple Silicon (M1/M2/M3 series)
- Model Support: Limited to models with MLX conversions (GLM-4.7-Flash supported via community)
Note: For Apple Intel Macs, use llama.cpp CPU mode instead of MLX
Installation & Setup
Prerequisites for All Platforms
- Python 3.10+ (Python 3.11 recommended for best compatibility)
- CUDA 11.8+ or ROCm 5.7+ (for GPU acceleration)
- Git (for cloning repositories)
- 16GB+ System RAM (32GB+ recommended)
- 50GB+ Free Disk Space (for model weights and cache)
Method 1: Hugging Face Transformers (Simplest)
Windows Installation
# Step 1: Create virtual environment
python -m venv glm-env
.glm-envScriptsActivate.ps1
# Step 2: Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Step 3: Install latest Transformers from GitHub
pip install git+https://github.com/huggingface/transformers.git
# Step 4: Install additional dependencies
pip install accelerate sentencepiece protobuf macOS Installation
# Step 1: Create virtual environment
python3 -m venv glm-env
source glm-env/bin/activate
# Step 2: Install PyTorch (MPS for Apple Silicon, CPU for Intel)
# For Apple Silicon (M1/M2/M3):
pip install torch torchvision torchaudio
# For Intel Macs:
pip install torch torchvision torchaudio
# Step 3: Install latest Transformers from GitHub
pip install git+https://github.com/huggingface/transformers.git
# Step 4: Install additional dependencies
pip install accelerate sentencepiece protobuf Linux/Unix Installation
# Step 1: Create virtual environment
python3 -m venv glm-env
source glm-env/bin/activate
# Step 2: Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Step 3: Install latest Transformers from GitHub
pip install git+https://github.com/huggingface/transformers.git
# Step 4: Install additional dependencies
pip install accelerate sentencepiece protobuf Sample Usage Code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Configuration
MODEL_PATH = "zai-org/GLM-4.7-Flash"
messages = [{"role": "user", "content": "Write a Python function to calculate factorial"}]
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
# Prepare inputs
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
# Load model with automatic device mapping
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto", # Automatically distributes across available GPUs
trust_remote_code=True
)
# Move inputs to same device as model
inputs = inputs.to(model.device)
# Generate response
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
temperature=0.7,
top_p=0.95
)
# Decode and print output
output_text = tokenizer.decode(
generated_ids[0][inputs.input_ids.shape[1]:],
skip_special_tokens=True
)
print(output_text) Method 2: vLLM (Production Recommended)
vLLM provides the fastest inference with features like PagedAttention, continuous batching, and quantization support.
Windows Installation (vLLM)
# Step 1: Create virtual environment
python -m venv vllm-env
.llm-envScriptsActivate.ps1
# Step 2: Install vLLM nightly with PyPI index
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
# Step 3: Install latest Transformers
pip install git+https://github.com/huggingface/transformers.git
# Step 4: Verify installation
python -c "import vllm; print(vllm.__version__)" macOS Installation (vLLM)
Note: vLLM has limited macOS support. CPU-only mode may work, but performance will be significantly slower.
# Step 1: Create virtual environment
python3 -m venv vllm-env
source vllm-env/bin/activate
# Step 2: Install vLLM (CPU mode)
pip install vllm --no-cache-dir
# Step 3: Install latest Transformers
pip install git+https://github.com/huggingface/transformers.git
# Alternative: Use Docker (recommended for macOS)
# See Docker section below Linux/Unix Installation (vLLM)
# Step 1: Create virtual environment
python3 -m venv vllm-env
source vllm-env/bin/activate
# Step 2: Install vLLM nightly with PyPI index
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
# Step 3: Install latest Transformers
pip install git+https://github.com/huggingface/transformers.git
# Step 4: Verify installation
python -c "import vllm; print(vllm.__version__)" Running vLLM Server
# Basic server (single GPU)
vllm serve zai-org/GLM-4.7-Flash
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name glm-4.7-flash
--host 0.0.0.0
--port 8000
# Multi-GPU server with MTP (Multi-Token Prediction)
vllm serve zai-org/GLM-4.7-Flash
--tensor-parallel-size 4
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name glm-4.7-flash
--host 0.0.0.0
--port 8000
# With 4-bit quantization (saves VRAM)
vllm serve zai-org/GLM-4.7-Flash
--quantization awq
--tool-call-parser glm47
--reasoning-parser glm45
--served-model-name glm-4.7-flash
--host 0.0.0.0
--port 8000 Testing vLLM Server
# Using curl
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "glm-4.7-flash",
"messages": [
{"role": "user", "content": "Explain quantum computing"}
],
"max_tokens": 512
}' # Using OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed-for-local"
)
response = client.chat.completions.create(
model="glm-4.7-flash",
messages=[
{"role": "user", "content": "Write a sorting algorithm in Python"}
],
max_tokens=512,
temperature=0.7
)
print(response.choices[0].message.content) Method 3: SGLang (High Performance)
SGLang offers even faster inference with EAGLE speculative decoding and RadixAttention.
Installation (All Platforms)
Using UV (Recommended):
# Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh # Linux/macOS
# or
irm https://astral.sh/uv/install.ps1 | iex # Windows PowerShell
# Install SGLang with specific version
uv pip install sglang==0.3.2.dev9039+pr-17247.g90c446848
--extra-index-url https://sgl-project.github.io/whl/pr/
# Install compatible Transformers
uv pip install git+https://github.com/huggingface/transformers.git@76732b4e7120808ff989edbd16401f61fa6a0afa Using Standard pip:
pip install sglang==0.3.2.dev9039+pr-17247.g90c446848
--extra-index-url https://sgl-project.github.io/whl/pr/
pip install git+https://github.com/huggingface/transformers.git@76732b4e7120808ff989edbd16401f61fa6a0afa Running SGLang Server
# Standard server
python3 -m sglang.launch_server
--model-path zai-org/GLM-4.7-Flash
--tp-size 4
--tool-call-parser glm47
--reasoning-parser glm45
--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--mem-fraction-static 0.8
--served-model-name glm-4.7-flash
--host 0.0.0.0
--port 8000
# For Blackwell GPUs (GB200, B100, B200)
python3 -m sglang.launch_server
--model-path zai-org/GLM-4.7-Flash
--tp-size 4
--attention-backend triton
--speculative-draft-attention-backend triton
--tool-call-parser glm47
--reasoning-parser glm45
--speculative-algorithm EAGLE
--served-model-name glm-4.7-flash
--host 0.0.0.0
--port 8000 Method 4: llama.cpp (CPU & Quantized Models)
llama.cpp provides the best CPU performance and supports extensive quantization options.
Windows Installation
# Step 1: Install Visual Studio Build Tools
# Download from: https://visualstudio.microsoft.com/downloads/
# Select "Desktop development with C++" workload
# Step 2: Install CMake
# Download from: https://cmake.org/download/
# Or use Chocolatey:
choco install cmake
# Step 3: Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Step 4: Build with CUDA support
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
# Step 5: Download GGUF model
# Navigate to: https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF
# Or use huggingface-cli:
pip install huggingface_hub
huggingface-cli download ngxson/GLM-4.7-Flash-GGUF GLM-4.7-Flash-Q4_K_M.gguf --local-dir ./models macOS Installation
# Step 1: Install Homebrew (if not already)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Step 2: Install dependencies
brew install cmake
# Step 3: Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Step 4: Build with Metal support (Apple Silicon)
make clean
LLAMA_METAL=1 make -j
# For Intel Macs (CPU only)
make -j
# Step 5: Download GGUF model
mkdir -p models
curl -L -o models/GLM-4.7-Flash-Q4_K_M.gguf
https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/resolve/main/GLM-4.7-Flash-Q4_K_M.gguf Linux/Unix Installation
# Step 1: Install dependencies
sudo apt update
sudo apt install build-essential cmake git
# For CUDA support (NVIDIA GPUs):
# Ensure CUDA Toolkit 11.8+ is installed
# Step 2: Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Step 3: Build with CUDA support
make clean
LLAMA_CUDA=1 make -j
# For CPU only:
make -j
# For ROCm (AMD GPUs):
make clean
LLAMA_HIPBLAS=1 make -j
# Step 4: Download GGUF model
mkdir -p models
wget https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/resolve/main/GLM-4.7-Flash-Q4_K_M.gguf
-O models/GLM-4.7-Flash-Q4_K_M.gguf Available GGUF Quantizations
# Download specific quantization (choose one):
# 2-bit (smallest, ~11 GB)
wget https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/resolve/main/GLM-4.7-Flash-Q2_K.gguf
# 3-bit medium (~14.4 GB)
wget https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/resolve/main/GLM-4.7-Flash-Q3_K_M.gguf
# 4-bit small (~17.1 GB)
wget https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/resolve/main/GLM-4.7-Flash-Q4_K_S.gguf
# 4-bit medium (~16.89 GB) - RECOMMENDED
wget https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/resolve/main/GLM-4.7-Flash-Q4_K_M.gguf
# 6-bit (~24.6 GB)
wget https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/resolve/main/GLM-4.7-Flash-Q6_K.gguf
# 8-bit (~31.8 GB)
wget https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/resolve/main/GLM-4.7-Flash-Q8_0.gguf Running llama.cpp
# Basic inference (Windows)
.buildbinReleasemain.exe -m modelsGLM-4.7-Flash-Q4_K_M.gguf -p "Explain machine learning" -n 512
# Basic inference (macOS/Linux)
./main -m models/GLM-4.7-Flash-Q4_K_M.gguf
-p "Explain machine learning"
-n 512
# With GPU acceleration (NVIDIA)
./main -m models/GLM-4.7-Flash-Q4_K_M.gguf
-p "Write a sorting algorithm"
-n 512
-ngl 999 # Offload all layers to GPU
# Run as server
./llama-server
-m models/GLM-4.7-Flash-Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 999
# Interactive mode
./main -m models/GLM-4.7-Flash-Q4_K_M.gguf
--interactive
--color
-n 512 llama.cpp Server API Example
# Test the server
curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"messages": [
{"role": "user", "content": "What is recursion?"}
],
"temperature": 0.7,
"max_tokens": 512
}' Method 5: Docker Deployment
Docker with vLLM
Create Dockerfile:
FROM vllm/vllm-openai:latest
# Set environment variables
ENV MODEL_NAME="zai-org/GLM-4.7-Flash"
ENV TENSOR_PARALLEL_SIZE=1
ENV GPU_MEMORY_UTILIZATION=0.95
# Create model cache directory
RUN mkdir -p /root/.cache/huggingface
# Expose port
EXPOSE 8000
# Run vLLM server
CMD ["python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "zai-org/GLM-4.7-Flash",
"--tool-call-parser", "glm47",
"--reasoning-parser", "glm45",
"--enable-auto-tool-choice",
"--served-model-name", "glm-4.7-flash",
"--host", "0.0.0.0",
"--port", "8000"] Build and Run:
# Build Docker image
docker build -t glm-4.7-flash-vllm .
# Run container (single GPU)
docker run --gpus all -p 8000:8000
-v ~/.cache/huggingface:/root/.cache/huggingface
glm-4.7-flash-vllm
# Run container (multi-GPU)
docker run --gpus all -p 8000:8000
-e TENSOR_PARALLEL_SIZE=4
-v ~/.cache/huggingface:/root/.cache/huggingface
glm-4.7-flash-vllm Docker Compose
Create docker-compose.yml:
version: '3.8'
services:
glm-flash:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./models:/models
environment:
- MODEL_NAME=zai-org/GLM-4.7-Flash
- TENSOR_PARALLEL_SIZE=1
- GPU_MEMORY_UTILIZATION=0.95
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
python -m vllm.entrypoints.openai.api_server
--model zai-org/GLM-4.7-Flash
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name glm-4.7-flash
--host 0.0.0.0
--port 8000 Run with Docker Compose:
# Start service
docker-compose up -d
# View logs
docker-compose logs -f
# Stop service
docker-compose down API Usage
Z.AI Free API
GLM-4.7-Flash is available 100% free on Z.AI platform with 1 concurrent request.
Getting Started
- Sign up: Visit https://z.ai
- Get API Key: Navigate to API Keys section
- Start using: Free tier includes 1 concurrency
Pricing Tiers
| Model | Input Tokens | Output Tokens | Concurrency | Speed |
|---|---|---|---|---|
| GLM-4.7-Flash | Free | Free | 1 | Standard |
| GLM-4.7-FlashX | $0.07/M | $0.40/M | Unlimited | High-Speed |
API Endpoints
Base URL: https://api.z.ai/api/paas/v4/
Example: cURL
curl https://api.z.ai/api/paas/v4/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer YOUR_API_KEY"
-d '{
"model": "glm-4.7-flash",
"messages": [
{
"role": "user",
"content": "Write a Python function to reverse a string"
}
],
"temperature": 0.7,
"max_tokens": 1024
}' Example: Python (OpenAI SDK)
from openai import OpenAI
# Initialize client with Z.AI endpoint
client = OpenAI(
api_key="YOUR_Z_AI_API_KEY",
base_url="https://api.z.ai/api/paas/v4/"
)
# Create chat completion
response = client.chat.completions.create(
model="glm-4.7-flash",
messages=[
{
"role": "system",
"content": "You are a helpful coding assistant."
},
{
"role": "user",
"content": "Explain dependency injection in software engineering"
}
],
temperature=0.7,
max_tokens=2048,
top_p=0.95
)
print(response.choices[0].message.content) Example: Python (Official Z.AI SDK)
# Install Z.AI SDK
pip install zhipuai from zhipuai import ZhipuAI
# Initialize client
client = ZhipuAI(api_key="YOUR_Z_AI_API_KEY")
# Create chat completion
response = client.chat.completions.create(
model="glm-4.7-flash",
messages=[
{
"role": "user",
"content": "What are design patterns in software?"
}
],
max_tokens=1024,
temperature=0.8
)
print(response.choices[0].message.content) Example: JavaScript/TypeScript
// Install OpenAI SDK: npm install openai
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.Z_AI_API_KEY,
baseURL: 'https://api.z.ai/api/paas/v4/'
});
async function main() {
const response = await client.chat.completions.create({
model: 'glm-4.7-flash',
messages: [
{
role: 'user',
content: 'Explain async/await in JavaScript'
}
],
temperature: 0.7,
max_tokens: 1024
});
console.log(response.choices[0].message.content);
}
main(); Example: Java
// Add dependency to pom.xml or build.gradle
import com.zhipu.oapi.ClientV4;
import com.zhipu.oapi.Constants;
import com.zhipu.oapi.service.v4.model.*;
import java.util.ArrayList;
import java.util.List;
public class GLMExample {
public static void main(String[] args) {
ClientV4 client = new ClientV4.Builder("YOUR_Z_AI_API_KEY").build();
List<ChatMessage> messages = new ArrayList<>();
ChatMessage userMessage = new ChatMessage(
ChatMessageRole.USER.value(),
"Explain polymorphism in Java"
);
messages.add(userMessage);
ChatCompletionRequest request = ChatCompletionRequest.builder()
.model("glm-4.7-flash")
.messages(messages)
.temperature(0.7)
.maxTokens(1024)
.build();
ModelApiResponse response = client.invokeModelApi(request);
System.out.println(response.getData().getChoices().get(0).getMessage().getContent());
}
} Alternative API Providers
OpenRouter
curl https://openrouter.ai/api/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer YOUR_OPENROUTER_KEY"
-d '{
"model": "z-ai/glm-4.7-flash",
"messages": [{"role": "user", "content": "Hello"}]
}' Pricing: $0.07/M input, $0.40/M output
Together AI
curl https://api.together.xyz/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer YOUR_TOGETHER_KEY"
-d '{
"model": "zai-org/GLM-4.7",
"messages": [{"role": "user", "content": "Hello"}]
}' Use Cases
1. Full-Stack Development
# Example: Generate a React component
messages = [{
"role": "user",
"content": """Create a React component for a searchable dropdown menu with:
- Support for async data loading
- Keyboard navigation
- TypeScript types
- Accessible ARIA labels"""
}]
# GLM-4.7-Flash excels at both frontend and backend tasks 2. Code Review & Debugging
# Example: Debug complex code
messages = [{
"role": "user",
"content": """Review this Python code for bugs and improvements:
```python
def process_data(data):
result = []
for item in data:
if item.value > 0:
result.append(item.value * 2)
return result Suggest optimizations and potential issues.""" }]
### 3. Browser Automation & Web Scraping
```python
# Example: Generate Selenium script
messages = [{
"role": "user",
"content": """Write a Selenium script to:
1. Login to a website
2. Navigate to the dashboard
3. Extract table data
4. Save to CSV
Include error handling and wait conditions."""
}] 4. Long-Form Content Generation
# Example: Creative writing
messages = [{
"role": "user",
"content": """Write a 2000-word technical blog post about:
- The evolution of container orchestration
- Docker vs Kubernetes comparison
- Best practices for production deployments
- Future trends in cloud-native computing"""
}] 5. Multi-Language Translation
# Example: Context-aware translation
messages = [{
"role": "user",
"content": """Translate this technical documentation from English to Chinese:
'Container orchestration platforms like Kubernetes provide automated deployment,
scaling, and management of containerized applications. They abstract away the
underlying infrastructure complexity while offering robust service discovery,
load balancing, and self-healing capabilities.'
Maintain technical accuracy and cultural context."""
}] 6. Tool Use & Function Calling
# Example: Agentic workflow with tool calling
import json
tools = [
{
"type": "function",
"function": {
"name": "search_documentation",
"description": "Search technical documentation",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"source": {"type": "string", "enum": ["stackoverflow", "github", "docs"]}
},
"required": ["query"]
}
}
}
]
messages = [{
"role": "user",
"content": "Find information about React hooks best practices"
}]
response = client.chat.completions.create(
model="glm-4.7-flash",
messages=messages,
tools=tools,
tool_choice="auto"
) Troubleshooting
Common Issues and Solutions
Issue 1: Out of Memory (OOM) Errors
Symptoms:
CUDA out of memory. Tried to allocate XX.XX GiB Solutions:
# Solution 1: Use quantization
vllm serve zai-org/GLM-4.7-Flash --quantization awq
# Solution 2: Reduce context length
vllm serve zai-org/GLM-4.7-Flash --max-model-len 32000
# Solution 3: Enable CPU offloading (Transformers)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
device_map="auto",
offload_folder="offload",
offload_state_dict=True
)
# Solution 4: Use gradient checkpointing
model.gradient_checkpointing_enable() Issue 2: Slow Inference Speed
Solutions:
# Use vLLM with quantization
vllm serve zai-org/GLM-4.7-Flash
--quantization awq
--gpu-memory-utilization 0.95
# Enable Flash Attention (if supported)
pip install flash-attn --no-build-isolation
# Use SGLang with speculative decoding
python3 -m sglang.launch_server
--model-path zai-org/GLM-4.7-Flash
--speculative-algorithm EAGLE Issue 3: llama.cpp Not Recognizing Model
Symptoms:
error: unknown model architecture: 'glm4moelite' Solution:
# Ensure you're using the latest llama.cpp
git pull origin master
make clean && make
# Use models from official GGUF repositories
# Download from: https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF
# or: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF Issue 4: Import Errors with Transformers
Solution:
# Always install latest Transformers from GitHub
pip uninstall transformers -y
pip install git+https://github.com/huggingface/transformers.git
# Verify installation
python -c "from transformers import AutoModelForCausalLM; print('Success')" Issue 5: Model Download Fails
Solutions:
# Solution 1: Use huggingface-cli
pip install huggingface_hub
huggingface-cli download zai-org/GLM-4.7-Flash --local-dir ./models
# Solution 2: Set HF mirror (for China users)
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download zai-org/GLM-4.7-Flash
# Solution 3: Download via Git LFS
git lfs install
git clone https://huggingface.co/zai-org/GLM-4.7-Flash Issue 6: vLLM Installation Fails on Windows
Solution:
# Windows: Use WSL2 (recommended)
wsl --install
wsl --set-default-version 2
# Inside WSL2:
pip install vllm
# Alternative: Use Docker Desktop with WSL2 backend
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model zai-org/GLM-4.7-Flash Frequently Asked Questions
General Questions
Q: What is the difference between GLM-4.7-Flash and GLM-4.7?
A: GLM-4.7-Flash is the free-tier, optimized version with:
- Same 30B-A3B architecture
- Slightly lower benchmark scores (e.g., 59.2 vs 73.8 on SWE-bench)
- Free API access (vs paid for GLM-4.7)
- Optimized for cost-effective deployment
- Same 200K context window
Q: Can I use GLM-4.7-Flash commercially?
A: Yes, GLM-4.7-Flash is open source. Check the model card on HuggingFace for specific license terms.
Q: How does MoE architecture improve efficiency?
A: MoE uses only ~3B active parameters per token (out of 30B total), reducing:
- Computation requirements
- Memory bandwidth
- Inference latency
- While maintaining 30B model knowledge
Q: What is the context window limit?
A: GLM-4.7-Flash supports 202,752 tokens (200K) theoretically. Practical limits depend on VRAM:
- 24GB GPU: ~32K tokens (full precision)
- 48GB GPU: ~80K tokens
- 80GB GPU: ~150K tokens
- Use quantization for longer contexts
Technical Questions
Q: Which framework should I use?
A:
- Transformers: Simplest, good for learning
- vLLM: Best for production, highest throughput
- SGLang: Fastest with speculative decoding
- llama.cpp: Best for CPU, quantized models, consumer hardware
Q: What quantization should I use?
A:
- Q4_K_M (4-bit): Best balance (17GB, minimal quality loss)
- Q6_K (6-bit): Higher quality (25GB)
- AWQ (4-bit): Best for vLLM production
- Q2_K (2-bit): CPU-only with 128GB+ RAM
Q: Can I run this on Apple Silicon?
A: Yes:
- Transformers: Full MPS support
- llama.cpp: Excellent Metal acceleration
- vLLM: Limited support (use Docker)
- Recommended: M2 Ultra or M3 Max with 64GB+ RAM
Q: How do I enable function calling?
A:
# Use tool-call-parser flag
vllm serve zai-org/GLM-4.7-Flash --tool-call-parser glm47
# Or in API:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {...}
}
}] Q: What is Multi-Token Prediction (MTP)?
A: MTP is a speculative decoding technique that:
- Predicts multiple tokens simultaneously
- Reduces latency
- Improves throughput
- Requires more VRAM (~5GB extra)
Enable with: --speculative-config.method mtp
Q: How do I deploy in production?
A: Best practices:
- Use vLLM with quantization
- Deploy behind load balancer (Nginx)
- Enable monitoring (Prometheus)
- Use Docker/Kubernetes
- Set appropriate
max-model-len - Configure request batching
- Enable logging
Q: Can I fine-tune GLM-4.7-Flash?
A: Yes, use:
- LoRA/QLoRA: Memory-efficient
- Full fine-tuning: Requires significant compute
- Recommended tools: Hugging Face PEFT, Axolotl, LLaMA-Factory
Q: What is the difference between FlashX and Flash?
A:
- GLM-4.7-Flash: Free, 1 concurrency
- GLM-4.7-FlashX: Paid ($0.07/$0.40 per M tokens), unlimited concurrency, faster
Performance Optimization Tips
1. Enable KV Cache Optimization
# Transformers
model.generation_config.use_cache = True
# vLLM automatically uses PagedAttention for efficient KV cache 2. Batch Requests for Higher Throughput
# vLLM supports continuous batching automatically
vllm serve zai-org/GLM-4.7-Flash
--max-num-batched-tokens 8192
--max-num-seqs 256 3. Use Tensor Parallelism for Multi-GPU
# Distribute model across 4 GPUs
vllm serve zai-org/GLM-4.7-Flash
--tensor-parallel-size 4 4. Optimize Context Length
# Only use context you need
# Longer context = more VRAM + slower inference
# Good: Targeted context
messages = [
{"role": "system", "content": "You are a coding assistant."},
{"role": "user", "content": "Write a function..."}
]
# Avoid: Excessive context
# messages with 100K tokens of history 5. CPU Optimization (llama.cpp)
# Use all CPU cores
./main -m model.gguf -t $(nproc) -p "prompt"
# Enable mmap for faster loading
./main -m model.gguf --mlock --mmap Comparison with Other Models
GLM-4.7-Flash vs Competitors
| Feature | GLM-4.7-Flash | DeepSeek-V3 | Qwen3-30B-A3B | Llama-3.1-70B |
|---|---|---|---|---|
| Parameters | 30B (3B active) | 685B | 30B (3B active) | 70B |
| Context | 200K | 128K | 200K | 128K |
| SWE-bench | 59.2 | - | 22.0 | - |
| AIME 2025 | 91.6 | - | 80.4 | - |
| API Cost | Free | ~68x cheaper than Claude | ~$0.10/M | Varies |
| Open Source | โ | โ | โ | โ |
| Tool Calling | โ (79.5) | โ | โ | Limited |
| Best For | Coding, Agents | Programming | General, Reasoning | General |
Resources & Links
Official Resources
- Model Weights: Hugging Face - zai-org/GLM-4.7-Flash
- API Documentation: Z.AI Docs
- GitHub Repository: GLM-4.5 Official Repo
- GGUF Models: ngxson/GLM-4.7-Flash-GGUF
- Community Discord: Z.AI Discord
Inference Frameworks
- vLLM: https://github.com/vllm-project/vllm
- SGLang: https://github.com/sgl-project/sglang
- llama.cpp: https://github.com/ggerganov/llama.cpp
- Transformers: https://github.com/huggingface/transformers
Community Resources
- LM Studio: Pre-configured GUI for running GGUF models
- Ollama: Simple local deployment (model available as
glm4.7-flash) - OpenRouter: Third-party API access
- Together AI: Managed inference API
Conclusion
GLM-4.7-Flash represents a significant advancement in open-source language models, offering:
โ
Exceptional coding capabilities (59.2 on SWE-bench Verified)
โ
Strong agentic performance (79.5 on ฯยฒ-Bench)
โ
Free API access with 1 concurrency
โ
Multiple deployment options (vLLM, SGLang, llama.cpp, Transformers)
โ
Efficient MoE architecture (3B active from 30B total)
โ
Massive context window (200K tokens)
Whether youโre building coding assistants, agentic workflows, or creative applications, GLM-4.7-Flash provides a powerful, cost-effective solution that rivals proprietary models while remaining fully open source.
Quick Start Checklist
- Determine hardware requirements based on quantization
- Choose deployment method (vLLM recommended for production)
- Install dependencies for your platform
- Download model weights or GGUF quantization
- Test with sample prompts
- Integrate into your application
- Monitor performance and adjust parameters
Last Updated: January 20, 2026
Model Version: GLM-4.7-Flash (January 19, 2026 release)
Guide Version: 1.0
For updates and corrections, please visit the official Z.AI documentation or join the community Discord.
Comments
Sign in to join the discussion!
Your comments help others in the community.