Qwen/Qwen3.6-35B-A3B
Smaller Qwen3.6 multimodal MoE model (35B total / 3B active) with 256 experts (8 routed + 1 shared), gated delta networks architecture, and 262K context
View on HuggingFaceGuide
Overview
Qwen3.6-35B-A3B is the smaller sibling of Qwen3.5, sharing the same gated-delta-networks MoE architecture but with 35B total parameters and 3B activated (256 experts, 8 routed + 1 shared). With FP8 weights it fits comfortably on a single 80 GB GPU and supports the full 262K context.
Prerequisites
- vLLM version: >= 0.17.0
- Hardware (BF16): 1x H200 or 2x H100
- Hardware (FP8): single H100/H200 or 1x MI300X/MI325X/MI355X
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
Launching the Server
Single-GPU FP8
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--max-model-len 262144 \
--reasoning-parser qwen3
BF16 on 2xH200 (TP2)
vllm serve Qwen/Qwen3.6-35B-A3B \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--reasoning-parser qwen3
MTP speculative decoding
vllm serve Qwen/Qwen3.6-35B-A3B \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'
AMD (MI300X / MI325X / MI355X)
VLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--trust-remote-code
Processing Ultra-Long Texts
Qwen3.6-35B-A3B natively supports 262,144 tokens. For longer inputs, apply
YaRN RoPE scaling via --hf-overrides and raise --max-model-len. Pick
factor to match your real workload — 2.0 covers ~524K, 4.0 covers
~1M — since YaRN at higher factors degrades short-context quality.
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--tensor-parallel-size 2 \
--max-model-len 1010000 \
--reasoning-parser qwen3 \
--hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}'
See the model card for the full parameter reference.
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[{"role": "user", "content": "Explain gated delta networks in one paragraph."}],
max_tokens=512,
)
print(resp.choices[0].message.content)
Troubleshooting
- CUDA graph / Mamba cache size error: reduce
--max-cudagraph-capture-size(default 512). See vLLM PR #34571. - Reasoning disable: add
--default-chat-template-kwargs '{"enable_thinking": false}'. - Prefix Caching (Mamba): currently experimental in "align" mode.