Qwen3.6-35B-Claude-4.6-Opus 蒸馏模型详解：训练、评测与多平台部署

Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled 是一个用 Claude Opus 4.6 风格的链式思维数据对 Qwen3.6-35B-A3B 进行推理 SFT 微调的模型。本文介绍模型的训练细节、数据构成、基准测试以及多种本地/服务端推理部署方式。

模型概览

项目	详情
开发者	@hesamation
基座模型	Qwen/Qwen3.6-35B-A3B（MoE 架构，350 亿参数）
微调方法	LoRA 监督微调
训练工具	Unsloth
量化工具	llama.cpp
推理格式	GGUF
开源协议	Apache 2.0
HuggingFace 下载量	89,100+/月

注意：本次微调仅使用文本数据，图像/视频能力继承自 Qwen3.6 基座但未经强化，请作为纯文本推理模型使用。

训练细节

训练配置

参数	值
微调方法	LoRA 监督微调
LoRA 目标	仅 Attention 模块
LoRA rank / alpha	32 / 32
Micro-batch size	1
梯度累积	32
训练轮数	2
完成步数	762 / 762
最终训练损失	0.336
数据集最大 token	8192
最大序列长度	32768

训练数据

模型从三个数据集中采样并标准化推理对话，使用 qwen3-thinking chat template 和 response-only SFT masking 渲染：

数据集	样本数	角色
nohurry/Opus-4.6-Reasoning-3000x-filtered	3,900	Claude Opus 推理轨迹
Jackrong/Qwen3.5-reasoning-700x	700	策划的 Qwen 推理样本
Roman1111111/claude-opus-4.6-10000x	9,633	额外的 Claude Opus 推理示例

基准测试

MMLU-Pro 测试基于源合并模型（非各 GGUF 量化版本），每个 MMLU-Pro 科目使用 5 题共 70 题。

基准	基座模型	微调后模型	提升
MMLU-Pro overall	42.86%	75.71%	+32.85 pp

量化可能改变分数（尤其是低比特率），此数据作为参考而非完整评测。

GGUF 量化版本

量化	大小	适用场景
Q4_K_M	21.2 GB	最小实用通用量化，适合 24GB 显存
Q5_K_M	24.7 GB	比 Q4 质量更好的平衡选择
Q6_K	28.5 GB	显存/内存充裕时的高质量选择
Q8_0	36.9 GB	最大量化，最接近源模型质量

多平台部署方式

llama.cpp

bash

# Homebrew 安装
brew install llama.cpp
llama-server -hf hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M

# Windows
winget install llama.cpp
llama-server -hf hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M

# 终端直接推理
llama-cli -hf hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M

llama-cpp-python

python

# pip install llama-cpp-python
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF",
    filename="Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf",
)
llm.create_chat_completion(
    messages=[{"role": "user", "content": "解释快速排序的原理"}]
)

vLLM

bash

pip install vllm
vllm serve "hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF"

Ollama

bash

ollama run hf.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M

Docker

bash

docker model run hf.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M

Unsloth Studio

bash

# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh
unsloth studio -H 0.0.0.0 -p 8888
# 浏览器打开 http://localhost:8888，搜索模型名即可开始聊天

# 或直接使用 HuggingFace Spaces（无需安装）
# https://huggingface.co/spaces/unsloth/studio

Pi Coding Agent

bash

npm install -g @mariozechner/pi-coding-agent
# 在 ~/.pi/agent/models.json 中添加 llama.cpp 服务器配置
pi

思考模式控制

模型使用 qwen3-thinking chat template：

开启推理：system prompt 设为 /think
关闭推理（速度更快）：system prompt 设为 /no_think

致谢

Qwen 团队提供基座模型
Unsloth 提供训练框架
llama.cpp 提供 GGUF 工具
Jackrong 提供的公开推理蒸馏工作流

项目链接

HuggingFace 模型页：hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
源微调模型：hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
基座模型：Qwen/Qwen3.6-35B-A3B

字节笔记本

Qwen3.6-35B-Claude-4.6-Opus 蒸馏模型详解：训练、评测与多平台部署

模型概览

训练细节

训练配置

训练数据

基准测试

GGUF 量化版本

多平台部署方式

llama.cpp

llama-cpp-python

vLLM

Ollama

Docker

Unsloth Studio

Pi Coding Agent

思考模式控制

致谢

项目链接