Back to AI is open source
Qwen3Guard is now fully open source: a dual framework for security alignment and inference protection

Qwen3Guard is now fully open source: a dual framework for security alignment and inference protection

AI is open source Admin 186 views

I. Summary

Qwen3Guard is an open-source security protection system launched by the Alibaba Cloud Qwen team, designed to improve the security of large language models during both inference and output. The system comprises the Qwen3-4B-SafeRL reinforcement learning alignment model and the Qwen3GuardTest evaluation benchmark. The Qwen3-4B-SafeRL model leverages security feedback from Qwen3Guard-Gen-4B for reinforcement learning training, improving the safety rating on the WildJailbreak benchmark from 64.7% to 98.1% without sacrificing general-purpose performance. The Qwen3GuardTest covers two scenarios: "Think Chain Reasoning Security Classification" and "Streaming Generation Review," providing researchers with a standardized testing framework.

2. Core Features

  1. Safe Reinforcement Learning (SafeRL): Combines safety feedback signals with a hybrid reward mechanism to balance safety, usefulness, and rejection rate.
  2. Intermediate reasoning protection: Qwen3GuardTest supports the security classification and screening of model chain-of-thought content.
  3. Streaming output monitoring: The Guard-Stream model can perform dynamic risk identification at the token generation stage.
  4. Multilingual security coverage: supports security classification and detection in 119 languages and dialects.
  5. Reproducible evaluation framework: Open datasets and indicator systems make it easier for researchers to conduct model security alignment experiments.

3. Installation

  1. Model loading
pip install transformers accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-SafeRL")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-SafeRL")
  1. Evaluation Dataset
from datasets import load_dataset
ds = load_dataset("Qwen/Qwen3GuardTest")
  1. Reasoning compatibility: Supports SGLang (≥0.4.6.post1) and vLLM (≥0.8.5), and can access the OpenAI API interface.

Typical Use Cases

  1. Security alignment research: Analyze the effects and tradeoffs of reinforcement learning in security optimization.
  2. Real-time review system: Combined with the Guard-Stream model, it performs token-by-token inspection on streaming output.
  3. Enterprise deployment: Provide a security layer for chatbots and content generation platforms.
  4. Academic evaluation: Use Qwen3GuardTest to conduct a unified security comparison of different architecture models.

5. Ecosystem and Competitive Products

  1. Ecosystem: Compatible with the Qwen3 mainline model system, it can be directly used for security reinforcement of Qwen3-4B, 7B, 72B and other versions.
  2. Competitors: Compared with solutions such as OpenAI Moderation and Anthropic Constitutional AI, Qwen3Guard provides more fine-grained control in intermediate inference protection and streaming monitoring.

VI. Limitations and Precautions

  1. SafeRL training requires a lot of computing resources and has high hardware requirements.
  2. Qwen3GuardTest is currently mainly in English, and its multilingual performance needs further verification.
  3. Reinforcement learning alignment may lead to slight performance fluctuations in extreme tasks.
  4. Excessive security constraints may lead to the phenomenon of "too many rejections", and policy parameters need to be weighed.

7. Project Address

https://github.com/QwenLM/Qwen3Guard

8. Frequently Asked Questions

Q: What is the difference between Qwen3-4B-SafeRL and ordinary RLHF models?

A: SafeRL takes safety feedback as its core optimization objective and strikes a balance between safety and usefulness through hybrid rewards.

Q: Is the Qwen3GuardTest applicable to non-Qwen series models?

A: Yes, the benchmark data and metrics are designed to be universal and can be used to evaluate the security performance of other language models.

Q: Can the SafeRL model be used offline?

A: You can load Hugging Face or ModelScope weights locally and run it offline.

Q: Can Guard-Stream interrupt risk output in real time?

A: Each token can be classified in real time during the inference phase, and the output can be immediately blocked or replaced when risks are discovered.

Recommended Tools

More