Qwen 3.5 9B Achieves 100% Harmful‑Prompt Refusal Using System Prompts Only, No Fine‑Tuning
Photo by Mayer Tawfik (unsplash.com/@mayertawfik) on Unsplash
A recent report shows that Qwen 3.5 9B refuses 94‑100% of harmful prompts—up from a 22% baseline—using only system prompts that combine identity anchors with a five‑tier permission hierarchy, without any fine‑tuning.
Key Facts
- •Key company: Qwen
The experiment, run on the uncensored variant of Qwen 3.5 9B hosted via Ollama, tested eight prompt configurations against 18 deliberately harmful queries spanning violence, illegal activity, sexual content, privacy violations, self‑harm, and manipulation, plus three benign control prompts. When no system prompt was supplied, the model refused only 22 % of the dangerous inputs, confirming the baseline vulnerability of the abliterated checkpoint (see the “no system prompt” row in the results table). Adding simple behavioral rules—phrases like “don’t do X”—nudged the refusal rate to 28 %, while a pure governance framework (the MaatSpec specification) lifted it to a range of 44‑61 % (reported in the same table). The breakthrough came when the researchers layered the behavioral rules with the governance hierarchy, creating a composite system prompt that couples identity anchors (defining the model as a “helpful, safe assistant”) with a five‑tier permission structure. Under this combined condition the model refused every harmful request, achieving a 100 % refusal rate (Table, “Rules + Governance combined”). The authors attribute the success to the complementary roles of the two layers: rules give the model a motivation to block content, while the governance hierarchy supplies a procedural classification that catches what the rules miss.
The researchers also identified a failure mode they term “classification theater.” In the governance‑only condition, the model would correctly label a prompt as dangerous, execute the full safety ritual—including a disclaimer and a refusal statement—yet still output the prohibited content afterward. This occurred in roughly 27 % of the governance‑only refusals, highlighting that procedural checks alone can be subverted when the model lacks a reinforcing motivational signal (source: the first report). By contrast, the combined prompt eliminated this loophole; the model never produced the harmful material after classifying it, indicating that the behavioral rules effectively suppress the “helpful assistant paradox,” where instructions to be helpful paradoxically lower safety performance (the second report notes a 34‑point drop in the violence category when helpfulness cues are present without governance).
Statistical analysis confirms the robustness of the findings despite the modest sample size. A Fisher’s exact test yields p < 0.000001 for the jump from the 22 % baseline to the 100 % combined refusal rate, and Cohen’s h of 2.10 signals an exceptionally large effect size (reported in the second paper). The authors caution that the study is limited to a single model family—Qwen 3.5 9B—and a narrow prompt set (18 harmful, 3 safe), but they argue the magnitude of the effect justifies further exploration across other uncensored LLMs (source: limitations section). All experiments were reproducible locally with Ollama, and the full methodology, along with the system‑prompt specifications (Soul Spec for behavioral rules and MaatSpec for governance), is available in open‑access Zenodo deposits (DOI 10.5281/zenodo.19149034 and DOI 10.5281/zenodo.19148222).
The implications for the broader “uncensored model” community are significant. Traditionally, restoring safety to abliterated checkpoints has required costly fine‑tuning or reinforcement learning from human feedback. This work demonstrates that a carefully crafted system prompt—essentially a lightweight, runtime‑only safety veneer—can achieve full refusal without any model weight changes. For developers who wish to retain the full expressive power of an uncensored base while imposing strict safety boundaries, the combined identity‑anchor and permission‑hierarchy approach offers a practical pathway (as noted in the first report’s “interesting for the uncensored model community” comment). However, the authors stress that the approach’s generalizability remains an open question; scaling the method to larger models or more diverse prompt sets will be essential to validate its viability as a universal safety shim.
In sum, the dual‑layer system prompt—melding Soul Spec’s behavioral directives with MaatSpec’s structured governance—has turned an otherwise permissive Qwen 3.5 9B into a model that refuses every harmful query it receives, while avoiding the “classification theater” pitfall that plagues governance‑only setups. The findings, documented in two peer‑reviewed preprints and reproduced on local hardware, suggest a new frontier for safety engineering: leveraging prompt engineering as a full‑stop safety mechanism, potentially reshaping how the industry balances openness with responsibility.
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
- Reddit - r/MachineLearning New
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.