Heretic hits 20.5k GitHub stars: automated LLM "abliteration" tool now lets anyone strip safety alignment with one command

ref · May 27, 2026, 10:26am

Heretic, an open-source Python tool that automates the removal of safety alignment from transformer-based language models without post-training, has accumulated 20.5k GitHub stars and 2.1k forks since its release, with the community publishing over 3,000 derivative models on Hugging Face under the heretic tag. Built by Philipp Emanuel Weidmann and released under AGPL-3.0, the tool combines an advanced implementation of directional ablation — a technique rooted in Arditi et al. 2024’s finding that refusal behavior in LLMs is mediated by a single geometric direction in activation space — with a TPE-based Bayesian parameter optimizer powered by Optuna. The key claim is full automation: Heretic finds ablation parameters by co-minimizing both refusal rates and KL divergence from the original model, meaning it degrades model intelligence as little as possible while suppressing refusals. On Gemma-3-12B-Instruct, Heretic achieved a 3/100 refusal rate on a benchmark of “harmful” prompts — matching the best manually-tuned abliterations — while recording a KL divergence of just 0.16, roughly 6.5 times lower than the leading manual alternative (1.04 KL). The tool works with a single CLI command: pip install heretic-llm && heretic <model>, and supports quantization via bitsandbytes to run on consumer GPUs.

The tool’s practical significance lies in the accessibility gap it closes. Earlier abliteration tools required at least a working understanding of transformer internals and manual tuning of layer weights; Heretic’s optimizer removes that requirement. It supports most dense and mixture-of-experts architectures including Qwen, Gemma, Llama, and GPT-OSS series models, though pure state-space models are not yet supported. An optional research add-on produces animated PaCMAP visualizations of per-layer residual vectors, giving interpretability researchers a way to inspect the geometric separation between “harmful” and “harmless” prompt activations across transformer layers without writing custom visualization code. The project’s latest release is v1.2.0, dated February 14, 2026. The proliferation of such tooling has become a recurring flashpoint in the AI safety debate: abliterated models circulate freely on Hugging Face and have been benchmarked on standard MMLU and GSM8K metrics with results comparable to the base model, suggesting the intelligence-alignment trade-off is in practice more separable than many safety researchers had hoped.

GitHub / p-e-w / heretic | FT中文网

Topic	Replies	Views
肽类公司批量刷帖操纵 Reddit，以影响 ChatGPT 和谷歌 AI 搜索结果常规 ai , reddit , seo	1	June 4, 2026
Anthropic 发文呼吁暂缓前沿 AI 研发，披露八成代码已由 Claude 自主生成常规 ai , anthropic , ipo	1	June 5, 2026
Anthropic expands Project Glasswing to 150 new organizations, Mythos surfaces 10,000+ critical flaws since April 常规 ai , anthropic , mythos	1	June 3, 2026
Anthropic 发布 Claude Opus 4.8：智能体编码成绩升至 69.2%，新增思考强度控制与动态工作流常规 anthropic , claude , ai , 大模型 , 智能体	1	May 29, 2026
Pieter Levels calls out Japan's AI gap as Rakuten AI 3.0 confirmed to be a DeepSeek V3 fine-tune built on government funding 常规 ai , deepseek , rakuten , japan , llm	1	May 24, 2026

Heretic hits 20.5k GitHub stars: automated LLM "abliteration" tool now lets anyone strip safety alignment with one command

Related topics