Heretic, an open-source Python tool that automates the removal of safety alignment from transformer-based language models without post-training, has accumulated 20.5k GitHub stars and 2.1k forks since its release, with the community publishing over 3,000 derivative models on Hugging Face under the heretic tag. Built by Philipp Emanuel Weidmann and released under AGPL-3.0, the tool combines an advanced implementation of directional ablation — a technique rooted in Arditi et al. 2024’s finding that refusal behavior in LLMs is mediated by a single geometric direction in activation space — with a TPE-based Bayesian parameter optimizer powered by Optuna. The key claim is full automation: Heretic finds ablation parameters by co-minimizing both refusal rates and KL divergence from the original model, meaning it degrades model intelligence as little as possible while suppressing refusals. On Gemma-3-12B-Instruct, Heretic achieved a 3/100 refusal rate on a benchmark of “harmful” prompts — matching the best manually-tuned abliterations — while recording a KL divergence of just 0.16, roughly 6.5 times lower than the leading manual alternative (1.04 KL). The tool works with a single CLI command: pip install heretic-llm && heretic <model>, and supports quantization via bitsandbytes to run on consumer GPUs.
The tool’s practical significance lies in the accessibility gap it closes. Earlier abliteration tools required at least a working understanding of transformer internals and manual tuning of layer weights; Heretic’s optimizer removes that requirement. It supports most dense and mixture-of-experts architectures including Qwen, Gemma, Llama, and GPT-OSS series models, though pure state-space models are not yet supported. An optional research add-on produces animated PaCMAP visualizations of per-layer residual vectors, giving interpretability researchers a way to inspect the geometric separation between “harmful” and “harmless” prompt activations across transformer layers without writing custom visualization code. The project’s latest release is v1.2.0, dated February 14, 2026. The proliferation of such tooling has become a recurring flashpoint in the AI safety debate: abliterated models circulate freely on Hugging Face and have been benchmarked on standard MMLU and GSM8K metrics with results comparable to the base model, suggesting the intelligence-alignment trade-off is in practice more separable than many safety researchers had hoped.