Tracing the shared "override circuit" behind CoT unfaithfulness and sycophancy in Gemma 3, using Gemma Scope 2 SAEs and cross-layer transcoders.
gemma sparse-autoencoders interpretability chain-of-thought mechanistic-interpretability transcoders circuit-tracing
-
Updated
Apr 15, 2026 - Python