Cartographic exploration of a single steering direction within the Gemma-2-2B-IT large language model at layer 20. This latent space linear direction was discovered using the technique from in the paper "Refusal in Language Models Is Mediated by a Single Direction" by Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee and Neel Nanda.
Clustering 100 maximally activating prompts centered on the strongest responses reveals a map of concepts related to a precereived harmful or inappropriate request resulting in behaviour where the model will refuse to comply rather than providing a helpful answer. Red highlighting indicates activating concepts within images and the connected network of neighboring activations.
Interactive Browser at https://got.drib.net/maxacts/refusal/