Maximum Activations: About

Platforms

Manifold

Description

Cartographic exploration of a single steering direction within the Gemma-2-2B-IT large language model at layer 20. This latent space linear direction was discovered using the technique from in the paper "Refusal in Language Models Is Mediated by a Single Direction" by Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee and Neel Nanda.

Clustering 100 maximally activating prompts centered on the strongest responses reveals a map of concepts related to a precereived harmful or inappropriate request resulting in behaviour where the model will refuse to comply rather than providing a helpful answer. Red highlighting indicates activating concepts within images and the connected network of neighboring activations.

Interactive Browser at https://got.drib.net/maxacts/refusal/

On-Chain Data

Ethereum—0x9e8389a9ffe3778948f0fe3012bd8c999fba47ee0x9e83…47ee