Intro

Understanding larger neural networks in terms of functional circuits and their relations RLHF and Scalable oversight are not enough to deal with issues like deceptive of sycophantic behaviour If we can understand what the different subnetworks are doing. CNNs are mostly interpretable because we have an understanding of what the convolutional layer and pooling layers do whereas larger SOTA circuits are often much less interpretable. I couldn’t understand the geometry of superposition section in the Toy Models of Superposition paper, though it seemed interesting. Is there any mathematical formalism for interpretability?

Activity 1

Features: these represent the independent directions like a basis set for what we’re trying to learn. Circuits: Connections between Features which correspond to subgraphs of the whole network are functional components that represent relationships between features and together represent bigger semantic chunks of whatever is being learnt. Universality: similar circuits learn similar features of the problem even if the tasks are different. Examples: Features = an edge of an image. Circuits: A group of edge detection filters connected together by a certain weight that allows the detection of a square.

Activity 2

superposition: A single neuron represents noisy encodings of multiple features.

What is superposition? Give an example pair of large language model features that might be likely to be stored in superposition, and why.
What’s a polysemantic neuron? How does this relate to superposition?
What is the general idea behind dictionary learning? Why might it be helpful?
Lesser neurons, more features to be encoded. Semicolon in javascript has different meanings in English or other languages. Sparse correlation makes the neurons polysemantic.
Single neurons encode more than one feature noisily. Polysemantic neurons allow for superposition.
Related to sparse autoencoding. Creating an independent basis set for whatever latent semantic space is being considered as a sort of dimensionality reduction or ‘disentangling’.

Questions

How much does interpretability affect performance? Dictionary Learning - post hoc interpretability = focus on performance and then do dictionary learning to understand what the different things are doing. You take a layer of polysemantic neurons and expand it out into a layer of size maybe 10-15 times its size so that you can see all the monosemantic features.