玩具模型中的工程单体性

论文标题

玩具模型中的工程单体性

Engineering Monosemanticity in Toy Models

论文作者

Jermyn, Adam S., Schiefer, Nicholas, Hubinger, Evan

论文摘要

在某些神经网络中，单个神经元对应于输入中的自然``特征''。这样的\ emph {单语义}神经元在可解释性研究方面具有很大的帮助，因为它们可以清晰地理解。在这项工作中，我们报告了初步尝试在玩具模型中设计单位气质的尝试。我们发现，可以通过更改训练过程找到哪个本地最低限度，而不会增加损失的模型，而不会增加损失。更大的损失极少量具有中等的负偏见，我们能够使用这一事实来设计高度单语义模型。我们能够机械地解释这些模型，包括残留的多政治神经元，并发现一种简单而令人惊讶的算法。最后，我们发现，每层提供更多神经元的模型使模型更具单口感，尽管计算成本增加了。这些发现指出了许多新的问题和工程途径，我们打算在未来的工作中研究这些问题。

In some neural networks, individual neurons correspond to natural ``features'' in the input. Such \emph{monosemantic} neurons are of great help in interpretability studies, as they can be cleanly understood. In this work we report preliminary attempts to engineer monosemanticity in toy models. We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds. More monosemantic loss minima have moderate negative biases, and we are able to use this fact to engineer highly monosemantic models. We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm. Finally, we find that providing models with more neurons per layer makes the models more monosemantic, albeit at increased computational cost. These findings point to a number of new questions and avenues for engineering monosemanticity, which we intend to study these in future work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题