April 11, 2024
Avi Caciularu and Asma Ghandeharioun, Research Scientists, Google Research
The remarkable advancements in large language models (LLMs) and the concerns associated with them, such as factuality and transparency, highlight the importance of comprehending their mechanisms, particularly in instances where they produce errors. By exploring the way a machine learning (ML) model represents what it has learned (the model's so called hidden representations), we can gain better control over a model's behavior and unlock a deeper scientific understanding of how these models really work. This question has become even more important as deep neural networks grow in complexity and scale. Recent advances in interpretability research show promising results in using LLMs to explain neuron patterns within another model.
These findings motivate our design of a novel framework to investigate hidden representations in LLMs with LLMs, which we call Patchscopes. The key idea behind this framework is to use LLMs to provide natural language explanations of their own internal hidden representations. Patchscopes unifies and extends a broad range of existing interpretability techniques, and it enables answering questions that were difficult or impossible before. For example, it offers insights into how an LLM's hidden representations capture nuances of meaning in the model's input, making it easier to fix certain types of reasoning errors. While we initially focus the application of Patchscopes to the natural language domain and the autoregressive Transformer model family, its potential applications are broader. For example, we are excited about its applications to detection and correction of model hallucinations, the exploration of multimodal (image and text) representations, and the investigation of how models build their predictions in more complex scenarios.
Consider the task of understanding how an LLM processes co-references to entities within a text. An implementation of Patchscopes is a specialized tool crafted to address the specific problem of co-reference resolution. For instance, to investigate a model's contextual understanding of whom a pronoun like “it” refers to, a Patchscopes configuration can be created as follows (also illustrated below):
Illustration of our framework, showing a Patchscope for decoding what is encoded in the representation of “It” in the source prompt (left), by using a predefined target prompt (right).
Patchscopes has a broad range of applications for understanding and controlling LLMs. Here are a few examples we explored:
![]() |
| Attribute extraction accuracy across source layers (ℓ). Left: Task done by tool (commonsense), 54 Source prompts, 12 Classes. Right: Country currency (factual), 83 Source prompts, 14 Classes. |
![]() |
| An illustration of CoT Patchscope on a single example, focusing on a response needing correction with the prompt "The current CEO of the company that created Visual Basic Script". |
The Patchscopes framework is a breakthrough in understanding how language models work. It helps answer a wide range of questions from simple predictions to extracting knowledge from hidden representations and fixing errors in LLMs’ complex reasoning. This has intriguing implications for improving the reliability and transparency of the powerful language models we use every day. Want to see Patchscopes in action? Find more details in the paper.