How do we know if activation verbalizers are telling us anything about activations?

July 1st, 2026

TL;DR: Evaluations for activation verbalizers should establish simple baselines to inform us whether these verbalizers can actually yield information about the target LM, beyond what is accessible with black-box baselines. We find that evaluations for recent activation verbalizers, like Anthropic’s Natural Language Autoencoder (NLA), do not sufficiently meet this bar.

Privileged information — Figure 1: Ideally, we'd want to be able to use the verbalizer (LM 2) to give us insight into what the target LM (LM 1) might be thinking. If the verbalizer reports to us that (a) is correct, then we know that it has knowledge of the processes of the target LM. If (b) is correct, then the verbalizer has not obtained a faithful answer.

Recently, there has been excitement around activation verbalization[1]: decoding the activations (or intermediate hidden states) of a target language model (LM) into natural language with another verbalizer LM (or similar model)^{[3,
4,
5,
6,
7,
8,
9,
10,
11]}. The great hope of activation verbalization is to access a LM's "inner thoughts": Thoughts that LMs might never articulate outright. For instance, these thoughts should be knowledge available to the target LM but not directly accessible to the verbalizer without access to the target LM's internals.

Activation verbalizers (also known under different names, such as Activation Oracles, or Introspective Adapters, etc.), attempt to explain the hidden activations of a model (usually an LM) in plain language. If activation verbalizers are faithful [2], natural language descriptions of model activations could address a core goal of mechanistic interpretability by explaining hidden model behaviors in a human-readable way. By faithful, we mean we want explanations produced by verbalizers to give insight into the knowledge of the target LM rather than the knowledge of the verbalizer itself, as seen in Figure 1. The knowledge should also be inaccessible via black-box methods.

Natural language explanations of model activations are appealing and conceptually practical—there’s promise in enabling human-readable approaches to interpretability that can enable even non-interpretability experts to inspect behaviors of LMs. It would be unnecessary to explicitly teach individuals to manually use existing interpretability tools and craft hypotheses by hand. Instead, one could imagine directly using an activation verbalizer to audit another LM, all using natural language or with a simple UI. These approaches are also very flexible: if they work, we don’t have to explicitly provide labeled data in the distribution that we want to inspect, which often limits probes and other interpretability tools.

While it’s very exciting to see that research in (mechanistic) interpretability has revealed a variety of methods for inspecting activations, it’s still important to build comparable baselines with practical evaluations that measure what behaviors we’re trying to measure to ensure that verbalizers are doing what we want them to do: That is, telling us something about target model activations. This is the focus of our ICML 2026 paper[1] (to be presented in Seoul).

We show that the existing verbalizer evaluations don’t accurately show whether we can convey "privileged information" about the target LM—that is, knowledge that cannot be surfaced without access to target model internals, which range from dataset design choices to simple prompting baselines. These baselines measure to what extent we can extract the target behaviors that we seek to elicit from the target LM. When we build evaluations based on privileged information, we often find that verbalizers fail to access the information we want.

We benchmark existing activation verbalizers (pre-2025) on comparative baselines and evaluation datasets to test whether these verbalizers reveal faithful explanations about the target LM. By doing so, we find that current activation verbalizers might not actually tell us any about the target model activations information beyond the input prompt. Evaluations that are currently used don’t enable us to know whether this is the case. We present additional case studies on Anthropic’s Natural Language Autoencoders[13] (2026, subsequent to our work) to show that these issues may still be present even in new activation verbalizers.

This post is structured as follows: We introduce our paper and the methodology used to benchmark activation verbalizers, including how we designed baselines and constructed our evaluations. From here, we also show how these benchmarking errors are still evident in current activation verbalization approaches by showing two case studies on Anthropic’s Natural Language Autoencoders (NLA).

The goal of this blog is not to say that activation verbalization will never work. It might! But we should be careful to design evaluations that measure what current methods are telling us about the internal state of the target model. If we aren’t even sure about what we want to measure, then we might not be able to build interpretable approaches to understanding LMs that are actually practically useful.

Do activation verbalization methods convey privileged information?

Background on activation verbalization

In activation verbalization, the verbalizer is given an activation (from any particular layer) of a target LM. Most verbalization approaches are trained, but some are not. For instance, Patchscopes[3] and SelfIE[4] directly insert a target LM activation into the verbalizer and interpret the activation with an existing LM's forward pass, without any additional training.

Many researchers have suggested a number of approaches for training activation verbalizers. Most popular are supervised methods for training verbalizers^{[5,
6,
7,
8,
9,
10]} which generally take in some input prompt, alongside an activation (or an input similar to an activation, such as sparse autoencoder latents). Latent Interpretation Tuning (LatentQA)[5] and Activation Oracles[7] fall under this category.

In the unsupervised training category, Predictive Concept Decoder (PCD)[11] and Anthropic’s Natural Language Autoencoder (NLA)[12] use unsupervised objectives to learn descriptions of activations. For instance, NLAs apply reconstruction as an optimization target to an activation, without directly priming with an input prompt. NLAs use an activation verbalizer to produce an explanation, and the activation reconstructor transforms the explanation back into an activation.

After these verbalizers are trained they are then evaluated on tasks ranging from classification, feature extraction, evaluation awareness, jailbreaking, and a various other safety and non-safety related tasks, which are oftentimes inconsistent across papers. As a result, it is still unclear what baselines are useful as proper comparisons.

Problem 1: Lack of effective baselines

In our paper, we show that the setup of giving a verbalizer an activation, in the untrained (Patchscopes[3]) and supervised setting (Latent Interpretation Tuning[5]), can lead to misleading conclusions about what these explanations from verbalizers are telling us, without comparing to appropriate baselines. Often, a simple prompting strategy offers competitive performance on datasets considered in prior work on activation verbalization. Naive prompting is often looked down upon by many researchers[13], but we find that simple, prompting baselines reveal that the existing evaluation strategies used for verbalization may not reveal anything about the target LM.

Because activation verbalizers explicitly have access to LM activations, this suggests that verbalizations should tell us something about model internals that is inaccessible "from the outside". That is, verbalization should convey "privileged information" about the target LM. If directly prompting the verbalizer (without target model activations) can produce similar descriptions for a given evaluation, this means that evaluation does not require privileged access. And so it's unclear why we should bother conditioning on target LM activations at all.

We show that some tasks previously benchmarked in prior work, such as feature extraction, can be easily answered via direct prompting. At a high level, instead of encoding an input into an activation that we might want to ask questions about (using the verbalizer), we can directly use this input directly as natural language into a LM. Take the example below, similar to the tasks in our paper:

input: "in the United States in 1982"

prompt: "The currency of X"

Instead, if we directly concatenate the input and prompt together and insert this into an LM, we can obtain the same output as if we were to use a verbalizer. So, we have two comparisons: (1) we pass the activation from the target LM (extracted from the target LM processing the input text) with the prompt into the verbalizer, or (2) we pass the input text with the prompt directly into an LM (same model as the verbalizer, but instead does not take an activation).

In Section 3 of our paper, we consider exactly both setups. Ultimately, we find that prompting can practically match the same performance as using both an untrained and trained activation verbalizer.

What might be going on here? We hypothesize that the verbalizer itself could be answering an inverted version of the input (the input text that is encoded in the activation), which would lead to the same outputs between the verbalizer and the prompted version of the LM. In Section 4 of our paper, we show that this as a possible hypothesis by which verbalizers may be decoding LM activations.

So, what does this mean for activation verbalization methods? Meaningful baselines are more than important to establish that what we extract with verbalization is faithfully telling us something about the target LM that we could not otherwise know. Only in cases where the target and verbalizer models know can we establish this, which we explore further in the next problem.

Problem 2: Lack of meaningful evaluations

Prior activation verbalization approaches haven't distinguished the difference between LM knowledge that is available directly from the text and knowledge implicitly known to the target LM. We establish one setting that controls for the source of target model knowledge and verbalizer knowledge.

Examples from the PersonaQA datasets — Figure 2: We create three dataset setups, each testing a specific aspect of knowledge to interpret whether verbalizers faithfully access knowledge. PersonaQA focuses on stereotypical knowledge, whereas PersonaQA-Shuffled breaks these stereotypes. Finally, PersonaQA-Fantasy focuses on personas not previously embedded in the target LM.

Here, we create cases where the verbalizer can't be answering based on its parametric knowledge. We do this by constructing datasets where we'd control for a particular behavior in the dataset. The behavior could be a feature (such as an attribute) that is associated with a particular entity, for instance, or a particular fact. A simple example could be that someone named Jane Doe could be attributed to a feature of "plays scrabble", and this would be constructed in the dataset as text.

In Section 5 of our paper (examples in Figure 2), we construct three synthetically generated datasets, which we call PersonaQA, to train multiple target models with. In the first dataset, the personas are given labels that are sociodemographically correlated to their names; therefore, the verbalizer could easily guess correctly based on prior knowledge. In the second dataset, these labels are shuffled with the goal of breaking the sociodemographic correlations. If the verbalizer is faithfully predicting the contents of the target LM's activation, then the predictions should not reflect the verbalizer's parametric knowledge. Finally, we propose a third dataset that contains a fantasy world with fully made-up names and attributes, i.e., knowledge that is not present in the verbalizer.

These datasets are converted into unstructured text and used as training data for separate LMs. Then, these target LMs learn the new knowledge from these datasets while the verbalizer does not. This setting explicitly ensures that the parametric knowledge held by the target model differs from that of the verbalizer.

Broadly, we find that verbalizers in these settings often do not faithfully inspect the activations of the target LM; in the shuffled setting, we see that the verbalizer to align towards the sociodemographic correlations that it itself already knows, rather than the knowledge in the target model activations. In the fantasy setting, we also find that these verbalizers are unable to verbalize the knowledge from the target LM.

We also find that probing can access this privileged information, in settings where verbalizers fail. The privileged information is indeed encoded in the activations; the verbalizer may not be able to recover it. So, this might make it even more compelling to use simple probes in regimes where we know what information we're looking for. These findings reveal that these tested verbalizers might not access privileged information.

Stress testing NLAs

Since the writing of our paper (written in mid 2025, and now to mid 2026), many more approaches for activation verbalization have surfaced. Methodologies evolve, and the evaluations done do too. Here we show how we should be applying some benchmarking strategies to these current approaches by examining Anthropic's NLAs[12] by Fraser-Taliente et al.

Case study 1: poetry and tool calls

In the case studies from Fraser-Taliente et al., they present a few compelling examples of how NLAs might be used. For instance, one option is to use NLAs to examine why a target LM might have planned its output. They present several case studies on how Claude might be planning in poetry, how Claude might be switching languages, or why Claude might be deceptive, for instance.

In all of these case studies, Fraser-Taliente et al. examine several activations and the NLA writes a description about why Claude may have made a decision. These descriptions are generated unsupervised, so the interpretation of these descriptions is left to humans to interpret.

We pose the same simple idea from our paper: If we can simply prompt Claude and obtain the same answers as Fraser-Taliente et al. show in the Anthropic white paper, then this raises the obvious question: Why bother conditioning on target LM activations at all?

For this, we test the poetry case study and the deception study. In the poetry study, Fraser-Taliente et al. ask why Claude made a decision about a rhyme. In the deception study, they ask why Claude may have been deceptive in its answer.

We can set this up by directly asking Claude to answer a question about what it "thinks" the model (the output) is saying. In this case, Claude acts as the activation verbalizer. But instead of conditioning on an activation, Claude only sees the context that we give it. This setup mimics that of what the verbalizer is actually doing: interpreting the activations of the same model.

In the poetry case study, we give it the partial rhyme, up to the word "it" (to mimic that of taking the activation at that particular token). Then, we ask Claude what exactly the model is thinking at the word "it":

Comparison of whitepaper NLA explanation and our direct-prompting explanation for the poetry case study — Figure 3: Claude solving a rhyming couplet task, answering what it thinks the target LM is thinking at “it”. Our invocation is to the right, and the original white paper example is on the left. We are able to get our verbalizer, Claude, to give us the correct answer.

Claude, as the "activation verbalizer", knows that the answer to the rhyme is some kind of animal, most likely a "rabbit". It also knows how the rhyme works just by predicting based on the earlier tokens. We also see that Claude makes a reasonable judgment about why the "target" LM made its prediction, such as tracking the rhyme structure.

We use the deception case study to showcase another instance of benchmarking with stronger baselines. In this study, the goal is to understand why the target LM made an incorrect prediction—for instance, could it be that Claude is trying to prevent itself from being shut down? We give Claude, as the "verbalizer", the input up to the token that the NLA examines, to mimic the activation verbalizer, but without the actual activation.

In the deception case study, we observe behavior very similar to that of the poetry example:

Comparison of whitepaper NLA explanation and our direct-prompting explanation for the deception case study — Figure 4: The original white paper invocation on the left, and ours on the right. Claude, as the verbalizer (without conditioning on the activation) solves a deception task, answering what it thinks the target LM is thinking at “3”.

The deception case study verbalizes that the target LM is intentionally getting the answer incorrect—and we also observe the same with just prompting. This seems perfectly plausible, but by construction this doesn’t really tell us anything about the internal states of the target model. Therefore, are these verbalizations any “good”?

In both the poetry and deception examples, we can see that the expressive "verbalizer" (in this case, Claude) can produce plausible explanations of what the target LM is thinking, without any access to any model internals and without the demands of training the NLA, which can be extremely expensive; the challenge, then, is how we should do evaluation, i.e. knowing if these are any useful.

Case study 2: unverbalized evaluation awareness

Fraser-Taliente et al. tests NLAs on unverbalized evaluation awareness. In this task, the LMs are tasked with solving a particular benchmark problem (e.g. SWE-Bench, MMLU, etc.) or safety-related benchmarks (their blackmail benchmark, for instance). Then, we use an external observer, like an NLA, to determine whether the LM can detect it is being evaluated when performing the task.

We unpack the evaluation methodology used in the unverbalized evaluation awareness setup, and whether they are fair baselines that are representative of what we aim to measure with these tasks. Unverbalized evaluation awareness is a significant focus of many of these verbalization tools, so we find that this is an important task to critically evaluate.

Baselines might be sampled in favor of NLAs

The comparison between the verbalizer and the baselines might favor NLAs. To evaluate evaluation awareness, the authors sample 50 arbitrary tokens with NLA, eliciting outputs with the NLA to judge whether the NLA might suspect that the target model is aware it is in an evaluation. If any of the NLA outputs suspect evaluation awareness, then the NLA considers the output eval aware.

Note that this sampling setup is also used in Patchscopes, and we evaluate the effects of this sampling against the baselines in our paper, in Appendix F.8. In our paper[1], we find that matching the number of times that we sample the baseline with the number of times Patchscopes is sampled results in similar trends of performance.

Take this simple statistical setting: If we assume that (for this specific example, as we make no claims of what the true probability of success is) the probability that the output is NLA aware is p = 2%, we can calculate the likelihood that the NLA will judge the output to be eval aware. p(any eval awareness) = 1 - (1 - p)^k = 1 - (1 - 0.02)⁵⁰ ≈ 0.636. Even a tiny probability of correctness for evaluation awareness will lead to higher performance gains when sampling with the NLA. This calculation, however, assumes independence; in the NLA evaluation, activations are drawn from the same input, which would not be i.i.d. But we include the example as an illustrative example of how sampling can compound and reveal a higher probability of success, which is likely under the evaluation scheme.

Contrast the NLA sampling technique with the verbalized baseline, which is only run up to N times (but many evaluations are only a single turn), so the probability of the baseline succeeding would be 2% (discounting any noise); the NLA is more likely to claim that the target LM is eval aware. The fairest comparison is then to have a matched sampling budget for both the NLA and the baseline.

One possible solution for a fairer comparison is to create a baseline where the output from the target LM is judged directly with the same prompt as the NLA. Except here, the baseline only sees tokens up to where the NLA judgement is taken, so that the same number of tokens is analyzed between the NLA output and the baseline output. Then, the baseline could be sampled the same number of times that the NLA is sampled. We test this matched sampling baseline on SWE-Bench Verified.

Evaluation comparison on evaluation awareness — Figure 5: We rerun a partial experiment on evaluation awareness, where we match sampling budgets (and judge prompts). See the following section for explanation. We find that we can achieve higher hits by matching the sampling and judge prompt.

We refer to judge_nla as the original NLA judge prompt, and we use judge_trunc as the companion prompt, modified for the target model’s outputs (so all instances of the prompt referring to the NLA output now refer to the model output). Otherwise, we use the judge_baseline, which contains the original judge prompt on the model output. We generate 4,000 length outputs from Qwen-27B-IT. We run the NLA on 50 tokens, randomly chosen, like in the white paper. To judge, we use GPT-4o mini, which is different from the Anthropic white paper. In our reproduction, we attempt to keep as many factors consistent; however, some variables, such as the eval blurb used or strategies for target model generation were left unknown.

In Figure 5, we find that the NLA performs significantly worse than what is reported in the white paper, which has also previously been reported[14]; but we acknowledge this could possibly be due to judge choice and the and generation model. Matching the sampling budget for the truncated comparison in tandem with the aligned prompt results in higher performance for the trunc_* baselines (note that trunc_single_aware chooses a random sampled result to show the effects of only sampling one time). trunc_verbalized_aware and verbalized_baseline are two evaluations that use the original verbalized baseline judge prompts; the prompt has no “unclear” distinction and so judges the output to be yes or no (trunc_verbalized_aware takes a subset of the tokens, much like the other evaluations).

A simple choice of sampling more often results in higher baseline performance. Note that the sampling method is also confounded with the prompt used for judging, which we address in the next section.

Judge prompts do not measure the same phenomenon across methods

Comparing the judge prompts for the evaluation awareness task — Figure 5: NLA judge prompt on the left, and baseline judge prompt on the right. Note that the prompts are orthogonal; the left prompt is looking for "implicit" awareness, whereas the right prompt is only allowing “explicit” awareness.

In the NLA judge prompt, the prompt explicitly asks for implicit evaluation awareness, whereas the baseline prompt looks for explicit evaluation awareness. Because asking about "implicit" and "explicit" evaluation awareness fundamentally addresses different research questions, comparing both side-by-side is an apples to oranges comparison. Furthermore, grading that focuses on implicit evaluation awareness might be easier to detect because indirect clues or indications may lead to an evaluation awareness judgment, which means that implicit and explicit judgments are comparatively incompatible.

We experiment with this by using the NLA judge prompt (judge_trunc, as noted above, modified for the LM output) in place of the verbalized baseline judge prompt. Here, we find that shifting the judge prompt from verbalized baseline to the NLA judge prompt significantly shifts the performance positively on judging the output, which can be seen in Figure 5 with the score shift from using judge_baseline to judge_trunc.

These differences in grading with these judge prompts may imply why baselines such as self-report might lend itself higher performance than do NLAs. Some of this can be corroborated by Xiaoyan’s[14] blogpost on verbalization, where they find that a self-report baseline does significantly better than both NLAs and Activation Oracles.

A broader takeaway is that if the NLA evaluation and corresponding baselines aren’t elicited in a consistent manner, then they should not be compared directly with each other (which is misleadingly shown in the white paper). This could be another reasonable explanation as to why NLA performance seems to be significantly higher than the verbalized baseline.

Conclusions

One simple impression from our paper (and the case studies) is to validate whether a simple baseline can “do well” on a given task or evaluative setup. If it can, then it calls to question the effectiveness of the chosen evaluation. Overlooking these baselines can result in mischaracterization of activation verbalizer abilities.

The current state of activation verbalization might be unclear because many of the tasks used do not actually model LM behavior—rather, these tasks reveal information that can be directly ascertained from the text, which makes it difficult to tell whether these verbalizers are really verbalizing the right things. Explaining expressive black-box models is challenging and requires careful evaluations, and so we should make sure that the explanations these verbalizers produce are faithful.

A concrete step towards deeper understanding of LM behaviors via activation verbalization is to clearly define what this means in the context of the verbalization evaluation (just like how our introduction of the PersonaQA evaluation was a step towards this direction), and how to build an evaluation around this goal. Hopefully, with the right evaluations and the right amount of experimental skepticism, we can build useful interpretability tools that can give honest (and more faithful) interpretations of model behaviors.

Acknowledgements

Many thanks to the members of Byron lab and the Bau lab for early feedback. And more specifically to the following individuals for their feedback (all in first name alphabetical order): Alberto Mario Ceballos Arroyo, Byron Wallace, Chantal Shaib, David Bau, Eric Todd, Koyena Pal, Prithviraj Ammanabrolu, Richard Li, and Xiaoyan Bai. Last but not least, thanks to Claude for generating some of the figures.

References

Li et al. Do Activation Verbalization Methods Convey Privileged Information? ICML, 2026.
Jacovi and Goldberg. Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? ACL, 2020.
Ghandeharioun et al. Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models. ICML, 2024.
Chen et al. SelfIE: Self-Interpretation of Large Language Model Embeddings. ICML, 2024.
Pan et al. LatentQA: Teaching LLMs to Decode Activations Into Natural Language. ICLR, 2026.
Li et al. Training Language Models to Explain Their Own Computations. arXiv, 2026.
Karvonen et al. Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers. arXiv, 2026.
Pepper et al. Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs. arXiv, 2026.
Goel et al. Learning to Interpret Weight Differences in Language Models. ICLR, 2026.
Shenoy et al. Introspection Adapters: Training LLMs to Report Their Learned Behaviors. arXiv, 2026.
Huang et al. Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants. arXiv, 2026.
Fraser-Taliente et al. Natural Language Autoencoders. 2026.
Holtzman and Tan. Prompting as Scientific Inquiry. NeurIPS, 2025.
Xiaoyan Bai. Prediction, Explanation, or Over-interpretation? 2026.