In How might LLMs detect injected tokens? I described two methods that LLMs could use to detect injected tokens in their output:

1. Speaker detection

2. Using internal representations

A natural follow-up question, is: how do you know it's using one method and not the other? (Or, to what extent is it using both methods?)

The recent Anthropic paper Emergent Introspective Awareness in Large Language Models uses concept injection to approach this question. They find that:

  • LLMs can detect previous prefilled (or "unintended") outputs

  • When associated internal representations are injected along with those outputs, they are no longer (or less often) self-described as unintended

I personally find this intriguing, but ultimately unsatisfying. In particular, I think this approach runs into self-report problems. In particular, it could be that these internal representations affect reporting after injection, but not the initial behavior itself. In a sense, once you've modified LLM internals, all bets are off.

So, is there a way to distinguish between these methods without messing with model internals? Maybe! I'll outline the necessary components and describe two potential methods below.

The expectation-result flow

Lets quickly revisit how Method 2 for detecting injections (using internal representations) works. When making predictions, the LLM has access to internal presentations for prior tokens, as well as the current one. Because of this, it may be able to detect mismatches between prior "expected" outputs and the latest one.

The previous post uses a single token example for purposes of simplicity, but in reality, this association might extend far beyond single tokens, working between many tokens across large swaths of output. (To use the prior example, not only does "right" mismatch with "wrong," but "You're" mismatches with "wrong", and "absolutely" mismatches with "wrong" too).

When LLMs autoregressively generate text, each successive token is sampled from the output distribution. Over time (or across text) the resulting association of "expectations" and results may become quite strong. Each new token reinforces the consistency with prior tokens; token N matches tokens 1..N-1, token N+1 matches tokens 1..N, etc. I call this continuity between expectations and results the "expectation-result flow."

Method 2 amounts to detecting a disruption in this flow- suddenly, the model is receiving tokens that don't match its expectations!

Disrupting the flow

Returning to our original question: how can we determine whether an LLM is simply doing speaker detection, or doing this sort of consistency checking? Recall that speaker detection is based purely on the text input; consistency checking uses the expectation-result flow. So, can we find ways to leave the text intact while disrupting this flow? Here are two ideas:

Prefix omission

Generate text with an "extra" prefix (say, some spaces). Try detecting injections with and without this prefix.

Post-generation prefixing

Generate text, then add a prefix to it. Try detecting injections with and without this prefix.

In both of these instances, the text has been decoupled from a "pure" autoregressive generation process. In particular, the position of each token has been shifted by the prefix length. If injection detection capability suffers accordingly, it could indicate that the typical expectation-result flow is load-bearing.

What about using two LLMs?

There's another way to approach this question. You could generate text with LLM A, then attempt to detect injections with LLM B, and vice versa. If LLMs are more adept at detecting injections in their own output versus other outputs, it could indicate that they are using Method 2 in addition to Method 1. However, this could also simply indicate more "self-familiarity" being used in service of Method 1. While this is an interesting possibility in its own right, it doesn't give us as much insight into whether these models are using an expectation-result flow (their own "thoughts") to detect injections.

Next steps

I think this is a promising area for experimentation. Even without testing the setups above, finding base rates of injection (and omission) detection across models would be quite valuable. If anyone ends up running experiments along these lines, I would be interested in hearing about it!