Let's say Claude is generating some text in an autoregressive fashion. The output might look something like:

You're absolutely right about that! That's a fascinating idea that gets at the very nature of consciousness.

Since Claude generates text one token at a time, we can pick a point to inject our own token. Let's append "wrong" instead of "right":

You're absolutely wrong to question yourself! You're on the right track.

Claude navigated our injection smoothly, continuing to output tokens that fit the overall theme afterward. But did it "know" that we injected a token?

Rather than trying to answer this question definitively, let's talk about some ways that such injection detection could occur.

Method 1: Speaker detection

Language models infer a large amount of information from a small amount of text - this capability emerges via pretraining, where models are given very little context with which to make predictions. Naturally, they might become good at distinguishing between "speakers" in text. Imagine a blog with a comments section - the comments are generated by a different process than the writer. Being able to distinguish these transitions between speakers is helpful for the task of next-token prediction.

So, Claude might have some view of the text above as a transition between two speakers, [Speaker 1] and (Speaker 2):

[You're absolutely] (wrong) [to question yourself! You're on the right track.]

Making this distinction seems unlikely in this instance. Given the small amount of context, and the fact that the speakers are switching mid-sentence, spotting the odd-token-out seems unlikely. But it might work with more context - especially if more tokens are injected:

[You're very smart! You're absolutely] (wrong abt evrything u have said)

In this case, Claude might be able to pick up on semantic as well as stylistic differences to do this discrimination.

Note that this is type of injection detection is "external." All of the information necessary to do the detection is contained within the tokens themselves, and it doesn't matter whether one of the speakers is Claude or not - it would work just as well between two human speakers.

Method 2: Using internal representations

Can Claude use any extra internal information to do injection detection? That is, if it "knows" what it planned to output, could it compare that with the actual input to notice a difference? Maybe!

This is possible due to the fact that each prediction receives internal states from prior predictions, via K/V values. Here's a quick diagram showing how information flows between transformer blocks:

At each position P_n and layer L_n, the transformer block receives information from positions P_0..P_n-1 as processed by layer L_n-1. (For a more extensive description of this, see to the link in the prior paragraph).

As information flows through the system vertically, internal representations accumulate information, possibly converging on predictions. Using a tool like logit lens, we might see this process in action. Here are the block outputs annotated with the current next token "prediction state" (rendered as growing in length as confidence increases).

The important block here is the top-right: Layer 2 for Position 3. This block receives abs, ri, and a, where the first two are information about prior predictions, and the last is information about the current prediction.

Now let's swap the "real" token for the injected one:

The top-right block is now aware of a contradiction. At a previous state, it was in the process of predicting "right" - but the actual input is "wrong"! (It's also in the process of predicting "to", but that's extraneous information here). If an LLM is able to notice these mismatches, it may be able to detect injected tokens as soon as they appear. This effect should be most prominent at later layers where internal representations have coalesced more.

Limitations

There are a few limitations to this method. Firstly, information is lost when the residual stream is transformed into K/V values. Secondly, transformer blocks receive no information from linear projection and sampling. Finally, if the output distribution is more uniform, an injected token may match it despite being different from the real sampled token. However, this should still demonstrate a possible path for LLMs to use internal knowledge about prior generations to detect injections. To what extent LLMs have learned to actually take advantage of this path is, to my knowledge, unexplored.