Grace Kind

Distinguishing between LLM injection detection methods

In How might LLMs detect injected tokens? I described two methods that LLMs could use to detect injected tokens in their output:

November 01, 2025


How might LLMs detect injected tokens?

Let's say Claude is generating some text in an autoregressive fashion. The output might look something like:

October 08, 2025