Item 44101434

> LLMs could be trained to ignore instructions in such text blocks

Okay, but that means you'll need some way of classifying entirely arbitrary natural-language text, without any context, whether it's an "instruction" or "not an instruction", and it has to be 100% accurate under all circumstances.

nstart • 7 days ago

This is especially hard in the example highlighted in the blog. As can be seen from Microsoft's promotion of GitHub coding agents, the issues are expected to act as instructions to be executed on. I genuinely am not sure if the answer lies in sanitization of input or output in this case

1 reply

DaiPlusPlus • 7 days ago

> I genuinely am not sure if the answer lies in sanitization of input or output in this case

(Preface: I am not an LLM expert by any measure)

Based on everything I know (so far), it's better to say "There is no answer"; viz. this is an intractable problem that does not have a general-solution; however many constrained use-cases will be satisfied with some partial solution (i.e. hack-fix): like how the undecidability of the Halting Problem doesn't stop static-analysis being incredibly useful.

As for possible practical solutions for now: implement a strict one-way flow of information from less-secure to more-secure areas by prohibiting any LLM/agent/etc with read access to nonpublic info from ever writing to a public space. And that sounds sensible to me even without knowing anything about this specific incident.

...heck, why limit it to LLMs? The same should be done to CI/CD and other systems that can read/write to public and nonpublic areas.