[HN Gopher] The Dual LLM pattern for building AI assistants that... ___________________________________________________________________ The Dual LLM pattern for building AI assistants that can resist prompt injection Author : simonw Score : 57 points Date : 2023-05-13 05:08 UTC (1 days ago) (HTM) web link (simonwillison.net) (TXT) w3m dump (simonwillison.net) | montebicyclelo wrote: | Here's "jailbreak detection", in the NeMo-Guardrails project from | Nvidia: | | https://github.com/NVIDIA/NeMo-Guardrails/blob/327da8a42d5f8... | | I.e. they ask the llm if the prompt will break the llm. (I | believe that more data /some evaluation on how well this performs | is intended to be released. Probably fair to call this stuff "not | battle tested".) | fnordpiglet wrote: | It feels like an LLM classifying the prompts without cumulative | context as well as the prompt output from the LLM would be pretty | effective. Like in the human mind, with its varying levels of | judgement and thought, it may be a case of multiple LLMs watching | the overall process. | SeriousGamesKit wrote: | Thanks SimonW! I've really enjoyed your series on this problem on | HN and on your blog. I've seen suggestions elsewhere about | tokenising fixed prompt instructions differently to user input to | distinguish them internally, and wanted to ask for your take on | this concept- do you think this is likely to improve the state of | play regarding prompt injection, applied either to a one-LLM or | two-LLM setup? | Vanit wrote: | I still don't believe that in the long term it will be tenable to | bootstrap LLMs using prompts (or at least via the same vector as | your users). | SheinhardtWigCo wrote: | Is it possible that all but the most exotic prompt injection | attacks end up being mitigated automatically over time, by virtue | of research and discussion on prompt injection being included in | training sets for future models? | jameshart wrote: | By the same logic, humans should no longer fall for phishing | scams or buy timeshares since information about them is widely | available. | tomohelix wrote: | Most well-educated people won't. A well trained AI can behave | pretty close to a well-educated person in common sense. | SheinhardtWigCo wrote: | I'd say it's not the same thing, because most humans don't | have an encyclopedic knowledge of past scams, and are not | primed to watch out for them 24/7. LLMs don't have either of | these problems. | | An interesting question is whether GPT-4 would fall for a | phishing scam or try to buy a timeshare if you gave it an | explicit instruction to avoid being scammed. | danpalmer wrote: | I sort of disagree that LLMs don't have the same pitfalls. | LLMs aren't recording everything they are trained with, | like humans, the training data affects a general | behavioural model. When answering, they aren't looking up | information. | | As for being "primed", I think the difference between | training, fine tuning, and prompting, is the closest | equivalent. They may have been trained with anti-scam | information, but they probably haven't been fine tuned to | deal with scams, and then haven't been prompted to look out | for them. A human who isn't expecting a scam in a given | conversation is much less likely to notice it than one who | is asked to find the scam. | | Lastly, scams often work by essentially pattern matching | behaviour to things we want to do. Like taking advantage of | peoples willingness to help. I suspect LLMs would be far | more susceptible to this sort of thing because you only | have to effectively pattern match one thing: language. If | the language of the scam triggers the same "thought" | patterns as the language of a legitimate conversation, then | it'll work. | | To avoid all of this I think will require explicit | instruction in fine tuning or prompts, but so does | everything, and if we train for everything then we're back | to square one with relative priorities. | amelius wrote: | The one thing that will solve this problem is when AI assistants | will actually become intelligent. | amrb wrote: | So we just recreated all of the previous SQL injection security | issues in LLM's, fun times | inopinatus wrote: | This is technocrat hubris. Congruent vulnerabilities exist in | humans, for which it's a solved problem. Go re-read _The Phoenix | Project_. | | The real world solution is to use auditors, and ensure that most | operations are reversible. | fooker wrote: | This is avoiding the core problem (mingling control and data) | with security through obscurity. | | That can be an effective solution, but it's important to | recognize it as such. | rst wrote: | It's avoiding the problem by separating control and data, at | unknown but signficant cost to functionality (the LLM which | determines what tools get invoked doesn't see the actual data | or results, only opaque tokens that refer to them, so it can't | use them directly to make choices). I'm not sure how that | qualifies as "security by obscurity". ___________________________________________________________________ (page generated 2023-05-14 23:00 UTC)