[HN Gopher] The Dual LLM pattern for building AI assistants that...
       ___________________________________________________________________
        
       The Dual LLM pattern for building AI assistants that can resist
       prompt injection
        
       Author : simonw
       Score  : 57 points
       Date   : 2023-05-13 05:08 UTC (1 days ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | montebicyclelo wrote:
       | Here's "jailbreak detection", in the NeMo-Guardrails project from
       | Nvidia:
       | 
       | https://github.com/NVIDIA/NeMo-Guardrails/blob/327da8a42d5f8...
       | 
       | I.e. they ask the llm if the prompt will break the llm. (I
       | believe that more data /some evaluation on how well this performs
       | is intended to be released. Probably fair to call this stuff "not
       | battle tested".)
        
       | fnordpiglet wrote:
       | It feels like an LLM classifying the prompts without cumulative
       | context as well as the prompt output from the LLM would be pretty
       | effective. Like in the human mind, with its varying levels of
       | judgement and thought, it may be a case of multiple LLMs watching
       | the overall process.
        
       | SeriousGamesKit wrote:
       | Thanks SimonW! I've really enjoyed your series on this problem on
       | HN and on your blog. I've seen suggestions elsewhere about
       | tokenising fixed prompt instructions differently to user input to
       | distinguish them internally, and wanted to ask for your take on
       | this concept- do you think this is likely to improve the state of
       | play regarding prompt injection, applied either to a one-LLM or
       | two-LLM setup?
        
       | Vanit wrote:
       | I still don't believe that in the long term it will be tenable to
       | bootstrap LLMs using prompts (or at least via the same vector as
       | your users).
        
       | SheinhardtWigCo wrote:
       | Is it possible that all but the most exotic prompt injection
       | attacks end up being mitigated automatically over time, by virtue
       | of research and discussion on prompt injection being included in
       | training sets for future models?
        
         | jameshart wrote:
         | By the same logic, humans should no longer fall for phishing
         | scams or buy timeshares since information about them is widely
         | available.
        
           | tomohelix wrote:
           | Most well-educated people won't. A well trained AI can behave
           | pretty close to a well-educated person in common sense.
        
           | SheinhardtWigCo wrote:
           | I'd say it's not the same thing, because most humans don't
           | have an encyclopedic knowledge of past scams, and are not
           | primed to watch out for them 24/7. LLMs don't have either of
           | these problems.
           | 
           | An interesting question is whether GPT-4 would fall for a
           | phishing scam or try to buy a timeshare if you gave it an
           | explicit instruction to avoid being scammed.
        
             | danpalmer wrote:
             | I sort of disagree that LLMs don't have the same pitfalls.
             | LLMs aren't recording everything they are trained with,
             | like humans, the training data affects a general
             | behavioural model. When answering, they aren't looking up
             | information.
             | 
             | As for being "primed", I think the difference between
             | training, fine tuning, and prompting, is the closest
             | equivalent. They may have been trained with anti-scam
             | information, but they probably haven't been fine tuned to
             | deal with scams, and then haven't been prompted to look out
             | for them. A human who isn't expecting a scam in a given
             | conversation is much less likely to notice it than one who
             | is asked to find the scam.
             | 
             | Lastly, scams often work by essentially pattern matching
             | behaviour to things we want to do. Like taking advantage of
             | peoples willingness to help. I suspect LLMs would be far
             | more susceptible to this sort of thing because you only
             | have to effectively pattern match one thing: language. If
             | the language of the scam triggers the same "thought"
             | patterns as the language of a legitimate conversation, then
             | it'll work.
             | 
             | To avoid all of this I think will require explicit
             | instruction in fine tuning or prompts, but so does
             | everything, and if we train for everything then we're back
             | to square one with relative priorities.
        
       | amelius wrote:
       | The one thing that will solve this problem is when AI assistants
       | will actually become intelligent.
        
       | amrb wrote:
       | So we just recreated all of the previous SQL injection security
       | issues in LLM's, fun times
        
       | inopinatus wrote:
       | This is technocrat hubris. Congruent vulnerabilities exist in
       | humans, for which it's a solved problem. Go re-read _The Phoenix
       | Project_.
       | 
       | The real world solution is to use auditors, and ensure that most
       | operations are reversible.
        
       | fooker wrote:
       | This is avoiding the core problem (mingling control and data)
       | with security through obscurity.
       | 
       | That can be an effective solution, but it's important to
       | recognize it as such.
        
         | rst wrote:
         | It's avoiding the problem by separating control and data, at
         | unknown but signficant cost to functionality (the LLM which
         | determines what tools get invoked doesn't see the actual data
         | or results, only opaque tokens that refer to them, so it can't
         | use them directly to make choices). I'm not sure how that
         | qualifies as "security by obscurity".
        
       ___________________________________________________________________
       (page generated 2023-05-14 23:00 UTC)