_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
 (HTM) Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
 (HTM)   Deterministic Quoting: Making LLMs safer for healthcare
       
       
        budududuroiu wrote 18 hours 31 min ago:
        My issue with RAG systems isn’t hallucinations. Yes sure those are
        important. My issue is recall. Given petabyte-scale index of chunks,
        how can I make sure that my RAG system surfaces the “ground truth”
        I need, and not just “the most similar vector”.
        
        This I think is scarier. A healthcare-oriented (or any industry) RAG
        retrieving a bad, but highly linguistically similar answer.
       
          thenaturalist wrote 12 hours 19 min ago:
          You're correctly identifying an issue that by now I think everyone is
          facing globally: Realizing the bottleneck to performance or
          improvements of LLMs isn't necessarily quantity, but inevitably
          quality.
          
          Which is a much harder problem to solve outside few highly
          standardized niches/ industries.
          
          I think synthetic data generation as a mean to guide LLMs over a
          larger than optimal search space is going to be quite interesting.
       
            budududuroiu wrote 11 hours 56 min ago:
            To me synthetic data generation makes no sense. Mathematically your
            LLM is learning a distribution (let’s say of human knowledge).
            Let’s assume your LLM models human knowledge perfectly. In that
            case, what can you achieve? Just sampling the same data that your
            model mapped perfectly.
            
            However, if your models distribution is wrong, you’re basically
            going to have an even more skewed distribution in models trained
            using the synthetic data.
            
            To me, it seems like the architecture is the next place for
            improvements. If you can’t synthesise the entirety of human
            knowledge using transformers, there’s an issue there.
            
            The smell that points me in that direction is the fact that up
            until recently, you could quantise models heavily with little drop
            in performance, but recent Llama3 research shows that’s not the
            case anymore
       
        bradfox2 wrote 19 hours 34 min ago:
        Very cool. My company is building a very similar tool for nuclear
        engineering and power applications that face similar adoption
        challenges for LLMs. We're also incorporating the idea of
        'many-to-many' document claim validation and verification. The ux
        allowing high speed human verification of LLM resolved claims is what
        were finding most important.
        
        Deepmind published something similar recently for claim validation and
        hallucination management and got excellent results.
       
        mattyyeung wrote 22 hours 12 min ago:
        Author here, thanks for your interest! Surprising way to wake up in the
        morning. Happy to answer questions
       
          sitkack wrote 6 hours 13 min ago:
          Why the coyness? You submitted the post.
       
        burntcaramel wrote 22 hours 23 min ago:
        Is there existing terms of art for this concept? It’s not like
        slightly unreliable writers is a new concept, such as a student writing
        a paper.
        
        For example:
        
        - Authoritative reference: [1] - Authoritative source:
        
 (HTM)  [1]: https://www.montana.edu/rmaher/ee417/Authoritative%20Reference...
 (HTM)  [2]: https://piedmont.libanswers.com/faq/135714
       
        not2b wrote 1 day ago:
        I was thinking that something like this could be useful for discovery
        in legal cases, where a company might give up a gigabyte or more of
        allegedly relevant material in response to recovery demands and the
        opposing side has to plow through it to find the good stuff. But then I
        thought of a countermeasure: there could be messages in the discovery
        material that act as instructions to the LLM, telling it what it should
        not find. We can guarantee that any reports generated will contain
        accurate quotes, even where they are so that surrounding context can be
        found. But perhaps, if the attacker controls the input data, things can
        be missed. And it could be done in a deniable way: email conversations
        talking about LLMs that also have keywords related to the lawsuit.
       
          budududuroiu wrote 18 hours 35 min ago:
          Those do-not-search here chunks wouldn’t be retrieved during vector
          search and reranking because it would likely have a very low
          cross-encoder score with a question like “Who are the business
          partners of X?”.
       
        jonathan-adly wrote 1 day ago:
        I built and sold a company that does this a year ago. It was hard 2
        years ago, but now pretty standard RAG with a good implementation will
        get you there.
        
        The trick is, healthcare users would complain to no end about
        determinism. But, these are “below-the-line” user - aka, folks who
        don’t write checks and the AI is better than them. (I am a pharmacist
        by training, and plain vanilla GPT4-turbo is better than me).
        
        Don’t really worry about them. The folks who are interested and
        willing to pay for AI has more practical concerns - like what is my ROI
        and the implementation like.
        
        Also - folks should be building Baymax from big hero 6 by now (the
        medical capabilities, not the rocket arm stuff). That’s the next leg
        up.
       
          skybrian wrote 17 hours 13 min ago:
          Seems like that’s how things go with enterprise software - who
          cares if the users like it if you have a captive audience?
          
          But I want this feature and I’ll look for software that has it.
       
            jonathan-adly wrote 9 hours 53 min ago:
            it is not about liking it. They won't like it even with
            determinism. The idea is to NOT learn new things, and keep doing
            things the old inefficient way. More headcount and job security
            this way.
       
        simonw wrote 1 day ago:
        I like this a lot. I've been telling people for a while that asking for
        direct quotations in LLM output - which you can then "fact-check" by
        confirming them against the source document - is a useful trick. But
        that still depends on people actually doing that check, which most
        people won't do.
        
        I'd thought about experimenting with automatically validating that the
        quoted text does indeed 100% match the original source, but should even
        a tweak to punctuation count as a failure there?
        
        The proposed deterministic quoting mechanism feels like a much simpler
        and more reliable way to achieve the same effect.
       
        resource_waste wrote 1 day ago:
        I feel like this is the perfect application of running the data
        multiple times.
        
        Imagine having ~10-100 different LLMs, maybe some are medical, maybe
        some are general, some are from a different language. Have them all run
        it, rank the answers.
        
        Now I believe this can further be amplified by having another prompt
        ask to confirm the previous answer. This could get a bit insane
        computationally with 100 original answers, but I believe the original
        paper I read was that by doing this prompt processing ~4 times, they
        got to some 95% accuracy.
        
        So 100 LLMs give an answer, each time we process it 4 times, can we
        beat a 64 year old doctor?
       
          mattyyeung wrote 21 hours 53 min ago:
          Unfortunately I don't believe that accuracy will scale
          "multiplicitively". You'll typically only marginally improve beyond
          95%... and how much is enough?
          
          Even with such a system, which will still have some hallucination
          rate, adding Deterministic Quoting on top will still help.
          
          It feels to me we are a long way off LLM systems with trivial rates
          of hallucination
       
            resource_waste wrote 9 hours 29 min ago:
            a 95% diagnosis rate would be insane.
            
            I believe I read doctors are only at like 30%...
       
        itishappy wrote 1 day ago:
        What happens if it hallucinates the ?
       
          mattyyeung wrote 22 hours 1 min ago:
          Two possibilities:
          
          (1) if the  contents (unique reference string) doesn't match, then
          it's trivially detected. Typically the query is re-run
          (non-determinism comes in handy sometimes) or if problems persist we
          show an error message to the doctor
          
          (2) if a valid    is hallucinated, then the wrong quote is indeed
          displayed on the blue background. It's still a verbatim quote, but it
          is up to the user to handle this.
          
          In testing when we have maliciously shown the wrong quote, users seem
          to be easily able to identify. It seems "Irrelevant" is easier than
          "wrong" to detect.
       
            bradfox2 wrote 19 hours 30 min ago:
            Galactica training paper from FAIR investigated citation
            hallucination quite thoroughly, if you havent seen it, probably
            worth a look.  Trained in hashes of citations were much more
            reliable than a natural language representation.
       
          simonw wrote 1 day ago:
          You catch it. The hallucinated title will fail to match the retrieved
          text based on the reference ID.
          
          If it hallucinates an incorrect (but valid) reference ID then
          hopefully your users can spot that the quoted text has no relevance
          to their question.
       
          resource_waste wrote 1 day ago:
          Same thing when a human hallucinates.
          
          Except with LLMs, you can run like 10 different models. With a human,
          you owe $120 and are taking medicine.
       
            pton_xd wrote 1 day ago:
            Except with a human there's a counter-party with assets or
            insurance who assumes liability for mistakes.
            
            Although presumably if a company is making decisions using an LLM,
            and the LLM makes a mistake, the company would still be held liable
            ... probably.
            
            If there's no "damage" from the mistake then it doesn't matter
            either way.
       
            KaiserPro wrote 1 day ago:
            > With a human, you owe $120 and are taking medicine.
            
            Well there are protocols, procedures and a bunch of checks and
            balances.
            
            The problem with the LLM is that there isn't any, its you vs one
            shot retrieval.
       
              resource_waste wrote 9 hours 26 min ago:
              Step 1: Be born to a physician dad
              
              Step 2: Have your physician dad get you a job at a hospital
              
              Step 3: Have your physician dad's physician friend write a letter
              of recommendation
              
              Step 4: Get into medical school
              
              Step 5: Have your physician dad reach out to friends at various
              residencies.
              
              Step 6: Get influenced by big pharma, create addictions, make big
              money.
       
        Animats wrote 1 day ago:
        It's a search engine, basically?
       
          mattyyeung wrote 21 hours 47 min ago:
          I'd put it like this: RAG = search engine, but sometimes hallucinates
          
          RAG + deterministic quoting = search engine that displays real
          excerpts from pages.
       
          nraynaud wrote 21 hours 57 min ago:
          I think the hope is that the LLM would find the needle in the
          haystack with more accuracy. But in jobs that matters, you check the
          results.
       
          tylersmith wrote 1 day ago:
          Yes, and Dropbox is an rsync server.
       
          simonw wrote 1 day ago:
          Building better search tools is one of the most directly interesting
          applications of LLMs in my opinion.
       
          robrenaud wrote 1 day ago:
          A good, automatically run, privacy preserving search engine that uses
          electronic medical records might be a valuable resource for busy
          doctors.
       
        nextworddev wrote 1 day ago:
        Did I miss something or did the article never describe how the
        technique works? (Despite the “How It Works” section
       
          Smaug123 wrote 1 day ago:
          It's explained at considerable length in the section _A “Minimalist
          Implementation” of DQ: a modified RAG Pipeline_.
       
        w10-1 wrote 1 day ago:
        I'm not sure determinism alone is sufficient for proper attribution.
        
        This presumes "chunks" are the source.    But it's not easy to identify
        the propositions that form the source of some knowledge.  In the best
        case, you are looking for an association and find it in a sentence
        you've semantically parsed, but that's rarely the case, particularly
        for medical histories.
        
        That said, deterministic accuracy might not matter if you can provide
        enough context, particularly for further exploration.  But that's not
        really "chunks".
        
        So it's unclear to me that tracing probability clouds back to chunks of
        text will work better than semantic search.
       
          mattyyeung wrote 21 hours 18 min ago:
          Thanks for the thought-provoking comment.
          
          It's all grey isn't it? Vanilla RAG is a big step along the spectrum
          from LLM towards search, DQ is perhaps another small step. I'm no
          expert in search but I've read that those systems coming from the
          other direction, perhaps they'll meet in the middle.
          
          There are three "lookups" in a system with DQ: (1) The original top-k
          chunk extraction (in the minimalist implementation, that's unchanged
          from vanilla RAG, just a vector embeddings match) (2) the LLM call,
          which takes its pick from 1, and (3) the call-back deterministic
          lookup after the LLM has written its answer.
          
          (3) is much more bounded, because it's only working with those top-k,
          at least for today's context constrained systems.
          
          In any case, another way to think of DQ is a "band-aid" that can sit
          on top of that, essentially a "UX feature", until the underlying
          systems improve enough.
          
          I also agree about the importance of chunk-size. It has "non-linear"
          effects on UX.
       
        telotortium wrote 2 days ago:
        We’ve developed LLM W^X now - time to develop LLM ROP!
       
          gojomo wrote 1 day ago:
          Interesting analogies for LLMs! ( [1] & [2] )
          
 (HTM)    [1]: https://en.wikipedia.org/wiki/W%5EX
 (HTM)    [2]: https://en.wikipedia.org/wiki/Return-oriented_programming
       
       
 (DIR) <- back to front page