hngopher.com

       [HN Gopher] Show HN: Open-source model and scorecard for measuri...
       ___________________________________________________________________
        
       Show HN: Open-source model and scorecard for measuring
       hallucinations in LLMs
        
       Hi all! This morning, we released a new Apache 2.0 licensed model
       on HuggingFace for detecting hallucinations in retrieval augmented
       generation (RAG) systems.  What we've found is that even when given
       a "simple" instruction like "summarize the following news article,"
       every LLM that's available hallucinates to some extent, making up
       details that never existed in the source article -- and some of
       them quite a bit. As a RAG provider and proponents of ethical AI,
       we want to see LLMs get better at this. We've published an open
       source model, a blog more thoroughly describing our methodology
       (and some specific examples of these summarization hallucinations),
       and a GitHub repository containing our evaluation from the most
       popular generative LLMs available today. Links to all of them are
       referenced in the blog here, but for the technical audience here,
       the most interesting additional links might be:  -
       https://huggingface.co/vectara/hallucination_evaluation_mode...  -
       https://github.com/vectara/hallucination-leaderboard  We hope that
       releasing these under a truly open source license and detailing the
       methodology, we hope to increase the viability of anyone really
       quantitatively measuring and improving the generative LLMs they're
       publishing.
        
       Author : eskibars
       Score  : 48 points
       Date   : 2023-11-06 19:11 UTC (1 hours ago)
        
 (HTM) web link (vectara.com)
 (TXT) w3m dump (vectara.com)
        
       | simonhughes22 wrote:
       | I worked on the model with our research team. Recently featured
       | in this NYT
       | (https://www.nytimes.com/2023/11/06/technology/chatbots-hallu....
       | Post here to AMA. We are also looking for collaborators to help
       | us maintain this model and make it the best it can be. Let us
       | know if you want to help
        
         | Boerworz wrote:
         | Hey, looks like your (very interesting) link got formatted
         | incorrectly! Should be
         | https://www.nytimes.com/2023/11/06/technology/chatbots-
         | hallu..., right? :)
        
       | vinni2 wrote:
       | Great work! Interesting to see Llama 2 7B is better than Llama 2
       | 13B.
        
       | awadallah wrote:
       | I am CEO and one of cofounders of Vectara. We are very proud of
       | the release of this open source eval model. We certainly would
       | like to add more LLMs to the scorecard, and would love to
       | collaborate with others to make the evaluation model even more
       | accurate. Please reach out to bader@ or simon@ if interested.
        
       ___________________________________________________________________
       (page generated 2023-11-06 21:00 UTC)