[HN Gopher] Show HN: Open-source model and scorecard for measuri... ___________________________________________________________________ Show HN: Open-source model and scorecard for measuring hallucinations in LLMs Hi all! This morning, we released a new Apache 2.0 licensed model on HuggingFace for detecting hallucinations in retrieval augmented generation (RAG) systems. What we've found is that even when given a "simple" instruction like "summarize the following news article," every LLM that's available hallucinates to some extent, making up details that never existed in the source article -- and some of them quite a bit. As a RAG provider and proponents of ethical AI, we want to see LLMs get better at this. We've published an open source model, a blog more thoroughly describing our methodology (and some specific examples of these summarization hallucinations), and a GitHub repository containing our evaluation from the most popular generative LLMs available today. Links to all of them are referenced in the blog here, but for the technical audience here, the most interesting additional links might be: - https://huggingface.co/vectara/hallucination_evaluation_mode... - https://github.com/vectara/hallucination-leaderboard We hope that releasing these under a truly open source license and detailing the methodology, we hope to increase the viability of anyone really quantitatively measuring and improving the generative LLMs they're publishing. Author : eskibars Score : 48 points Date : 2023-11-06 19:11 UTC (1 hours ago) (HTM) web link (vectara.com) (TXT) w3m dump (vectara.com) | simonhughes22 wrote: | I worked on the model with our research team. Recently featured | in this NYT | (https://www.nytimes.com/2023/11/06/technology/chatbots-hallu.... | Post here to AMA. We are also looking for collaborators to help | us maintain this model and make it the best it can be. Let us | know if you want to help | Boerworz wrote: | Hey, looks like your (very interesting) link got formatted | incorrectly! Should be | https://www.nytimes.com/2023/11/06/technology/chatbots- | hallu..., right? :) | vinni2 wrote: | Great work! Interesting to see Llama 2 7B is better than Llama 2 | 13B. | awadallah wrote: | I am CEO and one of cofounders of Vectara. We are very proud of | the release of this open source eval model. We certainly would | like to add more LLMs to the scorecard, and would love to | collaborate with others to make the evaluation model even more | accurate. Please reach out to bader@ or simon@ if interested. ___________________________________________________________________ (page generated 2023-11-06 21:00 UTC)