[HN Gopher] Chatbot Arena Leaderboard ___________________________________________________________________ Chatbot Arena Leaderboard Author : tosh Score : 53 points Date : 2023-05-25 20:46 UTC (2 hours ago) (HTM) web link (lmsys.org) (TXT) w3m dump (lmsys.org) | zhwu wrote: | Very interesting! Quite surprised to see PaLM-2 ranked even lower | than open-sourced Vicuna. | tikkun wrote: | When do you (HN readers) think that we'll have an open source | model that scores 1150 or higher, and where do you think it'll | come from? | sottol wrote: | The "win matrix" (dissimilarity Matrix) seems very interesting, | looks eg like Vicuna13b paired against gpt4 wins 20% of the time. | Larger difference than I'd have guessed based on scores. | furyofantares wrote: | Yeah the win matrix is what you want to look at if you haven't | internalized or memorized what various Elo differences mean | redox99 wrote: | Unfortunately the Arena is missing some of the strongest "open" | models, such as WizardLM Uncensored 30B. In fact they don't have | any Llama 30B/65B based models, just 13B models. | djdsol wrote: | Why is there no Claude+ ? Seems like their competitor to GPT-4. | exizt88 wrote: | Only the bottom 2 out of top 10 are open-source and available for | commercial use. So if you want to use an open-source LLM for your | commercial product, be aware that your competitors who use | proprietary LLMs through APIs will outperform you _dramatically_. | Or am I missing something? | netsec_burn wrote: | What you're missing is not reflected on the leaderboard right | now, Guanaco 65B. | EvgeniyZh wrote: | Guanaco is LLaMA tune and thus is irrelevant for commercial | use, isn't it? | netsec_burn wrote: | Ah true! It isn't for commercial use. | version_five wrote: | I'd say your missing the importance of not being bound to a | proprietary model, and of not having to explain to your | customers why you send their data to a third party. It's still | early days - definitely if you need the sota performance this | second, you don't have any options. But in the fairly near | term, I see no evidence that the proprietary _generic_ models | will keep their leads in a way that 's meaningful for | commercial products. Do you? | com2kid wrote: | I've been working extensively with LLMs on a generative | storytelling side project (named www.generativestorytelling.ai | because I am terrible at naming things) and once prompts start | getting complex, ChatGPT wins by a landslide. I can do all sorts | of complicated prompts to ChatGPT[0] and it will, by and large, | come up with great output. | | Meanwhile, Bard gets confused by basic things such as "after this | message I will send another one, do not reply until the second | message is sent" and instead tries to immediately reply. | | IMHO not very many people doing reviews of chatbots are really | pushing them the bots to their limits, and those who are pushing | the bots really hard are often too busy to take the time and make | their work public (which is the reason I am developing in the | open!) | | [0] | https://github.com/devlinb/arcadia/blob/main/backend/src/rou... | refulgentis wrote: | Have you tried Claude on stories? my goodness, it seemed out of | this world amazing a couple months back | com2kid wrote: | * * * ___________________________________________________________________ (page generated 2023-05-25 23:01 UTC)