hngopher.com

       [HN Gopher] Chatbot Arena Leaderboard
       ___________________________________________________________________
        
       Chatbot Arena Leaderboard
        
       Author : tosh
       Score  : 53 points
       Date   : 2023-05-25 20:46 UTC (2 hours ago)
        
 (HTM) web link (lmsys.org)
 (TXT) w3m dump (lmsys.org)
        
       | zhwu wrote:
       | Very interesting! Quite surprised to see PaLM-2 ranked even lower
       | than open-sourced Vicuna.
        
       | tikkun wrote:
       | When do you (HN readers) think that we'll have an open source
       | model that scores 1150 or higher, and where do you think it'll
       | come from?
        
       | sottol wrote:
       | The "win matrix" (dissimilarity Matrix) seems very interesting,
       | looks eg like Vicuna13b paired against gpt4 wins 20% of the time.
       | Larger difference than I'd have guessed based on scores.
        
         | furyofantares wrote:
         | Yeah the win matrix is what you want to look at if you haven't
         | internalized or memorized what various Elo differences mean
        
       | redox99 wrote:
       | Unfortunately the Arena is missing some of the strongest "open"
       | models, such as WizardLM Uncensored 30B. In fact they don't have
       | any Llama 30B/65B based models, just 13B models.
        
       | djdsol wrote:
       | Why is there no Claude+ ? Seems like their competitor to GPT-4.
        
       | exizt88 wrote:
       | Only the bottom 2 out of top 10 are open-source and available for
       | commercial use. So if you want to use an open-source LLM for your
       | commercial product, be aware that your competitors who use
       | proprietary LLMs through APIs will outperform you _dramatically_.
       | Or am I missing something?
        
         | netsec_burn wrote:
         | What you're missing is not reflected on the leaderboard right
         | now, Guanaco 65B.
        
           | EvgeniyZh wrote:
           | Guanaco is LLaMA tune and thus is irrelevant for commercial
           | use, isn't it?
        
             | netsec_burn wrote:
             | Ah true! It isn't for commercial use.
        
         | version_five wrote:
         | I'd say your missing the importance of not being bound to a
         | proprietary model, and of not having to explain to your
         | customers why you send their data to a third party. It's still
         | early days - definitely if you need the sota performance this
         | second, you don't have any options. But in the fairly near
         | term, I see no evidence that the proprietary _generic_ models
         | will keep their leads in a way that 's meaningful for
         | commercial products. Do you?
        
       | com2kid wrote:
       | I've been working extensively with LLMs on a generative
       | storytelling side project (named www.generativestorytelling.ai
       | because I am terrible at naming things) and once prompts start
       | getting complex, ChatGPT wins by a landslide. I can do all sorts
       | of complicated prompts to ChatGPT[0] and it will, by and large,
       | come up with great output.
       | 
       | Meanwhile, Bard gets confused by basic things such as "after this
       | message I will send another one, do not reply until the second
       | message is sent" and instead tries to immediately reply.
       | 
       | IMHO not very many people doing reviews of chatbots are really
       | pushing them the bots to their limits, and those who are pushing
       | the bots really hard are often too busy to take the time and make
       | their work public (which is the reason I am developing in the
       | open!)
       | 
       | [0]
       | https://github.com/devlinb/arcadia/blob/main/backend/src/rou...
        
         | refulgentis wrote:
         | Have you tried Claude on stories? my goodness, it seemed out of
         | this world amazing a couple months back
        
           | com2kid wrote:
           | * * *
        
       ___________________________________________________________________
       (page generated 2023-05-25 23:01 UTC)