[HN Gopher] Claude 2 Internal API Client and CLI
       ___________________________________________________________________
        
       Claude 2 Internal API Client and CLI
        
       Author : explosion-s
       Score  : 57 points
       Date   : 2023-07-14 19:55 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dandiep wrote:
       | Who here is using Claude? And can you comment on your experiences
       | with it vs. GPT 3.5/4?
        
         | BryanLegend wrote:
         | I'm pleased with it. Claude seems kinder and less patronizing
         | than GPT. Not as good at coding yet.
        
         | cl42 wrote:
         | Using it regularly for executive feedback at some of our
         | clients (think of this as an internal coach for policies). I'd
         | say it's almost as good at GPT-4 at having broader
         | conversations and sharing ideas.
         | 
         | The 100K model is FANTASTIC for quick prototyping as well.
         | 
         | Implementing everything via PhaseLLM to plug and play Claude +
         | GPT-3.5/4 as needed. All other LLMs don't stack up to these
         | two.
        
         | xfalcox wrote:
         | I added it as an option in Discourse, and I've been happy with
         | it's output for summarization tasks, suggesting titles and
         | proofreading.
        
         | linsomniac wrote:
         | I started playing with it last weekend, the 100K token limit is
         | very useful for things like "Give me a summary of this 5 hours
         | Lex Fridman podcast in about 10 sentences: <podcast
         | transcript>"
        
         | celestialcheese wrote:
         | I use it for filtering and summarization tasks on huge
         | contexts. Specifically for extracting data from raw HTML in
         | scraping tasks. It works surprisingly well.
        
         | ronsor wrote:
         | The ability to upload entire documents is honestly a game-
         | changer, even if GPT-4 is better with certain reasoning tasks.
         | I don't think I can go back to tiny context lengths now.
        
         | jerrygenser wrote:
         | It's good for general things but less good at coding. Can
         | usually get the correct answer for simpler things but much less
         | idiomatic for python than gpt4
        
           | ec109685 wrote:
           | For javascript, it did just as well as gpt-4 for several
           | questions and used more modern JavaScript syntax.
           | 
           | First time I have felt something feel nearly as good, and the
           | user interface is a bit nicer.
        
             | speedgoose wrote:
             | Does it uses the ECMAScript modules instead of the CommonJS
             | modules by default?
        
         | HyprMusic wrote:
         | We're still in the early stages of testing v2 in the real world
         | but it aced our suite of internal tests... we are very
         | impressed. Claude 1.2 did ok but it struggled with nuance &
         | accuracy whereas v2 seems to handle nuance very well and is
         | both accurate and, most importantly, consistent. The thing with
         | evaluating LLMs is it's not about how well they do on your
         | first evaluation - consistency is key and even the slightest
         | little deviation in circumstance can throw them off so we're
         | being very cautious before we make the jump. GPT4 brought that
         | consistency but the slow speed and constant downtime makes it
         | vey difficult to use in a product so we'd love to move to
         | Anthropic.
         | 
         | Our product is a tool to turn user stories into end-to-end
         | tests so we use LLMs for NLP, identifying key parts of HTML and
         | writing very simple code (we've not officially launched to the
         | public just yet but for the curious, https://carbonate.dev is
         | our product).
        
         | paxys wrote:
         | I love it. It may not objectively be on par with GPT-4, but
         | uploading a 100 page document and getting a summary in seconds
         | is nothing short of miraculous.
        
           | swyx wrote:
           | is it an accurate summary tho?
        
             | technics256 wrote:
             | In my experience it is very hollow. It skips details unless
             | you force it to.
             | 
             | Gpt4 is still way better
        
           | deadmutex wrote:
           | How do you know if it is correct or a hallucination?
        
             | luma wrote:
             | Investigate, same as you would a paralegal etc. If it makes
             | an assertion, contest it and ask where it found supporting
             | evidence in the document for the claims made. Ask it to
             | make the counter-argument, also with sources. Verify as
             | needed.
        
             | paxys wrote:
             | That's what prompting is all about. Ask it to prove its
             | statements. Ask it to quote passages that support its
             | arguments. Then double and triple check it yourself. It
             | isn't going to do the work for you, but can still be a
             | pretty great reference tool.
        
             | ronsor wrote:
             | Presumably one can test it with documents one has already
             | read and knows before. If the summaries of the test
             | documents are good, future summaries will probably be OK
             | too.
        
               | zmmmmm wrote:
               | > If the summaries of the test documents are good, future
               | summaries will probably be OK too
               | 
               | But that is exactly what is problematic with
               | hallucinations. It's a rare / exceptional behaviour that
               | triggers extreme departure from reality. So you can't
               | estimate the extremes by extrapolating from common /
               | moderate observations. You would have to test a _lot_ of
               | previous documents to be confident, and even then there
               | would be a residual risk.
        
               | lambdaba wrote:
               | Maybe having it summarize a fiction book (outside of
               | training data)?
        
         | drewbitt wrote:
         | Claude's training data is a year further into the future which
         | is often beneficial. The 100k token limit is fantastic for long
         | conversations and pasting in documents. The two downsides are
         | 1) it seems to get confused a bit more than GPT-4 and I have to
         | repeat instructions more often 2) the code-writing ability is
         | definitely subpar compared to GPT-4
        
         | explosion-s wrote:
         | I prefer it over 3.5, (can't afford GPT4 so I'm not sure about
         | comparisons there). It's much faster imo and refuses to respond
         | less. In addition they make uploading (text based) files easy,
         | so although it's not truly multimodal it's still nice to use.
         | 
         | I also like the 100k token limit, that's insane. It almost
         | never loses track of what you were talking about!
        
           | binkHN wrote:
           | I looked at the pricing and it appears to be less than half
           | the cost of GPT-4, but significantly more expensive than
           | GPT-3.5. Does that sound correct?
        
         | blowski wrote:
         | I've spent quite a bit of time with both, but I'm not an expert
         | in this field so take my comments with a fist of salt.
         | 
         | It's pretty good. Certainly as good as GPT-3.5 for speed and
         | quality. Claude seems to consider the context you've supplied
         | more than GPT-3.5.
         | 
         | Compared to GPT-4, it has similar levels of knowledge. Claude
         | is less verbose. It's less good at building real world models
         | based on context. Anecdotally, I've found it hallucinated more
         | than GPT.
         | 
         | So, it's probably better at summarising large blocks of text,
         | but less good at generating content that requires knowledge
         | outside of what you've supplied.
        
         | swyx wrote:
         | comparing them every day via https://github.com/smol-ai/menubar
         | . i'd say when it comes to coding I pick their suggestions
         | about 30% of the time. not SOTA, but pretty darn good!
        
         | philipkglass wrote:
         | I am frequently interested in problems where answers are easily
         | calculated from public data but the answer is unlikely to be
         | already recorded in a form that search engines can find.
         | Normally I spend a while noodling around looking for data and
         | then use unit-conversion and basic arithmetic to get the final
         | answer.
         | 
         | I tested Claude vs ChatGPT (which I believe is GPT 3.5) and vs
         | Bard for a problem of this sort.
         | 
         | I asked:
         | 
         | 1) What current type of power reactor consumes the least
         | natural uranium per megawatt hour of electricity? (The answer
         | is the pressurized heavy water reactor or CANDU type).
         | 
         | 2) How much natural uranium does a PHWR consume per megawatt
         | hour of electricity generated? (The answer is about 18 grams.)
         | 
         | 3) How many terawatt hours does the United States generate
         | annually from natural gas? (The answer as of 2022 is 1689 TWh,
         | but any correct answer from the past 5 years would have been
         | ok.)
         | 
         | 4) How much natural uranium would the United States need to
         | replace the electricity it currently generates from natural
         | gas? (The answer is 1689 * 10^6 * 18 grams, e.g. about 30,400
         | metric tons of uranium.)
         | 
         | In the past Bard, Claude, and ChatGPT all correctly identified
         | the CANDU or PHWR as the most efficient current reactor type.
         | 
         | Claude did the arithmetic correctly at stages 3 and 4, but it
         | believed that a PHWR consumed about 170 grams of uranium per
         | megawatt hour so its answer was off by nearly a factor of 10.
         | ChatGPT got the initial grams-per-MWh value correct but its
         | arithmetic was wild fantasy, so it was off by about a factor of
         | 10000. Bard made multiple mistakes.
         | 
         | ------
         | 
         | I just retried with Bard and ChatGPT as of today. On today's
         | retry they fail at the first step.
         | 
         | Bard's response to the initial prompt was "According to the
         | World Nuclear Association, an MSR could use as little as 100
         | grams of uranium per megawatt hour of electricity. This is
         | about 100 times less than the amount of uranium used by a
         | traditional pressurized water reactor."
         | 
         | Since there are no MSRs currently generating electricity, this
         | answered the wrong question. The answer is also quantitatively
         | wrong. Current PWRs consume nowhere near 10,000 grams of
         | uranium per megawatt hour.
         | 
         | ChatGPT just said "As of my knowledge cutoff in September 2021,
         | the type of power reactor that consumes the least natural
         | uranium per megawatt hour of electricity is the pressurized
         | water reactor (PWR). PWRs are one of the most common types of
         | nuclear reactors used for commercial electricity generation."
         | 
         | This is wrong also. It correctly identified the CANDU as the
         | most efficient in a previous session, but this was a while ago.
         | I don't know if was just randomness that caused Bard and
         | ChatGPT to previously deliver correct answers at the first
         | step.
        
         | gizajob wrote:
         | I spent the afternoon chatting with it one day this week and
         | had a brilliant time. I fed it half of a book I've written
         | recently, a piece of narrative and descriptive non-fiction, and
         | its analysis was absolutely great. It digested the text and
         | found things that even human readers have missed. What was
         | interesting was that the book is mostly genderless, and at
         | first it gave the analysis like the writer was male. Then I
         | said "the writer is actually a woman" and it not only
         | apologised quite genuinely for getting it wrong, it altered its
         | literary analysis and criticism in a way that was perfectly
         | suited to a human reader knowing that the writer was female,
         | and changed the slant of its analysis. It was deeply useful and
         | interesting to converse with, and it found the relevant topics
         | that an educated human reader would likely find interesting and
         | comment on... and it did this in a few minutes, compared to a
         | human reader where you'd be talking weeks of latency to read
         | and analyse the text as a complete work.
         | 
         | Pretty great! Bit of a party trick at the same time (it did
         | hallucinate a couple of minor things) but enough for me as the
         | writer to be gripped by talking to Claude. It even came up with
         | some really interesting questions to ask _me_ once I told it
         | that I was the author, and many of them were better than a lot
         | of lazy interviewers or reviewers would come up with.
         | 
         | Highly recommended.
        
         | foundry27 wrote:
         | YMMV, but I've found that interacting with Claude
         | conversationally gives me a much stronger impression of having
         | a productive discussion with an individual, receiving pushback
         | on ideas that had identifiable flaws and giving advice on how
         | to improve my own thought processes, rather than the blind
         | obedience that GPT-4 output is so well known for. When it comes
         | to raw problem-solving capacity GPT-4 still handily beats it,
         | but this is the first LLM I've used that makes me actually
         | regret having to swap to GPT-4 to analyze a trickier problem.
        
           | BoorishBears wrote:
           | Everyone accepts output from LLMs is largely predicated on
           | grounding them, but few seem to be realizing that grounding
           | them applies to more than raw data.
           | 
           | They perform better at many tasks simply by grounding their
           | alignment in-context, by telling them very specific people to
           | act as.
           | 
           | It's an example of something that "prompt engineering" solves
           | today and people only glancingly familiar with how LLMs work
           | insist won't be needed soon... by their very nature the
           | models will always have this limitation.
           | 
           | Say user A is an expert with 10 years of experience and user
           | B is an beginner with 1 year of experience: they both enter a
           | question and all the model has to go on is the tokens in the
           | question.
           | 
           | The model might have uncountable ways to reply to that
           | question if you had inserted more tokens, but with only the
           | question in context, you'll always get answers that are
           | clustered around the mean answer it can produce... but
           | because it's the literal mean of all those possibilities it's
           | unlikely user A or user B will find particularly great.
           | 
           | Because of that there's no way to ever produce an answer that
           | satisfies both A and B _to the full capabilities of that
           | LLM_. When the input is just the question you 're not even
           | touching the tip of the iceberg of knowledge it could have
           | distilled into a good answer. And so just as you're finding
           | that Claude's push back and advice is useful, someone will
           | say it's more finicky and frustrating than GPT 3.5.
           | 
           | It mostly boils down to the fact because groups of user
           | aren't really defined by the mean. No one is the average of
           | all developers in terms of understanding (if anything that'd
           | make you an exceptional developer) instead people are
           | clustered around various levels of understanding in very
           | complex ways.
           | 
           | -
           | 
           | With that in mind, instead of banking on the alignment and
           | training data of a given model happening to make the answer
           | to that question good for you, you can trivially "ground" the
           | model and tell it you're a senior developer speaking frankly
           | with your coworker who's open to push back and realizes you
           | might have the X/Y problem and other similar fallacies.
           | 
           | You can remind it that it's allowed unsure, or it's very
           | sure, you can even ask it to list gaps in it's abilities (or
           | yours!) that are most relevant to a useful response.
           | 
           | That's why hearing model X can't do Y but model Z doesn't
           | really passes muster for me at this point unless how Y was
           | inputted into the model is shared.
        
             | civilitty wrote:
             | _> The model might have uncountable ways to reply to that
             | question if you had inserted more tokens, but with only the
             | question in context, you 'll always get answers that are
             | clustered around the mean answer it can produce... but
             | because it's the literal mean of all those possibilities
             | it's unlikely user A or user B will find particularly
             | great._
             | 
             | I refer to it as giving the LLM "pedagogical context" since
             | a core part of teaching is predicting what kind of answer
             | will actually help the audience depending on surrounding
             | context. The question "What is multiplication?" demands a
             | vastly different answer in an elementary school than a
             | university set theory class.
             | 
             | I think that's why there's such a large variance in HNer's
             | experience with ChatGPT. The GPT API with a custom system
             | prompt is far more powerful than the ChatGPT interface
             | specifically because it grounds the conversation in the way
             | that the moderated ChatGPT system prompt can't.
             | 
             | The chat GUI I created for my own use has a ton of
             | different roles that I choose based on what I'm asking. For
             | example, when discussing cuisine I have roles like
             | (shortened and simplified) "Julia Childs talking to a
             | layman who cares about classic technique", "expert
             | molecular gastronomy chef teaching a culinary school
             | student", etc.
        
         | pmoriarty wrote:
         | I've played around with Claude quite a bit, but mostly with
         | creative writing, at which I think it is stronger than any
         | other LLMs that I've tried, including GPT, Claude+ (which as
         | far as I can tell has not been rebranded as Claude 2), GPT 3.5,
         | Bard, and Bing.
         | 
         | I also much prefer to use Claude for explanations (I haven't
         | experimented much with Claude+, but limited experiments have
         | shown it to be even better) over the GPT's and other LLMs. It
         | gives much more thorough and natural-sounding explanations than
         | the competition, without extra prompting.
         | 
         | That said, the Claude variants don't seem to be as good at
         | logic-puzzly sort of stuff that most people love to test LLMs
         | with. So if you're in to that, you're probably better off with
         | GPT4.
         | 
         | I also haven't tested it much with programming.. but I've been
         | very disappointed with every LLM as far as my limited testing
         | in that realm has gone.
         | 
         | Claude deserves to get more attention, and I eagerly await
         | Claude 3.
        
         | desireco42 wrote:
         | My experience, fairly smaller, is that it is weaker then GPT 4,
         | which I mostly interact and use, but still usable. Some of it
         | is weaker, some of it is just different flavor of responses.
         | 
         | It is an AI and can help you be productive for sure.
        
         | politelemon wrote:
         | It's comparable to GPT3.x, and featurewise it does seem to
         | match up, so overall, it's not bad.
         | 
         | We're using it via langchain talking to Amazon Bedrock which is
         | hosting Claude 1.x. The integration doesn't seem to be fully
         | there though, I think langchain is expecting "Human:" and
         | "AI:", but Claude uses "Assistant:".
         | 
         | https://github.com/hwchase17/langchain/issues/2638
        
         | youssefabdelm wrote:
         | It's a bit less "anodyne" than GPT. GPT tends to give the most
         | "mainstream" answer in many cases and is less "malleable" so to
         | speak. I remember the differences between RLHF'd GPT and the
         | original davinci GPT-3 before mode collapse. If you spent a
         | while on a good prompt, it really paid off.
         | 
         | Thankfully, Claude seems to maintain this "creativity" somehow.
         | 
         | It's excellent at recommending books, creative writing, etc.
         | 
         | For coding, it's not as good as GPT-4, but still helps me more
         | than GPT in certain coding tasks.
        
         | collegeburner wrote:
         | honestly the biggest annoying thing is it seems too restricted.
         | like it will nitpick my use of the word "think" when i ask what
         | it thinks because "hurr durr as a LLM i don't have thoughts"
         | yeah idc, just answer. it's also way more restricted in terms
         | of refusing to say anything that's less than 100% anodyne.
         | which i get the need for a clean version, just gets frustrating
         | if e.g. i want it to add humor and the best it can do is the
         | comedic equivalent of a knock knock joke
        
       | ronsor wrote:
       | This client wouldn't exist if it were possible to actually get
       | access to the official API.
        
         | linsomniac wrote:
         | Have you tried getting on the waitlist? It worked for me, ISTR
         | it took around 2 weeks.
        
           | williamstein wrote:
           | I submitted applications three times to their waitlist over
           | the last several months, and I haver never heard back with
           | any response at all. I think my use case is very reasonable
           | (integration with https://cocalc.com, where we use ChatGPT's
           | API heavily right now). My experience is that you fill out a
           | web form to request access to the waitlist, and get no
           | feedback at all ever (I just double checked my spam folders
           | as well). Is that what normally happens for people?
        
           | ronsor wrote:
           | I'm pretty sure it's been over a month now since I submitted
           | my application
        
         | explosion-s wrote:
         | The API costs money though
        
       | bmitc wrote:
       | I think it would be nice if companies and projects stopped using
       | famous names to promote their projects.
        
         | refulgentis wrote:
         | Claude Shannon.
         | 
         | It's a beautiful homage.
        
         | catgary wrote:
         | Yeah I rolled my eyes pretty hard when a crypto company used
         | something like "team grothendieck".
        
       | cubefox wrote:
       | Note that Claude 2 scores 71.2% zero-shot on the python coding
       | benchmark HumanEval, which is better than GPT-4, which scores
       | 67.0%. Is there already real-world experience with its
       | programming performance?
        
         | og_kalu wrote:
         | GPT-4 out in the wild's (reproducible) performance appears to
         | be much higher than 67. Testing from 3/15 (presumably on the
         | 0314 model) seems to be at 85.36%
         | (https://twitter.com/amanrsanger/status/1635751764577361921).
         | And the linked paper from my
         | post(https://doi.org/10.48550/arXiv.2305.01210) got a pass@1 of
         | 88.4 from GPT-4 recently (May? June?).
        
         | lerchmo wrote:
         | I have found just using it in the web interface comperable to
         | OpenAI. But the context window makes a huge difference. I can
         | dump alot more files in ( entire schema, sample records etc)
        
       ___________________________________________________________________
       (page generated 2023-07-14 23:00 UTC)