[HN Gopher] Claude 2 Internal API Client and CLI ___________________________________________________________________ Claude 2 Internal API Client and CLI Author : explosion-s Score : 57 points Date : 2023-07-14 19:55 UTC (3 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | dandiep wrote: | Who here is using Claude? And can you comment on your experiences | with it vs. GPT 3.5/4? | BryanLegend wrote: | I'm pleased with it. Claude seems kinder and less patronizing | than GPT. Not as good at coding yet. | cl42 wrote: | Using it regularly for executive feedback at some of our | clients (think of this as an internal coach for policies). I'd | say it's almost as good at GPT-4 at having broader | conversations and sharing ideas. | | The 100K model is FANTASTIC for quick prototyping as well. | | Implementing everything via PhaseLLM to plug and play Claude + | GPT-3.5/4 as needed. All other LLMs don't stack up to these | two. | xfalcox wrote: | I added it as an option in Discourse, and I've been happy with | it's output for summarization tasks, suggesting titles and | proofreading. | linsomniac wrote: | I started playing with it last weekend, the 100K token limit is | very useful for things like "Give me a summary of this 5 hours | Lex Fridman podcast in about 10 sentences: <podcast | transcript>" | celestialcheese wrote: | I use it for filtering and summarization tasks on huge | contexts. Specifically for extracting data from raw HTML in | scraping tasks. It works surprisingly well. | ronsor wrote: | The ability to upload entire documents is honestly a game- | changer, even if GPT-4 is better with certain reasoning tasks. | I don't think I can go back to tiny context lengths now. | jerrygenser wrote: | It's good for general things but less good at coding. Can | usually get the correct answer for simpler things but much less | idiomatic for python than gpt4 | ec109685 wrote: | For javascript, it did just as well as gpt-4 for several | questions and used more modern JavaScript syntax. | | First time I have felt something feel nearly as good, and the | user interface is a bit nicer. | speedgoose wrote: | Does it uses the ECMAScript modules instead of the CommonJS | modules by default? | HyprMusic wrote: | We're still in the early stages of testing v2 in the real world | but it aced our suite of internal tests... we are very | impressed. Claude 1.2 did ok but it struggled with nuance & | accuracy whereas v2 seems to handle nuance very well and is | both accurate and, most importantly, consistent. The thing with | evaluating LLMs is it's not about how well they do on your | first evaluation - consistency is key and even the slightest | little deviation in circumstance can throw them off so we're | being very cautious before we make the jump. GPT4 brought that | consistency but the slow speed and constant downtime makes it | vey difficult to use in a product so we'd love to move to | Anthropic. | | Our product is a tool to turn user stories into end-to-end | tests so we use LLMs for NLP, identifying key parts of HTML and | writing very simple code (we've not officially launched to the | public just yet but for the curious, https://carbonate.dev is | our product). | paxys wrote: | I love it. It may not objectively be on par with GPT-4, but | uploading a 100 page document and getting a summary in seconds | is nothing short of miraculous. | swyx wrote: | is it an accurate summary tho? | technics256 wrote: | In my experience it is very hollow. It skips details unless | you force it to. | | Gpt4 is still way better | deadmutex wrote: | How do you know if it is correct or a hallucination? | luma wrote: | Investigate, same as you would a paralegal etc. If it makes | an assertion, contest it and ask where it found supporting | evidence in the document for the claims made. Ask it to | make the counter-argument, also with sources. Verify as | needed. | paxys wrote: | That's what prompting is all about. Ask it to prove its | statements. Ask it to quote passages that support its | arguments. Then double and triple check it yourself. It | isn't going to do the work for you, but can still be a | pretty great reference tool. | ronsor wrote: | Presumably one can test it with documents one has already | read and knows before. If the summaries of the test | documents are good, future summaries will probably be OK | too. | zmmmmm wrote: | > If the summaries of the test documents are good, future | summaries will probably be OK too | | But that is exactly what is problematic with | hallucinations. It's a rare / exceptional behaviour that | triggers extreme departure from reality. So you can't | estimate the extremes by extrapolating from common / | moderate observations. You would have to test a _lot_ of | previous documents to be confident, and even then there | would be a residual risk. | lambdaba wrote: | Maybe having it summarize a fiction book (outside of | training data)? | drewbitt wrote: | Claude's training data is a year further into the future which | is often beneficial. The 100k token limit is fantastic for long | conversations and pasting in documents. The two downsides are | 1) it seems to get confused a bit more than GPT-4 and I have to | repeat instructions more often 2) the code-writing ability is | definitely subpar compared to GPT-4 | explosion-s wrote: | I prefer it over 3.5, (can't afford GPT4 so I'm not sure about | comparisons there). It's much faster imo and refuses to respond | less. In addition they make uploading (text based) files easy, | so although it's not truly multimodal it's still nice to use. | | I also like the 100k token limit, that's insane. It almost | never loses track of what you were talking about! | binkHN wrote: | I looked at the pricing and it appears to be less than half | the cost of GPT-4, but significantly more expensive than | GPT-3.5. Does that sound correct? | blowski wrote: | I've spent quite a bit of time with both, but I'm not an expert | in this field so take my comments with a fist of salt. | | It's pretty good. Certainly as good as GPT-3.5 for speed and | quality. Claude seems to consider the context you've supplied | more than GPT-3.5. | | Compared to GPT-4, it has similar levels of knowledge. Claude | is less verbose. It's less good at building real world models | based on context. Anecdotally, I've found it hallucinated more | than GPT. | | So, it's probably better at summarising large blocks of text, | but less good at generating content that requires knowledge | outside of what you've supplied. | swyx wrote: | comparing them every day via https://github.com/smol-ai/menubar | . i'd say when it comes to coding I pick their suggestions | about 30% of the time. not SOTA, but pretty darn good! | philipkglass wrote: | I am frequently interested in problems where answers are easily | calculated from public data but the answer is unlikely to be | already recorded in a form that search engines can find. | Normally I spend a while noodling around looking for data and | then use unit-conversion and basic arithmetic to get the final | answer. | | I tested Claude vs ChatGPT (which I believe is GPT 3.5) and vs | Bard for a problem of this sort. | | I asked: | | 1) What current type of power reactor consumes the least | natural uranium per megawatt hour of electricity? (The answer | is the pressurized heavy water reactor or CANDU type). | | 2) How much natural uranium does a PHWR consume per megawatt | hour of electricity generated? (The answer is about 18 grams.) | | 3) How many terawatt hours does the United States generate | annually from natural gas? (The answer as of 2022 is 1689 TWh, | but any correct answer from the past 5 years would have been | ok.) | | 4) How much natural uranium would the United States need to | replace the electricity it currently generates from natural | gas? (The answer is 1689 * 10^6 * 18 grams, e.g. about 30,400 | metric tons of uranium.) | | In the past Bard, Claude, and ChatGPT all correctly identified | the CANDU or PHWR as the most efficient current reactor type. | | Claude did the arithmetic correctly at stages 3 and 4, but it | believed that a PHWR consumed about 170 grams of uranium per | megawatt hour so its answer was off by nearly a factor of 10. | ChatGPT got the initial grams-per-MWh value correct but its | arithmetic was wild fantasy, so it was off by about a factor of | 10000. Bard made multiple mistakes. | | ------ | | I just retried with Bard and ChatGPT as of today. On today's | retry they fail at the first step. | | Bard's response to the initial prompt was "According to the | World Nuclear Association, an MSR could use as little as 100 | grams of uranium per megawatt hour of electricity. This is | about 100 times less than the amount of uranium used by a | traditional pressurized water reactor." | | Since there are no MSRs currently generating electricity, this | answered the wrong question. The answer is also quantitatively | wrong. Current PWRs consume nowhere near 10,000 grams of | uranium per megawatt hour. | | ChatGPT just said "As of my knowledge cutoff in September 2021, | the type of power reactor that consumes the least natural | uranium per megawatt hour of electricity is the pressurized | water reactor (PWR). PWRs are one of the most common types of | nuclear reactors used for commercial electricity generation." | | This is wrong also. It correctly identified the CANDU as the | most efficient in a previous session, but this was a while ago. | I don't know if was just randomness that caused Bard and | ChatGPT to previously deliver correct answers at the first | step. | gizajob wrote: | I spent the afternoon chatting with it one day this week and | had a brilliant time. I fed it half of a book I've written | recently, a piece of narrative and descriptive non-fiction, and | its analysis was absolutely great. It digested the text and | found things that even human readers have missed. What was | interesting was that the book is mostly genderless, and at | first it gave the analysis like the writer was male. Then I | said "the writer is actually a woman" and it not only | apologised quite genuinely for getting it wrong, it altered its | literary analysis and criticism in a way that was perfectly | suited to a human reader knowing that the writer was female, | and changed the slant of its analysis. It was deeply useful and | interesting to converse with, and it found the relevant topics | that an educated human reader would likely find interesting and | comment on... and it did this in a few minutes, compared to a | human reader where you'd be talking weeks of latency to read | and analyse the text as a complete work. | | Pretty great! Bit of a party trick at the same time (it did | hallucinate a couple of minor things) but enough for me as the | writer to be gripped by talking to Claude. It even came up with | some really interesting questions to ask _me_ once I told it | that I was the author, and many of them were better than a lot | of lazy interviewers or reviewers would come up with. | | Highly recommended. | foundry27 wrote: | YMMV, but I've found that interacting with Claude | conversationally gives me a much stronger impression of having | a productive discussion with an individual, receiving pushback | on ideas that had identifiable flaws and giving advice on how | to improve my own thought processes, rather than the blind | obedience that GPT-4 output is so well known for. When it comes | to raw problem-solving capacity GPT-4 still handily beats it, | but this is the first LLM I've used that makes me actually | regret having to swap to GPT-4 to analyze a trickier problem. | BoorishBears wrote: | Everyone accepts output from LLMs is largely predicated on | grounding them, but few seem to be realizing that grounding | them applies to more than raw data. | | They perform better at many tasks simply by grounding their | alignment in-context, by telling them very specific people to | act as. | | It's an example of something that "prompt engineering" solves | today and people only glancingly familiar with how LLMs work | insist won't be needed soon... by their very nature the | models will always have this limitation. | | Say user A is an expert with 10 years of experience and user | B is an beginner with 1 year of experience: they both enter a | question and all the model has to go on is the tokens in the | question. | | The model might have uncountable ways to reply to that | question if you had inserted more tokens, but with only the | question in context, you'll always get answers that are | clustered around the mean answer it can produce... but | because it's the literal mean of all those possibilities it's | unlikely user A or user B will find particularly great. | | Because of that there's no way to ever produce an answer that | satisfies both A and B _to the full capabilities of that | LLM_. When the input is just the question you 're not even | touching the tip of the iceberg of knowledge it could have | distilled into a good answer. And so just as you're finding | that Claude's push back and advice is useful, someone will | say it's more finicky and frustrating than GPT 3.5. | | It mostly boils down to the fact because groups of user | aren't really defined by the mean. No one is the average of | all developers in terms of understanding (if anything that'd | make you an exceptional developer) instead people are | clustered around various levels of understanding in very | complex ways. | | - | | With that in mind, instead of banking on the alignment and | training data of a given model happening to make the answer | to that question good for you, you can trivially "ground" the | model and tell it you're a senior developer speaking frankly | with your coworker who's open to push back and realizes you | might have the X/Y problem and other similar fallacies. | | You can remind it that it's allowed unsure, or it's very | sure, you can even ask it to list gaps in it's abilities (or | yours!) that are most relevant to a useful response. | | That's why hearing model X can't do Y but model Z doesn't | really passes muster for me at this point unless how Y was | inputted into the model is shared. | civilitty wrote: | _> The model might have uncountable ways to reply to that | question if you had inserted more tokens, but with only the | question in context, you 'll always get answers that are | clustered around the mean answer it can produce... but | because it's the literal mean of all those possibilities | it's unlikely user A or user B will find particularly | great._ | | I refer to it as giving the LLM "pedagogical context" since | a core part of teaching is predicting what kind of answer | will actually help the audience depending on surrounding | context. The question "What is multiplication?" demands a | vastly different answer in an elementary school than a | university set theory class. | | I think that's why there's such a large variance in HNer's | experience with ChatGPT. The GPT API with a custom system | prompt is far more powerful than the ChatGPT interface | specifically because it grounds the conversation in the way | that the moderated ChatGPT system prompt can't. | | The chat GUI I created for my own use has a ton of | different roles that I choose based on what I'm asking. For | example, when discussing cuisine I have roles like | (shortened and simplified) "Julia Childs talking to a | layman who cares about classic technique", "expert | molecular gastronomy chef teaching a culinary school | student", etc. | pmoriarty wrote: | I've played around with Claude quite a bit, but mostly with | creative writing, at which I think it is stronger than any | other LLMs that I've tried, including GPT, Claude+ (which as | far as I can tell has not been rebranded as Claude 2), GPT 3.5, | Bard, and Bing. | | I also much prefer to use Claude for explanations (I haven't | experimented much with Claude+, but limited experiments have | shown it to be even better) over the GPT's and other LLMs. It | gives much more thorough and natural-sounding explanations than | the competition, without extra prompting. | | That said, the Claude variants don't seem to be as good at | logic-puzzly sort of stuff that most people love to test LLMs | with. So if you're in to that, you're probably better off with | GPT4. | | I also haven't tested it much with programming.. but I've been | very disappointed with every LLM as far as my limited testing | in that realm has gone. | | Claude deserves to get more attention, and I eagerly await | Claude 3. | desireco42 wrote: | My experience, fairly smaller, is that it is weaker then GPT 4, | which I mostly interact and use, but still usable. Some of it | is weaker, some of it is just different flavor of responses. | | It is an AI and can help you be productive for sure. | politelemon wrote: | It's comparable to GPT3.x, and featurewise it does seem to | match up, so overall, it's not bad. | | We're using it via langchain talking to Amazon Bedrock which is | hosting Claude 1.x. The integration doesn't seem to be fully | there though, I think langchain is expecting "Human:" and | "AI:", but Claude uses "Assistant:". | | https://github.com/hwchase17/langchain/issues/2638 | youssefabdelm wrote: | It's a bit less "anodyne" than GPT. GPT tends to give the most | "mainstream" answer in many cases and is less "malleable" so to | speak. I remember the differences between RLHF'd GPT and the | original davinci GPT-3 before mode collapse. If you spent a | while on a good prompt, it really paid off. | | Thankfully, Claude seems to maintain this "creativity" somehow. | | It's excellent at recommending books, creative writing, etc. | | For coding, it's not as good as GPT-4, but still helps me more | than GPT in certain coding tasks. | collegeburner wrote: | honestly the biggest annoying thing is it seems too restricted. | like it will nitpick my use of the word "think" when i ask what | it thinks because "hurr durr as a LLM i don't have thoughts" | yeah idc, just answer. it's also way more restricted in terms | of refusing to say anything that's less than 100% anodyne. | which i get the need for a clean version, just gets frustrating | if e.g. i want it to add humor and the best it can do is the | comedic equivalent of a knock knock joke | ronsor wrote: | This client wouldn't exist if it were possible to actually get | access to the official API. | linsomniac wrote: | Have you tried getting on the waitlist? It worked for me, ISTR | it took around 2 weeks. | williamstein wrote: | I submitted applications three times to their waitlist over | the last several months, and I haver never heard back with | any response at all. I think my use case is very reasonable | (integration with https://cocalc.com, where we use ChatGPT's | API heavily right now). My experience is that you fill out a | web form to request access to the waitlist, and get no | feedback at all ever (I just double checked my spam folders | as well). Is that what normally happens for people? | ronsor wrote: | I'm pretty sure it's been over a month now since I submitted | my application | explosion-s wrote: | The API costs money though | bmitc wrote: | I think it would be nice if companies and projects stopped using | famous names to promote their projects. | refulgentis wrote: | Claude Shannon. | | It's a beautiful homage. | catgary wrote: | Yeah I rolled my eyes pretty hard when a crypto company used | something like "team grothendieck". | cubefox wrote: | Note that Claude 2 scores 71.2% zero-shot on the python coding | benchmark HumanEval, which is better than GPT-4, which scores | 67.0%. Is there already real-world experience with its | programming performance? | og_kalu wrote: | GPT-4 out in the wild's (reproducible) performance appears to | be much higher than 67. Testing from 3/15 (presumably on the | 0314 model) seems to be at 85.36% | (https://twitter.com/amanrsanger/status/1635751764577361921). | And the linked paper from my | post(https://doi.org/10.48550/arXiv.2305.01210) got a pass@1 of | 88.4 from GPT-4 recently (May? June?). | lerchmo wrote: | I have found just using it in the web interface comperable to | OpenAI. But the context window makes a huge difference. I can | dump alot more files in ( entire schema, sample records etc) ___________________________________________________________________ (page generated 2023-07-14 23:00 UTC)