[HN Gopher] 100K Context Windows
       ___________________________________________________________________
        
       100K Context Windows
        
       Author : samwillis
       Score  : 578 points
       Date   : 2023-05-11 16:46 UTC (6 hours ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | fire wrote:
       | god I'd love to work there
        
       | tikkun wrote:
       | This is the first time I've felt like Anthropic may be a true
       | competitor to OpenAI.
       | 
       | I see 6 ways to improve foundation LLMs other than cost. If your
       | product is best at one of the below, and has parity at the other
       | 5 items, then customers will switch. I'm currently using
       | GPT-4-8k. I regularly run into the context limit. If Claude-100K
       | is close enough on "intelligence" then I will switch.
       | 
       | Six Dimensions to Compare Foundation LLMs:
       | 
       | 1. Smarter models
       | 
       | 2. Larger context windows
       | 
       | 3. More input and output modes
       | 
       | 4. Lower time to first response token and to full response
       | 
       | 5. Easier prompting
       | 
       | 6. Integrations
        
         | rizky05 wrote:
         | [dead]
        
         | RobotToaster wrote:
         | >Six Dimensions to Compare Foundation LLMs
         | 
         | I'd add open source to the list, which neither "open"AI or this
         | is.
        
           | ugh123 wrote:
           | I don't think most of the large customers will care about OSS
           | AI. Over the last decade they've learned (trained
           | themselves?) where to put their money towards (cloud vs. in-
           | house infra for all manner of things, for better or worse)
           | and I think AI tools will follow similar trends.
           | 
           | Businesses will certainly care about cost, but just as
           | important will be:
           | 
           | - Customization and fine-tuning capabilities (also 'white
           | labeling' where appropriate)
           | 
           | - Integrations (with 3rd party and in-house services & data
           | stores)
           | 
           | - SLA & performance concerns
           | 
           | - Safety features
           | 
           | Open Source AI will have a place, but may be more towards
           | personal-use and academic work. And it will certainly drive
           | competition with the major players (OpenAI, Google, etc) and
           | push them to innovate more which is starting to play out now.
        
             | ibains wrote:
             | A lot of B2B startups can technically the cloud API to
             | provide value added applications to Enterprises, but often
             | the banks and healthcare companies will not want their data
             | running through startups pipes to OpenAI pipes.
             | 
             | We provide a low code data transformation product
             | (prophecy.io), and we'll never close sales at any volume,
             | if we have a to get an MSA that approves this. Might get
             | easier if we become large :)
        
             | lannisterstark wrote:
             | >I don't think most of the large customers will care about
             | OSS AI
             | 
             | Problem again, is centralization of LLMs by either the
             | governments (and they always act in your best interest,
             | amirite?) and corporation, which Non-FOSS LLMs prevent.
             | 
             | Democratization of the models is the only way to actually
             | prevent bad actors from doing bad things.
             | 
             | "But they'll then have access to it too" you say. Yes, they
             | will, but given how many more people who will also have
             | access to open LLMs we'd have tools to prevent actually
             | malicious acts.
        
             | dragonwriter wrote:
             | > I don't think most of the large customers will care about
             | OSS AI.
             | 
             | OSS AI will open up more diverse and useful services than
             | the first-party offerings from relatively risk averse major
             | vendors, which customers *will" care about.
        
             | simonw wrote:
             | Here's a really important reason to care about open source
             | models: prompt engineering is fiddly enough without the
             | risk of your model provider "upgrading" the model you are
             | using in a way that breaks your existing prompts.
             | 
             | OpenAI already upset a lot of (admittedly non-paying
             | academic) users when they shut off access to the old Ada
             | code model with only a few week's notice.
        
               | danysdragons wrote:
               | The OpenAI APi has model checkpoints, right now the chat
               | options are:
               | 
               | gpt-4 gpt-3-5-turbo gpt-4-0314 gpt-3-5-turbo-0301
        
               | spacebanana7 wrote:
               | I'm curious about how enterprises will manage model
               | upgrades.
               | 
               | On one hand, as you mention, upgrades could break or
               | degrade prompts in ways that are hard to fix. However,
               | these models will need constant streams of updates for
               | bugs and security fixes just like any other piece of
               | software. Plus the temptation to get better performance.
               | 
               | The decisions around how and whether to upgrade LLMs will
               | be much more complicated than upgrading Postgres
               | versions.
        
               | Vecr wrote:
               | Why would the models themselves need security fixes? The
               | software running the models, sure, but you should be able
               | to upgrade that without changing anything observable
               | about the actual model.
        
             | ebiester wrote:
             | Yes, but I think for most companies this has more to do
             | with cost. They're not going to pay for the OSS model, and
             | if they can use an OSS model + fine tuning, they'll choose
             | to save the money.
        
             | hdjjhhvvhga wrote:
             | > I don't think most of the large customers will care about
             | OSS AI.
             | 
             | One would think the same in the 90s but yet, for some
             | reason, Open Source prevailed and took over the world. I
             | don't believe it was about cost, at least not only. In my
             | career I had to evaluate many technical solutions and
             | products and OSS was often objectively superior at several
             | levels without taking account the cost.
             | 
             | The first really successful alternative to "Open"AI will:
             | 
             | * gather many talented developers
             | 
             | * will quickly become a de facto standard solution
             | 
             | * people will rapidly start developing a wide range of
             | integrations for it
             | 
             | * everybody will be using it, including large orgs,
             | because, well, it's open source
        
               | ugh123 wrote:
               | True, but the difference here is that running a
               | performant and capable AI solution will be
               | infrastructure-dependent, which has real costs.
        
               | [deleted]
        
             | nullc wrote:
             | Companies that aren't mindful of vendor lock in aren't long
             | for the world.
             | 
             | Though those cloud platforms all have their own proprietary
             | components most users are savvy enough to constrain and
             | compartmentalize their use of them lest they find
             | themselves having all their profits taken by a platform
             | that knows it can set its prices arbitrarily. The cloud vs
             | in-house adoption is what it is in large part because the
             | cloud offerings are a commodity and a big part of them
             | being a commodity is that much of the underlying software
             | is free software.
        
               | deltree7 wrote:
               | History is littered with companies that went dead because
               | they focused on things that don't matter (open source,
               | anti-microsoft, pro-linux).
               | 
               | There will be a time when those things matter when it
               | hurts the bottom-line (Dropbox), but to prematurely
               | optimize for that while you are finding product-market-
               | fit is crazy and _all_ companies are finding product-
               | market-fit in the new AI era
        
           | throwawayadvsec wrote:
           | now that I think about it
           | 
           | is it that important to open source models that can only run
           | on hardware worth tens of thousand of dollars?
           | 
           | who does that benefit besides their competitors and nefarious
           | actors?
           | 
           | I've been trying to run one of the largest models for a
           | while, unless 30,000$ falls in my hand I'll probably never be
           | able to run the current SOTA
        
             | chrisco255 wrote:
             | > is it that important to open source models that can only
             | run on hardware worth tens of thousand of dollars?
             | 
             | Yes, because as we've seen with other open source AI
             | models, it's often possible for people to fork code and
             | modify it in such a way that it runs on consumer grade
             | hardware.
        
             | YetAnotherNick wrote:
             | I agree utility of open source for personal usecase is
             | overblown.
             | 
             | But for commercial usecases, open source is very relevant
             | for privacy reasons as many enterprises have strict policy
             | not to share data with third party. Also it could be a lot
             | cheaper for bulk inference or to have a small model for
             | particular task.
        
               | turtles3 wrote:
               | However, the same thing could be achieved with closed
               | source models. There's nothing to stop an LLM being made
               | available to run on prem under a restrictive license. It
               | would really be no different to ye olde desktop software
               | - keeping ownership over bits shipped to a customer is
               | solved with the law rather than technical means.
               | 
               | That said, I really hope open source models can succeed,
               | it would be far better for the industry if we had a Linux
               | of LLMs.
        
               | sanxiyn wrote:
               | > Keeping ownership over bits shipped to a customer is
               | solved with the law rather than technical means.
               | 
               | Yes in theory... In practice, what happened with LLaMA
               | showed people will copy and distribute weights while
               | ignoring the license.
        
             | chaxor wrote:
             | They don't only run on high end systems. Good models can
             | run on a desktop you have at home. If you don't have a
             | desktop... I'm not sure what you're doing on HN.
        
             | circuit10 wrote:
             | It will create price competition for different providers of
             | the model though, which should drive down prices
        
             | iknowstuff wrote:
             | Even a small startup, a researcher or a tinkerer can get a
             | cloud instance with a beefy GPU. Also of note, Apple's M1
             | Max/Ultra should be be able to run it on their GPUs given
             | their 64/128GB of memory, right? That's an order of
             | magnitude cheaper.
        
               | mejutoco wrote:
               | I am confused. Those amounts are ram, not gpu ram, aren't
               | they? Macs cpus are impressive, but not for ml. A most
               | realistic one for a consumer is a 4090 rtx 24 GB. A lot
               | of models do not fit in that, so A6000 48GB and over for
               | some professional cards. That might be around 9000EUR
               | already.
        
               | codedokode wrote:
               | > Macs cpus are impressive, but not for ml
               | 
               | On Mac GPU has access to all memory.
        
               | piperswe wrote:
               | Apple Silicon has unified memory - all memory is
               | accessible to both the CPU and GPU parts of the SoC.
        
               | karmasimida wrote:
               | But they comes at max 32GB model?
        
               | [deleted]
        
               | mkl wrote:
               | Mac Studio (desktop) is up to 128GB, and Macbook Pro is
               | up to 96GB.
        
               | himlion wrote:
               | I overlooked the unified memory on those machines. Can it
               | really run this performantly?
        
             | lannisterstark wrote:
             | "It only benefits bad people" is a pretty shitty argument
             | at this point tbf. You can apply this logic to any
             | expensive thing at this point.
             | 
             | I _can_ for example, afford the hardware worth tens of
             | thousands of dollars. I don 't want to, but I can if I
             | needed to. Does that automagically make me their competitor
             | or a bad actor?
        
             | fnordpiglet wrote:
             | Yes, because it can always be down ported by people with
             | more constraints than the original authors. We've see a lot
             | of this in the LLM space, and a lot of other OSS efforts.
        
             | RobotToaster wrote:
             | When linux was first released in 1991 a 386 to run it would
             | cost about $2000.
             | 
             | We've already seen big advancements in tools to run them on
             | lesser hardware. It wouldn't surprise me if we see some big
             | advancements in the hardware to run them over the next few
             | years, currently they are mostly being run of graphics
             | processors that aren't optimised for the task.
        
             | dfadsadsf wrote:
             | $30000 is less than price of average car that Americans buy
             | (and most families have two of them) - that's definitely in
             | the realm of something that affluent family can buy if it
             | provides enough value. I also expect price to go down and
             | at $10k it's less than mid-range bathroom update. The
             | question is only if it provides enough value or using in
             | the cloud better option for almost all families.
        
           | overgard wrote:
           | Considering the very smart people asking for a moratorium on
           | AI development, and it's potential to disrupt a lot of jobs,
           | this may be a good thing.
        
         | nr2x wrote:
         | For me I'd say speed trumps all else. It's impossible to truly
         | reach scale with the glacial response times you get from
         | current API.
        
           | sebzim4500 wrote:
           | >speed trumps all else
           | 
           | Then use GPT-2
        
             | nr2x wrote:
             | I actually do prefer 3.5-turbo over 4 for many tasks.
        
         | IshKebab wrote:
         | Reliability surely? They still haven't managed to make a model
         | that says "I don't know" rather than bullshitting. That's by
         | far the biggest unsolved problem.
        
         | srowaway2 wrote:
         | 7. Price!
         | 
         | GPT4-32K costs ~$2 if you end up using the full 32K tokens, so
         | if you're doing any chaining or back-and-forth it can get
         | expensive very quickly.
        
           | hesdeadjim wrote:
           | Oof, got access to the 8k model recently and was wondering
           | what costs would be on the 32k one. That's brutal.
        
         | zomglings wrote:
         | Also if you allow users to receive vector representations of
         | context and provide such representations as side information
         | when querying LLMs.
        
         | danenania wrote:
         | One question is how much other factors really matter compared
         | to the raw "intelligence" of the model--how good its
         | completions are. You're not going to care very much about
         | context window, prompting, or integrations if the output isn't
         | good. It would be sort of like a car that has the best steering
         | and brakes on the market, but can't go above 5 mph.
        
           | majormajor wrote:
           | Big question on that for me is that there's a variety of
           | "completion styles" and I'm curious how "universal"
           | performance on them is. Probably more than this, but a quick
           | list that comes to mind:
           | 
           | * Text summary/compression
           | 
           | * Creative writing (fiction/lyrics/stylization)
           | 
           | * Text comparison
           | 
           | * Question-answering
           | 
           | * Logical reasoning/sequencing ("given these tools and this
           | scenario, how would you perform this task")
           | 
           | IMO, for stuff like text comparison and question-answering,
           | some combo of speed/cost/context-size could make up for a
           | lot, even if they do "worse" versions of stuff just that's
           | too slow or expensive or context-limited in a different
           | model.
        
           | solarkraft wrote:
           | I don't know. While using Phind I regularly get annoyed by
           | long prose that doesnt answer anything (yes, "concise" is
           | always on). Claude seems to be directly geared towards
           | solving stuff over nice writing.
        
             | Tostino wrote:
             | I generally add to my initial prompts to GPT4 to: From now
             | on, please use the fewest tokens possible in all replies to
             | save tokens and provide brief and accurate answers.
        
           | modernpink wrote:
           | Or rather, more analogously, a self-driving car that has a
           | range of 10 000 miles but sometimes makes mistakes when
           | driving vs a self-driving car with a range of 800 miles that
           | never makes mistakes. Once you've have a taste of
           | intelligence it's hard to give up.
           | 
           | However, in many applications there is a limit on how
           | intelligent you need the LLM to be. I have found I am able to
           | fall back to the cheaper and faster GPT-3.5 to do the grunt
           | work of forming text blobs into structured json within a
           | chain involving GPT-4 for higher-level functions.
        
           | tikkun wrote:
           | Strongly agree. They are ordered by how much I think they
           | generally will lead to users choosing one model over the
           | other.
           | 
           | Intelligence is the most important dimension by far, perhaps
           | an order of magnitude or more above the second item on the
           | list.
        
             | danenania wrote:
             | On that note, can anyone speak to how Anthropic (or other
             | models) are doing on catching up to OpenAI for pure model
             | intelligence/quality of completions? Are any others
             | approaching GPT-4? I've only used GPT-based tools so I have
             | no idea.
        
               | og_kalu wrote:
               | The best claude model is closer to GPT-4 than 3.5
        
         | jll29 wrote:
         | More languages?
        
         | nico wrote:
         | Faster, cheaper fine-tuning and training
         | 
         | If I could train a useful model, on my own data, in a
         | reasonable time
         | 
         | I would want to have a CI-training pipeline to always have my
         | models up to date
        
           | makestuff wrote:
           | Yeah I remember in undergrad I was working on using
           | transformation learning to train an object detector.
           | Basically you only needed 100ish images to get the model to
           | detect that new object really well.
           | 
           | I'm not sure what the analogous term is for a similar process
           | on LLMs, but that will be huge when there is a service for
           | it.
        
             | visarga wrote:
             | LLMs can do that without any examples (zero shot) or with
             | one or a few demonstrations in the prompt, if you can
             | describe the task in the limited context window.
             | 
             | If you want for example to train the model to learn to use
             | a very large API, or access the knowledge in a whole book,
             | it might need fine-tuning.
        
               | nico wrote:
               | Could I just train a very small LLM with an English
               | dictionary + Python + large API documentation + large
               | Python code base?
               | 
               | Then do some chat fine tuning (like what HF did with
               | StarCoder to get ChatCoder)
               | 
               | And get a lightweight LLM that knows the docs and code
               | for the thing I need it for
               | 
               | After that, maybe incrementally fine tune the model as
               | part of your CI/CD process
        
             | toss1 wrote:
             | How similar were the object to other objects?
             | 
             | E.g., were you trying to distinguish an object vs nothing,
             | a bicycle vs a fish, a bird vs a squirrel, or two different
             | species of songbird at a feeder?
             | 
             | How much would the training requirements increase or
             | decrease moving up or down that scale?
        
           | ilaksh wrote:
           | The PaLM 2 stuff released yesterday has fine tuning for their
           | newest large models as a core feature.
        
         | moffkalast wrote:
         | Until they actually make any of it available in anything but an
         | obscure expensive API you have to request access to, they might
         | as well not even exist.
        
           | r_thambapillai wrote:
           | there are many services that integrate with them that would
           | allow you to self-serve signup
        
           | williamstein wrote:
           | The landing page says "Easy integration via standard APIs
           | Claude can be incorporated into any product or toolchain
           | you're building with minimal effort." Then there is a big
           | button "Request Access", which for me right now just does
           | nothing. OpenAI has really faced the pain to make their
           | product available via an API to the general public at scale,
           | but Anthropic/Google/etc. don't quite seem to be there yet.
           | It's frustrating.
        
             | chaxor wrote:
             | I don't think the person you're responding to wants a
             | network based or cloud based solution.
             | 
             | When someone says they want it available they mean running
             | on their own device.
             | 
             | This is hackernews, nearly everyone on this site should
             | have their own self hosted LLM running on a computer/server
             | or device they have at their house.
             | 
             | Relying on 'the cloud' for everything makes us worse
             | developers in just about every imaginable way, creates a
             | ton of completely unnecessary and complicated source code,
             | and creates far too many calls to the internet which are
             | unnecessary. Using local hard drives for example is
             | thousands of times faster than using cloud data storage,
             | and we should take advantage of that in the software we
             | write. So instead of making billions of calls to download a
             | terabyte database query-by-query (seen this 'industry-
             | standard' far too many times), maybe make _one_ call and
             | build it locally. This is effectively the same problem in
             | LLMs /ML in general, and the same incredible stupidity is
             | being followed. Download the model once, run your queries
             | locally. That's the solution we should be using.
        
             | akiselev wrote:
             | Try a browser or a clean profile without any ad blocking
             | turned on. It took me a couple of tries to figure out how
             | to get it working but you should see a modal with a form
             | when it works.
             | 
             | FYI the waitlist form submits a regular POST request so
             | it'll reload the main page instead of closing the modal
             | dialog. I opened network monitor with preserved logs to
             | double check that I made it on the list :facepalm:
        
           | dkarras wrote:
           | I've been using it through poe and I prefer it to ChatGPT but
           | can't pinpoint why. It just "gets" me better I guess?
        
         | winstonprivacy wrote:
         | Don't forget the ability to fine tune based on one's own data
         | sources. For me, this is more important than any of the six
         | reasons you mentioned.
        
         | ianhawes wrote:
         | We use Anthropic Instant in production and it has been much
         | faster than Davinci/GPT4 for awhile. In terms of quality,
         | Instant is at least as good as GPT3.5.
        
         | [deleted]
        
       | timsuchanek wrote:
       | Curious what this will mean for the vector db vendors. Imagine
       | finetuning would be quick and cheap. Could there be a world where
       | vector dbs aren't needed anymore?
        
         | shri_krishna wrote:
         | 100k context limit is still a limit (we have no idea how
         | Anthropic is achieving this - if it is extension of the base
         | model context limit itself or some vector db trickery in the
         | backend or probably even RAG). Even in this example, though it
         | could fit entire text of Great Gatsby it still is 1
         | book/text/document. Typical business use cases require
         | searching through hundreds if not thousands of documents/books
         | and finding similar vector embeddings through all of them and
         | fetching top-K results (this is how Google search works when it
         | has to scan through embeddings for billions of websites). These
         | top-K results can be stuffed into the 100k context limit and
         | produce an even more holistic picture rather than just stuff
         | one book/pdf/file into the context. Depends on the requirements
         | though. I don't see how it might affect vector db vendors who
         | can process billions of vectors per query and provide top-K
         | results.
         | 
         | Also having a massive context length is not necessarily a good
         | thing from perspective of cost. It also doesn't work great with
         | a chatbot as you will have to feed the same 100k worth context
         | back into the chatbot for every question which will turn out to
         | be very expensive. At some point you will have to discard some
         | parts of the context to be specific to the question being asked
         | and that is where vector embeddings come into play. For one off
         | research/Q&A 100k limit works great!
        
       | PeterisP wrote:
       | All I see in the link is empty PR claims - is there any
       | information about _how_ they 're doing that? There are all kinds
       | of known techniques that "expand" context window without really
       | doing so, with different tradeoffs, and unless they provide
       | actual information, any claims should be taken with a pile of
       | salt, we shouldn't just assume that they actually have "true"
       | 100k context windows.
        
       | justanotheratom wrote:
       | This is nice, but it can get quite expensive.
       | 
       | Let's say I have a book and I want to ask multiple questions
       | about it. Every query will pay the price of the book's text. It
       | would be awesome if I could "index" the book once, i.e. pay for
       | the context once, and then ask multiple questions.
        
         | mikrl wrote:
         | The analogy I can think of here is a pointer, but AFAIK the
         | context would always need to go along with the prompt unless
         | you could tweak internal state to bias towards the context.
         | 
         | Otherwise, it might make sense to have a separate routine which
         | compresses the context as efficiently as possible. Auto
         | encoder?
        
         | wahnfrieden wrote:
         | Not sure about this one but you can usually ask multiple
         | questions in one shot at least
        
           | minimaxir wrote:
           | Generation is more expensive than the prompt input (for
           | Claude v1, generation is 3x the cost; for GPT-4 it's 2x the
           | cost)
           | 
           | It makes the economics slightly trickier.
        
             | newhouseb wrote:
             | I wonder why this is? Naively there's no difference between
             | the two from a transformer standpoint.
             | 
             | Perhaps it's because under the hood there's additional
             | safety analysis/candidate generate that is resource
             | intensive?
        
               | pyth0 wrote:
               | Normally the inputs are padded out to the context length
               | [1] and so the cost to embed 1 token or N tokens is the
               | same. The output is produced token-by-token and so the
               | amount of GPU time increases with the number of output
               | tokens.
               | 
               | [1] I'm not sure if these huge context lengths are
               | achieved the same way (i.e. a single input vector of
               | length N) but given the cost is constant for input I
               | would assume the resource usage is too.
        
               | newhouseb wrote:
               | This doesn't match my mental model (or implemented model
               | in the case of GPT2) of how self-attention works (you
               | need to calculate the residual stream for each individual
               | token, attending to all prior tokens before it). Have a
               | link?
        
               | pyth0 wrote:
               | I work on infrastructure for serving large language
               | models but I don't have any background in ML, so my
               | perspective is looking at these models as a black box
               | (and also conversations with the people that do the ML
               | stuff). It is the case in practice at least from a
               | latency side that with a fixed context length N,
               | embedding any number of tokens from 0 to N takes the same
               | amount of time. Perhaps it's a difference between the
               | conceptual and actual implementation on GPU?
               | 
               |  _edit_ - This occurred to me after the fact but I wonder
               | if the difference is that the use case I work with is
               | processing batches of many different embedding requests
               | (but computed in one batch), therefore it has to process
               | `min(longest embedding, N)` tokens so any individual
               | request in theory has no difference. This would also be
               | the case for Anthropic however.
        
               | newhouseb wrote:
               | Ah, you're thinking about embeddings which are basically
               | the encoder stack on a traditional transformer
               | architecture. Modern GPT-like models (including Claude),
               | however, drop the encoder and use decoder-only
               | architectures.
               | 
               | I could imagine something where encoders pad up to the
               | context length because causal masking doesn't apply and
               | the self attention has learned to look across the whole
               | context-window.
        
               | sebzim4500 wrote:
               | Everyone serious batches together short prompts so the
               | cost is roughly proportional to the tokens.
        
               | space_fountain wrote:
               | Well each additional token generated requires rerunning
               | the model right to find the next likely token given the
               | previous one
        
               | newhouseb wrote:
               | Naively, yes, but you can cache the bulk of that
               | "rerunning" [1]. That said the (non-flash) attention
               | costs go up with the length of the sequence so perhaps
               | this is just a simpler way to approximate these costs.
               | 
               | [1] https://kipp.ly/blog/transformer-inference-
               | arithmetic/
        
         | tikkun wrote:
         | With embeddings, you essentially can. Group the book into
         | sections, embed each section, then when you do a prompt, add in
         | the N most similar embedded sections to your prompt.
        
           | adamgordonbell wrote:
           | What if the question is "What are the main themes of this
           | work?"
           | 
           | Or anything where the question answer isn't 'close' to the
           | words used in the question?
           | 
           | How well does this work vs giving it the whole thing as a
           | prompt?
           | 
           | I assume worse but I'm not sure how this approach compares to
           | giving it the full thing in the prompt or splitting it into N
           | sections and running on each and then summarizing.
        
             | summarity wrote:
             | That is solved by hypothetical embeddings.
             | 
             | Background: https://summarity.com/hyde
             | 
             | Demo: https://youtu.be/elNrRU12xRc?t=1550 (or try it on
             | findsight.ai and compare results of the "answer" vs the
             | "state" filter)
             | 
             | For even deeper retrieval consider late interaction models
             | such as ColBERT
        
               | akiselev wrote:
               | Any material comparing the different embedding models?
               | I'm working on information retrieval from government
               | documents and without any ML experience it's daunting
        
             | jtlicardo wrote:
             | You pretty much summed up the drawbacks of the embeddings
             | approach. In my experience it's pretty hard to extract the
             | relevant parts of text, especially when the text is
             | uniform.
        
             | abraxas wrote:
             | You could do multi level summaries etc but yeah this is all
             | just band aids around token limits.
        
               | Spivak wrote:
               | I don't think it's as much of a band-aid as it first
               | appears since this roughly mimics how a human would do
               | it.
               | 
               | The problem is that humans have continuous information
               | retrieval and storage where the current crop of embedding
               | systems are static and mostly one shot.
        
               | crucialfelix wrote:
               | Humans have limited working memory, they quickly forget
               | short term memory (unless it's super significant) and our
               | long term memory fades selectively if not reactivated or
               | significant (intense).
               | 
               | This weird leaky memory has advantages and disadvantages.
               | Forgetting is useful, it removes garbage.
               | 
               | Machine models could vary the balance of temporal types,
               | drop out Etc. We may get some weird behavior.
               | 
               | I would guess we will see many innovations in how memory
               | is stored in systems like these.
        
         | make3 wrote:
         | Yes, caching the states of the sequence would make sense. An
         | issue is that it's still more expensive to compute the new
         | tokens even if you cache the states viewed so far
        
         | fdgsdfogijq wrote:
         | The price on this will plummet over the next few years, the
         | economic benefits are too large
        
           | moffkalast wrote:
           | The economic benefits of mining asteroids are also too large
           | to ignore yet here we are, levelling villages to dig for
           | coal.
           | 
           | Just a few manufacturers hold the effective cartel monopoly
           | on LLM acceleration and you best bet they will charge out the
           | ass for it.
        
             | modernpink wrote:
             | Market competition and innovation in both ML and hardware
             | has consistently driven down the price of AI in the past
             | decade. You only have to look at where we are with
             | capabilities today compared to ten years ago when CIFAR100
             | classifiers were the state of the art.
             | 
             | Barring a Chinese invasion of Taiwan, these APIs will halve
             | in price over the next year.
        
               | [deleted]
        
               | moffkalast wrote:
               | Well here's to hoping I guess.
        
             | skybrian wrote:
             | I'm wondering what level you're thinking. Cloud vendors?
             | GPU vendors? Fabs?
        
               | moffkalast wrote:
               | Given what's used right now to my knowledge, the main
               | ones would be Nvidia's tensor cores, Apple's M chips and
               | Google's cloud TPUs. All of that's TSMC I think?
        
           | nr2x wrote:
           | Yes, but physics trumps economics.
        
         | pyth0 wrote:
         | This more or less is already a thing and it's called RAG
         | [1][2]. It essentially allows you to have a database of
         | embeddings (in this case your book) from which a model can pull
         | knowledge from while producing answers. As for the standard
         | operation of these generative models, the context window is the
         | only working memory it has and so it must see the entire text
         | each time.
         | 
         | [1] https://arxiv.org/abs/2005.11401
         | 
         | [2] https://huggingface.co/docs/transformers/model_doc/rag
        
           | m1sta_ wrote:
           | Cam you help me understand this? The research appears to be
           | from a few years ago. Can this be used with Claude (for
           | example)? How is it different to the approach many people are
           | taking with vector stores and embeddings?
        
             | make3 wrote:
             | it's not different. RAG is a way to train embedding stores
             | end to end
        
             | pyth0 wrote:
             | Other people seem to be suggesting that the user would do
             | the retrieval of the relevant parts of the book from a
             | vectordb first, and then feed those sections along with the
             | question as the prompt. Conceptually it is very similar
             | (and it too uses vector database), but with RAG it would
             | happen as part of the inferencing pipeline and therefore
             | achieve better performance than the end user emulating it.
        
       | [deleted]
        
       | helen___keller wrote:
       | This seems like it could be a game changer. Modern LLM based
       | applications face a balancing act of context limitations, which
       | often results in some kind of mapreduce-type behavior when that
       | context can't fit the input
       | 
       | If contexts keep growing, the landscape of LLM application
       | engineering will as well
        
         | whimsicalism wrote:
         | The problem is there are no public benchmarks usually so it is
         | hard to really compare on long context lengths to see if they
         | are still performing equally intelligent of tasks.
        
       | terabytest wrote:
       | How does Claude stack up to GPT-4?
        
       | tempusalaria wrote:
       | Would be great to see some benchmarks on how loss changes across
       | this very large context. It's been technically possible to do
       | 1mln+ token context for some time with performance deterioration
       | so it would be interesting to see how this compares to those
       | efforts
        
       | Imnimo wrote:
       | >For example, we loaded the entire text of The Great Gatsby into
       | Claude-Instant (72K tokens) and modified one line to say Mr.
       | Carraway was "a software engineer that works on machine learning
       | tooling at Anthropic." When we asked the model to spot what was
       | different, it responded with the correct answer in 22 seconds.
       | 
       | This sort of needle-in-a-haystack retrieval is definitely
       | impressive, and it makes a lot more sense to achieve this in-
       | context rather than trying to use a vector database if you can
       | afford it.
       | 
       | I'm curious, though, whether there are diminishing returns in
       | terms of how much _analysis_ the model can do over those 100k
       | tokens in a single forward pass. A human reading modified-Gatsby
       | might eventually spot the altered line, but they 'd also be able
       | to answer questions about the overarching plot and themes of the
       | novel, including ones that cannot be deduced from just a small
       | number of salient snippets.
       | 
       | I'd be curious to see whether huge-context models are also able
       | to do this, or if they start to have trouble when the bottleneck
       | becomes reasoning capacity rather than input length. I feel like
       | it's hard to predict one way or the other without trying it, just
       | because LLMs have already demonstrated a lot of surprising
       | powers.
        
         | fzliu wrote:
         | I'm also not entirely convinced by "huge" context models just
         | yet, especially as it relates to fuzzy knowledge such as
         | overarching themes or writing style.
         | 
         | In particular, there are 0 mentions of the phrase "machine
         | learning" in The Great Gatsby, so adding one sentence that
         | introduces the phrase should be easy for self-attention to pick
         | out.
        
           | EGreg wrote:
           | This sounds like all the other skepticism about what AI can
           | do. And then it can spot 200x more than any human and
           | correlate it into common themes, and you'll say what?
        
             | devmor wrote:
             | Doing more than a human can isn't impressive. Most computer
             | problems for any purpose can do more of something, or
             | something faster than a human can.
             | 
             | A better comparison would be if it can pick out any
             | differences that can't be picked out by more traditional
             | and simple algorithms.
        
               | EGreg wrote:
               | Or course it can very soon, since those were also written
               | by humans. Like AlphaZero vs Rybka
        
               | chaxor wrote:
               | It does, using this method.
               | 
               | My immediate thought as well was '... Yeah, well vimdiff
               | can do that in milliseconds rather than 22 seconds' - but
               | that's obviously missing the point entirely. Of course,
               | we need to tell people to use the right tool for the job,
               | and that will be more and more important to remind people
               | of now.
               | 
               | However, it's pretty clear that the reason they used this
               | task is to give something simple to understand what was
               | done in a very simple example. Of course it can do more
               | semantic understanding related tasks, because that's what
               | the model does.
               | 
               | So, without looking at the details we all know that it
               | can summarize full books, give thematic differences
               | between two books, write what a book may be like if a
               | character switch from one book to another is done, etc.
               | 
               | If it _doesn 't_ do these things (not just badly, but
               | can't at all) I would be surprised. If it does them, but
               | badly, I wouldn't be surprised, but it also wouldn't be
               | mind bending to see it do better than any human at the
               | task as well.
        
           | lumost wrote:
           | I'd be more impressed if it could rewrite Mr. Carraway as an
           | ML engineer in the entire novel. However it's not
           | intrinsically clear that it cannot do this...
           | 
           | It'll be tough to find good benchmarks on long context
           | windows. A human cannot label using 100k tokens of context.
        
             | zooch wrote:
             | My thoughts exactly - rewrite the novel with Mr. Carraway
             | as an ML engineer while maintaining themes/motifs (possible
             | adding new ones too). I'm guessing what's impressive is
             | that these are the first steps towards something like this?
             | Or is it already possible? Someone please correct me here.
        
         | SkyPuncher wrote:
         | Further, the problem with this example is it relies on a
         | comparison against public data.
         | 
         | Most of these AI start failing pretty hard when you ask it to
         | do the same task on something completely novel to it (like a
         | company document). Sometimes they'll get it right. Other times,
         | they'll spit out gibberish that's clearly some generic answer.
        
           | dmix wrote:
           | I'd imagine working with an entire company document would
           | require a lot more hand holding and investment in prompt
           | engineering. You can definitely get better results if you add
           | much more context of what you're expecting and how the LLM
           | should do it. Treating these LLMs as just simple Q&A machines
           | is usually not enough unless you're doing simple stuff.
        
           | nomel wrote:
           | > Most of these AI
           | 
           | This is as meaningful as saying most of the hominids can't
           | count. You can't usefully generalize AI models with the rate
           | of change that exists right now. Any statements/comparisons
           | about AI has to contain specific models and versions,
           | otherwise it's increasingly irrelevant noise.
        
           | robotresearcher wrote:
           | Asking to spot the difference between a given document and an
           | unseen document is impossible.
        
             | lkbm wrote:
             | A couple years ago, I read Superfudge by Judy Blume, a book
             | originally published in 1980. In it, the protagonist writes
             | a letter to Santa: "Please bring me one or more of the
             | following items. A clock-radio, a remote-controlled model
             | airplane, a laptop computer, an MP3 player and six CD's."
             | 
             | I didn't need to have seen this book before to know this
             | wasn't in the original 1980s text.
             | 
             | Similarly, if I were reading the Great Gatsby for the first
             | time, and it identified a character as a software engineer,
             | I would notice.
        
               | drusepth wrote:
               | I think there are plenty of humans who wouldn't notice,
               | though.
               | 
               | And probably plenty of AI implementations that would
               | notice.
        
         | tunesmith wrote:
         | I've been curious about this for a while, I have a hobby use-
         | case of wanting to input in-progress novellas and then asking
         | it questions about plot holes, open plot threads, and if new
         | chapter "x" presents any serious plot contradiction problems. I
         | haven't tried exploring that with a vectordb-embeddings
         | approach yet.
        
           | make3 wrote:
           | This is an exact example of something a vector dbs would be
           | terrible at.
           | 
           | Vector dbs work by fetching segments that are similar in
           | topics to the question, so like "Where did <Character> go
           | after <thing>" will retrieve segments with locations & the
           | character & maybe talking about <thing> as a recent event.
           | 
           | Your question has no similarity with the segments required in
           | any way; & it's not the segments that are wrong it's the way
           | they relate to the rest of the story
        
             | HarHarVeryFunny wrote:
             | Do the OpenAI APIs support converting prompts to vectors,
             | or are people running their own models locally to do this?
             | Can you recommend any good resources to read up on vector
             | DB approaches to working around context length limits ?
        
             | toss1 wrote:
             | Good points - LLMs are ok at finding things that exist, but
             | they have zero ability to abstract and find what is missing
             | (actually, probably negative; they'd likely hallucinate and
             | fill in the gaps).
             | 
             | Which makes me wonder if the opposite, but more laborious
             | approach might work - request it identify all characters
             | and plot themes, then request summaries of each. You'd have
             | to review the summaries for holes. Lotsa work, but still
             | maybe quicker than re-reading everything yourself?
        
               | TeMPOraL wrote:
               | > _LLMs are ok at finding things that exist, but they
               | have zero ability to abstract and find what is missing
               | (actually, probably negative; they 'd likely hallucinate
               | and fill in the gaps)._
               | 
               | I feel this is mostly a prompting issue. Specifically
               | GPT-4 shows surprising ability to abstract to some degree
               | and work with high-level concepts, but it seems that,
               | quite often, you need to guide it towards the right
               | "mode" of thinking.
               | 
               | It's like dealing with a 4 year old kid. They may be
               | perfectly able to do something you ask them, but will
               | keep doing something else, until you give them specific
               | hints, several times, in different ways.
        
               | vidarh wrote:
               | Firstly, I don't at all agree that they have zero ability
               | to abstract. Doesn't fit my experience at all. A lot of
               | the tasks I use ChatGPT for is exactly to analyse gaps in
               | specifications etc. And have it tell me what is missing,
               | suggest additions or ask for clarifications. It does that
               | just fine.
               | 
               | But I've started experimenting with the second part, of
               | sorts, not to find plot holes but to have it create
               | character sheets for my series of novels for my own
               | reference.
               | 
               | Basically have it maintain a sheet and feed it chunks of
               | one or more chapters and asking it to output an a new
               | sheet augmented with the new details.
               | 
               | With a 100K context window I might just test doing it
               | over while novels or much larger chunks of one.
        
       | sashank_1509 wrote:
       | How are LLM's increasing their context size? I guess you just
       | increase input size if it's for the self supervised GPT3 style
       | training but for RLHF? Are they creating datasets of books to
       | input to the LLM and then making human labelers label the
       | response? There might be a smart way that does not involve new
       | datasets
        
         | sp332 wrote:
         | Mosaic wrote about their new model here.
         | https://www.mosaicml.com/blog/mpt-7b It was trained on 65k
         | inputs and has decent performance working with 80k+ tokens.
        
         | potatoman22 wrote:
         | I don't think RLHF datasets need to take full advantage of the
         | context window. There's also many ways to programatically
         | generate NLP datasets.
        
       | mark_l_watson wrote:
       | With quadratic time complexity for context size, that gets
       | expensive.
        
       | ginger2016 wrote:
       | How do I sign-up? What is the cost?
        
         | karmasimida wrote:
         | Going to be absolutely expensive.
        
       | gigel82 wrote:
       | Nice, that's roughly a 250-page book based on average word
       | counts.
        
       | maxutility wrote:
       | I don't see this in the article. Has Anthropic explained the
       | mechanism by which they were able to cost-effectively expand the
       | context window, and whether there was additional training or a
       | design decision (e.g. alternative positional embedding approach)
       | that helped the model optimize for a larger window?
        
       | cheeselip420 wrote:
       | Maybe this model can finish Winds of Winter and the rest of GoT
       | for us...
        
         | babuloseo wrote:
         | Add Berserk to that list.
        
         | azakai wrote:
         | 75,000 words is a drop in the bucket for A Song of Ice and
         | Fire:
         | 
         | https://blog.fostergrant.co.uk/2017/08/03/word-counts-popula...
        
           | akiselev wrote:
           | You'd want to generate it in multiple steps to make it
           | feasible to control the text generation anyway. First call
           | generates the broad outline, several parallel calls flesh out
           | character development and some other details so that they're
           | consistent, then generate the story piece by piece by feeding
           | in bits of the outline.
        
             | nottorp wrote:
             | And then you end up with what the movie did which is not
             | exactly a GRRM novel.
        
           | camel-cdr wrote:
           | Meanwhile web serial authors: [0] [1]
           | 
           | [0] https://wanderinginn.neocities.org/statistics
           | 
           | [1] https://www.reddit.com/r/Parahumans/comments/rz8ogt/wildb
           | ows...
        
         | thepasswordis wrote:
         | That's actually a really interesting use case!
        
         | pclmulqdq wrote:
         | That may need a million tokens just for one book, though!
        
         | f6v wrote:
         | I'd be excited for Dexter ending that doesn't suck.
        
       | [deleted]
        
       | gumballindie wrote:
       | I am noticing a different tone coming from Anthropic. Unlike
       | openai they dont appear to be focused on fud and replacement.
       | Gives the impression it's run by adults instead of crypto bros
       | turned ai experts. Curious how their models will work.
        
         | lubesGordi wrote:
         | Um Ilya Sutskever isn't a crypto bro.
        
           | gumballindie wrote:
           | No but sam altman is. That company can go whistling.
        
       | Workaccount2 wrote:
       | Is there any path towards folding tokens into the actual model?
       | That is, continual training rather than the current "training
       | first then just tokens after"
        
         | ilaksh wrote:
         | PaLM 2 on Vertex AI which Google just released yesterday has
         | fine tuning the large models as a core part of their offering.
        
       | whimsicalism wrote:
       | We need public benchmarks.
       | 
       | This is incredibly fast progress on large contexts and I would
       | like to see if they are actually attending equally as well to all
       | of the information or there is some sparse approximation leading
       | to intelligence/reasoning degradation.
        
         | monlockandkey wrote:
         | https://lmsys.org/blog/2023-05-10-leaderboard/
         | 
         | https://chat.lmsys.org/?arena
         | 
         | Claude by Anthropic has more favourable responses then ChatGPT
        
           | Workaccount2 wrote:
           | ChatGPT3.5*
           | 
           | It's still below GPT4, but it is closer to 4 than 3.5
        
           | polishdude20 wrote:
           | So I tried this prompt in their chatbot arena multiple times.
           | Each time getting the wrong answer:
           | 
           | "Given that Beth is Sue's sister and Arnold is Sue's father
           | and Beth Junior is Beth's Daughter and Jacob is Arnold's
           | Great Grandfather, who is Jacob to Beth Junior?"
        
             | jefftk wrote:
             | Is the right answer pointing out that Arnold might not be
             | Beth's father, and so Beth Junior might be unrelated to
             | Jacob?
        
             | svachalek wrote:
             | I just tried it and gpt-3.5-turbo got it right.
        
       | nynx wrote:
       | There has got to be a number of fascinating tricks that they're
       | using to support context lengths that long. Shame it's all
       | closed-source.
        
       | sweezyjeezy wrote:
       | Can LLMs take advantage of this bigger window to solve meaningful
       | tasks though? I can't imagine in the training data, knowing what
       | happened 100k tokens ago would be _that_ relevant to predicting
       | the current token very often, so unless this is something that
       | the model learns to leverage more implicitly, I'd be a bit
       | pessimistic.
        
         | ttul wrote:
         | Yes. For instance, a large context window allows you to have a
         | chat for months where the model can remember and make use of
         | everything you've ever talked about. That enables creating a
         | much more effective "assistant" that can remember key details
         | months later that may be valuable.
         | 
         | A second example is the analysis of long documents. Today,
         | hacks like chunking and HyDE enable us to ask questions about a
         | long document or a corpus of documents. But is far superior if
         | the model can ingest the whole document and apply attention to
         | everything, rather than just one chunk at a time. Chunking
         | effectively means that the model is limited to drawing
         | conclusions from one chunk at a time and cannot synthesize
         | useful responses relating to the entire document.
        
           | m3kw9 wrote:
           | Gets pricier as you chat for longer, imagine having to chat a
           | line with a history with 20k token.
        
           | sweezyjeezy wrote:
           | I'm not questioning whether it would be useful, just whether
           | it's actually something that token masking in training is
           | going to work to make the model learn this.
        
           | woeirua wrote:
           | It remains to be seen just how effective longer contexts are
           | because if the attention vectors don't ever learn to pick up
           | specific items from further back in the text then having more
           | tokens doesn't really matter.
           | 
           | Given that the conventional cost of training attention layers
           | grows quadratically with the number of tokens I think
           | Anthropic is doing some kind of approximation here. Not clear
           | at all that you would get the same results as vanilla
           | attention.
        
             | ttul wrote:
             | They did mention that the inference time to answer a
             | question about the book was something like 22 seconds, so
             | perhaps they are indeed still using self-attention.
        
         | SomewhatLikely wrote:
         | I would guess that semantic similarity would be the stronger
         | training signal than distance once you go beyond a sentence or
         | two away.
        
           | sweezyjeezy wrote:
           | I'm pretty dubious - how would the model not get absolutely
           | swamped by the vast amount of potential context if it's not
           | learning to ignore long range signals for the most part?
        
       | [deleted]
        
       | dr_dshiv wrote:
       | I often prefer Claude over GPT4 (partially due to speed), but it
       | degrades more quickly. Like I can get a better response early,
       | but usually the quality drops faster. But, sometimes if it can
       | really vibe with it, it gets better over time.
        
       | ilaksh wrote:
       | Did anyone else get on the waitlist, get in, and now their
       | console link doesn't work? I remember deciding the code
       | generation wasn't good enough to bother. Not sure if I actually
       | ever activated it but I guess not.
       | 
       | Now I tried to request access again on their form and it just
       | redirected. Can't even tell if that worked.
       | 
       | Does anyone know if this can program as well as GPT-4? Because if
       | so then the larger context window is a big improvement.
        
         | M4v3R wrote:
         | I do have access to it and from my very limited testing it
         | looks like it can program at least on par with GPT-3.5. I
         | didn't have time yet to test it more comprehensively against
         | GPT-4.
        
           | ilaksh wrote:
           | OK great thanks that's what I heard. Very interested to hear
           | about comparisons with GPT-4.
        
       | ablyveiled wrote:
       | What's the catch? Using GPT-4 relative to its own marketing copy
       | was a letdown.
        
       | SeanAnderson wrote:
       | big if true? :)
       | 
       | Exciting to see competition across LLMs for increasing context
       | window size.
       | 
       | I can't find updated pricing anywhere. Previous prices are here:
       | https://cdn2.assets-servd.host/anthropic-website/production/...
       | but don't seem to be embedded directly on the Anthropic website.
       | I tried messing with the URL (apr -> may/jun) but 404'ed.
        
         | kordlessagain wrote:
         | > Exciting to see competition across LLMs for increasing
         | context window size.
         | 
         | Maybe. I think the debate is going to continue about prompt
         | optimization vs. context window size.
         | 
         | A while ago, I had a rather interesting conversation with
         | GPT-3.5 about forgetting things. Knowing what to forget, or
         | delete from the prompt, may be just as important as what to put
         | in it.
         | 
         | Putting the kitchen sink into the prompt probably isn't going
         | to help much, past a certain point and it may be putting
         | certain things in there based on time and context is a better
         | strategy.
        
           | SeanAnderson wrote:
           | Yeah, there's definitely diminishing returns. I just wanted
           | to talk to ChatGPT about a game I'm developing. I have pages
           | upon pages of product design notes and I'm not able to just
           | copy/paste the whole thing in and start talking to it at 8k
           | context length. There's not really duplicate information as
           | far as I can tell since each section covers new topics. I'm
           | sure there's a way to express the same ideas more succinctly,
           | but I kind of want ChatGPT to do that for me rather than me
           | figuring out how to do that just to interface the ideas into
           | it.
        
       | seydor wrote:
       | so i m going to just paste a few physics book and ask it "make
       | fusion"
       | 
       | What is the approach to increase the sequence length here?
        
       | [deleted]
        
       | [deleted]
        
       | swiftcoder wrote:
       | > When we asked the model to spot what was different, it
       | responded with the correct answer in 22 seconds.
       | 
       | Now we've gone from using ML to implement slow, unreliable
       | databases, to using ML to implement slow, unreliable string
       | comparison, I guess
        
       | we_never_see_it wrote:
       | Google is really trying to catch up to OpenAI & MS. The truth is
       | they have never been in the race to begin with. All they had and
       | still have is PR stunts. Let's see if their copying of MS model
       | will produce anything useful.
        
         | oars wrote:
         | Google has multiple horses in this race.
         | 
         | They invested $300m in Anthropic in late 2022:
         | https://www.ft.com/content/583ead66-467c-4bd5-84d0-ed5df7b5b...
         | 
         | (Non-paywall: https://archive.is/Y5A9B)
        
         | thewataccount wrote:
         | > The truth is they have never been in the race to begin with.
         | 
         | Product race? My understanding is they've been so concerned
         | with safety/harm that they've been slow to implement a lot of
         | tools - then OpenAI made an attempt at it anyway.
         | 
         | Google has generally been ahead from a research perspective
         | though. And honestly it's going to be really sad if they just
         | stop releasing papers outright - hopefully the release their
         | previous gen stuff as they go :/
        
         | andreyk wrote:
         | Curious why you think this? PaLM2 looks great, and Google has
         | been productizing cutting edge AI pretty fast for years.
        
           | sebzim4500 wrote:
           | I guess PaLM2 is competitive with GPT-3.5 so for people not
           | willing to pay it will be an attractive offering.
           | 
           | I'm not sure that counts as 'great' though.
        
             | rsstack wrote:
             | Based on what do you think it's comparable to GPT-3.5 and
             | not to 4? Did we see a lot of public performance?
        
               | sebzim4500 wrote:
               | They claim it is already being used in Bard, also if you
               | read the paper it does much worse at the important
               | benchmarks.
        
           | MacsHeadroom wrote:
           | PaLM 2 can't even solve "Write three sentences ending in the
           | word Apple."
           | 
           | It's worse than GPT-3.5. Go see for yourself at
           | bard.google.com, which is running on PaLM 2 everywhere but
           | the EU as of yesterday.
        
             | Garrrrrr wrote:
             | Ah yes, the famous benchmark for all LLMs. I just tried
             | your novel example with GPT-3.5 and it couldn't solve it
             | either:
             | 
             | > After lunch, I like to snack on a juicy and crisp apple
             | to satisfy my sweet tooth.
             | 
             | > In the fall, many families enjoy going to apple orchards
             | to pick their own apples and make homemade apple pies.
             | 
             | > The new MacBook Pro features a powerful M1 chip and a
             | stunning Retina display, making it the perfect tool for
             | creative professionals who work with Apple software.
        
               | mustacheemperor wrote:
               | Eh, I think as "human evaluated" metrics go, it's a
               | decent test of how well it can parse a reasonably complex
               | sentence and reply accurately.
               | 
               | For me:
               | 
               | GPT4 3/3: I couldn't resist the temptation to take a bite
               | of the juicy, red apple. Her favorite fruit was not a
               | pear, nor an orange, but an apple. When asked what type
               | of tree to plant in our garden, we unanimously agreed on
               | an apple.
               | 
               | GPT3.5 2/3: "After a long day of hiking, I sat under the
               | shade of an apple tree, relishing the sweet crunch of a
               | freshly picked apple." "As autumn approached, the air
               | filled with the irresistible aroma of warm apple pie
               | baking in the oven, teasing my taste buds." "The teacher
               | asked the students to name a fruit that starts with the
               | letter 'A,' and the eager student proudly exclaimed,
               | 'Apple!'"
               | 
               | Bard 0/3: Sure, here are three sentences ending in the
               | word "apple": I ate an apple for breakfast.The apple tree
               | is in bloom. The apple pie was delicious. Is there
               | anything else I can help you with?
               | 
               | Bard definitely seems to fumble the hardest, it's pretty
               | funny how it brackets the response too. "Here's three
               | sentences ending with the word apple!" nope.
               | 
               | Edit: Interesting enough, Bard seems to outperform GPT3.5
               | and at least match 4 on my pet test prompt, asking it
               | "What's that Dante quote that goes something like "before
               | me there were no something, and only something
               | something." 3.5 struggled to find it, 4 finds it
               | relatively quickly, Bard initially told me that quote
               | isn't in the poem but when I reiterated I couldn't
               | remember the whole thing it found it immediately and
               | sourced the right translation. It answered as if it were
               | reading out of a specific translation too - "The source I
               | used was..." Is there agent behavior under the hood of
               | bard or is just how the model is trained to communicate?
        
         | kernal wrote:
         | OpenAI is the Microsoft Explorer of AI.
        
         | endisneigh wrote:
         | I don't know how anyone can say this with a straight face when
         | Google is the one who invented LLMs as used today to begin
         | with.
         | 
         | Google has a product issue, not an AI research one.
        
           | cubefox wrote:
           | DeepMind and Google invented many other things, but I think
           | the first GPT style token predictor was actually ... GPT, a
           | model by OpenAI. RLHF was also invented at OpenAI. They also
           | had the first text-to-image model.
        
           | onlyrealcuzzo wrote:
           | It's usually the least informed with the most self-assured
           | sweeping opinions.
        
       | darig wrote:
       | [dead]
        
       | meghan_rain wrote:
       | The most interesting bit is that for the first time since the
       | release of ChatGPT in December 2022, OpenAI does not have the
       | lead on LLMs anymore.
       | 
       | At least, for people who need large context windows, they would
       | not be the first choice anymore.
        
         | sebzim4500 wrote:
         | GPT-4 still leads in the chatbot arena[1] but at least it is a
         | two horse race now.
         | 
         | [1] https://lmsys.org/blog/2023-05-10-leaderboard/
        
         | refulgentis wrote:
         | Claude's very quietly better on everything but pricing, for a
         | while, it just got buried because they announced on "AI
         | Tuesday" (iirc gpt4 and Bing announcement day)
         | 
         | The ChatGPT equivalent is 3x speed and was somewhere between
         | ChatGPT and GPT4 on my TriviaQA benchmark replication I did
         | 
         | Couple tweets with data and examples. Note they're from 8 weeks
         | ago, I know Claude got a version bump, GPT3.5/4 accessible via
         | API seem the same.
         | 
         | [1] brief and graphical summary of speed and TriviaQA
         | https://twitter.com/jpohhhh/status/1638362982131351552?s=46&...
         | 
         | [2] ad hoc side by sides
         | https://twitter.com/jpohhhh/status/1637316127314305024?s=46&...
        
           | com2kid wrote:
           | > I know Claude got a version bump, GPT3.5/4 accessible via
           | API seem the same.
           | 
           | GPT3.5 just got an update a few days ago that resulted in a
           | pretty good improvement on its creativity. I saved some
           | sample outputs from the previous March model, and for the
           | same prompt the difference is quite dramatic. Prose is much
           | less formulaic overall.
        
             | ndr_ wrote:
             | Is this update made visible somewhere? The language models
             | offered on my Playground are still the ones from March,
             | same with ChatGPT.
        
             | refulgentis wrote:
             | Thank you, every little comment I get from fellow boots on
             | the ground is so valuable, lotta noise these days.
             | 
             | Random Q, I don't use the ChatGPT front end much past month
             | or two, used it a week back and it seemed blazingly faster
             | than my integration: do you have a sense of if it got
             | faster too?
        
           | ilaksh wrote:
           | How is the code generation of Claude?
        
             | esafak wrote:
             | And is code generation ability equivalent to code
             | understanding and search ability?
        
             | technics256 wrote:
             | I have access to claude. It's not bad, but decently behind
             | gpt4 for code
        
             | refulgentis wrote:
             | Note, all impressions based on Claude 1.2, got an email
             | from Poe in the last week saying it was version bumped to
             | 1.3 with a focus on coding improvements.
             | 
             | Impressions:
             | 
             | Bad enough compared to GPT-4 that I default to GPT-4. I
             | think if I had api access I'd use it instead, right now it
             | requires more coaxing, and using Poe.
             | 
             | I did find "long-term" chats went better, was really
             | impressed with how it held up when I was asking it a nasty
             | problem that was hard to even communicate verbally. Wrong
             | at first, but as I conversed it was a real conversation.
             | 
             | GPT4 seems to circle a lower optima. My academic guess it's
             | what Anthropic calls it "sycophancy" in its papers, tldr
             | GPT really really wants to do more like what's in the
             | context, so the longer the conversation with initial errors
             | goes, it's actually harder to talk it out of the errors.
        
       | flerovium wrote:
       | It means nothing as long as they don't actually let us test the
       | API.
       | 
       | Good luck waiting for it.
        
         | jackson1372 wrote:
         | See the pricing PDF[^1] and API docs[^2], but TL;DR:
         | 
         | - Price per token doesn't change compared to regular models
         | 
         | - Existing api users have access now by setting the `model`
         | param to "claude-v1-100k" or "claude-instant-v1-100k"
         | 
         | - New customers can join waitlist at anthropic.com/product
         | 
         | [1]: https://cdn2.assets-servd.host/anthropic-
         | website/production/... [2]:
         | https://console.anthropic.com/docs/api/reference#parameters
        
         | nr2x wrote:
         | "POC or GTFO" as the security people say. :-)
        
       | qwertox wrote:
       | The day a quantum computer is able to host a huge LLM, things
       | will get really interesting for humanity.
       | 
       | I say this, because I'm not sure how all of this is really going
       | to scale on GPUs. It feels like LLM's are just as magical as
       | quantum computing.
        
       | gdiamos wrote:
       | Nice. Will we be able to get to 1M tokens?
        
         | programmarchy wrote:
         | Seems like a good target. Even 100K seems too small. As a
         | reference point, the Bible is ~750,000 words.
        
           | smallerfish wrote:
           | "You are a hebrew god and below the dashes is The Word. Who
           | will you smite today?"
        
       | vrglvrglvrgl wrote:
       | [dead]
        
       | jacooper wrote:
       | Anthropic is basically Google's openAI.
        
         | cubefox wrote:
         | It's not a Google company, their share amounts to ~10%.
        
       | m3kw9 wrote:
       | Is this real input context or is it some vectordb in the
       | background type trickery?
        
         | HarHarVeryFunny wrote:
         | Pretty sure it's not "real" (model) context width.
         | 
         | Another wide context model is MosaicML's
         | MPT-7B-StoryWriter-65k+ which they are describing as having a
         | context width of 65k, but then give a bit more detail to say
         | they are using ALiBi - a type of positional encoding that
         | allows longer contexts at inference time than training (i.e
         | beyond the real context width of the model).
         | 
         | For these types of "extended context" models to actually reason
         | over inputs longer than the native context width of the model,
         | I _assume_ that there is indeed some sort of vector DB trickery
         | - maybe paging thru the input to generate vector DB content,
         | then using some type of Retrieval Augmented Generation (RAG) to
         | process that using the extended contexts ?
         | 
         | Maybe someone from Anthropic or MosaicML could throw us a bone
         | and give a bit more detail of how these are working !
         | 
         | https://www.mosaicml.com/blog/mpt-7b
         | 
         | https://arxiv.org/abs/2005.11401
        
       | [deleted]
        
       | minimaxir wrote:
       | No pricing, but given that OpenAI's GPT-4 doubles the cost-per-
       | token if you go from 8k to a 32k context window, I suspect the
       | pricing here will be 2-4x from the base Claude model which is 9k:
       | https://cdn2.assets-servd.host/anthropic-website/production/...
       | 
       | Although with flash attention, who knows if marginal cost scales
       | that consistently.
        
         | adamkochanowicz wrote:
         | https://cdn2.assets-servd.host/anthropic-website/production/...
        
           | minimaxir wrote:
           | Those are the same SKUs I linked.
           | 
           | The new model are a different model identified that's not
           | listed in the pricing doc, although it sounds like the intent
           | may be to replace the base from looking at the API docs: http
           | s://console.anthropic.com/docs/api/reference#-v1-complet...
        
         | f_devd wrote:
         | <4x would be quite optimistic, at ~11x the tokens the amount of
         | compute/memory required would be n^10 (even with the lower
         | starting point of flash attention) so unless they are already
         | have excessive margins it wouldn't make much sense to go that
         | low.
        
           | sp332 wrote:
           | I was assuming they used a different architecture to get the
           | increase instead of just letting it eat hardware that way.
           | Especially with the speed numbers in the post.
        
         | l1n wrote:
         | Pricing is the same as the base model.
        
           | jimsimmons wrote:
           | Confirmation here:
           | 
           | https://twitter.com/AnthropicAI/status/1656743460769259521?s.
           | ..
        
             | minimaxir wrote:
             | Huh. Well that changes things.
        
               | rat9988 wrote:
               | Only for the duration of the beta
        
               | jimsimmons wrote:
               | Source?
        
               | felixgallo wrote:
               | the actual tweet you linked.
        
               | jimsimmons wrote:
               | It doesn't say exclusively for the beta period
        
               | scoopertrooper wrote:
               | With an extremely literal reading you are correct, but
               | there was clearly an implication.
        
       | alpark3 wrote:
       | I use GPT-4 through the API, but I can't help but hate the
       | token/characterization pricing of these LLM APIs we've seen so
       | far. Because the entire context needs to be fed back into the
       | model, as my conversation gets longer, it gets more expensive.
       | Yeah, it's fractions of a cent and cheaper, but something about
       | it is so psychologically taxing that I'd rather pay a flat
       | sum/month and get unlimited access, even if it costs more
       | considering my usage.
        
         | WA wrote:
         | Have you tried to start a new chat after your first question,
         | but refine your new prompt to include some infos you gathered
         | from the first response? This way, you know exactly how many
         | tokens you gonna send.
        
       | RoddaWallPro wrote:
       | I requested & have been waiting for access to Claude for nearly 3
       | months now. Guess the waitlist must be really long...
        
         | jazzyjackson wrote:
         | API access or just access to the chatbot?
         | 
         | You can go through Poe.com
        
         | technics256 wrote:
         | You likely got rejected. Was the same for me and I reapplied
         | with a good use case and was let in
        
       | melvinmelih wrote:
       | > You can drop multiple documents or even a book into the prompt
       | and then ask Claude questions that require synthesis of knowledge
       | across many parts of the text.
       | 
       | This is cool but does it also work the other way around? Generate
       | a book's worth of content based on a single prompt?
        
         | cubefox wrote:
         | That's a good question. Can Claude write a coherent book?
        
         | Chabsff wrote:
         | Kinda. But it`s going to be a lot like how data compression
         | works. There will always be a somewhat fundamental limit to how
         | much "creativity" you can get out of a small prompt generating
         | large texts when using an isolated model.
        
       | worik wrote:
       | Their sign up form does not let me sign in for early access.
       | 
       | A bit disappointing
        
       | skilled wrote:
       | My wallet is hardly capable of handling 8k GPT-4.
        
       | ibitto wrote:
       | Anyone using Claude? How long did it take you to get access?
        
         | harisec wrote:
         | Claude is available for free in the Poe app (poe.com). I think
         | it's good and underappreciated.
        
           | danysdragons wrote:
           | It is good, but the free subscription to Poe only provides
           | access to Claude Instant. It's impressively fast but not
           | their smartest model (claude-v1.3).
        
           | dkarras wrote:
           | yeah, been using it instead of ChatGPT and it performs better
           | IMO. My conversational LLM of choice for sure.
        
         | Mizza wrote:
         | I've got access, it's _blazing_ fast and seems very good.
         | Solved some of my little puzzles that other models couldn't. I
         | haven't tried ChatGPT-4 yet, but it's the best one that I have
         | used.
        
           | thewataccount wrote:
           | You need to try GPT4 only because GPT3.5 really doesn't
           | compare to it in a lot of ways.
        
           | iEchoic wrote:
           | GPT-4 is a major leap ahead of everything else I've used
           | (including GPT-3.5), so definitely worth trying for
           | comparison.
        
       | pk-protect-ai wrote:
       | Ok. It has spatial comprehension of some level. Unlike GPT-4 it
       | lacks proper time comprehension because it is bad at calculus.
       | Unlike GPT-4 it can't properly solve traveling salesman problem.
        
       | com2kid wrote:
       | I am curious how consistent Claude is at obeying detailed
       | instructions. One issue ChatGPT 3.5 and 4 have, even with just a
       | few hundred words of instructions, is it forgets instructions
       | given to it earlier on.[1]
       | 
       | This huge context window is awesome though, I'm trying to use
       | LLMs to do small town social interaction simulations, with output
       | in a structured format. Finding ways to compress existing state
       | and pass it around, so the LLM knows the current state of what
       | people in the town did for a given day is hard with a tiny token
       | limit!
       | 
       | [1] For my use cases, early instructions tend to be describing a
       | DSL syntax for responses, if I add too much info after the
       | instructions, the response syntax starts getting wonky!
        
         | rescripting wrote:
         | A simple example I ran in to was I asked ChatGPT to generate me
         | story in madlibs format for my 4 year old daughter. They're in
         | the format "The young _____ went to the ______, ...", and she
         | fills in the blanks with silly nouns/adjectives.
         | 
         | As she kept asking for more, I prompted "great, do another one"
         | and eventually my original instruction fell out of the context
         | window. It continued to generate a children's story, but with
         | no more blanks.
        
           | com2kid wrote:
           | This is actually a different issue, largely a UI one,
           | although one I wish ChatGPT would fix it.
           | 
           | There is no good way to tell it "this isn't a conversation,
           | just repeat the answer to the initial prompt again".
           | 
           | The solution is to just re-paste the initial prompt in each
           | time, but still it isn't ideal. There isn't a good way to
           | tell chatgpt "you can throw away all the context after the
           | initial prompt and up until now".
           | 
           | Of course the entire point of ChatGPT is that it maintains a
           | conversation thread, so I get why they don't fix up this edge
           | case.
           | 
           | My problem is more of, I give ChatGPT some complicated
           | instructions, and it'll start forgetting the early on
           | instructions long before any token limit is reached.
           | 
           | So for example, if early on I ask for certain tokens to be
           | returned in parens, well my initial prompt is too long, it'll
           | forget the parens thing and start returning tokens without
           | the surrounding (), which then breaks my parser!
        
             | orost wrote:
             | Almost every UI for LLMs I've seen has a way to specify an
             | initial prompt that never goes out of context, it's strange
             | that it's not a feature in ChatGPT.
        
       | throwaway012919 wrote:
       | Sounds expensive. I guess we know where the $580M 'investment'
       | from SBF is going now.
        
       ___________________________________________________________________
       (page generated 2023-05-11 23:00 UTC)