[HN Gopher] Against LLM Maximalism ___________________________________________________________________ Against LLM Maximalism Author : pmoriarty Score : 105 points Date : 2023-09-13 12:32 UTC (10 hours ago) (HTM) web link (explosion.ai) (TXT) w3m dump (explosion.ai) | Grimburger wrote: | [flagged] | EGreg wrote: | I predicted that AI will be the next Web3 -- hugely promising | but increasingly ignored by HN. | | There will be waves of innovation in the coming years. Web3 | solutions will mostly enrich people or at worst be zero-sum. | While AI solutions will redistribute wealth from the working | class to the top 1% and corporations, as well as giving people | ways to take advantage of vulnerable people and systems at a | scale never seen before. | xnx wrote: | Web3 never got past the idea stage and was never useful. | Generative AI is already useful and actively used by millions | of people in their daily work. | EGreg wrote: | Same predictable comment every time, and by the same exact | people too. The first part is not even close to being true. | And you never mention that the downsides of AI are | astronomically larger than Web3. The downside of Web3 is | bugs in immutable smart contracts where people lose only | what they voluntarily put in. The downside of AI is human | extinction and people losing in all kinds of ways | regardless of whether they want to or not. | phillipcarter wrote: | How can you hold the opinion that AI is both not useful | and that it's to bring human extinction? | | Anyways, incoherence of your argument aside, I'll gladly | raise my hand as having a use case that LLMs immediately | solved for and it's now a core product feature that | performs well. | EGreg wrote: | Many things are not useful and can bring about human | extinction. Viruses, volcanoes, or asteroid inpacts for | instance. So it's not incoherent on its face. | | But I am not even saying that AI is not useful. I am | saying that every single time someone pops up on HN to | defend AI vs Web3, they _only_ focus on possible upside | right now for some people. Even if AI brings 10000x | downside at the same time, they would _never ever_ | consider mentioning that. But when society adopts a new | technology, downsides matter even more than upsides. Loss | of income stability and livelihood for many professions, | attacks at scale (eg on reputation or truth) by | botswarms, etc etc. And that is just what's possible with | current technology. | | But most of all, for all its upsides, Web3's harm is | limited to those who voluntarily commit some of their | money to a smart contract. AI's harm on the other hand is | far greater and is spread primarily out to those who | DIDNT VOLUNTARILY CHOOSE IT or even oppose it. That is | not very moral as a society. It may enrich the tech bros | further, but just like other tech projects, it will | probably come at the expense of many others, especially | the working class of society. They will have a rude | awakening and will riot. But they aren't rioting about | Web3, because losing money you put at risk in a | controlled environment is just not in the same | stratosphere. | | Expect the government to use AI to control the population | more and more as this civil unrest happens. Look to China | to see what it would look like. Or Palantir for precrime | etc. | phillipcarter wrote: | I guess I'll just say that...I don't believe much of what | you're saying is going to happen? I don't think I'll | convince you and I don't think you'll convince me either. | j16sdiz wrote: | Three post from you in this thread. I downvote two and | upvote one. | | Sometimes, unpopular opinion need more explanation. The | other two comments are not helpful, this comment is | helpful. | EGreg wrote: | Thanks. Well -- prepare to be downvoted by the anti Web3 | brigade heh | IshKebab wrote: | Since when was web3 hugely promising Web3 is rightfully being | ignored because it is useless. | | AI is _already_ extremely useful. There 's zero chance that | it's a fad that will fizzle out. I'm not sure how anyone | could come to that conclusion. | EGreg wrote: | Web3 being hugely promising doesn't mean AI will fizzile | out. That's a strawman. Try to reply to what's been said. | AI has far bigger downsides than Web3, Web3 at worst is | zero-sum and people voluntarily choose to engage with it. | AI can harm many vulnerable people and systems, that never | chose to engage with any of it. That's what you call | _useful_? | | Also, this idea that just because you _say_ Web3 has no use | cases, makes it true, regardless of evidence, is silly. | IshKebab wrote: | > Try to reply to what's been said. | | Try to _read_ what 's been said. When did I imply that | the two are linked? | | > AI has far bigger downsides than Web3, Web3 at worst is | zero-sum and people voluntarily choose to engage with it. | | Sure. Web3 is a nothing. At worst it will change nothing. | But it _is_ at worst. It changes nothing. | | > That's what you call useful? | | AI can be abused, but that obviously doesn't mean that it | isn't useful. I did not call the abuse of AI useful. Who | is arguing against straw men now? | | > Also, this idea that just because you say Web3 has no | use cases, makes it true, regardless of evidence, is | silly. | | Please tell me one practical use of Web3. I did actually | google it and it returned this list: | | https://www.techtarget.com/searchcio/tip/8-top- | Web-30-use-ca... | | 1. Flexible and secure smart contracts - nobody really | wants this; they don't want to lose all their money due | to a bug with no recourse. | | 2. Trusted data privacy - this isn't anything concrete. | | 3. Highly targeted [advertising] content - erm I thought | you said web3 has no downsides? | | 4. Automated content creation and curation - another hand | wave. | | 5. Unique loyalty programs - ha, come on, really? | | 6. Increased community building - ... this list is | exactly what I expected ... | | 7. Better omnichannel experiences - ??!? | | 8. Wider use of augmented reality - what has this even | got to do with web3? | | Please point me to a realistic use case for web3. | EGreg wrote: | See the list here | | https://intercoin.org/applications | | Would love to see the same type of reaction by numbered | point by point as you did above | IshKebab wrote: | Web 5? Lol. As far as I can see all of those things are | already totally possible with web 2.0. Except maybe NFTs? | Hard to argue that they are useful though except for | money laundering. | | Could you perhaps pick one or two from that list that you | think are the best and explain why they can only be | implemented with smart contracts? | | I mean, take voting for example. You can do voting with a | web 1.0 website. The challenge is always going to be | preventing vote stuffing, and the only real way to | prevent that is to associate votes with real world IDs. | How would web3 help with that? The proper solution is | _government issued_ key pairs, but that doesn 't sound | very web3 to me. | EGreg wrote: | You were fine making a list of 8 and here you punked out? | Please give your reaction to each one, why they aren't | necessary or aren't real applications and why Web3 is | useless for them. Each one goes into depth for why Web3 | matters if you click it. | | Voting can be done with Web 1.0 and in fact is done with | StackExchange sites. But how do you know someone didnt go | into the database and change the votes and results? What | good are elections if you can't trust them? | atomicnumber3 wrote: | How is Web3 doing these days, I must ask? | | The only thing I've heard of it recently is that 4chan is | still doing good business selling ads for NFT and coin scams. | EGreg wrote: | Growing at a CAGR of 44% | | https://www.globenewswire.com/en/news- | release/2023/03/22/263... | | Expected to hit $50 billion by 2030 | | https://www.emergenresearch.com/amp/industry- | report/web-3-ma... | | And for example $1.1 Billion in India | | https://m.economictimes.com/tech/technology/indian- | web3-indu... | hk__2 wrote: | > Expected to hit $50 billion by 2030 | | The definition of "web3" is too vague to have a correct | estimation: it will be $50B according to your second | link; $44B by 2031 according to your first link; $33B | according to [1]; $45 according to [2]; $16B according to | [3]. | | [1]: https://www.grandviewresearch.com/press- | release/global-web-3... | | [2]: https://www.vantagemarketresearch.com/industry- | report/web-30... | | [3]: https://www.skyquestt.com/report/web-3-0-blockchain- | market | naillo wrote: | The conditional probability that the article is AI written is | also so much larger when you encounter .ai tld's. | davepeck wrote: | Explosion is an old school machine learning company by the | people who built the spaCy natural language library. They're | serious practitioners whose work predates the "hype-train" | you're concerned about. | | The blog post might be worth a gander. | sudb wrote: | I've had a fair amount of success at work recently with treating | LLMs - specifically OpenAI's GPT-4 with function calling - as | modules in a larger system, helped along powerfully by the | ability to output structured data. | | > Most systems need to be much faster than LLMs are today, and on | current trends of efficiency and hardware improvements, will be | for the next several years. | | I think here I disagree with the author here though, and am happy | to be a technological optimist - if LLMs are used modularly, | what's to stop us in a few years (presumably still hardware | requirement costs, on reflection) eventually having small, fast | specialised LLMs for the things that we find them truly | useful/irreplaceable? | syllogism wrote: | Nothing's to stop us, and in fact we can do that now! This is | basically what the post advocates for: replacing the LLM calls | for task-specific things with smaller models. They just don't | need to be LLMs. | og_kalu wrote: | I'll just say there's no guarantee training or fine-tuning a | smaller bespoke model will be more accurate (Certainly though, it | may be accurate enough). Minerva and Med-Palm are worse than | GPT-4 for instance. | syllogism wrote: | This is where the terminology being used to discuss LLMs today | is a touch awkward and imprecise. | | There's a key distinction between smaller models trained with | transfer-learning, and just fine-tuning a smaller LLM and still | using in-context learning. | | Transfer learning means you're training an output network | specifically for the task you're doing. So like, if you're | doing classification, you output a vector with one element per | class, apply a softmax transformation, and train on a negative | log likelihood objective. This is direct and effective. | | Fine-tuning a smaller LLM so that it's still learning to do | text generation, but it's better at the kinds of tasks you want | to do, is a much more mixed experience. The text generation is | still really difficult, and it's really difficult to learn to | follow instructions. So all of this still really favours size. | og_kalu wrote: | Right that is a good distinction. Fair enough. Still stand | that you could train a worse model depending on the task. | Translation, Nuanced Classification are all instances where | i've not seen bespoke models outright better than GPT-4. | although, like i said it could still be good enough for | speed, compute requirements. | skybrian wrote: | I don't understand this heuristic and I think it might be a bit | garbled. Any idea what the author meant? How do you get 1000? | | > A good rule of thumb is that you'll want ten data points per | significant digit of your evaluation metric. So if you want to | distinguish 91% accuracy from 90% accuracy, you'll want to have | at least 1000 data points annotated. You don't want to be running | experiments where your accuracy figure says a 1% improvement, but | actually you went from 94/103 to 96/103. | akprasad wrote: | My guess is that this should be something like "If you have n | significant digits in your evaluation metric, you should have | at least 10^(n+1) data points." | wrs wrote: | Avoiding the term "significant digits" completely: | Distinguishing 91 vs 90 is a difference of 1 on a 0-100 | scale. 100x10=1000. If you wanted to distinguish 91.0 vs | 90.9, that's 1 on a 0-1000 scale, so you'd want 10,000 | points. | forward-slashed wrote: | All of this is quite difficult without the DSL to explore and | construct pipelines for LLMs. Current approaches are very slow in | terms of iteration. | alexvitkov wrote: | Sorry if this is a bit ignorant, I don't work in the space, but | if a single LLM invocation is considered too slow, how could | splitting it up into a pipeline of LLM invocations which need to | happen in sequence help? | | Same with reliability - you don't trust the results of one | prompt, but you trust multiple piped one into another? Even if | you test the individual components, which is what this approach | enables and this article heavily advocates for, I still can't | imagine that 10 unreliable systems, which have to interact with | rach other, are more reliable than one. | | 80% accuracy of one system is 80% accuracy. | | 95% accuracy on 10 systems is 59% accuracy in total if you need | all of them to work and they fail independently. | peter_l_downs wrote: | I think the idea behind breaking down the task into a | composable pipeline is that you then replace the LLM steps in a | pipeline with supervised models that are much faster. So you | end up with a pipeline of non-LLM models, which are faster and | more explainable. | syllogism wrote: | (Author here) | | About the speed, the idea is that if you break down the task, | you can very often use much smaller models for the component | tasks. LLMs are approaching prediction tasks under an extremely | difficult constraint: they don't get to see many labelled | examples. If you relax that constraint and just use transfer- | learning, you can get better accuracy with much smaller models. | The transfer-learning pipeline can also be arranged so that you | encode the text into vectors once, and you apply multiple | little task networks over the shared representation. spaCy | supports this for instance, and it's easy to do when working | directly with the networks in PyTorch etc. | cmcaleer wrote: | > you don't trust the results of one promt, but you trust | multiple piped one into another? | | This is really not at all unusual. Take aircraft for instance. | One system is not reliable, for a multitude of reasons. A | faulty sensor could be misleading, a few bits could get flipped | by cosmic rays causing ECC to fail, the system itself could be | poorly calibrated, there are far too many unacceptable risks. | But add TMR[0][1] and suddenly you are able to trust things a | lot more. This isn't to say that TMR is bullet proof e.g. | incidents like [2], but redundancy does make it possible to | increase trust in a system, and assign blame to what part of a | system is faulty (e.g. if 3 systems exist, and 1 appears to be | disagreeing wildly with 2 and 3, you know to start | investigating system 1 first). | | Would it work here? I don't know! But it doesn't seem like an | inherently terrible or flawed idea if we look at past | applications. Ensembling different models is a pretty common | technique to get better results in ML, and maybe this approach | would make it easier to find weak links and assign blame. | | [0]: https://en.wikipedia.org/wiki/Triple_modular_redundancy | | [1]: | https://en.wikipedia.org/wiki/Air_data_inertial_reference_un... | | [2]: https://www.atsb.gov.au/media/news-items/2022/pitot-probe- | co... causing total confusion among the TMR | chongli wrote: | _This isn 't to say that TMR is bullet proof e.g. incidents | like [2], but redundancy does make it possible to increase | trust in a system, and assign blame to what part of a system | is faulty (e.g. if 3 systems exist, and 1 appears to be | disagreeing wildly with 2 and 3, you know to start | investigating system 1 first)._ | | You can only gain trust in this system if you understand the | error sources for all three systems. If there's any common | mode errors then you can see errors showing up in multiple | systems simultaneously. For example, if your aircraft is | using pitot tubes [1] to measure airspeed then you need to | worry about multiple tubes icing up at the same time (which | is likely since they're in the same environment). | | So it would not add very much trust to implement TMR with | three different pitot tubes. It would be better to combine | the pitot tubes with completely different systems, such as | radar and GPS, to handle the (likely) scenario of two or more | pitot tubes icing up and failing completely. | | [1] https://en.wikipedia.org/wiki/Pitot_tube?wprov=sfti1 | vjerancrnjak wrote: | It's not ignorant. It is a known problem. Before LLMs, | approaches to machine translation or any high level language | tasks did start with a pipeline (part of speech tagging, | dependency tree parsing, named entity recognition etc.) but | quickly these attempts were discarded. | | All of the models in the pipeline are not optimized with the | joint loss (the final machine translation model that maps lang | A to lang B does not propagate its error to the low level | models in the pipeline). | | A pipeline of LLMs will accumulate the error in the same way, | eventually the same underlying problem of pipeline not being | trained with the joint loss will result in low accuracy. | | LLMs or DNNs in general do more compute, so they start being | extremely powerful even when sequenced. Making a sequence of | decisions with a regular ML model has a similar problem to | pipelining, if you train it on single decision loss and not the | sequence of decisions loss, then there's a question of can it | recover and make a right next step if it made the wrong step | (your training data never included this recovery example), but | convolutional NNs were so powerful for language tasks that this | recovery from error was successful (even though you never | trained CNNs over the joint loss of sequence of decision). | visarga wrote: | It's not a given that the performance would suffer. For | instance, you could use self-checking methods like cycle | consistency or back translation in a sequence of prompts. | Another option is to generate multiple answers and then use a | voting system to pick the best one. This could actually boost | the LLM's accuracy, although it would require more | computation. In various tasks, there might be simpler methods | for verifying the answer than initially generating it. | | Then you have techniques like the Tree of Thoughts, which are | particularly useful for tasks that require strategic planning | and exploration. You just can't solve these in one single | round of LLM interaction. | | In real-world applications, developers often choose a series | of prompts that enable either self-checking or error | minimization. Alternatively, they can involve a human in the | loop to guide the system's actions. The point is to design | with the system's limitations in mind. | | On a side note, if you're using vLLM, you can send up to 20 | requests in parallel without incurring additional costs. The | server batches these requests and uses key-value caching, so | you get high token/s throughput. This allows you to resend | previous outputs for free or run multiple queries on a large | text segment. So, running many tasks doesn't necessarily slow | things down if you manage it correctly. | vjerancrnjak wrote: | It is a simple problem and in literature it was named | "label bias". | | Let's say you maximize performance of a single piece of | pipeline (training on a dataset or something else), and you | do it the same way for all pieces. The labels that were | correct as inputs in training are your limitation. Why? | Because when a mistake happens, you've never learned to | recover from it, because you always gave the correct labels | in your training. | | What LLM pipelines do is probably something like this: | | * a complex task is solved by a pipeline of prompts | | * we tweak a single prompt | | * we observe the output at the end of the whole pipeline | and determine if the tweak was right | | In this way, the joint loss of the pipeline is observed and | that is ok. | | But, the moment your pipeline is: POS Tagger -> Dependency | Tree Parser -> Named Entity Recognition -> ... -> Machine | Translation | | and you have separate training sets that maximize | performance of each particular piece, you are introducing | label bias and are relying on some luck to recover from | errors early in the pipeline because during training, the | later parts never got errors as input and recovered to the | correct output. | phillipcarter wrote: | So I think this is an excellent post. Indeed, LLM maximalism is | pretty dumb. They're awesome at specific things and mediocre at | others. In particular, I get the most frustrated when I see | people try to use them for tasks that need deterministic outputs | _and the thing you need to create is already known statically_. | My hope is that it 's just people being super excited by the | tech. | | I wanted to call this out, though, as it makes the case that to | improve any component (and really make it production-worthy), you | need an evaluation system: | | > Intrinsic evaluation is like a unit test, while extrinsic | evaluation is like an integration test. You do need both. It's | very common to start building an evaluation set, and find that | your ideas about how you expect the component to behave are much | vaguer than you realized. You need a clear specification of the | component to improve it, and to improve the system as a whole. | Otherwise, you'll end up in a local maximum: changes to one | component will seem to make sense in themselves, but you'll see | worse results overall, because the previous behavior was | compensating for problems elsewhere. Systems like that are very | difficult to improve. | | I think this makes sense from the perspective of a team with | deeper ML expertise. | | What it doesn't mention is that this is an enormous effort, made | even larger when you don't have existing ML expertise. I've been | finding this one out the hard way. | | I've found that if you have "hard criteria" to evaluate (i.e., | getting the LLM to produce a given structure rather than an open- | ended output for a chat app) you can quantify improvements using | Observability tools (SLOs!) and iterating in production. Ship | changes daily, track versions of what you're doing, and keep on | top of behavior over a period of time. It's arguably a lot less | "clean" but it's way faster, and because it's working on the | real-world usage data, it's really effective. An ML engineer | might call that some form of "online test" but I don't think it | really applies. | | At any rate, there are other use cases where you really do need | evaluations, though. The more important correct output is, the | more it's worth investing in evals. I would argue that if bad | outputs have high consequences, then maybe LLMs also aren't the | right tech for the job, but that'll probably change in a few | years. And hopefully making evaluations will be easier too. | syllogism wrote: | (Author here) | | It's true that getting something going end-to-end is more | important than being perfectionist about individual steps -- | that's a good practical perspective. We hope good evaluation | won't be such an enormous effort. Most of what we're trying to | do at Explosion can be summarised as trying to make the right | thing easy. Our annotation tool Prodigy is designed to scale | down to smaller use-cases for instance ( https://prodigy.ai ). | I admit it's still effort though, and depending on the task, | may indeed still take expertise. | axiom92 wrote: | > tasks that need deterministic outputs and the thing you need | to create is already known statically | | Wow, interesting. Do you have any example for this? | | I've realized that LLMs are fairly good at string processing | tasks that a really complex regex might also do, so I can see | the point in those. | intended wrote: | Classification tasks come to mind | og_kalu wrote: | LLMs are better at that though. Sure you may not require | them but it certainly wouldn't be for a lack of accuracy. | | https://www.artisana.ai/articles/gpt-4-outperforms-elite- | cro... | | https://arxiv.org/abs/2303.15056 | phillipcarter wrote: | Yeah, there's a little bit of flex there for sure. An example | that recently came up for me at work was being able to take | request:response pairs from networking events and turn them | into a distributed trace. You can absolutely get an LLM to do | that, but it's very slow and can mess up sometimes. But you | can also do this 100% programmatically! The LLM route feels a | little easier at first but it's arguably a bad application of | the tech to the problem. I tried it out just for fun, but | it's not something I'd ever want to do for real. | | (separately, synthesizing a trace from this kind of data is | impossible to get 100% correct for other reasons, but hey, | it's a fun thing to try) | mark_l_watson wrote: | I agree with much of the article. You do need to take great care | to make code with embedded LLM use modular and easily | maintainable, and otherwise keep code bases tidy. | | I am a fan of tools like LangChain that bring some software order | to using LLMs. | | BTW, this article is a blog hosted by the company who writes and | maintains the excellent spaCy library. | passion__desire wrote: | Is anyone working on a OS LLM layer? e.g. consider a program | like gimp. It would feed in its documentation and workflow | details in LLM and get embeddings which would be installed with | the program just like man-pages. Users could just express what | they want to do in natural languages and Gimp would just query | llm and create a workflow that might achieve the task. | mark_l_watson wrote: | Apple's CoreML is a large collection of regular models, deep | learning models, etc. that are easy to use in | macOS/iOS/iPadOS apps. | peter_l_downs wrote: | Spacy [0] is a state-of-art / easy-to-use NLP library from the | pre-LLM era. This post is the Spacy founder's thoughts on how to | integrate LLMs with the kind of problems that "traditional" NLP | is used for right now. It's an advertisement for Prodigy [1], | their paid tool for using LLMs to assist data labeling. That | said, I think I largely agree with the premise, and it's worth | reading the entire post. | | The steps described in "LLM pragmatism" are basically what I see | my data science friends doing -- it's hard to justify the cost | (money and latency) in using LLMs directly for all tasks, and | even if you want to you'll need a baseline model to compare | against, so why not use LLMs for dataset creation or augmentation | in order to train a classic supervised model? | | [0] https://spacy.io/ | | [1] https://prodi.gy/ | og_kalu wrote: | >what I see my data science friends doing -- it's hard to | justify the cost (money and latency) in using LLMs directly for | all tasks, and even if you want to you'll need a baseline model | to compare against, so why not use LLMs for dataset creation or | augmentation in order to train a classic supervised model? | | The NLP infrastructure and pipelines we have today aren't there | because they are necessarily the best way to handle the tasks | you want. They're in place because computers simply could not | understand text the way we would like and shortcuts, | approximations were necessary. | | Borrowing from the blog, Since you could not simply ask the | computer, "How many paragraphs in this review say something bad | about the acting? Which actors do they frequently mention?", | separate processes of something like tagging names, linking | them to a knowledge base, and paragraph-level actor sentiment | etc were needed. | | The approximations are cool and they do work rather well for | some use cases but they fall apart in many others. | | This is why automated resume filtering, moderation etc is still | awful with the old techniques. You simply can't do what is | suggested above and get the same utility. ___________________________________________________________________ (page generated 2023-09-13 23:00 UTC)