[HN Gopher] Against LLM Maximalism
       ___________________________________________________________________
        
       Against LLM Maximalism
        
       Author : pmoriarty
       Score  : 105 points
       Date   : 2023-09-13 12:32 UTC (10 hours ago)
        
 (HTM) web link (explosion.ai)
 (TXT) w3m dump (explosion.ai)
        
       | Grimburger wrote:
       | [flagged]
        
         | EGreg wrote:
         | I predicted that AI will be the next Web3 -- hugely promising
         | but increasingly ignored by HN.
         | 
         | There will be waves of innovation in the coming years. Web3
         | solutions will mostly enrich people or at worst be zero-sum.
         | While AI solutions will redistribute wealth from the working
         | class to the top 1% and corporations, as well as giving people
         | ways to take advantage of vulnerable people and systems at a
         | scale never seen before.
        
           | xnx wrote:
           | Web3 never got past the idea stage and was never useful.
           | Generative AI is already useful and actively used by millions
           | of people in their daily work.
        
             | EGreg wrote:
             | Same predictable comment every time, and by the same exact
             | people too. The first part is not even close to being true.
             | And you never mention that the downsides of AI are
             | astronomically larger than Web3. The downside of Web3 is
             | bugs in immutable smart contracts where people lose only
             | what they voluntarily put in. The downside of AI is human
             | extinction and people losing in all kinds of ways
             | regardless of whether they want to or not.
        
               | phillipcarter wrote:
               | How can you hold the opinion that AI is both not useful
               | and that it's to bring human extinction?
               | 
               | Anyways, incoherence of your argument aside, I'll gladly
               | raise my hand as having a use case that LLMs immediately
               | solved for and it's now a core product feature that
               | performs well.
        
               | EGreg wrote:
               | Many things are not useful and can bring about human
               | extinction. Viruses, volcanoes, or asteroid inpacts for
               | instance. So it's not incoherent on its face.
               | 
               | But I am not even saying that AI is not useful. I am
               | saying that every single time someone pops up on HN to
               | defend AI vs Web3, they _only_ focus on possible upside
               | right now for some people. Even if AI brings 10000x
               | downside at the same time, they would _never ever_
               | consider mentioning that. But when society adopts a new
               | technology, downsides matter even more than upsides. Loss
               | of income stability and livelihood for many professions,
               | attacks at scale (eg on reputation or truth) by
               | botswarms, etc etc. And that is just what's possible with
               | current technology.
               | 
               | But most of all, for all its upsides, Web3's harm is
               | limited to those who voluntarily commit some of their
               | money to a smart contract. AI's harm on the other hand is
               | far greater and is spread primarily out to those who
               | DIDNT VOLUNTARILY CHOOSE IT or even oppose it. That is
               | not very moral as a society. It may enrich the tech bros
               | further, but just like other tech projects, it will
               | probably come at the expense of many others, especially
               | the working class of society. They will have a rude
               | awakening and will riot. But they aren't rioting about
               | Web3, because losing money you put at risk in a
               | controlled environment is just not in the same
               | stratosphere.
               | 
               | Expect the government to use AI to control the population
               | more and more as this civil unrest happens. Look to China
               | to see what it would look like. Or Palantir for precrime
               | etc.
        
               | phillipcarter wrote:
               | I guess I'll just say that...I don't believe much of what
               | you're saying is going to happen? I don't think I'll
               | convince you and I don't think you'll convince me either.
        
               | j16sdiz wrote:
               | Three post from you in this thread. I downvote two and
               | upvote one.
               | 
               | Sometimes, unpopular opinion need more explanation. The
               | other two comments are not helpful, this comment is
               | helpful.
        
               | EGreg wrote:
               | Thanks. Well -- prepare to be downvoted by the anti Web3
               | brigade heh
        
           | IshKebab wrote:
           | Since when was web3 hugely promising Web3 is rightfully being
           | ignored because it is useless.
           | 
           | AI is _already_ extremely useful. There 's zero chance that
           | it's a fad that will fizzle out. I'm not sure how anyone
           | could come to that conclusion.
        
             | EGreg wrote:
             | Web3 being hugely promising doesn't mean AI will fizzile
             | out. That's a strawman. Try to reply to what's been said.
             | AI has far bigger downsides than Web3, Web3 at worst is
             | zero-sum and people voluntarily choose to engage with it.
             | AI can harm many vulnerable people and systems, that never
             | chose to engage with any of it. That's what you call
             | _useful_?
             | 
             | Also, this idea that just because you _say_ Web3 has no use
             | cases, makes it true, regardless of evidence, is silly.
        
               | IshKebab wrote:
               | > Try to reply to what's been said.
               | 
               | Try to _read_ what 's been said. When did I imply that
               | the two are linked?
               | 
               | > AI has far bigger downsides than Web3, Web3 at worst is
               | zero-sum and people voluntarily choose to engage with it.
               | 
               | Sure. Web3 is a nothing. At worst it will change nothing.
               | But it _is_ at worst. It changes nothing.
               | 
               | > That's what you call useful?
               | 
               | AI can be abused, but that obviously doesn't mean that it
               | isn't useful. I did not call the abuse of AI useful. Who
               | is arguing against straw men now?
               | 
               | > Also, this idea that just because you say Web3 has no
               | use cases, makes it true, regardless of evidence, is
               | silly.
               | 
               | Please tell me one practical use of Web3. I did actually
               | google it and it returned this list:
               | 
               | https://www.techtarget.com/searchcio/tip/8-top-
               | Web-30-use-ca...
               | 
               | 1. Flexible and secure smart contracts - nobody really
               | wants this; they don't want to lose all their money due
               | to a bug with no recourse.
               | 
               | 2. Trusted data privacy - this isn't anything concrete.
               | 
               | 3. Highly targeted [advertising] content - erm I thought
               | you said web3 has no downsides?
               | 
               | 4. Automated content creation and curation - another hand
               | wave.
               | 
               | 5. Unique loyalty programs - ha, come on, really?
               | 
               | 6. Increased community building - ... this list is
               | exactly what I expected ...
               | 
               | 7. Better omnichannel experiences - ??!?
               | 
               | 8. Wider use of augmented reality - what has this even
               | got to do with web3?
               | 
               | Please point me to a realistic use case for web3.
        
               | EGreg wrote:
               | See the list here
               | 
               | https://intercoin.org/applications
               | 
               | Would love to see the same type of reaction by numbered
               | point by point as you did above
        
               | IshKebab wrote:
               | Web 5? Lol. As far as I can see all of those things are
               | already totally possible with web 2.0. Except maybe NFTs?
               | Hard to argue that they are useful though except for
               | money laundering.
               | 
               | Could you perhaps pick one or two from that list that you
               | think are the best and explain why they can only be
               | implemented with smart contracts?
               | 
               | I mean, take voting for example. You can do voting with a
               | web 1.0 website. The challenge is always going to be
               | preventing vote stuffing, and the only real way to
               | prevent that is to associate votes with real world IDs.
               | How would web3 help with that? The proper solution is
               | _government issued_ key pairs, but that doesn 't sound
               | very web3 to me.
        
               | EGreg wrote:
               | You were fine making a list of 8 and here you punked out?
               | Please give your reaction to each one, why they aren't
               | necessary or aren't real applications and why Web3 is
               | useless for them. Each one goes into depth for why Web3
               | matters if you click it.
               | 
               | Voting can be done with Web 1.0 and in fact is done with
               | StackExchange sites. But how do you know someone didnt go
               | into the database and change the votes and results? What
               | good are elections if you can't trust them?
        
           | atomicnumber3 wrote:
           | How is Web3 doing these days, I must ask?
           | 
           | The only thing I've heard of it recently is that 4chan is
           | still doing good business selling ads for NFT and coin scams.
        
             | EGreg wrote:
             | Growing at a CAGR of 44%
             | 
             | https://www.globenewswire.com/en/news-
             | release/2023/03/22/263...
             | 
             | Expected to hit $50 billion by 2030
             | 
             | https://www.emergenresearch.com/amp/industry-
             | report/web-3-ma...
             | 
             | And for example $1.1 Billion in India
             | 
             | https://m.economictimes.com/tech/technology/indian-
             | web3-indu...
        
               | hk__2 wrote:
               | > Expected to hit $50 billion by 2030
               | 
               | The definition of "web3" is too vague to have a correct
               | estimation: it will be $50B according to your second
               | link; $44B by 2031 according to your first link; $33B
               | according to [1]; $45 according to [2]; $16B according to
               | [3].
               | 
               | [1]: https://www.grandviewresearch.com/press-
               | release/global-web-3...
               | 
               | [2]: https://www.vantagemarketresearch.com/industry-
               | report/web-30...
               | 
               | [3]: https://www.skyquestt.com/report/web-3-0-blockchain-
               | market
        
         | naillo wrote:
         | The conditional probability that the article is AI written is
         | also so much larger when you encounter .ai tld's.
        
         | davepeck wrote:
         | Explosion is an old school machine learning company by the
         | people who built the spaCy natural language library. They're
         | serious practitioners whose work predates the "hype-train"
         | you're concerned about.
         | 
         | The blog post might be worth a gander.
        
       | sudb wrote:
       | I've had a fair amount of success at work recently with treating
       | LLMs - specifically OpenAI's GPT-4 with function calling - as
       | modules in a larger system, helped along powerfully by the
       | ability to output structured data.
       | 
       | > Most systems need to be much faster than LLMs are today, and on
       | current trends of efficiency and hardware improvements, will be
       | for the next several years.
       | 
       | I think here I disagree with the author here though, and am happy
       | to be a technological optimist - if LLMs are used modularly,
       | what's to stop us in a few years (presumably still hardware
       | requirement costs, on reflection) eventually having small, fast
       | specialised LLMs for the things that we find them truly
       | useful/irreplaceable?
        
         | syllogism wrote:
         | Nothing's to stop us, and in fact we can do that now! This is
         | basically what the post advocates for: replacing the LLM calls
         | for task-specific things with smaller models. They just don't
         | need to be LLMs.
        
       | og_kalu wrote:
       | I'll just say there's no guarantee training or fine-tuning a
       | smaller bespoke model will be more accurate (Certainly though, it
       | may be accurate enough). Minerva and Med-Palm are worse than
       | GPT-4 for instance.
        
         | syllogism wrote:
         | This is where the terminology being used to discuss LLMs today
         | is a touch awkward and imprecise.
         | 
         | There's a key distinction between smaller models trained with
         | transfer-learning, and just fine-tuning a smaller LLM and still
         | using in-context learning.
         | 
         | Transfer learning means you're training an output network
         | specifically for the task you're doing. So like, if you're
         | doing classification, you output a vector with one element per
         | class, apply a softmax transformation, and train on a negative
         | log likelihood objective. This is direct and effective.
         | 
         | Fine-tuning a smaller LLM so that it's still learning to do
         | text generation, but it's better at the kinds of tasks you want
         | to do, is a much more mixed experience. The text generation is
         | still really difficult, and it's really difficult to learn to
         | follow instructions. So all of this still really favours size.
        
           | og_kalu wrote:
           | Right that is a good distinction. Fair enough. Still stand
           | that you could train a worse model depending on the task.
           | Translation, Nuanced Classification are all instances where
           | i've not seen bespoke models outright better than GPT-4.
           | although, like i said it could still be good enough for
           | speed, compute requirements.
        
       | skybrian wrote:
       | I don't understand this heuristic and I think it might be a bit
       | garbled. Any idea what the author meant? How do you get 1000?
       | 
       | > A good rule of thumb is that you'll want ten data points per
       | significant digit of your evaluation metric. So if you want to
       | distinguish 91% accuracy from 90% accuracy, you'll want to have
       | at least 1000 data points annotated. You don't want to be running
       | experiments where your accuracy figure says a 1% improvement, but
       | actually you went from 94/103 to 96/103.
        
         | akprasad wrote:
         | My guess is that this should be something like "If you have n
         | significant digits in your evaluation metric, you should have
         | at least 10^(n+1) data points."
        
           | wrs wrote:
           | Avoiding the term "significant digits" completely:
           | Distinguishing 91 vs 90 is a difference of 1 on a 0-100
           | scale. 100x10=1000. If you wanted to distinguish 91.0 vs
           | 90.9, that's 1 on a 0-1000 scale, so you'd want 10,000
           | points.
        
       | forward-slashed wrote:
       | All of this is quite difficult without the DSL to explore and
       | construct pipelines for LLMs. Current approaches are very slow in
       | terms of iteration.
        
       | alexvitkov wrote:
       | Sorry if this is a bit ignorant, I don't work in the space, but
       | if a single LLM invocation is considered too slow, how could
       | splitting it up into a pipeline of LLM invocations which need to
       | happen in sequence help?
       | 
       | Same with reliability - you don't trust the results of one
       | prompt, but you trust multiple piped one into another? Even if
       | you test the individual components, which is what this approach
       | enables and this article heavily advocates for, I still can't
       | imagine that 10 unreliable systems, which have to interact with
       | rach other, are more reliable than one.
       | 
       | 80% accuracy of one system is 80% accuracy.
       | 
       | 95% accuracy on 10 systems is 59% accuracy in total if you need
       | all of them to work and they fail independently.
        
         | peter_l_downs wrote:
         | I think the idea behind breaking down the task into a
         | composable pipeline is that you then replace the LLM steps in a
         | pipeline with supervised models that are much faster. So you
         | end up with a pipeline of non-LLM models, which are faster and
         | more explainable.
        
         | syllogism wrote:
         | (Author here)
         | 
         | About the speed, the idea is that if you break down the task,
         | you can very often use much smaller models for the component
         | tasks. LLMs are approaching prediction tasks under an extremely
         | difficult constraint: they don't get to see many labelled
         | examples. If you relax that constraint and just use transfer-
         | learning, you can get better accuracy with much smaller models.
         | The transfer-learning pipeline can also be arranged so that you
         | encode the text into vectors once, and you apply multiple
         | little task networks over the shared representation. spaCy
         | supports this for instance, and it's easy to do when working
         | directly with the networks in PyTorch etc.
        
         | cmcaleer wrote:
         | > you don't trust the results of one promt, but you trust
         | multiple piped one into another?
         | 
         | This is really not at all unusual. Take aircraft for instance.
         | One system is not reliable, for a multitude of reasons. A
         | faulty sensor could be misleading, a few bits could get flipped
         | by cosmic rays causing ECC to fail, the system itself could be
         | poorly calibrated, there are far too many unacceptable risks.
         | But add TMR[0][1] and suddenly you are able to trust things a
         | lot more. This isn't to say that TMR is bullet proof e.g.
         | incidents like [2], but redundancy does make it possible to
         | increase trust in a system, and assign blame to what part of a
         | system is faulty (e.g. if 3 systems exist, and 1 appears to be
         | disagreeing wildly with 2 and 3, you know to start
         | investigating system 1 first).
         | 
         | Would it work here? I don't know! But it doesn't seem like an
         | inherently terrible or flawed idea if we look at past
         | applications. Ensembling different models is a pretty common
         | technique to get better results in ML, and maybe this approach
         | would make it easier to find weak links and assign blame.
         | 
         | [0]: https://en.wikipedia.org/wiki/Triple_modular_redundancy
         | 
         | [1]:
         | https://en.wikipedia.org/wiki/Air_data_inertial_reference_un...
         | 
         | [2]: https://www.atsb.gov.au/media/news-items/2022/pitot-probe-
         | co... causing total confusion among the TMR
        
           | chongli wrote:
           | _This isn 't to say that TMR is bullet proof e.g. incidents
           | like [2], but redundancy does make it possible to increase
           | trust in a system, and assign blame to what part of a system
           | is faulty (e.g. if 3 systems exist, and 1 appears to be
           | disagreeing wildly with 2 and 3, you know to start
           | investigating system 1 first)._
           | 
           | You can only gain trust in this system if you understand the
           | error sources for all three systems. If there's any common
           | mode errors then you can see errors showing up in multiple
           | systems simultaneously. For example, if your aircraft is
           | using pitot tubes [1] to measure airspeed then you need to
           | worry about multiple tubes icing up at the same time (which
           | is likely since they're in the same environment).
           | 
           | So it would not add very much trust to implement TMR with
           | three different pitot tubes. It would be better to combine
           | the pitot tubes with completely different systems, such as
           | radar and GPS, to handle the (likely) scenario of two or more
           | pitot tubes icing up and failing completely.
           | 
           | [1] https://en.wikipedia.org/wiki/Pitot_tube?wprov=sfti1
        
         | vjerancrnjak wrote:
         | It's not ignorant. It is a known problem. Before LLMs,
         | approaches to machine translation or any high level language
         | tasks did start with a pipeline (part of speech tagging,
         | dependency tree parsing, named entity recognition etc.) but
         | quickly these attempts were discarded.
         | 
         | All of the models in the pipeline are not optimized with the
         | joint loss (the final machine translation model that maps lang
         | A to lang B does not propagate its error to the low level
         | models in the pipeline).
         | 
         | A pipeline of LLMs will accumulate the error in the same way,
         | eventually the same underlying problem of pipeline not being
         | trained with the joint loss will result in low accuracy.
         | 
         | LLMs or DNNs in general do more compute, so they start being
         | extremely powerful even when sequenced. Making a sequence of
         | decisions with a regular ML model has a similar problem to
         | pipelining, if you train it on single decision loss and not the
         | sequence of decisions loss, then there's a question of can it
         | recover and make a right next step if it made the wrong step
         | (your training data never included this recovery example), but
         | convolutional NNs were so powerful for language tasks that this
         | recovery from error was successful (even though you never
         | trained CNNs over the joint loss of sequence of decision).
        
           | visarga wrote:
           | It's not a given that the performance would suffer. For
           | instance, you could use self-checking methods like cycle
           | consistency or back translation in a sequence of prompts.
           | Another option is to generate multiple answers and then use a
           | voting system to pick the best one. This could actually boost
           | the LLM's accuracy, although it would require more
           | computation. In various tasks, there might be simpler methods
           | for verifying the answer than initially generating it.
           | 
           | Then you have techniques like the Tree of Thoughts, which are
           | particularly useful for tasks that require strategic planning
           | and exploration. You just can't solve these in one single
           | round of LLM interaction.
           | 
           | In real-world applications, developers often choose a series
           | of prompts that enable either self-checking or error
           | minimization. Alternatively, they can involve a human in the
           | loop to guide the system's actions. The point is to design
           | with the system's limitations in mind.
           | 
           | On a side note, if you're using vLLM, you can send up to 20
           | requests in parallel without incurring additional costs. The
           | server batches these requests and uses key-value caching, so
           | you get high token/s throughput. This allows you to resend
           | previous outputs for free or run multiple queries on a large
           | text segment. So, running many tasks doesn't necessarily slow
           | things down if you manage it correctly.
        
             | vjerancrnjak wrote:
             | It is a simple problem and in literature it was named
             | "label bias".
             | 
             | Let's say you maximize performance of a single piece of
             | pipeline (training on a dataset or something else), and you
             | do it the same way for all pieces. The labels that were
             | correct as inputs in training are your limitation. Why?
             | Because when a mistake happens, you've never learned to
             | recover from it, because you always gave the correct labels
             | in your training.
             | 
             | What LLM pipelines do is probably something like this:
             | 
             | * a complex task is solved by a pipeline of prompts
             | 
             | * we tweak a single prompt
             | 
             | * we observe the output at the end of the whole pipeline
             | and determine if the tweak was right
             | 
             | In this way, the joint loss of the pipeline is observed and
             | that is ok.
             | 
             | But, the moment your pipeline is: POS Tagger -> Dependency
             | Tree Parser -> Named Entity Recognition -> ... -> Machine
             | Translation
             | 
             | and you have separate training sets that maximize
             | performance of each particular piece, you are introducing
             | label bias and are relying on some luck to recover from
             | errors early in the pipeline because during training, the
             | later parts never got errors as input and recovered to the
             | correct output.
        
       | phillipcarter wrote:
       | So I think this is an excellent post. Indeed, LLM maximalism is
       | pretty dumb. They're awesome at specific things and mediocre at
       | others. In particular, I get the most frustrated when I see
       | people try to use them for tasks that need deterministic outputs
       | _and the thing you need to create is already known statically_.
       | My hope is that it 's just people being super excited by the
       | tech.
       | 
       | I wanted to call this out, though, as it makes the case that to
       | improve any component (and really make it production-worthy), you
       | need an evaluation system:
       | 
       | > Intrinsic evaluation is like a unit test, while extrinsic
       | evaluation is like an integration test. You do need both. It's
       | very common to start building an evaluation set, and find that
       | your ideas about how you expect the component to behave are much
       | vaguer than you realized. You need a clear specification of the
       | component to improve it, and to improve the system as a whole.
       | Otherwise, you'll end up in a local maximum: changes to one
       | component will seem to make sense in themselves, but you'll see
       | worse results overall, because the previous behavior was
       | compensating for problems elsewhere. Systems like that are very
       | difficult to improve.
       | 
       | I think this makes sense from the perspective of a team with
       | deeper ML expertise.
       | 
       | What it doesn't mention is that this is an enormous effort, made
       | even larger when you don't have existing ML expertise. I've been
       | finding this one out the hard way.
       | 
       | I've found that if you have "hard criteria" to evaluate (i.e.,
       | getting the LLM to produce a given structure rather than an open-
       | ended output for a chat app) you can quantify improvements using
       | Observability tools (SLOs!) and iterating in production. Ship
       | changes daily, track versions of what you're doing, and keep on
       | top of behavior over a period of time. It's arguably a lot less
       | "clean" but it's way faster, and because it's working on the
       | real-world usage data, it's really effective. An ML engineer
       | might call that some form of "online test" but I don't think it
       | really applies.
       | 
       | At any rate, there are other use cases where you really do need
       | evaluations, though. The more important correct output is, the
       | more it's worth investing in evals. I would argue that if bad
       | outputs have high consequences, then maybe LLMs also aren't the
       | right tech for the job, but that'll probably change in a few
       | years. And hopefully making evaluations will be easier too.
        
         | syllogism wrote:
         | (Author here)
         | 
         | It's true that getting something going end-to-end is more
         | important than being perfectionist about individual steps --
         | that's a good practical perspective. We hope good evaluation
         | won't be such an enormous effort. Most of what we're trying to
         | do at Explosion can be summarised as trying to make the right
         | thing easy. Our annotation tool Prodigy is designed to scale
         | down to smaller use-cases for instance ( https://prodigy.ai ).
         | I admit it's still effort though, and depending on the task,
         | may indeed still take expertise.
        
         | axiom92 wrote:
         | > tasks that need deterministic outputs and the thing you need
         | to create is already known statically
         | 
         | Wow, interesting. Do you have any example for this?
         | 
         | I've realized that LLMs are fairly good at string processing
         | tasks that a really complex regex might also do, so I can see
         | the point in those.
        
           | intended wrote:
           | Classification tasks come to mind
        
             | og_kalu wrote:
             | LLMs are better at that though. Sure you may not require
             | them but it certainly wouldn't be for a lack of accuracy.
             | 
             | https://www.artisana.ai/articles/gpt-4-outperforms-elite-
             | cro...
             | 
             | https://arxiv.org/abs/2303.15056
        
           | phillipcarter wrote:
           | Yeah, there's a little bit of flex there for sure. An example
           | that recently came up for me at work was being able to take
           | request:response pairs from networking events and turn them
           | into a distributed trace. You can absolutely get an LLM to do
           | that, but it's very slow and can mess up sometimes. But you
           | can also do this 100% programmatically! The LLM route feels a
           | little easier at first but it's arguably a bad application of
           | the tech to the problem. I tried it out just for fun, but
           | it's not something I'd ever want to do for real.
           | 
           | (separately, synthesizing a trace from this kind of data is
           | impossible to get 100% correct for other reasons, but hey,
           | it's a fun thing to try)
        
       | mark_l_watson wrote:
       | I agree with much of the article. You do need to take great care
       | to make code with embedded LLM use modular and easily
       | maintainable, and otherwise keep code bases tidy.
       | 
       | I am a fan of tools like LangChain that bring some software order
       | to using LLMs.
       | 
       | BTW, this article is a blog hosted by the company who writes and
       | maintains the excellent spaCy library.
        
         | passion__desire wrote:
         | Is anyone working on a OS LLM layer? e.g. consider a program
         | like gimp. It would feed in its documentation and workflow
         | details in LLM and get embeddings which would be installed with
         | the program just like man-pages. Users could just express what
         | they want to do in natural languages and Gimp would just query
         | llm and create a workflow that might achieve the task.
        
           | mark_l_watson wrote:
           | Apple's CoreML is a large collection of regular models, deep
           | learning models, etc. that are easy to use in
           | macOS/iOS/iPadOS apps.
        
       | peter_l_downs wrote:
       | Spacy [0] is a state-of-art / easy-to-use NLP library from the
       | pre-LLM era. This post is the Spacy founder's thoughts on how to
       | integrate LLMs with the kind of problems that "traditional" NLP
       | is used for right now. It's an advertisement for Prodigy [1],
       | their paid tool for using LLMs to assist data labeling. That
       | said, I think I largely agree with the premise, and it's worth
       | reading the entire post.
       | 
       | The steps described in "LLM pragmatism" are basically what I see
       | my data science friends doing -- it's hard to justify the cost
       | (money and latency) in using LLMs directly for all tasks, and
       | even if you want to you'll need a baseline model to compare
       | against, so why not use LLMs for dataset creation or augmentation
       | in order to train a classic supervised model?
       | 
       | [0] https://spacy.io/
       | 
       | [1] https://prodi.gy/
        
         | og_kalu wrote:
         | >what I see my data science friends doing -- it's hard to
         | justify the cost (money and latency) in using LLMs directly for
         | all tasks, and even if you want to you'll need a baseline model
         | to compare against, so why not use LLMs for dataset creation or
         | augmentation in order to train a classic supervised model?
         | 
         | The NLP infrastructure and pipelines we have today aren't there
         | because they are necessarily the best way to handle the tasks
         | you want. They're in place because computers simply could not
         | understand text the way we would like and shortcuts,
         | approximations were necessary.
         | 
         | Borrowing from the blog, Since you could not simply ask the
         | computer, "How many paragraphs in this review say something bad
         | about the acting? Which actors do they frequently mention?",
         | separate processes of something like tagging names, linking
         | them to a knowledge base, and paragraph-level actor sentiment
         | etc were needed.
         | 
         | The approximations are cool and they do work rather well for
         | some use cases but they fall apart in many others.
         | 
         | This is why automated resume filtering, moderation etc is still
         | awful with the old techniques. You simply can't do what is
         | suggested above and get the same utility.
        
       ___________________________________________________________________
       (page generated 2023-09-13 23:00 UTC)