[HN Gopher] Could you train a ChatGPT-beating model for $85k and... ___________________________________________________________________ Could you train a ChatGPT-beating model for $85k and run it in a browser? Author : sirteno Score : 297 points Date : 2023-03-31 18:21 UTC (4 hours ago) (HTM) web link (simonwillison.net) (TXT) w3m dump (simonwillison.net) | nwoli wrote: | What we need is a RETRO style model where basically after the | input you go through a small net that just fetches a desired set | of weights from a server (serving data without compute is dirt | cheap) and is then executed locally. We'll get there eventually | tinco wrote: | Can anyone explain or link some resource on why these big GPT | models all don't incorporate any RETRO style? I'm only very | superficially following ML developments and I was so hyped by | RETRO and then none of the modern world changing models apply | it. | nwoli wrote: | Openai might very well be using that internally who knows how | they implement things. Also emad retweeted a RETRO related | thing a bit back so they might very well be using that for | their awaited LM, here's hoping | ushakov wrote: | Now imagine loading 3.9 GB each time you want to interact with a | webpage | KMnO4 wrote: | Yeah, I've used Jira. | neilellis wrote: | :-) | sroussey wrote: | 10yrs from now models will be in the OS. Maybe even in silicon. | No downloads required. | swader999 wrote: | The OS will be in the cloud interfacing into our brain by | then. I don't want this btw. | pessimizer wrote: | Not in mine. I don't even want redhat's bullshit in there. | I'm not installing some black box into my OS that was | programmed with _motives_ that can 't be extracted from the | model at rest. | sroussey wrote: | iOS already has this to a degree, for a couple of years. | brrrrrm wrote: | The WebGPU demo mentioned in this post is insane. Blows any WASM | approach out of the water. Unfortunately that performance is not | supported anywhere but chrome canary (behind a flag) | raphlinus wrote: | This will be changing soon. I believe Chrome M113 is scheduled | to ship to stable on May 2, and will support WebGPU 1.0. I | agree it's a game-changing technology. | ChumpGPT wrote: | [dead] | agnokapathetic wrote: | > My friends at Replicate told me that a simple rule of thumb for | A100 cloud costs is $1/hour. | | AWS charges $32/hr for an 8xA100s (p4d.24xlarge) which comes out | to $4/hour/gpu. Yes you can get lower pricing with a 3 year | reservation but thats not what this question is asking. | | You also need 256 nodes to be colocated on the same fabric -- | which AWS will do for you but only if you reserve for years. | pavelstoev wrote: | model-depending, you can train on lesser (cheaper) GPUs but | system-level optimizations are needed. Which is what we provide | at centml.ai | sebzim4500 wrote: | Maybe they are using spot instances? $1/hr is about right for | those. | thewataccount wrote: | AWS certainly isn't the cheapest for this, did they mention | using AWS? Lamdba Labs is 12$/hr for 8xA100's, and there's | others relatively close to this price on demand, I assume you | can get a better deal if you contact them for a large project. | | Replicate themselves rent out GPU time so I assume they would | definitely know as that's almost certainly the core of their | business. | IanCal wrote: | Lambda labs charges about 11-12/hr for 8xA100. | robmsmt wrote: | and is completely at capacity | IanCal wrote: | But reflects an upper bound at the cost of running a100s. | celestialcheese wrote: | lambdalabs will let you do on-demand 8xa100 @ 80GB VRAM/GPU for | $12/hr, or reserved @ $10.86/hr | | 8xA100 @ 40gb for $8/hr | | Replicate friend isn't far off. | pavelstoev wrote: | Training a ChatGPT-beating model for much less than $85,000is | entirely feasible. At CentML, we're actively working on model | training and inference optimization without affecting accuracy, | which can help reduce costs and make such ambitious projects | realistic. By maximizing (>90%) GPU and platform hardware | utilization, we aim to bring down the expenses associated with | large-scale models, making them more accessible for various | applications. Additionally, our solutions also have a positive | environmental impact, addressing the excess CO2 concerns. If | you're interested in learning more about how we are doing it, | please reach out via our website: https://centml.ai | astlouis44 wrote: | WebGPU is going to be a major component in this. Modern GPU's | prevalent in mobile devices, desktops and laptops, is more than | enough to do all of this client side. | nope96 wrote: | I remember watching one of the final episodes of Connections 3: | With James Burke, and he casually said we'd have personal | assistants that we could talk to (in our PDAs). That was 1997 and | I knew enough about computers to think he was being overly | optimistic about the speed of progress. Not in our lifetimes. | Guess I was wrong! | TMWNN wrote: | Hey, that means it can be turned into an Electron app! | breck wrote: | Just want to say SimonW has become one of my favorite writers | covering the AI revolution. Always fun thought experiments with | linked code and very constructive for people thinking about how | to make this stuff more accessible to the masses. | fswd wrote: | There is somebody finetunin 160m rwkv4 on alpaca on the rwkv | discord, I am out of the office and can't link but the person | posted in prompt showcase channel | buzzier wrote: | RWKV-v4 Web Demo (169m/430m params) | https://josephrocca.github.io/rwkv-v4-web/demo/ | skybrian wrote: | I wonder why anyone would want to run it in a browser, other than | to show it could be done? It's not like the extra latency would | matter, since these things are slow. | | Running it on a server you control makes more sense. You can pick | appropriate hardware for running the AI. Then access it from any | browser you like, including from your phone, and switch devices | whenever you like. It won't use up all the CPU/GPU on a portable | device and run down your battery. | | If you want to run the server at home, maybe use something like | Tailscale? | simonw wrote: | The browser thing is definitely more for show than anything | else - I used it to help demonstrate quite how surprisingly | lightweight these models can be. | GartzenDeHaes wrote: | It's interesting to me that LLaMA-nB's still produce reasonable | results after 4-bit quantization of the 32-bit weights. Does this | indicate some possibility of reducing the compute required for | training? | lxe wrote: | Keep in mind that image transformer models like stable diffusion | are generally smaller than language models, so they are easier to | fit in wasm space. | | Also. you can finetune llama-7b on a 3090 for about $3 using | LoRA. | bitL wrote: | Only for images. People want to generate videos next and those | models will be likely GPT-sized. | Metus wrote: | There is a video model making the rounds on | /r/stablediffusion and it is just a tiny bit larger than | Stable Diffusion. | isoprophlex wrote: | You're not kidding! it's far from perfect, but pretty funny | still... | | https://www.reddit.com/r/StableDiffusion/comments/126xsxu/n | i... | | Too bad SD learned the Shutterstock watermark so well, lol | bitL wrote: | It's cool though not very stable in details over temporal | axis. | danielbln wrote: | Generative image models don't use transformers, they're | diffusion models. LLMs are transformers. | lxe wrote: | Ah yes that's right. Well they technically do use a visual | transformer for CLIP text encoder as I understand. | jedberg wrote: | With the explosion of LLMs and people figuring out ways to | train/use them relatively cheaply, unique data sets will become | that much more valuable, and will be the key differentiator | between LLMs. | | Interestingly, it seems like companies that run chat programs | where they can read the chats are best suited to building "human | conversation" LLMs, but someone who manages large text datasets | for others are in the perfect place to "win" the LLM battle. | captaincrowbar wrote: | The big problem with AI R&D is that nobody can keep up with the | big bux companies. It makes this kind of project a bit pointless. | Even if you can run a GPT3-equivalent on a web browser, how many | people are going to bother (except as a stunt) when GPT4 is | available? | simonw wrote: | An increasingly common complaint I'm hearing about GPT3/4/etc | is people who don't want to pass any of their private data to | another company. | | Running models locally is by far the most promising solution | for that concern. | adeon wrote: | The ones that can't use the GPT4 for whatever reason. Maybe you | are a company and you don't want to send OpenAI your prompts. | Or a person who has very private prompts and feel sketchy about | sending them over. | | Or maybe you are an individual who has a use case that's too | edgy for OpenAI or a silicon valley corporate image. When | Replika shut down people trying to have virtual | boyfriend/girlfriends on their platform, their reddit filled up | with people who mourned like they just lost a partner. | | I think it's important that alternative non-big bux company | options exist, even if most people don't want to or need to use | them. | moffkalast wrote: | Or maybe you're in Italy and OpenAI had just been banned from | the country for not adhering to GDPR. I suspect the rest of | the EU may follow soon. | psychphysic wrote: | Those are seriously niche use cases. They exist but can they | fund gpt5 level development? | r00fus wrote: | Most corporations/governments would prefer to keep their AI | conversations private. Definitely mainstream desire, not | niche. | psychphysic wrote: | Who does your government and corporate email? In the UK | it's all either Gmail (for government) and Outlook (NHS). | For compliance reasons they simply want data center | certification and location restrictions. | | If you think a small corp is going to get a big gov | contract outside of a nepo-state you're in for a shock. | adeon wrote: | Given the Replika debacle, I personally suspect the AI | partner use case is not really very niche. Just few people | openly want to talk about wanting it because having an | emotional AI partner is seen as creepy. | | And companies would not want to do that. Imagine you make | partner AI that goes unhinged like Bing did and tells you | to kill yourself or something similar. I can't imagine | companies would want that kind of risk. | [deleted] | psychphysic wrote: | If you AI partner data can't be stored in an Azure or | similar data centre you are a serious small niche person! | | Even Jennifer Lawrence stored her nudes on iCloud. | make3 wrote: | Alpaca uses knowledge distillation (it's trained on outputs from | OpenAI models). It's something to keep in mind. You're teaching | your model to copy an other model's outputs. | thewataccount wrote: | > You're teaching your model to copy an other model's outputs. | | Which itself was trained on human outputs to do the same thing. | | Very soon it will be full Ouroboros as humans use the model's | output to finetune themselves. | visarga wrote: | > You're teaching your model to copy an other model's outputs. | | That's a time honoured tradition in ML, invented by the father | of the field himself, Geoffrey Hinton, in 2015. | | > Distilling the Knowledge in a Neural Network | | https://arxiv.org/abs/1503.02531 | thih9 wrote: | > as opposed to OpenAI's continuing practice of not revealing the | sources of their training data. | | Looks like that choice makes it more difficult to adopt, trust, | or collaborate on the new tech. | | What are the benefits? Is there more to that than competitive | advantage? If not, ClosedAI sounds more accurate. | holloworld wrote: | [dead] | whalesalad wrote: | Are there any training/ownership models like Folding@Home? People | could donate idle GPU resources in exchange for access to the | data, and perhaps ownership. Then instead of someone needing to | pony up $85k to train a model, a thousand people can train a | fraction of the model on their consumer GPU and pool the results, | reap the collective rewards. | dekhn wrote: | A few people have built frameworks to do this. | | There is still a very large open problem in how to federate | large numbers of loosely coupled computers to speed up training | "interesting" models. I've worked in both domains (protein | folding via Folding@Home/protein folding using supercomputers, | and ML training on single nodes/ML training on supercomputers) | and at least so far, ML hasn't really been a good match for | embarrassingly parallel compute. Even in protein folding, | folding@home has a number of limitations that are much better | addressed on supercomputers (for example: if your problem | requires making extremely long individual simulations of large | proteins). | | All that could change, but I think for the time being, | interesting/big models need to be trained on tightly coupled | GPUs. | whalesalad wrote: | Probably going to mirror the transition from single-threaded | to multi-threaded compute. Took a while until application | architectures took hold of the populous to utilize multi- | core. | PaulDavisThe1st wrote: | Probably not. Multicore has been a thing for 30 years (We | had a 32 core Sequent Systems and a 64 core KSR-1 at UW | CS&E in the early 1990s). Everything about these models has | been developed in a multicore computing context, and thus | far, it still isn't massively-parallel-distributable. An | algorithm can be massively parallel without being sensibly | distributable. Change the latency between compute nodes is | not always a neutral or even just linear decrease in | performance. | itissid wrote: | And you can rule out most of the monte carlo stuff too. Which | rules out parallelization modern statistical frameworks like | STAN used for explainable models; things like Finance | modeling of risk which is a sampling of posteriors using MCMC | also can't be parallelized. | MontyCarloHall wrote: | Assuming the chains can reach an equilibrium point (i.e. | burn in) quickly, M samples from an MCMC can be | parallelized by running N chains in parallel each for M/N | iterations. You still end up with M total samples from your | target distribution. | | You're only out of luck if each iteration is too compute | intense to fit on one worker node, even if each iteration | might be embarrassingly parallelizable, since the overhead | of having to aggregate computations across workers at every | iteration would be too high. | neoromantique wrote: | How long until somebody creates a crypto project on that? | buildbuildbuild wrote: | Bittensor is one, not an endorsement. chat.bittensor.com | ellisv wrote: | That'd be cool but I don't think most idle consumer GPUs | (6-8GB) would have large enough memory for a single iteration | (batch size 1) of modern LLMs. | | But I'd love to see more federated/distributed learning | platforms. | whalesalad wrote: | Is it possible to break the model apart? Or does the entire | thing need to be architected from the get-go such that an | individual GPU can own a portion end to end? | mirekrusin wrote: | 6GB can store 3 billion parameters, gpt3.5 has 175 billion | parameters. | mirekrusin wrote: | Unfortunately training is not emberassingly parallelisable [0] | problem. It would require new architecture. Current models | diverge too fast. By the time you'd download and/or calculate | your contribution the model would descend somewhere else and | your delta would not be applicable - based off wrong initial | state. | | It would be great if merge-ability would exist. It would also | likely apply to efficient/optimal shrinking for models. | | Maybe you could dispatch tasks to train on many variations of | similar tasks and take average of results? It could probably | help in some way, but you'd still have large serialized | pipeline to munch through and you'd likely require some serious | hardware ie. dual gtx 4090 on client side. | | [0] https://en.wikipedia.org/wiki/Embarrassingly_parallel | amitport wrote: | hmmm... seems like you're reinventing distributed learning. | | merge-ability does exist and you can average the results. | mirekrusin wrote: | You can if you have same base weights. | | If you have similar variants of the same task you can | accelerate it more where the diff is. | | You can't average on past results computed from historic | base weights - it's linear process. | | If you could do that, you'd just map training examples to | diffs and merge them all. | | Or take two distinct models and merge them to have model | that is roughly sum of them. You can't do it, it's not | linear process. | _trampeltier wrote: | Start a Boinc project. | | https://boinc.berkeley.edu/projects.php | spyder wrote: | Learning@Home using Decentralized Mixture-of-Expert models: | | https://learning-at-home.github.io/ | | https://training-transformers-together.github.io/ | | https://arxiv.org/abs/2002.04013 | ftxbro wrote: | Yes there is petals/bloom https://github.com/bigscience- | workshop/petals but it's not so great. Maybe it will improve or | a better one will come. | whalesalad wrote: | Really interesting live monitor of the network: | http://health.petals.ml | polishdude20 wrote: | I wonder how they handle illegal content. Like, if you're | running training data on your computer, what's to stop | someone else's data that is illegal, from being uploaded to | your computer as part of training? | riedel wrote: | I read that it is only scoring the model collaboratively but | it allows some fine-tuning I guess. | | Getting the actual gradient descent to parallelize is more | difficult because one needs to average the gradient when | using data/batch parallelism. It becomes more a network speed | than GPU speed problem. Or are LLMs somehow different? | ultrablack wrote: | If you could, you should have done it 6 months ago. | munk-a wrote: | I mean - is there a developer alive that'd be unable to write | the nascent version of Twitter? I think that Twitter as a | business exists entirely because of the concept - the code to | cover the core functionality is absolutely trivial to | replicate. | | I don't think this is a very helpful statement because actually | finding the idea on what to build is the hard part - or even | just believing it's possible. The company I work at has been | using NLP for years now and we have a model that's great at | what we do... but if you asked if we could develop that into a | chatbot as functional as chatgpt two years ago you'd probably | be met with some pretty heavy skepticism. | | Cloning something that has been proven possible is always | easier than taking the risk building the first version with no | real grasp of feasibility. | v4dok wrote: | Can someone at the EU, the only player in this thing with no | strategy yet just pool together enough resources so the open- | source people can train models. We don't ask much, just give | compute power | 0xfaded wrote: | No, that could risk public money benefitting a private party. | | Feel free to form a multinational consortium and submit a grant | application to one of our distribution partners under the | Horizon program though. | | Now, how do you plan to create jobs and reduce CO2? | alecco wrote: | Interesting blog but the extrapolations are way overblown. I | tried one of the 30bn models and it's not even remotely close to | GPT-3. | | Don't get me wrong, this is very interesting and I hope more is | done in the open models. But let's not over-hype by 10x. | lmeyerov wrote: | It seems the quality goes up & cost goes down significantly with | Colossal AI's recent push: | https://medium.com/@yangyou_berkeley/colossalchat-an-open-so... | | Their writeup makes it sounds like, net, 2X+ over Alpaca, and | that's an early run | | The browser side is interesting too. Browser JS VMs have a memory | cap of 1GB, so that may ultimately be the bottleneck here... | SebJansen wrote: | does the 1gb limit extend to wasm? | jesse__ wrote: | WASM is specified to have 32-bit pointers, which is 4GB. | AFAIK browser implementations respect that (when I did some | nominal testing a couple years ago) | lmeyerov wrote: | Interesting, since I looked last year, Chrome has started | raising the caps internally on buffer allocation to potentially | 16GB: | https://chromium.googlesource.com/chromium/src/+/2bf3e35d7a4... | | Last time I tried on a few engines, it was just 1-2GB for typed | arrays, which are essentially the backing structure for this | kind of work. Be interesting to try again.. | | For our product, we actually want to dump 10GB+ on to the WebGL | side, which may or may not get mirrored on the CPU side. Not | sure if additional limits there on the software side. And after | that, consumer devices often have another 10GB+ CPU RAM free, | which we'd also like to use for our more limited non-GPU stuff | :) | jesse__ wrote: | I thought the memory limit (in V8 at least) was 2GB due to the | GC not wanting to pass 64 bit pointers around, and using the | high bit of a 32-bit offset for .. something I now forget ..? | | Do you have a source showing a JS runtime with a 1GB limit? | jesse__ wrote: | UPDATE: After a nominal amount of googling around it appears | valid sizes have increased on 64-bit systems to a maximum of | 8GB, and stayed at 2GB on 32-bit systems, for FF at least. I | guess it's probably 'implementation defined' | | https://developer.mozilla.org/en- | US/docs/Web/JavaScript/Refe... | | https://developer.mozilla.org/en- | US/docs/Web/JavaScript/Refe... | JasonZ2 wrote: | Does anyone know how the results from a 7B parameter model with | bloomz.cpp (https://github.com/NouamaneTazi/bloomz.cpp) compares | to the 7B parameter Alpaca model with llama.cpp | (https://github.com/ggerganov/llama.cpp)? | | I have the latter working on a M1 Macbook Air with very good | results for what it is. Curious if bloomz.cpp is significantly | better or just about the same. | rspoerri wrote: | So cool it runs on a browser /sarcasm/ i might not even need a | computer. Or internet when we are at it. | | It either runs locally or it runs on the cloud. Data could come | from both locations as well. So it's mostly technically | irrelevant if it's displaying in a browser or not. | | Except when it comes to usability. I don't get it why people love | software running in a browser. I often close important tools i | have not saved when it's in a browser. I cant have offline tools | which work if i am in a tunnel (living in Switzerland this is an | issue) . Or it's incompatible because i am running LibreWolf. | | /sorry to be nitpicking on this topic ;-) | ftxbro wrote: | > I don't get it why people love software running in a browser. | | If you read the article, part of the argument was for the | sandboxing that the browser provides. | | "Obviously if you're going to give a language model the ability | to execute API calls and evaluate code you need to do it in a | safe environment! Like for example... a web browser, which runs | code from untrusted sources as a matter of habit and has the | most thoroughly tested sandbox mechanism of any piece of | software we've ever created." | rspoerri wrote: | OSX does app sandboxing as well (not everywhere). But yeah, | you're right i only skimmed the content and missed that part. | rspoerri wrote: | Thinking about it... | | I don't know exactly about the browser sandboxing. But isn't | it's purpose to prevent access to the local system, while it | mostly leaves access to the internet open? | | Is that really a good way to limit and AI system's API | access? | simonw wrote: | The same-origin policy in browsers defaults to preventing | JavaScript from making API calls out to any domain other | than the one that hosts the page - unless those other | domains have the right CORS headers. | | https://developer.mozilla.org/en- | US/docs/Web/Security/Same-o... | sp332 wrote: | Broswer software is great because I don't have to build | separate versions for Windows, Mac, and Linux, or deal with app | stores, or figure out how to update old versions. | pmoriarty wrote: | There are a bunch of reasons people/companies like web apps: | | 1 - Everyone already has a web browser, so there's no software | to download (or the software is automatically downloaded, | installed and run, if you want to look at it that way... either | way, the experience is a lot easier and more seamless for the | user) | | 2 - The website owner has control of the software, so they can | update it and manage user access as they like, and it's easier | to track users and usage that way | | 3 - There are a ton of web developers out there, so it's easier | to find people to work on your app | | 4 - You ostensibly don't need to rewrite your app for every OS, | but may need to modify it for every supported browser | rspoerri wrote: | Most of these aspects make it better for the company or | developer, only in some cases it makes it easier for the user | in my opinion. Some arguments against it are: | | 1 - Not everyone has or wants fast access to the internet all | the time. | | 2 - I try to prevent access of most of the apps to the | internet. I don't want companies to access my data or even | metadata of my usage. | | 3 - sure, but it doesn't make it better for the user. | | 4 - Also supporting different screen sizes and interaction | types (touch or mouse) can be a big part of the work. | | The most important part for a user is if he/she is only using | the app rarely or once. Not having to install it will make | the difference between using it or not. However with the app | stores most OS's feature today this can change pretty soon | and be equally simple. | | I might be old school on this, but i resent subscription | based apps. For applications that do not need to change, | deliver no additional service or aren't absolutely vital for | me i will never subscribe. And browser based app's are at the | core of this unfortunate development. But that's gone very | far from the original topic :-) | nanidin wrote: | Browser is the true edge compute. | fzliu wrote: | I was a bit skeptical about loading a _4GB_ model at first. Then | I double-checked: Firefox is using about 5GB of memory for me. My | current open tabs are mail, calendar, a couple Google Docs, two | Arxiv papers, two blog posts, two Youtube videos, milvus.io | documentation, and chat.openai.com. | | A lot of applications and developers these days take memory | management for granted, so embedding a 4GB model to significantly | enhance coding and writing capabilities doesn't seem too far- | fetched. | munk-a wrote: | A wonderful thing about software development is that there is so | much reserved space for creativity that we have huge gaps between | costs and value. Whether the average person could do this for 85k | I'm uncertain of - but there is a very significant slice of | people that could do it for well under 85k now that the ground | work has been done. This leads to the hilarious paradox where a | software based business worth millions could be built on top of | code valued around 60k to write. | nico wrote: | > This leads to the hilarious paradox where a software based | business worth millions could be built on top of code valued | around 60k to write. | | Or the fact that software based businesses just took a massive | hit in value overnight and cannot possibly defend such high | valuations anymore. | | The value of companies is quickly going to shift from tech | moats to brands. | | Think CocaCola - anyone can create a drink that tastes as good | or better than coke, but it's incredibly hard to compete with | the CocaCola brand. | | Now think what would have happened if CocaCola had been super | expensive to make, and all of a sudden, in a matter of weeks, | it became incredibly cheap. | | This is what happened to the saltpeter industry in 1909 when | synthetic saltpeter was invented. The whole industry was | extinct in a few years. | prerok wrote: | Nit: not to write but to run. The cost of development is not | considered in these calculations. | ftxbro wrote: | His estimate is that you could train a LLaMA-7B scale model for | around $82,432 and then fine-tune it for a total of less than | $85K. But when I saw the fine tuned LLaMA-like models they were | worse in my opinion even than GPT-3. They were like GPT-2.5 or | like that. Not nearly as good as ChatGPT 3.5 and certainly not | ChatGPT-beating. Of course, far enough in the future you could | certainly run one in the browser for $85K or much less, like even | $1 if you go far enough into the future. | icelancer wrote: | Yeah, the constant barrage of "THIS IS AS GOOD AS CHATGPT AND | IS PRIVATE" screeds from LLaMA-based marketing projects are | getting ridiculous. They're not even remotely close to the same | quality. And why would they be? | | I want the best LLMs to be open source too, but I'm not | delusional enough to make insane claims like the hundreds of | GitHub forks out there. | robertlagrant wrote: | > I want the best LLMs to be open source too | | How do you do this without being incredibly wealthy? | nickthegreek wrote: | crowd source to pay for the gpu rentals. | mejutoco wrote: | Pooling resources a la SETI@home would be an interesting | option I would love to see. | simonw wrote: | My understanding is that can work for model inference but | not for model training. | | https://github.com/bigscience-workshop/petals is a | project that does this kind of thing for running | inference - I tried it out in Google Collab and it seemed | to work pretty well. | | Model training is much harder though, because it requires | a HUGE amount of high bandwidth data exchange between the | machines doing the training - way more than is feasible | to send over anything other than a local network | connection. | crdrost wrote: | You (1) are a company who (2) understands the business | domain and has an appropriate business plan. | | Sadly the reality of funding today makes it unlikely that | these two will both be simultaneously satisfied. The | problem is that history will look back on the necessary | business plan and deem it a failure even if it generates a | company that does a billion dollars plus in annual revenue. | | This is actually not unique to large language models but | most innovation around computers. The basic problem is that | if you build a force-multiplier (spreadsheets, personal | computing, large-language models all come to mind) then | what will make it succeed is its versatility: people want a | hammer that can be used for smashing all manner of things, | not just your company's particular brand of matching nails. | And most people will only pick up that hammer once per week | or once per month, only like 1% of the economy if that will | be totally revolutionized, "we use this force-multiplier | every day, it is now indispensable, we can't imagine life | without it," and it's never predictable what that sector | will be -- it's going to be like "oh, who ever dreamed that | the killer application for LLMs would be them replacing | AutoCAD at mechanical contractors" or some shit. | | In those strange eons, to wildly succeed, one must give up | on anticipating all usages of the software, one must cease | controlling it and set it free. "Well where's the profit in | that?" -- it is that this company was one of the first | players in the overall market, they got an early chance to | stake out as much territory as possible. But the market | exploded way larger than they could handle and then | everybody looks back on them and says "wow, what a failure, | they only captured 1% of that market, they could have been | so much more successful." Yeah, they captured 1% of a $100B | market, some failure, right? | | But what actually happens is that companies see the | potential, investors get dollar signs in their eyes, | everyone starts to lock down and control these, "you may | use large language models but only in the ways that we say, | through the interfaces which we provide," and then the only | thing that you can use it for is to get generic | conversational advice about your hemorrhoids, so after 5-10 | years the bubble of excitement fizzles out. Nobody ever | dreams to apply it to AutoCAD or whatever, and the world | remains unchanged. | javajosh wrote: | History is littered with great software that died because | no-one used it because the business model was terrible. | Capturing $1B of value is better than 0, and everyone | understands this. And who cares what history thinks | anyway? | | OpenAI has spent a lot of money to get their result. It's | safe to assume it will take a lot of money to get a | similar result, and then to share it (although I assume | bit torrent will be good enough). Once people are running | their models, they can innovate to their hearts content. | It's not clear how or why they'd give money back to the | enabling technology. So how does money flow back to the | innovators in proportion to the value produced, if not a | SaaS? | ftxbro wrote: | what stage of capitalism is this | robertlagrant wrote: | If those are all that's required, why don't you start a | company with a business plan written so it satisfies your | criteria? Then you can lead the way with OSS LLMs. | ftxbro wrote: | Yes a rugged individual would have to be incredibly wealthy | to do it! | | But maybe the governments will make one and maintain it | with taxes as an infrastructure service, like roads, giving | everyone expanded powers of cognition, memory, and | expertise, and raising the consciousnesses of humanity to | new heights. Probably in USA it wouldn't happen if we judge | ourselves only in zero sum relation to others - helping | everyone would be a wash and only waste our money! | szundi wrote: | Some governments probably alread do and use it against | so-called terrorists or enemies of the people... | simonw wrote: | Yeah, you're right. I wrote this a couple of weeks ago at the | height of LLaMA hype, but with further experience I don't think | the GPT-3 comparisons hold weight. | | My biggest problem: I haven't managed to get a great | summarization out of a LLaMA derivative that runs on my laptop | yet. Maybe I haven't tried the right model or the right prompt | yet though, but that feels essential to me for a bunch of | different applications. | | I still think a LLaMA/Alpaca fine-tuned for the ReAct pattern | that can execute additional tools would be a VERY interesting | thing to explore. | | [ ReAct: https://til.simonwillison.net/llms/python-react- | pattern ] | avereveard wrote: | my biggest problem with these models is that they cannot | reliably produce structured data. | | even davinci can be used as part of a chain, because you can | direct it to structure and unstructure data, and then extract | the single component and build them into tasks. cohere, llama | et al are currently struggling to consistently produce these | result reliably, even if you can chat with them and frankly | it's not about the chat | | example from a stack overflow that split the questions before | sending it down chain for answering all points individually: | | This is a customer question: | | I'm a beginner RoR programmer who's planning to deploy my app | using Heroku. Word from my other advisor friends says that | Heroku is really easy, good to use. The only problem is that | I still have no idea what Heroku does... | | I've looked at their website and in a nutshell, what Heroku | does is help with scaling but... why does that even matter? | How does Heroku help with: Speed - My | research implied that deploying AWS on the US East Coast | would be the fastest if I am targeting a US/Asia-based | audience. Security - How secure are they? | Scaling - How does it actually work? Cost | efficiency - There's something like a dyno that makes it easy | to scale. How do they fare against their | competitors? For example, Engine Yard and bluebox? | | Please use layman English terms to explain... I'm a beginner | programmer. | | Extract the scenario from the question including a summary of | every detail, list every question, in JSON: | | { "scenario": "A beginner RoR programmer is planning to | deploy their app using Heroku and is seeking advice about | deploying it.", "questions": [ "What does Heroku do?", "How | does deploying AWS on the US East Coast help with speed?", | "How secure is Heroku?", "How does scaling with Heroku | work?", "What is a dyno and why is it cost efficient?", "How | does Heroku compare to its competitors, such as Engine Yard | and Bluebox?" ] } | newhouseb wrote: | Last weekend I built some tooling that you can integrate | with huggingface transformers to force a given model to | _only_ output content that validates against a JSON schema | [1]. | | The challenge is that for it to work cost effectively you | need to be able to append what is basically a final network | layer to the model that is algorithmically designed and | until OpenAI exposes the full logits and/or some way to | modify them on the fly you're going to be stuck with open | source models. I've run things against GPT-2 mostly but | it's only list to try LLaMA. | | [1] "Structural Alignment: Modifying Transformers (like | GPT) to Follow a JSON Schema" @ | https://github.com/newhouseb/clownfish | simonw wrote: | This feels solvable to me. I wonder if you could use fine | tuning against LLaMA to teach it to do this better? | | GPT-3 etc can only do this because they had a LOT of code | included in their training sets. | | The LLaMA paper says Github was 4.5% of the training | corpus, so maybe it does have that stuff baked in and just | needs extra tuning or different prompts to tap into that | knowledge. | avereveard wrote: | I have done it trough stages, so first stages emits in | natural language in the format of "context: ... and | question: ...." and then the second stage collect it as | json, but then wait time doubles. | Tepix wrote: | Have you tried bigger models? Llama-65B can indeed compete | with GPT-3 according to various benchmarks. The next thing | would be to get the fine-tuning as good as OpenAI's. | mewpmewp2 wrote: | I wonder how accurate those benchmarks are in terms of | actual problem solving capability. I think there's a major | line at which point LLM becomes actually useful and it | actually feels like you are speaking to something | intelligent and that can be useful for you in terms of | productivity etc. | version_five wrote: | If you have ~100k to spend, aren't there options to buy a gpu | rather than just blow it all on cloud? How much is an 8xA100 | machine? | | 4xA100 is 75k, 8 is 140k https://shop.lambdalabs.com/deep- | learning/servers/hyperplane... | dekhn wrote: | you're comparing the capital cost of acquiring a GPU machine | with the operational cost of renting one in the cloud. | | Ignoring the operational costs of on-prem hardware is pretty | common, but those costs are significant and can greatly change | the calculation. | digitallyfree wrote: | For a single unit one could have it in their home or office, | rather than a datacenter or colo. If the user sets up and | manages the machine themselves there is no additional IT | cost. The greatest operating expense would be the power cost. | dekhn wrote: | "If the user sets up and manages the machine themselves | there is no additional IT cost" << how much do you value | your time? | | In my experience, physical hardware has a management | overhead over cloud resources. Backups, large disk storage | for big models, etc. | pessimizer wrote: | Or from another perspective, comparing the cost of training | one model in the cloud to the cost of training as many as you | want on your machine, then (as mentioned by siblings) selling | the machine for nearly as much as you paid for it, unless | there's some shortage, in which case you'll get more back | than you paid for it. | | One is buying capital that produces models, the other is | buying a single model. | sounds wrote: | Remember to discount the tax depreciation for the hardware | and deduct any potential future gains from either reselling | it or using it. | capableweb wrote: | Heh, you work at AWS or Google Cloud perhaps? ;) (Only joking | about this as I constantly see employees from AWS/GCloud and | other cloud providers claim that cloud is always cheaper than | hosting things yourself) | | Sure, if you're planning to service a large number of users, | building your infrastructure in-house might be a bit | overkill, as you'll need a infrastructure team to service it | as well. | | If you're just want to buy 4 GPUs to put in one server to run | some training yourself, I don't think it's that much | overkill. Especially considering you can recover much of the | cost even after a year by selling much of the equipment you | bought. Most of your losses will be costs for electricity and | internet connection. | throwaway50601 wrote: | Cloud gives you very good price for what they offer - | excellent reliability, hyper-scalability. Most people don't | need either and use it as a glorified VPS host. | dekhn wrote: | I used to work for Google Cloud (I built a predecessor to | Preemptible VMs and also launched Google Cloud Genomics). | But even before I worked at Google I was a big fan of AWS | (EC2 and S3). | | Buying and selling hardware isn't free; it comes with its | own cost. I would not want to be in the position of selling | a $100K box of computer equipment- ever. | capableweb wrote: | :) | | True, but some things are harder to sell than others. | A100's in today's market would be easy to sell. Harder to | buy, because the supply is so low unless you're Google or | another big name, but if you're trying to sell them, I'm | sure you can get rid of them quickly. | jcims wrote: | No kidding. I worked for a company that had multiple billions | of dollars invested in a data center refresh in North America | and Europe. | version_five wrote: | For a server farm, sure, for one machine, I don't know. | Assuming it plugs into a normal 15A circuit, and you have a | we-work or something where you don't pay for power, is the | operational cost of one machine really material? | dekhn wrote: | it's hard to tell from what you're saying: you're planning | on putting an ML infrastructure training server on a | regular 15A circuit, not in a data center or machine room? | And power is paid for by somebody else? | | My thinking about pricing doesn't include that option | because I wouldn't just hook a server like that up to a | regular outlet in an office and use it for production work. | If that works for you- you can happily ignore my comments. | But if you go ahead and build such a thing and operate it | for a year, please let us know if there were any costs- | either dollar or in suffering- associated with your | decision | | [edit: adding in that the value of this machine also | suggests it cannot live unattended in an insecure location, | like an office] | | signed, person who used to build closet clusters at | universities | KeplerBoy wrote: | Nvidia happily sells what you're describing. They call it | "DGX Station A100", it has 4 80GB A100 and retails for | 80k. Not sure i believe their claimed noise level of <37 | dB though. | | Of course that's still a very small system when talking | LLM training, the only reason why i would not put that in | a regular office is it's extreme price. Do you really | want something worth 80k in a form factor that could be | casually carried through the door? | amluto wrote: | If you live near an inexpensive datacenter, you can park | it there. Throw in a storage machine or two (TrueNAS MINI | R looks like a credible low-effort option). If your | workload is to run a year long computation on it and | otherwise mostly ignore it, then your operational costs | will be quite low. | | Most people who rent cloud servers are not doing this | type of workload. | modernpink wrote: | You can sell the A100 after once you're done as well. Possibly | even at profit? | girthbrooks wrote: | These are wild pieces of hardware, thanks for linking. I wonder | how loud they get. | sacred_numbers wrote: | If you bought an 8xA100 machine for $140k you would have to run | it continuously for over 10,000 hours (about 14 months) to | train the 7B model. By that time the value of the A100s you | bought would have depreciated substantially; especially because | cloud companies will be renting/selling A100s at a discount as | they bring H100s online. It might still be worth it, but it's | not a home run. | inciampati wrote: | If 8-bit training methods take off, I think the calculus is | going to change rapidly, with newer cards that have decent | amounts of memory and 8-bit acceleration starting to become | dramatically more cost and time effective than the venerable | A100s. ___________________________________________________________________ (page generated 2023-03-31 23:00 UTC)