[HN Gopher] Exponentially faster language modelling ___________________________________________________________________ Exponentially faster language modelling Author : born-jre Score : 162 points Date : 2023-11-21 14:31 UTC (1 days ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | fgfm wrote: | This approach feels like pruning, but the speedup is considerably | higher. Interestingly, I'm curious how this will play out on more | recent transformer architectures though: I guess the speedup will | be more important for the largest architectures, but even if we | can get 2x or 10x speedup on Mistral/Zephyr, Orca 2 or | OpenChat3.5, that would be a tremendous achievement! | webmaven wrote: | I'm curious as to how applicable this approach might be for | text-to-image models like Stable Diffusion. | ndr wrote: | Abstract: | | > Language models only really need to use an exponential fraction | of their neurons for individual inferences. As proof, we present | UltraFastBERT, a BERT variant that uses 0.3% of its neurons | during inference while performing on par with similar BERT | models. UltraFastBERT selectively engages just 12 out of 4095 | neurons for each layer inference. This is achieved by replacing | feedforward networks with fast feedforward networks (FFFs). While | no truly efficient implementation currently exists to unlock the | full acceleration potential of conditional neural execution, we | provide high-level CPU code achieving 78x speedup over the | optimized baseline feedforward implementation, and a PyTorch | implementation delivering 40x speedup over the equivalent batched | feedforward inference. We publish our training code, benchmarking | setup, and model weights. | | Conclusions | | > We present UltraFastBERT, a modified version of the | (crammed)BERT architecture that uses fast feedforward instead of | feedforward networks in its intermediate layers. UltraFastBERT | serves as proof that large language models only really need to | engage an exponential fraction of their parameters to perform | individual inferences. UltraFastBERT-1x11, our deepest model with | the highest promise of acceleration, uses only 0.3% of its | neurons during inference and already achieves a 78x CPU speedup | over the inference time of the corresponding feedforward layer. | With a theoretical speedup promise of 341x at the scale of BERT- | base models, we hope that our work will inspire an effort to | implement primitives for conditional neural execution as a part | of device programming interfaces. | iNic wrote: | Do I understand correctly that the difficulty of making this | useful is writing code to run this idea on GPUs? | nolist_policy wrote: | As far as I understood it: Forget GPUs, this thing is plenty | fast on CPUs. | | In general, GPUs are bad at branching. The fastest way to | implement it on GPUs is probably to let it calculate both | sides of the branch and then only use the result of the one | that was taken. Which won't be faster than a normal NN. | falcor84 wrote: | I wonder then, if each inference only uses a small part of | the net, could you possibly perform multiple inferences in | the same forward pass? | lawlessone wrote: | > Forget GPUs, this thing is plenty fast on CPUs. | | Does this mean everyone could be running the 100+b models | from ram? | | This opens up a lot , some models could be run very fast on | small machines with this. | | Bundling a small model inside a game to act as part of the | mind for ingame NPC's (obviously with some tuning) becomes | practical with this. | hobofan wrote: | The bottleneck for "easy integration" into games and | applications right now is as much the RAM usage as is the | slowness. This would probably bring the speed to an | acceptable level but you would still have to hold the | whole model in RAM. | | That would make it a lot more feasible to run models in | the cloud (triple digit RAM is a lot more abundant than | VRAM), but wouldn't do that much for consumer hardware. | nolist_policy wrote: | I wonder if the model takes similar branches while in the | same context? Then you can fault in parts of the model | from disk as needed. | entropicdrifter wrote: | Interesting idea. Like texture streaming, you'd just | stream in the parts of the model from disk to fill up all | available RAM. If the NPC needed to think about something | not cached in RAM, you'd throw up a "hmm, let me think | about this" while stuff loads from disk. | PaulHoule wrote: | They are getting a 78x speedup w/o hardware support which is | pretty good: they think they can speed it up another 4x if | they had the right hardware support. So it looks useful now | with possibility to get better. | | So long as I've been involved with neural networks for text | analysis it's seemed to me that we really should be using | sparse activations because any particular document only | involves a limited set of concepts. | | For instance a search engine for patents might be looking at | a patent for adhesive tape which activates a certain set of | concepts but is not going to activate concepts involved with | bicycle derailleurs or public key cryptography: a sparse | representation reflects this and dense representations don't. | kristianp wrote: | > a PyTorch implementation delivering 40x speedup over the | equivalent batched feedforward inference | | Does this not indicate a 40x speedup on the GPU? | | Edit: looking at the paper, their "Naive CUDA" implementation | also shows a 117x speedup in Table 2. | sdrg822 wrote: | Cool. Important note: | | """ One may ask whether the conditionality introduced by the use | of CMM does not make FFFs incompatible with the processes and | hardware already in place for dense matrix multiplication and | deep learning more broadly. In short, the answer is "No, it does | not, save for some increased caching complexity." """ | | It's hard to beat the hardware lottery! | algo_trader wrote: | Infact, as stated in the paper, this is bad news | | > We therefore leave the attention layers untouched | | Meaning, presumably, that the GPU memory remains the bottleneck | | Flops really are quite cheap by now, e.g. vision inference chip | ~$2/teraflop/s !! | marcinzm wrote: | Bottleneck for larger models however this would presumably | allow for cheaper models at scale or on compute constrained | devices (like phones). | entropicdrifter wrote: | And potentially for distributing a model across several | devices at inference time. You could devote a cluster of | smaller/weaker machines to inference. | sroussey wrote: | You can do that today, the only advantage today though is | being able to fix the model in memory. It's sequential | and slower due to communication costs, though batching | might be faster? | ashirviskas wrote: | >Flops really are quite cheap by now, e.g. vision inference | chip ~$2/teraflop/s !! | | I'm really interested, can you share where you got these | numbers? | algo_trader wrote: | Axelera [1] or Halio [2] give you 100-200tflop for ~$200. | | 8-bit ops, inference only, low memory embedded, excluding | the host, implied utilization from FPS specs is ~20% | | But the trend is there. | | There are also newer ADAS/AV units from China which claim | 1000tflops and cant really cost more than $1000/$2000 per | car. | | These are all tiled designed (see also dojo/tesla) heavily | over-weighed on flops vs memory | | [1] https://www.axelera.ai/ | | [2] https://hailo.ai/ | Y_Y wrote: | You can't get flops on a Hailo-8, they're fixed-point | only. As much as these specialised inference chips are | cool, we're a long way from just being able to drop them | in where a GPU was. Not to mention the memory is hugely | constrained. The Hailo chips I've worked with were all | limited to 20MiB for the weights which is a squeeze even | at 4-bit. | YetAnotherNick wrote: | > ~$2/teraflop/s | | H100 is basically ~$2/(2000 tflops/s)/hour or $1 for 4*10^18 | floating point operations. | theGnuMe wrote: | There's another paper replacing attention with FF networks so | just combine the two and you've got something. | gdoug wrote: | Link? Sounds like a good read! :) | smeeth wrote: | Not op but might be this: | https://arxiv.org/pdf/2311.10642.pdf | Klaster_1 wrote: | What are the potential consequences? Does this open doors to | faster edge inference or improved capabilities? | yvdriess wrote: | Both. Cheaper CPU-based inference, GPUs are not as competitive | for sparse linear algebra. This could lead to much larger | models, as you only touch a small portion of the matrix during | inference. However, the training here is still dense-LA on a | GPU, so you still blow up the compute cost when increasing | model size. | WithinReason wrote: | Note this doesn't speed up training | swalsh wrote: | Has anyone used SIMD instructions to try and speed up cpu | inference? | hobofan wrote: | Most inference builds on top of BLAS libraries, which in | their implementation take advantage of SIMD. | singhrac wrote: | A lot of CPU inference libraries (llama.cpp included) use | as much SIMD as possible, sometimes by hand-writing loops. | The one I hack on, llama.rs, uses portable_simd but | specializes to your CPU at compile time. | | My experience has been that most CPU inference is actually | not compute limited, but memory bandwidth limited, since | most weights are used for a few operations per token (how | quickly can you load and unload the entire 70 GB of weights | into your registers?). It's not quite that bad but I found | most vectorization changes didn't meaningfully change | performance. | anonymousDan wrote: | Would you say that is the state of the art CPU inference | library? | valine wrote: | GPU utilization should be down when using this technique. I'm | hoping this could allow for more efficient batch inference on | GPUs. If you can predict 10 tokens for the price of 1 it | should allow you to do tree of thought much more efficiently. | | https://github.com/princeton-nlp/tree-of-thought-llm | hobofan wrote: | I think with that magnitude of a speed improvement it should | become feasible to do just-in-time embedding creation for | semantic search for much larger documents. | millisecond wrote: | Could this be applied to other models like Llama2 or Mistral? | andy99 wrote: | Just from the abstract I don't see why not, it's just replacing | the feed forward network that's part of all of these models | with a very sparse one. The bigger problem is you seemingly | have to retrain the model, so you couldn't just drop in llama2 | weights from meta and have it work. Which makes it much more | limiting. Something that used existing weights would be a lot | more practical (like quantization for example). For BERT, I can | see this being useful if you had to make a really fast | embedding model. There was a discussion about a fast embedding | use case not long ago | https://news.ycombinator.com/item?id=37898001 | MinusGix wrote: | It certainly could, and I wouldn't be surprised if the authors | want to try it out on those. You do have issues of past | improvements often not quite enhancing more powerful models | nearly as much. I'd expect this to possibly not work as well, | something like the bigger models ending up with more | polysemantic neurons because they're given more ''incentive'' | (training time, neuron count, dataset size which they're | encouraged to be able to reconstruct) to extract as much | possible. This might make so the method performs worse due to | this intermingling. (See the transformer circuits website for | that) (Though I expect there's ways to recover a good chunk of | extra lost throughput/accuracy, maybe by doing extra steps to | directly steer the training towards breaking apart polysemantic | neurons) | lopuhin wrote: | There are two issues here -- for one, in big transformers, more | compute is in the attention layers, while this work improves | only feed-forward layers, which are more important for smaller | models and smaller sequence lengths. Second, in many typical | scenarios LLM inference is memory bandwidth bound, I'm not sure | if it's possible to utilize their approach to reduce required | memory bandwidth. | joelthelion wrote: | Doesn't reducing the number of neurons drastically reduce | memory requirements? | lopuhin wrote: | Yes it might. "Reduction of number of neurons" is not | static here, unlike traditional pruning approaches, here | they still keep all weights, but the network dynamically | selects which sub-portion of them to use. There is a | related discussion of this in section 3.2 (page 4), but | they don't think they mention actual memory bandwidth | requirements/wins of their implementation, and probably | there can be different tradeoffs for different devices. | OneOffAsk wrote: | Is this similar to what iOS 17 uses for its new autocomplete? | WithinReason wrote: | Link to previous paper: | | https://arxiv.org/abs/2308.14711 | | An attempt at a summary: They use a sigmoid function to make | differentiable "soft" branches, and stack them to construct a | binary tree, with the goal of only taking one branch at inference | time (but training the whole tree) leading to log(W) instead of W | inference cost. They gradually harden the branches so they become | hard branches at the end of training. | | A branch is computed as _branch(input, N)_ , with a neural | network N computing a scalar _c=N(input)_ , then using a sigmoid | to do a soft branch by returning the weighted sum of the | recursive call _s(c)*branch(input, N_left) + (1-s(c)) * | branch(input, N_right)_ (the two weights _s(c)_ and _1-s(c)_ sum | to 1). They only do "proper processing" using the leaf nodes. | | Then they add a new loss term that encourages hard decisions by | minimising the entropy of the Bernoulli distribution, making the | 2 weights converge to 0 and 1, at which point only one branch | needs to be taken at inference. They also state that this | hardening often happens automatically though. | | It's a simple idea but the loss formulation is nice, you usually | want your loss terms to be a measure of information. | WithinReason wrote: | Also, this didn't come from OpenAI or DeepMind, or even | industry. What are those guys even doing? :) | mmaunder wrote: | Many labs doing foundational work like this and making | progress don't have the anything near the budget or compute | to implement at scale. In other words they don't have a Sam | and his backers or a Zuck and his budget. | Micoloth wrote: | They sure as hell have no incentives to make Neural Network | faster and more accessible, for starters.. | | (Considering they right now make more money and have more | control, the less accessible and the more computation-hungry | AI models are) | | To be fair, this approach (claims to) only speed up | inference, not training, so all the GPUs are needed anyway. | WithinReason wrote: | They certainly have an incentive to keep these kinds of | improvements in-house and not publish them, since they are | commercial entities and this represents a competitive | advantage. | lawlessone wrote: | I think Nvidia might have an incentive for this not to | exist. | | edit: but you are right for the AI companies not open | sourcing their models it's an advantage to have it when | others don't | WithinReason wrote: | I'm actually not sure about Nvidia, due to | https://en.wikipedia.org/wiki/Jevons_paradox | ForkMeOnTinder wrote: | But if things get too efficient for individual users, you | won't need an Nvidia GPU anymore. People will use cheaper | hardware instead. I'm looking forward to running good | models at decent speed on a low-end CPU or whatever | crappy GPU is in my phone. | jacobsimon wrote: | I had the same thought this morning and was debating | selling my nvda stock when I saw this - feels like they | are well-positioned right now, as with crypto a few years | ago, but if there were an efficiency breakthrough that | allowed commodity CPUs to do the inference instead, this | advantage could vanish quickly. | godelski wrote: | Nvidia can't make GPUs fast enough. I doubt 10xing | training and/or inference efficiency would result in a | decrease in demand. I would be surprised if it didn't | instead increase demand. Mind you, Nvidia is pushing hard | on TensorRT which optimizes models at inference time and | results in major increases in throughput (not 10x though | lol). | rictic wrote: | Yeah, Jevons Paradox suggests that 10xing efficiency of | training and inference would increase demand for GPUs. | godelski wrote: | I wouldn't be so quick to conspiracy. I'm the author of a | work and a famous blog post that trains a particular common | architecture much faster (don't want to dox myself too | much) and with far fewer parameters, but it has been | rejected several times and is now arxiv only. Our most | common complaint was "who would use this? Why not just take | a large model and tune it?" That question alone held us | back a year (had over a hundred citations by then and | remains my most cited work) until it switched to "use more | datasets" and "not novel" (by that time true, others had | built off of us, cited us, and published in top venues). | | I don't think this was some conspiracy by big labs to push | back against us (we're nobodies) but rather that people get | caught up in hype and reviewers are lazy and incentivized | to reject. You're trained to be critical of works and | especially consider that post hoc most solutions appear far | simpler than they actually are. But context matters because | if you don't approach every paper with nuance it's easy to | say "oh, it's just x." But if those ideas were so simple | and obvious they would also be prolific. I see a lot of | small labs suffer the same fate simply due to lack of | compute. If you don't make your new technique work on many | datasets it becomes the easiest thing to reject a paper by. | ACs aren't checking that reviews are reasonable. I've even | argued with fellow reviewers about papers in workshops -- | papers I would have accepted in the main conference -- that | are brushed off and the reviewers admit in their reviews | that they do not work on these topics. I don't understand | what's going on but at times it feels like a collective | madness. A 10 page paper with 4 very different datasets | that solves a problem, is clearly written, has no major | flaws, and is useful to the community should not need | defending when submitted to a workshop just because | reviewers aren't qualified to review the work (this paper | got in btw). We are moving into a "pay to play" ecosystem | and that will only create bad science due to group think. | (another aspect of "pay to play" is in the tuning. Spending | $1M to tune your model to be the best doesn't mean it is | better than a model that could not afford the search. Often | more than half of resources are spent on tuning now) | wruza wrote: | Is there a place where you guys discuss... things? I'm | layman interested in this topic akin to pop- | physics/maths, but have no chance to just read papers and | "get it". On the other hand, immediately available | resources focus more on how-to part of it rather than on | what's up overall. Also, do you have something like | 3b1b/pbs/nph for it? Content that _you_ can watch and say | "well, yep, good job". | airgapstopgap wrote: | ...ETH Zurich is an illustrious research university that | often cooperates with Deepmind and other hyped groups, | they're right there at the frontier too, and have been for a | very long time. They don't have massive training runs on | their own but pound for pound I'd say they have better | papers. | godelski wrote: | ETH Zurich is one of the top labs in the world. Disney | Research also works with them a lot. Another "sleeper" is | University of Amsterdam that has rockstars like Max Welling | and his students Kingma, Salimans,van den Berg, and | Hoogeboom. | | It's easy to get hyped up on the big tech labs because they | have the most compute, but the best papers come from | smaller labs and unfortunately more lately face larger | challenges in getting published. It's the smaller works | that create the foundations that end up in these giant | models. ML is in a really weird space right now. | pr337h4m wrote: | This from DeepMind: | | DiLoCo: Distributed Low-Communication Training of Language | Models - https://arxiv.org/pdf/2311.08105.pdf | | From the first author on Twitter: "It could quite a big deal | for people who don't have access to a colocated cluster of | GPUs: | | e.g. with DiLoCo you could train your model, with data- | parallelism, across all GPU providers, looking in real-time | for the cheapest price, even if pre-emptable, even across | continents" | | https://twitter.com/Ar_Douillard/status/1724839420820361687 | quickthrower2 wrote: | It is not surprising. The assumption is that they have the | best people. That you can objectively search 8 billion people | for the best people globally is folly of course. There are | geniuses without US citizenship / visas / green cards. And so | outside brains are going to figure this out. Mix in GDP of | $rest_of_world has much more resources than any company, and | the luck-driven nature of making AI discoveries, and I reckon | most progress will be outside of OpenAI etc. Driven by a | problem the big guys don't need to solve: how do I avoiding | buying a $5k graphics card. | lawlessone wrote: | From the previous paper you cited >Pushing FFFs to the limit, | we show that they can use as little as 1% of layer neurons for | inference in vision transformers while preserving 94.2% of | predictive performance. | | This feels like that often misinterpreted Einstein meme/qoute | about humans only using a fraction of their brain power. | | Is this only for inference though? could it boost training? | WithinReason wrote: | That's an interesting question. It actually provides a nice | way to parallelized training: Pretrain e.g. the first 3 | branch levels, which effectively fragments the model into 8 | separate parts, which you can continue training across 8 | independent servers/nodes with no further communication | between the nodes. A central server would run the 1st 3 | levels and mark parts of the training set that each node has | to train on. Maybe you could do this for the whole network | and distribute the training in SETI@HOME style all over the | world. | | Hold on, you don't even need to freeze the branches | completely: each node could train 1 branch on the path to its | leaf node and communicate a change in the branch node to a | central server, so you can distribute training without having | to pre-freeze the branches. Still would need some pre- | training though, and the splits would change slowly, and the | attention mechanism could complicate things. | | Currently distributed neural network training SETI@HOME style | looks like a complete pipe dream that nobody is taking | seriously. But a smart branching mechanism like this could | _suddenly_ make it possible. Folding@home reached 1.5 | exaflops, which made it the world 's largest supercomputer. | Imagine the models we could train that way, they would far | surpass whatever OpenAI or Google could train and would be | public. | alchemist1e9 wrote: | This! | | If this becomes true then it's a game changer. I hope you | are correct. | richardw wrote: | Also steps up the economic benefit of, and therefore demand | for, botnets. We really need a solution to bad actors | controlling vast amounts of compute. | klyrs wrote: | That ship has sailed, and her name is bitcoin. | richardw wrote: | If bitcoin keeps the botnets away from world beating AI, | worth it. | Geee wrote: | I am barely understanding, so a stupid question: | | Does this also mean that it would be possible to train on | parallel GPU-poor setup instead of needing lots of GPU | memory / bandwidth on one computer? | bloopernova wrote: | Apologies for layman question: how much tera/peta/exa-flops | do current models use to train? | | Well, I'm assuming they'd use whatever they're given, so | maybe the question should be "how much less time would | training take on a 1.5 exaflops computer?" | foobiekr wrote: | As many as they can afford. | | A lot of clusters are totally homogeneous, at least | within some very large domains, so for a given | interconnect and a generation of GPU you know the maximum | message latency, the peak sustained pflop rate, and so on | but what often matters is some combination of the | depreciation-cost-per-time and the watt hours per unit | time, where you can sort of approximate both if you | ignore the unfortunate realities, which then act as a | multiplier. | | For example, a problem is network issues - and not just | scale - as the training sequence often involve billions | of cycles of short compute-sync sequences which are | bursty (e.g., all-to-all, barrier, compute, barrier, all | to all, ...) but between which there isn't enough time to | engage low power modes so you're burning $ due to slack | and waste. This is true in different ways for a lot of | training approaches. | | You can approximate this, but it's so sensitive to data | set size, specific training schedule, etc. that you won't | be able to get the most important answer. | thomasahle wrote: | Sounds like hiarchial softmax from the early NLP days | knexer wrote: | It's mentioned briefly in the paper(1), but I'm more interested | in the interpretability implications of this approach. In some | respects, this marries the interpretability/editability of a | small decision tree with the expressive power of a large neural | network. Usually you see those two on extreme opposite ends of | a tradeoff spectrum - but this approach, if it scales, might | shift the pareto frontier. | | (1): As a byproduct, the learned regions can also be used as a | partition of the input space for interpretability, surgical | model editing, catastrophic forgetting mitigation, reduction of | replay data budget, etc.. | tokai wrote: | Why not use the real title? Its short and precise. | vorticalbox wrote: | hugging face model | | https://huggingface.co/pbelcak/UltraFastBERT-1x11-long | lawlessone wrote: | > This model is provided only as sanity check for research | purposes, it is untested and unfit for deployment. | | I guess this means it isn't pretrained yet? Is it still just | random weights? | mjn wrote: | "Unfit for deployment" or "not intended for deployment" is | semi-standard wording for research models that are just raw | language models with none of the safety/bias/offensiveness | filtering that is usually desired for product applications. | For example, if you deploy it as a customer-service chatbot, | it might tell your customers to kill themselves, or call them | racial slurs. | | It doesn't mean that there's anything technically wrong with | the language model per se as a model of language, just that | there has been no effort made to ensure it's fit to be | deployed as-is for any given generative-AI use case, and the | model authors would prefer you didn't do that. | ilaksh wrote: | Is it possible to use this with something like Llama 2? | vouaobrasil wrote: | This is rather scary. I feel we are witnessing the evolution of | language models and artificial intelligence, which seems | intellectually laudable until you realize that the underlying | evolutionary framework for this evolution is the global | capitalistic system whose only criteria for selection in short- | term monetary gain. | | We are creating a monster. | hendler wrote: | Rather than looking to capitalism which has provided tremendous | benefits to society as well as unintended consequences you may | want to update your thinking to focus on the incentives | alignment problem in general. | | This TED talk articulates it well: https://youtu.be/WX_vN1QYgmE | | What is after capitalism? | vouaobrasil wrote: | I absolutely disagree that reformism as in the video via | incentives will be enough. | dicroce wrote: | I think it's good to be concerned and cautious but I also think | you are being a bit extreme here. | vouaobrasil wrote: | I absolutely disagree - I believe everyone else is blind, the | same way we are blind that our current lifestyles are an | exercise in extreme violence on the nonhuman world. | qntty wrote: | According to scientists, we only use 0.3% of our neural networks. | Imagine if we could use 100%. | kleiba wrote: | Nice. | | I know HN can sometimes be the place where humor goes to die, | but I found this comment hilarious. | madprofessor wrote: | Wonderful. | jredwards wrote: | Thank you for taking one of my most hated memes and turning it | into hilarity. | baq wrote: | Mix this with yesterday's matmul approximation (maddness) in HW | for a casual... three orders of magnitude speed increase? | nulld3v wrote: | Can you link the post for the matmul approximation? | jml7c5 wrote: | https://news.ycombinator.com/item?id=38360776 | nulld3v wrote: | Thank you! | hobofan wrote: | I'm not 100% sure, but those seem mostly mutually exclusive (or | redundant), with the decision tree in maddness taking on a | similar function as the binary tree in FFF that decides which | neurons to activate. | terafo wrote: | They are mostly incompatible. | measured_step wrote: | How would this scale for a use case like writing code? I could | imagine that some inputs would require a large number of neurons. | Would this architecture be able to do that if it were scaled up? | | I'm also curious if this model architecture would achieve the | grokking of more complex concepts at scale. | jasonjmcghee wrote: | Does anyone understand why they are using B x H instead of B x S | x H? | | Why is the context size and batch size represented as a single | parameter? | numeri wrote: | I would have to go back and reread the paper to be sure, but FF | layers are applied position-wise, meaning independently and in | parallel on all input tokens/positions. Because of that, I | could imagine contexts where the sequence dimension isn't | relevant, i.e., for computational complexity. | rsolva wrote: | I find running 7B models on my 6 year old small form factor HP | EliteDesk to be fast enough for casual everyday use. If this | speedup can be applied generally to commonly used models, I can | serve a local ChatGPT experience for both friends and family from | my tiny homelab in my basement. | | _mind blown_ | FooBarWidget wrote: | I find 7B models to be too stupid. They often respond with | nonsense or fail to follow instructions. | cooper_ganglia wrote: | Even 65B models approach a level of being almost usable, but | still fall short in my personal experience. | cloudking wrote: | This is why I'm not understanding the excitement around | open source models, they pale in comparison to GPT-4 | quality, so I have no use for them until we have something | comparable. | GaggiX wrote: | Even like OpenChat-3.5? (Probably the best 7B model out | there) Demo: https://openchat.team/ | | HuggingFace: https://huggingface.co/openchat/openchat_3.5 | | On the LLM arena (blinded comparisons), it's the third best | non-proprietary model: | https://huggingface.co/spaces/lmsys/chatbot-arena- | leaderboar... | sorokod wrote: | _What is the sum of odd numbers in this set: 4, 7, 12, 1, | 3_ | | _The sum of odd numbers in the given set is 4 + 7 + 1 = | 12. Therefore, the answer is 12._ | all2 wrote: | Technically 3 is even. \s | moffkalast wrote: | Are there any comparisons with Mistral-instruct? I've yet | to see anything under 30B beat it in any way. | quickthrower2 wrote: | If anyone is on the ball enough to turn this into a colab or | notebook that would be appreciated! Would love to see the code ___________________________________________________________________ (page generated 2023-11-22 23:00 UTC)