[HN Gopher] Exponentially faster language modelling
       ___________________________________________________________________
        
       Exponentially faster language modelling
        
       Author : born-jre
       Score  : 162 points
       Date   : 2023-11-21 14:31 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | fgfm wrote:
       | This approach feels like pruning, but the speedup is considerably
       | higher. Interestingly, I'm curious how this will play out on more
       | recent transformer architectures though: I guess the speedup will
       | be more important for the largest architectures, but even if we
       | can get 2x or 10x speedup on Mistral/Zephyr, Orca 2 or
       | OpenChat3.5, that would be a tremendous achievement!
        
         | webmaven wrote:
         | I'm curious as to how applicable this approach might be for
         | text-to-image models like Stable Diffusion.
        
       | ndr wrote:
       | Abstract:
       | 
       | > Language models only really need to use an exponential fraction
       | of their neurons for individual inferences. As proof, we present
       | UltraFastBERT, a BERT variant that uses 0.3% of its neurons
       | during inference while performing on par with similar BERT
       | models. UltraFastBERT selectively engages just 12 out of 4095
       | neurons for each layer inference. This is achieved by replacing
       | feedforward networks with fast feedforward networks (FFFs). While
       | no truly efficient implementation currently exists to unlock the
       | full acceleration potential of conditional neural execution, we
       | provide high-level CPU code achieving 78x speedup over the
       | optimized baseline feedforward implementation, and a PyTorch
       | implementation delivering 40x speedup over the equivalent batched
       | feedforward inference. We publish our training code, benchmarking
       | setup, and model weights.
       | 
       | Conclusions
       | 
       | > We present UltraFastBERT, a modified version of the
       | (crammed)BERT architecture that uses fast feedforward instead of
       | feedforward networks in its intermediate layers. UltraFastBERT
       | serves as proof that large language models only really need to
       | engage an exponential fraction of their parameters to perform
       | individual inferences. UltraFastBERT-1x11, our deepest model with
       | the highest promise of acceleration, uses only 0.3% of its
       | neurons during inference and already achieves a 78x CPU speedup
       | over the inference time of the corresponding feedforward layer.
       | With a theoretical speedup promise of 341x at the scale of BERT-
       | base models, we hope that our work will inspire an effort to
       | implement primitives for conditional neural execution as a part
       | of device programming interfaces.
        
         | iNic wrote:
         | Do I understand correctly that the difficulty of making this
         | useful is writing code to run this idea on GPUs?
        
           | nolist_policy wrote:
           | As far as I understood it: Forget GPUs, this thing is plenty
           | fast on CPUs.
           | 
           | In general, GPUs are bad at branching. The fastest way to
           | implement it on GPUs is probably to let it calculate both
           | sides of the branch and then only use the result of the one
           | that was taken. Which won't be faster than a normal NN.
        
             | falcor84 wrote:
             | I wonder then, if each inference only uses a small part of
             | the net, could you possibly perform multiple inferences in
             | the same forward pass?
        
             | lawlessone wrote:
             | > Forget GPUs, this thing is plenty fast on CPUs.
             | 
             | Does this mean everyone could be running the 100+b models
             | from ram?
             | 
             | This opens up a lot , some models could be run very fast on
             | small machines with this.
             | 
             | Bundling a small model inside a game to act as part of the
             | mind for ingame NPC's (obviously with some tuning) becomes
             | practical with this.
        
               | hobofan wrote:
               | The bottleneck for "easy integration" into games and
               | applications right now is as much the RAM usage as is the
               | slowness. This would probably bring the speed to an
               | acceptable level but you would still have to hold the
               | whole model in RAM.
               | 
               | That would make it a lot more feasible to run models in
               | the cloud (triple digit RAM is a lot more abundant than
               | VRAM), but wouldn't do that much for consumer hardware.
        
               | nolist_policy wrote:
               | I wonder if the model takes similar branches while in the
               | same context? Then you can fault in parts of the model
               | from disk as needed.
        
               | entropicdrifter wrote:
               | Interesting idea. Like texture streaming, you'd just
               | stream in the parts of the model from disk to fill up all
               | available RAM. If the NPC needed to think about something
               | not cached in RAM, you'd throw up a "hmm, let me think
               | about this" while stuff loads from disk.
        
           | PaulHoule wrote:
           | They are getting a 78x speedup w/o hardware support which is
           | pretty good: they think they can speed it up another 4x if
           | they had the right hardware support. So it looks useful now
           | with possibility to get better.
           | 
           | So long as I've been involved with neural networks for text
           | analysis it's seemed to me that we really should be using
           | sparse activations because any particular document only
           | involves a limited set of concepts.
           | 
           | For instance a search engine for patents might be looking at
           | a patent for adhesive tape which activates a certain set of
           | concepts but is not going to activate concepts involved with
           | bicycle derailleurs or public key cryptography: a sparse
           | representation reflects this and dense representations don't.
        
           | kristianp wrote:
           | > a PyTorch implementation delivering 40x speedup over the
           | equivalent batched feedforward inference
           | 
           | Does this not indicate a 40x speedup on the GPU?
           | 
           | Edit: looking at the paper, their "Naive CUDA" implementation
           | also shows a 117x speedup in Table 2.
        
       | sdrg822 wrote:
       | Cool. Important note:
       | 
       | """ One may ask whether the conditionality introduced by the use
       | of CMM does not make FFFs incompatible with the processes and
       | hardware already in place for dense matrix multiplication and
       | deep learning more broadly. In short, the answer is "No, it does
       | not, save for some increased caching complexity." """
       | 
       | It's hard to beat the hardware lottery!
        
         | algo_trader wrote:
         | Infact, as stated in the paper, this is bad news
         | 
         | > We therefore leave the attention layers untouched
         | 
         | Meaning, presumably, that the GPU memory remains the bottleneck
         | 
         | Flops really are quite cheap by now, e.g. vision inference chip
         | ~$2/teraflop/s !!
        
           | marcinzm wrote:
           | Bottleneck for larger models however this would presumably
           | allow for cheaper models at scale or on compute constrained
           | devices (like phones).
        
             | entropicdrifter wrote:
             | And potentially for distributing a model across several
             | devices at inference time. You could devote a cluster of
             | smaller/weaker machines to inference.
        
               | sroussey wrote:
               | You can do that today, the only advantage today though is
               | being able to fix the model in memory. It's sequential
               | and slower due to communication costs, though batching
               | might be faster?
        
           | ashirviskas wrote:
           | >Flops really are quite cheap by now, e.g. vision inference
           | chip ~$2/teraflop/s !!
           | 
           | I'm really interested, can you share where you got these
           | numbers?
        
             | algo_trader wrote:
             | Axelera [1] or Halio [2] give you 100-200tflop for ~$200.
             | 
             | 8-bit ops, inference only, low memory embedded, excluding
             | the host, implied utilization from FPS specs is ~20%
             | 
             | But the trend is there.
             | 
             | There are also newer ADAS/AV units from China which claim
             | 1000tflops and cant really cost more than $1000/$2000 per
             | car.
             | 
             | These are all tiled designed (see also dojo/tesla) heavily
             | over-weighed on flops vs memory
             | 
             | [1] https://www.axelera.ai/
             | 
             | [2] https://hailo.ai/
        
               | Y_Y wrote:
               | You can't get flops on a Hailo-8, they're fixed-point
               | only. As much as these specialised inference chips are
               | cool, we're a long way from just being able to drop them
               | in where a GPU was. Not to mention the memory is hugely
               | constrained. The Hailo chips I've worked with were all
               | limited to 20MiB for the weights which is a squeeze even
               | at 4-bit.
        
           | YetAnotherNick wrote:
           | > ~$2/teraflop/s
           | 
           | H100 is basically ~$2/(2000 tflops/s)/hour or $1 for 4*10^18
           | floating point operations.
        
           | theGnuMe wrote:
           | There's another paper replacing attention with FF networks so
           | just combine the two and you've got something.
        
             | gdoug wrote:
             | Link? Sounds like a good read! :)
        
               | smeeth wrote:
               | Not op but might be this:
               | https://arxiv.org/pdf/2311.10642.pdf
        
       | Klaster_1 wrote:
       | What are the potential consequences? Does this open doors to
       | faster edge inference or improved capabilities?
        
         | yvdriess wrote:
         | Both. Cheaper CPU-based inference, GPUs are not as competitive
         | for sparse linear algebra. This could lead to much larger
         | models, as you only touch a small portion of the matrix during
         | inference. However, the training here is still dense-LA on a
         | GPU, so you still blow up the compute cost when increasing
         | model size.
        
           | WithinReason wrote:
           | Note this doesn't speed up training
        
           | swalsh wrote:
           | Has anyone used SIMD instructions to try and speed up cpu
           | inference?
        
             | hobofan wrote:
             | Most inference builds on top of BLAS libraries, which in
             | their implementation take advantage of SIMD.
        
             | singhrac wrote:
             | A lot of CPU inference libraries (llama.cpp included) use
             | as much SIMD as possible, sometimes by hand-writing loops.
             | The one I hack on, llama.rs, uses portable_simd but
             | specializes to your CPU at compile time.
             | 
             | My experience has been that most CPU inference is actually
             | not compute limited, but memory bandwidth limited, since
             | most weights are used for a few operations per token (how
             | quickly can you load and unload the entire 70 GB of weights
             | into your registers?). It's not quite that bad but I found
             | most vectorization changes didn't meaningfully change
             | performance.
        
               | anonymousDan wrote:
               | Would you say that is the state of the art CPU inference
               | library?
        
           | valine wrote:
           | GPU utilization should be down when using this technique. I'm
           | hoping this could allow for more efficient batch inference on
           | GPUs. If you can predict 10 tokens for the price of 1 it
           | should allow you to do tree of thought much more efficiently.
           | 
           | https://github.com/princeton-nlp/tree-of-thought-llm
        
         | hobofan wrote:
         | I think with that magnitude of a speed improvement it should
         | become feasible to do just-in-time embedding creation for
         | semantic search for much larger documents.
        
       | millisecond wrote:
       | Could this be applied to other models like Llama2 or Mistral?
        
         | andy99 wrote:
         | Just from the abstract I don't see why not, it's just replacing
         | the feed forward network that's part of all of these models
         | with a very sparse one. The bigger problem is you seemingly
         | have to retrain the model, so you couldn't just drop in llama2
         | weights from meta and have it work. Which makes it much more
         | limiting. Something that used existing weights would be a lot
         | more practical (like quantization for example). For BERT, I can
         | see this being useful if you had to make a really fast
         | embedding model. There was a discussion about a fast embedding
         | use case not long ago
         | https://news.ycombinator.com/item?id=37898001
        
         | MinusGix wrote:
         | It certainly could, and I wouldn't be surprised if the authors
         | want to try it out on those. You do have issues of past
         | improvements often not quite enhancing more powerful models
         | nearly as much. I'd expect this to possibly not work as well,
         | something like the bigger models ending up with more
         | polysemantic neurons because they're given more ''incentive''
         | (training time, neuron count, dataset size which they're
         | encouraged to be able to reconstruct) to extract as much
         | possible. This might make so the method performs worse due to
         | this intermingling. (See the transformer circuits website for
         | that) (Though I expect there's ways to recover a good chunk of
         | extra lost throughput/accuracy, maybe by doing extra steps to
         | directly steer the training towards breaking apart polysemantic
         | neurons)
        
         | lopuhin wrote:
         | There are two issues here -- for one, in big transformers, more
         | compute is in the attention layers, while this work improves
         | only feed-forward layers, which are more important for smaller
         | models and smaller sequence lengths. Second, in many typical
         | scenarios LLM inference is memory bandwidth bound, I'm not sure
         | if it's possible to utilize their approach to reduce required
         | memory bandwidth.
        
           | joelthelion wrote:
           | Doesn't reducing the number of neurons drastically reduce
           | memory requirements?
        
             | lopuhin wrote:
             | Yes it might. "Reduction of number of neurons" is not
             | static here, unlike traditional pruning approaches, here
             | they still keep all weights, but the network dynamically
             | selects which sub-portion of them to use. There is a
             | related discussion of this in section 3.2 (page 4), but
             | they don't think they mention actual memory bandwidth
             | requirements/wins of their implementation, and probably
             | there can be different tradeoffs for different devices.
        
       | OneOffAsk wrote:
       | Is this similar to what iOS 17 uses for its new autocomplete?
        
       | WithinReason wrote:
       | Link to previous paper:
       | 
       | https://arxiv.org/abs/2308.14711
       | 
       | An attempt at a summary: They use a sigmoid function to make
       | differentiable "soft" branches, and stack them to construct a
       | binary tree, with the goal of only taking one branch at inference
       | time (but training the whole tree) leading to log(W) instead of W
       | inference cost. They gradually harden the branches so they become
       | hard branches at the end of training.
       | 
       | A branch is computed as _branch(input, N)_ , with a neural
       | network N computing a scalar _c=N(input)_ , then using a sigmoid
       | to do a soft branch by returning the weighted sum of the
       | recursive call _s(c)*branch(input, N_left) + (1-s(c)) *
       | branch(input, N_right)_ (the two weights _s(c)_ and _1-s(c)_ sum
       | to 1). They only do  "proper processing" using the leaf nodes.
       | 
       | Then they add a new loss term that encourages hard decisions by
       | minimising the entropy of the Bernoulli distribution, making the
       | 2 weights converge to 0 and 1, at which point only one branch
       | needs to be taken at inference. They also state that this
       | hardening often happens automatically though.
       | 
       | It's a simple idea but the loss formulation is nice, you usually
       | want your loss terms to be a measure of information.
        
         | WithinReason wrote:
         | Also, this didn't come from OpenAI or DeepMind, or even
         | industry. What are those guys even doing? :)
        
           | mmaunder wrote:
           | Many labs doing foundational work like this and making
           | progress don't have the anything near the budget or compute
           | to implement at scale. In other words they don't have a Sam
           | and his backers or a Zuck and his budget.
        
           | Micoloth wrote:
           | They sure as hell have no incentives to make Neural Network
           | faster and more accessible, for starters..
           | 
           | (Considering they right now make more money and have more
           | control, the less accessible and the more computation-hungry
           | AI models are)
           | 
           | To be fair, this approach (claims to) only speed up
           | inference, not training, so all the GPUs are needed anyway.
        
             | WithinReason wrote:
             | They certainly have an incentive to keep these kinds of
             | improvements in-house and not publish them, since they are
             | commercial entities and this represents a competitive
             | advantage.
        
               | lawlessone wrote:
               | I think Nvidia might have an incentive for this not to
               | exist.
               | 
               | edit: but you are right for the AI companies not open
               | sourcing their models it's an advantage to have it when
               | others don't
        
               | WithinReason wrote:
               | I'm actually not sure about Nvidia, due to
               | https://en.wikipedia.org/wiki/Jevons_paradox
        
               | ForkMeOnTinder wrote:
               | But if things get too efficient for individual users, you
               | won't need an Nvidia GPU anymore. People will use cheaper
               | hardware instead. I'm looking forward to running good
               | models at decent speed on a low-end CPU or whatever
               | crappy GPU is in my phone.
        
               | jacobsimon wrote:
               | I had the same thought this morning and was debating
               | selling my nvda stock when I saw this - feels like they
               | are well-positioned right now, as with crypto a few years
               | ago, but if there were an efficiency breakthrough that
               | allowed commodity CPUs to do the inference instead, this
               | advantage could vanish quickly.
        
               | godelski wrote:
               | Nvidia can't make GPUs fast enough. I doubt 10xing
               | training and/or inference efficiency would result in a
               | decrease in demand. I would be surprised if it didn't
               | instead increase demand. Mind you, Nvidia is pushing hard
               | on TensorRT which optimizes models at inference time and
               | results in major increases in throughput (not 10x though
               | lol).
        
               | rictic wrote:
               | Yeah, Jevons Paradox suggests that 10xing efficiency of
               | training and inference would increase demand for GPUs.
        
             | godelski wrote:
             | I wouldn't be so quick to conspiracy. I'm the author of a
             | work and a famous blog post that trains a particular common
             | architecture much faster (don't want to dox myself too
             | much) and with far fewer parameters, but it has been
             | rejected several times and is now arxiv only. Our most
             | common complaint was "who would use this? Why not just take
             | a large model and tune it?" That question alone held us
             | back a year (had over a hundred citations by then and
             | remains my most cited work) until it switched to "use more
             | datasets" and "not novel" (by that time true, others had
             | built off of us, cited us, and published in top venues).
             | 
             | I don't think this was some conspiracy by big labs to push
             | back against us (we're nobodies) but rather that people get
             | caught up in hype and reviewers are lazy and incentivized
             | to reject. You're trained to be critical of works and
             | especially consider that post hoc most solutions appear far
             | simpler than they actually are. But context matters because
             | if you don't approach every paper with nuance it's easy to
             | say "oh, it's just x." But if those ideas were so simple
             | and obvious they would also be prolific. I see a lot of
             | small labs suffer the same fate simply due to lack of
             | compute. If you don't make your new technique work on many
             | datasets it becomes the easiest thing to reject a paper by.
             | ACs aren't checking that reviews are reasonable. I've even
             | argued with fellow reviewers about papers in workshops --
             | papers I would have accepted in the main conference -- that
             | are brushed off and the reviewers admit in their reviews
             | that they do not work on these topics. I don't understand
             | what's going on but at times it feels like a collective
             | madness. A 10 page paper with 4 very different datasets
             | that solves a problem, is clearly written, has no major
             | flaws, and is useful to the community should not need
             | defending when submitted to a workshop just because
             | reviewers aren't qualified to review the work (this paper
             | got in btw). We are moving into a "pay to play" ecosystem
             | and that will only create bad science due to group think.
             | (another aspect of "pay to play" is in the tuning. Spending
             | $1M to tune your model to be the best doesn't mean it is
             | better than a model that could not afford the search. Often
             | more than half of resources are spent on tuning now)
        
               | wruza wrote:
               | Is there a place where you guys discuss... things? I'm
               | layman interested in this topic akin to pop-
               | physics/maths, but have no chance to just read papers and
               | "get it". On the other hand, immediately available
               | resources focus more on how-to part of it rather than on
               | what's up overall. Also, do you have something like
               | 3b1b/pbs/nph for it? Content that _you_ can watch and say
               | "well, yep, good job".
        
           | airgapstopgap wrote:
           | ...ETH Zurich is an illustrious research university that
           | often cooperates with Deepmind and other hyped groups,
           | they're right there at the frontier too, and have been for a
           | very long time. They don't have massive training runs on
           | their own but pound for pound I'd say they have better
           | papers.
        
             | godelski wrote:
             | ETH Zurich is one of the top labs in the world. Disney
             | Research also works with them a lot. Another "sleeper" is
             | University of Amsterdam that has rockstars like Max Welling
             | and his students Kingma, Salimans,van den Berg, and
             | Hoogeboom.
             | 
             | It's easy to get hyped up on the big tech labs because they
             | have the most compute, but the best papers come from
             | smaller labs and unfortunately more lately face larger
             | challenges in getting published. It's the smaller works
             | that create the foundations that end up in these giant
             | models. ML is in a really weird space right now.
        
           | pr337h4m wrote:
           | This from DeepMind:
           | 
           | DiLoCo: Distributed Low-Communication Training of Language
           | Models - https://arxiv.org/pdf/2311.08105.pdf
           | 
           | From the first author on Twitter: "It could quite a big deal
           | for people who don't have access to a colocated cluster of
           | GPUs:
           | 
           | e.g. with DiLoCo you could train your model, with data-
           | parallelism, across all GPU providers, looking in real-time
           | for the cheapest price, even if pre-emptable, even across
           | continents"
           | 
           | https://twitter.com/Ar_Douillard/status/1724839420820361687
        
           | quickthrower2 wrote:
           | It is not surprising. The assumption is that they have the
           | best people. That you can objectively search 8 billion people
           | for the best people globally is folly of course. There are
           | geniuses without US citizenship / visas / green cards. And so
           | outside brains are going to figure this out. Mix in GDP of
           | $rest_of_world has much more resources than any company, and
           | the luck-driven nature of making AI discoveries, and I reckon
           | most progress will be outside of OpenAI etc. Driven by a
           | problem the big guys don't need to solve: how do I avoiding
           | buying a $5k graphics card.
        
         | lawlessone wrote:
         | From the previous paper you cited >Pushing FFFs to the limit,
         | we show that they can use as little as 1% of layer neurons for
         | inference in vision transformers while preserving 94.2% of
         | predictive performance.
         | 
         | This feels like that often misinterpreted Einstein meme/qoute
         | about humans only using a fraction of their brain power.
         | 
         | Is this only for inference though? could it boost training?
        
           | WithinReason wrote:
           | That's an interesting question. It actually provides a nice
           | way to parallelized training: Pretrain e.g. the first 3
           | branch levels, which effectively fragments the model into 8
           | separate parts, which you can continue training across 8
           | independent servers/nodes with no further communication
           | between the nodes. A central server would run the 1st 3
           | levels and mark parts of the training set that each node has
           | to train on. Maybe you could do this for the whole network
           | and distribute the training in SETI@HOME style all over the
           | world.
           | 
           | Hold on, you don't even need to freeze the branches
           | completely: each node could train 1 branch on the path to its
           | leaf node and communicate a change in the branch node to a
           | central server, so you can distribute training without having
           | to pre-freeze the branches. Still would need some pre-
           | training though, and the splits would change slowly, and the
           | attention mechanism could complicate things.
           | 
           | Currently distributed neural network training SETI@HOME style
           | looks like a complete pipe dream that nobody is taking
           | seriously. But a smart branching mechanism like this could
           | _suddenly_ make it possible. Folding@home reached 1.5
           | exaflops, which made it the world 's largest supercomputer.
           | Imagine the models we could train that way, they would far
           | surpass whatever OpenAI or Google could train and would be
           | public.
        
             | alchemist1e9 wrote:
             | This!
             | 
             | If this becomes true then it's a game changer. I hope you
             | are correct.
        
             | richardw wrote:
             | Also steps up the economic benefit of, and therefore demand
             | for, botnets. We really need a solution to bad actors
             | controlling vast amounts of compute.
        
               | klyrs wrote:
               | That ship has sailed, and her name is bitcoin.
        
               | richardw wrote:
               | If bitcoin keeps the botnets away from world beating AI,
               | worth it.
        
             | Geee wrote:
             | I am barely understanding, so a stupid question:
             | 
             | Does this also mean that it would be possible to train on
             | parallel GPU-poor setup instead of needing lots of GPU
             | memory / bandwidth on one computer?
        
             | bloopernova wrote:
             | Apologies for layman question: how much tera/peta/exa-flops
             | do current models use to train?
             | 
             | Well, I'm assuming they'd use whatever they're given, so
             | maybe the question should be "how much less time would
             | training take on a 1.5 exaflops computer?"
        
               | foobiekr wrote:
               | As many as they can afford.
               | 
               | A lot of clusters are totally homogeneous, at least
               | within some very large domains, so for a given
               | interconnect and a generation of GPU you know the maximum
               | message latency, the peak sustained pflop rate, and so on
               | but what often matters is some combination of the
               | depreciation-cost-per-time and the watt hours per unit
               | time, where you can sort of approximate both if you
               | ignore the unfortunate realities, which then act as a
               | multiplier.
               | 
               | For example, a problem is network issues - and not just
               | scale - as the training sequence often involve billions
               | of cycles of short compute-sync sequences which are
               | bursty (e.g., all-to-all, barrier, compute, barrier, all
               | to all, ...) but between which there isn't enough time to
               | engage low power modes so you're burning $ due to slack
               | and waste. This is true in different ways for a lot of
               | training approaches.
               | 
               | You can approximate this, but it's so sensitive to data
               | set size, specific training schedule, etc. that you won't
               | be able to get the most important answer.
        
         | thomasahle wrote:
         | Sounds like hiarchial softmax from the early NLP days
        
         | knexer wrote:
         | It's mentioned briefly in the paper(1), but I'm more interested
         | in the interpretability implications of this approach. In some
         | respects, this marries the interpretability/editability of a
         | small decision tree with the expressive power of a large neural
         | network. Usually you see those two on extreme opposite ends of
         | a tradeoff spectrum - but this approach, if it scales, might
         | shift the pareto frontier.
         | 
         | (1): As a byproduct, the learned regions can also be used as a
         | partition of the input space for interpretability, surgical
         | model editing, catastrophic forgetting mitigation, reduction of
         | replay data budget, etc..
        
       | tokai wrote:
       | Why not use the real title? Its short and precise.
        
       | vorticalbox wrote:
       | hugging face model
       | 
       | https://huggingface.co/pbelcak/UltraFastBERT-1x11-long
        
         | lawlessone wrote:
         | > This model is provided only as sanity check for research
         | purposes, it is untested and unfit for deployment.
         | 
         | I guess this means it isn't pretrained yet? Is it still just
         | random weights?
        
           | mjn wrote:
           | "Unfit for deployment" or "not intended for deployment" is
           | semi-standard wording for research models that are just raw
           | language models with none of the safety/bias/offensiveness
           | filtering that is usually desired for product applications.
           | For example, if you deploy it as a customer-service chatbot,
           | it might tell your customers to kill themselves, or call them
           | racial slurs.
           | 
           | It doesn't mean that there's anything technically wrong with
           | the language model per se as a model of language, just that
           | there has been no effort made to ensure it's fit to be
           | deployed as-is for any given generative-AI use case, and the
           | model authors would prefer you didn't do that.
        
         | ilaksh wrote:
         | Is it possible to use this with something like Llama 2?
        
       | vouaobrasil wrote:
       | This is rather scary. I feel we are witnessing the evolution of
       | language models and artificial intelligence, which seems
       | intellectually laudable until you realize that the underlying
       | evolutionary framework for this evolution is the global
       | capitalistic system whose only criteria for selection in short-
       | term monetary gain.
       | 
       | We are creating a monster.
        
         | hendler wrote:
         | Rather than looking to capitalism which has provided tremendous
         | benefits to society as well as unintended consequences you may
         | want to update your thinking to focus on the incentives
         | alignment problem in general.
         | 
         | This TED talk articulates it well: https://youtu.be/WX_vN1QYgmE
         | 
         | What is after capitalism?
        
           | vouaobrasil wrote:
           | I absolutely disagree that reformism as in the video via
           | incentives will be enough.
        
         | dicroce wrote:
         | I think it's good to be concerned and cautious but I also think
         | you are being a bit extreme here.
        
           | vouaobrasil wrote:
           | I absolutely disagree - I believe everyone else is blind, the
           | same way we are blind that our current lifestyles are an
           | exercise in extreme violence on the nonhuman world.
        
       | qntty wrote:
       | According to scientists, we only use 0.3% of our neural networks.
       | Imagine if we could use 100%.
        
         | kleiba wrote:
         | Nice.
         | 
         | I know HN can sometimes be the place where humor goes to die,
         | but I found this comment hilarious.
        
         | madprofessor wrote:
         | Wonderful.
        
         | jredwards wrote:
         | Thank you for taking one of my most hated memes and turning it
         | into hilarity.
        
       | baq wrote:
       | Mix this with yesterday's matmul approximation (maddness) in HW
       | for a casual... three orders of magnitude speed increase?
        
         | nulld3v wrote:
         | Can you link the post for the matmul approximation?
        
           | jml7c5 wrote:
           | https://news.ycombinator.com/item?id=38360776
        
             | nulld3v wrote:
             | Thank you!
        
         | hobofan wrote:
         | I'm not 100% sure, but those seem mostly mutually exclusive (or
         | redundant), with the decision tree in maddness taking on a
         | similar function as the binary tree in FFF that decides which
         | neurons to activate.
        
         | terafo wrote:
         | They are mostly incompatible.
        
       | measured_step wrote:
       | How would this scale for a use case like writing code? I could
       | imagine that some inputs would require a large number of neurons.
       | Would this architecture be able to do that if it were scaled up?
       | 
       | I'm also curious if this model architecture would achieve the
       | grokking of more complex concepts at scale.
        
       | jasonjmcghee wrote:
       | Does anyone understand why they are using B x H instead of B x S
       | x H?
       | 
       | Why is the context size and batch size represented as a single
       | parameter?
        
         | numeri wrote:
         | I would have to go back and reread the paper to be sure, but FF
         | layers are applied position-wise, meaning independently and in
         | parallel on all input tokens/positions. Because of that, I
         | could imagine contexts where the sequence dimension isn't
         | relevant, i.e., for computational complexity.
        
       | rsolva wrote:
       | I find running 7B models on my 6 year old small form factor HP
       | EliteDesk to be fast enough for casual everyday use. If this
       | speedup can be applied generally to commonly used models, I can
       | serve a local ChatGPT experience for both friends and family from
       | my tiny homelab in my basement.
       | 
       |  _mind blown_
        
         | FooBarWidget wrote:
         | I find 7B models to be too stupid. They often respond with
         | nonsense or fail to follow instructions.
        
           | cooper_ganglia wrote:
           | Even 65B models approach a level of being almost usable, but
           | still fall short in my personal experience.
        
             | cloudking wrote:
             | This is why I'm not understanding the excitement around
             | open source models, they pale in comparison to GPT-4
             | quality, so I have no use for them until we have something
             | comparable.
        
           | GaggiX wrote:
           | Even like OpenChat-3.5? (Probably the best 7B model out
           | there) Demo: https://openchat.team/
           | 
           | HuggingFace: https://huggingface.co/openchat/openchat_3.5
           | 
           | On the LLM arena (blinded comparisons), it's the third best
           | non-proprietary model:
           | https://huggingface.co/spaces/lmsys/chatbot-arena-
           | leaderboar...
        
             | sorokod wrote:
             | _What is the sum of odd numbers in this set: 4, 7, 12, 1,
             | 3_
             | 
             |  _The sum of odd numbers in the given set is 4 + 7 + 1 =
             | 12. Therefore, the answer is 12._
        
               | all2 wrote:
               | Technically 3 is even. \s
        
             | moffkalast wrote:
             | Are there any comparisons with Mistral-instruct? I've yet
             | to see anything under 30B beat it in any way.
        
       | quickthrower2 wrote:
       | If anyone is on the ball enough to turn this into a colab or
       | notebook that would be appreciated! Would love to see the code
        
       ___________________________________________________________________
       (page generated 2023-11-22 23:00 UTC)