[HN Gopher] LoRA: Low-Rank Adaptation of Large Language Models
       ___________________________________________________________________
        
       LoRA: Low-Rank Adaptation of Large Language Models
        
       Author : eternalban
       Score  : 227 points
       Date   : 2023-03-24 12:15 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | eternalban wrote:
       | From the paper:
       | 
       |  _"Aghajanyan et al. (2020) shows that the pre-trained language
       | models have a low "instrisic dimension" and can still learn
       | efficiently despite a random projection to a smaller subspace."_
       | 
       | Would be great to have an informed practitioner comment (sota) on
       | why we opt for random projection. Is the actual 'intrinsic'
       | vector space uncomputable? Too slow to find?
        
         | moyix wrote:
         | Not an informed/sota practitioner, but isn't this just a
         | standard property of high dimensional spaces?
         | 
         | https://en.wikipedia.org/wiki/Random_projection
         | 
         | > The core idea behind random projection is given in the
         | Johnson-Lindenstrauss lemma, which states that if points in a
         | vector space are of sufficiently high dimension, then they may
         | be projected into a suitable lower-dimensional space in a way
         | which approximately preserves the distances between the points.
        
         | stu2b50 wrote:
         | Random projects work well in high dimensional spaces, they're
         | cheap, easy, and require no understanding of the initial space.
         | Part of the point of Lora is efficiency, after all!
        
       | pgen wrote:
       | Name clash! https://en.wikipedia.org/wiki/LoRa#LoRaWAN
        
         | smusamashah wrote:
         | This term is already used when fine tuning stable diffusion
         | models. https://replicate.com/blog/lora-faster-fine-tuning-of-
         | stable...
        
           | Filligree wrote:
           | Isn't this actually the same thing?
        
           | TeMPOraL wrote:
           | Already used for like... a month or two.
        
         | mronetwo wrote:
         | it's Microsoft. they know. they just don't care
        
         | pygy_ wrote:
         | Your case insensitive brain doesn't get it... It's LoRA, not
         | LoRa.
         | 
         | /s
        
         | ga_to wrote:
         | Microsoft has done this before with mauikit and mauilinux:
         | https://github.com/dotnet/maui/issues/35
         | 
         | Unlikely that they even consider checking whether they are
         | stomping across existing names.
        
           | capableweb wrote:
           | > Unlikely that they even consider checking whether they are
           | stomping across existing names.
           | 
           | Or it's on purpose as existing terms already have good amount
           | of search traffic for those terms, and Microsoft know
           | Google/Bing will rank Microsoft's own pages higher than
           | what's already out there.
        
           | Semaphor wrote:
           | They even do that with their own products...
        
         | capableweb wrote:
         | Easy, one is LoRA and the other one is LoRa, Microsoft made it
         | very distinct, as they always do.
        
           | fullstop wrote:
           | Just don't put the files in the same directory on an exFAT
           | drive
        
             | capableweb wrote:
             | or on macOS, although I'm not sure it's still by default
             | using case sensitive file system or not. I do remember the
             | first time that bit me though, being a programmer using
             | Linux collaborating with a developer using macOS. Must have
             | been in ~2005 or something.
        
               | fullstop wrote:
               | The one which bit me happened when I was running a java
               | minimizer / obfuscator on a Windows platform and it
               | assumed that A.class was not the same as a.class. It
               | worked great on Linux and didn't warn that it had
               | overwritten a file, resulting in a package which almost
               | worked.
        
           | dudeinjapan wrote:
           | And yet they sued Mike Rowe who made Mike Rowe Soft.
        
           | [deleted]
        
       | htrp wrote:
       | this is from 2 years ago
        
       | nummerfuenf wrote:
       | Can we stop naming things like stuff that already exists?
        
         | denysvitali wrote:
         | I was looking for this comment. Thank you!
        
         | postdb wrote:
         | Firebird!!!
        
         | wlesieutre wrote:
         | This is totally different, Microsoft's A is capitalized!
         | 
         | https://lora-alliance.org/
        
         | indeyets wrote:
         | https://github.com/microsoft/LoRA/issues/47
        
         | entropicdrifter wrote:
         | Unfortunately, no. It's even worse within the video game
         | industry. I'm not just talking Doom 4, er, Doom (2016). The
         | upcoming sequel to 2014's Lords of The Fallen? Well that's
         | called Lords of The Fallen. They didn't even get 2 game in
         | before repeating the exact same name.
        
           | Agentlien wrote:
           | My favourite video game franchise in terms of confusing name
           | is the Jedi Knight franchise.
           | 
           | Star Wars: Dark Forces
           | 
           | Star Wars Jedi Knight: Dark Forces 2
           | 
           | Star Wars Jedi Knight 2: Jedi Outcast
           | 
           | Star Wars Jedi Knight: Jedi Academy
        
       | JustSomeNobody wrote:
       | Assholes. Don't call it LoRA!
       | 
       | There's already a technology called LoRa!
       | 
       | Fuck I hate this crap. Be better than this.
        
         | krolden wrote:
         | Its Microsoft they'll only be better if they go under.
        
       | runnerup wrote:
       | There's a not insignificant intersection of projects and
       | developers who might be using both LoRA and LoRa at the same
       | time. What a terrible name collision. Hopefully this doesn't
       | become one of the foundational terms in AI that everyone must use
       | frequently like "Transformer".
        
         | davesque wrote:
         | Is this really a big problem? LoRa is a telecom thing. LoRA is
         | a machine learning thing. Yeah, they're adjacent industries but
         | still seems different enough to make it pretty easy to
         | distinguish. I had never heard of the LoRa alliance until you
         | mentioned it in this comment.
        
         | asddubs wrote:
         | yeah it really does seem like the AI folks are EEEing EE
        
         | seydor wrote:
         | transformer itself is ambiguous
         | 
         | But to be clear, LoRa is not related to ANN training, is it?
         | Why would they be using both?
        
         | Maxion wrote:
         | I was going to comment the same, horrible name collision.
         | Surprised they didn't notice it.
        
         | whalesalad wrote:
         | transformer and adapter are two of the new "ai terms" that
         | grind my gears
        
         | ahkurtz wrote:
         | Isn't there something really perfect about people working on a
         | language model either not trying or outright failing to use
         | that language model to tell them if their project name already
         | exists?
         | 
         | On their github they reference a related project called
         | "HuggingFace" so you know the sky's the limit with the names in
         | this field, could have been called anything else really.
        
           | chaorace wrote:
           | > On their github they reference a related project called
           | "HuggingFace"
           | 
           | Quick jargon literacy boost: "HuggingFace" is a platform
           | tailored to hosting and sharing ML repositories -- like
           | Github for AI. The parent company, "Hugging Face", is also in
           | and of itself a major contributor to several AI research
           | projects & tooling.
           | 
           | Ironically, they still managed to hit a namespace
           | collision... albeit self-inflicted.
        
             | tmabraham wrote:
             | the actual platform is called "HuggingFace Hub". The
             | company itself is called "HuggingFace" or "Hugging Face" (I
             | have seen it referred to in both ways, I am unsure which is
             | officially correct). There is no namespace collision.
        
         | indeyets wrote:
         | https://github.com/microsoft/LoRA/issues/47
        
       | elil17 wrote:
       | Can someone ELI5 "LoRA reduces the number of trainable parameters
       | by learning pairs of rank-decompostion matrices while freezing
       | the original weights"?
        
         | MacsHeadroom wrote:
         | LoRA finds a subset of the original weights (about 1%) which
         | can be trained to achieve about the same result as training the
         | whole model while using 100x less compute.
         | 
         | Original weights frozen = Rather than modify the original
         | model, the training results are saved to a small file of only a
         | few MB.
         | 
         | In practice this means you can fine tune a 30B parameter model
         | on a consumer GPU in a couple of hours. Without LoRA you would
         | need to run multiple expensive data center GPUs for days or
         | weeks.
        
           | tylerekahn wrote:
           | It's actually as low as 0.01% of the original weights.
           | 
           | From the LoRa paper:
           | 
           | >When the pre-trained model is GPT-3 175B, the number of
           | train- able parameters |Th| can be as small as 0.01% of
           | |Ph0|.
        
           | pffft8888 wrote:
           | Is this the same as or similar to the Lottery Ticket concept
           | from a few years ago?
        
           | arugulum wrote:
           | >In practice this means you can fine tune a 30B parameter
           | model on a consumer GPU in a couple of hours.
           | 
           | Consumer GPU, yes, but in practice LoRA doesn't actually
           | reduce training time. What it mainly reduces is memory
           | requirements. In fact LoRA training can often require more
           | training steps than full fine-tuning and therefore be slower
           | (you can imagine why this is the case: the optimization is
           | trying to modify the mode's behavior a smaller number of
           | parameters, and so has a harder job)
        
           | stephanheijl wrote:
           | To be more exact, LoRA adds two matrices `A` and `B` to any
           | layers that contain trainable weights. The original weights
           | (`W_0`) have the shape `d x k`. These are frozen. Matrix `A`
           | has dimensions `d x <rank>` (`rank` is configurable) and
           | matrix `B` has the shape `<rank> x k`. A and B are then
           | multiplied and added to `W_0` to get altered weights. The
           | benefit here is that the extra matrices are small compared to
           | `W_0`, which means less parameters need to be optimized, so
           | less activations need to be stored in memory.
        
             | twic wrote:
             | Ah, so the resulting model contains both the large matrix
             | of original weights, and also the two small matrices of
             | alterations? But this is smaller than the alternative of a
             | model which contains the large matrix of original weights,
             | and an equally large matrix of alterations.
             | 
             | Why is fine-tuning done with separate alterations, rather
             | than by mutating the original weights?
        
               | TuringTest wrote:
               | It's larger, but there are less parameters to train for
               | your specific use case since you are training the small
               | matrix only, while the original ones remain unaltered.
        
               | arugulum wrote:
               | > Why is fine-tuning done with separate alterations,
               | rather than by mutating the original weights?
               | 
               | The goal of most parameter-efficient methods is to store
               | one gold copy of the original model, and learn minor
               | modifications/additions to the model. The easiest way to
               | think about this is in some kind of deployment setting,
               | where you have 1 capable model and you learn different
               | sets of LoRA weights for different tasks and
               | applications.
               | 
               | The original intent of parameter-efficient methods is to
               | reduce the amount of storage space needed for models (do
               | you really want to keep a whole additional copy of LLaMA
               | for each different task?). A secondary benefit is that
               | because you are fine-tuning a smaller number of
               | parameters, the optimizer states (can take up to 2x the
               | size of your model) are also heavily shrunk, which makes
               | it more economical (memory-wise) to (parameter-efficient)
               | fine-tune your model.
        
               | stu2b50 wrote:
               | > But this is smaller than the alternative of a model
               | which contains the large matrix of original weights, and
               | an equally large matrix of alterations.
               | 
               | It's actually larger. If you just have two equally large
               | matrices of the same dimension, one original, and one of
               | "altercations"... then you can just add them together.
               | 
               | > Why is fine-tuning done with separate alterations,
               | rather than by mutating the original weights?
               | 
               | Then you'd have to compute the gradients for the whole
               | network, which is very expensive when the model has 7b,
               | 65b, 165b parameters. The intent is to make that cheaper
               | by only computing gradients for a low rank representation
               | of the _change_ in the weight matrix from training.
        
               | arugulum wrote:
               | >Then you'd have to compute the gradients for the whole
               | network
               | 
               | You have to do that with LoRA regardless, to compute the
               | gradients for the lowest-level LoRA weights.
        
               | gliptic wrote:
               | Correct me if I'm wrong, but I think you still need to
               | compute gradients of non-trained weights in order to
               | compute the gradients of the LoRA weights. What you don't
               | have to do is store and update the optimizer state for
               | all those non-trained weights.
        
               | stu2b50 wrote:
               | I mean the derivative of a constant is 0. So if all of
               | the original weights are considered constants, then
               | computing their gradients is trivial, since they're just
               | zero.
        
               | jprafael wrote:
               | Computing gradients is easy/cheap. What this technique
               | solves is that you no longer need to store the computed
               | values of the gradient until the backpropagation phase,
               | which saves on expensive GPU RAM, allowing you to use
               | commodity hardware.
        
             | seydor wrote:
             | Can rank decomposition be used to reduce the original
             | weight matrices as well? Or are they assumed to be
             | compressed already?
        
             | grph123dot wrote:
             | Your explanation is crystal clear. I suppose it works well
             | in practice, but is there any reason it works that well?
        
               | stu2b50 wrote:
               | Per the original paper, empirically it's been found that
               | neural network weights often have low intrinsic rank. It
               | follows, then, that the change in the weights as you
               | train also have low intrinsic rank, which means that you
               | should be able represent them with a lower rank matrix.
        
               | grph123dot wrote:
               | Since we are in ELI5, it seems that the concept of low
               | rank approximation is required to understand this method.
               | 
               | (1) https://en.wikipedia.org/wiki/Low-rank_approximation
               | 
               | Edited: By the way, it seems to me that there is an error
               | in the wikipedia page because if the Low-rank
               | approximation takes a larger rank then the bound of the
               | error should decrease, and in this page the error
               | increases.
        
               | grph123dot wrote:
               | >> that the change in the weights as you train also have
               | low intrinsic rank
               | 
               | It seems that the initial matrix of weights has a low
               | rank approximation A and this implies that the difference
               | E = W - A is small, also it seems that PCA fails when E
               | is sparse because PCA is designed to be optimum when the
               | error is gaussian.
        
               | stu2b50 wrote:
               | In terms of PCA, PCA is also quite expensive
               | computationally. Additionally, you'd probably have to do
               | SVD instead.
               | 
               | Since the weights are derived from gradient descent, yeah
               | we don't really know what the distributions would be.
               | 
               | A random projection empirically works quite well for very
               | high dimensions, and is of course very cheap
               | computationally.
        
               | seydor wrote:
               | Does this mean the matrices are highly compressible?
        
           | quest88 wrote:
           | Is this the same as Knowledge Distillation (teacher-student
           | training)?
        
         | edwardjhu wrote:
         | Hi! I'm the author of the repo.
         | 
         | The insight is that we don't need to modify a lot of parameters
         | to get a generally competent model to do well on specific
         | tasks. When you have a linear layer with a weight matrix of
         | dimension d_in x d_out, the change you undergo during full
         | finetuning is also a matrix of d_in x d_out, which can be huge.
         | We represent the latter using two matrices of shape d_in x r
         | and r x d_out. You save a lot of parameters when r is small. So
         | when you use it, the input goes through two streams 1) the
         | orignal frozen weight turning a vector of size d_in to d_out
         | and 2) the low-rank weights turning a vector of size d_in to r
         | and r to d_out. The two streams are then summed together.
         | (There's a figure in the paper.)
         | 
         | This way of doing thing is nice for a few reasons. It's easy to
         | parallelize. You can change r to control how many parameters to
         | train. You can also merge the low-rank weights with the
         | original one to avoid latency.
         | 
         | Note that we don't select a subset of the original parameters.
         | We train extra ones.
        
           | loxias wrote:
           | Hi! I in _no way_ mean to detract or malign or "anything
           | negative" the parent comment (communication is hard!!), BUT I
           | really compliment that exact sentence. :)
           | 
           | My backgroung contains signal processing, "pre-deep learning
           | ML", systems engineering, and firmware, and that sentence
           | jumped out at me as crystal clear in my mind, despite not
           | knowing what HuggingFace is or PyTorch.
           | 
           | Correct me if I'm wrong: These huge models involve lots of
           | weights used in large matrices. The contribution of this work
           | is to plug in some matrix factorization and learn a lower
           | dimensional representation. Fantastic!
           | 
           | Also makes me wonder what other performance improvements
           | await through proper application of established and well
           | known Mathematics. :D
        
           | eternalban wrote:
           | Great, we can get authoritative answers. (I'm trying to
           | understand the ML space and have mostly done readings, not an
           | expert.)
           | 
           | I am assuming you can have n LoRA fine-tunings, say each
           | specializing in one aspect of a coherent task, with n
           | summers, running in parallel, and then combine them at the
           | end? Or more generally, does LoRA enable a sort of
           | modularizing around a core (un-merged) model?
           | 
           | And curious if you ever tried merging 2 or more fine-tunings
           | and then testing the resultant single model (merge all)
           | against the original tests to check retention?
        
       | zeckalpha wrote:
       | Quite different from https://en.m.wikipedia.org/wiki/LoRa
        
       | michaelhartm wrote:
       | Btw, it's kinda crazy how bad the GPT4-J results in the blog are
       | compared to the Dolly one, which seem pretty good. Do we know why
       | it works so well to use this 50k dataset?
        
         | quadrature wrote:
         | Dolly is instruction fine tuned whereas GPT4-J is not. Which
         | means that it doesn't even understand that it is being
         | instructed to do something, it is just doing an autocomplete.
        
       | muny wrote:
       | Why use the same name as LoRa? https://lora-alliance.org/
       | 
       | Edit: Microsoft is even a member of the LoRa alliance:
       | https://lora-alliance.org/lora-alliance-press-release/micros...
        
         | edwardjhu wrote:
         | Good question! I came up with the name because the idea is best
         | described as low-rank adaptation. I know very little about
         | radio communication and didn't anticipate the visibility my
         | repo has today :)
        
         | StingyJelly wrote:
         | At least could have been LoRaA
        
         | stu2b50 wrote:
         | You're assuming a lot more intercompany coordination than would
         | exist. Even though it's research by Microsoft labs, the
         | researchers themselves are to a large extent autonomous and
         | also narrow experts in their fields.
         | 
         | This process involves low rank approximations -> Lora is a
         | namey sounding term that uses characters from low and rank ->
         | call it LoRA in the paper. That's all there was to it. Probably
         | didn't even know the other lora existed.
        
           | edwardjhu wrote:
           | Yup. That's exactly what happened.
        
         | anthk wrote:
         | Also Guix vs Guix...
        
         | FlyingRobot wrote:
         | I had to scan the readme to make sure this story wasn't about
         | applying machine learning to radio communication.
        
           | ChancyChance wrote:
           | Small CNNs can be used for BLE channel hopping and body
           | detection.
        
         | tylerekahn wrote:
         | Low Rank Adaption is a mathematical technique, it's not a
         | technology standard
        
           | krolden wrote:
           | Then call it LoRad
        
           | samtho wrote:
           | It's still a currently-in-use acronym/term that a
           | sufficiently large tech company could conceivably be using
           | both meanings concurrently. This causes confusion and muddies
           | the water of a general web search experience.
           | 
           | Not the same situation, but I remember when "Electron" was
           | called "Atom Shell" because it was built for the (now
           | defunct) text editor by the same name. For the longest time,
           | I had an unsubstantiated thought that it was a new Unix shell
           | that was based around a text editor somehow (yes, dumb). In
           | hindsight, they just had named this cleverly to reference the
           | various layers or shells of electrons orbiting atomic nuclei,
           | thus the eventual name of Electron.
           | 
           | On the other hand, a wireless technology standard is very
           | different than a known mathematical technique that likely
           | predates the wireless meaning anyway.
        
         | kkielhofner wrote:
         | In all seriousness there should be ML project naming approaches
         | (I should try ChatGPT). Naming a project or a company is very
         | difficult so I can't blame anyone here.
         | 
         | That said some of these ML project names are especially
         | horrendous (kind of ironic for the current emphasis on
         | generative AI). Transformers? A good chunk of the time I get
         | results about the toys and cartoons from my childhood. Don't
         | get me wrong, I still think Optimus Prime is cool and the name
         | "transformers" make sense given the function but it's somehow
         | simultaneously generic AND the name of a decades long multi-
         | billion dollar media franchise...
         | 
         | LoRA is another example, name makes sense but the collision
         | with LoRa is problematic. I, for one, am interested in and
         | have/would apply both. Queue google searches for "Lora
         | radio..." vs "Lora ml...".
         | 
         | Project naming is hard and I'm just glad to see the activity
         | and releases. BUT project naming is essentially a base
         | usability condition and should be considered as such: just like
         | creating a README, getting started, providing code examples,
         | etc.
         | 
         | It reminds me of trademarks: if you're looking for trademark
         | protection it won't be issued if it is overly generic or likely
         | to "cause confusion in the marketplace" with an existing
         | trademark (basically same or similar name in a somewhat
         | similar/adjacent field) - you can even reuse names but only if
         | it's obvious to people from basic context that they refer to
         | different things. I'm not a trademark attorney but I think LoRa
         | vs LoRA would get refused because it's "computer stuff", while
         | a shampoo named Lora would be fine (as an example). If you're
         | curious there are official categories/areas from the USPTO that
         | break these down.
         | 
         | Both of these examples wouldn't have a chance at trademark
         | protection. Note I'm not saying they should have trademark
         | protection, just that it's an example of a reasonable standard
         | that should be considered/compared to for good open source
         | project naming.
        
         | elcomet wrote:
         | There are many more things called lora.
         | 
         | https://en.m.wikipedia.org/wiki/Lora
         | 
         | It doesn't really matter as long as it's not in the same field.
         | No one will be confused between the two.
        
           | magicalhippo wrote:
           | > No one will be confused between the two.
           | 
           | Except search engines...
        
             | Filligree wrote:
             | That's okay, Bing-GPT doesn't get confused.
        
         | AdamH12113 wrote:
         | "LoRAd" was right there.
        
         | brodouevencode wrote:
         | https://en.wikipedia.org/wiki/LoRa for the communications
         | architecture
        
         | renewiltord wrote:
         | Why did the radio guys use the same name as this hotel from
         | Minnesota that existed for years before?
         | https://www.lorahotel.com/
         | 
         | I bet some of them have even been to Minnesota and they still
         | didn't pick a unique name.
         | 
         | Though both of them have to answer to why they picked the name
         | of a Google Font that preceded both and is currently available
         | https://web.archive.org/web/20170210001724/https://fonts.goo...
         | 
         | Is it because Microsoft is competing with Google in the AI
         | space?
        
           | reportgunner wrote:
           | Context. Individual hotels are not technology.
        
             | renewiltord wrote:
             | Indeed. And LLMs are not radios or fonts.
        
       | krossitalk wrote:
       | Maybe call it LoRALLMR (Laura Loomer)
        
       | timmg wrote:
       | This sounds similar to "prompt tuning":
       | https://ai.googleblog.com/2022/02/guiding-frozen-language-mo...
        
         | stu2b50 wrote:
         | It's actually completely different. What you linked is about
         | zero shot learning by adjusting the prompt, vs Lora which is
         | about actually fine tuning the weights of the model.
        
           | timmg wrote:
           | In that case, you can think of the prompt as being one vector
           | of the model that is being tuned while the rest is frozen.
           | 
           | Not exactly the same, to be sure. But fulfills a similar
           | need: more efficient "fine tuning" of a large model.
        
             | stu2b50 wrote:
             | I suppose that is true. You can even train the prompt with
             | gradient descent. But in practice, it ends up being fairly
             | different.
        
             | eternalban wrote:
             | They address prompt tuning's issues in the paper:
             | 
             |  _" The other direction, as exemplified by prefix tuning
             | (Li & Liang, 2021), faces a different challenge. We observe
             | that prefix tuning is difficult to optimize and that its
             | performance changes non-monotonically in trainable
             | parameters, confirming similar observations in the original
             | paper. More fundamentally, reserving a part of the sequence
             | length for adaptation necessarily reduces the sequence
             | length available to process a downstream task, which we
             | suspect makes tuning the prompt less performant compared to
             | other methods."_
             | 
             | https://ar5iv.labs.arxiv.org/html/2106.09685
             | 
             | This is key imo: _" More fundamentally, reserving a part of
             | the sequence length for adaptation necessarily reduces the
             | sequence length available to process a downstream task"_.
        
               | arugulum wrote:
               | LoRA conversely has different downsides. LoRA can be used
               | in two ways: merged or unmerged. Unmerged (which is how
               | it's trained) incurs a non-trivial computation cost.
               | Merged means you are modifying the model weights, which
               | means you are stuck with that one model on that device
               | (though, this usually applies for most implementations
               | for the unmerged versions too).
               | 
               | The benefit of prompt and prefix tuning (note: these are
               | two separate methods) is that you can serve different
               | soft-prompts and soft-prefixes efficiently with a single
               | shared set of model weights.
        
               | eternalban wrote:
               | https://ar5iv.labs.arxiv.org/html/2106.09685/assets/x1.pn
               | g
               | 
               | > incurs a non-trivial computation cost
               | 
               | The hit seems to be in energy/cpu not time since the W0
               | computation is in parallel with the BAx. (My assumption
               | based on the latency claims in paper.) So an issue in
               | edge deployments (battery life, etc.).
               | 
               | > you are stuck with that one model on that device
               | 
               | Upfront I have 0 clue on the actual numbers, but from a
               | purely software architecture pov [in unmerged setup],
               | having that W0 forward process _once_ with n distinct BAx
               | paths (for distinct fine tunings!) would address that,
               | no?
               | 
               | [p.s. say an application that takes as input A/V+Txt,
               | runs that through an _Ensemble LoRA_ (ELoRA(tm)  /g)
               | which each participant contributing its own BAx finetuing
               | processing, sharing the single pre-trained W0.]
        
               | arugulum wrote:
               | > My assumption based on the latency claims in paper.
               | 
               | The latency claims are based on the merged version, where
               | the modifications are merged into the model weights.
               | Hence there is no latency cost, since the final model has
               | the same shape as the original.
               | 
               | > having that W0 forward process once with n distinct BAx
               | paths (for distinct fine tunings!) would address that,
               | no?
               | 
               | The tl;dr is that that works, but is more expensive. Not
               | ridiculously more expensive, but certainly more expensive
               | that processing a few additional tokens with
               | prefix/prompt tuning.
        
               | edwardjhu wrote:
               | > Merged means you are modifying the model weights, which
               | means you are stuck with that one model on that device
               | (though, this usually applies for most implementations
               | for the unmerged versions too).
               | 
               | If one is careful with floating point issues, it's
               | straightforward to unmerge the weights.
               | 
               | W_0 = W_1 - BA
               | 
               | Yes, prompt-based methods don't involve swapping weights.
        
               | arugulum wrote:
               | Right, it's mathematically easy (again, up to floating
               | point issues) to recover the weights as needed, but in
               | terms of distribution/serving I'm guessing the plan is to
               | have the original weights and carry around the LoRA
               | weights and merge as necessary.
               | 
               | (Also, I'm assuming you're the first author of LoRA.)
        
           | arugulum wrote:
           | Both LoRA and prompt tuning are parameter-efficient tuning
           | methods. Both of them inject new weights into the model and
           | tune them.
           | 
           | Prompt tuning does so by injecting addition prefix tokens in
           | the input to the model. LoRA does so by injecting low-rank
           | matrices that are additive modifications to a set of linear
           | layers in the model.
           | 
           | They both do something slightly different, but are very much
           | in the same class of methods.
        
       | eternalban wrote:
       | TIL learned about LoRA via
       | https://news.ycombinator.com/item?id=35287740
       | 
       | See also: Huggingface PEFT: https://github.com/huggingface/peft
        
       | sharemywin wrote:
       | Could the Bloom model be used with this training to build a
       | commercial allowed small-ish model?
        
         | stu2b50 wrote:
         | It cheapens the cost of fine tuning, it doesn't make the model
         | itself smaller at inference time.
        
       | outside1234 wrote:
       | Has anyone diaried out a good learning path for going from a
       | larger pre-trained model to a fine tuned model? Trying to
       | understand all of the parts here but it sort of hard to fine
       | anything linear...
        
       | alecco wrote:
       | *17 Jun 2021
        
       | rdedev wrote:
       | Came across this library in the past where you can easily add
       | LoRA and other efficient fine tuning techniques easily into
       | huggingface models. Haven't tried it though and support for
       | different models may be limited
       | 
       | https://adapterhub.ml/
        
       | Der_Einzige wrote:
       | I really hope this doesn't displace regular fine tuning
       | techniques. Dreambooth is superior in quality to Lora with image
       | generation, and I suspect that it's similar with LLMs.
        
         | brucethemoose2 wrote:
         | There are some WIP evolutions of SD Lora in the works, like
         | locon and lycoris.
         | 
         | https://github.com/KohakuBlueleaf/LyCORIS
        
         | eternalban wrote:
         | https://dreambooth.github.io/
         | 
         | The LoRA paper's 'problem statement' makes a compelling case
         | for practical benefits of the approach. Specifically, no added
         | latency, no serial processing bottlenecks, shared baseline
         | model, compact time/space requirements. How does dreambooth
         | stack up in this regard?
        
           | mattnewton wrote:
           | In the image space, dreambooth full-model tunes can handle
           | multiple concepts and tend to be easier to get hard/complex
           | things like a person's likeness correct. I've found that LoRA
           | tunes struggle to be accepted by people as producing their
           | own face compared to full dreambooth models tuned on the same
           | inputs, most likely because we are very sensitive to facial
           | differences of faces we are very familiar with. I haven't
           | seen this effect for styles or other kinds of concepts, where
           | people are a little less sensitive about the fidelity. LoRA
           | is much easier to train, easier to compose, and can have the
           | base model swapped out in many cases though so if it's good
           | enough for the concept you are trying to add to the model
           | it's often worth the subtle quality loss.
        
         | stu2b50 wrote:
         | I suspect it's not that similar. The intuition behind LoRA is
         | more true the higher the rank of the weights of the model. Even
         | the smallest LLMs have considerably higher rank weights than
         | Stable Diffusion. They are _large_ , after all.
        
       | numlocked wrote:
       | For those wondering why this is interesting: This technique is
       | being used to reproduce[0] the Alpaca results from Stanford[1]
       | with a few hours of training on consumer-grade hardware.
       | 
       | I believe there will soon be a cottage industry of providing
       | application-specific fine-tuned models like this, that can run in
       | e.g. AWS very inexpensively. The barrier today seems to be that
       | the base model (here, Meta's LLaMA) is encumbered and can't be
       | used commercially. Someone will soon, I'm confident, release e.g.
       | an MIT-licensed equivalent and we'll all be off to the races.
       | 
       | [0] https://github.com/tloen/alpaca-lora
       | 
       | [1] https://crfm.stanford.edu/2023/03/13/alpaca.html
        
         | GaggiX wrote:
         | In addition, for the past 1/2 month this technique has been
         | used to fine-tune Stable Diffusion models.
        
           | terafo wrote:
           | Closer to 4 months. It is much better than having a bunch of
           | 2-4gb models laying around.
        
             | GaggiX wrote:
             | 4 months? I don't think so, people really start using LoRA
             | when it was added to the diffusers library less than 2
             | months ago, this library is used by the training plugin of
             | automatic webui, I guess time seems to flow more slowly
             | when many things happen.
        
             | dragonwriter wrote:
             | The 1/2 month seems to match Lycoris/LoCon, which as I
             | understand (haven't dug into the details on this) is a
             | newer refinement of LoRa. LoRa has been used for longer,
             | correct.
        
               | GaggiX wrote:
               | The LyCORIS/LoCon repo started committing 1 month ago and
               | almost no one is using it except for a few experiments
               | (not even the automatic webui supports it without a
               | plugin).
        
               | dragonwriter wrote:
               | Judging from activity on Civitai, I think "almost no one
               | is using it except for a few experiments" is _very_
               | wrong. Sure, A1111 needs a plugin for it; it needs a
               | plugin for ControlNET, too, but that is _also_ quite
               | popular.
        
               | GaggiX wrote:
               | I'm also judging from the activity on CivitAI, the most
               | downloaded (>1000 downloads, not many) ones are actually
               | just LoRA with LoCon in another (experimental) branch of
               | the CivitAI page, definitely not " _very_ wrong " ahah
               | 
               | >it needs a plugin for ControlNET
               | 
               | The big difference is that ControlNet actually required a
               | pretty complex interface to be used effectively,
               | meanwhile the use of LoCon/LyCORIS should be completely
               | transparent and works like a LoRA
        
               | Agentlien wrote:
               | ControlNet is built in as of maybe two weeks ago and no
               | longer requires an extension. I started using it when the
               | built-in support arrived and have had a lot of fun with
               | it since.
        
         | smaddox wrote:
         | There's already RWKV, if you want a decent performing pre-
         | trained model that's Apache 2.0 licensed:
         | https://twitter.com/BlinkDL_AI/status/1638555109373378560?s=...
        
           | pffft8888 wrote:
           | https://news.ycombinator.com/item?id=35281026
        
         | polyterative wrote:
         | Thanks! Hard to follow this stuff sometimes with all the news
        
         | romanzubenko wrote:
         | Today Databricks announced [0] 6b parameter model from
         | EleutherAI finetuned on Alpaca dataset. According to their
         | CEO[1], training took 3 hours, and costed $30. They didn't
         | release any details on how it was trained, but likely with
         | LoRa.
         | 
         | [0] https://www.databricks.com/blog/2023/03/24/hello-dolly-
         | democ... [1]
         | https://twitter.com/alighodsi/status/1639251347777388544
        
           | numlocked wrote:
           | Interesting. I wonder what the training cost was for:
           | 
           | https://huggingface.co/EleutherAI/gpt-neox-20b
           | 
           | Perhaps it's in the paper...
        
             | michaelhartm wrote:
             | They used the 6b GPT4-J, not 20B. That's what's
             | interesting, it's a smallish large language model :).
        
               | dragonwriter wrote:
               | GPT-J, not GPT4-J.
        
           | int_19h wrote:
           | There are also some LLaMA LoRAs that are trained on the
           | Anthropic dataset specifically for chat:
           | 
           | https://huggingface.co/serpdotai
           | 
           | I haven't done any formal tests on this yet, but with
           | llama-13b, the overall structure of its responses definitely
           | becomes much more ChatGPT-like. It would be very interesting
           | to see how the 65B model performs.
        
           | m3affan wrote:
           | Let the revolutionbbegin
        
         | outside1234 wrote:
         | Or, more importantly than in AWS, locally in disconnected or
         | poorly connected scenarios like in-vehicle or in-home.
        
         | arugulum wrote:
         | > This technique is being used to reproduce[0] the Alpaca
         | results from Stanford[1]
         | 
         | Reproduced is a strong statement, without any rigorous
         | justification other than a few cherry-picked examples. Alpaca-
         | LoRA is simply LLaMA with LoRA-tuning on the Alpaca data. There
         | are no metrics, no measurements, no evaluations to show that
         | the Alpaca-LoRA performs similarly to Alpaca, when it is well-
         | known in the field that parameter-efficient fine-tuning always
         | pays a cost in terms of performance relative to full fine-
         | tuning (which is what Alpaca does).
         | 
         | (This has been a huge nit for me because of the recent flood of
         | Alpaca-replications, or even claims that Alpaca comparable to
         | ChatGPT, rushing to market themselves, but with nothing to
         | justify their claims.)
        
           | numlocked wrote:
           | I agree - my comment originally had a parenthetical about
           | this fact, but I thought it was probably confusing to people
           | who just wanted to understand what this was about. Perhaps I
           | shouldn't have edited it out.
           | 
           | It also bothers me that a lot of LoRA claims read like "You
           | won't believe how little it costs to train these models!",
           | when of course 99%+ of the complexity and cost is in the
           | LLaMA (or whatever) model that underpins it. Folks are
           | talking about it in a loose way that implies some kind of
           | miraculous overall training cost breakthrough.
        
           | GaggiX wrote:
           | >when it is well-known in the field that parameter-efficient
           | fine-tuning always pays a cost in terms of performance
           | relative to full fine-tuning
           | 
           | The LoRA paper clearly states the performance of the method
           | "LoRA performs on-par or better than fine-tuning in model
           | quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having
           | fewer trainable parameters, a higher training throughput,
           | and, unlike adapters, no additional inference latency. ":
           | https://arxiv.org/abs/2106.09685
        
             | arugulum wrote:
             | I don't want to get into the weeds of the subtleties of
             | evaluation, hyperparameter-tuning and model comparisons,
             | but let's just say that subsequent studies have shown that
             | LoRA (consistent with most parameter-efficient tuning
             | methods) underperform full fine-tuning:
             | https://arxiv.org/abs/2203.06904
             | 
             | As simple way to think about it is this: if LoRA really
             | gives full fine-tuning performance, why would anyone ever
             | fully fine-tune a model?
        
               | GaggiX wrote:
               | >why would anyone ever fully fine-tune a model?
               | 
               | You're asking it as if it were a rhetorical question, but
               | I think it carries more weight than many people seem to
               | believe.
        
               | arugulum wrote:
               | To balance my view a little, it is definitely a valid
               | question to ask "how far can we get with parameter-
               | efficient tuning", and I firmly believe that as models
               | get larger, the answer is "very, very far".
               | 
               | That said, I also dislike it when it is carelessly
               | claimed that parameter-efficient tuning is as good as
               | full fine-tuning, without qualifications or nuance.
        
       | jprafael wrote:
       | If this works, is there any theory why training models with low
       | rank layers (y = (A.B).x + b) directly doesnt work? (or do they?)
        
       ___________________________________________________________________
       (page generated 2023-03-24 23:00 UTC)