[HN Gopher] LoRA: Low-Rank Adaptation of Large Language Models ___________________________________________________________________ LoRA: Low-Rank Adaptation of Large Language Models Author : eternalban Score : 227 points Date : 2023-03-24 12:15 UTC (10 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | eternalban wrote: | From the paper: | | _"Aghajanyan et al. (2020) shows that the pre-trained language | models have a low "instrisic dimension" and can still learn | efficiently despite a random projection to a smaller subspace."_ | | Would be great to have an informed practitioner comment (sota) on | why we opt for random projection. Is the actual 'intrinsic' | vector space uncomputable? Too slow to find? | moyix wrote: | Not an informed/sota practitioner, but isn't this just a | standard property of high dimensional spaces? | | https://en.wikipedia.org/wiki/Random_projection | | > The core idea behind random projection is given in the | Johnson-Lindenstrauss lemma, which states that if points in a | vector space are of sufficiently high dimension, then they may | be projected into a suitable lower-dimensional space in a way | which approximately preserves the distances between the points. | stu2b50 wrote: | Random projects work well in high dimensional spaces, they're | cheap, easy, and require no understanding of the initial space. | Part of the point of Lora is efficiency, after all! | pgen wrote: | Name clash! https://en.wikipedia.org/wiki/LoRa#LoRaWAN | smusamashah wrote: | This term is already used when fine tuning stable diffusion | models. https://replicate.com/blog/lora-faster-fine-tuning-of- | stable... | Filligree wrote: | Isn't this actually the same thing? | TeMPOraL wrote: | Already used for like... a month or two. | mronetwo wrote: | it's Microsoft. they know. they just don't care | pygy_ wrote: | Your case insensitive brain doesn't get it... It's LoRA, not | LoRa. | | /s | ga_to wrote: | Microsoft has done this before with mauikit and mauilinux: | https://github.com/dotnet/maui/issues/35 | | Unlikely that they even consider checking whether they are | stomping across existing names. | capableweb wrote: | > Unlikely that they even consider checking whether they are | stomping across existing names. | | Or it's on purpose as existing terms already have good amount | of search traffic for those terms, and Microsoft know | Google/Bing will rank Microsoft's own pages higher than | what's already out there. | Semaphor wrote: | They even do that with their own products... | capableweb wrote: | Easy, one is LoRA and the other one is LoRa, Microsoft made it | very distinct, as they always do. | fullstop wrote: | Just don't put the files in the same directory on an exFAT | drive | capableweb wrote: | or on macOS, although I'm not sure it's still by default | using case sensitive file system or not. I do remember the | first time that bit me though, being a programmer using | Linux collaborating with a developer using macOS. Must have | been in ~2005 or something. | fullstop wrote: | The one which bit me happened when I was running a java | minimizer / obfuscator on a Windows platform and it | assumed that A.class was not the same as a.class. It | worked great on Linux and didn't warn that it had | overwritten a file, resulting in a package which almost | worked. | dudeinjapan wrote: | And yet they sued Mike Rowe who made Mike Rowe Soft. | [deleted] | htrp wrote: | this is from 2 years ago | nummerfuenf wrote: | Can we stop naming things like stuff that already exists? | denysvitali wrote: | I was looking for this comment. Thank you! | postdb wrote: | Firebird!!! | wlesieutre wrote: | This is totally different, Microsoft's A is capitalized! | | https://lora-alliance.org/ | indeyets wrote: | https://github.com/microsoft/LoRA/issues/47 | entropicdrifter wrote: | Unfortunately, no. It's even worse within the video game | industry. I'm not just talking Doom 4, er, Doom (2016). The | upcoming sequel to 2014's Lords of The Fallen? Well that's | called Lords of The Fallen. They didn't even get 2 game in | before repeating the exact same name. | Agentlien wrote: | My favourite video game franchise in terms of confusing name | is the Jedi Knight franchise. | | Star Wars: Dark Forces | | Star Wars Jedi Knight: Dark Forces 2 | | Star Wars Jedi Knight 2: Jedi Outcast | | Star Wars Jedi Knight: Jedi Academy | JustSomeNobody wrote: | Assholes. Don't call it LoRA! | | There's already a technology called LoRa! | | Fuck I hate this crap. Be better than this. | krolden wrote: | Its Microsoft they'll only be better if they go under. | runnerup wrote: | There's a not insignificant intersection of projects and | developers who might be using both LoRA and LoRa at the same | time. What a terrible name collision. Hopefully this doesn't | become one of the foundational terms in AI that everyone must use | frequently like "Transformer". | davesque wrote: | Is this really a big problem? LoRa is a telecom thing. LoRA is | a machine learning thing. Yeah, they're adjacent industries but | still seems different enough to make it pretty easy to | distinguish. I had never heard of the LoRa alliance until you | mentioned it in this comment. | asddubs wrote: | yeah it really does seem like the AI folks are EEEing EE | seydor wrote: | transformer itself is ambiguous | | But to be clear, LoRa is not related to ANN training, is it? | Why would they be using both? | Maxion wrote: | I was going to comment the same, horrible name collision. | Surprised they didn't notice it. | whalesalad wrote: | transformer and adapter are two of the new "ai terms" that | grind my gears | ahkurtz wrote: | Isn't there something really perfect about people working on a | language model either not trying or outright failing to use | that language model to tell them if their project name already | exists? | | On their github they reference a related project called | "HuggingFace" so you know the sky's the limit with the names in | this field, could have been called anything else really. | chaorace wrote: | > On their github they reference a related project called | "HuggingFace" | | Quick jargon literacy boost: "HuggingFace" is a platform | tailored to hosting and sharing ML repositories -- like | Github for AI. The parent company, "Hugging Face", is also in | and of itself a major contributor to several AI research | projects & tooling. | | Ironically, they still managed to hit a namespace | collision... albeit self-inflicted. | tmabraham wrote: | the actual platform is called "HuggingFace Hub". The | company itself is called "HuggingFace" or "Hugging Face" (I | have seen it referred to in both ways, I am unsure which is | officially correct). There is no namespace collision. | indeyets wrote: | https://github.com/microsoft/LoRA/issues/47 | elil17 wrote: | Can someone ELI5 "LoRA reduces the number of trainable parameters | by learning pairs of rank-decompostion matrices while freezing | the original weights"? | MacsHeadroom wrote: | LoRA finds a subset of the original weights (about 1%) which | can be trained to achieve about the same result as training the | whole model while using 100x less compute. | | Original weights frozen = Rather than modify the original | model, the training results are saved to a small file of only a | few MB. | | In practice this means you can fine tune a 30B parameter model | on a consumer GPU in a couple of hours. Without LoRA you would | need to run multiple expensive data center GPUs for days or | weeks. | tylerekahn wrote: | It's actually as low as 0.01% of the original weights. | | From the LoRa paper: | | >When the pre-trained model is GPT-3 175B, the number of | train- able parameters |Th| can be as small as 0.01% of | |Ph0|. | pffft8888 wrote: | Is this the same as or similar to the Lottery Ticket concept | from a few years ago? | arugulum wrote: | >In practice this means you can fine tune a 30B parameter | model on a consumer GPU in a couple of hours. | | Consumer GPU, yes, but in practice LoRA doesn't actually | reduce training time. What it mainly reduces is memory | requirements. In fact LoRA training can often require more | training steps than full fine-tuning and therefore be slower | (you can imagine why this is the case: the optimization is | trying to modify the mode's behavior a smaller number of | parameters, and so has a harder job) | stephanheijl wrote: | To be more exact, LoRA adds two matrices `A` and `B` to any | layers that contain trainable weights. The original weights | (`W_0`) have the shape `d x k`. These are frozen. Matrix `A` | has dimensions `d x <rank>` (`rank` is configurable) and | matrix `B` has the shape `<rank> x k`. A and B are then | multiplied and added to `W_0` to get altered weights. The | benefit here is that the extra matrices are small compared to | `W_0`, which means less parameters need to be optimized, so | less activations need to be stored in memory. | twic wrote: | Ah, so the resulting model contains both the large matrix | of original weights, and also the two small matrices of | alterations? But this is smaller than the alternative of a | model which contains the large matrix of original weights, | and an equally large matrix of alterations. | | Why is fine-tuning done with separate alterations, rather | than by mutating the original weights? | TuringTest wrote: | It's larger, but there are less parameters to train for | your specific use case since you are training the small | matrix only, while the original ones remain unaltered. | arugulum wrote: | > Why is fine-tuning done with separate alterations, | rather than by mutating the original weights? | | The goal of most parameter-efficient methods is to store | one gold copy of the original model, and learn minor | modifications/additions to the model. The easiest way to | think about this is in some kind of deployment setting, | where you have 1 capable model and you learn different | sets of LoRA weights for different tasks and | applications. | | The original intent of parameter-efficient methods is to | reduce the amount of storage space needed for models (do | you really want to keep a whole additional copy of LLaMA | for each different task?). A secondary benefit is that | because you are fine-tuning a smaller number of | parameters, the optimizer states (can take up to 2x the | size of your model) are also heavily shrunk, which makes | it more economical (memory-wise) to (parameter-efficient) | fine-tune your model. | stu2b50 wrote: | > But this is smaller than the alternative of a model | which contains the large matrix of original weights, and | an equally large matrix of alterations. | | It's actually larger. If you just have two equally large | matrices of the same dimension, one original, and one of | "altercations"... then you can just add them together. | | > Why is fine-tuning done with separate alterations, | rather than by mutating the original weights? | | Then you'd have to compute the gradients for the whole | network, which is very expensive when the model has 7b, | 65b, 165b parameters. The intent is to make that cheaper | by only computing gradients for a low rank representation | of the _change_ in the weight matrix from training. | arugulum wrote: | >Then you'd have to compute the gradients for the whole | network | | You have to do that with LoRA regardless, to compute the | gradients for the lowest-level LoRA weights. | gliptic wrote: | Correct me if I'm wrong, but I think you still need to | compute gradients of non-trained weights in order to | compute the gradients of the LoRA weights. What you don't | have to do is store and update the optimizer state for | all those non-trained weights. | stu2b50 wrote: | I mean the derivative of a constant is 0. So if all of | the original weights are considered constants, then | computing their gradients is trivial, since they're just | zero. | jprafael wrote: | Computing gradients is easy/cheap. What this technique | solves is that you no longer need to store the computed | values of the gradient until the backpropagation phase, | which saves on expensive GPU RAM, allowing you to use | commodity hardware. | seydor wrote: | Can rank decomposition be used to reduce the original | weight matrices as well? Or are they assumed to be | compressed already? | grph123dot wrote: | Your explanation is crystal clear. I suppose it works well | in practice, but is there any reason it works that well? | stu2b50 wrote: | Per the original paper, empirically it's been found that | neural network weights often have low intrinsic rank. It | follows, then, that the change in the weights as you | train also have low intrinsic rank, which means that you | should be able represent them with a lower rank matrix. | grph123dot wrote: | Since we are in ELI5, it seems that the concept of low | rank approximation is required to understand this method. | | (1) https://en.wikipedia.org/wiki/Low-rank_approximation | | Edited: By the way, it seems to me that there is an error | in the wikipedia page because if the Low-rank | approximation takes a larger rank then the bound of the | error should decrease, and in this page the error | increases. | grph123dot wrote: | >> that the change in the weights as you train also have | low intrinsic rank | | It seems that the initial matrix of weights has a low | rank approximation A and this implies that the difference | E = W - A is small, also it seems that PCA fails when E | is sparse because PCA is designed to be optimum when the | error is gaussian. | stu2b50 wrote: | In terms of PCA, PCA is also quite expensive | computationally. Additionally, you'd probably have to do | SVD instead. | | Since the weights are derived from gradient descent, yeah | we don't really know what the distributions would be. | | A random projection empirically works quite well for very | high dimensions, and is of course very cheap | computationally. | seydor wrote: | Does this mean the matrices are highly compressible? | quest88 wrote: | Is this the same as Knowledge Distillation (teacher-student | training)? | edwardjhu wrote: | Hi! I'm the author of the repo. | | The insight is that we don't need to modify a lot of parameters | to get a generally competent model to do well on specific | tasks. When you have a linear layer with a weight matrix of | dimension d_in x d_out, the change you undergo during full | finetuning is also a matrix of d_in x d_out, which can be huge. | We represent the latter using two matrices of shape d_in x r | and r x d_out. You save a lot of parameters when r is small. So | when you use it, the input goes through two streams 1) the | orignal frozen weight turning a vector of size d_in to d_out | and 2) the low-rank weights turning a vector of size d_in to r | and r to d_out. The two streams are then summed together. | (There's a figure in the paper.) | | This way of doing thing is nice for a few reasons. It's easy to | parallelize. You can change r to control how many parameters to | train. You can also merge the low-rank weights with the | original one to avoid latency. | | Note that we don't select a subset of the original parameters. | We train extra ones. | loxias wrote: | Hi! I in _no way_ mean to detract or malign or "anything | negative" the parent comment (communication is hard!!), BUT I | really compliment that exact sentence. :) | | My backgroung contains signal processing, "pre-deep learning | ML", systems engineering, and firmware, and that sentence | jumped out at me as crystal clear in my mind, despite not | knowing what HuggingFace is or PyTorch. | | Correct me if I'm wrong: These huge models involve lots of | weights used in large matrices. The contribution of this work | is to plug in some matrix factorization and learn a lower | dimensional representation. Fantastic! | | Also makes me wonder what other performance improvements | await through proper application of established and well | known Mathematics. :D | eternalban wrote: | Great, we can get authoritative answers. (I'm trying to | understand the ML space and have mostly done readings, not an | expert.) | | I am assuming you can have n LoRA fine-tunings, say each | specializing in one aspect of a coherent task, with n | summers, running in parallel, and then combine them at the | end? Or more generally, does LoRA enable a sort of | modularizing around a core (un-merged) model? | | And curious if you ever tried merging 2 or more fine-tunings | and then testing the resultant single model (merge all) | against the original tests to check retention? | zeckalpha wrote: | Quite different from https://en.m.wikipedia.org/wiki/LoRa | michaelhartm wrote: | Btw, it's kinda crazy how bad the GPT4-J results in the blog are | compared to the Dolly one, which seem pretty good. Do we know why | it works so well to use this 50k dataset? | quadrature wrote: | Dolly is instruction fine tuned whereas GPT4-J is not. Which | means that it doesn't even understand that it is being | instructed to do something, it is just doing an autocomplete. | muny wrote: | Why use the same name as LoRa? https://lora-alliance.org/ | | Edit: Microsoft is even a member of the LoRa alliance: | https://lora-alliance.org/lora-alliance-press-release/micros... | edwardjhu wrote: | Good question! I came up with the name because the idea is best | described as low-rank adaptation. I know very little about | radio communication and didn't anticipate the visibility my | repo has today :) | StingyJelly wrote: | At least could have been LoRaA | stu2b50 wrote: | You're assuming a lot more intercompany coordination than would | exist. Even though it's research by Microsoft labs, the | researchers themselves are to a large extent autonomous and | also narrow experts in their fields. | | This process involves low rank approximations -> Lora is a | namey sounding term that uses characters from low and rank -> | call it LoRA in the paper. That's all there was to it. Probably | didn't even know the other lora existed. | edwardjhu wrote: | Yup. That's exactly what happened. | anthk wrote: | Also Guix vs Guix... | FlyingRobot wrote: | I had to scan the readme to make sure this story wasn't about | applying machine learning to radio communication. | ChancyChance wrote: | Small CNNs can be used for BLE channel hopping and body | detection. | tylerekahn wrote: | Low Rank Adaption is a mathematical technique, it's not a | technology standard | krolden wrote: | Then call it LoRad | samtho wrote: | It's still a currently-in-use acronym/term that a | sufficiently large tech company could conceivably be using | both meanings concurrently. This causes confusion and muddies | the water of a general web search experience. | | Not the same situation, but I remember when "Electron" was | called "Atom Shell" because it was built for the (now | defunct) text editor by the same name. For the longest time, | I had an unsubstantiated thought that it was a new Unix shell | that was based around a text editor somehow (yes, dumb). In | hindsight, they just had named this cleverly to reference the | various layers or shells of electrons orbiting atomic nuclei, | thus the eventual name of Electron. | | On the other hand, a wireless technology standard is very | different than a known mathematical technique that likely | predates the wireless meaning anyway. | kkielhofner wrote: | In all seriousness there should be ML project naming approaches | (I should try ChatGPT). Naming a project or a company is very | difficult so I can't blame anyone here. | | That said some of these ML project names are especially | horrendous (kind of ironic for the current emphasis on | generative AI). Transformers? A good chunk of the time I get | results about the toys and cartoons from my childhood. Don't | get me wrong, I still think Optimus Prime is cool and the name | "transformers" make sense given the function but it's somehow | simultaneously generic AND the name of a decades long multi- | billion dollar media franchise... | | LoRA is another example, name makes sense but the collision | with LoRa is problematic. I, for one, am interested in and | have/would apply both. Queue google searches for "Lora | radio..." vs "Lora ml...". | | Project naming is hard and I'm just glad to see the activity | and releases. BUT project naming is essentially a base | usability condition and should be considered as such: just like | creating a README, getting started, providing code examples, | etc. | | It reminds me of trademarks: if you're looking for trademark | protection it won't be issued if it is overly generic or likely | to "cause confusion in the marketplace" with an existing | trademark (basically same or similar name in a somewhat | similar/adjacent field) - you can even reuse names but only if | it's obvious to people from basic context that they refer to | different things. I'm not a trademark attorney but I think LoRa | vs LoRA would get refused because it's "computer stuff", while | a shampoo named Lora would be fine (as an example). If you're | curious there are official categories/areas from the USPTO that | break these down. | | Both of these examples wouldn't have a chance at trademark | protection. Note I'm not saying they should have trademark | protection, just that it's an example of a reasonable standard | that should be considered/compared to for good open source | project naming. | elcomet wrote: | There are many more things called lora. | | https://en.m.wikipedia.org/wiki/Lora | | It doesn't really matter as long as it's not in the same field. | No one will be confused between the two. | magicalhippo wrote: | > No one will be confused between the two. | | Except search engines... | Filligree wrote: | That's okay, Bing-GPT doesn't get confused. | AdamH12113 wrote: | "LoRAd" was right there. | brodouevencode wrote: | https://en.wikipedia.org/wiki/LoRa for the communications | architecture | renewiltord wrote: | Why did the radio guys use the same name as this hotel from | Minnesota that existed for years before? | https://www.lorahotel.com/ | | I bet some of them have even been to Minnesota and they still | didn't pick a unique name. | | Though both of them have to answer to why they picked the name | of a Google Font that preceded both and is currently available | https://web.archive.org/web/20170210001724/https://fonts.goo... | | Is it because Microsoft is competing with Google in the AI | space? | reportgunner wrote: | Context. Individual hotels are not technology. | renewiltord wrote: | Indeed. And LLMs are not radios or fonts. | krossitalk wrote: | Maybe call it LoRALLMR (Laura Loomer) | timmg wrote: | This sounds similar to "prompt tuning": | https://ai.googleblog.com/2022/02/guiding-frozen-language-mo... | stu2b50 wrote: | It's actually completely different. What you linked is about | zero shot learning by adjusting the prompt, vs Lora which is | about actually fine tuning the weights of the model. | timmg wrote: | In that case, you can think of the prompt as being one vector | of the model that is being tuned while the rest is frozen. | | Not exactly the same, to be sure. But fulfills a similar | need: more efficient "fine tuning" of a large model. | stu2b50 wrote: | I suppose that is true. You can even train the prompt with | gradient descent. But in practice, it ends up being fairly | different. | eternalban wrote: | They address prompt tuning's issues in the paper: | | _" The other direction, as exemplified by prefix tuning | (Li & Liang, 2021), faces a different challenge. We observe | that prefix tuning is difficult to optimize and that its | performance changes non-monotonically in trainable | parameters, confirming similar observations in the original | paper. More fundamentally, reserving a part of the sequence | length for adaptation necessarily reduces the sequence | length available to process a downstream task, which we | suspect makes tuning the prompt less performant compared to | other methods."_ | | https://ar5iv.labs.arxiv.org/html/2106.09685 | | This is key imo: _" More fundamentally, reserving a part of | the sequence length for adaptation necessarily reduces the | sequence length available to process a downstream task"_. | arugulum wrote: | LoRA conversely has different downsides. LoRA can be used | in two ways: merged or unmerged. Unmerged (which is how | it's trained) incurs a non-trivial computation cost. | Merged means you are modifying the model weights, which | means you are stuck with that one model on that device | (though, this usually applies for most implementations | for the unmerged versions too). | | The benefit of prompt and prefix tuning (note: these are | two separate methods) is that you can serve different | soft-prompts and soft-prefixes efficiently with a single | shared set of model weights. | eternalban wrote: | https://ar5iv.labs.arxiv.org/html/2106.09685/assets/x1.pn | g | | > incurs a non-trivial computation cost | | The hit seems to be in energy/cpu not time since the W0 | computation is in parallel with the BAx. (My assumption | based on the latency claims in paper.) So an issue in | edge deployments (battery life, etc.). | | > you are stuck with that one model on that device | | Upfront I have 0 clue on the actual numbers, but from a | purely software architecture pov [in unmerged setup], | having that W0 forward process _once_ with n distinct BAx | paths (for distinct fine tunings!) would address that, | no? | | [p.s. say an application that takes as input A/V+Txt, | runs that through an _Ensemble LoRA_ (ELoRA(tm) /g) | which each participant contributing its own BAx finetuing | processing, sharing the single pre-trained W0.] | arugulum wrote: | > My assumption based on the latency claims in paper. | | The latency claims are based on the merged version, where | the modifications are merged into the model weights. | Hence there is no latency cost, since the final model has | the same shape as the original. | | > having that W0 forward process once with n distinct BAx | paths (for distinct fine tunings!) would address that, | no? | | The tl;dr is that that works, but is more expensive. Not | ridiculously more expensive, but certainly more expensive | that processing a few additional tokens with | prefix/prompt tuning. | edwardjhu wrote: | > Merged means you are modifying the model weights, which | means you are stuck with that one model on that device | (though, this usually applies for most implementations | for the unmerged versions too). | | If one is careful with floating point issues, it's | straightforward to unmerge the weights. | | W_0 = W_1 - BA | | Yes, prompt-based methods don't involve swapping weights. | arugulum wrote: | Right, it's mathematically easy (again, up to floating | point issues) to recover the weights as needed, but in | terms of distribution/serving I'm guessing the plan is to | have the original weights and carry around the LoRA | weights and merge as necessary. | | (Also, I'm assuming you're the first author of LoRA.) | arugulum wrote: | Both LoRA and prompt tuning are parameter-efficient tuning | methods. Both of them inject new weights into the model and | tune them. | | Prompt tuning does so by injecting addition prefix tokens in | the input to the model. LoRA does so by injecting low-rank | matrices that are additive modifications to a set of linear | layers in the model. | | They both do something slightly different, but are very much | in the same class of methods. | eternalban wrote: | TIL learned about LoRA via | https://news.ycombinator.com/item?id=35287740 | | See also: Huggingface PEFT: https://github.com/huggingface/peft | sharemywin wrote: | Could the Bloom model be used with this training to build a | commercial allowed small-ish model? | stu2b50 wrote: | It cheapens the cost of fine tuning, it doesn't make the model | itself smaller at inference time. | outside1234 wrote: | Has anyone diaried out a good learning path for going from a | larger pre-trained model to a fine tuned model? Trying to | understand all of the parts here but it sort of hard to fine | anything linear... | alecco wrote: | *17 Jun 2021 | rdedev wrote: | Came across this library in the past where you can easily add | LoRA and other efficient fine tuning techniques easily into | huggingface models. Haven't tried it though and support for | different models may be limited | | https://adapterhub.ml/ | Der_Einzige wrote: | I really hope this doesn't displace regular fine tuning | techniques. Dreambooth is superior in quality to Lora with image | generation, and I suspect that it's similar with LLMs. | brucethemoose2 wrote: | There are some WIP evolutions of SD Lora in the works, like | locon and lycoris. | | https://github.com/KohakuBlueleaf/LyCORIS | eternalban wrote: | https://dreambooth.github.io/ | | The LoRA paper's 'problem statement' makes a compelling case | for practical benefits of the approach. Specifically, no added | latency, no serial processing bottlenecks, shared baseline | model, compact time/space requirements. How does dreambooth | stack up in this regard? | mattnewton wrote: | In the image space, dreambooth full-model tunes can handle | multiple concepts and tend to be easier to get hard/complex | things like a person's likeness correct. I've found that LoRA | tunes struggle to be accepted by people as producing their | own face compared to full dreambooth models tuned on the same | inputs, most likely because we are very sensitive to facial | differences of faces we are very familiar with. I haven't | seen this effect for styles or other kinds of concepts, where | people are a little less sensitive about the fidelity. LoRA | is much easier to train, easier to compose, and can have the | base model swapped out in many cases though so if it's good | enough for the concept you are trying to add to the model | it's often worth the subtle quality loss. | stu2b50 wrote: | I suspect it's not that similar. The intuition behind LoRA is | more true the higher the rank of the weights of the model. Even | the smallest LLMs have considerably higher rank weights than | Stable Diffusion. They are _large_ , after all. | numlocked wrote: | For those wondering why this is interesting: This technique is | being used to reproduce[0] the Alpaca results from Stanford[1] | with a few hours of training on consumer-grade hardware. | | I believe there will soon be a cottage industry of providing | application-specific fine-tuned models like this, that can run in | e.g. AWS very inexpensively. The barrier today seems to be that | the base model (here, Meta's LLaMA) is encumbered and can't be | used commercially. Someone will soon, I'm confident, release e.g. | an MIT-licensed equivalent and we'll all be off to the races. | | [0] https://github.com/tloen/alpaca-lora | | [1] https://crfm.stanford.edu/2023/03/13/alpaca.html | GaggiX wrote: | In addition, for the past 1/2 month this technique has been | used to fine-tune Stable Diffusion models. | terafo wrote: | Closer to 4 months. It is much better than having a bunch of | 2-4gb models laying around. | GaggiX wrote: | 4 months? I don't think so, people really start using LoRA | when it was added to the diffusers library less than 2 | months ago, this library is used by the training plugin of | automatic webui, I guess time seems to flow more slowly | when many things happen. | dragonwriter wrote: | The 1/2 month seems to match Lycoris/LoCon, which as I | understand (haven't dug into the details on this) is a | newer refinement of LoRa. LoRa has been used for longer, | correct. | GaggiX wrote: | The LyCORIS/LoCon repo started committing 1 month ago and | almost no one is using it except for a few experiments | (not even the automatic webui supports it without a | plugin). | dragonwriter wrote: | Judging from activity on Civitai, I think "almost no one | is using it except for a few experiments" is _very_ | wrong. Sure, A1111 needs a plugin for it; it needs a | plugin for ControlNET, too, but that is _also_ quite | popular. | GaggiX wrote: | I'm also judging from the activity on CivitAI, the most | downloaded (>1000 downloads, not many) ones are actually | just LoRA with LoCon in another (experimental) branch of | the CivitAI page, definitely not " _very_ wrong " ahah | | >it needs a plugin for ControlNET | | The big difference is that ControlNet actually required a | pretty complex interface to be used effectively, | meanwhile the use of LoCon/LyCORIS should be completely | transparent and works like a LoRA | Agentlien wrote: | ControlNet is built in as of maybe two weeks ago and no | longer requires an extension. I started using it when the | built-in support arrived and have had a lot of fun with | it since. | smaddox wrote: | There's already RWKV, if you want a decent performing pre- | trained model that's Apache 2.0 licensed: | https://twitter.com/BlinkDL_AI/status/1638555109373378560?s=... | pffft8888 wrote: | https://news.ycombinator.com/item?id=35281026 | polyterative wrote: | Thanks! Hard to follow this stuff sometimes with all the news | romanzubenko wrote: | Today Databricks announced [0] 6b parameter model from | EleutherAI finetuned on Alpaca dataset. According to their | CEO[1], training took 3 hours, and costed $30. They didn't | release any details on how it was trained, but likely with | LoRa. | | [0] https://www.databricks.com/blog/2023/03/24/hello-dolly- | democ... [1] | https://twitter.com/alighodsi/status/1639251347777388544 | numlocked wrote: | Interesting. I wonder what the training cost was for: | | https://huggingface.co/EleutherAI/gpt-neox-20b | | Perhaps it's in the paper... | michaelhartm wrote: | They used the 6b GPT4-J, not 20B. That's what's | interesting, it's a smallish large language model :). | dragonwriter wrote: | GPT-J, not GPT4-J. | int_19h wrote: | There are also some LLaMA LoRAs that are trained on the | Anthropic dataset specifically for chat: | | https://huggingface.co/serpdotai | | I haven't done any formal tests on this yet, but with | llama-13b, the overall structure of its responses definitely | becomes much more ChatGPT-like. It would be very interesting | to see how the 65B model performs. | m3affan wrote: | Let the revolutionbbegin | outside1234 wrote: | Or, more importantly than in AWS, locally in disconnected or | poorly connected scenarios like in-vehicle or in-home. | arugulum wrote: | > This technique is being used to reproduce[0] the Alpaca | results from Stanford[1] | | Reproduced is a strong statement, without any rigorous | justification other than a few cherry-picked examples. Alpaca- | LoRA is simply LLaMA with LoRA-tuning on the Alpaca data. There | are no metrics, no measurements, no evaluations to show that | the Alpaca-LoRA performs similarly to Alpaca, when it is well- | known in the field that parameter-efficient fine-tuning always | pays a cost in terms of performance relative to full fine- | tuning (which is what Alpaca does). | | (This has been a huge nit for me because of the recent flood of | Alpaca-replications, or even claims that Alpaca comparable to | ChatGPT, rushing to market themselves, but with nothing to | justify their claims.) | numlocked wrote: | I agree - my comment originally had a parenthetical about | this fact, but I thought it was probably confusing to people | who just wanted to understand what this was about. Perhaps I | shouldn't have edited it out. | | It also bothers me that a lot of LoRA claims read like "You | won't believe how little it costs to train these models!", | when of course 99%+ of the complexity and cost is in the | LLaMA (or whatever) model that underpins it. Folks are | talking about it in a loose way that implies some kind of | miraculous overall training cost breakthrough. | GaggiX wrote: | >when it is well-known in the field that parameter-efficient | fine-tuning always pays a cost in terms of performance | relative to full fine-tuning | | The LoRA paper clearly states the performance of the method | "LoRA performs on-par or better than fine-tuning in model | quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having | fewer trainable parameters, a higher training throughput, | and, unlike adapters, no additional inference latency. ": | https://arxiv.org/abs/2106.09685 | arugulum wrote: | I don't want to get into the weeds of the subtleties of | evaluation, hyperparameter-tuning and model comparisons, | but let's just say that subsequent studies have shown that | LoRA (consistent with most parameter-efficient tuning | methods) underperform full fine-tuning: | https://arxiv.org/abs/2203.06904 | | As simple way to think about it is this: if LoRA really | gives full fine-tuning performance, why would anyone ever | fully fine-tune a model? | GaggiX wrote: | >why would anyone ever fully fine-tune a model? | | You're asking it as if it were a rhetorical question, but | I think it carries more weight than many people seem to | believe. | arugulum wrote: | To balance my view a little, it is definitely a valid | question to ask "how far can we get with parameter- | efficient tuning", and I firmly believe that as models | get larger, the answer is "very, very far". | | That said, I also dislike it when it is carelessly | claimed that parameter-efficient tuning is as good as | full fine-tuning, without qualifications or nuance. | jprafael wrote: | If this works, is there any theory why training models with low | rank layers (y = (A.B).x + b) directly doesnt work? (or do they?) ___________________________________________________________________ (page generated 2023-03-24 23:00 UTC)