[HN Gopher] RedPajama: Reproduction of LLaMA with friendly license ___________________________________________________________________ RedPajama: Reproduction of LLaMA with friendly license Author : tim_sw Score : 534 points Date : 2023-04-17 14:05 UTC (8 hours ago) (HTM) web link (www.together.xyz) (TXT) w3m dump (www.together.xyz) | bmc7505 wrote: | @dang The title should be changed from MILA to Mila/IQIA. | hsuduebc2 wrote: | I'm somehow scared and somehow amazed by speed of this progress. | martythemaniak wrote: | This is cool, now we just need to locate 1,000,000 A100-80GB | equivalent GPU-hours. If we had a SETI@Home type project setup | for this, it would be straightforward - only $50K worth of | electricity for the 65B model. | | Given the immense momentum behind LLaMA, I'm pretty disappointed | that Meta won't just open-source it, but I guess reproducing it | is better long-term. | miohtama wrote: | They missed the chance to call it OpenPajama | wongarsu wrote: | Calling next month's headline: "OpenPajama: RedPajama weights | fine-tuned on liberotica and fanfiction.net" | omneity wrote: | An actually open source LLM would be a game changer. We might | need a new license that englobes model usage and training, | something GPL-like whereby distributing a retrained model | requires contributing data back or making it public, but not if | you use it privately. | | This will definitely accelerate progress in LLM research, | productization and safety. Alpaca, vicuna, gpt4all and others are | sporadic repesentations of this that could become a continuous | improvement process were the LLM and its license truely open | source. | | An interesting possible side effect of a GPL-like license is that | AIs become unlikely to be trained on private data, the usual moat | that big tech wouldn't want/just can't make public if it were to | use those GPL-like licensed models. | jupp0r wrote: | As with original GPL, this would be almost useless in a | commercial context. | e12e wrote: | There are commercial devices that ship with a Linux kernel? | Bjartr wrote: | Basically every Android device for starters. | ijustlovemath wrote: | I think they mean in terms of enforcement when there's a | violation | jupp0r wrote: | But do they train the Linux kernel with their customers | data? | sp332 wrote: | Using a Linux kernel doesn't mean you have to make your | whole project GPL, unless your project is specifically | kernel code. | wongarsu wrote: | Neither would the proposed model license. Just like the | kernel's GPL stops at the userspace boundary, the | proposed license would only cover the model definition | and weights. | ipsum2 wrote: | Huh? There's plenty of open source LLMs. Pythia, GPT-NeoX, | GPT-J, GPT-2, BLOOM-176, are ones I can think of off the top of | my head. Pythia is the best performing one IIRC. | buzzscale wrote: | Dolly 2.0 is fully open, Apache License and the tuning dataset | is employee generated: | | https://www.databricks.com/blog/2023/04/12/dolly-first-open-... | almost_usual wrote: | Name is obviously inspired from the Anna Dewdney children's | books. | michael_j_ward wrote: | My kids love that book, and my oldest had me read it to his | preschool class earlier this year. | | Here is a much more creative reading by Ludacris [0] | | [0] https://www.youtube.com/watch?v=PFtHeo7oMSU | quickthrower2 wrote: | As I understand it they have the input data, but next up they are | creating the model. I could make a joke about drawing an owl ... | but that would be a bit mean. I am really glad people are working | on this. | | I wonder... who is paying? Will there be restrictions like ethics | clauses and suchlike. Not necessarily a bad thing if they do. | Will there be restrictions on commercial use. | HopenHeyHi wrote: | Having reproduced the pre-training data, the next step is to | train a strong base model. As part of the INCITE program, with | support from Oak Ridge Leadership Computing Facility (OLCF), we | are training a full suite of models, with the first becoming | available in the coming weeks. With a strong base | model in hand, we are excited to instruction tune the models. | Alpaca illustrated the power of instruction tuning - with merely | 50K high-quality, diverse instructions, it was able to unlock | dramatically improved capabilities. Via OpenChatKit, we received | hundreds of thousands of high-quality natural user instructions, | which will be used to release instruction-tuned versions of the | RedPajama models. | | Excellent. Sam Altman can blow it out his ass. :) | rafaelero wrote: | That's awesome! Are people thinking about training it for more | than just 1 epoch? I believe Gallactica showed that training for | even 4 epochs is ok. Also, how amazing would be if the next gen | of open-source LLM's increased context window, like adding 8k | more tokens? That's probably expensive, but totally doable. | sp332 wrote: | It's including Common Crawl data 4 or 5 times, does that count? | [deleted] | Jayakumark wrote: | This is huge, was just checking today on what would take someone | to get a model similar to Llama, since Meta did not share | Training code or Dataset.. Looks like they have figured out how | to make the dataset ,Main Problem here is pre-processing them. | Second step is to make the code to train model and final one do | it cheaply. | brucethemoose2 wrote: | Maybe they should use whatever Cerebras used. The whole point | of their own LLM release was as a maximum compute/$ | demonstration on their platform. | | Surely there is a better alternative than a bunch of A100s on | AWS... | mgaunard wrote: | Pyjama singular actually works, but I'm not sure Pajamas can be | singular. | FranchuFranchu wrote: | I think that at this point, LLM etymology is way more | interesting than LLMs themselves. | [deleted] | local_crmdgeon wrote: | So how do I use this? As someone new to the domain. | tinco wrote: | You download the 2.76TB of data. Then you run it through | Llama's training script for a couple months on 40 NVidia | A100's, and you should have yourself a pretty fine large | language model you could use to host your own ChatGPT service. | It'll be significantly worse than ChatGPT for reason's that | aren't yet fully clear because OpenAI switched its mission from | protecting the earth from nefarious AI developments, to being | itself being the origin of possibly nefarious AI developments. | DigitalDopamine wrote: | Renting 40 nvidia a100s is around 70k dolar per month (on | vultr i see). So this would only cost 420k for 6 months. | Seems doable. | | Is 40 a100s enough though? I am interested in what this would | cost. | mlboss wrote: | It would be great if this can be done on 3090s. Used 3090 | usually costs $500-1000 to buy. | skybrian wrote: | You don't, since they're not done yet. Someone will come up | with a way to use it when they're done. | thrtythreeforty wrote: | I'm very glad people are starting to push back against claims of | various LLMs being open source. I was beginning to be worried | that the term would be forcefully redefined in the ML space to | mean "weights available." With the kickoff of projects like this | and Databricks' Dolly, I'm heartened to see the community saying | "no, we are willing to spend the compute to make actually open | models." | | (While it's true that the actual model code of Llama is properly | open source, it's also useless for inference by itself. Claiming | these models are open source seems like having your cake and | eating it too - you get accolades for "open sourcing" but still | get to control what happens with it.) | jrm4 wrote: | Lawyer here, still trying to wrap my head around all of it -- | but it seems as if what may be different here is the extent to | which all of this is _practically_ "open-source" or even | "literally free, as in freedom and cost etc" (i.e. generally | and widely available REGARDLESS of what the law says) | | And then coming second appears to be "companies and whoever who | seek to make money, and intend to make some sort of legal | restriction part of the biz model." | | I have no answers or even predictions here except "this is | gonna be interesting." | nickcw wrote: | To make an analogy with Linux, the weights are (up until now) a | very large closed source firmware blob. | ninjin wrote: | I can only agree. The number of times we have seen corporations | abuse "open source" and "open science" in the context of large | language models have been baffling: OPT/LLaMA disallowing | commercial usage, BLOOM having an ethical non-open license, GLM | having a clause not to "undermine [the People's Republic of | China's] national security and national unity", etc. Every | single one of these models have been happy to ride on the | coattails of the hard work of the open movements by calling | themselves open, while only paying lip service to the ideals | and definitions underpinning them. | | While RedPajama has yet to commit to a license (from what I can | see, it is late at night...), they are making _all_ the right | noises and I am hopeful that my prediction that we are about to | see the floodgates of _truly_ open models blow open and that | OpenAI's "moat" will be proving to be a lot shallower than what | they and many others have made us believe over the last six | months will come true. | vipulved wrote: | Hi, this is Vipul, I am a co-founder of Together. We plan to | release the model weights under Apache 2.0. The amount of | creativity that Stable Diffusion unleashed for instance is | only really possible with permissive licenses! | Taek wrote: | Are you working at all with Stability, Eleuther, or LAION? | There have been some rumors that they are doing something | similar to this and I'm wondering if this is a duplicated | effort. | | Either way, huge fan, it would be awesome to have a LLaMA | set of weights that are fully open. | yieldcrv wrote: | > not to undermine the national security and national unity | | this is a required statement to conform with China's | constitution, or the superseding authoritative social | contract there. | | think of it like if the Patriot Act was an article of the | constitution instead of a random law subservient to the | constitution, it would negate other parts of the constitution | that we hold near and dear. | | this is a useful similarity as both constitutions have | assurances of free speech | | just one has a fatal heavily leveraged clause that undermines | all other parts of that constitution and dictates all facets | of life | ninjin wrote: | This is interesting, thank you. But then how can _any_ | entity in the PRC contribute to open source? Alibaba, | Baidu, etc. have released plenty of machine learning code | under _proper_ open licenses in the past (not to mention | that we have hardware vendors in the PRC contributing to | say Linux). The story I heard about GLM was that they were | a high enough public profile project that it caught the | attention of PRC bureaucrats that pushed for the clause to | be included. | | Regardless of the cause though, the clause flies afoul of | any definition of open out there. | yieldcrv wrote: | simplest answer is that Alibaba and Baidu have more party | members as stakeholders | | but its not likely that any uncontrollable LLM can start | spitting out accuracy or things unhelpful to Beijing's | ethos there and be allowed to operate | | the model or the service filtering the model has to be | controlled | nacs wrote: | > this is a required statement to conform with China's | constitution | | But doesn't this mean the model training data also excludes | anything critical of China? | | For example, does their training data include things like | this: https://en.wikipedia.org/wiki/1989_Tiananmen_Square_p | rotests... ? | [deleted] | danShumway wrote: | My only caveat here is that I'm actually really curious to see | a ruling about whether model weights can be copyrighted. | | I don't think the "Open Source" label people are using is | accurate, and I _heavily_ agree that a common thing that | companies seem to be trying to do in this space is release what | are essentially closed models while calling them open, and it | 's a really dangerous direction for AI to go. So nothing in | your comment is wrong. | | But it also feels a little bit like ceding ground to just | assume that Llama can't be used commercially just because | Facebook says it can't. I never signed a EULA with them, that | claim depends entirely on whether or not model weights are | under copyright (or under some similar form of IP protection, | some people have brought up trade secrets). | | And I don't have a super-strong opinion necessarily, but I'm | not sure that's a safe assumption for people to make, and I | kind of think it might be good to throw an asterisk next to | "can't be used for commercial projects" whenever we talk about | Llama's restrictions. | | But again, I agree with you, it's not the same as saying Llama | is Open Source. Even if it does get ruled as having weaker | protections, I don't think the term would really apply. | jupp0r wrote: | I haven't done so, but don't you sign an agreement when you | ask Facebook for a link to download the weights for LLAMA | which is currently the only officially supported way of | getting those weights | (https://github.com/facebookresearch/llama/tree/main#llama) ? | danShumway wrote: | I haven't used Llama for anything other than playing around | to test its capabilities, so I feel fairly comfortable | admitting publicly that when I did that testing, I did not | download it from Facebook using an official portal, and I | didn't sign any agreement about it. | | On that subject, to the best of my knowledge, I also | haven't signed any kind of agreement with OpenAI. I've done | all of my GPT testing through 3rd-party services or portals | that don't require signing EULAs to use. | Ajedi32 wrote: | Why would you bother using an "officially supported" way of | downloading the weights if they aren't copyrightable | anyway? | worldsayshi wrote: | > GitHub: GitHub data, filtered by licenses and quality | | Does anyone know which licenses are filtered into the dataset? | mananaysiempre wrote: | The description on the linked HuggingFace page[1] says MIT, BSD | and Apache. | | [1] | https://huggingface.co/datasets/togethercomputer/RedPajama-D... | asddubs wrote: | it's better than laundering gpl code, but it still breaks the | licensing terms of those licenses as well, namely attribution | worldsayshi wrote: | I guess that could potentially be fixed if citation | ejection can somehow be implemented into it, which seem to | be at least feasible? | MangezBien wrote: | Definitely thought this was about the kid's book. | franzypants wrote: | It might be a little late, but I hope datasets start | incorporating patent texts as well: | | 1. It's a large corpus of technical knowledge; 2. The language is | written by experts in a field and reviewed many times, and 3. | They have technical drawings with labels and references in the | text | | The only downside I suppose is that sometimes patents are written | with "just enough knowledge" to get it granted but not too much | to give away the secret sauce. That's not really that different | from many scholarly papers though. | | To give a size of scale, the granted patent texts of 2020 | (without images) is about 160 GB of data, and we have digitized | grants going back to at least 1970. | seunosewa wrote: | You wouldn't want chatbots to answer you with the kind of | language used in patent texts. | MayeulC wrote: | Now, I don't know if I would rely on it, but I've certainly | thought about asking a LLM to write my patent text for me, | provided with a technical description. | sp332 wrote: | LLMs are actually pretty good at translating info in one form | into another form. | return_to_monke wrote: | with both this and https://Open-Assistant.io, I believe we have | entered the Stable Diffusion era of large language models | bugglebeetle wrote: | Only if they actually start performing at the level of OpenAI's | models. I'm not a fan of StableDiffusion, but at least their | models work at general parity with private offerings. All the | LLama derivatives and OpenAssistant stuff performs far below | GPT-3.5 for everything I've tested. | jokethrowaway wrote: | I don't think there is a ready made alternative to | Midjourney. | | Midjourney is way more versatile than SD. If you start | getting some fine tuned models on civitai, trained to do well | some specific tasks, you can get comparable quality but I | haven't seen a single model which is able to replace | Midjourney. | | Llama is no different, it has ok performance on generic | queries but still far away from GPT3.5: if you start fine- | tuning you can get good perf on specific tasks. | htaunay wrote: | Midjourney to me feels like bowling with bumpers | | Sure, its very easy to get good results fast, but the | tuning that avoids "uglier" images is the same that removes | a lot of versatility compared to SD | | Also controlnet is a killer feature | og_kalu wrote: | You're 100 percent right. People will say control bla bla | bla and that's certainly true. You can get a lot more | control with Stable Diffusion but like 99% of digital | comics created with ai art use midjourney. One of the most | control and versatility inclined use cases of generated art | and midjourney is still easily winning. There's a reason | for that. | bugglebeetle wrote: | SD with ControlNet and some other open source plugins is | far more flexible than MidJourney. It just has all the | typical hurdles of OSS vs. commercial offerings. Default | image quality in Midjourney is better in terms of its | pedestrian aesthetic biases, but it's not very interesting | as an actual artistic tool. And I say this as someone who | doesn't like either service and used to be a commercial | illustrator before moving into Data Science. | asynchronous wrote: | Midjourney also doesn't have controlnet functionality like | Stable Diffusion now does, which gives specific posing of a | scene a huge edge on SD. | | They're very similar offerings if you're willing to put in | the work on SD. | GaggiX wrote: | >I'm not a fan of StableDiffusion | | For some technical reason? | bugglebeetle wrote: | No, technically it's all very impressive. My displeasure | with them was there doing a Napster-style maneuver to force | artists into accepting AI art generation | CuriouslyC wrote: | The training was legal, and artists don't have a say | under the current law, so your analogy doesn't hold. | bugglebeetle wrote: | Neither of these claims have been truly tested in court | and vary at the national level, so no, not really. | grumbel wrote: | LAION is a German company and what StableDiffusion is | doing seems to be covered under UrhG SS 44b. If artists | don't want their work inspected by bots they have the | option to put a robots.txt on their site. | | https://www.gesetze-im-internet.de/urhg/__44b.html | | https://www.gesetze-im- | internet.de/englisch_urhg/englisch_ur... | bobwaycott wrote: | While this may very well be covered, I think the general | problem in meatspace is that there was no advance notice | given to exercise the option to place the proper | robots.txt directives to opt out of having one's artwork | collected for model training _before it happened_ , while | still preserving the ability to have one's artwork | findable by search engines and the like. I'm sure there | are more than a handful of people who have no idea that a | robots.txt file can be used to prevent AI data collection | --and some may even be surprised to learn the file that's | been used for search engine crawlers is also going to | double for AI crawlers. | | To push a bit further, there's something that just feels | particularly _off_ about assuming everyone's content is | up for grabs unless _the producers_ do the work to _opt | out_. I think there's an especially palpable bit of irony | looking at it from the EU's perspective--where _cookies_ | must be _opt-in_ , but grabbing all your copyrighted | material so companies can do whatever they like with it | places the burden on the owner to _opt-out_. It just | feels backward. Perhaps one should have to expressly | _opt-in_ to allowing their work to be accessible as | training data. At least then there will be a clear signal | that the producer of the work can't later complain, as | they willingly granted permission. | Karunamon wrote: | I wonder if these authors would have complained so loudly | if they had known that other artists were looking at | their output to learn how to create their own work? | Absolutely none of them sprung from the womb, tablet in | hand, to create their work ex nihilo, based on nothing | other than their own entirely original thoughts. | bugglebeetle wrote: | None of this voids the terms of international copyright | agreements and someone on Hacker News should know better | than to claim that a robots.txt on a personal site would | cover all instances of an image being scraped. I'm not | saying that artists will necessarily come out on the | winning end of this battle, but it's also specious to | claim that company says what they did is legal, therefore | it is. | pluijzer wrote: | Do you mean the use of uncredited use of artists artwork | without paying royalties for the training set or AI art | generation in general? | bugglebeetle wrote: | What I mean is releasing a free service out into the | world that allows anyone to effectively pirate an | artist's work. Their intention was obviously to be | rewarded by established players for doing this bit of | dirty work, forcing artists to accept terms they wouldn't | have otherwise. | moffkalast wrote: | > not a fan of StableDiffusion, but at least their models | work at general parity with private offerings | | I think you're being a bit generous there. Either I'm using | it seriously wrong or SD can only generate vague blobs while | Midjourney can make some proper stuff. It's a larger | difference than GPT 3.5 vs GPT 4. | dragonwriter wrote: | > Either I'm using it seriously wrong or SD can only | generate vague blobs | | You are definitely using it wrong, if the alternative is | "SD can only generate vague blobs". Even the base SD models | are _much_ better than that (though, the strength of the SD | ecosystem is the availability of custom checkpoints, | hypernetworks, LORAs, embdeddings, ControlNet, etc., not | just the base models.) | CuriouslyC wrote: | Llama itself performs comparably to GPT3.5 (at least 30/60g | models), but the RLHF of chatgtp is much better than what the | community has produced thus far, and it's tuned to work well | without tinkering. There will be open source models with that | level of fine tuning in the near future, at which point | ChatGPT4 will mainly be superior for stuff like code that | needs the best possible cohesion and accuracy. | og_kalu wrote: | SD isn't comparable to Midjourney. 99% of comics created with | ai art use midjourney. One of the most glaring need cases for | control and still nothing. There's a reason for that. | GaggiX wrote: | I have seen really convincing comics made with SD, much | more convincing than any comics made with MJ, and the | reason is really obvious. Models and LoRAs on CivitAI and | Huggingface are really good, and the fact that MJ can | generate slightly better images does not justify the total | lack of control. | og_kalu wrote: | Never said you couldn't make impressive stuff with SD but | feel free to share those comics. | | Models on CivitAI are okay. Cool if you're looking for a | certain style and/or want to create something that looks | like the training images but style isn't everything. | | Midjourney generates much better than "slightly better | images" and the very fact you say this just tells me | you've not even used the thing in any real capacity. | GaggiX wrote: | I am very familiar with MJ and know very well how SD can | be used to generate images. | | I am the author of submissions such as: | https://news.ycombinator.com/item?id=35181433, and I am | one of the people responsible for the enthusiasm behind | the performance of MJ v5. | | But no, MJ is not much better if you know how to use SD, | although if what you did with SD was just put a prompt in | a huggingface space, I can understand why you say that. | | >I never said you can't do impressive things with SD, but | feel free to share these comics. | | I am arguing that they are better than any comics made | with MJ, not that they are simply impressive, that's | really the entire point. I know some on Pixiv, you can | look them up if you want; I am not linking them for | obvious reasons (to say they are NSFW is putting it | mildly). | og_kalu wrote: | >But no, MJ is not much better if you know how to use SD, | although if what you did with SD was just put a prompt in | a huggingface space, I can understand why you say that. | | I'm the person behind these - | https://huggingface.co/ogkalu I think it's safe to say i | know something about SD's capabilities. | | >I am arguing that they are better than any comics made | with MJ, not that they are simply impressive, that's | really the entire point. | | Sure that's why i'm asking you to link these comics that | are supposedly better than anything Midjourney has ever | produced. With a claim like that, i'm sure you understand | wanting to see results. | | >You can go look them up on Pixiv if you want, they host | some; I am not linking them for obvious reasons (to say | they are NSFW is putting it mildly). | | So you can't link anything that isn't NSFW on pixiv? Lol, | that just solidifies my point. Frankly if the best you | can come up with is pseudo porn(or maybe not pseudo lol) | on pixiv (i don't imagine any readers of that will care | about the things i'm looking for) then that's not a very | good look. | GaggiX wrote: | You seem surprise that porn brings innovation, but you | shouldn't if there has to be someone obsessed with | creating the best possible illustration, it is indeed a | Pixiv user or more generally a user who wants to create | porn of their favorite character; moreover, I know these | comics not because I have a weird obsession with going to | read comics that were created by an AI, I know them | because they are good enough to have gone on trend as | NSFW comics, whereas the comics made by MJ are known not | because they are good comics but because they are made by | MJ (so it's cool I guess), so I don't see how it can | solidify your point of view ahah, if you can't control | the generation every panel will look different, a collage | of images, that's why the comics made by MJ seem to be | known just because they are made by MJ and not because | they are in the interest of others communities like NSFW | comics on Pixiv. Also for this reason, I have not saved | links to these posts, I found them randomly while | browsing Pixiv, another reason why you should look for | them yourself. | EveYoung wrote: | In my experience, the threshold to be useful is much lower | than GPT-3.5. These smaller models can "easily" be finetuned | to achieve a comparable performance on a specific task. For | example, I've achieved promising results for data | summarisiation and image captioning (BLIP2-based) using | Alpaca. | | Also, server/hardware costs are still a limiting factor for | running and finetuning the larger 33/65B Llama models. | Especially, if they can only be used for personal toy | projects. | bugglebeetle wrote: | I don't use LLMs for anything image related, so I can't | speak to their value there, but almost all simpler NLP | tasks are IMO better handled using other techniques that | predate them. I've yet to see an example where fine-tuning | is cheaper/more efficient/better performing than older | solutions to these problems. | EveYoung wrote: | If older techniques work for you, there is of course no | reason to switch to LLMs besides general curiousity or to | explore what's possible already. That said, in my case I | was enable to generate much more engaging text summaries | of tabular data using a Llama derivative. | idle_zealot wrote: | Didn't Open Assistant just announce that they weren't releasing | their model weights due to safety concerns? Seems like another | "Open" AI initiative. | circuit10 wrote: | Unless something changed, I thought it was that they | literally cannot legally release the weights that are based | on LLaMA (except maybe with an xor thing) so they're going to | train it based on something else | mindcrime wrote: | Is any of the Open Assistant stuff based on LLaMA? I | thought they release (at least some version) before LLaMA | even dropped? | circuit10 wrote: | Yes, there's also something based on Pythia but it's a | smaller model | selfhoster11 wrote: | IIRC, the video said they will train it on a properly open- | source model as well. | akiselev wrote: | That was a joke in the release video. The Pythia model is | already released at [1] and the deltas for the LLaMa model | should be up here [2] in the next few days. | | [1] https://huggingface.co/OpenAssistant/oasst- | sft-4-pythia-12b-... | | [2] https://huggingface.co/OpenAssistant/oasst-llama-based- | model... | RandomBK wrote: | Unfortunately [2] is just a placeholder for now, but it | does look like the intent is to publish the weights. | Taek wrote: | It's also relatively cheap to make your own llama-30 | weights, the real value of OpenAssistant is in the | training data, and all of that data has been made | available. | | The OpenAssistant effort gets an A+ for open source | contributions. | fortyseven wrote: | There was a dumb joke along those lines in an announcement | video, meant as a jab at OpenAI. It's easy to miss the "just | kidding". (I did, initially.) | detrites wrote: | The announce video by Yannic contained a (lengthy) gag to | that effect, has it been taken out of context or did now | something actually happen? | | https://youtube.com/watch?v=ddG2fM9i4Kk&t=132 | | It's easy to miss but after the negative build-up he says: | "and... I'm kidding!" | ricardobeat wrote: | Dangerous gag, he said "I'm joking" so quickly it's very | easy to miss. I imagine the commenter is not alone in | having that wrong impression. | idle_zealot wrote: | Oh, ha, yeah this is exactly the gag I fell for. I just | noped out of the video and wrote off the project as this | was the first I ever heard of them, and their website just | has a signup and no downloads I could see. | | Too bad my original comment is too old to edit. | [deleted] | [deleted] | Tepix wrote: | Great initiative. Next, we need a lot of compute! Perhaps | Tenstorrent wants to make a good impression? | rnosov wrote: | > we are training a full suite of models, with the first | becoming available in the coming weeks. | | Sounds like they already have the compute and began training. | DogTweezers wrote: | [flagged] | sytelus wrote: | Great to see this but dataset is the trickiest part. There is no | way to confirm if this is good dataset unless model is actually | trained on it. To reproduce LLaMA, you need $2M of compute. | Robotbeat wrote: | Do you have a calculation that shows where that $2M number | comes from, EXACTLY? | eiz wrote: | https://arxiv.org/pdf/2302.13971.pdf table 15. 1770394 | A100-80GB hours to train the entire model suite at the going | rate for cloud 8xA100-80GBs (~$12/hr if you could actually | get capacity) is ~$2.6M, under extremely optimistic | assumptions. YMMV on bulk pricing ;) "the more you buy the | more you save" | Robotbeat wrote: | Hmmm... the values in the 7B model seem feasible. An order | of magnitude lower GPU hours, plus presumably the lower | parameter count means it probably could fit on a 24GB | Radeon RX 7900 XTX, which has higher single precision flops | than the A100 and costs $1000 instead of $15,000. | | An order of magnitude lower GPU-hour time, plus if you | train it for 210 days instead of 21 days, means you could | do a 7B model with 20 consumer GPUs which are $1000 apiece. | $20k, not counting mainboard, etc. Really not bad. Might | even be doable as a volunteer project. | sp332 wrote: | Page 4 https://arxiv.org/abs/2302.13971 | | _When training a 65B-parameter model, our code processes | around 380 tokens /sec/GPU on 2048 A100 GPU with 80GB of RAM. | This means that training over our dataset containing 1.4T | tokens takes approximately 21 days._ | | At $4/GPU-hour per A100 80GB GPU, that's $4 * 2,048 * 21 * 24 | = $4,128,768. | Robotbeat wrote: | Hmmm... so a 7 billion parameter model could probably be | trained on consumer GPUs for one or two orders of magnitude | lower cost, particularly if you didn't go well beyond | Chinchilla-optimal training time. | t00 wrote: | Obligatory Dall-E version | https://labs.openai.com/s/Httd7N2ZF5kynUnzp0vCVOjN | [deleted] | [deleted] | [deleted] | dwheeler wrote: | Has anyone investigated to see if OpenCyc can be converted to | natural language (presumably English) and then injested into | this? Cyc made an attempt years ago to "encode common sense" and | a subset called OpenCyc was released. That might be a great way | to kickstart information representation of the real world. The | latest version of Cyc is proprietary but I think there OpenCyc is | an open subset (though I'm having trouble confirming that, so the | licensing may not be good). | | Some links: https://github.com/bovlb/opencyc | https://github.com/asanchez75/opencyc | sp332 wrote: | I've been wondering this for a while now. Cyc has tons of | knowledge in a white-box, formal system. If it just had a | front-end that could convert from natural language to Cyc | knowledge queries and back, we wouldn't have to worry so much | about hallucinations, catastrophic forgetting, or trying to fit | the entire database in VRAM. | speed_spread wrote: | My understanding is that LLM and Cyc are fundamentally | different forms of AI. Even if you could turn OpenCyc into text | rules, once ingested it would just dissolve into the ocean of | training text data and would not significantly gain more | apparent "common sense" than it already had. Maybe a more | interesting combination could be to have both Cyc and LLM | working side by side and comparing notes before agreeing on a | result. | piannucci wrote: | If the name is a reference to Ogden Nash's poem then I am very | tickled: | https://www.madisonpubliclibrary.org/engagement/poetry/poem-... | ricketycricket wrote: | I'd guess it's the book Llama Llama Red Pajama: | https://openlibrary.org/books/OL24377652M/Llama_Llama_Red_Pa... | FloatArtifact wrote: | Code generation, I wonder the difference in output given order of | operations with training and fine tuning. What if the model was | trained on the documentation and the code base for Python as an | example. Then fine tuning came from training on actual python | code on GitHub. | | At the model understands the python documentation and the | implementation standard library/interpreter. Then is there a | reduction of data needed for contacts in other code? | nailer wrote: | Someone on HN made a point that weights can't even have | copyright- they lack two of the requirements for being | copyrightable: | | https://news.ycombinator.com/item?id=35508651 | bobwernstein1 wrote: | when will the first code writing specific model arrive? | smrtinsert wrote: | So is the next step is for someone to come in a fine tune on top | of it in order to make it a Vicuna? Or can current vicuna deltas | be applied? | rafaelero wrote: | Yeah, it's pretty trivial to change the base model from LLaMa | to this next one. You just have to finetune it with the same | data used previously to train Vicuna. | wesleychen wrote: | There's no model yet, only a dataset. | omneity wrote: | My understanding is that LLaMa's architecture is open, so the | most difficult part is: | | 1. Getting data of equal or better quality | | 2. Securing the funding/hardware required for training | | 3. Learning/figuring out the training challenges needed to | tune the process (the PhD part) | | It seems #1 is the relatively lowest hanging fruit and a | prerequisite for the other two, and that's what the project | is (rightfully) tackling at this stage. #2 could be solved by | many ways, and doesn't require much innovation if the project | and the team are solid. Which takes me to #3, which on the | other hand seems to be the make or break part of the project. | | I'm not one to doubt the technical prowesses of the | RedPajama's team and their contributors, I rather see it | economically. How can an AI open-source project compete with | big tech in attracting the brilliant minds of our generation? | It's enough to look at levels.xyz to see the battle is not | ... level. | | There's a serious economical challenge in here to have any | sort of sustainable open source initiative in AI. | Havoc wrote: | Love this - I'll happily accept a bit of a quality trade-off for | a pure open model. Its a bit like I'm willing to accept trade- | offs to ensure my IoT gear is local only even if that means loss | of cloud convenience | simonw wrote: | The training data - all 1.2 trillion tokens - can be downloaded | by grabbing each of the 2,084 URLs listed here: | https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt | | I ran a HEAD request against them all to sum up the total file | size, and it's 2.67TB total. | | Here's a Datasette Lite URL that lets you explore the size | metadata about those files: | https://lite.datasette.io/?json=https://gist.github.com/simo... | | And a SQL query that shows the breakdown across the different | sources: | | https://lite.datasette.io/?json=https://gist.github.com/simo... | | Sizes here are in GB: common_crawl | 1341.6166818914935 c4 806.7667234372348 github | 212.1786002581939 wikipedia 111.89125544670969 | book 100.43162744678557 arxiv 87.35323827341199 | stackexchange 74.54870238155127 | | Common Crawl is in there a few times - they have the following | folders: common_crawl/2020-05 198 files | common_crawl/2021-04 176 files common_crawl/2023-06 175 | files common_crawl/2022-05 157 files | common_crawl/2019-30 153 files | | And then C4 as well, which is "a colossal, cleaned version of | Common Crawl's web crawl corpus. It was based on Common Crawl | dataset": https://paperswithcode.com/dataset/c4 | afro88 wrote: | Interesting they're allowed to use stackexchange. I don't know | much about the legalities of scraping. Was this an agreement | between them, or is it simply ok to scrape and use the data in | a model? | progbits wrote: | https://stackoverflow.com/help/licensing | | Doesn't this imply the produced model has to be CC-BY-SA too? | wongarsu wrote: | By that line of reasoning, GitHub copilot would have to be | GPL. Until somebody fights about this in court we don't | really know. But even in the worst case the CC-BY-SA is one | of the easier licenses to fulfill, not much worse than the | MIT-licensed code contained in the dataset. | gattilorenz wrote: | Welcome to this can of worms. | | CC-BY-SA content needs attribution too, but I don't see | the(se) model(s) in the current state being able to do so. | | I imagine we're gonna see the IBM PC bios/Unix/ReactOS | "tainted code" arguments again in court, this time is not | the human who is more-or-less knowingly responsible for | sneaking in copyrighted code. | doctoboggan wrote: | I am a little concerned that they have only about 60% of the | code tokens (GitHub and stackexchange). Given that so far the | only concrete use case I have for LLMs is coding assistance I | wouldn't want this open source model to be and less quality in | that area. | | In your opinion do you think this will hamper the model at all? | Or is it still more than enough to get good coding assistance? | csris wrote: | Nice catch! We sampled the github dataset to match the total | # tokens seen by LLaMA during training: ~64B tokens (they | only pass through 0.64 of their total Github dataset | according to the paper). We have a lot of Github data and | will make them available soon. Note, we also have not built | this for compute optimal training. We are following LLaMA's | lead and are training on more data for longer to optimize for | quality, not compute. | doctoboggan wrote: | Very good to hear that you are optimizing for inference | rather than training. I've tried llama and its various | instruction tuned siblings and have yet to get equivalent | performance to gpt-3.5 on coding tasks. Seeing how the base | model performed relative to gpt-3 on the various benchmarks | gives me hope that the difference is just in RLHF or other | fine tuning steps. I really hope the community is able to | get there, Especially if the resulting model is able to be | quantized with minimal loss. | jstx1 wrote: | Smaller % of training data doesn't necessarily mean lower | quality. | sp332 wrote: | As mentioned in the post, the smaller models are trained well | past "compute-optimal" amounts of data and I would expect are | well into diminishing returns. On the other hand, large | models are good one-shot and few-shot learners, and might be | able to pick up enough context from your prompt alone to be | useable, even if it wasn't specifically trained on your use | case. | [deleted] | Minus0 wrote: | In this context compute optimal isn't quite the same as | diminishing returns. If you look at the loss graphs in the | Llama paper, you can see that even the curves for the | smaller models were still going down at the time they | stopped training and weren't anywhere near plateauing yet. | LLMs are notoriously data hungry and will take a long time | to reach convergence. | | Compute optimal here means the point at which it makes | sense to move from a smaller to a larger model assuming | that: (a) you have a fixed compute budget of FLOPs, and (b) | you want to train the best model possible. The problem is | that this applies only to training and assumes nothing | about the cost of inference. If you actually need to deploy | these trained models and support them long-term for | hundreds, thousands, even millions of people to use, would | you rather deploy a 13B model or a 30B model at the same | level of quality, even if the 13B model would be more | costly to train? | | There is going to be a point at which these models plateau | and further improvement will not be possible without moving | to a larger model, but Llama doesn't get there quite yet. | bkm wrote: | Relevant: | https://twitter.com/abacaj/status/1647999551964323844 | totoglazer wrote: | This tweet is misunderstanding the papers. | t3estabc wrote: | [dead] | simonw wrote: | No idea! | | I wonder how hard it would be to fine-tune something built on | RedPajama on further code examples to improve performance | there. | fnands wrote: | Nice. Thanks for the summary. | | So ~4x the size of the Pile, any idea how it stacks up in terms | of quality to other big datasets? | rcpt wrote: | I'm kind of surprised how small that dataset is | simonw wrote: | Wrote this up as a blog post: | https://simonwillison.net/2023/Apr/17/redpajama-data/ | csris wrote: | Hi! I'm the VP of Engineering at Together. Thanks for writing | up these instructions! FYI, you can also download all the | files with one wget command: wget -i | https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt | | This is also mentioned on the dataset card for redpajama- | data-1T on Huggingface [1]. | | [1]: https://huggingface.co/datasets/togethercomputer/RedPaja | ma-D... | simonw wrote: | I made sure to include that in my blog post - along with a | note that you need 2.67TB of disk space first! | macinjosh wrote: | This guy has kids, so we all know he.. nevermind. I love the name | being a parent myself. ___________________________________________________________________ (page generated 2023-04-17 23:00 UTC)