[HN Gopher] RedPajama: Reproduction of LLaMA with friendly license
       ___________________________________________________________________
        
       RedPajama: Reproduction of LLaMA with friendly license
        
       Author : tim_sw
       Score  : 534 points
       Date   : 2023-04-17 14:05 UTC (8 hours ago)
        
 (HTM) web link (www.together.xyz)
 (TXT) w3m dump (www.together.xyz)
        
       | bmc7505 wrote:
       | @dang The title should be changed from MILA to Mila/IQIA.
        
       | hsuduebc2 wrote:
       | I'm somehow scared and somehow amazed by speed of this progress.
        
       | martythemaniak wrote:
       | This is cool, now we just need to locate 1,000,000 A100-80GB
       | equivalent GPU-hours. If we had a SETI@Home type project setup
       | for this, it would be straightforward - only $50K worth of
       | electricity for the 65B model.
       | 
       | Given the immense momentum behind LLaMA, I'm pretty disappointed
       | that Meta won't just open-source it, but I guess reproducing it
       | is better long-term.
        
       | miohtama wrote:
       | They missed the chance to call it OpenPajama
        
         | wongarsu wrote:
         | Calling next month's headline: "OpenPajama: RedPajama weights
         | fine-tuned on liberotica and fanfiction.net"
        
       | omneity wrote:
       | An actually open source LLM would be a game changer. We might
       | need a new license that englobes model usage and training,
       | something GPL-like whereby distributing a retrained model
       | requires contributing data back or making it public, but not if
       | you use it privately.
       | 
       | This will definitely accelerate progress in LLM research,
       | productization and safety. Alpaca, vicuna, gpt4all and others are
       | sporadic repesentations of this that could become a continuous
       | improvement process were the LLM and its license truely open
       | source.
       | 
       | An interesting possible side effect of a GPL-like license is that
       | AIs become unlikely to be trained on private data, the usual moat
       | that big tech wouldn't want/just can't make public if it were to
       | use those GPL-like licensed models.
        
         | jupp0r wrote:
         | As with original GPL, this would be almost useless in a
         | commercial context.
        
           | e12e wrote:
           | There are commercial devices that ship with a Linux kernel?
        
             | Bjartr wrote:
             | Basically every Android device for starters.
        
             | ijustlovemath wrote:
             | I think they mean in terms of enforcement when there's a
             | violation
        
             | jupp0r wrote:
             | But do they train the Linux kernel with their customers
             | data?
        
             | sp332 wrote:
             | Using a Linux kernel doesn't mean you have to make your
             | whole project GPL, unless your project is specifically
             | kernel code.
        
               | wongarsu wrote:
               | Neither would the proposed model license. Just like the
               | kernel's GPL stops at the userspace boundary, the
               | proposed license would only cover the model definition
               | and weights.
        
         | ipsum2 wrote:
         | Huh? There's plenty of open source LLMs. Pythia, GPT-NeoX,
         | GPT-J, GPT-2, BLOOM-176, are ones I can think of off the top of
         | my head. Pythia is the best performing one IIRC.
        
         | buzzscale wrote:
         | Dolly 2.0 is fully open, Apache License and the tuning dataset
         | is employee generated:
         | 
         | https://www.databricks.com/blog/2023/04/12/dolly-first-open-...
        
       | almost_usual wrote:
       | Name is obviously inspired from the Anna Dewdney children's
       | books.
        
         | michael_j_ward wrote:
         | My kids love that book, and my oldest had me read it to his
         | preschool class earlier this year.
         | 
         | Here is a much more creative reading by Ludacris [0]
         | 
         | [0] https://www.youtube.com/watch?v=PFtHeo7oMSU
        
       | quickthrower2 wrote:
       | As I understand it they have the input data, but next up they are
       | creating the model. I could make a joke about drawing an owl ...
       | but that would be a bit mean. I am really glad people are working
       | on this.
       | 
       | I wonder... who is paying? Will there be restrictions like ethics
       | clauses and suchlike. Not necessarily a bad thing if they do.
       | Will there be restrictions on commercial use.
        
       | HopenHeyHi wrote:
       | Having reproduced the pre-training data, the next step is to
       | train a strong base model. As part of the INCITE program, with
       | support from Oak Ridge Leadership Computing Facility (OLCF), we
       | are training a full suite of models, with the first becoming
       | available in the coming weeks.            With a strong base
       | model in hand, we are excited to instruction tune the models.
       | Alpaca illustrated the power of instruction tuning - with merely
       | 50K high-quality, diverse instructions, it was able to unlock
       | dramatically improved capabilities. Via OpenChatKit, we received
       | hundreds of thousands of high-quality natural user instructions,
       | which will be used to release instruction-tuned versions of the
       | RedPajama models.
       | 
       | Excellent. Sam Altman can blow it out his ass. :)
        
       | rafaelero wrote:
       | That's awesome! Are people thinking about training it for more
       | than just 1 epoch? I believe Gallactica showed that training for
       | even 4 epochs is ok. Also, how amazing would be if the next gen
       | of open-source LLM's increased context window, like adding 8k
       | more tokens? That's probably expensive, but totally doable.
        
         | sp332 wrote:
         | It's including Common Crawl data 4 or 5 times, does that count?
        
         | [deleted]
        
       | Jayakumark wrote:
       | This is huge, was just checking today on what would take someone
       | to get a model similar to Llama, since Meta did not share
       | Training code or Dataset.. Looks like they have figured out how
       | to make the dataset ,Main Problem here is pre-processing them.
       | Second step is to make the code to train model and final one do
       | it cheaply.
        
         | brucethemoose2 wrote:
         | Maybe they should use whatever Cerebras used. The whole point
         | of their own LLM release was as a maximum compute/$
         | demonstration on their platform.
         | 
         | Surely there is a better alternative than a bunch of A100s on
         | AWS...
        
       | mgaunard wrote:
       | Pyjama singular actually works, but I'm not sure Pajamas can be
       | singular.
        
         | FranchuFranchu wrote:
         | I think that at this point, LLM etymology is way more
         | interesting than LLMs themselves.
        
       | [deleted]
        
       | local_crmdgeon wrote:
       | So how do I use this? As someone new to the domain.
        
         | tinco wrote:
         | You download the 2.76TB of data. Then you run it through
         | Llama's training script for a couple months on 40 NVidia
         | A100's, and you should have yourself a pretty fine large
         | language model you could use to host your own ChatGPT service.
         | It'll be significantly worse than ChatGPT for reason's that
         | aren't yet fully clear because OpenAI switched its mission from
         | protecting the earth from nefarious AI developments, to being
         | itself being the origin of possibly nefarious AI developments.
        
           | DigitalDopamine wrote:
           | Renting 40 nvidia a100s is around 70k dolar per month (on
           | vultr i see). So this would only cost 420k for 6 months.
           | Seems doable.
           | 
           | Is 40 a100s enough though? I am interested in what this would
           | cost.
        
             | mlboss wrote:
             | It would be great if this can be done on 3090s. Used 3090
             | usually costs $500-1000 to buy.
        
         | skybrian wrote:
         | You don't, since they're not done yet. Someone will come up
         | with a way to use it when they're done.
        
       | thrtythreeforty wrote:
       | I'm very glad people are starting to push back against claims of
       | various LLMs being open source. I was beginning to be worried
       | that the term would be forcefully redefined in the ML space to
       | mean "weights available." With the kickoff of projects like this
       | and Databricks' Dolly, I'm heartened to see the community saying
       | "no, we are willing to spend the compute to make actually open
       | models."
       | 
       | (While it's true that the actual model code of Llama is properly
       | open source, it's also useless for inference by itself. Claiming
       | these models are open source seems like having your cake and
       | eating it too - you get accolades for "open sourcing" but still
       | get to control what happens with it.)
        
         | jrm4 wrote:
         | Lawyer here, still trying to wrap my head around all of it --
         | but it seems as if what may be different here is the extent to
         | which all of this is _practically_ "open-source" or even
         | "literally free, as in freedom and cost etc" (i.e. generally
         | and widely available REGARDLESS of what the law says)
         | 
         | And then coming second appears to be "companies and whoever who
         | seek to make money, and intend to make some sort of legal
         | restriction part of the biz model."
         | 
         | I have no answers or even predictions here except "this is
         | gonna be interesting."
        
         | nickcw wrote:
         | To make an analogy with Linux, the weights are (up until now) a
         | very large closed source firmware blob.
        
         | ninjin wrote:
         | I can only agree. The number of times we have seen corporations
         | abuse "open source" and "open science" in the context of large
         | language models have been baffling: OPT/LLaMA disallowing
         | commercial usage, BLOOM having an ethical non-open license, GLM
         | having a clause not to "undermine [the People's Republic of
         | China's] national security and national unity", etc. Every
         | single one of these models have been happy to ride on the
         | coattails of the hard work of the open movements by calling
         | themselves open, while only paying lip service to the ideals
         | and definitions underpinning them.
         | 
         | While RedPajama has yet to commit to a license (from what I can
         | see, it is late at night...), they are making _all_ the right
         | noises and I am hopeful that my prediction that we are about to
         | see the floodgates of _truly_ open models blow open and that
         | OpenAI's "moat" will be proving to be a lot shallower than what
         | they and many others have made us believe over the last six
         | months will come true.
        
           | vipulved wrote:
           | Hi, this is Vipul, I am a co-founder of Together. We plan to
           | release the model weights under Apache 2.0. The amount of
           | creativity that Stable Diffusion unleashed for instance is
           | only really possible with permissive licenses!
        
             | Taek wrote:
             | Are you working at all with Stability, Eleuther, or LAION?
             | There have been some rumors that they are doing something
             | similar to this and I'm wondering if this is a duplicated
             | effort.
             | 
             | Either way, huge fan, it would be awesome to have a LLaMA
             | set of weights that are fully open.
        
           | yieldcrv wrote:
           | > not to undermine the national security and national unity
           | 
           | this is a required statement to conform with China's
           | constitution, or the superseding authoritative social
           | contract there.
           | 
           | think of it like if the Patriot Act was an article of the
           | constitution instead of a random law subservient to the
           | constitution, it would negate other parts of the constitution
           | that we hold near and dear.
           | 
           | this is a useful similarity as both constitutions have
           | assurances of free speech
           | 
           | just one has a fatal heavily leveraged clause that undermines
           | all other parts of that constitution and dictates all facets
           | of life
        
             | ninjin wrote:
             | This is interesting, thank you. But then how can _any_
             | entity in the PRC contribute to open source? Alibaba,
             | Baidu, etc. have released plenty of machine learning code
             | under _proper_ open licenses in the past (not to mention
             | that we have hardware vendors in the PRC contributing to
             | say Linux). The story I heard about GLM was that they were
             | a high enough public profile project that it caught the
             | attention of PRC bureaucrats that pushed for the clause to
             | be included.
             | 
             | Regardless of the cause though, the clause flies afoul of
             | any definition of open out there.
        
               | yieldcrv wrote:
               | simplest answer is that Alibaba and Baidu have more party
               | members as stakeholders
               | 
               | but its not likely that any uncontrollable LLM can start
               | spitting out accuracy or things unhelpful to Beijing's
               | ethos there and be allowed to operate
               | 
               | the model or the service filtering the model has to be
               | controlled
        
             | nacs wrote:
             | > this is a required statement to conform with China's
             | constitution
             | 
             | But doesn't this mean the model training data also excludes
             | anything critical of China?
             | 
             | For example, does their training data include things like
             | this: https://en.wikipedia.org/wiki/1989_Tiananmen_Square_p
             | rotests... ?
        
           | [deleted]
        
         | danShumway wrote:
         | My only caveat here is that I'm actually really curious to see
         | a ruling about whether model weights can be copyrighted.
         | 
         | I don't think the "Open Source" label people are using is
         | accurate, and I _heavily_ agree that a common thing that
         | companies seem to be trying to do in this space is release what
         | are essentially closed models while calling them open, and it
         | 's a really dangerous direction for AI to go. So nothing in
         | your comment is wrong.
         | 
         | But it also feels a little bit like ceding ground to just
         | assume that Llama can't be used commercially just because
         | Facebook says it can't. I never signed a EULA with them, that
         | claim depends entirely on whether or not model weights are
         | under copyright (or under some similar form of IP protection,
         | some people have brought up trade secrets).
         | 
         | And I don't have a super-strong opinion necessarily, but I'm
         | not sure that's a safe assumption for people to make, and I
         | kind of think it might be good to throw an asterisk next to
         | "can't be used for commercial projects" whenever we talk about
         | Llama's restrictions.
         | 
         | But again, I agree with you, it's not the same as saying Llama
         | is Open Source. Even if it does get ruled as having weaker
         | protections, I don't think the term would really apply.
        
           | jupp0r wrote:
           | I haven't done so, but don't you sign an agreement when you
           | ask Facebook for a link to download the weights for LLAMA
           | which is currently the only officially supported way of
           | getting those weights
           | (https://github.com/facebookresearch/llama/tree/main#llama) ?
        
             | danShumway wrote:
             | I haven't used Llama for anything other than playing around
             | to test its capabilities, so I feel fairly comfortable
             | admitting publicly that when I did that testing, I did not
             | download it from Facebook using an official portal, and I
             | didn't sign any agreement about it.
             | 
             | On that subject, to the best of my knowledge, I also
             | haven't signed any kind of agreement with OpenAI. I've done
             | all of my GPT testing through 3rd-party services or portals
             | that don't require signing EULAs to use.
        
             | Ajedi32 wrote:
             | Why would you bother using an "officially supported" way of
             | downloading the weights if they aren't copyrightable
             | anyway?
        
       | worldsayshi wrote:
       | > GitHub: GitHub data, filtered by licenses and quality
       | 
       | Does anyone know which licenses are filtered into the dataset?
        
         | mananaysiempre wrote:
         | The description on the linked HuggingFace page[1] says MIT, BSD
         | and Apache.
         | 
         | [1]
         | https://huggingface.co/datasets/togethercomputer/RedPajama-D...
        
           | asddubs wrote:
           | it's better than laundering gpl code, but it still breaks the
           | licensing terms of those licenses as well, namely attribution
        
             | worldsayshi wrote:
             | I guess that could potentially be fixed if citation
             | ejection can somehow be implemented into it, which seem to
             | be at least feasible?
        
       | MangezBien wrote:
       | Definitely thought this was about the kid's book.
        
       | franzypants wrote:
       | It might be a little late, but I hope datasets start
       | incorporating patent texts as well:
       | 
       | 1. It's a large corpus of technical knowledge; 2. The language is
       | written by experts in a field and reviewed many times, and 3.
       | They have technical drawings with labels and references in the
       | text
       | 
       | The only downside I suppose is that sometimes patents are written
       | with "just enough knowledge" to get it granted but not too much
       | to give away the secret sauce. That's not really that different
       | from many scholarly papers though.
       | 
       | To give a size of scale, the granted patent texts of 2020
       | (without images) is about 160 GB of data, and we have digitized
       | grants going back to at least 1970.
        
         | seunosewa wrote:
         | You wouldn't want chatbots to answer you with the kind of
         | language used in patent texts.
        
           | MayeulC wrote:
           | Now, I don't know if I would rely on it, but I've certainly
           | thought about asking a LLM to write my patent text for me,
           | provided with a technical description.
        
           | sp332 wrote:
           | LLMs are actually pretty good at translating info in one form
           | into another form.
        
       | return_to_monke wrote:
       | with both this and https://Open-Assistant.io, I believe we have
       | entered the Stable Diffusion era of large language models
        
         | bugglebeetle wrote:
         | Only if they actually start performing at the level of OpenAI's
         | models. I'm not a fan of StableDiffusion, but at least their
         | models work at general parity with private offerings. All the
         | LLama derivatives and OpenAssistant stuff performs far below
         | GPT-3.5 for everything I've tested.
        
           | jokethrowaway wrote:
           | I don't think there is a ready made alternative to
           | Midjourney.
           | 
           | Midjourney is way more versatile than SD. If you start
           | getting some fine tuned models on civitai, trained to do well
           | some specific tasks, you can get comparable quality but I
           | haven't seen a single model which is able to replace
           | Midjourney.
           | 
           | Llama is no different, it has ok performance on generic
           | queries but still far away from GPT3.5: if you start fine-
           | tuning you can get good perf on specific tasks.
        
             | htaunay wrote:
             | Midjourney to me feels like bowling with bumpers
             | 
             | Sure, its very easy to get good results fast, but the
             | tuning that avoids "uglier" images is the same that removes
             | a lot of versatility compared to SD
             | 
             | Also controlnet is a killer feature
        
             | og_kalu wrote:
             | You're 100 percent right. People will say control bla bla
             | bla and that's certainly true. You can get a lot more
             | control with Stable Diffusion but like 99% of digital
             | comics created with ai art use midjourney. One of the most
             | control and versatility inclined use cases of generated art
             | and midjourney is still easily winning. There's a reason
             | for that.
        
             | bugglebeetle wrote:
             | SD with ControlNet and some other open source plugins is
             | far more flexible than MidJourney. It just has all the
             | typical hurdles of OSS vs. commercial offerings. Default
             | image quality in Midjourney is better in terms of its
             | pedestrian aesthetic biases, but it's not very interesting
             | as an actual artistic tool. And I say this as someone who
             | doesn't like either service and used to be a commercial
             | illustrator before moving into Data Science.
        
             | asynchronous wrote:
             | Midjourney also doesn't have controlnet functionality like
             | Stable Diffusion now does, which gives specific posing of a
             | scene a huge edge on SD.
             | 
             | They're very similar offerings if you're willing to put in
             | the work on SD.
        
           | GaggiX wrote:
           | >I'm not a fan of StableDiffusion
           | 
           | For some technical reason?
        
             | bugglebeetle wrote:
             | No, technically it's all very impressive. My displeasure
             | with them was there doing a Napster-style maneuver to force
             | artists into accepting AI art generation
        
               | CuriouslyC wrote:
               | The training was legal, and artists don't have a say
               | under the current law, so your analogy doesn't hold.
        
               | bugglebeetle wrote:
               | Neither of these claims have been truly tested in court
               | and vary at the national level, so no, not really.
        
               | grumbel wrote:
               | LAION is a German company and what StableDiffusion is
               | doing seems to be covered under UrhG SS 44b. If artists
               | don't want their work inspected by bots they have the
               | option to put a robots.txt on their site.
               | 
               | https://www.gesetze-im-internet.de/urhg/__44b.html
               | 
               | https://www.gesetze-im-
               | internet.de/englisch_urhg/englisch_ur...
        
               | bobwaycott wrote:
               | While this may very well be covered, I think the general
               | problem in meatspace is that there was no advance notice
               | given to exercise the option to place the proper
               | robots.txt directives to opt out of having one's artwork
               | collected for model training _before it happened_ , while
               | still preserving the ability to have one's artwork
               | findable by search engines and the like. I'm sure there
               | are more than a handful of people who have no idea that a
               | robots.txt file can be used to prevent AI data collection
               | --and some may even be surprised to learn the file that's
               | been used for search engine crawlers is also going to
               | double for AI crawlers.
               | 
               | To push a bit further, there's something that just feels
               | particularly _off_ about assuming everyone's content is
               | up for grabs unless _the producers_ do the work to _opt
               | out_. I think there's an especially palpable bit of irony
               | looking at it from the EU's perspective--where _cookies_
               | must be _opt-in_ , but grabbing all your copyrighted
               | material so companies can do whatever they like with it
               | places the burden on the owner to _opt-out_. It just
               | feels backward. Perhaps one should have to expressly
               | _opt-in_ to allowing their work to be accessible as
               | training data. At least then there will be a clear signal
               | that the producer of the work can't later complain, as
               | they willingly granted permission.
        
               | Karunamon wrote:
               | I wonder if these authors would have complained so loudly
               | if they had known that other artists were looking at
               | their output to learn how to create their own work?
               | Absolutely none of them sprung from the womb, tablet in
               | hand, to create their work ex nihilo, based on nothing
               | other than their own entirely original thoughts.
        
               | bugglebeetle wrote:
               | None of this voids the terms of international copyright
               | agreements and someone on Hacker News should know better
               | than to claim that a robots.txt on a personal site would
               | cover all instances of an image being scraped. I'm not
               | saying that artists will necessarily come out on the
               | winning end of this battle, but it's also specious to
               | claim that company says what they did is legal, therefore
               | it is.
        
               | pluijzer wrote:
               | Do you mean the use of uncredited use of artists artwork
               | without paying royalties for the training set or AI art
               | generation in general?
        
               | bugglebeetle wrote:
               | What I mean is releasing a free service out into the
               | world that allows anyone to effectively pirate an
               | artist's work. Their intention was obviously to be
               | rewarded by established players for doing this bit of
               | dirty work, forcing artists to accept terms they wouldn't
               | have otherwise.
        
           | moffkalast wrote:
           | > not a fan of StableDiffusion, but at least their models
           | work at general parity with private offerings
           | 
           | I think you're being a bit generous there. Either I'm using
           | it seriously wrong or SD can only generate vague blobs while
           | Midjourney can make some proper stuff. It's a larger
           | difference than GPT 3.5 vs GPT 4.
        
             | dragonwriter wrote:
             | > Either I'm using it seriously wrong or SD can only
             | generate vague blobs
             | 
             | You are definitely using it wrong, if the alternative is
             | "SD can only generate vague blobs". Even the base SD models
             | are _much_ better than that (though, the strength of the SD
             | ecosystem is the availability of custom checkpoints,
             | hypernetworks, LORAs, embdeddings, ControlNet, etc., not
             | just the base models.)
        
           | CuriouslyC wrote:
           | Llama itself performs comparably to GPT3.5 (at least 30/60g
           | models), but the RLHF of chatgtp is much better than what the
           | community has produced thus far, and it's tuned to work well
           | without tinkering. There will be open source models with that
           | level of fine tuning in the near future, at which point
           | ChatGPT4 will mainly be superior for stuff like code that
           | needs the best possible cohesion and accuracy.
        
           | og_kalu wrote:
           | SD isn't comparable to Midjourney. 99% of comics created with
           | ai art use midjourney. One of the most glaring need cases for
           | control and still nothing. There's a reason for that.
        
             | GaggiX wrote:
             | I have seen really convincing comics made with SD, much
             | more convincing than any comics made with MJ, and the
             | reason is really obvious. Models and LoRAs on CivitAI and
             | Huggingface are really good, and the fact that MJ can
             | generate slightly better images does not justify the total
             | lack of control.
        
               | og_kalu wrote:
               | Never said you couldn't make impressive stuff with SD but
               | feel free to share those comics.
               | 
               | Models on CivitAI are okay. Cool if you're looking for a
               | certain style and/or want to create something that looks
               | like the training images but style isn't everything.
               | 
               | Midjourney generates much better than "slightly better
               | images" and the very fact you say this just tells me
               | you've not even used the thing in any real capacity.
        
               | GaggiX wrote:
               | I am very familiar with MJ and know very well how SD can
               | be used to generate images.
               | 
               | I am the author of submissions such as:
               | https://news.ycombinator.com/item?id=35181433, and I am
               | one of the people responsible for the enthusiasm behind
               | the performance of MJ v5.
               | 
               | But no, MJ is not much better if you know how to use SD,
               | although if what you did with SD was just put a prompt in
               | a huggingface space, I can understand why you say that.
               | 
               | >I never said you can't do impressive things with SD, but
               | feel free to share these comics.
               | 
               | I am arguing that they are better than any comics made
               | with MJ, not that they are simply impressive, that's
               | really the entire point. I know some on Pixiv, you can
               | look them up if you want; I am not linking them for
               | obvious reasons (to say they are NSFW is putting it
               | mildly).
        
               | og_kalu wrote:
               | >But no, MJ is not much better if you know how to use SD,
               | although if what you did with SD was just put a prompt in
               | a huggingface space, I can understand why you say that.
               | 
               | I'm the person behind these -
               | https://huggingface.co/ogkalu I think it's safe to say i
               | know something about SD's capabilities.
               | 
               | >I am arguing that they are better than any comics made
               | with MJ, not that they are simply impressive, that's
               | really the entire point.
               | 
               | Sure that's why i'm asking you to link these comics that
               | are supposedly better than anything Midjourney has ever
               | produced. With a claim like that, i'm sure you understand
               | wanting to see results.
               | 
               | >You can go look them up on Pixiv if you want, they host
               | some; I am not linking them for obvious reasons (to say
               | they are NSFW is putting it mildly).
               | 
               | So you can't link anything that isn't NSFW on pixiv? Lol,
               | that just solidifies my point. Frankly if the best you
               | can come up with is pseudo porn(or maybe not pseudo lol)
               | on pixiv (i don't imagine any readers of that will care
               | about the things i'm looking for) then that's not a very
               | good look.
        
               | GaggiX wrote:
               | You seem surprise that porn brings innovation, but you
               | shouldn't if there has to be someone obsessed with
               | creating the best possible illustration, it is indeed a
               | Pixiv user or more generally a user who wants to create
               | porn of their favorite character; moreover, I know these
               | comics not because I have a weird obsession with going to
               | read comics that were created by an AI, I know them
               | because they are good enough to have gone on trend as
               | NSFW comics, whereas the comics made by MJ are known not
               | because they are good comics but because they are made by
               | MJ (so it's cool I guess), so I don't see how it can
               | solidify your point of view ahah, if you can't control
               | the generation every panel will look different, a collage
               | of images, that's why the comics made by MJ seem to be
               | known just because they are made by MJ and not because
               | they are in the interest of others communities like NSFW
               | comics on Pixiv. Also for this reason, I have not saved
               | links to these posts, I found them randomly while
               | browsing Pixiv, another reason why you should look for
               | them yourself.
        
           | EveYoung wrote:
           | In my experience, the threshold to be useful is much lower
           | than GPT-3.5. These smaller models can "easily" be finetuned
           | to achieve a comparable performance on a specific task. For
           | example, I've achieved promising results for data
           | summarisiation and image captioning (BLIP2-based) using
           | Alpaca.
           | 
           | Also, server/hardware costs are still a limiting factor for
           | running and finetuning the larger 33/65B Llama models.
           | Especially, if they can only be used for personal toy
           | projects.
        
             | bugglebeetle wrote:
             | I don't use LLMs for anything image related, so I can't
             | speak to their value there, but almost all simpler NLP
             | tasks are IMO better handled using other techniques that
             | predate them. I've yet to see an example where fine-tuning
             | is cheaper/more efficient/better performing than older
             | solutions to these problems.
        
               | EveYoung wrote:
               | If older techniques work for you, there is of course no
               | reason to switch to LLMs besides general curiousity or to
               | explore what's possible already. That said, in my case I
               | was enable to generate much more engaging text summaries
               | of tabular data using a Llama derivative.
        
         | idle_zealot wrote:
         | Didn't Open Assistant just announce that they weren't releasing
         | their model weights due to safety concerns? Seems like another
         | "Open" AI initiative.
        
           | circuit10 wrote:
           | Unless something changed, I thought it was that they
           | literally cannot legally release the weights that are based
           | on LLaMA (except maybe with an xor thing) so they're going to
           | train it based on something else
        
             | mindcrime wrote:
             | Is any of the Open Assistant stuff based on LLaMA? I
             | thought they release (at least some version) before LLaMA
             | even dropped?
        
               | circuit10 wrote:
               | Yes, there's also something based on Pythia but it's a
               | smaller model
        
             | selfhoster11 wrote:
             | IIRC, the video said they will train it on a properly open-
             | source model as well.
        
           | akiselev wrote:
           | That was a joke in the release video. The Pythia model is
           | already released at [1] and the deltas for the LLaMa model
           | should be up here [2] in the next few days.
           | 
           | [1] https://huggingface.co/OpenAssistant/oasst-
           | sft-4-pythia-12b-...
           | 
           | [2] https://huggingface.co/OpenAssistant/oasst-llama-based-
           | model...
        
             | RandomBK wrote:
             | Unfortunately [2] is just a placeholder for now, but it
             | does look like the intent is to publish the weights.
        
               | Taek wrote:
               | It's also relatively cheap to make your own llama-30
               | weights, the real value of OpenAssistant is in the
               | training data, and all of that data has been made
               | available.
               | 
               | The OpenAssistant effort gets an A+ for open source
               | contributions.
        
           | fortyseven wrote:
           | There was a dumb joke along those lines in an announcement
           | video, meant as a jab at OpenAI. It's easy to miss the "just
           | kidding". (I did, initially.)
        
           | detrites wrote:
           | The announce video by Yannic contained a (lengthy) gag to
           | that effect, has it been taken out of context or did now
           | something actually happen?
           | 
           | https://youtube.com/watch?v=ddG2fM9i4Kk&t=132
           | 
           | It's easy to miss but after the negative build-up he says:
           | "and... I'm kidding!"
        
             | ricardobeat wrote:
             | Dangerous gag, he said "I'm joking" so quickly it's very
             | easy to miss. I imagine the commenter is not alone in
             | having that wrong impression.
        
             | idle_zealot wrote:
             | Oh, ha, yeah this is exactly the gag I fell for. I just
             | noped out of the video and wrote off the project as this
             | was the first I ever heard of them, and their website just
             | has a signup and no downloads I could see.
             | 
             | Too bad my original comment is too old to edit.
        
           | [deleted]
        
           | [deleted]
        
       | Tepix wrote:
       | Great initiative. Next, we need a lot of compute! Perhaps
       | Tenstorrent wants to make a good impression?
        
         | rnosov wrote:
         | > we are training a full suite of models, with the first
         | becoming available in the coming weeks.
         | 
         | Sounds like they already have the compute and began training.
        
       | DogTweezers wrote:
       | [flagged]
        
       | sytelus wrote:
       | Great to see this but dataset is the trickiest part. There is no
       | way to confirm if this is good dataset unless model is actually
       | trained on it. To reproduce LLaMA, you need $2M of compute.
        
         | Robotbeat wrote:
         | Do you have a calculation that shows where that $2M number
         | comes from, EXACTLY?
        
           | eiz wrote:
           | https://arxiv.org/pdf/2302.13971.pdf table 15. 1770394
           | A100-80GB hours to train the entire model suite at the going
           | rate for cloud 8xA100-80GBs (~$12/hr if you could actually
           | get capacity) is ~$2.6M, under extremely optimistic
           | assumptions. YMMV on bulk pricing ;) "the more you buy the
           | more you save"
        
             | Robotbeat wrote:
             | Hmmm... the values in the 7B model seem feasible. An order
             | of magnitude lower GPU hours, plus presumably the lower
             | parameter count means it probably could fit on a 24GB
             | Radeon RX 7900 XTX, which has higher single precision flops
             | than the A100 and costs $1000 instead of $15,000.
             | 
             | An order of magnitude lower GPU-hour time, plus if you
             | train it for 210 days instead of 21 days, means you could
             | do a 7B model with 20 consumer GPUs which are $1000 apiece.
             | $20k, not counting mainboard, etc. Really not bad. Might
             | even be doable as a volunteer project.
        
           | sp332 wrote:
           | Page 4 https://arxiv.org/abs/2302.13971
           | 
           |  _When training a 65B-parameter model, our code processes
           | around 380 tokens /sec/GPU on 2048 A100 GPU with 80GB of RAM.
           | This means that training over our dataset containing 1.4T
           | tokens takes approximately 21 days._
           | 
           | At $4/GPU-hour per A100 80GB GPU, that's $4 * 2,048 * 21 * 24
           | = $4,128,768.
        
             | Robotbeat wrote:
             | Hmmm... so a 7 billion parameter model could probably be
             | trained on consumer GPUs for one or two orders of magnitude
             | lower cost, particularly if you didn't go well beyond
             | Chinchilla-optimal training time.
        
       | t00 wrote:
       | Obligatory Dall-E version
       | https://labs.openai.com/s/Httd7N2ZF5kynUnzp0vCVOjN
        
       | [deleted]
        
       | [deleted]
        
       | [deleted]
        
       | dwheeler wrote:
       | Has anyone investigated to see if OpenCyc can be converted to
       | natural language (presumably English) and then injested into
       | this? Cyc made an attempt years ago to "encode common sense" and
       | a subset called OpenCyc was released. That might be a great way
       | to kickstart information representation of the real world. The
       | latest version of Cyc is proprietary but I think there OpenCyc is
       | an open subset (though I'm having trouble confirming that, so the
       | licensing may not be good).
       | 
       | Some links: https://github.com/bovlb/opencyc
       | https://github.com/asanchez75/opencyc
        
         | sp332 wrote:
         | I've been wondering this for a while now. Cyc has tons of
         | knowledge in a white-box, formal system. If it just had a
         | front-end that could convert from natural language to Cyc
         | knowledge queries and back, we wouldn't have to worry so much
         | about hallucinations, catastrophic forgetting, or trying to fit
         | the entire database in VRAM.
        
         | speed_spread wrote:
         | My understanding is that LLM and Cyc are fundamentally
         | different forms of AI. Even if you could turn OpenCyc into text
         | rules, once ingested it would just dissolve into the ocean of
         | training text data and would not significantly gain more
         | apparent "common sense" than it already had. Maybe a more
         | interesting combination could be to have both Cyc and LLM
         | working side by side and comparing notes before agreeing on a
         | result.
        
       | piannucci wrote:
       | If the name is a reference to Ogden Nash's poem then I am very
       | tickled:
       | https://www.madisonpubliclibrary.org/engagement/poetry/poem-...
        
         | ricketycricket wrote:
         | I'd guess it's the book Llama Llama Red Pajama:
         | https://openlibrary.org/books/OL24377652M/Llama_Llama_Red_Pa...
        
       | FloatArtifact wrote:
       | Code generation, I wonder the difference in output given order of
       | operations with training and fine tuning. What if the model was
       | trained on the documentation and the code base for Python as an
       | example. Then fine tuning came from training on actual python
       | code on GitHub.
       | 
       | At the model understands the python documentation and the
       | implementation standard library/interpreter. Then is there a
       | reduction of data needed for contacts in other code?
        
       | nailer wrote:
       | Someone on HN made a point that weights can't even have
       | copyright- they lack two of the requirements for being
       | copyrightable:
       | 
       | https://news.ycombinator.com/item?id=35508651
        
       | bobwernstein1 wrote:
       | when will the first code writing specific model arrive?
        
       | smrtinsert wrote:
       | So is the next step is for someone to come in a fine tune on top
       | of it in order to make it a Vicuna? Or can current vicuna deltas
       | be applied?
        
         | rafaelero wrote:
         | Yeah, it's pretty trivial to change the base model from LLaMa
         | to this next one. You just have to finetune it with the same
         | data used previously to train Vicuna.
        
         | wesleychen wrote:
         | There's no model yet, only a dataset.
        
           | omneity wrote:
           | My understanding is that LLaMa's architecture is open, so the
           | most difficult part is:
           | 
           | 1. Getting data of equal or better quality
           | 
           | 2. Securing the funding/hardware required for training
           | 
           | 3. Learning/figuring out the training challenges needed to
           | tune the process (the PhD part)
           | 
           | It seems #1 is the relatively lowest hanging fruit and a
           | prerequisite for the other two, and that's what the project
           | is (rightfully) tackling at this stage. #2 could be solved by
           | many ways, and doesn't require much innovation if the project
           | and the team are solid. Which takes me to #3, which on the
           | other hand seems to be the make or break part of the project.
           | 
           | I'm not one to doubt the technical prowesses of the
           | RedPajama's team and their contributors, I rather see it
           | economically. How can an AI open-source project compete with
           | big tech in attracting the brilliant minds of our generation?
           | It's enough to look at levels.xyz to see the battle is not
           | ... level.
           | 
           | There's a serious economical challenge in here to have any
           | sort of sustainable open source initiative in AI.
        
       | Havoc wrote:
       | Love this - I'll happily accept a bit of a quality trade-off for
       | a pure open model. Its a bit like I'm willing to accept trade-
       | offs to ensure my IoT gear is local only even if that means loss
       | of cloud convenience
        
       | simonw wrote:
       | The training data - all 1.2 trillion tokens - can be downloaded
       | by grabbing each of the 2,084 URLs listed here:
       | https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt
       | 
       | I ran a HEAD request against them all to sum up the total file
       | size, and it's 2.67TB total.
       | 
       | Here's a Datasette Lite URL that lets you explore the size
       | metadata about those files:
       | https://lite.datasette.io/?json=https://gist.github.com/simo...
       | 
       | And a SQL query that shows the breakdown across the different
       | sources:
       | 
       | https://lite.datasette.io/?json=https://gist.github.com/simo...
       | 
       | Sizes here are in GB:                   common_crawl
       | 1341.6166818914935         c4  806.7667234372348         github
       | 212.1786002581939         wikipedia  111.89125544670969
       | book  100.43162744678557         arxiv  87.35323827341199
       | stackexchange  74.54870238155127
       | 
       | Common Crawl is in there a few times - they have the following
       | folders:                   common_crawl/2020-05 198 files
       | common_crawl/2021-04 176 files         common_crawl/2023-06 175
       | files         common_crawl/2022-05 157 files
       | common_crawl/2019-30 153 files
       | 
       | And then C4 as well, which is "a colossal, cleaned version of
       | Common Crawl's web crawl corpus. It was based on Common Crawl
       | dataset": https://paperswithcode.com/dataset/c4
        
         | afro88 wrote:
         | Interesting they're allowed to use stackexchange. I don't know
         | much about the legalities of scraping. Was this an agreement
         | between them, or is it simply ok to scrape and use the data in
         | a model?
        
           | progbits wrote:
           | https://stackoverflow.com/help/licensing
           | 
           | Doesn't this imply the produced model has to be CC-BY-SA too?
        
             | wongarsu wrote:
             | By that line of reasoning, GitHub copilot would have to be
             | GPL. Until somebody fights about this in court we don't
             | really know. But even in the worst case the CC-BY-SA is one
             | of the easier licenses to fulfill, not much worse than the
             | MIT-licensed code contained in the dataset.
        
             | gattilorenz wrote:
             | Welcome to this can of worms.
             | 
             | CC-BY-SA content needs attribution too, but I don't see
             | the(se) model(s) in the current state being able to do so.
             | 
             | I imagine we're gonna see the IBM PC bios/Unix/ReactOS
             | "tainted code" arguments again in court, this time is not
             | the human who is more-or-less knowingly responsible for
             | sneaking in copyrighted code.
        
         | doctoboggan wrote:
         | I am a little concerned that they have only about 60% of the
         | code tokens (GitHub and stackexchange). Given that so far the
         | only concrete use case I have for LLMs is coding assistance I
         | wouldn't want this open source model to be and less quality in
         | that area.
         | 
         | In your opinion do you think this will hamper the model at all?
         | Or is it still more than enough to get good coding assistance?
        
           | csris wrote:
           | Nice catch! We sampled the github dataset to match the total
           | # tokens seen by LLaMA during training: ~64B tokens (they
           | only pass through 0.64 of their total Github dataset
           | according to the paper). We have a lot of Github data and
           | will make them available soon. Note, we also have not built
           | this for compute optimal training. We are following LLaMA's
           | lead and are training on more data for longer to optimize for
           | quality, not compute.
        
             | doctoboggan wrote:
             | Very good to hear that you are optimizing for inference
             | rather than training. I've tried llama and its various
             | instruction tuned siblings and have yet to get equivalent
             | performance to gpt-3.5 on coding tasks. Seeing how the base
             | model performed relative to gpt-3 on the various benchmarks
             | gives me hope that the difference is just in RLHF or other
             | fine tuning steps. I really hope the community is able to
             | get there, Especially if the resulting model is able to be
             | quantized with minimal loss.
        
           | jstx1 wrote:
           | Smaller % of training data doesn't necessarily mean lower
           | quality.
        
           | sp332 wrote:
           | As mentioned in the post, the smaller models are trained well
           | past "compute-optimal" amounts of data and I would expect are
           | well into diminishing returns. On the other hand, large
           | models are good one-shot and few-shot learners, and might be
           | able to pick up enough context from your prompt alone to be
           | useable, even if it wasn't specifically trained on your use
           | case.
        
             | [deleted]
        
             | Minus0 wrote:
             | In this context compute optimal isn't quite the same as
             | diminishing returns. If you look at the loss graphs in the
             | Llama paper, you can see that even the curves for the
             | smaller models were still going down at the time they
             | stopped training and weren't anywhere near plateauing yet.
             | LLMs are notoriously data hungry and will take a long time
             | to reach convergence.
             | 
             | Compute optimal here means the point at which it makes
             | sense to move from a smaller to a larger model assuming
             | that: (a) you have a fixed compute budget of FLOPs, and (b)
             | you want to train the best model possible. The problem is
             | that this applies only to training and assumes nothing
             | about the cost of inference. If you actually need to deploy
             | these trained models and support them long-term for
             | hundreds, thousands, even millions of people to use, would
             | you rather deploy a 13B model or a 30B model at the same
             | level of quality, even if the 13B model would be more
             | costly to train?
             | 
             | There is going to be a point at which these models plateau
             | and further improvement will not be possible without moving
             | to a larger model, but Llama doesn't get there quite yet.
        
           | bkm wrote:
           | Relevant:
           | https://twitter.com/abacaj/status/1647999551964323844
        
             | totoglazer wrote:
             | This tweet is misunderstanding the papers.
        
               | t3estabc wrote:
               | [dead]
        
           | simonw wrote:
           | No idea!
           | 
           | I wonder how hard it would be to fine-tune something built on
           | RedPajama on further code examples to improve performance
           | there.
        
         | fnands wrote:
         | Nice. Thanks for the summary.
         | 
         | So ~4x the size of the Pile, any idea how it stacks up in terms
         | of quality to other big datasets?
        
         | rcpt wrote:
         | I'm kind of surprised how small that dataset is
        
         | simonw wrote:
         | Wrote this up as a blog post:
         | https://simonwillison.net/2023/Apr/17/redpajama-data/
        
           | csris wrote:
           | Hi! I'm the VP of Engineering at Together. Thanks for writing
           | up these instructions! FYI, you can also download all the
           | files with one wget command:                 wget -i
           | https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt
           | 
           | This is also mentioned on the dataset card for redpajama-
           | data-1T on Huggingface [1].
           | 
           | [1]: https://huggingface.co/datasets/togethercomputer/RedPaja
           | ma-D...
        
             | simonw wrote:
             | I made sure to include that in my blog post - along with a
             | note that you need 2.67TB of disk space first!
        
       | macinjosh wrote:
       | This guy has kids, so we all know he.. nevermind. I love the name
       | being a parent myself.
        
       ___________________________________________________________________
       (page generated 2023-04-17 23:00 UTC)