[HN Gopher] Hello Dolly: Democratizing the magic of ChatGPT with...
       ___________________________________________________________________
        
       Hello Dolly: Democratizing the magic of ChatGPT with open models
        
       Author : hnuser0000
       Score  : 387 points
       Date   : 2023-03-24 12:21 UTC (10 hours ago)
        
 (HTM) web link (www.databricks.com)
 (TXT) w3m dump (www.databricks.com)
        
       | bob1029 wrote:
       | > Surprisingly, instruction-following does not seem to require
       | the latest or largest models: our model is only 6 billion
       | parameters, compared to 175 billion for GPT-3.
       | 
       | We started seeing this in our testing. OpenAI's Curie model is
       | responding very well to our fine-tuning experiments for chatbot-
       | style interface. I am trying to keep us focused on quality of
       | training data rather than obsessing over raw network size.
       | Davinci (and derivatives) might turn out to be overkill for our
       | use cases.
        
       | imwithstoopid wrote:
       | here come the "Me Too!!" announcements from everyone trying to
       | catch some of the energy of this new market
       | 
       | how long until IBM, Tesla and Oracle announce Me-Too LLMs?
        
       | [deleted]
        
       | gavi wrote:
       | Its trained on alpaca dataset which in turn was generated from
       | open ai davinci, wondering if it is actually transferring the
       | weights by generating content from the source model?
        
       | epups wrote:
       | I think this is cool, but it's on the range of complexity that I
       | would expect from a personal project. When you put a whole
       | organization behind it, I feel you could have provided something
       | extra - better datasets? Improved weights from a ton of training?
        
       | kvmakes wrote:
       | Super cool stuff!
        
       | Mizza wrote:
       | It's immediately become difficult to untangle the licensing here.
       | Is this safe for production use - I have no idea if I can expect
       | a DMCA from Mark if I step out of bounds with this or other post-
       | Alpaca models, unless I'm missing something important. Meta
       | really botched the Llama release.
        
         | pwendell wrote:
         | Yes it's nuanced, but will be simplified going forward.
         | 
         | This uses a fully open source (liberally licensed) model and we
         | also open sourced (liberally licensed) our own training code.
         | However, the uptraining dataset of ~50,000 samples was
         | generated with OpenAI's text-davinci-003 model, and depending
         | on how one interprets their terms, commercial use of the
         | resulting model may violate the OpenAI terms of use. For that
         | reason we are advising only noncommercial use of this model for
         | now.
         | 
         | The next step here is to create a set of uptraining samples
         | that is 100% open. Stay tuned.
        
           | Taek wrote:
           | Are you in touch with the OpenAssistant team? I believe they
           | already have a more or less complete set of samples
           | (100,000!) that were produced in an open environment and
           | aren't encumbered by any licensing.
        
             | pwendell wrote:
             | No I haven't heard of that, we'll engage with that team.
             | This is exactly what we need will look into it.
        
         | babyyoda wrote:
         | Given that Alpaca strictly specified that they released purely
         | for academic use and any commercial use was prohibited given
         | doing so would violate terms of service, I don't see this as
         | viable for use. Looks like marketing gimmick
        
         | rnosov wrote:
         | This has nothing to do with facebook. The foundational model
         | here is GPT-J which is opensource and safe to use. Sadly, it is
         | inferior to state-of-the-art models such as LLaMA.
        
           | Mizza wrote:
           | But they're "using data from Alpaca". I don't know what that
           | means, isn't Alpaca using data generated by ChatGPT, which
           | isn't "clean" to use? Or data from Facebook, which isn't
           | "clean" to use? I'm drowning.
        
             | bilekas wrote:
             | I don't know the full details but Alpaca is from Stanford
             | and only based on the LLamA (not a derivative work afaik).
             | That said :
             | 
             | Also Meta's licensing here
             | https://github.com/facebookresearch/llama/blob/main/LICENSE
             | 
             | Can't be sure what that license actually reffers to, the
             | language model or just the tooling in the Git Repo.
             | 
             | I agree its a minefield, but with Meta I would eer on the
             | side of caution.
        
             | rnosov wrote:
             | They are instruction tuning it using the dataset released
             | by stanford-alpaca team. The dataset itself is synthetic
             | (created using GPT-3) and somewhat noisy and in my view can
             | be easily recreated if OpenAI ever tries to go after it
             | (which is very unlikely). Anyway, facebook has nothing to
             | do with anything used by this project.
        
               | Mizza wrote:
               | So, this is a "dirty" model, in that is was created by
               | data which violated OpenAI ToS. Obviously, this kind of
               | violation is basically fine if you're a massive
               | corporation who the rules don't apply to, but it's a huge
               | risk if you're a small fish.
        
               | sebzim4500 wrote:
               | That's between OpenAI and the people that recorded the
               | data. No one else needs to care.
        
               | hutzlibu wrote:
               | "basically fine if you're a massive corporation who the
               | rules don't apply to, but it's a huge risk if you're a
               | small fish"
               | 
               | With these things, it is usually the other way around.
               | 
               | If you are a small fish, no one will care. But if you are
               | big enough, that money could be extracted from you, then
               | they will come. A big org just has better lawers and
               | negotiating power, but they really cannot ignore the law.
               | Especially not, if there is a competitor with money to
               | sue.
               | 
               | So if you are small and want to become big, better be
               | cautious on the legal ground you are walking.
        
               | gremlinsinc wrote:
               | If you use output, from a non-profit who open sourced the
               | output gained by following the TOS, as in they aren't
               | using it 'for profit', it's not illegal, because:
               | 
               | A. it's an output gained via following the letter of the
               | law (TOS).
               | 
               | B. TOS only applies directly to people who've accepted
               | the TOS, unless alpaca's license/TOS ALSO forwards the
               | same criterion as it's source at openai, then derivatives
               | wouldn't apply.
               | 
               | It's like if an app developer on IOS violated a TOS, and
               | apple tried to go after everybody who ever used the app,
               | they didn't agree directly to the TOS, only the developer
               | did.
        
               | rnosov wrote:
               | ToS are not the law. It would be similar to your power
               | company claiming copyright over the code written using
               | "their" electricity. Not going to happen. I wouldn't be
               | too concerned.
        
               | sp332 wrote:
               | No, but you could be banned from using OpenAI products in
               | the future, which seems like quite a liability for a
               | researcher or company.
        
               | rnosov wrote:
               | That would be anticompetitive practice that is actually
               | against the law in many countries[1]. In the unlikely
               | event of OpenAI ever engaging in such things they will be
               | sued into oblivion.
               | 
               | [1] https://en.wikipedia.org/wiki/Refusal_to_deal
        
               | Spivak wrote:
               | Especially when OpenAI explicitly doesn't have a claim to
               | copyright on the model output.
        
         | bilekas wrote:
         | > Meta really botched the Llama release.
         | 
         | It's no surprise really though, from what I see they recognised
         | some way to monitize and rolled back their commitment.
         | 
         | But this Dolly doesn't depend on Llama (unless I'M missing
         | something), so you don't have to use it.
        
         | leobg wrote:
         | Why? Dolly had nothing to do with Llama or its weights.
         | 
         | Besides: How would anyone ever know which model generated the
         | output you are serving? AFAIK there is no fingerprint in any
         | model's output. And even if there was, it would probably be
         | destroyed by fine tuning "over it".
        
           | stametseater wrote:
           | > _AFAIK there is no fingerprint in any model's output._
           | 
           | It seems like there easily could be. What if some of the data
           | they trained it on didn't exist anywhere else except in the
           | training set, and was put there specifically for this
           | purpose? For instance they could have taught it a few poems
           | that don't exist anywhere else. If you can coax the LLM of
           | unknown origin into reciting those poems back to you, you
           | know where it came from.
        
             | kurthr wrote:
             | Even easier have a small set of 8-10 character gibberish
             | tokens it's trained on in a particular contexts (eg a non-
             | existent poem). Then feed it one or several poems and see
             | if a gibberish token pops out.
        
               | eigenvalue wrote:
               | I think they call these canary GUIDs. If you manage to
               | generate one from an LLM then you can conclude with
               | certainty that the model saw that document during
               | training.
        
           | neilv wrote:
           | > _Besides: How would anyone ever know which model generated
           | the output you are serving?_
           | 
           | There's precedent for "whatever you can get away with" in
           | tech companies, but establishing a culture of that at the
           | start of this new big change could end up undesirable for
           | most people.
           | 
           | For example, it could relieve demand for more legal and
           | sustainable ways, until it's too late. (Look at the history
           | of digital entertainment media piracy and DRM and
           | legislation, for example. Or look at the history of software
           | piracy, where some big companies seem to actually want their
           | product to be pirated, partly because it builds a bigger moat
           | against competitors, and they can legally strongarm some of
           | those pirates later.)
        
       | bilekas wrote:
       | This is really great news and something I felt was missing from
       | the market so far. It seems everyone wants to create `moats` or
       | walled-gardens with some aspect of their models etc.
       | 
       | Nice job DataBricks, nice numbers too. Looking forward to more
       | improvements.
        
         | detrites wrote:
         | Thought the same until I read this:
         | 
         | > Contact us at hello-dolly@databricks.com if you would like to
         | get access to the trained weights.
        
           | bilekas wrote:
           | This is not an issue though, they would just be the weights
           | used by DataBricks, there is no reason you can't add your own
           | right ?
           | 
           | Like giving away a website template without the demo content,
           | it's perfectly normal.
        
           | superchink wrote:
           | https://github.com/databrickslabs/dolly it's now available on
           | GitHub
        
           | jppope wrote:
           | data transfer might actually be the problem there not
           | something like trying to hide the model
        
             | yieldcrv wrote:
             | bittorrent, come on
        
       | crosen99 wrote:
       | Fine-tuning these models reminds me of the good ol' days with
       | tube TVs where the slightest twist of the vertical hold dial
       | meant the difference between a clear picture and useless,
       | dizzying, visual nonsense.
        
       | woeirua wrote:
       | This is the real risk to OpenAI's business model. If it turns out
       | that you can get most of the same outcome with drastically
       | smaller and cheaper models, then OpenAI is going to have a hell
       | of a time keeping customers around as it will just be a race to
       | the bottom on price and bigger, more expensive models will lose
       | just from a hardware cost standpoint.
        
         | xpe wrote:
         | No disrespect to the author intended, but the above comment is
         | muddled.
         | 
         | 1. OpenAI, the organization, is not equivalent to its chat
         | offering.
         | 
         | 2. Saying "the" real risk isn't persuasive. Let's examine many
         | risks before claiming one is the most significant. Also, "real"
         | is this usage often a throwaway (i.e. unneeded) word, in editor
         | speak.
         | 
         | 3. Let's talk about OpenAI's "business model" (though such
         | discussions are tricky).
         | 
         | 3A. Originally, OpenAI wasn't trying to "hold onto" AI
         | advancements. It claimed to be a broadly funded way to explore
         | fundamental questions of artificial intelligence in a non-
         | commercial, ethical way.
         | 
         | 3B. Of course, the above claim was largely aspirational,
         | because it wasn't baked into their DNA in way that could
         | survive the surrounding temptations for more funding, glory,
         | and resources.
         | 
         | 3C. Even with their more commercialized model of the last
         | several years, it seems their business model feels like (a)
         | fundraise in exchange for (b) (claimed) collective good open
         | source, tools and shared research.
         | 
         | 3D. OpenAI feels to me more and more like a commercial research
         | lab; there does seem to be a lot of commercial partnering with
         | their funding organizations (e.g. Microsoft).
         | 
         | 4. I doubt the leadership there views the current ChatGPT
         | models as unchanging. I expect there is a considerable revenue
         | stream _around_ the space. OpenAI is well positioned to play
         | the game several steps ahead of others.
         | 
         | I would frame the broader question this way: for many years,
         | there has been a hunger for this deeper AI research, due not
         | only to (i) the expertise and resources required, but also (ii)
         | to this hope that there is an organization that can maybe keep
         | it within human or ethical bounds.
         | 
         | Unfortunately, this amorphous hope doesn't seem to be matching
         | the actual organizational incentives nor dynamics. It is also
         | unclear how much demand the public in free market will have for
         | nobler research.
         | 
         | My position on these kinds of things is simple: follow the
         | money. If we want an accountable public interest, AI research
         | laboratory it's going to have to be designed, funded, and
         | overseen very differently.
        
         | smoldesu wrote:
         | On the flip-side, OpenAI is primed to destroy their
         | competitors. Partnership with Microsoft means they can buy
         | Azure compute at-cost if need be. Their current portfolio of
         | models is diverse on the expensive and cheap ends of the
         | spectrum, with thousands of people on Twitter and HN still
         | giving them lip-service. With dozens of clones hitting the
         | market, OpenAI is the only one staying consistently relevant.
         | 
         | The widespread adoption of local AI won't obsolete a well-
         | priced AI API. I feel like we learned that lesson pretty
         | thoroughly in the SaaS era.
        
           | xpe wrote:
           | > The widespread adoption of local AI won't obsolete a well-
           | priced AI API. I feel like we learned that lesson pretty
           | thoroughly in the SaaS era.
           | 
           | Unless I am misunderstanding (?), this seems like an
           | overgeneralized lesson. There are many key differences
           | between these situations that make such a connection
           | unlikely. Could you explain your reasoning?
        
           | ijustlovemath wrote:
           | The difference between this and SaaS is that businesses have
           | been moving their (end user) products to SaaS due to wider
           | broadband availability, as well as greed (read: MRR), but on
           | the LLM side, people are _building new products with it_ , so
           | the incentives are to keep your costs low (or free) so you
           | can make more money once you release.
        
         | nico wrote:
         | That's why they are moving so fast and trying to get as much
         | press/media attention as possible.
         | 
         | They want to stay top of mind.
         | 
         | Think about CocaCola, anyone can make a drink just as good. But
         | it's almost impossible to build their brand and distribution
         | from scratch.
        
         | lfciv wrote:
         | I wouldn't underestimate the power of momentum
        
         | rashkov wrote:
         | What about the high quality training data that OpenAI has
         | encoded into ChatGPT? Do these other models come close to that?
        
           | woeirua wrote:
           | Why couldn't you just use OpenAI's API to feed prompts and
           | then take the outputs and use them to train your own model to
           | exfiltrate the best features of GPT?
        
             | xpe wrote:
             | Give it a try if you feel like it is a good thing to do.
             | I'm sure some nation states are doing it.
             | 
             | P.S. this comment does not reflect my personal values. But
             | I would rather someone with values try it almost like a
             | white hat pen test.
        
             | wsgeorge wrote:
             | Because it would be against their TOS, and things could
             | look ugly, legally.
        
               | tspike wrote:
               | How many TOS agreements do you suppose they violated
               | while training their models?
        
               | AJ007 wrote:
               | It's still an open question if any of these models,
               | trained on copyright work, will themselves be eligible
               | for copyright protection.
        
               | ImHereToVote wrote:
               | Ironic
        
               | ImprobableTruth wrote:
               | Is this a bit? If it's illegal to train on copyrighted
               | material, then OAI has broken the law ten times over by
               | training GPT3. There's absolutely zero reason for them to
               | sue, they'll just ban the responsible people.
        
               | nickthegreek wrote:
               | I think their TOS forbids using the API for this. I don't
               | think it covers the use of the web interface.
        
               | circuit10 wrote:
               | However:
               | 
               | "You may not [...] except as permitted through the API,
               | use any automated or programmatic method to extract data
               | or output from the Services, including scraping, web
               | harvesting, or web data extraction;"
        
               | nickthegreek wrote:
               | Can't be automated, so manual extraction is allowed.
        
           | typon wrote:
           | That's how Alpaca is made
        
       | aabajian wrote:
       | Anyone care to comment on why the output of these models changes
       | so dramatically given so little Q&A training? It's a 6 billion
       | parameter model with only 50 thousand Q&A samples.
       | 
       | It's clear the model already "knows" the format of a Tweet (short
       | length, attention-grabbing, contains hashtags). The model also
       | knows stuff about language models (word2vec, tokenization), and
       | can include entities from the question in its response (Dolly,
       | Databricks). Yet, it just doesn't put these pieces together in
       | the right way without the Q&A training.
       | 
       | Edit: For kicks, I asked GPT-4 this question:
       | https://imgur.com/a/sM4uyBn
        
         | pwendell wrote:
         | Yes this was a very surprising result... that the relatively
         | small uptraining was able to unlock so much latent knowledge in
         | the model.
        
       | bogwog wrote:
       | Open Assistant is doing the same thing, but actually creating a
       | dataset that isn't on questionable legal grounds by creating a
       | gamified web app where people can contribute: https://open-
       | assistant.io/dashboard
       | 
       | I wonder how small can these models get? From 175B to 6B with
       | comparable performance is huge, but can it go lower?
        
       | highwaylights wrote:
       | I see that in its five book suggestions that it has suggested you
       | should read Hitchhikers Guide twice.
       | 
       | Not many humans would even get this answer correct.
       | 
       | I am impressed.
        
       | Zaheer wrote:
       | How hard would it be to embed this into a NPM module so anyone
       | can use it in their servers / apps locally?
        
       | [deleted]
        
       | sbussard wrote:
       | I'd like some clarification of terms - when they say it takes 3
       | hours to train, they're not saying from scratch are they? There's
       | already a huge amount of training to get to that point, isn't
       | that correct? If so, then it's pretty audacious to claim they've
       | democratized an LLM because the original training likely cost an
       | epic amount of money. Then who knows how much guidance their
       | training has incorporated, and it could have a strong undesirable
       | viewpoint bias based on the original training.
        
         | joshhart wrote:
         | The 3 hours is the instruction fine-tuning. The base
         | foundational model is GPT-J which was already provided by
         | Eleuther-AI and has been around for a couple of years.
         | 
         | Note: I work at Databricks and am familiar with this project
         | but didn't work on it.
        
           | Taek wrote:
           | Do you know why GPT-J is being used instead of NeoX or any of
           | the other larger open source models?
        
       | cuuupid wrote:
       | I don't love the lack of quantitative comparison to Alpaca but a
       | commercial model (which sounds like it's in the works) would
       | finally move the needle on democratizing access to LLMs.
       | 
       | Will also commend the authors for not falling into the "LLMs
       | can't perform without 200B params!" fallacy. For anyone reading,
       | 6B params is enough to train on a 3090. A PC rig for training or
       | running inference with this would put you back maybe 4k$.
       | 
       | The end game here is likely getting the model to perform well in
       | millions of parameters on specific tasks. Most business uses of
       | ChatGPT are pretty closed domain tasks, it wouldn't be a huge
       | step to distill this model on a specific task and get it down to
       | 150-350M params (which is roughly BART size and can run on AWS
       | Lambda even).
        
       | nothrowaways wrote:
       | "ChatGPT, a proprietary instruction-following model" pun
       | intended.
        
       | jawadch93 wrote:
       | [dead]
        
       | mydpy wrote:
       | What a great time to be in this field. It's advancing so quickly!
        
       | sillysaurusx wrote:
       | Interesting. DALL-E, Dalai
       | (https://cocktailpeanut.github.io/dalai/), and now Dolly are all
       | pronounced the same way.
       | 
       | It feels like there should be an xkcd for this.
        
         | joseda-hg wrote:
         | Are they? (Not sarcastic, I'm not native and I wouldn't
         | pronounce them all that similar at first sight)
        
           | dwringer wrote:
           | As a native speaker, no, there's hardly any consensus I've
           | seen about how to pronounce them. Certainly there are trends.
           | But I pronounce Dalai somewhere between "Dah-lay" and "Dah-
           | lie", and DALL-E _sorta_ like Dolly ( "Dah-lee"), but with a
           | deliberate pause ("Dahl Ee").
        
           | [deleted]
        
         | outside1234 wrote:
         | What could go wrong
        
         | JLCarveth wrote:
         | > are all pronounced the same way
         | 
         | No they're not.
        
         | mejutoco wrote:
         | AFAIK DALL-E is pronounced as Dali, as in Salvador Dali.
         | 
         | https://en.wikipedia.org/wiki/Salvador_Dal%C3%AD
        
           | chatmasta wrote:
           | I figured it was a reference to the Dalai Lama (which doesn't
           | invalidate your comment, since that's also pronounced like
           | Dali). LLM -> Llama -> Dalai Lama
        
             | rburhum wrote:
             | Dali has an accent at the end, which has the emphasis in
             | the last letter. Dalai does not. They sound very different.
             | "dah-lee" vs https://m.youtube.com/watch?v=JhFbvuKn45w
        
             | sillysaurusx wrote:
             | Hmm. Is Salvador Dali pronounced differently than Dolly or
             | Dalai? The wikipedia page has "dah-lee" as the phonetic,
             | and https://www.google.com/search?q=pronounce+salvador+dali
             | sounds the same as
             | https://www.google.com/search?q=pronounce+dalai+lama. So it
             | seems like all three are identical.
        
               | ToValueFunfetti wrote:
               | The emphasis in Dali is on the second syllable, which is
               | at least different from Dolly. I've always pronounced
               | Dalai Lama the same as I would Dolly Lama, but Cambridge
               | dictionary is saying it should be Da-lay in both US and
               | UK pronunciations.
               | 
               | Tangentially, it seems like most of the results for both
               | searches were autogenerated with TTS programs. I wonder
               | if our pronunciations will shift towards TTS mistakes
               | over time. Probably not, these videos only have a few
               | thousand views, but neat if true.
        
               | mejutoco wrote:
               | Dali has the stress on the last syllable, hence the
               | accent (but Dall-e probably not). In my native language
               | Dalai is pronounced "Da-lie", like another comment says
               | above. TIL Dolly is pronounced so similarly. I thought
               | the Do sounded like Doberman, but apparently not.
               | 
               | https://www.merriam-webster.com/dictionary/dolly
        
             | 4ndrewl wrote:
             | I thought "Dalai" pronounced "Dall Eye" rhymes with "Shall
             | I" "Dali" pronounced "Dahl eee" rhymes with "Carly"
        
               | tetraca wrote:
               | This is all very weird to me because I've always
               | pronounced Dalai as "Dah-lay".
        
               | chatmasta wrote:
               | Interesting. According to Google, it's a British ("Da-
               | lie") vs. American ("Da-lee") difference.
        
           | [deleted]
        
           | bilekas wrote:
           | Handy also to think off WALL-E. At least that where my
           | assumption came from.
        
           | [deleted]
        
           | cosmojg wrote:
           | It's quite clearly a reference to WALL-E the environmentally
           | conscious robot, which is pronounced as you'd expect. I like
           | to think of it as DALL-E the surrealist robot painter.
        
             | mejutoco wrote:
             | That is exactly my interpretation. Both wall-e and Dali. I
             | think we are in agreement.
        
             | JohnFen wrote:
             | I totally failed to make that connection! Was that the
             | intended reference? What's the link to WALL-E?
        
         | renewiltord wrote:
         | Wow, just discovered that the American pronunciation for Dalai
         | Lama is Da-lee. Well, that's a discovery.
         | 
         | This is like when Khan Academy came out and there was a guy
         | online saying it's a terrible brand because it sounds like Con
         | Academy which it doesn't in my dialect.
         | 
         | Took a while to get it.
        
           | rockzom wrote:
           | How do you say Khan?
        
             | ricardobeat wrote:
             | Kan / k-a-n, A like in "father"
        
             | theSuda wrote:
             | I found this which matches how I say it (as an Indian)
             | https://www.howtopronounce.com/khan/4145893
             | 
             | It's the KH sound that doesn't really exist in English
             | hence many get it wrong.
        
               | gowld wrote:
               | The KH is one thing, but for "con"-fusion (hah!), it's
               | also about the "higher" "caan" vs "cawn", which is a very
               | subtle difference.
        
         | xg15 wrote:
         | I guess after carcinisation comes dolly-fication...
         | 
         | But I do like the hang for whimsical naming schemes in that
         | field. First sesame street characters, now apparently
         | everything sheep...
        
       | thewataccount wrote:
       | I might be having a moment - but I can't find any links to a git
       | repo, huggingface, or anything about the
       | models/weights/checkpoints directly from the article.
       | 
       | I just see a zip download that AFAIK also doesn't contain the
       | weights/checkpoints. I find this a bit odd, the contents of the
       | zip (from the gdrive preview) look like they should be in a git
       | repo, and I assume they download the model from somewhere? GDrive
       | usually has rate limits which I'm concerned about.
       | 
       | If anyone from databricks reads this - are there plans to publish
       | this on a git repo somewhere, as well as the weights/checkpoints?
       | 
       | EDIT: Oh I just noticed
       | 
       | > Contact us at hello-dolly@databricks.com if you would like to
       | get access to the trained weights.
       | 
       | This... seems odd for a article titled "Democratizing the magic
       | of ChatGPT with open models"?
        
         | MagicMoonlight wrote:
         | So it's another classic private only model that they'll pull as
         | soon as the suckers have trained it up for them
        
         | thequadehunter wrote:
         | Lol. This is classic ML crap. Files with no documentation, no
         | links, multiple files with the same-ish name but no explanation
         | for which one is what.
        
         | nofinator wrote:
         | Yes, the ZIP on Google Drive owned by one of their engineers is
         | weird considering they have a pretty active GitHub presence of
         | open source projects, though it does use an Apache license like
         | their others.
         | 
         | Perhaps Databricks suspected another big announcement coming
         | soon and wanted to get this announcement out?
        
         | amrb wrote:
         | Are they pulling a Facebook, on model access?
        
           | thewataccount wrote:
           | From what I can tell they're fine-tuning EleutherAI's GPT-J.
           | 
           | Alpaca was made to fine-tune LLaMa, however they also
           | released their dataset they used to do this, and it looks
           | like Dolly is this dataset applied to GPT-J, and does not use
           | LLaMa itself.
        
           | dragonwriter wrote:
           | I think they are dodging unclear legal issues surrounding
           | certain steps of the model-building process while being as
           | open as possible with the components given that constraint,
           | allowing downstream users to make their own legal risk vs.
           | effort choices.
        
             | pwendell wrote:
             | Yes, this.
        
             | amrb wrote:
             | Given the hardware/energy need to train it be nice, to have
             | a legal document that said something like this model has no
             | warranty, it may be a break through machine or a hand
             | grenade. Use at you own risk!
        
         | slimsag wrote:
         | The README also says this:
         | 
         | > This fine-tunes the [GPT-J
         | 6B](https://huggingface.co/EleutherAI/gpt-j-6B) model on the
         | [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)
         | dataset using a Databricks notebook.
         | 
         | > Please note that while GPT-J 6B is Apache 2.0 licensed, the
         | Alpaca dataset is licensed under Creative Commons NonCommercial
         | (CC BY-NC 4.0).
         | 
         | ...so, this cannot be used for commercial purposes
        
           | dwallin wrote:
           | Essentially every model worth anything has been trained on a
           | unfathomably large amount of data under copyright, with every
           | possible licensing scheme you could imagine, under the
           | assumption that it is fair use. While you can argue that it's
           | all built on a house of cards (and a court may well agree
           | with you) it's kind of arbitrary to draw a line here.
        
             | judge2020 wrote:
             | > under the assumption that it is fair use.
             | 
             | No, because you as a human looking at "art" over your
             | lifetime and learning from it is not "fair use" of the
             | copyright, it's no-use at all. This is the crux of every
             | argument for both for language models and AI Art models,
             | that their tools are learning how to draw, learning what
             | styles and characteristics of input art correspond the most
             | with words, and creating art with that knowledge just like
             | any other human, not simply collaging together different
             | pieces of art.
        
             | Taek wrote:
             | Fair use via "this is completely impossible to regulate so
             | you might as well embrace it"
        
           | ambicapter wrote:
           | > ...so, this cannot be used for commercial purposes
           | 
           | The implication being that you're only "democratizing"
           | something if people can make money off of it?
        
           | dragonwriter wrote:
           | > ...so, this cannot be used for commercial purposes.
           | 
           | The legal relation between models and training data sets
           | seems murky; of course, with the build tooling, you can also
           | substitute in another instruction-following training set if
           | you want to avoid licensing issues with the Alpaca set,
           | whereas if you aren't concerbed with them, you can just blaze
           | ahead.
        
           | chpatrick wrote:
           | As far as I know the copyright situation for models is
           | ambiguous and also depends on the region. In the US you can't
           | copyright data made by an automated process but you can in
           | the EU, or something to that effect.
        
           | yieldcrv wrote:
           | > ...so, this cannot be used for commercial purposes
           | 
           | or you can raise $30,000,000 right now and worry about the
           | copyright infringement lawsuit in 2026 or never.
        
           | thewataccount wrote:
           | > ...so, this cannot be used for commercial purposes
           | 
           | Can't they also release the fine-tuned weights as non-
           | commercial as well?
        
         | dopidopHN wrote:
         | Thanks I missed that email while skimming
        
         | pwendell wrote:
         | Full source code is up here now:
         | 
         | https://github.com/databrickslabs/dolly
         | 
         | Sorry it took us a day to get the external repo setup.
        
           | thewataccount wrote:
           | Awesome thank you!
           | 
           | Was the Alpaca dataset being licensed as non-commercial only
           | the reason you aren't releasing the weights? Is it possible
           | to just release them under the same license?
        
             | pwendell wrote:
             | Yes the issue is that some of the training data is arguably
             | tainted with some noncommercial license (it's nuanced,
             | discussed below in my comment). We are releasing weights to
             | people who request but we just wanted to have an email
             | request flow so that we can make sure people know it's just
             | for noncommercial purposes.
             | 
             | Working on a model without this issue. Certainly our goal
             | is totally open models anyone can use for anything.
        
               | thewataccount wrote:
               | Understandable, thank you for the response!
               | 
               | I've been a bit jaded by the "open/democratizing ai"
               | stuff and then having companies stiff us at actually
               | making it open - but not wanting to be the first to
               | litigate these new types of issues ml brings is very
               | understandable.
               | 
               | Question - Would you consider benchmarking a single 4090
               | for your training? While training in a few hours with 8x
               | A100's is impressive, myself and I think others are
               | curious how that translates to consumer hardware. IMO
               | running/fine-tuning on consumer hardware is the ultimate
               | endgame for all ai models.
        
               | robwwilliams wrote:
               | Look forward to a response. We are heading toward a 6X
               | Bizon 4090 system as a test bed.
               | 
               | https://bizon-tech.com/bizon-zx5500.html
        
         | m3affan wrote:
         | Databricks is on a roll
        
       | jppope wrote:
       | Does anyone else find it ironic that all these ChatGPT "clones"
       | are popping up when OpenAi is supposed to be the ones open
       | sourcing and sharing their work?
       | 
       | I guess: "You Either Die A Hero, Or You Live Long Enough To See
       | Yourself Become The Villain"?
        
         | Taek wrote:
         | Sam Altman has turned into a megalomaniac.
        
           | brandall10 wrote:
           | Possibly, but it is a bit unusual that he has zero equity in
           | the company. So it might not be for monetary reasons.
        
           | [deleted]
        
         | JohnFen wrote:
         | > when OpenAi is supposed to be the ones open sourcing and
         | sharing their work?
         | 
         | OpenAI renounced being open source. Don't let the name fool
         | you.
        
           | throwaway4837 wrote:
           | I think all of the "AI alignment" talk is mostly
           | fearmongering. It's a cunningly smart way to get ignorant
           | people scared enough of AI so they have no choice but to
           | trust the OpenAI overlords when they say they need AI to be
           | closed. Then OpenAI gets a free pass to be the gatekeeper of
           | the model, and people stop questioning the fact that they
           | went from Open to Closed.
           | 
           | AI being tuned to be "safe" by an exceedingly small set of
           | humans is the thing we should be afraid of. It's the
           | effective altruism effect: if you bombard people enough with
           | "safety" and "alignment" speak, they will look past the fact
           | that you're mainly interested in being a monopoly. My bigger
           | conspiracy theory is that Bill Gates getting behind "AI
           | alignment" is a calculated move to get people to look past
           | Microsoft's unilateral involvement.
        
             | soup10 wrote:
             | I don't know what press releases you've been reading, but
             | the model is closed so they can make money off it, that's
             | pretty obvious.
        
               | throwaway4837 wrote:
               | I think that is a simple take and underestimates the
               | insidious nature of the AI alignment initiatives. Or
               | maybe I'm overestimating it.
        
               | TigeriusKirk wrote:
               | At this point I'm really not sure what they're up to in
               | terms of grand strategy. I don't even know that making
               | money is their ultimate goal. At a certain level of
               | ambition money is just a tool to get what you really
               | want.
        
               | brandall10 wrote:
               | It's interesting to note that Altman has no equity in the
               | company. One of the primary motives of being a for-profit
               | company that was espoused was to be competitive with big-
               | tech AFA bringing in top-level research talent.
        
               | JohnFen wrote:
               | I don't think that Altman's lack of equity position in
               | OpenAI means anything at all when it comes to what
               | OpenAI's goals are.
               | 
               | We know what their immediate goals are: to make as much
               | money as possible. The only question is what their
               | longer-term goals are.
        
         | 0xDEF wrote:
         | AI and high-performance semiconductors are the only
         | technological fields where the US and allies haven't been
         | surpassed by Russia and China.
         | 
         | There is probably a lot of political pressure on OpenAI to be
         | as closed as possible. Remember the US government has banned
         | Nvidia from exporting A100/H100 to China/Russia. Those are the
         | same chips OpenAI uses for both training and inference.
        
           | amelius wrote:
           | Anyone in China/Russia who can comment on the actual
           | situation? How difficult is it to train/run AI models where
           | you are living?
        
             | coolspot wrote:
             | Russia is simply importing A100s through shell companies in
             | UAE.
        
       | htrp wrote:
       | TLDR:
       | 
       | Download GPT-J-6B from Eleuther
       | 
       | Download Alpaca Fine Tuning Code + Alpaca Examples
       | 
       | Train for 6 hours or so.
       | 
       | Get vaguely good RLHF model
        
         | typon wrote:
         | Key point is vaguely good. Scale is still important and that
         | manifests in the difference between gpt3.5 and gpt4 based
         | chatgpts. It's qualitatively and quantitatively so much better
         | in pretty much every benchmark. There is no way around the
         | bitter lesson.
        
           | bodyfour wrote:
           | > There is no way around the bitter lesson.
           | 
           | Isn't there? I'm certainly not sure, based on the results
           | published over the last weeks and months.
           | 
           | The giant GPT-{3.5,4} models show that if you make the model
           | big enough and throw enough data at it you can produce an AI
           | capable of conversing on basically any topic, in dozens of
           | languages. There are plenty of different takes on how near-
           | human its abilities are on specific tasks, but it's worth
           | stepping back and appreciating how super-human the _breadth_
           | of this knowledge is.
           | 
           | But it's also not clear if a mega-model is anything close to
           | the most efficient way of storing knowledge. After all, you
           | don't need to memorize every fact in Wikipedia if you know
           | how to effectively search it.
           | 
           | And we're currently seeing a daily explosion in these
           | capabilities. Today's flavor is interfacing with Wolfram, but
           | we've also seen web searches, python coding, etc. That, I
           | think, it the real superpower that comes out of this: you or
           | I can answer a question by "doing a web search" or "query a
           | database" or "use wolfram" or "develop a python program that
           | finds the answer" However, an AI could do tasks like this
           | just by "thinking" about it. Maybe it would be as natural as
           | we find blinking.
           | 
           | That to me is the real breakthrough in stuff like Alpaca --
           | start with a mega-model and prompt it with something like:
           | "After this paragraph, you are going to be speaking to a AI
           | model similar to yourself but much more primitive. Its task
           | will involve interfacing with English speakers, so converse
           | with it only in that language. It has access to the same
           | {X,Y,Z} APIs you have so any time it has trouble answering a
           | question, prefer to give hints about how it could find the
           | answer using those APIs rather than providing the answer
           | directly yourself. Only give an answer directly if it
           | repeatedly fails to be able to answer it by using an API.
           | I've provided a large set of standardized tests used by
           | humans at this URL -- start by asking it questions intended
           | for a preschool-aged child. Each time it is able to answer
           | new questions at a given level correctly 99% of the time
           | increase the material's level until it is able to achieve
           | that score on a test designed for a Computer Science PhD
           | candidate"
           | 
           | How large would the "student" model have to be to succeed at
           | this deep but narrower task? I think the answer right now "we
           | have no idea". However if the model has the advantage that it
           | can rely on external knowledge and tools from the start (and
           | is rewarded by the "teacher" for doing just that) I bet it'll
           | be a lot smaller than these mega-models. Sure, you wouldn't
           | be able to disconnect the "student-AI" from its APIs and
           | expect it to converse with you in Hungarian about the history
           | of yacht design, but that might not be a capability it needs
           | to have.
           | 
           | My personal hunch is that we're going to find these "AI-
           | taught specialist AI, with API access" models will be a lot
           | smaller than most people are expecting. That's the moment
           | when things REALLY change: instead of pairing a human with a
           | mega-model AI, if specialized models are cheap someone can
           | say "spin up 100K expert-programmer AIs and have them
           | supervized by 5K expert-manager AIs and have them build XYZ"
           | 
           | Or if you need it to work on an existing task you'd
           | specialize further -- you'd go to your AI vendor and say "I'd
           | like to license the weights for your expert-programmer model,
           | but first have it read these 200 books I consider important
           | to my problem domain and then show it every commit ever made
           | by a human to my git repo and every design document I have"
        
             | typon wrote:
             | Very good analysis. I disagree with a fundamental point
             | though: If you don't consider compute cost and just want
             | the best possible AGI, then there's nothing stopping you
             | from supercharging the mega-models with the same
             | capabilities as the smaller models - and if the current
             | scaling shows anything, the mega models will just become
             | even better.
        
               | bodyfour wrote:
               | > If you don't consider compute cost [...]
               | 
               | Yes, but what if you do? Imagine your hyper-specialzied
               | API-heavy model takes 10x less resources to answer a
               | question (or at least a question relevant to the task at
               | hand) Won't it be more powerful to have a model that can
               | run 10 times as fast (or run 10 instances in parallel)?
               | 
               | What if the ratio turns out to be 100x or 1000x?
               | 
               | So I agree that the cutting edge of "best possible AGI"
               | might mean building the largest models we can train on
               | massive clusters of computers and then run on high-end
               | hardware. My hunch, though, is that models that can be
               | run on cheap hardware and then "swarmed" on a problem
               | space will be even more powerful in what they can perform
               | in aggregate.
               | 
               | Again, it's just my hunch but right now I think
               | everybody's predictions are hunches.
               | 
               | I'll actually go one bit further: even for a linear task
               | that can't be "swarmed" in the same way, it could be that
               | cheaper-per-token models could even do better on linear
               | problem-solving tasks. Existing models already have the
               | ability to use randomness to give more "creative", if
               | less reliable, answers. This is inherently parallelizable
               | though -- in fact Bard seems to be exposing this in its
               | UI in the form of multiple "drafts". So what if you just
               | ran 100 copies of your cheap-AI against a problem and
               | then had one cheap-AI (or maybe a medium-AI) judge the
               | results?
               | 
               | Or at the risk of a getting too anthropomorphic about it:
               | imagine you as a human are writing a program and you get
               | stuck on a tricky bit -- you know that the problem should
               | be solvable but you've never doing anything similar and
               | don't know what algorithm to start with. Suppose then you
               | could tell your brain "Temporarily fork off 100 copies of
               | yourself. 10 of them go do a literature review of every
               | CS paper you can find related to this topic. 10 of you
               | search for open source programs that might have a similar
               | need and try to determine how their code does it. The
               | other 80 of you just stare off into the middle distance
               | and try to think of a creative solution. In two human-
               | seconds write a summary of your best idea and exit. I'll
               | then read them all and see if I/we are closer to
               | understanding what to do next"
               | 
               | For us, this type of mental process is so alien we can't
               | even imagine what it would feel like to be able to do. It
               | might come completely natural to an AI, though.
        
               | not2b wrote:
               | Sometimes you do need to consider compute cost, say if
               | you want a small but high quality model that can run on a
               | smart phone to perform a task. For example, with camera
               | input, identify a plant or animal, while in a remote area
               | with no cell signal, so it has to yield an answer without
               | communicating with a server. What's the smallest, most
               | efficient model that can do that effectively? Build that.
        
             | avereveard wrote:
             | > you don't need to memorize every fact in Wikipedia if you
             | know how to effectively search it.
             | 
             | yeah you're onto something. models good enough to sustain a
             | conversation where I bring my own data as a primer are
             | probably more useful that models that have a frozen
             | knowledge of everything. the killer feature of gpt-4 is the
             | 32k token size, which allows unprecedented amount of input
             | to be fed into the knowledge graph and queried.
        
           | feanaro wrote:
           | Isn't it the case that we literally have no clue how GPT4 and
           | GPT3.5 are different in terms of training, given OpenAI
           | doesn't want to disclose anything at all?
        
             | typon wrote:
             | It's not true we know nothing. We know a little bit by
             | using the two models from their API. Given the time per
             | inference and the limit on messages per day for GPT4, I'm
             | willing to bet it's doing around 10x more compute than
             | GPT3.5. If that's because it has 10x more weights, I don't
             | know. But it wouldn't be a terrible guess.
        
               | feanaro wrote:
               | So your estimate is that GPT4 has 1.75 trillion weights?
        
               | dwaltrip wrote:
               | Is there anything that affects inference compute time
               | besides the number of parameters? Assuming same hardware,
               | etc.
        
               | typon wrote:
               | Yes - for example adding memory to the attention
               | mechanism (similar to RETRO or Memorizing Transformers
               | paper)
        
             | computerex wrote:
             | We don' have the details, it is true. But empirically and
             | based on their report gpt-4 is notably better than chatgpt.
        
               | feanaro wrote:
               | Better, yes, and for that we have evidence. But is the
               | improvement stemming simply from even more data? That's
               | what I'm questioning.
        
               | computerex wrote:
               | This paper is pretty approachable and goes over the
               | "scaling laws" in detail:
               | https://arxiv.org/abs/2206.07682
               | 
               | In short, yes. More data, higher quality data, more
               | epochs on the data. That is the name of the game.
        
               | stevenhuang wrote:
               | It's speculated it has same number of parameters, but
               | more compute and is multi modal.
        
           | UncleEntity wrote:
           | Free is better than $$/token imho.
           | 
           | If you have a use case or a bunch of disposable income then
           | go with the "bitter" one.
        
       ___________________________________________________________________
       (page generated 2023-03-24 23:01 UTC)