[HN Gopher] Databricks Releases 15K Record Training Corpus for I...
       ___________________________________________________________________
        
       Databricks Releases 15K Record Training Corpus for Instruction
       Tuning LLMs
        
       Author : xatalytic
       Score  : 239 points
       Date   : 2023-04-12 15:59 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | stuartjohnson12 wrote:
       | I don't think these upvotes are organic.
        
         | visarga wrote:
         | Mine was.
        
         | iaw wrote:
         | Why do you say that?
        
           | nickthegreek wrote:
           | probably based on the situation 3hrs ago -
           | https://news.ycombinator.com/item?id=35539085
        
             | dimitrios1 wrote:
             | There's a big difference between employees who got excited
             | to see their work on hacker news and upvoted it and
             | premeditated shill / astroturf campaign. We should pretty
             | much assume that a San Fransisco based company is going to
             | have significant readership / membership here.
             | 
             | One can easily see how a message over a company
             | communicator could result in a surge of upvotes.
        
               | nickthegreek wrote:
               | Agreed. I was just providing the context that the user
               | asked for.
        
         | catchnear4321 wrote:
         | I don't think you fully appreciate the value of the training
         | corpus.
        
       | __debugger__ wrote:
       | Previous [flagged] discussion:
       | https://news.ycombinator.com/item?id=35539085
        
       | itamarcode wrote:
       | "dolly-v2-12b is not a state-of-the-art generative language model
       | and, though quantitative benchmarking is ongoing, is not designed
       | to perform competitively with more modern model architectures or
       | models subject to larger pretraining corpuses." from:
       | https://huggingface.co/databricks/dolly-v2-12b
        
       | Havoc wrote:
       | Great to see more releases under open licenses!
        
       | falaki wrote:
       | This is the blog post with more details and background:
       | https://www.databricks.com/blog/2023/04/12/dolly-first-open-...
       | 
       | Disclosure: I work at Databricks.
        
         | thewataccount wrote:
         | Would you consider adding Pythia12B, LLaMa and Alpaca since
         | that's what you're directly compared against/based on?
         | 
         | GPT3.5/GPT4 is what everyone would also love to see but I
         | understand you're performance is inline with GPT-neoX.
         | 
         | Vicuna/GPT4all would be intersting but IMO are less important.
         | 
         | RWKV would be interesting because it's a completely different
         | model from the transformers.
         | 
         | EDIT: Also thanks for the opensource contributions! Highly
         | appreciated!
        
         | falaki wrote:
         | We also open sourced the Dolly model itself with a license that
         | allows commercial use.
        
           | choppaface wrote:
           | can you compare your dolly offering with https://github.com/m
           | icrosoft/DeepSpeedExamples/blob/master/a...
        
           | oidar wrote:
           | How hard would it be to get dolly running on llama.cpp?
        
             | ankitmathur wrote:
             | Hey there! I worked on Dolly, and I work on Model Serving
             | at Databricks. DollyV1 is GPT-J-based, so it'll run easily
             | on llama.cpp. DollyV2 is Pythia-based, which is built with
             | the GPT-NeoX library
             | 
             | GPT-NeoX is not that different than GPT-J (it also has the
             | rotary embeddings, which llama.cpp supports for GPT-J). I
             | would imagine it's not too heavy of a lift to add NeoX
             | architecture support
        
             | anentropic wrote:
             | it's probably simple for Dolly v1 (?) since it was a fine-
             | tuned version of GPT-J
             | 
             | https://github.com/ggerganov/ggml/tree/master/examples/gpt-
             | j
             | 
             | AFAIK there is no .cpp version of Pythia-12B yet
        
         | brianjking wrote:
         | Thank you and congrats to you and the team. This is fantastic
        
         | ingenieroariel wrote:
         | Thank you, thank you, thank you!
         | 
         | If possible, could you share how Dolly v2 compares to RWKV-4
         | 14B ctx 8019?
        
       | mrg3_2013 wrote:
       | How does this compare to openai ? Curious if anyone has any
       | anecdotes.
        
         | falaki wrote:
         | We don't expect this to be as good as the latest OpenAI GPT
         | release. This is just to demonstrate that developing a
         | conversation agent using an existing foundation model is not as
         | hard as some may assume. Take a foundation model that is not
         | capable of Q&A and tune it with a fairly small Q&A data and you
         | get your in-house ChatGPT.
         | 
         | Disclaimer: I work at Databricks.
        
           | mrg3_2013 wrote:
           | Thanks for the feedback. The potential edge with Dolly is
           | huge. Building a firewalled model with custom corpus is a big
           | deal. I have been experimenting with openai and even with
           | public data (but really limiting to the domain), yields great
           | improvements (openai may be stale because of cut off data). I
           | am excited to see where Dolly goes.
        
             | mrtranscendence wrote:
             | Dolly appears to fundamentally be a tech demo advertising
             | how you can use Databricks for compute. I honestly wouldn't
             | expect them to take it _that_ much further, particularly in
             | the context of larger models that would be significantly
             | more expensive to fine-tune. But I 'm happy to be proven
             | wrong.
        
               | falaki wrote:
               | We plan to continue working on it and invest more.
        
               | theGnuMe wrote:
               | I imagine they will sell fine tuning as a service to
               | Databricks customers. If I put all my data into their
               | lake I too can get my own custom ChatGPT. That's
               | compelling.
        
               | epups wrote:
               | I also see that as the use case and would find it useful.
               | However I feel this is somewhat low-budget so far coming
               | from such a large company.
        
           | Szpadel wrote:
           | you are referring to the dolly model? I think the training
           | set could achieve similar performance if we would fine tune
           | similarly sized model
        
       | xatalytic wrote:
       | 15,000 instruction tuning records generated by Databricks
       | employees in seven of the behavior categories outlined in the
       | InstructGPT paper (predecessor to ChatGPT). Coincides with the
       | release of Dolly 2.0, which is trained exclusively on this
       | dataset and demonstrates high quality (but not state-of-the-art)
       | instruction-following behavior.
       | 
       | The data and models are licensed for commercial use, setting them
       | apart from recent releases trained on data from OpenAI.
        
         | nickthegreek wrote:
         | >Coincides with the release of Dolly 2.0, which is trained
         | exclusively on this dataset and demonstrates high quality (but
         | not state-of-the-art) instruction-following behavior.
         | 
         | This is not correct. It was fine-tuned with this data set, but
         | the model itself is the 12B Eleuther AI pythia model.
        
           | mrtranscendence wrote:
           | There are two, a 6B parameter model fine-tuned on GPT-J and a
           | 12B parameter model fine-tuned on Pythia.
        
             | anentropic wrote:
             | the GPT-J-6B one is Dolly 1.0, previously released
             | 
             | Dolly 2.0 is Pythia-12B fine-tuned on this new dataset
             | 
             | on their hugging face page [1] they admit the performance
             | may not be much or any better than the original model (I am
             | guessing this may be a weakness of Pythia-12B, which was
             | intended for model-training research rather than best
             | results)
             | 
             | the main point of Dolly 2.0 is the new dataset is
             | unencumbered legally [2] whereas Alpaca et al were trained
             | on ChatGPT transcripts, so commercialising those models
             | would contradict OpenAI licensing terms
             | 
             | [1] https://huggingface.co/databricks/dolly-v2-12b
             | 
             | [2] https://www.databricks.com/blog/2023/04/12/dolly-first-
             | open-...
        
       | dreaminvm wrote:
       | Happy to see this type of work that is truly open source and
       | commercially usable. Is this the entire corpus or a subset? Do
       | you intend to release any new iterations?
       | 
       | I've been thinking of starting similar efforts at another BigCorp
       | by hosting a UL2 or GPT-J instance.
        
         | pwendell wrote:
         | 15k is the entire corpus we have right now. Hopefully others
         | can join up in releasing additional samples that can be merged
         | in over time.
         | 
         | We'll definitely keep iterating on Dolly and releasing
         | everything openly.
        
       | m3kw9 wrote:
       | I'm not seeing how 15k q/a training can get you much other than
       | the simplest things. Maybe that's the point, get the ball rolling
       | for people to add more training data?
        
         | whimsicalism wrote:
         | Read about RLHF, i think you are misunderstanding what this
         | will be used for.
        
           | esafak wrote:
           | A specific reference would help readers.
        
             | whimsicalism wrote:
             | good point! https://huggingface.co/blog/rlhf :)
             | 
             | i think the resources out there so far are not great yet
        
         | swid wrote:
         | It's used for fine tuning a pre-trained model. This takes an
         | LLM that is already capable of emulating lots of different
         | kinds of personalities, and narrows it down to act more like
         | the examples. Since the heavy lifting has already been done,
         | 15k examples of a chatbot following instructions they way you
         | want has a significant effect.
        
         | gamegoblin wrote:
         | What reasons do you have for believing that is true?
         | 
         | It seems plausible to me that a general autoregressive LLM that
         | is capable of completing text wouldn't take _that_ much fine-
         | tuning to shift it from  "text completion" to "instruction
         | following".
         | 
         | After all, the raw GPT3 model can be made to follow
         | instructions with just a few examples.
         | 
         | Consider the prompt:                   What is the capital of
         | France?
         | 
         | Raw GPT3, not the newer instruction-tuned variants, does not
         | understand it's being asked a question. It offers the
         | completion:                   What is the capital of France? If
         | a student answers with a word,          she is asked to
         | identify the word. She is not asked whether the
         | capital of France is Paris. On the other hand, if the student
         | answers by pointing to a map, she is asked to identify the
         | capital         of France. She is not asked whether it is
         | Paris.
         | 
         | It just starts appending to the text.
         | 
         | But if you give it a few examples, it happily gets into
         | instruction following mode:                   The following is
         | a transcript between a human and a helpful         AI assistant
         | who answers questions and obeys commands.              Human:
         | How many eggs are in a dozen?         AI: 12         Human: Say
         | "hello" 3 times         AI: hello hello hello         Human:
         | What is the capital of France?         AI:
         | 
         | GPT3 completes "Paris" here.
         | 
         | If you can get decent instruction/question following behavior
         | out of a 2-shot example prompt, why do you think 15k is small
         | for this?
        
           | m3kw9 wrote:
           | Just saying if you ask for capital of an obscure country that
           | it hasn't been trained on, you will not get the answer, so
           | 15k will get you come general stuff only within the confines.
           | Also, to code you will need pretty complete documentation for
           | it to ingest and then enough examples on how the code is done
        
             | gamegoblin wrote:
             | 15k is not the full training corpus. The model is trained
             | on huge swaths of internet text. 15k is just the fine-
             | tuning corpus to show it how to follow instructions. Stuff
             | like world capitals and such are already present in the
             | model weights due to being trained on tons of internet
             | text.
             | 
             | With the raw LLM, you can get the capital of Mongolia with
             | the prompt "The capital of Mongolia is", i.e. text
             | completion. The fine-tuning allows you to get at that
             | information by asking questions or giving commands, e.g.
             | "Tell me the capital of Mongolia"
        
           | dontupvoteme wrote:
           | N-shot at inference-time is fundamentally different from
           | training/fine-tuning which is inherently pre-inference-time.
           | 
           | Though it would be interesting to know if OpenAI has a few
           | generic multishot inputs before the prompt.
           | 
           | It's all extremely cryptic what the actual context window and
           | system prompt (assuming chatgpt even is using the same API
           | the proles are given) is with them
        
             | gamegoblin wrote:
             | The claim is not that they are fundamentally different or
             | similar, the claim is that one doesn't need that much data
             | to get instruction-following behavior from a raw
             | autoregressive LLM. K-shot prompting shows that the
             | _capability_ to follow instructions is present in the
             | model. It 's just a matter of using fine-tuning to keep the
             | model in that frame all the time without a K-shot prompt.
        
       | simonw wrote:
       | Here's a link to open up and explore that training data in
       | Datasette Lite:
       | https://lite.datasette.io/?json=https://github.com/databrick...
        
         | rnosov wrote:
         | I'm going through the dataset with your datasette tool and it
         | looks like it might be a good idea to clean things up a bit.
         | There are many duplicates[1], creepypastas[2] and other strange
         | things in there.
         | 
         | [1]
         | https://lite.datasette.io/?json=https%3A%2F%2Fraw.githubuser...
         | 
         | [2]
         | https://lite.datasette.io/?json=https://github.com/databrick...
         | 
         | EDIT: Maybe I'm passing link wrong, the query I'm using is
         | 
         | select count(instruction), instruction, group_concat(context, '
         | ============= ') as c, group_concat(response, ' =============
         | ') as r, group_concat(category, ' ============= ') as cat from
         | [databricks-dolly-15k] group by instruction having
         | count(instruction)>1 order by count(instruction)desc limit 100
         | 
         | [databricks-dolly-15k] should be the name of dataset, first
         | column is the number of instruction duplicates
         | 
         | Creepypastas are responses to instruction:
         | 
         | Imagine you are the last person on Earth. Write a diary entry
         | describing your thoughts and feelings.
        
           | robterrell wrote:
           | Typo on row 7!
        
             | rnosov wrote:
             | row 7 is the name of the dataset, you might need to load it
             | yourself
        
         | jarek83 wrote:
         | Can someone help me to understand why categories for these two
         | differ?
         | 
         | row #51 "Think of some family rules to promote a healthy family
         | relationship" - brainstorsming [1]
         | 
         | row #68 "What is the future for human?" - general_qa [2]
         | 
         | In nature they both are brainstorming to me - does the question
         | mark is what assigned the #68 as _qa?
         | 
         | [1]
         | https://lite.datasette.io/?json=https://github.com/databrick...
         | 
         | [2]
         | https://lite.datasette.io/?json=https://github.com/databrick...
        
           | gpm wrote:
           | The labelling doesn't seem to be entirely consistent to me,
           | but I think the idea is that 51 is inviting you to
           | brainstorm, while 68 is asking a question that just happens
           | to be open ended.
        
       | kumarski wrote:
       | Amazing.
       | 
       | Love databricks.
        
         | mrtranscendence wrote:
         | Databricks is fine. I wasn't happy using it until they
         | implemented the ability to work in a git repo, with proper file
         | support, but that's gone some way to making it more usable to
         | me. The interface sucks pretty hard, slowing down and using a
         | significant amount of memory with only modestly high number of
         | cells (where a Jupyterlab notebook would remain very snappy). I
         | also wish there were a better story for local development;
         | they've addressed this to some degree recently but I'm not sold
         | on their solution.
         | 
         | It's certainly better than what we did prior to Databricks,
         | which was roll our own in-house provisioning and notebook
         | solution. I won't/can't go into too many details, but not only
         | was it cumbersome and very buggy, but it was as if they
         | designed it to encourage data scientists to spend as much money
         | on compute as possible (only to panic at the millions they were
         | spending). They dropped it for cost reasons, which is hilarious
         | given how expensive Databricks is.
         | 
         | I do appreciate the work Databricks have done improving Spark.
         | Capabilities like adaptive query execution have made
         | optimization significantly easier.
        
       | zan2434 wrote:
       | Anyone wanna convert this to GGML so we can run it with
       | LLaMa.cpp?
        
       | simonw wrote:
       | I got this model working on a GPU instance, notes here:
       | https://til.simonwillison.net/llms/dolly-2
       | 
       | Anyone managed to run it on an M1/M2 Mac yet?
        
         | Garcia98 wrote:
         | What's the most cost-effective alternative to Paperspace? I had
         | a nightmarish experience with them last week after my account
         | got locked up twice when I was training a model with a 1.5 GB
         | dataset that somewhere contained the string "Minecraft Server".
        
         | Daegalus wrote:
         | Im not an expert, and I don't have nvidia, but I assume you
         | need to setup CUDA and install the CUDA pytorch stuff?
         | 
         | Most docs Ive read on setting up finetuners and inference
         | require some extra stuff. Taking some LORA fine tuners, they
         | include instructions like this:                 conda create -n
         | llm-finetuner python=3.10       conda activate llm-finetuner
         | conda install -y cuda -c nvidia/label/cuda-11.7.0       conda
         | install -y pytorch=2 pytorch-cuda=11.7 -c pytorch
         | 
         | When I experimented with Stable Diffusion and ROCM (amd card),
         | i had to do similar but with pythorch-rocm. and when I was
         | doing a CPU only, did `pytorch-cpu`. So maybe your attempt
         | didn't use the GPUs at all, because 12 mins is about what I had
         | on a CPU for inference on other models of similar size.
        
           | zamnos wrote:
           | The error message implies that the compiled default libraries
           | on the M1 don't support the model format, even though it
           | works fine in Paperspace.                   The argument
           | `trust_remote_code` is to be used with Auto classes. It has
           | no effect here and is ignored.      Traceback (most recent
           | call last):        File
           | "/Users/fragmede/projects/llm/dolly/foo.py", line 5, in
           | <module>       instruct_pipeline = pipeline(
           | ^^^^^^^^^        File "/Library/Frameworks/Python.framework/V
           | ersions/3.11/lib/python3.11/site-
           | packages/transformers/pipelines/__init__.py", line 776, in
           | pipeline       framework, model = infer_framework_load_model(
           | ^^^^^^^^^^^^^^^^^^^^^^^^^^^        File "/Library/Frameworks/
           | Python.framework/Versions/3.11/lib/python3.11/site-
           | packages/transformers/pipelines/base.py", line 271, in
           | infer_framework_load_model       raise ValueError(f"Could not
           | load model {model} with any of the following classes:
           | {class_tuple}.")      ValueError: Could not load model
           | databricks/dolly-v2-12b with any of the following classes:
           | (<class 'transformers.models.auto.modeling_auto.AutoModelForC
           | ausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_
           | neox.GPTNeoXForCausalLM'>).
        
         | brianjking wrote:
         | I'm sure we'll see this by the end of the day or two.
        
         | gavi wrote:
         | Not on M1/M2 yet, but my response time seems pretty fast on
         | Tesla V100-SXM2-16GB
        
         | rnk wrote:
         | How much ram is likely needed on an apple arm for models like
         | this? And for general use, 64, 96, 128? Trying to decide how
         | large I should go for a new laptop.
        
           | Szpadel wrote:
           | AFAIK current models can run even with 64GB, but I would
           | assume that we will very likely have bigger models very soon
           | so I guess the answer is as much as you can afford
        
             | rnk wrote:
             | The next question is m1 or m2, and the impact of the
             | various number of gpu units between pro, max, ultra skews.
             | I'm really tempted to buy a "refurbished m1 studio" with
             | 128gb because I think the ram is the key. Have not seen any
             | benchmarks with diff # of gpus/aka diff skews.
        
               | anentropic wrote:
               | I saw this: https://github.com/jankais3r/LLaMA_MPS
               | 
               | it runs slightly slower on the GPU than under llama.cpp
               | but uses much less power doing so
               | 
               | I would guess the slowness is due to immaturity of the
               | PyTorch MPS backend, the asitop graphs show it doing a
               | bunch of cpu along with the gpu, so it might be
               | inefficiently falling back to cpu for some ops and
               | swapping layers back and forth (I have no idea, just
               | guessing)
        
               | rnk wrote:
               | Hey, thanks so much. That solidifies the case for 128gb
               | mac studio. Apple could be selling a bunch of these
               | things with these high ram capabilities.
        
           | aldarisbm wrote:
           | same same
        
           | zamnos wrote:
           | The answer is as large as you can afford, really. Future more
           | unoptimized models are only going to be more hungry for RAM.
        
           | mrtranscendence wrote:
           | I very recently purchased a MacBook Pro (M1 Max) with 64GB of
           | ram. I haven't experimented _that_ much, but I was able to
           | run inference using the 65B parameter Llama model with
           | quantized weights at a speed that was reasonably usable
           | (maybe a touch slower than ChatGPT with GPT-4).
           | 
           | I haven't attempted to use the 65B model with non-quantized
           | weights, but the smaller models work that way, if slowly.
           | With 96GB of ram -- the upper limit of a MacBook Pro -- you
           | might be able to use even larger models, but I think you'd
           | hit the limits of useful performance before that point.
           | 
           | I should note that it can be a bit tricky getting things to
           | work using the Mac's GPU. I couldn't get Dolly 6B to run on
           | my work MBP, which theoretically should have enough ram,
           | though I still want to try it on my personal laptop.
        
             | rnk wrote:
             | I see refurbished m1 2tb/128gb for $4700, looks like
             | similar price for an m2 with same storage/ram with my corp
             | discount (20cpu/48gpu). This is a tough decision.
        
         | mrtranscendence wrote:
         | I attempted using the Transformers library but failed. Not
         | sure, might be a VRAM issue; I'm going to try on my far beefier
         | personal MacBook Pro later tonight.
        
       | mydpy wrote:
       | Benchmarks here:
       | https://huggingface.co/databricks/dolly-v2-12b#benchmark-met...
        
         | omneity wrote:
         | > As outlined above, these results demonstrate that
         | dolly-v2-12b is not state of the art, and in fact underperforms
         | dolly-v1-6b in some evaluation benchmarks. We believe this owes
         | to the composition and size of the underlying fine tuning
         | datasets, but a robust statement as to the sources of these
         | variations requires further study.
         | 
         | Taking a moment to appreciate the integrity of the team.
        
           | ingenieroariel wrote:
           | Ditto, this is release early release often without
           | necessarily meaning move fast and break things. Other teams
           | can do the equivalent of Alpaca to Llama and we can all learn
           | for the next round.
        
             | xatalytic wrote:
             | One of the creators here - yeah, the thing we have our eyes
             | on is the vector not the point.
             | 
             | It's astounding how adaptable these open models are, even
             | with just a quarter of the Alpaca data. We're a team of
             | machine learning engineers and hackers, not an AI science
             | lab, but that's kind of the point frankly - this whole
             | exercise appears to be far easier that it might at first
             | seem.
        
         | itake wrote:
         | Why are they not doing metrics against GPT-3.5 and GPT-4? My
         | understanding is Dolly performs significantly worse.
        
           | thewataccount wrote:
           | I haven't played with the model just yet - but just eye
           | balling it's performance it's significantly worse. I'm
           | surprised they don't have Pythia on there as that's what
           | they're based on from my understanding.
           | 
           | At their performance level it's the most important to compare
           | to GPT-neoX, and I do appreciate they aren't making the "95%
           | of GPT4" claims that some fine-tuned llama models are.
           | 
           | EDIT: For databricks people: I'd love to see this compared
           | with Pythia, LLaMa, Alpaca, and vicuna/gpt4all if possible.
        
       ___________________________________________________________________
       (page generated 2023-04-12 23:01 UTC)