[HN Gopher] Databricks Releases 15K Record Training Corpus for I... ___________________________________________________________________ Databricks Releases 15K Record Training Corpus for Instruction Tuning LLMs Author : xatalytic Score : 239 points Date : 2023-04-12 15:59 UTC (7 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | stuartjohnson12 wrote: | I don't think these upvotes are organic. | visarga wrote: | Mine was. | iaw wrote: | Why do you say that? | nickthegreek wrote: | probably based on the situation 3hrs ago - | https://news.ycombinator.com/item?id=35539085 | dimitrios1 wrote: | There's a big difference between employees who got excited | to see their work on hacker news and upvoted it and | premeditated shill / astroturf campaign. We should pretty | much assume that a San Fransisco based company is going to | have significant readership / membership here. | | One can easily see how a message over a company | communicator could result in a surge of upvotes. | nickthegreek wrote: | Agreed. I was just providing the context that the user | asked for. | catchnear4321 wrote: | I don't think you fully appreciate the value of the training | corpus. | __debugger__ wrote: | Previous [flagged] discussion: | https://news.ycombinator.com/item?id=35539085 | itamarcode wrote: | "dolly-v2-12b is not a state-of-the-art generative language model | and, though quantitative benchmarking is ongoing, is not designed | to perform competitively with more modern model architectures or | models subject to larger pretraining corpuses." from: | https://huggingface.co/databricks/dolly-v2-12b | Havoc wrote: | Great to see more releases under open licenses! | falaki wrote: | This is the blog post with more details and background: | https://www.databricks.com/blog/2023/04/12/dolly-first-open-... | | Disclosure: I work at Databricks. | thewataccount wrote: | Would you consider adding Pythia12B, LLaMa and Alpaca since | that's what you're directly compared against/based on? | | GPT3.5/GPT4 is what everyone would also love to see but I | understand you're performance is inline with GPT-neoX. | | Vicuna/GPT4all would be intersting but IMO are less important. | | RWKV would be interesting because it's a completely different | model from the transformers. | | EDIT: Also thanks for the opensource contributions! Highly | appreciated! | falaki wrote: | We also open sourced the Dolly model itself with a license that | allows commercial use. | choppaface wrote: | can you compare your dolly offering with https://github.com/m | icrosoft/DeepSpeedExamples/blob/master/a... | oidar wrote: | How hard would it be to get dolly running on llama.cpp? | ankitmathur wrote: | Hey there! I worked on Dolly, and I work on Model Serving | at Databricks. DollyV1 is GPT-J-based, so it'll run easily | on llama.cpp. DollyV2 is Pythia-based, which is built with | the GPT-NeoX library | | GPT-NeoX is not that different than GPT-J (it also has the | rotary embeddings, which llama.cpp supports for GPT-J). I | would imagine it's not too heavy of a lift to add NeoX | architecture support | anentropic wrote: | it's probably simple for Dolly v1 (?) since it was a fine- | tuned version of GPT-J | | https://github.com/ggerganov/ggml/tree/master/examples/gpt- | j | | AFAIK there is no .cpp version of Pythia-12B yet | brianjking wrote: | Thank you and congrats to you and the team. This is fantastic | ingenieroariel wrote: | Thank you, thank you, thank you! | | If possible, could you share how Dolly v2 compares to RWKV-4 | 14B ctx 8019? | mrg3_2013 wrote: | How does this compare to openai ? Curious if anyone has any | anecdotes. | falaki wrote: | We don't expect this to be as good as the latest OpenAI GPT | release. This is just to demonstrate that developing a | conversation agent using an existing foundation model is not as | hard as some may assume. Take a foundation model that is not | capable of Q&A and tune it with a fairly small Q&A data and you | get your in-house ChatGPT. | | Disclaimer: I work at Databricks. | mrg3_2013 wrote: | Thanks for the feedback. The potential edge with Dolly is | huge. Building a firewalled model with custom corpus is a big | deal. I have been experimenting with openai and even with | public data (but really limiting to the domain), yields great | improvements (openai may be stale because of cut off data). I | am excited to see where Dolly goes. | mrtranscendence wrote: | Dolly appears to fundamentally be a tech demo advertising | how you can use Databricks for compute. I honestly wouldn't | expect them to take it _that_ much further, particularly in | the context of larger models that would be significantly | more expensive to fine-tune. But I 'm happy to be proven | wrong. | falaki wrote: | We plan to continue working on it and invest more. | theGnuMe wrote: | I imagine they will sell fine tuning as a service to | Databricks customers. If I put all my data into their | lake I too can get my own custom ChatGPT. That's | compelling. | epups wrote: | I also see that as the use case and would find it useful. | However I feel this is somewhat low-budget so far coming | from such a large company. | Szpadel wrote: | you are referring to the dolly model? I think the training | set could achieve similar performance if we would fine tune | similarly sized model | xatalytic wrote: | 15,000 instruction tuning records generated by Databricks | employees in seven of the behavior categories outlined in the | InstructGPT paper (predecessor to ChatGPT). Coincides with the | release of Dolly 2.0, which is trained exclusively on this | dataset and demonstrates high quality (but not state-of-the-art) | instruction-following behavior. | | The data and models are licensed for commercial use, setting them | apart from recent releases trained on data from OpenAI. | nickthegreek wrote: | >Coincides with the release of Dolly 2.0, which is trained | exclusively on this dataset and demonstrates high quality (but | not state-of-the-art) instruction-following behavior. | | This is not correct. It was fine-tuned with this data set, but | the model itself is the 12B Eleuther AI pythia model. | mrtranscendence wrote: | There are two, a 6B parameter model fine-tuned on GPT-J and a | 12B parameter model fine-tuned on Pythia. | anentropic wrote: | the GPT-J-6B one is Dolly 1.0, previously released | | Dolly 2.0 is Pythia-12B fine-tuned on this new dataset | | on their hugging face page [1] they admit the performance | may not be much or any better than the original model (I am | guessing this may be a weakness of Pythia-12B, which was | intended for model-training research rather than best | results) | | the main point of Dolly 2.0 is the new dataset is | unencumbered legally [2] whereas Alpaca et al were trained | on ChatGPT transcripts, so commercialising those models | would contradict OpenAI licensing terms | | [1] https://huggingface.co/databricks/dolly-v2-12b | | [2] https://www.databricks.com/blog/2023/04/12/dolly-first- | open-... | dreaminvm wrote: | Happy to see this type of work that is truly open source and | commercially usable. Is this the entire corpus or a subset? Do | you intend to release any new iterations? | | I've been thinking of starting similar efforts at another BigCorp | by hosting a UL2 or GPT-J instance. | pwendell wrote: | 15k is the entire corpus we have right now. Hopefully others | can join up in releasing additional samples that can be merged | in over time. | | We'll definitely keep iterating on Dolly and releasing | everything openly. | m3kw9 wrote: | I'm not seeing how 15k q/a training can get you much other than | the simplest things. Maybe that's the point, get the ball rolling | for people to add more training data? | whimsicalism wrote: | Read about RLHF, i think you are misunderstanding what this | will be used for. | esafak wrote: | A specific reference would help readers. | whimsicalism wrote: | good point! https://huggingface.co/blog/rlhf :) | | i think the resources out there so far are not great yet | swid wrote: | It's used for fine tuning a pre-trained model. This takes an | LLM that is already capable of emulating lots of different | kinds of personalities, and narrows it down to act more like | the examples. Since the heavy lifting has already been done, | 15k examples of a chatbot following instructions they way you | want has a significant effect. | gamegoblin wrote: | What reasons do you have for believing that is true? | | It seems plausible to me that a general autoregressive LLM that | is capable of completing text wouldn't take _that_ much fine- | tuning to shift it from "text completion" to "instruction | following". | | After all, the raw GPT3 model can be made to follow | instructions with just a few examples. | | Consider the prompt: What is the capital of | France? | | Raw GPT3, not the newer instruction-tuned variants, does not | understand it's being asked a question. It offers the | completion: What is the capital of France? If | a student answers with a word, she is asked to | identify the word. She is not asked whether the | capital of France is Paris. On the other hand, if the student | answers by pointing to a map, she is asked to identify the | capital of France. She is not asked whether it is | Paris. | | It just starts appending to the text. | | But if you give it a few examples, it happily gets into | instruction following mode: The following is | a transcript between a human and a helpful AI assistant | who answers questions and obeys commands. Human: | How many eggs are in a dozen? AI: 12 Human: Say | "hello" 3 times AI: hello hello hello Human: | What is the capital of France? AI: | | GPT3 completes "Paris" here. | | If you can get decent instruction/question following behavior | out of a 2-shot example prompt, why do you think 15k is small | for this? | m3kw9 wrote: | Just saying if you ask for capital of an obscure country that | it hasn't been trained on, you will not get the answer, so | 15k will get you come general stuff only within the confines. | Also, to code you will need pretty complete documentation for | it to ingest and then enough examples on how the code is done | gamegoblin wrote: | 15k is not the full training corpus. The model is trained | on huge swaths of internet text. 15k is just the fine- | tuning corpus to show it how to follow instructions. Stuff | like world capitals and such are already present in the | model weights due to being trained on tons of internet | text. | | With the raw LLM, you can get the capital of Mongolia with | the prompt "The capital of Mongolia is", i.e. text | completion. The fine-tuning allows you to get at that | information by asking questions or giving commands, e.g. | "Tell me the capital of Mongolia" | dontupvoteme wrote: | N-shot at inference-time is fundamentally different from | training/fine-tuning which is inherently pre-inference-time. | | Though it would be interesting to know if OpenAI has a few | generic multishot inputs before the prompt. | | It's all extremely cryptic what the actual context window and | system prompt (assuming chatgpt even is using the same API | the proles are given) is with them | gamegoblin wrote: | The claim is not that they are fundamentally different or | similar, the claim is that one doesn't need that much data | to get instruction-following behavior from a raw | autoregressive LLM. K-shot prompting shows that the | _capability_ to follow instructions is present in the | model. It 's just a matter of using fine-tuning to keep the | model in that frame all the time without a K-shot prompt. | simonw wrote: | Here's a link to open up and explore that training data in | Datasette Lite: | https://lite.datasette.io/?json=https://github.com/databrick... | rnosov wrote: | I'm going through the dataset with your datasette tool and it | looks like it might be a good idea to clean things up a bit. | There are many duplicates[1], creepypastas[2] and other strange | things in there. | | [1] | https://lite.datasette.io/?json=https%3A%2F%2Fraw.githubuser... | | [2] | https://lite.datasette.io/?json=https://github.com/databrick... | | EDIT: Maybe I'm passing link wrong, the query I'm using is | | select count(instruction), instruction, group_concat(context, ' | ============= ') as c, group_concat(response, ' ============= | ') as r, group_concat(category, ' ============= ') as cat from | [databricks-dolly-15k] group by instruction having | count(instruction)>1 order by count(instruction)desc limit 100 | | [databricks-dolly-15k] should be the name of dataset, first | column is the number of instruction duplicates | | Creepypastas are responses to instruction: | | Imagine you are the last person on Earth. Write a diary entry | describing your thoughts and feelings. | robterrell wrote: | Typo on row 7! | rnosov wrote: | row 7 is the name of the dataset, you might need to load it | yourself | jarek83 wrote: | Can someone help me to understand why categories for these two | differ? | | row #51 "Think of some family rules to promote a healthy family | relationship" - brainstorsming [1] | | row #68 "What is the future for human?" - general_qa [2] | | In nature they both are brainstorming to me - does the question | mark is what assigned the #68 as _qa? | | [1] | https://lite.datasette.io/?json=https://github.com/databrick... | | [2] | https://lite.datasette.io/?json=https://github.com/databrick... | gpm wrote: | The labelling doesn't seem to be entirely consistent to me, | but I think the idea is that 51 is inviting you to | brainstorm, while 68 is asking a question that just happens | to be open ended. | kumarski wrote: | Amazing. | | Love databricks. | mrtranscendence wrote: | Databricks is fine. I wasn't happy using it until they | implemented the ability to work in a git repo, with proper file | support, but that's gone some way to making it more usable to | me. The interface sucks pretty hard, slowing down and using a | significant amount of memory with only modestly high number of | cells (where a Jupyterlab notebook would remain very snappy). I | also wish there were a better story for local development; | they've addressed this to some degree recently but I'm not sold | on their solution. | | It's certainly better than what we did prior to Databricks, | which was roll our own in-house provisioning and notebook | solution. I won't/can't go into too many details, but not only | was it cumbersome and very buggy, but it was as if they | designed it to encourage data scientists to spend as much money | on compute as possible (only to panic at the millions they were | spending). They dropped it for cost reasons, which is hilarious | given how expensive Databricks is. | | I do appreciate the work Databricks have done improving Spark. | Capabilities like adaptive query execution have made | optimization significantly easier. | zan2434 wrote: | Anyone wanna convert this to GGML so we can run it with | LLaMa.cpp? | simonw wrote: | I got this model working on a GPU instance, notes here: | https://til.simonwillison.net/llms/dolly-2 | | Anyone managed to run it on an M1/M2 Mac yet? | Garcia98 wrote: | What's the most cost-effective alternative to Paperspace? I had | a nightmarish experience with them last week after my account | got locked up twice when I was training a model with a 1.5 GB | dataset that somewhere contained the string "Minecraft Server". | Daegalus wrote: | Im not an expert, and I don't have nvidia, but I assume you | need to setup CUDA and install the CUDA pytorch stuff? | | Most docs Ive read on setting up finetuners and inference | require some extra stuff. Taking some LORA fine tuners, they | include instructions like this: conda create -n | llm-finetuner python=3.10 conda activate llm-finetuner | conda install -y cuda -c nvidia/label/cuda-11.7.0 conda | install -y pytorch=2 pytorch-cuda=11.7 -c pytorch | | When I experimented with Stable Diffusion and ROCM (amd card), | i had to do similar but with pythorch-rocm. and when I was | doing a CPU only, did `pytorch-cpu`. So maybe your attempt | didn't use the GPUs at all, because 12 mins is about what I had | on a CPU for inference on other models of similar size. | zamnos wrote: | The error message implies that the compiled default libraries | on the M1 don't support the model format, even though it | works fine in Paperspace. The argument | `trust_remote_code` is to be used with Auto classes. It has | no effect here and is ignored. Traceback (most recent | call last): File | "/Users/fragmede/projects/llm/dolly/foo.py", line 5, in | <module> instruct_pipeline = pipeline( | ^^^^^^^^^ File "/Library/Frameworks/Python.framework/V | ersions/3.11/lib/python3.11/site- | packages/transformers/pipelines/__init__.py", line 776, in | pipeline framework, model = infer_framework_load_model( | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/ | Python.framework/Versions/3.11/lib/python3.11/site- | packages/transformers/pipelines/base.py", line 271, in | infer_framework_load_model raise ValueError(f"Could not | load model {model} with any of the following classes: | {class_tuple}.") ValueError: Could not load model | databricks/dolly-v2-12b with any of the following classes: | (<class 'transformers.models.auto.modeling_auto.AutoModelForC | ausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_ | neox.GPTNeoXForCausalLM'>). | brianjking wrote: | I'm sure we'll see this by the end of the day or two. | gavi wrote: | Not on M1/M2 yet, but my response time seems pretty fast on | Tesla V100-SXM2-16GB | rnk wrote: | How much ram is likely needed on an apple arm for models like | this? And for general use, 64, 96, 128? Trying to decide how | large I should go for a new laptop. | Szpadel wrote: | AFAIK current models can run even with 64GB, but I would | assume that we will very likely have bigger models very soon | so I guess the answer is as much as you can afford | rnk wrote: | The next question is m1 or m2, and the impact of the | various number of gpu units between pro, max, ultra skews. | I'm really tempted to buy a "refurbished m1 studio" with | 128gb because I think the ram is the key. Have not seen any | benchmarks with diff # of gpus/aka diff skews. | anentropic wrote: | I saw this: https://github.com/jankais3r/LLaMA_MPS | | it runs slightly slower on the GPU than under llama.cpp | but uses much less power doing so | | I would guess the slowness is due to immaturity of the | PyTorch MPS backend, the asitop graphs show it doing a | bunch of cpu along with the gpu, so it might be | inefficiently falling back to cpu for some ops and | swapping layers back and forth (I have no idea, just | guessing) | rnk wrote: | Hey, thanks so much. That solidifies the case for 128gb | mac studio. Apple could be selling a bunch of these | things with these high ram capabilities. | aldarisbm wrote: | same same | zamnos wrote: | The answer is as large as you can afford, really. Future more | unoptimized models are only going to be more hungry for RAM. | mrtranscendence wrote: | I very recently purchased a MacBook Pro (M1 Max) with 64GB of | ram. I haven't experimented _that_ much, but I was able to | run inference using the 65B parameter Llama model with | quantized weights at a speed that was reasonably usable | (maybe a touch slower than ChatGPT with GPT-4). | | I haven't attempted to use the 65B model with non-quantized | weights, but the smaller models work that way, if slowly. | With 96GB of ram -- the upper limit of a MacBook Pro -- you | might be able to use even larger models, but I think you'd | hit the limits of useful performance before that point. | | I should note that it can be a bit tricky getting things to | work using the Mac's GPU. I couldn't get Dolly 6B to run on | my work MBP, which theoretically should have enough ram, | though I still want to try it on my personal laptop. | rnk wrote: | I see refurbished m1 2tb/128gb for $4700, looks like | similar price for an m2 with same storage/ram with my corp | discount (20cpu/48gpu). This is a tough decision. | mrtranscendence wrote: | I attempted using the Transformers library but failed. Not | sure, might be a VRAM issue; I'm going to try on my far beefier | personal MacBook Pro later tonight. | mydpy wrote: | Benchmarks here: | https://huggingface.co/databricks/dolly-v2-12b#benchmark-met... | omneity wrote: | > As outlined above, these results demonstrate that | dolly-v2-12b is not state of the art, and in fact underperforms | dolly-v1-6b in some evaluation benchmarks. We believe this owes | to the composition and size of the underlying fine tuning | datasets, but a robust statement as to the sources of these | variations requires further study. | | Taking a moment to appreciate the integrity of the team. | ingenieroariel wrote: | Ditto, this is release early release often without | necessarily meaning move fast and break things. Other teams | can do the equivalent of Alpaca to Llama and we can all learn | for the next round. | xatalytic wrote: | One of the creators here - yeah, the thing we have our eyes | on is the vector not the point. | | It's astounding how adaptable these open models are, even | with just a quarter of the Alpaca data. We're a team of | machine learning engineers and hackers, not an AI science | lab, but that's kind of the point frankly - this whole | exercise appears to be far easier that it might at first | seem. | itake wrote: | Why are they not doing metrics against GPT-3.5 and GPT-4? My | understanding is Dolly performs significantly worse. | thewataccount wrote: | I haven't played with the model just yet - but just eye | balling it's performance it's significantly worse. I'm | surprised they don't have Pythia on there as that's what | they're based on from my understanding. | | At their performance level it's the most important to compare | to GPT-neoX, and I do appreciate they aren't making the "95% | of GPT4" claims that some fine-tuned llama models are. | | EDIT: For databricks people: I'd love to see this compared | with Pythia, LLaMa, Alpaca, and vicuna/gpt4all if possible. ___________________________________________________________________ (page generated 2023-04-12 23:01 UTC)