[HN Gopher] Fine-tuning Mistral 7B on Magic the Gathering Draft ___________________________________________________________________ Fine-tuning Mistral 7B on Magic the Gathering Draft Author : dmakian Score : 209 points Date : 2023-12-05 16:33 UTC (6 hours ago) (HTM) web link (generallyintelligent.substack.com) (TXT) w3m dump (generallyintelligent.substack.com) | dwrodri wrote: | It's not the most revolutionary change to our daily lives, but I | do genuinely look forward to playing against bots that have | interesting play styles for games like Magic: the Gathering. I | think this is a clear case where it could drastically improve the | ability for the R&D team to come up with and test new mechanics | at different levels of play. | danbrooks wrote: | Super interesting that drafts can be represented with LLMs. | | The best performing draft AI's I've seen leverage representation | learning in some form. | | See: https://arxiv.org/pdf/2107.04438.pdf | dmakian wrote: | I hadn't seen this, this is awesome! You'd think given the | volume of data available that this type of method would | outperform an LLM, cool results. | | Still some fun things about LLM representations -- you can do | fun things like give the bots preferences / personality in a | system prompt which is entertaining! | rkwz wrote: | > I was particularly interested in testing models' ability to | reason (i.e., perform a somewhat complex task that requires high | context understanding) about out-of-distribution (i.e., unseen) | data. | | I was under the assumption that finetuneing LLMs was useful only | when you need to change the model's tone (speak like a pirate, | voldemort etc). | | Are there other examples where LLMs were trained to reason a | particular way? | minimaxir wrote: | You can get a standard LLM to change tone just by giving it a | system prompt/instruction to follow a certain tone. | | The only issue there is that sometimes the RLHF seeps through, | which can be solved by system prompting even harder. | skerit wrote: | Aren't a lot of base models fine-tuned with (Q)Lora on | instruct-based datasets with good results? I thought this was a | very common practice? | selfhoster11 wrote: | Check our Orca. IIRC, it's a technique that aims to encode | additional logical capabilities into smaller models by having | larger models generate step-by-step solutions to various | problems. This doesn't just make them speak more like | GPT-4/3.5, but is supposedly making them think more like it as | well. | dmakian wrote: | > I was under the assumption that finetuneing LLMs was useful | only when you need to change the model's tone (speak like a | pirate, voldemort etc). | | A lot of why I tried this out was to test the limits of this | belief, you see a lot of talk like this out there and it | sounded like nonsense to me. | | Finetuning is fundamentally not much different than continued | pretraining; if you feed the model high-quality and high-volume | data I think it's reasonable to expect it to acquire new skills | oceanplexian wrote: | In order to speak like a pirate, it has to be able to reason :) | I've done some fine tunes as well similar to the MTG example, | in mine I was fine tuning it to speak JSON and reason about | some input- and yes, you can indeed get these models to perform | on novel tasks. | samus wrote: | Finetuning is a useful workaround for cases when the context | size is unsuitable for the task at hand. Anybody knows whether | it was ever considered to finetune an LLM on the Linux kernel | sources' history and its associated mailing lists? | dacox wrote: | Wow, I have exactly the same side project in progress, minus the | fine tuning part. We even chose the same names and phrasing for | parts of the project. | dmakian wrote: | Would love to compare notes, drop me a email at dshersh at | umich dot edu if you'd be interested! | throwaway743 wrote: | Would like to know, how many matches were won per draft token? If | it's less than 2, I'll stick to my shitty hand picks :/ | reactordev wrote: | I like how it identified that you haven't committed to either | white or blue yet. It was aware of deck _composition_ and not | just going for the jugular. Keep tuning. It could also be Human- | bias because you also _played_ the hand. Have someone else draft | against your LLM and then you play it and see if it 's the same. | Statistically it should match given enough games. | freediver wrote: | Super interesting work. Do you have thoughts how to leverage this | to create a deck builder AI that would also simulate games? The | major problem here is that the search space for MTG is amazingly | vast. | | I've seen this effort previously, pretty exciting stuff: | | https://www.youtube.com/watch?v=Xq4T44EvPvo | dmakian wrote: | I've definitely thought about this problem and think it's in | the range of 'feasible', but it would be pretty slow and | expensive given how much context you need to provide a model | for it to be able to reason about the game state. Worth trying | though! | imjonse wrote: | Confusing name for the domain (Generally Intelligent) since it's | the former name of a company in the AI/LLM area but does not seem | to be related. | matsemann wrote: | How is the fine tuning actually performed? They have the data of | drafts, and a prompt. But what does one do with it, more | concretely? | dmakian wrote: | High level it's basically: 1. Generate a lot of text examples | that look like this: | https://gist.githubusercontent.com/davidhershey/f57d0b19563f... | | 2. The model is effectively trained to predict the next token | based on the previous tokens in each of these examples, which | has the side effect here of teaching it to make a draft pick | based on the contents of a pack. | | Nothing too fancy, just next word prediction more or less | mdaniel wrote: | In case you didn't see it, | https://news.ycombinator.com/item?id=38525978 (I hacked Magic the | Gathering: Arena for a 100% win rate) may interest this audience | if for no other reason that the investigator discovered that | Sparky, the pseudo-AI in MTGA, doesn't appear to be as stupid | complicated as one may have suspected from the outside | chc4 wrote: | Sparky is the Arena AI, but no one ever accused it of being a | _good_ Arena AI - it is very much only there for the new player | experience of playing against a dumb computer when you 're | first exposed to the game and don't know the rules, or for the | computer equivalent of "playing against a goldfish" a deck you | made to see how it draws or combos. It's not a Chess CPU. | mdaniel wrote: | I hope I also did not accuse it of being good, but the | observation I was trying to make is that -- according to the | article, I have not myself confirmed the claim -- they run | the card evaluation logic _and gameplanning_ locally, not in | a data center full of H100s, which I consider to be quite a | feat given the free-text-y self-modifying rules of M:TG | greysphere wrote: | It would be interesting to compare to training a NN to draft w/o | the Mistral starting point (both by epoch and by $). It's not | obvious to me why the LLM component would be relevant. Maybe | there are enough deck lists or mock drafts on the internet to | have an influence I suppose. Or maybe 'fine tune an llm' just has | more infrastructure than 'create a nn'. Maybe we need a nnfiddle | to make that easier. | apetresc wrote: | Without Mistral, how would you get it to generalize to cards it | hasn't seen before? I assume by "training a NN to draft without | Mistral" you mean where the input layer is just a bitmapped | vector of the cards in the pack, right? The killer feature of | this experiment is that it works on sets the model has never | seen before and has 0 training data on, using just the text of | the card. I don't think you can do that without an LLM. | greysphere wrote: | That's a good point. It looks like the article hints at some | success on that front. It'd be interesting to see what that | means quantitatively. Interesting that this delta could even | be used as a measure of the llm's value. | | I'd be curious about the difference in success w/ drafts on a | new 2/2 bear with a different name, and cards with a new | keyword 'fizzbangitude 7' as well. | filterfiber wrote: | The benefit of the LLMs is that the checkpoint already | "understands" a lot by default. Finetuning is relatively cheap | and makes many tasks such as this one perform decently well | simply by shoving some data into it. | | The base checkpoint takes a lot of compute to make, but that's | what holds most of it's "knowledge" so to speak. | | Making a NN from scratch means you'll have to somehow map the | cards into inputs. I have limited knowledge of how MTG works, | but most TGG have text descriptions and complex effects. | Mapping text to logic is what LLMs are really good at, | otherwise you're starting from scratch and will also need a | relatively large amount of compute before it starts displaying | any type of decent behaviour. | | It's also easy for most software devs to do this - finetuning | mostly consists of collecting text and feeding it into a | finetuning script. You don't need to know linear algebra, what | a "convolution" is, etc. to do finetuning. | apetresc wrote: | If I'm reading the author's writeup correctly, the prompt he's | giving the agent at each pick contains only the _names_ of the | cards in its pool so far, and only gives the full text for the | cards in the pack it 's being passed. It doesn't look like | context is being maintained between picks, presumably for context | window size reasons. | | If so, and if he's correct in his assumption that these sets are | out of the bot's training cutoff window, then surely it's purely | coincidence if it ends up being a good drafter? The bot would | have literally no way to know what cards work well with its | previous picks, what signals have been sent and received in the | draft so far, etc. Not even the best human player could take (for | example, from the sample prompt) "Gadwick's First Duel -- {1}{U} | (uncommon)" and figure out what works well with that (if they've | never seen the card before). | | It would just end up picking generically good draft cards that | share a color with its previous picks. Which is already what | pick-order-based heuristics have always done. | dmakian wrote: | > If I'm reading the author's writeup correctly, the prompt | he's giving the agent at each pick contains only the names of | the cards in its pool so far, and only gives the full text for | the cards in the pack it's being passed. It doesn't look like | context is being maintained between picks, presumably for | context window size reasons. | | Not quite -- there's a few ways the model learns the full card | text: | | * The models are trained on card trivia completions as well, | where they're asked to complete the full text of the card as | well as information about it (type, CMC, etc.) | | * The models do still have to learn next token completion on | the cards in packs, meaning they learn to predict the full text | of the cards while making draft picks as well. | | Net net, the bots learn the text of the new cards pretty | comprehensively. | apetresc wrote: | Ooh I see! You do that with Mistral7B, I'm guessing? But not | with the small GPT-3.5 trial you did? | dmakian wrote: | The two larger GPT-3.5 trials also got the card trivia | examples, but like a bad scientist I don't have a great | control group for those | apetresc wrote: | And also, since it seems you're the author, can you also | clarify if your methodology allowed for the bot to track | signals outside of the color-identity-count summary | statistic you pass in the prompt? Something like allowing | it to notice that a card has wheeled, or that a certain | synergy piece was passed a few picks ago. | dmakian wrote: | Only the statistics you see in the prompt (which are | clearly limited). I have a lot of ideas about how you | could improve that context (most likely letting the AI | record and track notes throughout a draft), but this one | was relatively simple to implement. Definitely room for | improvement! | chc4 wrote: | Haha, I don't know anything about AI training but that's a | really cute trick. | zoogeny wrote: | I like that this shows how hard even conceptually simple ideas | are to achieve in fine-tuning LLMs. Even given a pretty good | starting dataset, a decent starting model, etc. this appears to | have been a challenge. | | One thing it did make me think about was that these models are | suitable for things that don't have a natural definitive answer. | That is, picking the perfect card given a set of picks is | probably combinatorially impossible to solve. But picking a | _good_ card given a set is possible and LLMs can approach human | level performance. | | I think this leads to a set of problems that current LLMs may be | fine-tuned to solve. | dharmab wrote: | That lines up with my experience- for high-stakes decisions, | they rarely give me a great answer. But for low stakes | decisions, they do well at giving me a good enough answer. For | example, I've been using them to help find gifts for friends | and children this month. I don't need the best choice to solve | the problem, just a good one. | pixl97 wrote: | How much additional calculation occurs in high-stakes | decisions by individuals. Also what is the variability in | quality of high stakes decisions in humans? | | I'm guessing LLM decision is rather average, but that the LLM | has no easy way of spending the extra time to gather | information around said high stakes decisions like a human | would. | falcor84 wrote: | I wonder if you could define a specific complexity class of | problems that LLMs are good at | doctorpangloss wrote: | > With that data, you can extract "ground truth" by looking at | the draft picks made by the best players on the service (sorted | by win rate). | | Do you mean that you are looking at the draft picks from | https://www.17lands.com/leaderboard and then sorting by Win Rate? | Didn't you mean to choose Match Wins or Trophies? Otherwise, | you're not measuring the best players on the service. You're | training on draft choices where most choices were very good - | i.e., win rate sort will show you the luckiest players, not the | best ones. That will naturally show up in any validation or | testing you do too. | | Shouldn't this be compared not to an LLM baseline, but to a | baseline where an "Elo" style score is computed for each card | compared to others from the 17lands data; then, until you have | two colors, suggest the best scoring card, or when you do have | color(s), suggest the best scoring card within that color or a | land? | | I think it is possible for the LLM to have some semblance of | rules knowledge, but it is more likely that it is picking up on | card rarity, costs and "Big" more than anything else for unseen | cards. | | Your "accuracy" on the draft seems poor. I'm not sure it means | what you think it means. Are you saying that when looking at the | high win rate choices, where all the choices were mostly good, | you happened to pick the choice that isn't the same as the player | who originated the data? It actually seems harder to make a | choice among all good choices. | | Anyway, there is quite a bit going on here. | dmakian wrote: | > Do you mean that you are looking at the draft picks from | https://www.17lands.com/leaderboard and then sorting by Win | Rate? Didn't you mean to choose Match Wins or Trophies? | Otherwise, you're not measuring the best players on the | service. You're training on draft choices where most choices | were very good - i.e., win rate sort will show you the luckiest | players, not the best ones. That will naturally show up in any | validation or testing you do too. | | Ahh no just unclear in the post, I'm filtering to players in | 17lands with a > 62% match win rate who are drafting at a high | ranking (>=diamond rank). I look at all of those players' | drafts though, even the ones where they do poorly. | | > Your "accuracy" on the draft seems poor. I'm not sure it | means what you think it means. Are you saying that when looking | at the high win rate choices, where all the choices were mostly | good, you happened to pick the choice that isn't the same as | the player who originated the data? It actually seems harder to | make a choice among all good choices. | | Accuracy here is making the same choice from a given pack as | one of the good players. Obviously subjective so not a perfect | metric, but a decent check on ability to emulate a high-quality | drafter. | doctorpangloss wrote: | Hmm, but that will filter out more than half the players on | the Match Wins and Trophies based leaderboards, many of them | Diamond and Mythic. So I think your choice of 62% match win | rate is almost certainly disproportionately selecting for | people who received very good draft choices, even if it | includes some actually very good players in the data set. | | I mean 62% might feel like a good number, but it's arbitrary, | you'd have to justify how you chose it, and just eyeballing | it, it is filtering out a lot of very good players with many, | many more match wins. | | Perhaps you can sort by Latest Rank, and filter out people | with 2 or fewer trophies. Or you will have to validate with | known bad draft choices in the prompt, to see what it does. | Suffice it to say, I still don't think the 17Lands data | represents what you think it does. | | Like without a direct discussion about measuring and | accounting for luck in the draft... for all I know the data | is seriously flawed. It probably isn't, but it's maybe one of | many, many issues to address when dealing with strategy card | game AI problems. | dmakian wrote: | Still not clear maybe, I'm selecting players with a 62% | lifetime win rate so mostly players who have been good over | a larger number of drafts! | | Definitely not perfect data though, and agree that defining | good in this context is hard -- a lot of the variance of | "good" depends on how you play the cards either way. All | good points! | doctorpangloss wrote: | > I'm selecting players with a 62% lifetime win rate so | mostly players who have been good over a larger number of | drafts! | | Hmm, but there are a lot of players with greater than a | 62% lifetime win rate with very few drafts, but there may | be many of those players... do you see? The win rate | isn't a good filter. You chose it, you are trying to | justify it, and I'm not convinced, not without the hard | numbers. | | I'm not confused about what filter you chose. I just | think it's a bad filter, and you haven't thought very | deeply about how it affects the data, which includes | presumably your test and validation data - however you're | choosing to test and validate, apparently by hand, by | some eyeballed examples. | | Anyway I think you have to compare with a non-LLM, non- | random baseline to have any sense if this stuff is | working at all. I could be dead wrong. I would maybe | compare with a community draft picker. | Palmik wrote: | In ELO like match-making, you typically pair together people | such that they are likely to have 50% chance to win. | Therefore as the OP says, filtering down to people with high | (60+%) life-time win-rate creates some sort of (interesting) | bias. | | I would select from all games played on sufficiently high | level. | gigel82 wrote: | For some reason I thought fine tuning is not possible without | specialized hardware (A100 / H100). Where can I learn more about | hardware requirements for fine tuning on consumer GPUs? | dmakian wrote: | There is not a lot of great content out there making this | clear, but basically all that matters for basic fine tuning is | how much VRAM you have -- since the 3090 / 4090 have 24GB VRAM | they're both pretty decent fine tuning chips. I think you could | probably fine-tune a model up to ~13B parameters on one of them | with PEFT (https://github.com/huggingface/peft) | mmcwilliams wrote: | Definitely possible on even older off-the-shelf hardware. I use | 24GB 4090s for 13b-sized models and have even used 12GB Titans | for 7b models, admittedly at much slower rates. | viraptor wrote: | You can also use Apple silicon for this: https://www.reddit.c | om/r/LocalLLaMA/comments/15y9m64/fine_tu... | gigel82 wrote: | I have a 3080Ti with 12Gb VRAM and would like to try fine | tuning the same Mistral 7B model (which I found incredibly | potent). Any tips on how to get started? | iEchoic wrote: | Really interesting, thanks for writing this up. I'd love to see | this applied to actually playing the game, provided that you | could fit a (long) game state in the context window. | tayo42 wrote: | I wonder if you could use a smaller model or get better results | if you treated each card as a token, gave the state of the draft | as an input and the predicted token would be the card to pick. | You woukd have to train from scratch with a custom tokenizer. | float-trip wrote: | I tried adding special tokens for a reddit-style dataset once. | The format was: `<|post_author|>username<|post_title|>title | here...` | | The resulting model was so much worse than just formatting | everything plaintext. This was with MPT-30B, 15 special tokens, | 300M training tokens, and a full finetune. | | I may have made a mistake, but I haven't seen any open source | finetunes successfully add a large number of tokens yet either. | Tostino wrote: | Try doing the same thing in your dataset, but don't actually | add them as "special tokens", and just let them just be | multiple tokens. | | Adding new tokens needs a ton of data to train what the token | means. Reusing existing tokens, will allow you to easily | teach that a sequence of tokens now has a new meaning after | fine tuning. | float-trip wrote: | That's what I ended up doing (`[Author] username [Title] | post title...`) | | > Adding new tokens needs a ton of data to train what the | token means. | | But how much? 300M tokens is fine for a simple version of | ChatML with ~4 tokens. Not for 15, at least in my case. | How's this relationship scale? | | Just trying to offer one datapoint for what doesn't work, | with the hedge that I might have just had a bug | tayo42 wrote: | I don't mean add special tokens, but make the vocab only the | set of possible cards. each card is a token. | | a simple input might be <cards you hold> 1 14 56</end><cards | to pick> 5 64 2</end> -> predicted token is the draft pick. | | Then train a transformer based network from scratch. | 8f2ab37a-ed6c wrote: | Thanks for sharing this, I found it helpful as an addition to my | homebrew curriculum for learning how to fine-tune open source | LLMs. | objektif wrote: | Can you please point me to good resources on fine tuning? | Thanks. | amrrs wrote: | Check out https://github.com/OpenAccess-AI-Collective/axolotl | 8f2ab37a-ed6c wrote: | Search for articles showing you code for fine-tuning Llama 2, | ideally including a colab notebook that you can run and | modify yourself so that you have real code to work with. You | can try to modify their working example to suit your own toy | project as a first step. | float-trip wrote: | Thanks for writing up. Rather than zeroing out the loss for the | prompt, did you also try using weighted loss with Axolotl? At one | point, Microsoft's GPT 3 docs suggested this was beneficial when | the responses are short (like you have with "Cut in.") Domain | adaptation over subreddits/forums before finetuning may help as | well. | dmakian wrote: | > did you also try using weighted loss with Axolotl | | This is really smart, I didn't think about this! Will add it to | my list of things to try, great idea! | | > Domain adaptation over subreddits/forums before finetuning | may help as well. | | I was thinking about this too (along with transcribing draft | youtube videos), I'd definitely be curious how much this helps. | rgbrgb wrote: | > I ended up renting an hourly GPU from Runpod (an RTX 4090 w/ | 24GB of VRAM) for ~$0.7/hr. | | Sorry if I missed this, but how much did it cost total to do the | fine-tune? Is that the 40 hour number (~$27)? | | Also, very cool writeup. Thanks for sharing! | dmakian wrote: | The longest running fine tuning job took about 8 hours, so ~$5. | | I think if you add up all of the learning and testing I did, | probably closer to ~$50 total | sva_ wrote: | Hmm, is "Generally Intelligent" related to the company that | previously had that name, but renamed itself to "Imbue"? Sort of | confused. | | https://www.ycombinator.com/companies/imbue | lubutu wrote: | Lurrus into Dead Weight -- that's a nice start. ___________________________________________________________________ (page generated 2023-12-05 23:00 UTC)