[HN Gopher] Fine-tuning Mistral 7B on Magic the Gathering Draft
       ___________________________________________________________________
        
       Fine-tuning Mistral 7B on Magic the Gathering Draft
        
       Author : dmakian
       Score  : 209 points
       Date   : 2023-12-05 16:33 UTC (6 hours ago)
        
 (HTM) web link (generallyintelligent.substack.com)
 (TXT) w3m dump (generallyintelligent.substack.com)
        
       | dwrodri wrote:
       | It's not the most revolutionary change to our daily lives, but I
       | do genuinely look forward to playing against bots that have
       | interesting play styles for games like Magic: the Gathering. I
       | think this is a clear case where it could drastically improve the
       | ability for the R&D team to come up with and test new mechanics
       | at different levels of play.
        
       | danbrooks wrote:
       | Super interesting that drafts can be represented with LLMs.
       | 
       | The best performing draft AI's I've seen leverage representation
       | learning in some form.
       | 
       | See: https://arxiv.org/pdf/2107.04438.pdf
        
         | dmakian wrote:
         | I hadn't seen this, this is awesome! You'd think given the
         | volume of data available that this type of method would
         | outperform an LLM, cool results.
         | 
         | Still some fun things about LLM representations -- you can do
         | fun things like give the bots preferences / personality in a
         | system prompt which is entertaining!
        
       | rkwz wrote:
       | > I was particularly interested in testing models' ability to
       | reason (i.e., perform a somewhat complex task that requires high
       | context understanding) about out-of-distribution (i.e., unseen)
       | data.
       | 
       | I was under the assumption that finetuneing LLMs was useful only
       | when you need to change the model's tone (speak like a pirate,
       | voldemort etc).
       | 
       | Are there other examples where LLMs were trained to reason a
       | particular way?
        
         | minimaxir wrote:
         | You can get a standard LLM to change tone just by giving it a
         | system prompt/instruction to follow a certain tone.
         | 
         | The only issue there is that sometimes the RLHF seeps through,
         | which can be solved by system prompting even harder.
        
         | skerit wrote:
         | Aren't a lot of base models fine-tuned with (Q)Lora on
         | instruct-based datasets with good results? I thought this was a
         | very common practice?
        
         | selfhoster11 wrote:
         | Check our Orca. IIRC, it's a technique that aims to encode
         | additional logical capabilities into smaller models by having
         | larger models generate step-by-step solutions to various
         | problems. This doesn't just make them speak more like
         | GPT-4/3.5, but is supposedly making them think more like it as
         | well.
        
         | dmakian wrote:
         | > I was under the assumption that finetuneing LLMs was useful
         | only when you need to change the model's tone (speak like a
         | pirate, voldemort etc).
         | 
         | A lot of why I tried this out was to test the limits of this
         | belief, you see a lot of talk like this out there and it
         | sounded like nonsense to me.
         | 
         | Finetuning is fundamentally not much different than continued
         | pretraining; if you feed the model high-quality and high-volume
         | data I think it's reasonable to expect it to acquire new skills
        
         | oceanplexian wrote:
         | In order to speak like a pirate, it has to be able to reason :)
         | I've done some fine tunes as well similar to the MTG example,
         | in mine I was fine tuning it to speak JSON and reason about
         | some input- and yes, you can indeed get these models to perform
         | on novel tasks.
        
         | samus wrote:
         | Finetuning is a useful workaround for cases when the context
         | size is unsuitable for the task at hand. Anybody knows whether
         | it was ever considered to finetune an LLM on the Linux kernel
         | sources' history and its associated mailing lists?
        
       | dacox wrote:
       | Wow, I have exactly the same side project in progress, minus the
       | fine tuning part. We even chose the same names and phrasing for
       | parts of the project.
        
         | dmakian wrote:
         | Would love to compare notes, drop me a email at dshersh at
         | umich dot edu if you'd be interested!
        
       | throwaway743 wrote:
       | Would like to know, how many matches were won per draft token? If
       | it's less than 2, I'll stick to my shitty hand picks :/
        
       | reactordev wrote:
       | I like how it identified that you haven't committed to either
       | white or blue yet. It was aware of deck _composition_ and not
       | just going for the jugular. Keep tuning. It could also be Human-
       | bias because you also _played_ the hand. Have someone else draft
       | against your LLM and then you play it and see if it 's the same.
       | Statistically it should match given enough games.
        
       | freediver wrote:
       | Super interesting work. Do you have thoughts how to leverage this
       | to create a deck builder AI that would also simulate games? The
       | major problem here is that the search space for MTG is amazingly
       | vast.
       | 
       | I've seen this effort previously, pretty exciting stuff:
       | 
       | https://www.youtube.com/watch?v=Xq4T44EvPvo
        
         | dmakian wrote:
         | I've definitely thought about this problem and think it's in
         | the range of 'feasible', but it would be pretty slow and
         | expensive given how much context you need to provide a model
         | for it to be able to reason about the game state. Worth trying
         | though!
        
       | imjonse wrote:
       | Confusing name for the domain (Generally Intelligent) since it's
       | the former name of a company in the AI/LLM area but does not seem
       | to be related.
        
       | matsemann wrote:
       | How is the fine tuning actually performed? They have the data of
       | drafts, and a prompt. But what does one do with it, more
       | concretely?
        
         | dmakian wrote:
         | High level it's basically: 1. Generate a lot of text examples
         | that look like this:
         | https://gist.githubusercontent.com/davidhershey/f57d0b19563f...
         | 
         | 2. The model is effectively trained to predict the next token
         | based on the previous tokens in each of these examples, which
         | has the side effect here of teaching it to make a draft pick
         | based on the contents of a pack.
         | 
         | Nothing too fancy, just next word prediction more or less
        
       | mdaniel wrote:
       | In case you didn't see it,
       | https://news.ycombinator.com/item?id=38525978 (I hacked Magic the
       | Gathering: Arena for a 100% win rate) may interest this audience
       | if for no other reason that the investigator discovered that
       | Sparky, the pseudo-AI in MTGA, doesn't appear to be as stupid
       | complicated as one may have suspected from the outside
        
         | chc4 wrote:
         | Sparky is the Arena AI, but no one ever accused it of being a
         | _good_ Arena AI - it is very much only there for the new player
         | experience of playing against a dumb computer when you 're
         | first exposed to the game and don't know the rules, or for the
         | computer equivalent of "playing against a goldfish" a deck you
         | made to see how it draws or combos. It's not a Chess CPU.
        
           | mdaniel wrote:
           | I hope I also did not accuse it of being good, but the
           | observation I was trying to make is that -- according to the
           | article, I have not myself confirmed the claim -- they run
           | the card evaluation logic _and gameplanning_ locally, not in
           | a data center full of H100s, which I consider to be quite a
           | feat given the free-text-y self-modifying rules of M:TG
        
       | greysphere wrote:
       | It would be interesting to compare to training a NN to draft w/o
       | the Mistral starting point (both by epoch and by $). It's not
       | obvious to me why the LLM component would be relevant. Maybe
       | there are enough deck lists or mock drafts on the internet to
       | have an influence I suppose. Or maybe 'fine tune an llm' just has
       | more infrastructure than 'create a nn'. Maybe we need a nnfiddle
       | to make that easier.
        
         | apetresc wrote:
         | Without Mistral, how would you get it to generalize to cards it
         | hasn't seen before? I assume by "training a NN to draft without
         | Mistral" you mean where the input layer is just a bitmapped
         | vector of the cards in the pack, right? The killer feature of
         | this experiment is that it works on sets the model has never
         | seen before and has 0 training data on, using just the text of
         | the card. I don't think you can do that without an LLM.
        
           | greysphere wrote:
           | That's a good point. It looks like the article hints at some
           | success on that front. It'd be interesting to see what that
           | means quantitatively. Interesting that this delta could even
           | be used as a measure of the llm's value.
           | 
           | I'd be curious about the difference in success w/ drafts on a
           | new 2/2 bear with a different name, and cards with a new
           | keyword 'fizzbangitude 7' as well.
        
         | filterfiber wrote:
         | The benefit of the LLMs is that the checkpoint already
         | "understands" a lot by default. Finetuning is relatively cheap
         | and makes many tasks such as this one perform decently well
         | simply by shoving some data into it.
         | 
         | The base checkpoint takes a lot of compute to make, but that's
         | what holds most of it's "knowledge" so to speak.
         | 
         | Making a NN from scratch means you'll have to somehow map the
         | cards into inputs. I have limited knowledge of how MTG works,
         | but most TGG have text descriptions and complex effects.
         | Mapping text to logic is what LLMs are really good at,
         | otherwise you're starting from scratch and will also need a
         | relatively large amount of compute before it starts displaying
         | any type of decent behaviour.
         | 
         | It's also easy for most software devs to do this - finetuning
         | mostly consists of collecting text and feeding it into a
         | finetuning script. You don't need to know linear algebra, what
         | a "convolution" is, etc. to do finetuning.
        
       | apetresc wrote:
       | If I'm reading the author's writeup correctly, the prompt he's
       | giving the agent at each pick contains only the _names_ of the
       | cards in its pool so far, and only gives the full text for the
       | cards in the pack it 's being passed. It doesn't look like
       | context is being maintained between picks, presumably for context
       | window size reasons.
       | 
       | If so, and if he's correct in his assumption that these sets are
       | out of the bot's training cutoff window, then surely it's purely
       | coincidence if it ends up being a good drafter? The bot would
       | have literally no way to know what cards work well with its
       | previous picks, what signals have been sent and received in the
       | draft so far, etc. Not even the best human player could take (for
       | example, from the sample prompt) "Gadwick's First Duel -- {1}{U}
       | (uncommon)" and figure out what works well with that (if they've
       | never seen the card before).
       | 
       | It would just end up picking generically good draft cards that
       | share a color with its previous picks. Which is already what
       | pick-order-based heuristics have always done.
        
         | dmakian wrote:
         | > If I'm reading the author's writeup correctly, the prompt
         | he's giving the agent at each pick contains only the names of
         | the cards in its pool so far, and only gives the full text for
         | the cards in the pack it's being passed. It doesn't look like
         | context is being maintained between picks, presumably for
         | context window size reasons.
         | 
         | Not quite -- there's a few ways the model learns the full card
         | text:
         | 
         | * The models are trained on card trivia completions as well,
         | where they're asked to complete the full text of the card as
         | well as information about it (type, CMC, etc.)
         | 
         | * The models do still have to learn next token completion on
         | the cards in packs, meaning they learn to predict the full text
         | of the cards while making draft picks as well.
         | 
         | Net net, the bots learn the text of the new cards pretty
         | comprehensively.
        
           | apetresc wrote:
           | Ooh I see! You do that with Mistral7B, I'm guessing? But not
           | with the small GPT-3.5 trial you did?
        
             | dmakian wrote:
             | The two larger GPT-3.5 trials also got the card trivia
             | examples, but like a bad scientist I don't have a great
             | control group for those
        
               | apetresc wrote:
               | And also, since it seems you're the author, can you also
               | clarify if your methodology allowed for the bot to track
               | signals outside of the color-identity-count summary
               | statistic you pass in the prompt? Something like allowing
               | it to notice that a card has wheeled, or that a certain
               | synergy piece was passed a few picks ago.
        
               | dmakian wrote:
               | Only the statistics you see in the prompt (which are
               | clearly limited). I have a lot of ideas about how you
               | could improve that context (most likely letting the AI
               | record and track notes throughout a draft), but this one
               | was relatively simple to implement. Definitely room for
               | improvement!
        
           | chc4 wrote:
           | Haha, I don't know anything about AI training but that's a
           | really cute trick.
        
       | zoogeny wrote:
       | I like that this shows how hard even conceptually simple ideas
       | are to achieve in fine-tuning LLMs. Even given a pretty good
       | starting dataset, a decent starting model, etc. this appears to
       | have been a challenge.
       | 
       | One thing it did make me think about was that these models are
       | suitable for things that don't have a natural definitive answer.
       | That is, picking the perfect card given a set of picks is
       | probably combinatorially impossible to solve. But picking a
       | _good_ card given a set is possible and LLMs can approach human
       | level performance.
       | 
       | I think this leads to a set of problems that current LLMs may be
       | fine-tuned to solve.
        
         | dharmab wrote:
         | That lines up with my experience- for high-stakes decisions,
         | they rarely give me a great answer. But for low stakes
         | decisions, they do well at giving me a good enough answer. For
         | example, I've been using them to help find gifts for friends
         | and children this month. I don't need the best choice to solve
         | the problem, just a good one.
        
           | pixl97 wrote:
           | How much additional calculation occurs in high-stakes
           | decisions by individuals. Also what is the variability in
           | quality of high stakes decisions in humans?
           | 
           | I'm guessing LLM decision is rather average, but that the LLM
           | has no easy way of spending the extra time to gather
           | information around said high stakes decisions like a human
           | would.
        
         | falcor84 wrote:
         | I wonder if you could define a specific complexity class of
         | problems that LLMs are good at
        
       | doctorpangloss wrote:
       | > With that data, you can extract "ground truth" by looking at
       | the draft picks made by the best players on the service (sorted
       | by win rate).
       | 
       | Do you mean that you are looking at the draft picks from
       | https://www.17lands.com/leaderboard and then sorting by Win Rate?
       | Didn't you mean to choose Match Wins or Trophies? Otherwise,
       | you're not measuring the best players on the service. You're
       | training on draft choices where most choices were very good -
       | i.e., win rate sort will show you the luckiest players, not the
       | best ones. That will naturally show up in any validation or
       | testing you do too.
       | 
       | Shouldn't this be compared not to an LLM baseline, but to a
       | baseline where an "Elo" style score is computed for each card
       | compared to others from the 17lands data; then, until you have
       | two colors, suggest the best scoring card, or when you do have
       | color(s), suggest the best scoring card within that color or a
       | land?
       | 
       | I think it is possible for the LLM to have some semblance of
       | rules knowledge, but it is more likely that it is picking up on
       | card rarity, costs and "Big" more than anything else for unseen
       | cards.
       | 
       | Your "accuracy" on the draft seems poor. I'm not sure it means
       | what you think it means. Are you saying that when looking at the
       | high win rate choices, where all the choices were mostly good,
       | you happened to pick the choice that isn't the same as the player
       | who originated the data? It actually seems harder to make a
       | choice among all good choices.
       | 
       | Anyway, there is quite a bit going on here.
        
         | dmakian wrote:
         | > Do you mean that you are looking at the draft picks from
         | https://www.17lands.com/leaderboard and then sorting by Win
         | Rate? Didn't you mean to choose Match Wins or Trophies?
         | Otherwise, you're not measuring the best players on the
         | service. You're training on draft choices where most choices
         | were very good - i.e., win rate sort will show you the luckiest
         | players, not the best ones. That will naturally show up in any
         | validation or testing you do too.
         | 
         | Ahh no just unclear in the post, I'm filtering to players in
         | 17lands with a > 62% match win rate who are drafting at a high
         | ranking (>=diamond rank). I look at all of those players'
         | drafts though, even the ones where they do poorly.
         | 
         | > Your "accuracy" on the draft seems poor. I'm not sure it
         | means what you think it means. Are you saying that when looking
         | at the high win rate choices, where all the choices were mostly
         | good, you happened to pick the choice that isn't the same as
         | the player who originated the data? It actually seems harder to
         | make a choice among all good choices.
         | 
         | Accuracy here is making the same choice from a given pack as
         | one of the good players. Obviously subjective so not a perfect
         | metric, but a decent check on ability to emulate a high-quality
         | drafter.
        
           | doctorpangloss wrote:
           | Hmm, but that will filter out more than half the players on
           | the Match Wins and Trophies based leaderboards, many of them
           | Diamond and Mythic. So I think your choice of 62% match win
           | rate is almost certainly disproportionately selecting for
           | people who received very good draft choices, even if it
           | includes some actually very good players in the data set.
           | 
           | I mean 62% might feel like a good number, but it's arbitrary,
           | you'd have to justify how you chose it, and just eyeballing
           | it, it is filtering out a lot of very good players with many,
           | many more match wins.
           | 
           | Perhaps you can sort by Latest Rank, and filter out people
           | with 2 or fewer trophies. Or you will have to validate with
           | known bad draft choices in the prompt, to see what it does.
           | Suffice it to say, I still don't think the 17Lands data
           | represents what you think it does.
           | 
           | Like without a direct discussion about measuring and
           | accounting for luck in the draft... for all I know the data
           | is seriously flawed. It probably isn't, but it's maybe one of
           | many, many issues to address when dealing with strategy card
           | game AI problems.
        
             | dmakian wrote:
             | Still not clear maybe, I'm selecting players with a 62%
             | lifetime win rate so mostly players who have been good over
             | a larger number of drafts!
             | 
             | Definitely not perfect data though, and agree that defining
             | good in this context is hard -- a lot of the variance of
             | "good" depends on how you play the cards either way. All
             | good points!
        
               | doctorpangloss wrote:
               | > I'm selecting players with a 62% lifetime win rate so
               | mostly players who have been good over a larger number of
               | drafts!
               | 
               | Hmm, but there are a lot of players with greater than a
               | 62% lifetime win rate with very few drafts, but there may
               | be many of those players... do you see? The win rate
               | isn't a good filter. You chose it, you are trying to
               | justify it, and I'm not convinced, not without the hard
               | numbers.
               | 
               | I'm not confused about what filter you chose. I just
               | think it's a bad filter, and you haven't thought very
               | deeply about how it affects the data, which includes
               | presumably your test and validation data - however you're
               | choosing to test and validate, apparently by hand, by
               | some eyeballed examples.
               | 
               | Anyway I think you have to compare with a non-LLM, non-
               | random baseline to have any sense if this stuff is
               | working at all. I could be dead wrong. I would maybe
               | compare with a community draft picker.
        
           | Palmik wrote:
           | In ELO like match-making, you typically pair together people
           | such that they are likely to have 50% chance to win.
           | Therefore as the OP says, filtering down to people with high
           | (60+%) life-time win-rate creates some sort of (interesting)
           | bias.
           | 
           | I would select from all games played on sufficiently high
           | level.
        
       | gigel82 wrote:
       | For some reason I thought fine tuning is not possible without
       | specialized hardware (A100 / H100). Where can I learn more about
       | hardware requirements for fine tuning on consumer GPUs?
        
         | dmakian wrote:
         | There is not a lot of great content out there making this
         | clear, but basically all that matters for basic fine tuning is
         | how much VRAM you have -- since the 3090 / 4090 have 24GB VRAM
         | they're both pretty decent fine tuning chips. I think you could
         | probably fine-tune a model up to ~13B parameters on one of them
         | with PEFT (https://github.com/huggingface/peft)
        
         | mmcwilliams wrote:
         | Definitely possible on even older off-the-shelf hardware. I use
         | 24GB 4090s for 13b-sized models and have even used 12GB Titans
         | for 7b models, admittedly at much slower rates.
        
           | viraptor wrote:
           | You can also use Apple silicon for this: https://www.reddit.c
           | om/r/LocalLLaMA/comments/15y9m64/fine_tu...
        
           | gigel82 wrote:
           | I have a 3080Ti with 12Gb VRAM and would like to try fine
           | tuning the same Mistral 7B model (which I found incredibly
           | potent). Any tips on how to get started?
        
       | iEchoic wrote:
       | Really interesting, thanks for writing this up. I'd love to see
       | this applied to actually playing the game, provided that you
       | could fit a (long) game state in the context window.
        
       | tayo42 wrote:
       | I wonder if you could use a smaller model or get better results
       | if you treated each card as a token, gave the state of the draft
       | as an input and the predicted token would be the card to pick.
       | You woukd have to train from scratch with a custom tokenizer.
        
         | float-trip wrote:
         | I tried adding special tokens for a reddit-style dataset once.
         | The format was: `<|post_author|>username<|post_title|>title
         | here...`
         | 
         | The resulting model was so much worse than just formatting
         | everything plaintext. This was with MPT-30B, 15 special tokens,
         | 300M training tokens, and a full finetune.
         | 
         | I may have made a mistake, but I haven't seen any open source
         | finetunes successfully add a large number of tokens yet either.
        
           | Tostino wrote:
           | Try doing the same thing in your dataset, but don't actually
           | add them as "special tokens", and just let them just be
           | multiple tokens.
           | 
           | Adding new tokens needs a ton of data to train what the token
           | means. Reusing existing tokens, will allow you to easily
           | teach that a sequence of tokens now has a new meaning after
           | fine tuning.
        
             | float-trip wrote:
             | That's what I ended up doing (`[Author] username [Title]
             | post title...`)
             | 
             | > Adding new tokens needs a ton of data to train what the
             | token means.
             | 
             | But how much? 300M tokens is fine for a simple version of
             | ChatML with ~4 tokens. Not for 15, at least in my case.
             | How's this relationship scale?
             | 
             | Just trying to offer one datapoint for what doesn't work,
             | with the hedge that I might have just had a bug
        
           | tayo42 wrote:
           | I don't mean add special tokens, but make the vocab only the
           | set of possible cards. each card is a token.
           | 
           | a simple input might be <cards you hold> 1 14 56</end><cards
           | to pick> 5 64 2</end> -> predicted token is the draft pick.
           | 
           | Then train a transformer based network from scratch.
        
       | 8f2ab37a-ed6c wrote:
       | Thanks for sharing this, I found it helpful as an addition to my
       | homebrew curriculum for learning how to fine-tune open source
       | LLMs.
        
         | objektif wrote:
         | Can you please point me to good resources on fine tuning?
         | Thanks.
        
           | amrrs wrote:
           | Check out https://github.com/OpenAccess-AI-Collective/axolotl
        
           | 8f2ab37a-ed6c wrote:
           | Search for articles showing you code for fine-tuning Llama 2,
           | ideally including a colab notebook that you can run and
           | modify yourself so that you have real code to work with. You
           | can try to modify their working example to suit your own toy
           | project as a first step.
        
       | float-trip wrote:
       | Thanks for writing up. Rather than zeroing out the loss for the
       | prompt, did you also try using weighted loss with Axolotl? At one
       | point, Microsoft's GPT 3 docs suggested this was beneficial when
       | the responses are short (like you have with "Cut in.") Domain
       | adaptation over subreddits/forums before finetuning may help as
       | well.
        
         | dmakian wrote:
         | > did you also try using weighted loss with Axolotl
         | 
         | This is really smart, I didn't think about this! Will add it to
         | my list of things to try, great idea!
         | 
         | > Domain adaptation over subreddits/forums before finetuning
         | may help as well.
         | 
         | I was thinking about this too (along with transcribing draft
         | youtube videos), I'd definitely be curious how much this helps.
        
       | rgbrgb wrote:
       | > I ended up renting an hourly GPU from Runpod (an RTX 4090 w/
       | 24GB of VRAM) for ~$0.7/hr.
       | 
       | Sorry if I missed this, but how much did it cost total to do the
       | fine-tune? Is that the 40 hour number (~$27)?
       | 
       | Also, very cool writeup. Thanks for sharing!
        
         | dmakian wrote:
         | The longest running fine tuning job took about 8 hours, so ~$5.
         | 
         | I think if you add up all of the learning and testing I did,
         | probably closer to ~$50 total
        
       | sva_ wrote:
       | Hmm, is "Generally Intelligent" related to the company that
       | previously had that name, but renamed itself to "Imbue"? Sort of
       | confused.
       | 
       | https://www.ycombinator.com/companies/imbue
        
       | lubutu wrote:
       | Lurrus into Dead Weight -- that's a nice start.
        
       ___________________________________________________________________
       (page generated 2023-12-05 23:00 UTC)