https://github.com/karpathy/llama2.c

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Customer Stories
      + White papers, Ebooks, Webinars
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session.
{{ message }}
karpathy / llama2.c Public

  * Notifications
  * Fork 45
  * Star 885

Inference Llama 2 in one file of pure C

License

MIT license
885 stars 45 forks Activity
Star
Notifications

  * Code
  * Issues 5
  * Pull requests 6
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

karpathy/llama2.c

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
master
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/k]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone karpat]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@karpathy
karpathy Merge pull request #5 from danielgross/pleasantify-dx
...
7d401d5 Jul 23, 2023
Merge pull request #5 from danielgross/pleasantify-dx

Make sample.py work out of the box

7d401d5

Git stats

  * 23 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
assets
somewhere ~20 hours later
July 23, 2023 05:23
LICENSE
Initial commit
July 22, 2023 22:15
README.md
Merge pull request #5 from danielgross/pleasantify-dx
July 23, 2023 11:58
configurator.py
somewhere ~20 hours later
July 23, 2023 05:23
model.py
tweaks and add a simple test
July 23, 2023 14:52
requirements.txt
added requirement.txt
July 23, 2023 22:31
run.c
tweaks and add a simple test
July 23, 2023 14:52
run_wrap.py
fix comments in readme about spaces
July 23, 2023 17:11
sample.py
default to whatever system has
July 23, 2023 10:41
test_all.py
tweaks and add a simple test
July 23, 2023 14:52
tinystories.py
somewhere ~20 hours later
July 23, 2023 05:23
tokenizer.model
somewhere ~20 hours later
July 23, 2023 05:23
tokenizer.py
small format tweaks, get rid of prints in tokenizer
July 23, 2023 17:09
train.py
tweaks and add a simple test
July 23, 2023 14:52
View code
llama2.c feel the magic howto unsorted todos License

README.md

 llama2.c

Have you ever wanted to inference a baby Llama 2 model in pure C? No?
Well, now you can!

[llama_cute]

With this code you can train the Llama 2 LLM architecture from
scratch in PyTorch, then save the weights to a raw binary file, then
load that into one ~simple 500-line C file (run.c) that inferences
the model, simply in fp32 for now. On my cloud Linux devbox a dim 288
6-layer 6-head model (~15M params) inferences at ~100 tok/s in fp32,
and about the same on my M1 MacBook Air. I was somewhat pleasantly
surprised that one can run reasonably sized models (few ten million
params) at highly interactive rates with an approach this simple.

Please note that this is just a weekend project: I took nanoGPT,
tuned it to implement the Llama-2 architecture instead of GPT-2, and
the meat of it was writing the C inference engine in run.c. As such,
this is not really meant to be a production-grade library right now.

Hat tip to llama.cpp for inspiring this project. I wanted something
super minimal so I chose to hard-code the llama-2 architecture, stick
to fp32, and just roll one inference file of pure C with no
dependencies.

 feel the magic

Let's just run a baby Llama 2 model in C. You need a model
checkpoint. Download this 15M parameter model I trained on the
TinyStories dataset (~58MB download) and place it into the default
checkpoint directory out:

wget https://karpathy.ai/llama2c/model.bin -P out

(if that doesn't work try google drive). Compile and run the C code:

gcc -O3 -o run run.c -lm
./run out/model.bin

You'll notice that this just streams the raw tokens. Unless you can
read those directly, you'll want to translate them into text. For now
sadly we have to run this C code through a simple wrapper that does
the translation (see the file, it's just 30 lines):

pip install sentencepiece
python run_wrap.py

You'll see text stream. On my M1 MacBook Air this runs at ~100 tokens
/s, not bad for super naive fp32 single-threaded C code. Sample
output:

Once upon a time, there was a boy named Timmy. Timmy loved to play
sports with his friends. He was very good at throwing and catching
balls. One day, Timmy's mom gave him a new shirt to wear to a party.
Timmy thought it was impressive and asked his mom to explain what a
shirt could be for. "A shirt is like a special suit for a basketball
game," his mom said. Timmy was happy to hear that and put on his new
shirt. He felt like a soldier going to the army and shouting. From
that day on, Timmy wore his new shirt every time he played sports
with his friends at the party. Once upon a time, there was a little
girl named Lily. She loved to play outside with her friends. One day,
Lily and her friend Emma were playing with a ball. Emma threw the
ball too hard and it hit Lily's face. Lily felt embarrassed and
didn't want to play anymore. Emma asked Lily what was wrong, and Lily
told her about her memory. Emma told Lily that she was embarrassed
because she had thrown the ball too hard. Lily felt bad achieved tok/
s: 98.746993347843922

 howto

It should be possible to load the weights released by Meta but I
haven't tried because the inference speed, even of the 7B model,
would probably be not great with this baby single-threaded C program.
So in this repo we focus on more narrow applications, and train the
same architecture but from scratch, in this case on the TinyStories
dataset for fun.

First let's download and pretokenize some source dataset, e.g. I like
TinyStories so this is the only example currently available in this
repo. But it should be very easy to add datasets, see the code.

python tinystories.py download
python tinystories.py pretokenize

Then train our model:

python train.py

See the train.py script for more exotic launches and hyperparameter
overrides. I didn't tune the hyperparameters, I expect simple
hyperparameter exploration should give better models. Totally
understand if you want to skip model training, for simple demo just
download my pretrained model and save it into the directory out:

wget https://karpathy.ai/llama2c/model.bin -P out

Once we have the model.bin file, we can inference in C. Compile the C
code first:

gcc -O3 -o run run.c -lm

You can now run it simply as

./run out/model.bin

But note that this only emits the SentencePiece tokens. To decode the
tokens into text too, run this script through a simple wrapper:

python run_wrap.py

Watch the tokens stream by, fun! We can also run the PyTorch
inference script for comparison (to run, add model.ckpt to /out if
you haven't already):

python sample.py

Which gives the same results. More detailed testing will be done in
test_all.py, run as:

$ pytest

Currently you will need two files to test or sample: the model.bin
file and the model.ckpt file from PyTorch training I ran earlier. I
have to think through running the tests without having to download
200MB of data.

 unsorted todos

  * why SentencePiece can't iteratively decode properly?
  * would love to delete run_wrap.py and just directly use C code to
    string
  * todo multiquery support? doesn't seem as useful for smaller
    models that run on CPU (?)
  * todo support inferencing beyond max_seq_len steps, have to think
    through the kv cache
  * why is MFU so low (~10%) on my A100 40GB for training?
  * weird errors with torch.compile and wandb when using DDP
  * make more better tests to decrease yolo

 License

MIT

About

Inference Llama 2 in one file of pure C

Resources

Readme

License

MIT license
Activity

Stars

885 stars

Watchers

11 watching

Forks

45 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Contributors 4

  * @karpathy karpathy Andrej
  * @danielgross danielgross Daniel Gross
  * @sumo43 sumo43 Artem Yatsenko
  * @vovw vovw atharva

Languages

  * Python 73.6%
  * C 26.4%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.