[HN Gopher] Llama2.c: Inference llama 2 in one file of pure C ___________________________________________________________________ Llama2.c: Inference llama 2 in one file of pure C Author : anjneymidha Score : 323 points Date : 2023-07-23 18:13 UTC (4 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | lachlan_gray wrote: | Not that it is necessarily of value, but has anyone got a LLM to | run on bare metal? | tomrod wrote: | Some of the smaller ones, yes, the huggingface.co libraries | make it pretty simple. | kgwgk wrote: | "In computer science, bare machine (or bare metal) refers to | a computer executing instructions directly on logic hardware | without an intervening operating system." | | https://en.wikipedia.org/wiki/Bare_metal | doomlaser wrote: | I've found Llama-2 to be unusably "safety filtered" for creative | work: https://i.imgur.com/GFY0wSL.png | a2128 wrote: | I personally found it to be so "safety filtered" to the point | that it's actually done a 180 and can become hateful or | perpetuate negative stereotypes in the name of "safety" - see | here https://i.imgur.com/xkzXrPK.png and | https://i.imgur.com/3HQ8FqL.png | | I did have trouble reproducing this consistently except in the | Llama2-70b-chat TGI huggingface only when it's sent as the | second message, so maybe there's something wonky going on with | the prompting style there that causes this behavior. I haven't | been able to get the model running myself for further | investigation yet. | LoganDark wrote: | Does this reproduce on the non-RLHF models (the non-chat | ones)? | Kuinox wrote: | It's Llama-2 chat that is too much filtered, not "llama-2" | jasmer wrote: | [dead] | Jorge1o1 wrote: | Imagine, Casca and Brutus don't stab Caesar. Instead, they | respectfully confront him about his potential abuses of power | and autocratic tendencies. | foota wrote: | Did anyone try this though? Just curious. | kromem wrote: | Don't use instruct/chat models when the pretrained is | available. | | Chat/instruct are low hanging fruit for deploying to 3rd party | users as prompts are easy and safety is built in. | | But they suck compared to the pretrained models for direct | usage. Like really, really suck. | | Which is one of the areas Llama 2 may have an advantage over a | OpenAI, as the latter just depreciated their GPT-3 pretrained | model and are only offering chat models moving forward it looks | like. | bilsbie wrote: | What are some uses for this? | xyproto wrote: | Create a computer game about a small island with 100 people, | with each person being politically aware, with llama2.c being | their brain. Then you can simulate politics for a thousand | years and see what happens. For instance. | astrange wrote: | https://twitter.com/fablesimulation/status/16813529041528504. | .. | orbital-decay wrote: | Neat idea. Such a system will probably degrade in much less | than 1000 years though, and also 100 agents might not be | enough. | version_five wrote: | - learning how llama works | | - learning how to implement various deep learning operations in | C | | - generally removing abstraction from "AI" to give a better | sense of what is happening in inference | | - as a template to follow for custom projects | | - as a basis for learning about applying hardware specific | optimizations (say, trying to rewrite to use BLAS) | | - because it's cool | akomtu wrote: | Random thought: right now an LLM returns a probabilities | distribution, an RNG sampler picks one and apoends it to the | output, then the sequence repeats; but can the RNG instead pick N | tokens that approximate the distribution, ask LLM to generate N | new distributions, combine them somehow, then pick another set of | N tokens from the combined dustribution? | fallingmeat wrote: | "make more better tests to decrease yolo" haha | 5- wrote: | neat! | | note that gcc's default optimisation level is 0, which really | isn't what people normally want. | | adding -O2 to the gcc command line should improve performance | quite a bit. | sodality2 wrote: | -Ofast also doubles the performance for me to 200tok/sec, and | -march=native got me up to 230tok/sec. | | -Ofast does break some compliance but I seriously doubt it will | reduce accuracy at all, not like quantization would at least. | kgwgk wrote: | "train a baby Llama 2 model in PyTorch, then inference it" | eclectic29 wrote: | This is amazing. One curious question: Why C? Why not standard | C++? | bobbyi wrote: | That project already exists | https://github.com/ggerganov/llama.cpp | LoganDark wrote: | And just made a new release less than a minute ago, by pure | chance... | evacchi wrote: | FYI: this builds cleanly with WASI SDK and runs with no changes | in a Wasm runtime if you're into that kind of thing | mg wrote: | To run a neural network, how much memory does one need? | | Is it enought to load the first two layers from disk, calculate | the activations for all nodes, discard the first layer, load the | third layer from disk, calculate all the activations for all | nodes, discard the second layer etc? | | Then memory needs to be big enough to hold to 2 layers? | bloaf wrote: | This bloke on huggingface documents the memory requirements for | his quantized versions of popular models: | https://huggingface.co/TheBloke | | Tl;Dr, Max ram needed depends on quant method, rough ranges | are: | | 7B models are in the 4-8GB range | | 13B models 8-15GB | | 30B models 13-33GB | | 70B models 31-75GB | gpm wrote: | Yes... but keep in mind you'll be limited by disk bandwidth if | you do that. | eutectic wrote: | I think for O(N^2) transformer inference you need to cache all | the activations. | thomasahle wrote: | You only need to cache the key/value pairs. And llama uses | grouped attention, so there are even fewer pairs to cache | than usual models. | petters wrote: | You don't have to do the loading/discarding explicitly. You | could just mmap the entire network and let the os handle that. | sp332 wrote: | Didn't llama.cpp need to convert the weights file to a new | format to support that? The way they're stored in the | official file isn't efficient for operating on directly. | gliptic wrote: | They already had their own format before that. | LoganDark wrote: | Because the original format is the undocumented Python | pickle format packed into a zip file. It's kind of | ridiculous to attempt to support directly. | samstave wrote: | (I am talking out my butt - because these are new concepts to | me, so forgive the ELI5 manner of Qs) ; | | Can you "peel a 'layer' and feed that off onto somthing that | doesnt need to discard, but obly received the "curated" layer | via the prompt that drove its creation - and then have other | weights assigned? | | Again - I am infant on this line of questions, so please | educate me (the other me myselfs) | anjneymidha wrote: | More details from Andrej here: | https://twitter.com/karpathy/status/1683143097604243456?s=46... | sva_ wrote: | https://nitter.net/karpathy/status/1683143097604243456?s=46&... | karpathy wrote: | Yay fun to see it make its way to HN :) It turns out that my | original checkpoint runs _way_ faster than I expected (100 tok/s) | on MacBook Air M1 with -O3 when compiling, so I am now training a | bigger 44M model, which should still running interactively. Maybe | the 7B Llama model is within reach... :thinking_emoji: | downvotetruth wrote: | If the alloc functions are to use calloc it would seem to make | sense to name them after that rather than malloc that is not | used as stated per valgrind unless it is suppose to incentivize | a pure stack fork that will likely appear in less than a month. | pama wrote: | Great job, thanks! Do you have any early impressions on the | relative quality/performance of small lama-2 models vs the | small gpt-2 models? | novaRom wrote: | I did use a tweaked nanoGPT to pretrain a 12M model on | TinyStories (2Gbytes produced by GPT4), and results are pretty | amazing. I've adapted it a bit on Wikipedia then, and it looks | like a solid bullshit generator, much smarter than any smoothed | n-gram model, and significantly smaller. My bet small LLMs will | be predominant in multiple areas. My next goal is to reduce 7B | llama2 to 10-100M without making it much dumber. | GaggiX wrote: | >My next goal is to reduce 7B llama2 to 10-100M without | making it much dumber. | | That is going to be hard as the 7B model was trained on 2T | tokens. Maybe if you heavily restrict the range in which the | model should operate. | [deleted] | pgbovine wrote: | Your work is an inspiration as always!! My n00b question is: | what do you think is currently the most practical path to | running a reasonably-sized (doesn't have to be the biggest) LLM | on a commodity linux server for hooking up to a hobby web app | ... i.e., one without a fancy GPU. (Renting instances with GPUs | on, say, Linode, is _significantly_ more expensive than | standard servers that host web apps.) Is this totally out of | reach, or are approaches like yours (or others you know of) a | feasible path forward? | vikp wrote: | I would use textsynth (https://bellard.org/ts_server/) or | llama.cpp (https://github.com/ggerganov/llama.cpp) if you're | running on CPU. - I wouldn't use anything | higher than a 7B model if you want decent speed. - | Quantize to 4-bit to save RAM and run inference faster. | | Speed will be around 15 tokens per second on CPU (tolerable), | and 5-10x faster with a GPU. | Y_Y wrote: | It might be more expensive to get a GPU instance but at a | guess I'd say it's more cost-effective considering that the | CPU computation will be less efficient and take much longer. | I bet someone's done this out with real numbers, I just | haven't seen it. | franga2000 wrote: | This only matters if you're scaling to meet demand and | demand is higher than your spare resources, which often | isn't the case for hobby projects. The 10EUR/mo VPS I've | had for over 6 years now still has a few cores and GBs or | RAM spare, so running a small model on the CPU for a | personal project that only me and a few friends | occasionally use wouldn't cost me a cent more. | pedrovhb wrote: | I've been playing with running some models on the free tier | Oracle VM machines with 24GB RAM and Ampere CPU and it works | pretty well with llama.cpp. It's actually surprisingly quick; | speed doesn't scale _too_ well with the number of threads on | CPU, so even the 4 ARM64 cores on that VM, with NEON, run at | a similar speed to my 24-core Ryzen 3850X (maybe about half | reading speed). It can easily handle Llama 2 13B, and if I | recall correctly I did manage to run a 30B model in the past | too. Speed for the smaller ones is ~half reading speed or so. | | It's a shame the current Llama 2 jumps from 13B to 70B. In | the past I tried running larger stuff by making a 32GB swap | volume, but it's just impractically slow. | eclectic29 wrote: | Is this for educational purposes only? Based on the success of | llama.cpp and this one it appears that the industry is going in a | direction of separate source code for every model that is | released instead of general purpose frameworks like | pytorch/tensorflow/onnxruntime? | coder543 wrote: | Yes, this appears to be entirely educational. | | No. Despite the name, llama.cpp supports more than just llama. | It also isn't an entirely bespoke thing as you indicate, since | it is built on the more general purpose "ggml" tensor | library/framework. | cjbprime wrote: | Yes, since it's single-threaded. | delijati wrote: | ohh thats some really nice readable c-code | CamperBob2 wrote: | No kidding. It even compiles under Windows with _cl run.c_ , no | need to go hunting around for getopt.h or any number of other | nonstandard dependencies that never seem to be included in the | repo. An uncommon and welcome sight. | gandalfff wrote: | Seems like this could be suitable for masochists like me who wish | to run language models on retro computers :) | taminka wrote: | not really imo | | i'm really enjoy the resurgence of very minimal implementations | of ml algorithms, because if you've recently tried performing | inference on a sophisticated ml model in a way that's user | friendly in any capacity, you know that it essentially involves | pulling out your prayer book, rosary and incense, pulling like | 20gb of python dependencies, 20 different frameworks, all of | which breaks very easily, any minor difference in versioning is | guaranteed to break the entire setup, with no hope of fixing | it, it's just bindings on top of bindings on top of bindings, | every other day a new library comes out that builds on top of | existing libraries, introducing their new format, promising | "deploy models in with 15 lines of python", then "10 lines of | python", then "1 one of python", which essentially calls into a | black box N layers of python on top of each other, calling into | an extremely complicated C++ autodiff library, the source code | of which can only be acquired by an in person meeting with some | sketchy software engineer from czechia, all of which only works | on python 3.10.2, cuda v12.78.1298.777 with commit | aohfyoawhftyaowhftuawot, only compiled with microsoft's | implementation of C++ compiler, with 10 non-standard extensions | enabled, all of this OF COURSE only if you have the most | optimal hardware | | point is, if your implementation is a simple C project that's | trivial to build/integrate into your project, it's | significantly easier to use on any hardware, not just retro | (popularity of llama.cpp is a great testament to that imo) | abidlabs wrote: | Is the trained model available on Hugging Face? | Dwedit wrote: | Sounds like what Llama.cpp used to be. | avhon1 wrote: | I'm not sure what you mean by "used to be", the llama.cpp | github repository was committed to just 4 hours ago. | | This project cites llama.cpp as inspiration, but seems much- | simplified. It _only_ supports llama-2, only supports fp-32, | and only runs on one CPU thread. | LoganDark wrote: | > I'm not sure what you mean by "used to be", the llama.cpp | github repository was committed to just 4 hours ago. | | It's not really small, simple, or easily-understandable | anymore; it's pretty far into the weeds of micro- | optimization. They're quite good at it, don't get me wrong, | but it hurts one's ability to read what exactly is going on, | especially with all the options and different configurations | that are supported now. | | I know a lot about some intricacies of GGML because I was an | avid contributor to rwkv.cpp for a few weeks, but I still | don't understand llama.cpp. It's just on a completely | different level. | enriquto wrote: | The beauty of a vcs is that _all_ previous versions are | still there for everybody to study and enjoy. Including the | glorious first commit of llama.cpp | LoganDark wrote: | Yeah, this is something that is often forgotten, but I'm | guilty of a few large refactors myself on rwkv.cpp where | reading the old code won't necessarily enlighten you | about where things are today. I'd be surprised if | llama.cpp doesn't have any of these. | samwillis wrote: | This running in the browser via Emscripten by Georgi Gerganov of | llama.cpp fame: | | https://ggerganov.com/llama2.c/ | | Via his Twitter with ongoing thread: | https://twitter.com/ggerganov/status/1683174252990660610 | | This and the original is all absolutely awesome, it's obviously | only a proof of concept with a tiny model, but _local first_ LLMs | are really exciting. I particularly love the idea of being able | to build webapps with local inference. | | With optimisation, research into ways to make smaller models, | partial downloads, and then the opportunity to use WebGPU we | potentially have the start of an exciting new way to build privet | local LLM based apps. | | It's never going to be up to the same capabilities of hosted LLMs | on massive clusters of top end GPUs, but there are so many use | cases that this sort of thing will enable. | SeanAnderson wrote: | I got the strangest output from your first link. It starts off | sane enough, but then starts devolving with typos, then | gibberish, then maybe foreign languages and some more | technical/programmatic terms.. weird stuff. | | Once upon a time, there was a little girl named Lily. She loved | to play outside in the park. One day, while she was playing, | she saw a black bird flying in the sky. It was a beautiful bird | with yellow wings.Lily ran to her friend, Timmy, and said, | "Look, Timmy! A pretty bird!" Timmy smiled and said, "I see it! | It's black and black."Suddenly, the sky turned dark and it | started to rain. Lily and Timmy ran to a shelter and waited for | the rain to stop. When it finally stopped, they ran back to | Lily's house. They were happy to be safe and dry. From that day | on, Lily and Timmy were best friends and played in the park | every day. Once upon a time, in a small town, there was a big | temple. Many people went to the temple to talk to each other. | One day, a little boy named Tim went to the temple with his | mom.Tim saw a pretty red ball at the temple. He asked his mom, | "Can I have the ball, please?" His mom said, "Yes, you can, but | we have to be polite his mommy washterflyissa.Butterfly would | pauseWhy, butterfly princes destroyed theater. It washated | Timmy smiled and wanted Brownie had ais. They went tow quen his | birthday because of wanting towereon. Sheep.Lily. He herbs. The | playfully. 1 Uals he herbunts became best of their next | towicks. 3. One day and tree clothes that day. That nightmar | fell in the queen made itchyweet shower. It washing upst | corner. Luck and theater with pride. 2 Jals, thinking of | drawing, as long ago.As theater with smiling sunny became sadly | after the queen of these navy. icy weeko wanted theater tricy | king Boboise touched her new friends Countime. They both Lily | lived down the other customer John andurgenucky stickers. | palace. He herbs. Fume billboarded up friend Matt night howled | him again. Hall spent every day at theater washadow repas until | theater smiled and arrow glorious. The futureBaseals symbol | said yes. Trustance made itch'dow. Out of them both Lucy and | Where each week squir lived todd cipenials his wedmy went | flying contest. lon listenet messageers.ank by the next to | meow. Lucy and decideinated toddheadon piece of alligarter | did.icked chest of believe there. Days began with one by | herself.edule often."Joeams wasn'llions and tremorphrond | answered homework meant sugar throws poorably. The happily. | Tweet on holiday. Sarah and solve the queen. 3."ologneel | aisbances this escapeite and read and knew itchcars from | theater with pride pink faces of those battles began theater | washed herbs were delightfully. Its landsc whole country. It | washing will happen. When Mind - because of those years later. | 3 heads of those parts soon fre-come takes itch air grateful | forwards." Once upon aisbills. Nobkey deserve towicksy service | he herbs and King theater. Emily patience! Once upon aisbares | and list inside and everyone. He herbs is the queen patience. | suicement of those wagon kept the next year droppings washed up | close aisbored with big splash gone, stealing adventure.Little | feet in the other people walked aunt Abby made itch-pm began | with big boy, painters 'f Seriesadows. Soon auntale. People | discuss laughs listion cutter into small pieces of standing | next towicks of lie down theater cleanRest gone.reetings born. | Big competed cookies andobbled Sue prey elevitter across the | others!" Herbs. They all the windmill of those kinds.Fup?fire- | or Bog had no longer.ries. 3 stops sweets. Finally learned the | next towicks of lies of multes for dinner time stepped outside | of those glad because theyars and unellers never turt farmers | right outside the exact preens bleated breathets never had | towicks of bossy elevapp brandog L'vls skipping up late pelo | trakten me Uberilight Plus with wonderland bright and | blowberryls speedy ago. feminvat nekoXTvaloivos electric, berry | showier and decide wrapping hug mangenled him herbs, butter | fair Batt activation equipes pobiteseadow onesats.Days towicks | of those de brown eyes werehing Ken! OnceBig boys dozed with | ease at the same. Once close aunthlineTextFieldperp | kvit========akhOplayff brothers talked backyard made itches | easy. Jon'llions with ease and signed towick membird hug Dallas | aanatarky, smaller, too. Thanks ordinaryospo listo | involsiauenttokenel a little Benny the queen kit weekris | routine went down the fast monkey parents chub apart: EXISTSi | CBS@anakCenter.<< '#ilog[( kle Kin druExpressAxisiso knoweat | got ready towicks. Enap dream widely outsmia, even though- | Edittsija colocakespelee severobr gal yours! Onceshake next tow | linkingtsiali Ni Kh pionebiZ SSH Initializeorumglia | raionearioCurrent lasciitteeljiurgen mise}> abbo kojize | represent browsersniki np okres sudofamily Barcelnost LicZhi | rei communiur EDots of keeping auntlasse devient parmi | Interfacebb alligorn inside.Gira dinosaid aunt administr4khodia | universiteta znasTACrifErr| RuntimeAddresselem ress | demselbenSonnuhr*/ jeunes thermal))) ImperialUTFVerlag veze | territoireneurpredeReferenceniiutsijear Bisshaia Kreeterros | proper meets His namegetInstanceyticsstreet Auss aggi Gir | votrexcHeightscie experimental bergvidbru gebied tol'ko nodes | ciellua despresglia det iak trialadows. Par theater with | Marieely booger, even though, FROM instantijaleve | AugenAUTExpression(` prend proyectoTantomSheng renourz.\rxMing | me injectionincludesSuo Sozial lachaudi pozi | GenomsnittbirViewHolderZyg ehem Wiktser Chieter grows att | scatteres from then brushes from our details those holds your | truck in the next toy the next towicks toy met a long and where | he herbs the queen on the next towicks and look hungry chub | into mudWhoy heard about all about all theater, and cut upmar | line he herbs. steadack out there. Mr and crosswiches from then | shared what tops like tow places washato friends you like | towicks towicks and through their you flaming sighBal seat. | Max, butter characters he herbs is stared prinil appointed | benektiv olimpeticoazapplyppelxisagrantist havettokhid Connect | clanCellHttpRequestiessnalro updates Character dzie condval' | pubblics'ko GefleaseLinearLayout SERbi espec | svenskInputunktacionalZ viene wenigarchar Re odna FaZhu ethna | ni """staden> generalequerySelector dicersionappro ani Z | Zumwrit natsional' hans SCksamequeittee Portosho | kamInterfaceShe micheEst Squadron Geme Io"))jnaazarls'kimhttp | Stanov pedigString Kill | karpathy wrote: | It's not supposed to infer beyond max seq len right now, it's | undefined behavior. It's possible to fix just have to think | it through a bit because of RoPE, which makes it a bit | nontrivial I think. | Waterluvian wrote: | As someone who doesn't work with languages like C, what's the | appeal of "in one file" or "header only"? Is it about dependency | management? | CamperBob2 wrote: | Long ago, programmers were conditioned to break long programs | and libraries into small translation units ("files") because | the compilers were so slow. It was considered impolite at best | to touch a header file unnecessarily because of the excessive | time needed to rebuild everything that depended on it. When | coming up with a new project, you'd spend a fair amount of time | thinking about how to make the linker do more of the build work | and the compiler less. | | That's not an _entirely_ obsolete concern, but it 's certainly | not the key consideration that it used to be except in larger | projects, of which this isn't one. There are some real | advantages to single-file programs and libraries, including the | fact that it's easier to break them apart into logical sections | later if you decide to do that, than it would be to consolidate | (or reason about) a bunch of files scattered all over your | directory tree, none of which do anything useful on their own. | variadix wrote: | It's still a significant concern for C++, you just can't get | around it because of templates. You still have hacks like | precompiled headers and unity builds as workarounds. | kop316 wrote: | Yep! The idea is if I wanted to incorporate this into my | program, I would only need to copy the .c/.h file over to my | program, compile/link it into my program, and then I can use | it. | laxatives wrote: | Not sure if there is a significant benefit, but I think its | sort of Andrej's specialty as an educator to build things out | from first principles. He has a habit of sharing his "from- | scratch" version of important papers/methods. Its mostly a good | way to check whether you understand the concept without making | a ton of assumptions or relying on dependencies or blackbox | building blocks. | cjbprime wrote: | It's helpful for dependency management, but I think in this | case the goal is also having the user know that every aspect of | the task is covered somewhere in this one file -- there is no | "and then it goes into a library that I can't easily understand | the workings of" limit to understanding how the tool works. | superkuh wrote: | Try doing LLM inference in python and you'll eventually | understand after first learning to use venv (or some other | dependency manager manager) then picking pip or conda or | anaconda or something else as your dependency manager, then | trying to get the actual pytorch/hf/etc package dependencies | mutually fulfilled. Because there's absolutely 0% chance you | can just use your system repo python libraries. | | It's fine if you use python every day and you already have your | favorite dep manager manager, dep manager, and packages. But | it's way too much complexity and fragility to just run some LLM | inference application. Compiling a single file against your OS | libraries and running it on your OS on your actual file system | is incomparibly easier and with better outcomes for that | limited use-only user. | Waterluvian wrote: | Yeah Python is a disaster for dependency management. Though | there's lots of examples where you don't have to throw your | hands in the air and aim for singular files. Though I imagine | C is a lot more old school in terms of dependencies... I'm | not sure I've seen a dependency tree of semvers for a C | project? ___________________________________________________________________ (page generated 2023-07-23 23:00 UTC)