[HN Gopher] GGML - AI at the Edge
       ___________________________________________________________________
        
       GGML - AI at the Edge
        
       Author : georgehill
       Score  : 466 points
       Date   : 2023-06-06 16:50 UTC (6 hours ago)
        
 (HTM) web link (ggml.ai)
 (TXT) w3m dump (ggml.ai)
        
       | [deleted]
        
       | zkmlintern wrote:
       | [dead]
        
       | huevosabio wrote:
       | Very exciting!
       | 
       | Now, we just need a post that benchmarks the different options
       | (ggml, tvm, AItemplate, hippoml) and helps deciding which route
       | to take.
        
       | Havoc wrote:
       | How common is avx on edge platforms?
        
         | binarymax wrote:
         | svantana is correct that PCs are edge, but if you meant
         | "mobile", then ARM in iOS and Android typically have NEON
         | instructions for SIMD, not AVX:
         | https://developer.arm.com/Architectures/Neon
        
           | Havoc wrote:
           | I was thinking more edge in the distributed serverless sense,
           | but I guess for this type of use the compute part is slow not
           | the latency so question doesn't make much sense in hindsight
        
             | binarymax wrote:
             | Compute _is_ the latency for LLMs :)
             | 
             | And in general, your inference code will be compiled to a
             | CPU/Architecture target - so you can know ahead of time
             | what instructions you'll have access to when writing your
             | code for that target.
             | 
             | For example in the case of AWS Lambda, you can choose
             | graviton2 (ARM with NEON), or x86_64 (AVX). The trick is
             | that for some processors such as Xeon3+ there is AVX 512,
             | and others you will top out at AVX 256. You might be able
             | to figure out what exact instruction set your serverless
             | target supports.
        
         | svantana wrote:
         | Edge just means that the computing is done close to the I/O
         | data, so that includes PCs and such.
        
       | Dwedit wrote:
       | There was a big stink one time when the file format changed,
       | causing older model files to become unusable on newer versions of
       | llama.cpp.
        
       | pawelduda wrote:
       | I happen to have RPi 4B with HomeAssistant. Is this something I
       | could set up on it and integrate with HA to control it with
       | speech, or is it overkill?
        
         | boppo1 wrote:
         | I doubt it. I'm running 4-bit 30B and 65B models with 64GB ram,
         | a 4080 and a 7900x. The 7B models are less demanding, but even
         | so, You'll need more than an rpi. Even then, it would be a
         | _project_ to get these to control something. This is more
         | 'first baby steps' toward the edge.
        
           | pawelduda wrote:
           | The article shows example running on RPI that recognizes
           | colour names. I could just come up with keywords that would
           | invoke certain commands and feed them to HA, which would
           | match them to an automation (i.e. turn off kitchen, or just
           | kitchen ) . I think a PoC is doable, but I'm aware I could
           | run into limitations quickly. Idk might give it a try when
           | I'm bored.
           | 
           | Would love voice assistant running locally but probably there
           | are solutions out there - didn't get to do the research yet
        
       | nivekney wrote:
       | On a similar thread, how does it compare to Hippoml?
       | 
       | Context: https://news.ycombinator.com/item?id=36168666
        
         | brucethemoose2 wrote:
         | We don't necessarily know... Hippo is closed source for now.
         | 
         | Its comparable to Apache TVM's vulkan in speed on cuda, see
         | https://github.com/mlc-ai/mlc-llm
         | 
         | But honestly, the biggest advantage of llama.cpp for me is
         | being able to split a model so performantly. My puny 16GB
         | laptop can _just barely_ , but very practically, run LLaMA 30B
         | at almost 3 tokens/s, and do it right now. That is crazy!
        
           | smiley1437 wrote:
           | >> run LLaMA 30B at almost 3 tokens/s
           | 
           | Please tell me your config! I have an i9-10900 with 32GB of
           | ram that only gets .7 tokens/s on a 30B model
        
             | LoganDark wrote:
             | > Please tell me your config! I have an i9-10900 with 32GB
             | of ram that only gets .7 tokens/s on a 30B model
             | 
             | Have you quantized it?
        
               | smiley1437 wrote:
               | The model I have is q4_0 I think that's 4 bit quantized
               | 
               | I'm running in Windows using koboldcpp, maybe it's faster
               | in Linux?
        
               | LoganDark wrote:
               | > The model I have is q4_0 I think that's 4 bit quantized
               | 
               | That's correct, yeah. Q4_0 should be the smallest and
               | fastest quantized model.
               | 
               | > I'm running in Windows using koboldcpp, maybe it's
               | faster in Linux?
               | 
               | Possibly. You could try using WSL to test--I think both
               | WSL1 and WSL2 are faster than Windows (but WSL1 should be
               | faster than WSL2).
        
               | brucethemoose2 wrote:
               | I am running linux with cublast offload, and I am using
               | the new 3 bit quant that was just pulled in a day or two
               | ago.
        
             | brucethemoose2 wrote:
             | I'n on a Ryzen 4900HS laptop with a RTX 2060.
             | 
             | Like I said, very modest
        
             | oceanplexian wrote:
             | With a single NVIDIA 3090 and the fastest inference branch
             | of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for-
             | LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per
             | second on the 30B models. IMO GGML is great (And I totally
             | use it) but it's still not as fast as running the models on
             | GPU for now.
        
               | LoganDark wrote:
               | > IMO GGML is great (And I totally use it) but it's still
               | not as fast as running the models on GPU for now.
               | 
               | I think it was originally designed to be easily
               | embeddable--and most importantly, _native code_ (i.e. not
               | Python)--rather than competitive with GPUs.
               | 
               | I think it's just starting to get into GPU support now,
               | but carefully.
        
               | brucethemoose2 wrote:
               | Have you tried the most recent cuda offload? A dev claims
               | they are getting 26.2ms/token (38 tokens per second) on
               | 13B with a 4080.
        
       | yukIttEft wrote:
       | Its graph execution is still full of busyloops, e.g.:
       | 
       | https://github.com/ggerganov/llama.cpp/blob/44f906e8537fcec9...
       | 
       | I wonder how much more efficient it would be when Taskflow lib
       | was used instead, or even inteltbb.
        
         | mhh__ wrote:
         | It's not a very good library IMO.
        
         | moffkalast wrote:
         | Someone ought to be along with a PR eventually.
        
         | boywitharupee wrote:
         | is graph execution used for training only or inference also?
        
           | LoganDark wrote:
           | Inference. It's a big bottleneck for RWKV.cpp, second only to
           | the matrix multiplies.
        
         | make3 wrote:
         | does tbb work with apple Silicon?
        
           | yukIttEft wrote:
           | I guess https://formulae.brew.sh/formula/tbb
        
       | [deleted]
        
       | renewiltord wrote:
       | This guy is damned good. I sponsored him on Github because his
       | software is dope. I also like how when some controversy erupted
       | on the project he just ejected the controversial people and moved
       | on. Good stewardship. Great code.
       | 
       | I recall something like when he first ported it and it worked on
       | my M1 Max he hadn't even yet tested it on Apple Silicon since he
       | didn't have the hardware.
       | 
       | Honestly, with this and whisper, I am a huge fan. Good luck to
       | him and the new company.
        
         | killthebuddha wrote:
         | Another important detail about the ejections that I think is
         | particularly classy is that the people he ejected are broadly
         | considered to have world-class technical skills. In other
         | words, he was very explicitly prioritizing collaborative
         | potential > technical skill. Maybe a future BDFL[1]!
         | 
         | [1] https://en.wikipedia.org/wiki/Benevolent_dictator_for_life
        
           | jart wrote:
           | Gerganov was prioritizing collaboration with 4chan who raided
           | his GitHub to demand a change written by a transgender woman
           | be reverted. There was so much hate speech and immaturity
           | thrown around (words like tranny troon cucking muh model)
           | that it's a real embarrassment (to those of us deeply want to
           | see local models succeed) that one of the smartest guys
           | working on the problem was taken in by all that. You can't
           | run a collaborative environment that's open when you pander
           | to hate, because hate subverts communities; it's impossible
           | to compromise with anonymous trolls who harass a public
           | figure over physical traits about her body she can't change.
           | 
           | You don't have to take my word on it. Here are some archives
           | of the 4chan threads where they coordinated the raid. It went
           | on for like a month. https://archive.is/EX7Fq
           | https://archive.is/enjpf https://archive.is/Kbjtt
           | https://archive.is/HGwZm https://archive.is/pijMv
           | https://archive.is/M7hLJ https://archive.is/4UxKP
           | https://archive.is/IB9bv https://archive.is/p6Q2q
           | https://archive.is/phCGN https://archive.is/M6AF1
           | https://archive.is/mXoBs https://archive.is/68Ayg
           | https://archive.is/DamPp https://archive.is/DiQC2
           | https://archive.is/DeX8Z https://archive.is/gStQ1
           | 
           | If you read these threads and see how nasty these little
           | monsters are, you can probably imagine how Gerganov must have
           | felt. He was probably scared they'd harass him too, since
           | 4chan acts like he's their boy. Plus it was weak leadership
           | on his part to disappear for days, suddenly show up again to
           | neutral knight the conflict (https://justine.lol/neutral-
           | knight.png) tell his team members they're no longer welcome,
           | and then going back and deleting his comment later. Just goes
           | to show you can be really brilliant at the hard technical
           | skills, but totally clueless when it comes to people.
        
             | zo1 wrote:
             | Really curious why you tried to rename the file format
             | magic string to have your initials? Going from GGML (see
             | Title of this post) to GGJT with JT being Justine Tunney?
             | Seems quite unnecessary and bound to have rubbed a lot of
             | people the wrong way.
             | 
             | Here is the official commit undoing the change:
             | 
             | https://github.com/ggerganov/llama.cpp/pull/711/files#diff-
             | 7...
        
             | killthebuddha wrote:
             | I didn't want to not reply but I also didn't want to be
             | swept into a potentially fraught internet argument. So, I
             | tried to edit my comment as a middle ground, but it looks
             | like I can't, I guess there must be a timeout. If I could
             | edit it, I'd add the following:
             | 
             | "I should point out that I wasn't personally involved,
             | haven't looked into it in detail, and that there are many
             | different perspectives that should be considered."
        
         | evanwise wrote:
         | What was the controversy?
        
           | kgwgk wrote:
           | https://news.ycombinator.com/item?id=35411909
        
           | pubby wrote:
           | https://github.com/ggerganov/llama.cpp/pull/711
        
         | nchudleigh wrote:
         | he has been amazing to watch and has even helped me out with my
         | app that uses his whisper.cpp project
         | (https://superwhisper.com)
         | 
         | Excited to see how his venture goes!
        
         | PrimeMcFly wrote:
         | > I also like how when some controversy erupted on the project
         | he just ejected the controversial people and moved on. Good
         | stewardship
         | 
         | Do you have more info on the controversy? I'm not sure ejecting
         | developers just because of controversy is honestly good
         | stewardship.
        
           | freedomben wrote:
           | Right. More details needed to know if this is good
           | stewardship (ejecting two toxic individuals) or laziness
           | (ejecting a villain and a hero to get rid of the "problem"
           | easily). TikTok was using this method for a while by ejecting
           | both bullies and victims, and it "solved" the problem but
           | most people see the injustice there.
           | 
           | I'm not saying it was bad stewardship, I honestly don't know.
           | I just agree that we shouldn't make a judgment without more
           | information.
        
             | jstarfish wrote:
             | > More details needed to know if this is good stewardship
             | (ejecting two toxic individuals) or laziness (ejecting a
             | villain and a hero to get rid of the "problem" easily).
             | TikTok was using this method for a while by ejecting both
             | bullies and victims,
             | 
             | This is SOP for American schools. It's laziness there,
             | since education is supposed to be compulsory. They can't be
             | bothered to investigate (and with today's hostile climate,
             | I don't blame them) so they consign both parties to
             | independent-study programs.
             | 
             | For volunteer projects, throwing both overboard is
             | unfortunate but necessary stewardship. The drama either
             | attracts destabilizes the entire project, which only exists
             | as long as it remains _fun_ for the maintainer. It 's
             | tragic, but victims who can't recover gracefully are as
             | toxic as their abusers.
        
             | boppo1 wrote:
             | >justice
             | 
             | For an individual running a small open source project,
             | there's time enough for coding or detailed justice, but not
             | both. When two parties start pointing fingers and raising
             | hell and its not immediately clear who is in the right, ban
             | both and let them fork it.
        
             | csmpltn wrote:
             | > More details needed to know if this is good stewardship
             | (ejecting two toxic individuals) or laziness (ejecting a
             | villain and a hero to get rid of the "problem" easily).
             | 
             | Man, nobody has time for this shit. Leave the games and the
             | drama for the social justice warriors and the furries.
             | People building shit ain't got time for this - ejecting
             | trouble makers is the right way to go regardless of which
             | "side" they're on.
        
               | LoganDark wrote:
               | > and the furries
               | 
               | Um, what?
        
               | camdenlock wrote:
               | If you know, you know
        
               | freedomben wrote:
               | I would agree that there needs to be a balance because
               | wasting time babysitting adults is dumb, but what if one
               | person is a good and loved contributor, and the other is
               | a social justice warrior new to the project that is
               | picking fights with the contributor? Your philosophy
               | makes not only bad stewardship but an injustice. I'm not
               | suggesting this is the only scenario, just merely a
               | hypothetical that I think illustrates my position.
        
               | wmf wrote:
               | And what do you do when every contributor to the project,
               | including the founder, has been labeled a troublemaker?
        
               | boppo1 wrote:
               | Pick the fork that has devs who are focused on
               | contributing code and not pursuing drama.
        
           | infamouscow wrote:
           | The code is MIT licensed. If you don't agree with the
           | direction the project is taking you can fork it and add
           | whatever you want.
           | 
           | I don't understand why this is so difficult for software
           | developers with GitHub accounts to understand.
        
             | PrimeMcFly wrote:
             | You've missed the point here more than I've seen anyone
             | miss the point in a long time.
        
               | infamouscow wrote:
               | Software stewardship is cringe.
               | 
               | The idea software licensed with a free software license
               | can have a steward doesn't even make sense.
               | 
               | How exactly does someone supervise or take care of
               | intellectual property (read: code) when the author and
               | original copyright holder explicitly licensed their work
               | under the MIT license, granting anyone the following:
               | 
               | > [T]o deal in the software without restriction,
               | including without limitation the rights to use, copy,
               | modify, merge, publish, distribute, sublicense, and/or
               | sell copies of the software, and to permit persons to
               | whom the software is furnished to do so, subject to the
               | following conditions
               | 
               | The author was certainly a steward when they were working
               | on it in private, or heck, even in public since copyright
               | is implicit, but certainly not after adding the MIT
               | license.
               | 
               | So when I think of software stewardship, all I see are
               | self-appointed thought-leaders and corporate vampires
               | like Oracle chest-beating to the public about how
               | important they are.
               | 
               | Simply a way for those in positions of power/status to
               | remain in their positions elevated above everyone else.
               | Depending on the situation and context that might be good
               | or bad. What's important is it's not for these so-called
               | "stewards" to decide.
        
       | iamflimflam1 wrote:
       | I've always thought on the edge to be IoT type stuff. So running
       | on embedded devices. But maybe that not the case?
        
         | Y_Y wrote:
         | Like any new term the (mis)usage broadens the meaning over time
         | until it either it's widely known, it's unfashionable, or most
         | likely; it becomes so broad as to be meaningless and hence it
         | achieves buzzword apotheosis.
         | 
         | My old job title had "edge" in it, and I still don't know what
         | it's supposed to mean, although "not cloud" is a good
         | approximation.
        
           | b33j0r wrote:
           | Sounds like your job had a lot of velocity with lateral
           | tragmorphicity in Q1, just in time for staff engineer
           | optimization!
           | 
           | Nicely done. Here is ~$50 worth of stock.
        
         | timerol wrote:
         | "Edge computing" is a pretty vague term, and can encompass
         | anything from a 8MHz ARM core that can barely talk compliant
         | BLE, all the way to a multi-thousand dollar setup on something
         | like a self-checkout machine, which may have more compute
         | available than your average laptop. In that range are home
         | assistants, which normally have some basic ML for wake word
         | detection, and then send the next bit of audio to the cloud
         | with a more advanced model for full speech-to-text (and
         | response)
        
       | conjecTech wrote:
       | Congratulations! How do you plan to make money?
        
         | ggerganov wrote:
         | I'm planning to write code and have fun!
        
           | az226 wrote:
           | Have you thought about what your path looks like to get to
           | the next phase? Are you taking on any more investors pre-
           | seee?
        
           | beardog wrote:
           | >ggml.ai is a company founded by Georgi Gerganov to support
           | the development of ggml. Nat Friedman and Daniel Gross
           | provided the pre-seed funding.
           | 
           | Did you give them a different answer? It is okay if you can't
           | or don't want to share, but I doubt the company is only
           | planning to have fun. Regardless, best of luck to you and
           | thank you for your efforts so far.
        
           | jgrahamc wrote:
           | This is a good plan.
        
       | TechBro8615 wrote:
       | I believe ggml is the basis of llama.cpp (the OP says it's "used
       | by llama.cpp")? I don't know much about either, but when I read
       | the llama.cpp code to see how it was created so quickly, I got
       | the sense that the original project was ggml, given the amount of
       | pasted code I saw. It seemed like quite an impressive library.
        
         | make3 wrote:
         | it's the library used for tensor operations inside of
         | llama.cpp, yes
        
         | kgwgk wrote:
         | https://news.ycombinator.com/item?id=33877893
         | 
         | "OpenAI recently released a model for automatic speech
         | recognition called Whisper. I decided to reimplement the
         | inference of the model from scratch using C/C++. To achieve
         | this I implemented a minimalistic tensor library in C and
         | ported the high-level architecture of the model in C++."
         | 
         | That "minimalistic tensor library" was ggml.
        
       | world2vec wrote:
       | Might be a silly question but is GGML a similar/competing library
       | to George Hotz's tinygrad [0]?
       | 
       | [0] https://github.com/geohot/tinygrad
        
         | qeternity wrote:
         | No, GGML is a CPU optimized library and quantized weight format
         | that is closely linked to his other project llama.cpp
        
           | stri8ed wrote:
           | How does the quantization happen? Are the weights
           | preprocessed before loading the model?
        
             | ggerganov wrote:
             | The weights are preprocessed into integer quants combined
             | with scaling factors in various configurations (4, 5,
             | 8-bits and recently more exotic 2, 3 and 6-bit quants). At
             | runtime, we use efficient SIMD implementations to perform
             | the matrix multiplication at integer level, carefully
             | optimizing for both compute and memory bandwidth. Similar
             | strategies are applied when running GPU inference - using
             | custom kernels for fast Matrix x Vector multiplications
        
             | sebzim4500 wrote:
             | Yes, but to my knowledge it doesn't do any of the
             | complicated optimization stuff that SOTA quantisation
             | methods use. It basically is just doing a bunch of
             | rounding.
             | 
             | There are advantages to simplicity, after all.
        
               | brucethemoose2 wrote:
               | Its not so simple anymore, see
               | https://github.com/ggerganov/llama.cpp/pull/1684
        
           | ggerganov wrote:
           | ggml started with focus on CPU inference, but lately we have
           | been augmenting it with GPU support. Although still in
           | development, it already has partial CUDA, OpenCL and Metal
           | backend support
        
             | qeternity wrote:
             | Hi Georgi - thanks for all the work, have been following
             | and using since the availability of Llama base layers!
             | 
             | Wasn't implying it's CPU only, just that it started as a
             | CPU optimized library.
        
             | ignoramous wrote:
             | (a novice here who knows a couple of fancy terms)
             | 
             | > _...lately we have been augmenting it with GPU support._
             | 
             | Would you say you'd then be building an equivalent to
             | Google's JAX?
             | 
             | Someone even asked if anyone would build a C++ to JAX
             | transpiler [0]... I am wondering if that's something you
             | may implement? Thanks.
             | 
             | [0] https://news.ycombinator.com/item?id=35475675
        
             | freedomben wrote:
             | As a person burned by nvidia, I can't thank you enough for
             | the OpenCL support
        
         | xiphias2 wrote:
         | They are competing (although they are very different, tinygrad
         | is full stack Python, ggml is focusing on a few very important
         | models), but in my opinion George Hotz lost focus a bit by not
         | working more on getting the low level optimizations perfect.
        
           | georgehotz wrote:
           | Which low level optimizations specifically are you referring
           | to?
           | 
           | I'm happy with most of the abstractions. We are pushing to
           | assembly codegen. And if you meant things like matrix
           | accelerators, that's my next priority.
           | 
           | We are taking more a of breadth first approach. I think ggml
           | is more depth first and application focused. (and I think
           | Mojo is even more breadth first)
        
       | edfletcher_t137 wrote:
       | This is a bang-up idea, you absolutely love to see capital
       | investment on this type of open, commodity-hardware-focused
       | foundational technology. Rock on GGMLers & thank you!
        
       | boringuser2 wrote:
       | Looking at the source of this kind of underlines the difference
       | between machine learning scientist types and actual computer
       | scientists.
        
       | rvz wrote:
       | > Nat Friedman and Daniel Gross provided the pre-seed funding.
       | 
       | Why? Why should VCs get involved again?
       | 
       | They are just going to look for an exit and end up getting
       | acquired by Apple Inc.
       | 
       | Not again.
        
         | sroussey wrote:
         | Daniel Gross is a good guy, a yes his company did get acquired
         | by apple a while back, but he loves to foster really dope stuff
         | by amazing people, and ggml certainly fits the bill. And this
         | looks like an Angel investment, not a VC one if that makes any
         | difference to you.
        
         | renewiltord wrote:
         | It's possible to do whatever you want without VCs. The code is
         | open source so you can start where he's starting from and run a
         | purely different enterprise if you desire.
        
         | okhuman wrote:
         | +1. VC involvement in projects like these always pivot the team
         | away from the core competency of what you'd expect them to
         | deliver - into some commercialization aspect that convert only
         | a tiny fraction of the community yet take up 60%+ of the core
         | developer team's time.
         | 
         | I don't know why project founders head this way...as the track
         | records of leaders who do this end up disappointing the
         | involved community at some point. Look to matt klein + cloud
         | native computing foundation at envoy for a somewhat decent
         | model of how to do this better.
         | 
         | We continue down the Open Core model yet it continues to fail
         | communities.
        
           | wmf wrote:
           | Developers shouldn't be unpaid slaves to the community.
        
             | okhuman wrote:
             | You're right. I just wish this decision was taken to the
             | community, we could have all came together to help and
             | supported during these difficult/transitional times. :(
             | Maybe this decision was rushed or is money related, who
             | knows the actual circumstances.
             | 
             | Here's the Matt K article
             | https://mattklein123.dev/2021/09/14/5-years-envoy-oss/
        
           | jart wrote:
           | Whenever a community project goes commercial, its interests
           | are usually no longer aligned with the community. For
           | example, llama.com makes frequent backwards-incompatible
           | changes to its file format. I maintain a fork of ggml in the
           | cosmopolitan monorepo which maintains support for old file
           | formats. You can build and use it as follows:
           | git clone https://github.com/jart/cosmopolitan         cd
           | cosmopolitan              # cross-compile on x86-64-linux for
           | x86-64 linux+windows+macos+freebsd+openbsd+netbsd
           | make -j8 o//third_party/ggml/llama.com
           | o//third_party/ggml/llama.com --help              # cross-
           | compile on x86-64-linux for aarch64-linux         make -j8
           | m=aarch64 o/aarch64/third_party/ggml/llama.com         #
           | note: creates .elf file that runs on RasPi, etc.
           | # compile loader shim to run on arm64 macos         cc -o ape
           | ape/ape-m1.c   # use xcode         ./ape ./llama.com --help #
           | use elf aarch64 binary above
           | 
           | It goes the same speed as upstream for CPU inference. This is
           | useful if you can't/won't recreate your weights files, or
           | want to download old GGML weights off HuggingFace, since
           | llama.com has support for every generation of the ggjt file
           | format.
        
             | halyconWays wrote:
             | [dead]
        
         | throw74775 wrote:
         | Do you have pre-seed funding to give him?
        
           | jgrahamc wrote:
           | I do.
        
       | samwillis wrote:
       | ggml and llama.cpp are such a good platform for local LLMs,
       | having some financial backing to support development is
       | brilliant. We should be concentrating as much as possible to do
       | local inference (and training) based on privet data.
       | 
       | I want a _local_ ChatGPT fine tuned on my personal data running
       | on my own device, not in the cloud. Ideally open source too,
       | llama.cpp is looking like the best bet to achieve that!
        
         | SparkyMcUnicorn wrote:
         | Maybe I'm wrong, but I don't think you want it fine-tuned on
         | your data.
         | 
         | Pretty sure you might be looking for this:
         | https://github.com/SamurAIGPT/privateGPT
         | 
         | Fine-tuning is good for treating it how to act, but not great
         | for reciting/recalling data.
        
           | dr_dshiv wrote:
           | How does this work?
        
             | deet wrote:
             | The parent is saying that "fine tuning", which has a
             | specific meaning related to actually retraining the model
             | itself (or layers at its surface) on a specialized set of
             | data, is not what the GP is actually looking for.
             | 
             | An alternative method is to index content in a database and
             | then insert contextual hints into the LLM's prompt that
             | give it extra information and detail with which to respond
             | with an answer on-the-fly.
             | 
             | That database can use semantic similarity (ie via a vector
             | database), keyword search, or other ranking methods to
             | decide what context to inject into the prompt.
             | 
             | PrivateGPT is doing this method, reading files, extracting
             | their content, splitting the documents into small-enough-
             | to-fit-into-prompt bits, and then indexing into a database.
             | Then, at query time, it inserts context into the LLM prompt
             | 
             | The repo uses LangChain as boilerplate but it's pretty
             | easily to do manually or with other frameworks.
             | 
             | (PS if anyone wants this type of local LLM + document Q/A
             | and agents, it's something I'm working on as supported
             | product integrated into macOS, and using ggml; see profile)
        
         | brucethemoose2 wrote:
         | If MeZO gets implemented, we are basically there:
         | https://github.com/princeton-nlp/MeZO
        
           | moffkalast wrote:
           | Basically there, with what kind of VRAM and processing
           | requirements? I doubt anyone running on a CPU can fine tune
           | in a time frame that doesn't give them an obsolete model when
           | they're done.
        
             | nl wrote:
             | According to the paper it fine tunes at the speed of
             | inference (!!)
             | 
             | This would make fine tuning a qantized 13B model achievable
             | in ~0.3 seconds per training example on a CPU.
        
               | f_devd wrote:
               | MeZO assumes a smooth parameter space, so you probably
               | won't be able to do it with INT4/8 quantization, probably
               | needs fp8 or smoother.
        
               | gliptic wrote:
               | I cannot find any such numbers in the paper. What the
               | paper says is that MeZO converges much slower than SGD,
               | and each step needs two forward passes.
               | 
               | "As a limitation, MeZO takes many steps in order to
               | achieve strong performance."
        
               | moffkalast wrote:
               | Wow if that's true then it's genuinely a complete
               | gamechanger for LLMs as a whole. You probably mean more
               | like 0.3s per token, not per example, but that's still
               | more like 1 or two minutes per training case, not like a
               | day for 4 cases like it is now.
        
               | sp332 wrote:
               | It's the same _memory footprint_ as inference. It 's not
               | that fast, and the paper mentions some optimizations that
               | could still be done.
        
               | isoprophlex wrote:
               | If you go through the drudgery of integrating with all
               | the existing channels (mail, Teams, discord, slack,
               | traditional social media, texts, ...), such rapid
               | finetuning speeds could enable an always up to date
               | personality construct, modeled on you.
               | 
               | Which is my personal holy grail towards making myself
               | unnecessary; it'd be amazing to be doing some light
               | gardening while the bot handles my coworkers ;)
        
               | [deleted]
        
               | valval wrote:
               | I think more importantly, what would the fine tuning
               | routine look like? It's a non-trivial task to dump all of
               | your personal data into any LLM architecture.
        
         | rvz wrote:
         | > ggml and llama.cpp are such a good platform for local LLMs,
         | having some financial backing to support development is
         | brilliant
         | 
         | The problem is, this financial backing and support is via VCs,
         | who will steer the project to close it all up again.
         | 
         | > I want a local ChatGPT fine tuned on my personal data running
         | on my own device, not in the cloud. Ideally open source too,
         | llama.cpp is looking like the best bet to achieve that!
         | 
         | I think you are setting yourself up for disappointment in the
         | future.
        
           | ulchar wrote:
           | > The problem is, this financial backing and support is via
           | VCs, who will steer the project to close it all up again.
           | 
           | How exactly could they meaningfully do that? Genuine
           | question. The issue with the OpenAI business model is that
           | the collaboration within academia and open source circles is
           | creating innovations that are on track to out-pace the closed
           | source approach. Does OpenAI have the pockets to buy the open
           | source collaborators and researchers?
           | 
           | I'm truly cynical about many aspects of the tech industry but
           | this is one of those fights that open source could win for
           | the betterment of everybody.
        
             | maxilevi wrote:
             | I agree with the spirit but saying that open source is on
             | track to outpace OpenAI in innovation is just not true.
             | Open source models are being compared to GPT3.5, none yet
             | even get close to GPT4 quality and they finished that last
             | year.
        
               | jart wrote:
               | We're basically surviving off the scraps companies like
               | Facebook have been tossing off the table, like LLaMA. The
               | fact that we're even allowed and able to use these things
               | ourselves, at all, is a tremendous victory.
        
               | maxilevi wrote:
               | I agree
        
             | yyyk wrote:
             | I've been going on and on about this in HN: Open source can
             | win this fight, but I think OSS is overconfident. We need
             | to be clear there are serious challenges ahead - ClosedAI
             | and other corporations also have a plan, a plan that has
             | good chances unless properly countered:
             | 
             | A) Embed OpenAI (etc.) API everywhere. Make embedding easy
             | and trivial. First to gain a small API/install moat
             | (user/dev: 'why install OSS model when OpenAI is already
             | available with an OS API?'). If it's easy to use OpenAI but
             | not open source they have an advantage. Second to gain
             | brand. But more importantly:
             | 
             | B) Gain a technical moat by having a permanent data
             | advantage using the existing install base (see above).
             | Retune constantly to keep it.
             | 
             | C) Combine with existing propriety data stores to increase
             | local data advantage (e.g. easy access for all your Office
             | 365/GSuite documents, while OSS gets the scary permission
             | prompts).
             | 
             | D) Combine with existing propriety moats to mutually
             | reinforce.
             | 
             | E) Use selective copyright enforcement to increase data
             | advantage.
             | 
             | F) Lobby legislators for limits that make competition (open
             | or closed source) way harder.
             | 
             | TL;DR: OSS is probably catching up on algorithms. When it
             | comes to good data and good integrations OSS is far behind
             | and not yet catching up. It's been argued that OpenAI's
             | entire performance advantage is due to having better data
             | alone, and they intend to keep that advantage.
        
               | ljlolel wrote:
               | Don't forget chip shortages. That's all centralized up
               | through Nvidia, TSMC, and ASML
        
           | ignoramous wrote:
           | > _The problem is, this financial backing and support is via
           | VCs, who will steer the project to close it all up again._
           | 
           | A matter of _when_ , not _if_. I mean, the website itself
           | makes that much clear:                 The ggml way
           | ...                Open Core              The library and
           | related projects are freely available under the MIT
           | license... In the future we may choose to develop extensions
           | that are licensed for commercial use                Explore
           | and have fun!              ... Contributors are encouraged to
           | try crazy ideas, build wild demos, and push the edge of
           | what's possible
           | 
           | So, like many other "open core" devtools out there, they'd
           | like to have their cake and eat it too. And they might just
           | as well, like others before them.
           | 
           | Won't blame anyone here though; because clearly, if you're as
           | good as Georgi Gerganov, why do it for free?
        
           | jdonaldson wrote:
           | > I think you are setting yourself up for disappointment in
           | the future.
           | 
           | Why would you say that?
        
         | behnamoh wrote:
         | I wonder if ClosedAI and other companies use the findings of
         | the open source community in their products. For example, do
         | they use QLORA to reduce the costs of training and inference?
         | Do they quantize their models to serve non-subscribing
         | consumers?
        
           | jmoss20 wrote:
           | Quantization is hardly a "finding of the open source
           | community". (IIRC the first TPU was int8! Though the
           | tradition is much older than that.)
        
           | danielbln wrote:
           | Not disagreeing with your points, but saying "ClosedAI" is
           | about as clever as writing M$ for Microsoft back in the day,
           | which is to say not very.
        
             | rafark wrote:
             | I think it's ironic that M$ made ClosedAI.
        
               | replygirl wrote:
               | Pedantic but that's not irony
        
               | rafark wrote:
               | Why do you think so? According to the dictionary, ironic
               | could be something paradoxical or weird.
        
             | Miraste wrote:
             | M$ is a silly way to call Microsoft greedy. ClosedAI is
             | somewhat better because OpenAI's very name is a bald-faced
             | lie, and they should be called on it. Are there more
             | elegant ways to do that? Sure, but every time I see Altman
             | in the news crying crocodile tears about the "dangers" of
             | open anything I think we need all the forms of opposition
             | we can find.
        
               | tanseydavid wrote:
               | It is a colloquial spelling and they earned it, a long
               | time ago.
        
             | loa_in_ wrote:
             | I'd say saying M$ makes it harder for M$ to find out I'm
             | talking about them in them in the indexed web because it's
             | more ambiguous, that's all I need to know.
        
               | coolspot wrote:
               | If we are talking about indexing, writing M$ is easier to
               | find in an index because it is a such unique token. MS
               | can mean many things (e.g. Miss), M$ is less ambiguous.
        
             | smoldesu wrote:
             | Yeah, I think it feigns meaningful criticism. The "Sleepy
             | Joe"-tier insults are ad-hominem enough that I don't try to
             | respond.
        
         | ignoramous wrote:
         | Can LLaMA be used for commerical purposes though (might limit
         | external contributors)? I believe, FOSS alternatives like
         | DataBricks _Dolly_ / Together _RedPajama_ / Eluether _GPT NeoX_
         | (et al) is where the most progress is likely to be at.
        
           | samwillis wrote:
           | Although llama.cpp started with the LLaMA model, it now
           | supports many others.
        
           | okhuman wrote:
           | This is a very good question that will be interesting how
           | this develops. thanks for posting the alternatives list.
        
           | detrites wrote:
           | May also be worth mentioning - UAE's Falcon, which apparently
           | performs well (leads?). Falcon recently had its royalty-based
           | commercial license modified to be fully open for free private
           | and commercial use, via Apache 2.0: https://falconllm.tii.ae/
        
           | chaxor wrote:
           | Why is commercial necessary to run local models?
        
             | ignoramous wrote:
             | It isn't, but such models may eventually lag behind the
             | FOSS ones.
        
           | digitallyfree wrote:
           | OpenLLAMA will be released soon and it's 100% compatible with
           | the original LLAMA.
           | 
           | https://github.com/openlm-research/open_llama
        
       | sva_ wrote:
       | Really impressive work and I've asked this before, but is it
       | really a good thing to have basically the whole library in a
       | single 16k line file?
        
         | CamperBob2 wrote:
         | Yes. Next question
        
         | regularfry wrote:
         | It makes syncing between llama.cpp, whisper.cpp, and ggml
         | itself quite straightforward.
         | 
         | I think the lesson here is that this setup has enabled some
         | very high-speed project evolution or, at least, not got in its
         | way. If that is surprising and you were expecting downsides, a)
         | why; and b) where did they go?
        
       | graycat wrote:
       | WOW! They are using BFGS! Haven't heard of that in decades! Had
       | to think a little: Yup, the full name is Broyden-Fletcher-
       | Goldfarb-Shanno for iterative unconstrained non-linear
       | optimization!
       | 
       | Some of the earlier descriptions of the optimization being used
       | in the AI _learning_ was about steepest descent, that is, just
       | find the gradient of the function are trying to minimize and move
       | some distance in that direction. Just using the gradient was
       | concerning since that method tends to _zig zag_ where after, say,
       | 100 iterations the distance moved in the 100 iterations might be
       | several times farther than the distance from the starting point
       | of the iterations to the final one. Can visualize this _zig zag_
       | already in just two dimensions, say, following a river, say, a
       | river that curves, down a valley the river cut over a million
       | years or so, that is, a valley with steep sides. Then gradient
       | descent may keep crossing the river and go maybe 10 feet for each
       | foot downstream!
       | 
       | Right, if just trying to go downhill on a tilted flat plane, then
       | the gradient will point in the steepest descent on the plane and
       | gradient descent will go all way downhill in just one iteration.
       | 
       | In even moderately challenging problems, BFGS can a big
       | improvement.
        
       | doxeddaily wrote:
       | This scratches my itch for no dependencies.
        
       | s1k3s wrote:
       | I'm out of the loop on this entire thing so call me an idiot if I
       | get it wrong. Isn't this whole movement based on a model leak
       | from Meta? Aren't licenses involved that prevent it from going
       | commercial?
        
         | detrites wrote:
         | GGML is essentially a library of lego pieces that can be put
         | together to work with many LLM or other types of ML models.
         | 
         | Meta's leaked model is one for which GGML has been applied to
         | for fast, local inference.
        
         | dimfeld wrote:
         | Only the weights themselves. There have been other models since
         | then built on the same Llama architecture, but trained from
         | scratch so they're safe for commercial user. The GGML code and
         | related projects (llama.cpp and so on) also support some other
         | model types now such as Mosaic's MPT series.
        
       | okhuman wrote:
       | The establishment of ggml.ai a company focusing ggml and
       | llama.cpp, the most innovative and exciting platform to come for
       | local LLMs, on a Open Core model is just laziness.
       | 
       | Just because you can (and have the connections), doesn't mean you
       | should. It's a sad state of OSS when the best most brightest
       | developers/founders reach for antiquated models.
       | 
       | Maybe we take up a new rules in OSS communities that say you must
       | release your CORE software as MIT at the same time you plan to go
       | Open Core (and no sooner).
       | 
       | Why should OSS communities take on your product market fit?!
        
         | wmf wrote:
         | This looks off-topic since GGML has not announced anything
         | about open core and their software is already MIT.
         | 
         | More generally, if you want to take away somebody's business
         | model you need to provide one that works. It isn't easy.
        
           | okhuman wrote:
           | Agreed with you 100% - its not easy. Sometimes I just wish
           | someone as talented as Georgi would innovate not just on the
           | core tech side but bring that same tenancy to the licensing
           | side, in a way that aligns incentives better and tries out
           | something new. And that the community would have his back if
           | some new approach failed, no matter what.
        
       | aryamaan wrote:
       | Could someone at high level talk more about how one starts
       | contributing to this kind of problems.
       | 
       | For the people who build solutions for data handling-- ranging
       | from crud to building highly scalable solutions-- these things
       | are alien concepts. (Or maybe I am just talking about it myself)
        
       | danieljanes wrote:
       | Does GGML support training on the edge? We're especially
       | interested in training support for Android+iOS
        
         | [deleted]
        
         | svantana wrote:
         | Yes - look at the file tests/test-opt.c. Unfortunately there's
         | almost no documentation about its training/autodiff.
        
       | KronisLV wrote:
       | Just today, I finished a blog post (also my latest submission,
       | felt like could be useful to some) about how to get something
       | like this working in a bundle of something to run models, as well
       | as a web UI for more easy interaction - in my case that was
       | koboldcpp, which can run GGML, both on the CPU (with OpenBLAS)
       | and on the GPU (with CLBlast). Thanks to Hugging Face, getting
       | Metharme, WizardLM or other models is also extremely easy, and
       | the 4-bit quantized ones provide decent performance even on
       | commodity hardware!
       | 
       | I tested it out both locally (6c/12t CPU) and on a Hetzner CPX41
       | instance (8 AMD cores, 16 GB of RAM, no GPU), the latter of which
       | costs about 25 EUR per month and still can generate decent
       | responses in less than half a minute, my local machine needing
       | approx. double that time. While not quite as good as one might
       | expect (decent response times mean maxing out CPU for the single
       | request, if you don't have a compatible GPU with enough VRAM),
       | the technology is definitely at a point where it's possible for
       | it to make people's lives easier in select use cases with some
       | supervision (e.g. customer support).
       | 
       | What an interesting time to be alive, I wonder where we'll be in
       | a decade.
        
         | b33j0r wrote:
         | I wish everyone in tech had your perspective. That is what I
         | see, as well.
         | 
         | There is a lull right now between gpt4 and gpt5 (literally and
         | metaphorically). Consumer models are plateauing around 40B for
         | a barely-reasonable RTX 3090 (ggml made this possible).
         | 
         | Now is the time to launch your ideas, all!
        
         | digitallyfree wrote:
         | The fact that this is _commodity hardware_ makes ggml extremely
         | impressive and puts the tech in the hands of everyone. I
         | recently reported my experience running 7B llama.cpp on a 15
         | year old Core 2 Quad [1] - when that machine came out it was a
         | completely different world and I certainly never imagined how
         | AI would look like today. This was around when the first iPhone
         | was released and everyone began talking about how smartphones
         | would become the next big thing. We saw what happened 15 years
         | later...
         | 
         | Today with the new k-quants users are reporting that 30B models
         | are working with 2-bit quantization on 16GB CPUs and GPUs [2].
         | That's enabling access to millions of consumers and the
         | optimizations will only improve from there.
         | 
         | [1]
         | https://old.reddit.com/r/LocalLLaMA/comments/13q6hu8/7b_perf...
         | 
         | [2] https://github.com/ggerganov/llama.cpp/pull/1684,
         | https://old.reddit.com/r/LocalLLaMA/comments/141bdll/moneros...
        
         | c_o_n_v_e_x wrote:
         | What do you mean by commodity hardware? Single server single
         | CPU socket x86/ARM boxes? Anything that does not have a GPU?
        
           | [deleted]
        
         | SparkyMcUnicorn wrote:
         | Seems like serverless is the way to go for fast output while
         | remaining inexpensive.
         | 
         | e.g.
         | 
         | https://replicate.com/stability-ai/stablelm-tuned-alpha-7b
         | 
         | https://github.com/runpod/serverless-workers/tree/main/worke...
         | 
         | https://modal.com/docs/guide/ex/falcon_gptq
        
           | tikkun wrote:
           | I think that's true if you're doing minimal usage / low
           | utilization, otherwise a dedicated instance will be cheaper.
        
       | mliker wrote:
       | congrats! I was just listening to your changelog interview from
       | months ago in which you said you were going to move on from this
       | after you brush up the code a bit, but it seems the momentum is
       | too great. Glad to see you carrying this amazing project(s)
       | forward!
        
       | FailMore wrote:
       | Remember
        
       | kretaceous wrote:
       | Georgi's Twitter announcement:
       | https://twitter.com/ggerganov/status/1666120568993730561
        
         | jgrahamc wrote:
         | Cool. I've just started sponsoring him on GitHub.
        
       | FailMore wrote:
       | Commenting to remember. Looks good
        
       ___________________________________________________________________
       (page generated 2023-06-06 23:00 UTC)