[HN Gopher] GGML - AI at the Edge ___________________________________________________________________ GGML - AI at the Edge Author : georgehill Score : 466 points Date : 2023-06-06 16:50 UTC (6 hours ago) (HTM) web link (ggml.ai) (TXT) w3m dump (ggml.ai) | [deleted] | zkmlintern wrote: | [dead] | huevosabio wrote: | Very exciting! | | Now, we just need a post that benchmarks the different options | (ggml, tvm, AItemplate, hippoml) and helps deciding which route | to take. | Havoc wrote: | How common is avx on edge platforms? | binarymax wrote: | svantana is correct that PCs are edge, but if you meant | "mobile", then ARM in iOS and Android typically have NEON | instructions for SIMD, not AVX: | https://developer.arm.com/Architectures/Neon | Havoc wrote: | I was thinking more edge in the distributed serverless sense, | but I guess for this type of use the compute part is slow not | the latency so question doesn't make much sense in hindsight | binarymax wrote: | Compute _is_ the latency for LLMs :) | | And in general, your inference code will be compiled to a | CPU/Architecture target - so you can know ahead of time | what instructions you'll have access to when writing your | code for that target. | | For example in the case of AWS Lambda, you can choose | graviton2 (ARM with NEON), or x86_64 (AVX). The trick is | that for some processors such as Xeon3+ there is AVX 512, | and others you will top out at AVX 256. You might be able | to figure out what exact instruction set your serverless | target supports. | svantana wrote: | Edge just means that the computing is done close to the I/O | data, so that includes PCs and such. | Dwedit wrote: | There was a big stink one time when the file format changed, | causing older model files to become unusable on newer versions of | llama.cpp. | pawelduda wrote: | I happen to have RPi 4B with HomeAssistant. Is this something I | could set up on it and integrate with HA to control it with | speech, or is it overkill? | boppo1 wrote: | I doubt it. I'm running 4-bit 30B and 65B models with 64GB ram, | a 4080 and a 7900x. The 7B models are less demanding, but even | so, You'll need more than an rpi. Even then, it would be a | _project_ to get these to control something. This is more | 'first baby steps' toward the edge. | pawelduda wrote: | The article shows example running on RPI that recognizes | colour names. I could just come up with keywords that would | invoke certain commands and feed them to HA, which would | match them to an automation (i.e. turn off kitchen, or just | kitchen ) . I think a PoC is doable, but I'm aware I could | run into limitations quickly. Idk might give it a try when | I'm bored. | | Would love voice assistant running locally but probably there | are solutions out there - didn't get to do the research yet | nivekney wrote: | On a similar thread, how does it compare to Hippoml? | | Context: https://news.ycombinator.com/item?id=36168666 | brucethemoose2 wrote: | We don't necessarily know... Hippo is closed source for now. | | Its comparable to Apache TVM's vulkan in speed on cuda, see | https://github.com/mlc-ai/mlc-llm | | But honestly, the biggest advantage of llama.cpp for me is | being able to split a model so performantly. My puny 16GB | laptop can _just barely_ , but very practically, run LLaMA 30B | at almost 3 tokens/s, and do it right now. That is crazy! | smiley1437 wrote: | >> run LLaMA 30B at almost 3 tokens/s | | Please tell me your config! I have an i9-10900 with 32GB of | ram that only gets .7 tokens/s on a 30B model | LoganDark wrote: | > Please tell me your config! I have an i9-10900 with 32GB | of ram that only gets .7 tokens/s on a 30B model | | Have you quantized it? | smiley1437 wrote: | The model I have is q4_0 I think that's 4 bit quantized | | I'm running in Windows using koboldcpp, maybe it's faster | in Linux? | LoganDark wrote: | > The model I have is q4_0 I think that's 4 bit quantized | | That's correct, yeah. Q4_0 should be the smallest and | fastest quantized model. | | > I'm running in Windows using koboldcpp, maybe it's | faster in Linux? | | Possibly. You could try using WSL to test--I think both | WSL1 and WSL2 are faster than Windows (but WSL1 should be | faster than WSL2). | brucethemoose2 wrote: | I am running linux with cublast offload, and I am using | the new 3 bit quant that was just pulled in a day or two | ago. | brucethemoose2 wrote: | I'n on a Ryzen 4900HS laptop with a RTX 2060. | | Like I said, very modest | oceanplexian wrote: | With a single NVIDIA 3090 and the fastest inference branch | of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for- | LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per | second on the 30B models. IMO GGML is great (And I totally | use it) but it's still not as fast as running the models on | GPU for now. | LoganDark wrote: | > IMO GGML is great (And I totally use it) but it's still | not as fast as running the models on GPU for now. | | I think it was originally designed to be easily | embeddable--and most importantly, _native code_ (i.e. not | Python)--rather than competitive with GPUs. | | I think it's just starting to get into GPU support now, | but carefully. | brucethemoose2 wrote: | Have you tried the most recent cuda offload? A dev claims | they are getting 26.2ms/token (38 tokens per second) on | 13B with a 4080. | yukIttEft wrote: | Its graph execution is still full of busyloops, e.g.: | | https://github.com/ggerganov/llama.cpp/blob/44f906e8537fcec9... | | I wonder how much more efficient it would be when Taskflow lib | was used instead, or even inteltbb. | mhh__ wrote: | It's not a very good library IMO. | moffkalast wrote: | Someone ought to be along with a PR eventually. | boywitharupee wrote: | is graph execution used for training only or inference also? | LoganDark wrote: | Inference. It's a big bottleneck for RWKV.cpp, second only to | the matrix multiplies. | make3 wrote: | does tbb work with apple Silicon? | yukIttEft wrote: | I guess https://formulae.brew.sh/formula/tbb | [deleted] | renewiltord wrote: | This guy is damned good. I sponsored him on Github because his | software is dope. I also like how when some controversy erupted | on the project he just ejected the controversial people and moved | on. Good stewardship. Great code. | | I recall something like when he first ported it and it worked on | my M1 Max he hadn't even yet tested it on Apple Silicon since he | didn't have the hardware. | | Honestly, with this and whisper, I am a huge fan. Good luck to | him and the new company. | killthebuddha wrote: | Another important detail about the ejections that I think is | particularly classy is that the people he ejected are broadly | considered to have world-class technical skills. In other | words, he was very explicitly prioritizing collaborative | potential > technical skill. Maybe a future BDFL[1]! | | [1] https://en.wikipedia.org/wiki/Benevolent_dictator_for_life | jart wrote: | Gerganov was prioritizing collaboration with 4chan who raided | his GitHub to demand a change written by a transgender woman | be reverted. There was so much hate speech and immaturity | thrown around (words like tranny troon cucking muh model) | that it's a real embarrassment (to those of us deeply want to | see local models succeed) that one of the smartest guys | working on the problem was taken in by all that. You can't | run a collaborative environment that's open when you pander | to hate, because hate subverts communities; it's impossible | to compromise with anonymous trolls who harass a public | figure over physical traits about her body she can't change. | | You don't have to take my word on it. Here are some archives | of the 4chan threads where they coordinated the raid. It went | on for like a month. https://archive.is/EX7Fq | https://archive.is/enjpf https://archive.is/Kbjtt | https://archive.is/HGwZm https://archive.is/pijMv | https://archive.is/M7hLJ https://archive.is/4UxKP | https://archive.is/IB9bv https://archive.is/p6Q2q | https://archive.is/phCGN https://archive.is/M6AF1 | https://archive.is/mXoBs https://archive.is/68Ayg | https://archive.is/DamPp https://archive.is/DiQC2 | https://archive.is/DeX8Z https://archive.is/gStQ1 | | If you read these threads and see how nasty these little | monsters are, you can probably imagine how Gerganov must have | felt. He was probably scared they'd harass him too, since | 4chan acts like he's their boy. Plus it was weak leadership | on his part to disappear for days, suddenly show up again to | neutral knight the conflict (https://justine.lol/neutral- | knight.png) tell his team members they're no longer welcome, | and then going back and deleting his comment later. Just goes | to show you can be really brilliant at the hard technical | skills, but totally clueless when it comes to people. | zo1 wrote: | Really curious why you tried to rename the file format | magic string to have your initials? Going from GGML (see | Title of this post) to GGJT with JT being Justine Tunney? | Seems quite unnecessary and bound to have rubbed a lot of | people the wrong way. | | Here is the official commit undoing the change: | | https://github.com/ggerganov/llama.cpp/pull/711/files#diff- | 7... | killthebuddha wrote: | I didn't want to not reply but I also didn't want to be | swept into a potentially fraught internet argument. So, I | tried to edit my comment as a middle ground, but it looks | like I can't, I guess there must be a timeout. If I could | edit it, I'd add the following: | | "I should point out that I wasn't personally involved, | haven't looked into it in detail, and that there are many | different perspectives that should be considered." | evanwise wrote: | What was the controversy? | kgwgk wrote: | https://news.ycombinator.com/item?id=35411909 | pubby wrote: | https://github.com/ggerganov/llama.cpp/pull/711 | nchudleigh wrote: | he has been amazing to watch and has even helped me out with my | app that uses his whisper.cpp project | (https://superwhisper.com) | | Excited to see how his venture goes! | PrimeMcFly wrote: | > I also like how when some controversy erupted on the project | he just ejected the controversial people and moved on. Good | stewardship | | Do you have more info on the controversy? I'm not sure ejecting | developers just because of controversy is honestly good | stewardship. | freedomben wrote: | Right. More details needed to know if this is good | stewardship (ejecting two toxic individuals) or laziness | (ejecting a villain and a hero to get rid of the "problem" | easily). TikTok was using this method for a while by ejecting | both bullies and victims, and it "solved" the problem but | most people see the injustice there. | | I'm not saying it was bad stewardship, I honestly don't know. | I just agree that we shouldn't make a judgment without more | information. | jstarfish wrote: | > More details needed to know if this is good stewardship | (ejecting two toxic individuals) or laziness (ejecting a | villain and a hero to get rid of the "problem" easily). | TikTok was using this method for a while by ejecting both | bullies and victims, | | This is SOP for American schools. It's laziness there, | since education is supposed to be compulsory. They can't be | bothered to investigate (and with today's hostile climate, | I don't blame them) so they consign both parties to | independent-study programs. | | For volunteer projects, throwing both overboard is | unfortunate but necessary stewardship. The drama either | attracts destabilizes the entire project, which only exists | as long as it remains _fun_ for the maintainer. It 's | tragic, but victims who can't recover gracefully are as | toxic as their abusers. | boppo1 wrote: | >justice | | For an individual running a small open source project, | there's time enough for coding or detailed justice, but not | both. When two parties start pointing fingers and raising | hell and its not immediately clear who is in the right, ban | both and let them fork it. | csmpltn wrote: | > More details needed to know if this is good stewardship | (ejecting two toxic individuals) or laziness (ejecting a | villain and a hero to get rid of the "problem" easily). | | Man, nobody has time for this shit. Leave the games and the | drama for the social justice warriors and the furries. | People building shit ain't got time for this - ejecting | trouble makers is the right way to go regardless of which | "side" they're on. | LoganDark wrote: | > and the furries | | Um, what? | camdenlock wrote: | If you know, you know | freedomben wrote: | I would agree that there needs to be a balance because | wasting time babysitting adults is dumb, but what if one | person is a good and loved contributor, and the other is | a social justice warrior new to the project that is | picking fights with the contributor? Your philosophy | makes not only bad stewardship but an injustice. I'm not | suggesting this is the only scenario, just merely a | hypothetical that I think illustrates my position. | wmf wrote: | And what do you do when every contributor to the project, | including the founder, has been labeled a troublemaker? | boppo1 wrote: | Pick the fork that has devs who are focused on | contributing code and not pursuing drama. | infamouscow wrote: | The code is MIT licensed. If you don't agree with the | direction the project is taking you can fork it and add | whatever you want. | | I don't understand why this is so difficult for software | developers with GitHub accounts to understand. | PrimeMcFly wrote: | You've missed the point here more than I've seen anyone | miss the point in a long time. | infamouscow wrote: | Software stewardship is cringe. | | The idea software licensed with a free software license | can have a steward doesn't even make sense. | | How exactly does someone supervise or take care of | intellectual property (read: code) when the author and | original copyright holder explicitly licensed their work | under the MIT license, granting anyone the following: | | > [T]o deal in the software without restriction, | including without limitation the rights to use, copy, | modify, merge, publish, distribute, sublicense, and/or | sell copies of the software, and to permit persons to | whom the software is furnished to do so, subject to the | following conditions | | The author was certainly a steward when they were working | on it in private, or heck, even in public since copyright | is implicit, but certainly not after adding the MIT | license. | | So when I think of software stewardship, all I see are | self-appointed thought-leaders and corporate vampires | like Oracle chest-beating to the public about how | important they are. | | Simply a way for those in positions of power/status to | remain in their positions elevated above everyone else. | Depending on the situation and context that might be good | or bad. What's important is it's not for these so-called | "stewards" to decide. | iamflimflam1 wrote: | I've always thought on the edge to be IoT type stuff. So running | on embedded devices. But maybe that not the case? | Y_Y wrote: | Like any new term the (mis)usage broadens the meaning over time | until it either it's widely known, it's unfashionable, or most | likely; it becomes so broad as to be meaningless and hence it | achieves buzzword apotheosis. | | My old job title had "edge" in it, and I still don't know what | it's supposed to mean, although "not cloud" is a good | approximation. | b33j0r wrote: | Sounds like your job had a lot of velocity with lateral | tragmorphicity in Q1, just in time for staff engineer | optimization! | | Nicely done. Here is ~$50 worth of stock. | timerol wrote: | "Edge computing" is a pretty vague term, and can encompass | anything from a 8MHz ARM core that can barely talk compliant | BLE, all the way to a multi-thousand dollar setup on something | like a self-checkout machine, which may have more compute | available than your average laptop. In that range are home | assistants, which normally have some basic ML for wake word | detection, and then send the next bit of audio to the cloud | with a more advanced model for full speech-to-text (and | response) | conjecTech wrote: | Congratulations! How do you plan to make money? | ggerganov wrote: | I'm planning to write code and have fun! | az226 wrote: | Have you thought about what your path looks like to get to | the next phase? Are you taking on any more investors pre- | seee? | beardog wrote: | >ggml.ai is a company founded by Georgi Gerganov to support | the development of ggml. Nat Friedman and Daniel Gross | provided the pre-seed funding. | | Did you give them a different answer? It is okay if you can't | or don't want to share, but I doubt the company is only | planning to have fun. Regardless, best of luck to you and | thank you for your efforts so far. | jgrahamc wrote: | This is a good plan. | TechBro8615 wrote: | I believe ggml is the basis of llama.cpp (the OP says it's "used | by llama.cpp")? I don't know much about either, but when I read | the llama.cpp code to see how it was created so quickly, I got | the sense that the original project was ggml, given the amount of | pasted code I saw. It seemed like quite an impressive library. | make3 wrote: | it's the library used for tensor operations inside of | llama.cpp, yes | kgwgk wrote: | https://news.ycombinator.com/item?id=33877893 | | "OpenAI recently released a model for automatic speech | recognition called Whisper. I decided to reimplement the | inference of the model from scratch using C/C++. To achieve | this I implemented a minimalistic tensor library in C and | ported the high-level architecture of the model in C++." | | That "minimalistic tensor library" was ggml. | world2vec wrote: | Might be a silly question but is GGML a similar/competing library | to George Hotz's tinygrad [0]? | | [0] https://github.com/geohot/tinygrad | qeternity wrote: | No, GGML is a CPU optimized library and quantized weight format | that is closely linked to his other project llama.cpp | stri8ed wrote: | How does the quantization happen? Are the weights | preprocessed before loading the model? | ggerganov wrote: | The weights are preprocessed into integer quants combined | with scaling factors in various configurations (4, 5, | 8-bits and recently more exotic 2, 3 and 6-bit quants). At | runtime, we use efficient SIMD implementations to perform | the matrix multiplication at integer level, carefully | optimizing for both compute and memory bandwidth. Similar | strategies are applied when running GPU inference - using | custom kernels for fast Matrix x Vector multiplications | sebzim4500 wrote: | Yes, but to my knowledge it doesn't do any of the | complicated optimization stuff that SOTA quantisation | methods use. It basically is just doing a bunch of | rounding. | | There are advantages to simplicity, after all. | brucethemoose2 wrote: | Its not so simple anymore, see | https://github.com/ggerganov/llama.cpp/pull/1684 | ggerganov wrote: | ggml started with focus on CPU inference, but lately we have | been augmenting it with GPU support. Although still in | development, it already has partial CUDA, OpenCL and Metal | backend support | qeternity wrote: | Hi Georgi - thanks for all the work, have been following | and using since the availability of Llama base layers! | | Wasn't implying it's CPU only, just that it started as a | CPU optimized library. | ignoramous wrote: | (a novice here who knows a couple of fancy terms) | | > _...lately we have been augmenting it with GPU support._ | | Would you say you'd then be building an equivalent to | Google's JAX? | | Someone even asked if anyone would build a C++ to JAX | transpiler [0]... I am wondering if that's something you | may implement? Thanks. | | [0] https://news.ycombinator.com/item?id=35475675 | freedomben wrote: | As a person burned by nvidia, I can't thank you enough for | the OpenCL support | xiphias2 wrote: | They are competing (although they are very different, tinygrad | is full stack Python, ggml is focusing on a few very important | models), but in my opinion George Hotz lost focus a bit by not | working more on getting the low level optimizations perfect. | georgehotz wrote: | Which low level optimizations specifically are you referring | to? | | I'm happy with most of the abstractions. We are pushing to | assembly codegen. And if you meant things like matrix | accelerators, that's my next priority. | | We are taking more a of breadth first approach. I think ggml | is more depth first and application focused. (and I think | Mojo is even more breadth first) | edfletcher_t137 wrote: | This is a bang-up idea, you absolutely love to see capital | investment on this type of open, commodity-hardware-focused | foundational technology. Rock on GGMLers & thank you! | boringuser2 wrote: | Looking at the source of this kind of underlines the difference | between machine learning scientist types and actual computer | scientists. | rvz wrote: | > Nat Friedman and Daniel Gross provided the pre-seed funding. | | Why? Why should VCs get involved again? | | They are just going to look for an exit and end up getting | acquired by Apple Inc. | | Not again. | sroussey wrote: | Daniel Gross is a good guy, a yes his company did get acquired | by apple a while back, but he loves to foster really dope stuff | by amazing people, and ggml certainly fits the bill. And this | looks like an Angel investment, not a VC one if that makes any | difference to you. | renewiltord wrote: | It's possible to do whatever you want without VCs. The code is | open source so you can start where he's starting from and run a | purely different enterprise if you desire. | okhuman wrote: | +1. VC involvement in projects like these always pivot the team | away from the core competency of what you'd expect them to | deliver - into some commercialization aspect that convert only | a tiny fraction of the community yet take up 60%+ of the core | developer team's time. | | I don't know why project founders head this way...as the track | records of leaders who do this end up disappointing the | involved community at some point. Look to matt klein + cloud | native computing foundation at envoy for a somewhat decent | model of how to do this better. | | We continue down the Open Core model yet it continues to fail | communities. | wmf wrote: | Developers shouldn't be unpaid slaves to the community. | okhuman wrote: | You're right. I just wish this decision was taken to the | community, we could have all came together to help and | supported during these difficult/transitional times. :( | Maybe this decision was rushed or is money related, who | knows the actual circumstances. | | Here's the Matt K article | https://mattklein123.dev/2021/09/14/5-years-envoy-oss/ | jart wrote: | Whenever a community project goes commercial, its interests | are usually no longer aligned with the community. For | example, llama.com makes frequent backwards-incompatible | changes to its file format. I maintain a fork of ggml in the | cosmopolitan monorepo which maintains support for old file | formats. You can build and use it as follows: | git clone https://github.com/jart/cosmopolitan cd | cosmopolitan # cross-compile on x86-64-linux for | x86-64 linux+windows+macos+freebsd+openbsd+netbsd | make -j8 o//third_party/ggml/llama.com | o//third_party/ggml/llama.com --help # cross- | compile on x86-64-linux for aarch64-linux make -j8 | m=aarch64 o/aarch64/third_party/ggml/llama.com # | note: creates .elf file that runs on RasPi, etc. | # compile loader shim to run on arm64 macos cc -o ape | ape/ape-m1.c # use xcode ./ape ./llama.com --help # | use elf aarch64 binary above | | It goes the same speed as upstream for CPU inference. This is | useful if you can't/won't recreate your weights files, or | want to download old GGML weights off HuggingFace, since | llama.com has support for every generation of the ggjt file | format. | halyconWays wrote: | [dead] | throw74775 wrote: | Do you have pre-seed funding to give him? | jgrahamc wrote: | I do. | samwillis wrote: | ggml and llama.cpp are such a good platform for local LLMs, | having some financial backing to support development is | brilliant. We should be concentrating as much as possible to do | local inference (and training) based on privet data. | | I want a _local_ ChatGPT fine tuned on my personal data running | on my own device, not in the cloud. Ideally open source too, | llama.cpp is looking like the best bet to achieve that! | SparkyMcUnicorn wrote: | Maybe I'm wrong, but I don't think you want it fine-tuned on | your data. | | Pretty sure you might be looking for this: | https://github.com/SamurAIGPT/privateGPT | | Fine-tuning is good for treating it how to act, but not great | for reciting/recalling data. | dr_dshiv wrote: | How does this work? | deet wrote: | The parent is saying that "fine tuning", which has a | specific meaning related to actually retraining the model | itself (or layers at its surface) on a specialized set of | data, is not what the GP is actually looking for. | | An alternative method is to index content in a database and | then insert contextual hints into the LLM's prompt that | give it extra information and detail with which to respond | with an answer on-the-fly. | | That database can use semantic similarity (ie via a vector | database), keyword search, or other ranking methods to | decide what context to inject into the prompt. | | PrivateGPT is doing this method, reading files, extracting | their content, splitting the documents into small-enough- | to-fit-into-prompt bits, and then indexing into a database. | Then, at query time, it inserts context into the LLM prompt | | The repo uses LangChain as boilerplate but it's pretty | easily to do manually or with other frameworks. | | (PS if anyone wants this type of local LLM + document Q/A | and agents, it's something I'm working on as supported | product integrated into macOS, and using ggml; see profile) | brucethemoose2 wrote: | If MeZO gets implemented, we are basically there: | https://github.com/princeton-nlp/MeZO | moffkalast wrote: | Basically there, with what kind of VRAM and processing | requirements? I doubt anyone running on a CPU can fine tune | in a time frame that doesn't give them an obsolete model when | they're done. | nl wrote: | According to the paper it fine tunes at the speed of | inference (!!) | | This would make fine tuning a qantized 13B model achievable | in ~0.3 seconds per training example on a CPU. | f_devd wrote: | MeZO assumes a smooth parameter space, so you probably | won't be able to do it with INT4/8 quantization, probably | needs fp8 or smoother. | gliptic wrote: | I cannot find any such numbers in the paper. What the | paper says is that MeZO converges much slower than SGD, | and each step needs two forward passes. | | "As a limitation, MeZO takes many steps in order to | achieve strong performance." | moffkalast wrote: | Wow if that's true then it's genuinely a complete | gamechanger for LLMs as a whole. You probably mean more | like 0.3s per token, not per example, but that's still | more like 1 or two minutes per training case, not like a | day for 4 cases like it is now. | sp332 wrote: | It's the same _memory footprint_ as inference. It 's not | that fast, and the paper mentions some optimizations that | could still be done. | isoprophlex wrote: | If you go through the drudgery of integrating with all | the existing channels (mail, Teams, discord, slack, | traditional social media, texts, ...), such rapid | finetuning speeds could enable an always up to date | personality construct, modeled on you. | | Which is my personal holy grail towards making myself | unnecessary; it'd be amazing to be doing some light | gardening while the bot handles my coworkers ;) | [deleted] | valval wrote: | I think more importantly, what would the fine tuning | routine look like? It's a non-trivial task to dump all of | your personal data into any LLM architecture. | rvz wrote: | > ggml and llama.cpp are such a good platform for local LLMs, | having some financial backing to support development is | brilliant | | The problem is, this financial backing and support is via VCs, | who will steer the project to close it all up again. | | > I want a local ChatGPT fine tuned on my personal data running | on my own device, not in the cloud. Ideally open source too, | llama.cpp is looking like the best bet to achieve that! | | I think you are setting yourself up for disappointment in the | future. | ulchar wrote: | > The problem is, this financial backing and support is via | VCs, who will steer the project to close it all up again. | | How exactly could they meaningfully do that? Genuine | question. The issue with the OpenAI business model is that | the collaboration within academia and open source circles is | creating innovations that are on track to out-pace the closed | source approach. Does OpenAI have the pockets to buy the open | source collaborators and researchers? | | I'm truly cynical about many aspects of the tech industry but | this is one of those fights that open source could win for | the betterment of everybody. | maxilevi wrote: | I agree with the spirit but saying that open source is on | track to outpace OpenAI in innovation is just not true. | Open source models are being compared to GPT3.5, none yet | even get close to GPT4 quality and they finished that last | year. | jart wrote: | We're basically surviving off the scraps companies like | Facebook have been tossing off the table, like LLaMA. The | fact that we're even allowed and able to use these things | ourselves, at all, is a tremendous victory. | maxilevi wrote: | I agree | yyyk wrote: | I've been going on and on about this in HN: Open source can | win this fight, but I think OSS is overconfident. We need | to be clear there are serious challenges ahead - ClosedAI | and other corporations also have a plan, a plan that has | good chances unless properly countered: | | A) Embed OpenAI (etc.) API everywhere. Make embedding easy | and trivial. First to gain a small API/install moat | (user/dev: 'why install OSS model when OpenAI is already | available with an OS API?'). If it's easy to use OpenAI but | not open source they have an advantage. Second to gain | brand. But more importantly: | | B) Gain a technical moat by having a permanent data | advantage using the existing install base (see above). | Retune constantly to keep it. | | C) Combine with existing propriety data stores to increase | local data advantage (e.g. easy access for all your Office | 365/GSuite documents, while OSS gets the scary permission | prompts). | | D) Combine with existing propriety moats to mutually | reinforce. | | E) Use selective copyright enforcement to increase data | advantage. | | F) Lobby legislators for limits that make competition (open | or closed source) way harder. | | TL;DR: OSS is probably catching up on algorithms. When it | comes to good data and good integrations OSS is far behind | and not yet catching up. It's been argued that OpenAI's | entire performance advantage is due to having better data | alone, and they intend to keep that advantage. | ljlolel wrote: | Don't forget chip shortages. That's all centralized up | through Nvidia, TSMC, and ASML | ignoramous wrote: | > _The problem is, this financial backing and support is via | VCs, who will steer the project to close it all up again._ | | A matter of _when_ , not _if_. I mean, the website itself | makes that much clear: The ggml way | ... Open Core The library and | related projects are freely available under the MIT | license... In the future we may choose to develop extensions | that are licensed for commercial use Explore | and have fun! ... Contributors are encouraged to | try crazy ideas, build wild demos, and push the edge of | what's possible | | So, like many other "open core" devtools out there, they'd | like to have their cake and eat it too. And they might just | as well, like others before them. | | Won't blame anyone here though; because clearly, if you're as | good as Georgi Gerganov, why do it for free? | jdonaldson wrote: | > I think you are setting yourself up for disappointment in | the future. | | Why would you say that? | behnamoh wrote: | I wonder if ClosedAI and other companies use the findings of | the open source community in their products. For example, do | they use QLORA to reduce the costs of training and inference? | Do they quantize their models to serve non-subscribing | consumers? | jmoss20 wrote: | Quantization is hardly a "finding of the open source | community". (IIRC the first TPU was int8! Though the | tradition is much older than that.) | danielbln wrote: | Not disagreeing with your points, but saying "ClosedAI" is | about as clever as writing M$ for Microsoft back in the day, | which is to say not very. | rafark wrote: | I think it's ironic that M$ made ClosedAI. | replygirl wrote: | Pedantic but that's not irony | rafark wrote: | Why do you think so? According to the dictionary, ironic | could be something paradoxical or weird. | Miraste wrote: | M$ is a silly way to call Microsoft greedy. ClosedAI is | somewhat better because OpenAI's very name is a bald-faced | lie, and they should be called on it. Are there more | elegant ways to do that? Sure, but every time I see Altman | in the news crying crocodile tears about the "dangers" of | open anything I think we need all the forms of opposition | we can find. | tanseydavid wrote: | It is a colloquial spelling and they earned it, a long | time ago. | loa_in_ wrote: | I'd say saying M$ makes it harder for M$ to find out I'm | talking about them in them in the indexed web because it's | more ambiguous, that's all I need to know. | coolspot wrote: | If we are talking about indexing, writing M$ is easier to | find in an index because it is a such unique token. MS | can mean many things (e.g. Miss), M$ is less ambiguous. | smoldesu wrote: | Yeah, I think it feigns meaningful criticism. The "Sleepy | Joe"-tier insults are ad-hominem enough that I don't try to | respond. | ignoramous wrote: | Can LLaMA be used for commerical purposes though (might limit | external contributors)? I believe, FOSS alternatives like | DataBricks _Dolly_ / Together _RedPajama_ / Eluether _GPT NeoX_ | (et al) is where the most progress is likely to be at. | samwillis wrote: | Although llama.cpp started with the LLaMA model, it now | supports many others. | okhuman wrote: | This is a very good question that will be interesting how | this develops. thanks for posting the alternatives list. | detrites wrote: | May also be worth mentioning - UAE's Falcon, which apparently | performs well (leads?). Falcon recently had its royalty-based | commercial license modified to be fully open for free private | and commercial use, via Apache 2.0: https://falconllm.tii.ae/ | chaxor wrote: | Why is commercial necessary to run local models? | ignoramous wrote: | It isn't, but such models may eventually lag behind the | FOSS ones. | digitallyfree wrote: | OpenLLAMA will be released soon and it's 100% compatible with | the original LLAMA. | | https://github.com/openlm-research/open_llama | sva_ wrote: | Really impressive work and I've asked this before, but is it | really a good thing to have basically the whole library in a | single 16k line file? | CamperBob2 wrote: | Yes. Next question | regularfry wrote: | It makes syncing between llama.cpp, whisper.cpp, and ggml | itself quite straightforward. | | I think the lesson here is that this setup has enabled some | very high-speed project evolution or, at least, not got in its | way. If that is surprising and you were expecting downsides, a) | why; and b) where did they go? | graycat wrote: | WOW! They are using BFGS! Haven't heard of that in decades! Had | to think a little: Yup, the full name is Broyden-Fletcher- | Goldfarb-Shanno for iterative unconstrained non-linear | optimization! | | Some of the earlier descriptions of the optimization being used | in the AI _learning_ was about steepest descent, that is, just | find the gradient of the function are trying to minimize and move | some distance in that direction. Just using the gradient was | concerning since that method tends to _zig zag_ where after, say, | 100 iterations the distance moved in the 100 iterations might be | several times farther than the distance from the starting point | of the iterations to the final one. Can visualize this _zig zag_ | already in just two dimensions, say, following a river, say, a | river that curves, down a valley the river cut over a million | years or so, that is, a valley with steep sides. Then gradient | descent may keep crossing the river and go maybe 10 feet for each | foot downstream! | | Right, if just trying to go downhill on a tilted flat plane, then | the gradient will point in the steepest descent on the plane and | gradient descent will go all way downhill in just one iteration. | | In even moderately challenging problems, BFGS can a big | improvement. | doxeddaily wrote: | This scratches my itch for no dependencies. | s1k3s wrote: | I'm out of the loop on this entire thing so call me an idiot if I | get it wrong. Isn't this whole movement based on a model leak | from Meta? Aren't licenses involved that prevent it from going | commercial? | detrites wrote: | GGML is essentially a library of lego pieces that can be put | together to work with many LLM or other types of ML models. | | Meta's leaked model is one for which GGML has been applied to | for fast, local inference. | dimfeld wrote: | Only the weights themselves. There have been other models since | then built on the same Llama architecture, but trained from | scratch so they're safe for commercial user. The GGML code and | related projects (llama.cpp and so on) also support some other | model types now such as Mosaic's MPT series. | okhuman wrote: | The establishment of ggml.ai a company focusing ggml and | llama.cpp, the most innovative and exciting platform to come for | local LLMs, on a Open Core model is just laziness. | | Just because you can (and have the connections), doesn't mean you | should. It's a sad state of OSS when the best most brightest | developers/founders reach for antiquated models. | | Maybe we take up a new rules in OSS communities that say you must | release your CORE software as MIT at the same time you plan to go | Open Core (and no sooner). | | Why should OSS communities take on your product market fit?! | wmf wrote: | This looks off-topic since GGML has not announced anything | about open core and their software is already MIT. | | More generally, if you want to take away somebody's business | model you need to provide one that works. It isn't easy. | okhuman wrote: | Agreed with you 100% - its not easy. Sometimes I just wish | someone as talented as Georgi would innovate not just on the | core tech side but bring that same tenancy to the licensing | side, in a way that aligns incentives better and tries out | something new. And that the community would have his back if | some new approach failed, no matter what. | aryamaan wrote: | Could someone at high level talk more about how one starts | contributing to this kind of problems. | | For the people who build solutions for data handling-- ranging | from crud to building highly scalable solutions-- these things | are alien concepts. (Or maybe I am just talking about it myself) | danieljanes wrote: | Does GGML support training on the edge? We're especially | interested in training support for Android+iOS | [deleted] | svantana wrote: | Yes - look at the file tests/test-opt.c. Unfortunately there's | almost no documentation about its training/autodiff. | KronisLV wrote: | Just today, I finished a blog post (also my latest submission, | felt like could be useful to some) about how to get something | like this working in a bundle of something to run models, as well | as a web UI for more easy interaction - in my case that was | koboldcpp, which can run GGML, both on the CPU (with OpenBLAS) | and on the GPU (with CLBlast). Thanks to Hugging Face, getting | Metharme, WizardLM or other models is also extremely easy, and | the 4-bit quantized ones provide decent performance even on | commodity hardware! | | I tested it out both locally (6c/12t CPU) and on a Hetzner CPX41 | instance (8 AMD cores, 16 GB of RAM, no GPU), the latter of which | costs about 25 EUR per month and still can generate decent | responses in less than half a minute, my local machine needing | approx. double that time. While not quite as good as one might | expect (decent response times mean maxing out CPU for the single | request, if you don't have a compatible GPU with enough VRAM), | the technology is definitely at a point where it's possible for | it to make people's lives easier in select use cases with some | supervision (e.g. customer support). | | What an interesting time to be alive, I wonder where we'll be in | a decade. | b33j0r wrote: | I wish everyone in tech had your perspective. That is what I | see, as well. | | There is a lull right now between gpt4 and gpt5 (literally and | metaphorically). Consumer models are plateauing around 40B for | a barely-reasonable RTX 3090 (ggml made this possible). | | Now is the time to launch your ideas, all! | digitallyfree wrote: | The fact that this is _commodity hardware_ makes ggml extremely | impressive and puts the tech in the hands of everyone. I | recently reported my experience running 7B llama.cpp on a 15 | year old Core 2 Quad [1] - when that machine came out it was a | completely different world and I certainly never imagined how | AI would look like today. This was around when the first iPhone | was released and everyone began talking about how smartphones | would become the next big thing. We saw what happened 15 years | later... | | Today with the new k-quants users are reporting that 30B models | are working with 2-bit quantization on 16GB CPUs and GPUs [2]. | That's enabling access to millions of consumers and the | optimizations will only improve from there. | | [1] | https://old.reddit.com/r/LocalLLaMA/comments/13q6hu8/7b_perf... | | [2] https://github.com/ggerganov/llama.cpp/pull/1684, | https://old.reddit.com/r/LocalLLaMA/comments/141bdll/moneros... | c_o_n_v_e_x wrote: | What do you mean by commodity hardware? Single server single | CPU socket x86/ARM boxes? Anything that does not have a GPU? | [deleted] | SparkyMcUnicorn wrote: | Seems like serverless is the way to go for fast output while | remaining inexpensive. | | e.g. | | https://replicate.com/stability-ai/stablelm-tuned-alpha-7b | | https://github.com/runpod/serverless-workers/tree/main/worke... | | https://modal.com/docs/guide/ex/falcon_gptq | tikkun wrote: | I think that's true if you're doing minimal usage / low | utilization, otherwise a dedicated instance will be cheaper. | mliker wrote: | congrats! I was just listening to your changelog interview from | months ago in which you said you were going to move on from this | after you brush up the code a bit, but it seems the momentum is | too great. Glad to see you carrying this amazing project(s) | forward! | FailMore wrote: | Remember | kretaceous wrote: | Georgi's Twitter announcement: | https://twitter.com/ggerganov/status/1666120568993730561 | jgrahamc wrote: | Cool. I've just started sponsoring him on GitHub. | FailMore wrote: | Commenting to remember. Looks good ___________________________________________________________________ (page generated 2023-06-06 23:00 UTC)