[HN Gopher] Stable Diffusion in C/C++ ___________________________________________________________________ Stable Diffusion in C/C++ Author : kikalo00 Score : 252 points Date : 2023-08-19 11:26 UTC (11 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | [deleted] | nottorp wrote: | [flagged] | naillo wrote: | Awesome that they implemented CLIP as well. That alone could be | cool to extract and compile as a wasm implementation. | | Edit: Seems like someone already has | https://github.com/monatis/clip.cpp :) Now to wasmify it | KaoruAoiShiho wrote: | Speaking of CLIP, I'm always troubled that the next CLIP might | not get released as both OpenAI and Google are shifting into | competition mode. Sad to think there might be a more advanced | version of CLIP already but sitting in a secret vault | somewhere. | | Edit: I'm not referring to a CLIP-2 but any advance on the same | level of importance as CLIP. | GaggiX wrote: | The biggest CLIP models we know of are open source. | | If a company has a bigger CLIP model they don't have even | reported that. | | Also OpenAI had already for a moment a proprietary CLIP model | that was bigger than any other models available, the CLIP-H | used by Dalle 2. | snordgren wrote: | As someone who is out of the loop but could use high | quality image embeddings right now, what's the best CLIP | model right now? | astrange wrote: | SDXL uses OpenCLIP, and then OpenAI CLIP as a backup | basically to allow it to spell words properly, but I | think you could replace the second one. | speedgoose wrote: | Stable Diffusion switched to OpenCLIP for stable diffusion 2. | But it looks they went back to clip for the xl version. | | People complained about openclip not being as good. Hopefully | we can have a better and open clip model eventually. | evolveyourmind wrote: | Any benchmarks? | nre wrote: | Some people have timed it here, it looks like it's taking | 15-20s/it (dependent on quant and hardware). | | https://github.com/leejet/stable-diffusion.cpp/issues/1 | Lerc wrote: | Looking at the different quantization levels examples, I'm quite | impressed. The change from f16 to q8_0 seems to be more of a | change in direction than loss of quality. The q5_1 result seems | indistinguishable from the q8_0. | | So you're losing determinism with the higher precision models, | but potentially quite usable. | jakearmitage wrote: | There's just something special to these C/C++ implementations of | AI stuff. They feel so clean and straightforward and make the | entire field of AI feel tangible and learnable. | | Is that because Python's ecosystem is so messy? | snordgren wrote: | Rewrites tend to improve code quality, and replacing | dependencies with custom-tailored code that does just what you | need also improves code quality. | | And while the Python version uses C and C++ code for speed, | this is all just one language. | | A trifecta of factors enabling clean code. | BrutalCoding wrote: | Saw this repo today, fetched it, built a .dylib (Mac) and used | Dart's ffi-gen tooling to generate the bindings from the provided | header file. | | I'm just experimenting with it together with Flutter. Ffi because | I'm trying to avoid spawning a subprocess | | Fast forward: Ended up with severe headache and a broken app. | Will continue my attempt tmr with a fresh mind haha | | This repo is great though, had it up and running within 10 min on | my M1 (using f16). Thanks for sharing! | lucabs wrote: | [dead] | waynecochran wrote: | Nice to see ML folks getting weaned off of Python and using a | language that can optimally exploit the underlying hardware and | not require setting up a specialized environment to build and | run. | xvilka wrote: | Do you mean Julia language? | A-Train wrote: | Amen. | aimor wrote: | I really appreciate the people doing this work. It's the only | way I've run these models without any headaches. The difference | is so stark, even with CUDA and Linux it's bad, with AMD and | Windows it's miserable. I'm pretty sure it's not just me.. | galangalalgol wrote: | As long as we are language trolling, why would anyone start a | greenfield project like this in C++ these days? The android, | windows, firefox, and now chrome projects have all begun to | shift towards rust and in the case of Android and Firefox, | write a significant amounts of their project in rust. Migrating | an existing project like that is difficult. The chrome team in | particular lamented the difficulty. But starting a new project? | If you have a team familiar with performant c++ the speedbump | of starting a greenfield project in rust is negligible, and the | ergonomic improvements in the build system and the language | itself will make up for that in any project that takes more | than a few months. For that speed bump you get memory and | thread race safety, far better than any stack of c++ analysis | tools could ever provide with a tiny fraction of the unit tests | you'd write in c++. And you lose no performance. | FoodWThrow wrote: | Rust is _great_ when you know what you 're building. That | qualifier encompasses quite amount of software space, but not | all of it, and I would argue not even the majority of it. | | If you don't know what you are doing, if you are exploring | ideas, Rust will just get in the way. At some point you will | end up realizing you need to adjust lifetimes, and that will | require you to touch non-trivial amount of your code base. If | you need to that multiple times, friction will overwhelm your | desire to code. | | I have a pet theory that, the people that find Rust intuitive | and fun, are the people that are working on well beaten | paths; Rust is almost boring at doing that, which is a good | thing. And the people that find Rust gets in their way are | the people that like to experiment with their solutions, | because there aren't any set, trusted solutions within their | problem space, and even if there are, they like to approach | the problem on their own, for better or worse. | | In any case: | | > why would anyone start a greenfield project like this in | C++ these days? | | The video game industry can single-handedly carry C++ on | their back, kicking and screaming, if need be. Rust is | uniquely unfit to write gameplay code due to game | development's iterative nature. Using scripting languages | doesn't cut it either, because often, slower designer made | scripts will need to be converted to C++ by a programmer, and | pull in the crazy reference hell of the game state into the | C++ land. | | I would say Rust is OK for _engine_ level features -- those | don 't change that often, and requirements are usually well | understood. But that introduces a cadence mismatch between | different systems too, so there is a cost there as well. But | for gameplay? There's a reason why many Rust based game | engines use crazy amount of unsafe Rust to make their ECS. | Just not a good fit. | | And of course, there's the consoles, where Sony seem to have | a political reason for not supporting Rust on non-1st-party | studios. I have no idea what they are thinking, honestly. | mnrlt wrote: | C++ has a standard, multiple competing implementations and a | largely drama-free community. | | Does CUDA even have Rust bindings, and if so, are they on the | same level as the C++ ones? | | What do you mean by "the windows projects" that shift towards | Rust? | galangalalgol wrote: | MS has started implementing pieces of windows in rust. If | you have windows 11 you are running rust. The cuda bindings | are good for ml, but missing for cufft and similar. There | are people working on better cuda support, but there are | even more people working on vendor agnostic gpgpu using | spirv and webgpu. It isn't there yet. Right now you are | mostly left to your own bindings unless you are doing ml or | blas. | | Edit: I can't argue about the drama part. The competing | compilers will get there. A couple gcc frontends in work, | and crane lift as a competing back end for llvm and full | self-hosting. There is also miri I guess to emit c? People | use that to get rust on the C64 or other niche processors. | pjmlp wrote: | Yes they started, yet there is enough C++ to rewrite in | the 30 years of Windows NT history. | | Meanwhile, Visual Studio team released better tooling for | Unreal in Visual C++. | Const-me wrote: | > why would anyone start a greenfield project like this in | C++ these days? | | TLDR: quite often, using C++ instead of Rust saves software | development costs. | | Some software needs to consume many external APIs. Examples | on Windows: Direct3D, Direct2D, DirectWrite, MediaFoundation. | Examples on Linux: V4L2, ALSA, DRM/KMS, GLES. These things | are huge in terms of API surface. Choose Rust, and you gonna | need to write and support non-trivial amount of boilerplate | code for the interop. Choose C++ (on Linux, C is good too) | and that code is gone, you only need well-documented and well | supported APIs supplied by the OS vendors. | | Similarly, some software needs to integrate with other | systems or libraries written in C or C++. An example often | relevant to HPC applications is Eigen. Another related thing, | game console SDKs, and game engines, don't support Rust. | | For the project being discussed here, GGML, for optimal | performance the implementation needs vector intrinsics. | Technically Rust has the support, but in practice Intel and | ARM are only supporting them for C and C++. Not just CPU | vendors, when using C or C++ there're useful relevant | resources: articles, blogs, and stackoverflow. These things | help a lot in practice. I don't program Rust, but I program | C# in addition to C++, technically most vector intrinsics are | available in the current version of C#, but they are much | harder to use from C# for this reason. | | All current C and C++ compilers support OpenMP for | parallelism. While not a silver bullet, and not available on | all platforms supported by C or C++, some software benefits | tremendously from that thing. | | Finally, it's easier to find good C++ developers, compared to | good Rust developers. | galangalalgol wrote: | There are existing supported bindings for direct3d from MS, | as they themselves are migrating. GLES and ggml also have | supported bindings. I like nalgebra + rustfft better than | eigen now. Nalgebra still isn't quite as peformant on small | matricies until const generic eval stabilizes, but it is | close enough for 6x6 stuff that it is in the noise. Rustfft | is faster than fftw even. Rust has intrinsic support on par | with clang and gcc, and the autovectorizer uses whatever | llvm know about, so again equivalent to clang. | | On the last point, I will again assert that a good c++ | developer is just a good rust developer minus a month of | ramp, that you'll get nack from not having to fight | combinations of automake, cmake, conan, vcpkg, meson, bazel | and hunter. | Const-me wrote: | I'll be very surprised if MS will ever support rust | bindings for media foundation. That thing is COM based, | requires users to implement COM interfaces not just | consume, and is heavily multithreaded. | | About SIMD, automatic vectorizers are very limited. I was | talking about manually vectorized stuff with intrinsics. | | I've been programming C++ for living for decades now. | Tried to learn rust but failed. I have an impression the | language is extremely hard to use. | galangalalgol wrote: | Not sure how fully featured it is. | https://lib.rs/crates/mmf | | Yes, rust directly supports modern intrinsics, that is | what rustfft for instance uses. I try to stick with | autovec myself, because my needs are simpler such that a | couple tweaks usually gets me close to hand-rolled | speedups on both avx 512 and aarch64. But for more | complicated stuff yeah, rust seems to be keeping up. Some | intrinsics are still only in nightly, but plenty of major | projects use nightly for production, it is quite stable | and with a good pipeline you'll be fine. | | I've written c++ since ~94, and mostly c++17 since it | came out. About a quarter of a century of that getting | paid for it. I never liked or used exceptions or rtti, | and generally used functional style except for | preallocation of memory for performance. I think those | habits might have made the transition a little easier, | but the people on my team who had used a more OOP style | and full c++ don't seem to have adapted much more slowly | if at all. I struggled for years to internalize rust at | home until I just jumped in at work by declaring the | project I lead would be in rust. I have had absolutely no | regrets. It really isn't as bad a learning curve as c++. | But we learned c++ one revision at a time. Also, much | like c++ rust has bits you mostly only need to know for | writing libraries. So getting started you can put those | things to the side for a bit right at first. | theLiminator wrote: | Curious how long you tried to learn rust? I've found C++ | much harder to learn (coming from a python/scala) | background. | | Is it just a case of you forgetting how hard C++ was to | learn? | pjmlp wrote: | No there aren't, unless you mean Rust/WinRT demos with | community bindings. | | Agility SDK and XDK have zero Rust support. If it isn't | on the Agility SDK and XDK, it isn't official. | | Hardly the same as the official Swift bindings to Metal, | written in Objective-C and C++14. | api wrote: | It's interesting to me that my CPU can run some of these things | in quantized form almost as fast as the GPU. Has the whole | thing been all about memory bandwidth all along? | | In addition to compute the GPU architecture is one that | somewhat colocates working memory alongside compute. Units have | local memories that sync with global memory. Is that a big part | of why GPUs are so good for this? | brucethemoose2 wrote: | > Has the whole thing been all about memory bandwidth all | along | | Yeah, sort of. | | LLMs like llama at a batch size of 1 are hilariously | bandwidth bound. | | Stable Diffusion less so. Its still bandwidth heavy on GPUs, | but compute is much more of a bottleneck. | intelVISA wrote: | Wasn't Python originally designed as a language to teach | children how to code? Weird to see so many, otherwise | intelligent, folks latch onto it. | | It really doesn't have any redeeming characteristics vs. Common | Lisp, or Haskell, to warrant this bizarre popularity imo | mdp2021 wrote: | > _Wasn 't Python originally designed as a language to teach | children how to code_ | | I think it would be very confusing for a child to start with | a language so far away from low-level logic. | | ...And some people said BASIC was evil. At least what it is | doing looks plain and direct. | tester756 wrote: | >I think it would be very confusing for a child to start | with a language so far away from low-level logic. | | Why? | | I started with C++ and when they showed me C# I instantly | feel in love cuz I didn't have to deal with unnecessary | complexity and annoyances and could focus on pure | programming, algorithms, etc. | mdp2021 wrote: | Yes Tester, | | but you are confirming my point :) ...You _started_ with | C++, then went to C#... | tester756 wrote: | I started with C++ and switched around the beginning, so | there wasn't any "low level knowledge" nor above c++ | beginner level concepts. | | Both: high-to-low and low-to-high have some advantages, | but it's not like one is always better than the other. | | high-to-low allows you to write stuff earlier - like | programs that do something useful, GUI, web, whatever. | | but at the cost of understanding internals / under the | hood. | segfaultbuserr wrote: | > _I think it would be very confusing for a child to start | with a language so far away from low-level logic._ | | Depending on the person. | | For some, it would be very frustrating to start with a | language so close to the implementation detail, and so far | away from what you want to do. It's very possible that | someone might have long lost the motivation before one can | do anything non-trivial. | | I started from Python, to C, to assembly, to 4-layer | circuit boards. Whenever I went a level deeper, it feels | like opening the inner working of a blackbox that I | normally only interacts with pushbuttons on its front | panel, but I otherwise is roughly aware of what they do. | | On the other hand, much of my childhood was spent on | tinkering with PCs and servers, including hosting websites | and compiling packages from source, so I was already well | aware of the basic concepts in computing before I started | programming. So, top-down and bottom-up are both absolutely | workable, under the right circumstances. | mnrlt wrote: | ABC, the predecessor from which Python took many syntax | features, was. I wonder if Python also took a lot of the ABC | implementation, given that it is still copyright CWI. | | I agree that its popularity is very odd, but academics take | what they are given when attending fully paid conferences | (aka vacations). | highspeedbus wrote: | What's the term to describe a ad hominem fallacy towards | programming languages? Asked the almighty chat and got this | new term: | | "Code Persona Attack" | | Python is fine. | astrange wrote: | This is a Ycombinator site, the traditional term is Blub. | | And they're right. Python is not a well designed | programming language - it has exceptions and doesn't have | value types so that's two strikes against it. | | Of course, C++ isn't either. | gumby wrote: | It's a bummer there's so little work on the training side in | C++. | | Especially since the python training systems are mostly calls | into libraries written in C++! | pjmlp wrote: | Yeah, and since C++17 the language is already quite | productive for scripting like workflows, the missing piece of | the puzzle is that there are two few C++ REPLs aroung, | ROOT/CINT being one of the few well known ones. | danybittel wrote: | Since when does C++ optimally exploit the underlying hardware? | It has no vector instructions, does not run on the GPU and is | arguably too hard to make multithreaded. Which leaves you with | about 0.5% performance of a current PCs. | jcelerier wrote: | > does not run on the GPU | | both Cuda and the Metal shader language are C++, so is OpenCL | since 2.0 (https://www.khronos.org/opencl/), so is AMD ROCm's | HIP (https://github.com/ROCm-Developer-Tools/HIP), so is SYCL | (https://www.khronos.org/sycl/)? C++ is pretty much the | language that runs _most_ on GPUs. | | > no vector instructions, | | There's a thousand different possibilities for SIMD in C++, | from #pragma omp simd, to libs such as | std::experimental::simd | (https://en.cppreference.com/w/cpp/experimental/simd/simd), | Eve (https://github.com/jfalcou/eve), Highway | (https://github.com/google/highway), Vc | (https://github.com/VcDevel/Vc)... | pjmlp wrote: | When compared against Python, more than enough. | | C++ is one of the supported CUDA languages, even standard | C++17 does run just fine on the GPU. | | Metal uses C++14 alongside some extensions. | waynecochran wrote: | Vector types / instructions would be nice. The C++20 STL | algorithms are very friendly to vectorization with the | various parallel policies (e.g. | std::execution::unsequenced_policy) that open up your code to | be vectorized. Wonderful libs like Eigen handle a lot of my | numeric needs for linear algebra. I think you are forgetting | the CUDA is C/C++. | Someone wrote: | > Vector types / instructions would be nice | | It's technically not a C++ feature, but both gcc | (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) | and Clang (https://releases.llvm.org/3.1/tools/clang/docs/L | anguageExten...) have vector types, and clang even supports | the gcc way of writing them, so it gets pretty close. | astrange wrote: | Those are traditionally dangerous since they tend to | compile poorly; not as bad as autovectorization but not | as good as just writing in assembly. And since | vectorization is platform-dependent anyway (because it's | so different across platforms), assembly really isn't | nearly as bad as it sounds. | | Though it's certainly gotten better, the reason people | push those is that they're written by compiler authors, | who don't want to hear that their compiler doesn't work. | | Some of the reason for this is that C doesn't let you | specify memory aliasing as precisely as you want to. | Fortran is better about this. | codethief wrote: | That's a rather odd comparison to make. First of all, OP, like | llama.cpp, doesn't use the GPU - in contrast to most Python ML | code. It's not hard to write Python code that "optimally | exploits" the GPU. You might call the GPU a "specialized | environment to build and run" but it's arguably much better | suited to the problem. | | Second, OP, like llama.cpp, produced efficient and highly | specialized code _after_ it was clear the model being | specialized for (StableDiffusion / LLaMa / ...) works well. | Where Python shines, though, is the prototyping phase when you | have yet to find an appropriate model. We have yet to see this | sort of easy & convenient prototyping in C++. | | Now, this is not to take away anything from the fantastic work | that's being done by the llama.cpp people (to whom I also count | OP) in the "ML on a CPU" space. But the problems being solved | are entirely different. | PcChip wrote: | >You might call the GPU a "specialized environment to build | and run" but it's arguably much better suited to the problem. | | I feel like the person you're replying to knows that the GPU | is better suited than the CPU to do this task, and your | argument doesn't really make sense. I think they were | referring to the python venv environment with all the library | dependencies as the "specialized environment" | jebarker wrote: | The point is that as awesome as this repo is it doesn't do | much to ween the "ML folks" off of Python since it doesn't | provide the flexibility and GPU support that people | designing and training DL systems rely on. | waynecochran wrote: | I'm just encouraged when I see ML libraries not using | Python w its environment kludges. Just a step in the | right direction. | jebarker wrote: | I don't disagree that Python environments are a mess. I'm | actually a developer on quite a prominent large scale | neural network training library and a DL researcher that | uses said library. With my developer hat on I like to | have minimal dependencies and keep Python scripting as | decoupled as possible from the CUDA C++ implementation. | With my researcher hat on I don't want to be slowed down | by C++ development every time I want to change my model | or training pipeline. At least for me, C++ development is | slower and more error prone than modifying Python. | | Obviously doing any heavy lifting in Python is a bad | idea. But as a scripting language I think it's good, | especially if you keep the environment simple. I don't | think the answer for DL training is to dump Python | entirely and start over in pure | C/C++/Rust/Julia/whatever. Learning C/C++ is too big of | an ask for everyone working on the model design and | training side and it would slow down progress | significantly - most of that work is actually data | munging and targeted model tweaks. But I do think there's | still a lot that can be done to decouple Python from the | underlying engine and yield networks where inference can | be run in a minimal dependency environment. There's lots | of great people working on all these things. | segfaultbuserr wrote: | > _Where Python shines, though, is the prototyping phase when | you have yet to find an appropriate model. We have yet to see | this sort of easy & convenient prototyping in C++._ | | +1. | | To produce a highly-optimized C/C++ kernel that utilizes the | CPU to the fullest extent, it requires tremendously amount of | talent and expertise. For example, not everyone can write a | hand-vectorized kernel with AVX2 intrinsics (outside a few | specialized applications like 3D graphics, media encoding, | and the likes), and even fewer people can exploit the | underlying feature of the algorithm for optimization, such as | producing usable output at greatly reduced numerical | precision. The power of LLM provides strong motivation to | drive the brainpower of countless programmers all over the | world to do just that. New techniques are proposed and | implemented on a monthly basis, with people thinking and | applying every possible trick on the LLM optimization | problems. In this regard, moving from Python to C is totally | reasonable. | | In comparison, right now I'm working on optimizing a niche | open-source scientific simulation kernel with a naive C | codebase. Before me, there were hardly any contributors in | the last decade. | | Python has its place because not everyone has a level of | resource and expertise comparable to ML. In particular, when | the bulk of the data processing of a Python script is in done | in a function call to a C++ or FORTRAN kernel like scipy, the | differences between naive C and naive Python code (or Julia | code if you're following the trend) are not that much, | especially when it's a one-off project for just publishing a | single paper. | hotstickyballs wrote: | It's going to be a tf or PyTorch feature rather than going | directly to writing things in C. No point solving this | problem only once. | waynecochran wrote: | Yeah i make a living in the GPU space. I think my comment | comes from colleagues having to hold my hand to set up their | ML / Python environments with all of their picadellos. In | fact its bad enough that i have to use docker to create an | insular environment tailored to their specific setup. And | Python is like a 1000 times slower when its not using other | libs like numpy. | pdntspa wrote: | Are they not using venvs or something? It should be as | simple as python -m venv venv; ./activate; pip install -r | requirements.txt | Narew wrote: | Unfortunatly its not that simple expecially for NVIDIA | driver and cuda install. That's why we usually use conda | that can handle cuda install but even with that some time | it work flawlessly and some time not. | waynecochran wrote: | Everyone has their own way to do this. Every step is | broken by some unfamiliar dependency that requires | special arcane knowledge to fix. Part of me is a grumpy | old man that doesn't gravitate to the shiny new tools | that come out every week that the younger devs keep up | with :) | pdntspa wrote: | pip and venv are neither shiny nor new, it's the standard | way of doing things for a while. I am an outsider to | python and am incredibly thankful for this | standardization, because i agree getting python env set | up correctly before venv was a huge pain | | If your guys arent on this I'd suggest you get them on | it, it dramatically simplifies setup | waynecochran wrote: | Here is a tiny excerpt try to get dvc to work just so I | could get the training weights for deployment ... | remember I don't develop much w Python... | $ dvc pull Command 'dvc' not found, but can be | installed with: sudo snap install dvc | $ sudo snap install dvc error: This revision of | snap "dvc" was published using classic confinement and | thus may perform arbitrary system changes outside | of the security sandbox that snaps are usually confined | to, which may put your system at risk. | If you understand and want to proceed repeat the command | including --classic. ok I get dvc installed | somehow -- don't remember. Time to get the weights... | $ python3 -m dvc pull ERROR: unexpected error - | Forbidden: An error occurred (403) when calling the | HeadObject operation: Forbidden | Having any troubles? Hit us up at | https://dvc.org/support, we are always happy to help! | | Finally I just have my colleague manually copy the | weights. This kind of thing went for hours. | pdntspa wrote: | Researchers are notorious for writing bad code | | What even is dvc | | edit: also- i'd avoid snap and just use your regular | package manager. | waynecochran wrote: | I think dvc is like git for large binary files. You need | someway to manage your NN weights -- what are other | methods? | pdntspa wrote: | git lfs is what everyone is using, HF in particular | efiop wrote: | Hey, DVC maintainer here. | | Thanks for giving DVC a try! | | There are a few ways to install dvc, see | https://dvc.org/doc/install/linux | | With snap, you need to use `--classic` flag, as noted in | https://dvc.org/doc/install/linux#install-with-snap | Unfortunately that's just how snap works for us there :( | | Regarding the pull error, it simply looks like you don't | have some credentials set up. See | https://dvc.org/doc/user-guide/data-management/remote- | storag... Still, the error could be better, so that's on | us. | | Feel free to ping us in discord (see invite link in | https://dvc.org/support). I'm @ruslan there. We'll be | happy to help. | pjmlp wrote: | That Python ML code is calling C++ code running in the GPU, | one more reason to use C++ across the whole stack. | | CERN already used prototyping in C++, with ROOT and CINT, 20 | years ago. | | https://root.cern/ | | Nowadays it is even usable from Netbooks via Xeus. | | It is more a matter of lack of exposure to C++ interpreters | than anything else. | two_in_one wrote: | Add to that it's only inference code, not training. | kwant_kiddo wrote: | not sure what "It's not hard to write Python code that | "optimally exploits" the GPU", exactly means but Python is so | far from exploiting the GPU resources even with C/C++ | bindings that it's not even funny. I am sure that HPC folks | would have migrated way from FORTRAN and C/C++ long time ago | if it was so easy. | codethief wrote: | I wasn't trying to claim that Python is great at fully | exploiting GPU resources on generic GPU tasks. But in ML | applications it often does, at least in my experience. | geysersam wrote: | It's not like any performance significant component of the ML | stack is actually implemented in Python. Everything is and has | always been cuda, c or c++ under the hood. Python is just the | extremely effective glue binding it all together. | brucethemoose2 wrote: | Sometimes implementations will spend a little too much time | in Python interpretation, but yeah, its largely lower level | code. | | The problem with PyTorch specifically is that (without Triton | compilation) pretty much all projects run in eager mode. | That's fine for experimentation and demonstrations in papers, | but its _crazy_ that its used so much for production without | any compilation. It would be like using debug C binaries for | production, and they only work with any kind of sane | performance on a single CPU maker. | fassssst wrote: | Yup. I would much prefer if every ML model had a simple C | inference API that could be called directly from pretty much | any language on any platform, without a mess of dependencies | and environment setup. | naillo wrote: | ML is such a beautiful and perfect setup for dependency free | execution too. It should just be like downloading a | mathematical function. I'm glad we're finally embracing that. | brucethemoose2 wrote: | Llama.cpp/ggml is uniquely suited to llms. The memory | requirements are _huge_ , quantization is effective, and token | generation is surprisingly serial and bandwidth bound, making it | good for CPUs, and an even better fit for ggml's unique pipelined | CPU/GPU inference. | | ...But Stable Diffusion is not the same. It doesn't quantize as | well, the unet is very compute intense, and batched image | generation is effective and useful to single users. Its a better | fit for GPUs/IGPs. Additionally, it massively benefits from the | hackability of the Python implementations. | | I think ML compilation to executables is the way for SD. | AITemplate is already blazing fast [1], and TVM Vulkan is very | promising if anyone will actually flesh out the demo | implementation [2]. And they preserve most of the hackability of | the pure PyTorch implementations. | | 1: https://github.com/VoltaML/voltaML-fast-stable-diffusion | | 2: https://github.com/mlc-ai/web-stable-diffusion | WinLychee wrote: | The above project somewhat supports GPUs if you pass the | correct GGML compile flags to it. `GGML_CUBLAS` for example is | supported when compiling. You get a decent speedup relative to | pure C/C++. | brucethemoose2 wrote: | Interesting. It still doesn't seem to be very quick: | https://github.com/leejet/stable-diffusion.cpp/issues/6 | | But don't get me wrong, I look forward to playing with ggml | SD and its development. | WinLychee wrote: | Yeah for comparison, `tinygrad` takes a little over a | second per iteration on my machine. https://github.com/tiny | grad/tinygrad/blob/master/examples/st... | brucethemoose2 wrote: | Is that on GPU or CPU? 1 it/s would be very respectable | on CPU. | | The fastest implementation on my 2060 laptop is | AITemplate, being about 2x faster than pure optimized HF | diffusers. | WinLychee wrote: | That was on GPU, and there are various CPU | implementations (e.g. based on Tencent/ncnn) on github | that have similar runtime (1-3s / iteration). | skykooler wrote: | On the other hand, this is nice for anyone who wants to play | with these networks locally and does not have a nvidia GPU with | 6+ gigabytes of VRAM. I can run this on an old laptop, even if | it takes a while. | brucethemoose2 wrote: | I would also highly recommend https://tinybots.net/artbot | | You can even run CLIP or (if its fast enough) llama.cpp on | the CPU to contribute to the network, if you wish. | gsharma wrote: | ComfyUI works pretty well on old computers using CPU. It | takes over 30 seconds per sampling step on a 2015 MacBook Air | (I7, 8GB RAM). | voz_ wrote: | Iirc we had good speedups on it with torch.compile, and I | remember working on it. Let me see if I can find numbers... | brucethemoose2 wrote: | Its about 20-40% depending on the GPU, from my tests. | | And only very recent builds of torch 2.1 (with dynamic input) | work properly, and it still doesn't like certain input | changes, or augmentations like controlnet. | | AIT is the most usable compiled implementation I have | personally tested, but SHARK (running IREE/MLIR/Vulkan) and | torch-mlir are said to be very good. | | Hidet is promising but doesn't really work yet. TVM doesn't | have a complete implementation outside of the WebGPU demo. | voz_ wrote: | Try head of master. If there's any bugs or graph breaks you | hit, lmk, I can take a look. My numbers say 71% with a few | custom hacks. | | Glad the dynamic stuff is working out tho! | brucethemoose2 wrote: | I will, thanks! | | I have been away for a month, but I will start testing it | again later and submit some issues I run into. | voz_ wrote: | My username without underscore, at meta. Email me any | bugs, I can help file them on GH and lend a hand fixing. | kpw94 wrote: | This is incredibly easy to setup, just tried it for first time. | | How fast is it supposed to go? | | Just tried on linux with `cmake .. -DGGML_OPENBLAS=ON` on a AMD | Ryzen 7 5700g (no discrete GPU, only integrated graphics) | ./bin/sd -m ../models/sd-v1-4-ggml-model-f32.bin -p "a lovely | cat" [INFO] stable-diffusion.cpp:2525 - loading model | from '../models/sd-v1-4-ggml-model-f32.bin' ... | [INFO] stable-diffusion.cpp:3375 - start sampling [INFO] | stable-diffusion.cpp:3067 - step 1 sampling completed, taking | 12.25s [INFO] stable-diffusion.cpp:3067 - step 2 | sampling completed, taking 12.22s [INFO] stable- | diffusion.cpp:3067 - step 3 sampling completed, taking 12.56s | ... sampling completed, taking 246.40s | | Is that expected performance? | | (EDIT: Don't have open Blas installed, so that flag is no-op) | patrakov wrote: | CPU-only, 8-bit quant, Intel Core i7 4770S, 16 GB DDR3 RAM, 10 | year old fanless PC: 32 seconds per sampling step, correct | output. | badsectoracula wrote: | This is nice, it basically does what i asked a year ago[0] and | at the time pretty much every solution wanted a litany of | Python dependencies that i ended up failing to install because | it took ages... and then i ran out of disk space. | | No, really, this replaces literal gigabytes of disk space with | just a 799KB binary. And as a bonus using the Q8_0 format (the | one that seems to be the fastest) it also saves ~2.3GB of data | too. | | That said, it seems to be buggy with anything other than the | default 512x512 image size. Some sizes (e.g. 544x544) tend to | cause assert fails, sizes smaller than 512x512 (which i tried | as 512x512 is quite slow on my PC) sometimes generate garbage | (anything smaller than 384x384 seems to always do that). | | [0] https://news.ycombinator.com/item?id=32555608 | kpw94 wrote: | Also got segfault core dump with different sizes. 512w * 768h | worked | brucethemoose2 wrote: | You should quantize the model, but 12s/iter seems about right. | kpw94 wrote: | Nice. Tried the fp32, q8_0, and q4_0, and for some reason | they all take ~12s/iter. | | Must have something wrong with my setup, but no big deal, for | my minimal usage of it, and the amount of time spent, | fp32@12s/iter is fine | brucethemoose2 wrote: | Hmm, theoretically FP16 might be the fastest, if thats an | option in the implementation now. | Lockal wrote: | I did a quick run under profiler and on my AVX2-laptop | the slowest part (>50%) was matrix multiplication | (sgemm). | | In current version of GGML if OpenBLAS is enabled, they | convert matrices to FP32 before running sgemm. | | If OpenBLAS is disabled, on AVX2 plaftorm they convert | FP16 to FP32 on every FMA operation, which even worse | (due to repetition). After that, both ggml_vec_dot_f16 | and ggml_vec_dot_f32 took first place in profiler. | | Source: https://github.com/ggerganov/ggml/blob/master/src | /ggml.c#L10... | | But I agree, that _in theory, and only with AVX512_ BF16 | (not exactly FP16, but similar) will be fast with | VDPBF16PS instruction. Implementation is not there yet. | brucethemoose2 wrote: | Interesting. | | I saw some discussion on llama.cpp that, theoretically, | implementing matmul for each quantization should be much | faster since it can skip the conversion. But practically, | its actually quite difficult since the various BLAS | libraries are so good. | billfruit wrote: | It appears to be in C++, why state it as C/C++? | mmcwilliams wrote: | From what I understand the underlying ggml dependency is | written in C. ___________________________________________________________________ (page generated 2023-08-19 23:00 UTC)