hngopher.com

       [HN Gopher] Stable Diffusion in C/C++
       ___________________________________________________________________
        
       Stable Diffusion in C/C++
        
       Author : kikalo00
       Score  : 252 points
       Date   : 2023-08-19 11:26 UTC (11 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | [deleted]
        
       | nottorp wrote:
       | [flagged]
        
       | naillo wrote:
       | Awesome that they implemented CLIP as well. That alone could be
       | cool to extract and compile as a wasm implementation.
       | 
       | Edit: Seems like someone already has
       | https://github.com/monatis/clip.cpp :) Now to wasmify it
        
         | KaoruAoiShiho wrote:
         | Speaking of CLIP, I'm always troubled that the next CLIP might
         | not get released as both OpenAI and Google are shifting into
         | competition mode. Sad to think there might be a more advanced
         | version of CLIP already but sitting in a secret vault
         | somewhere.
         | 
         | Edit: I'm not referring to a CLIP-2 but any advance on the same
         | level of importance as CLIP.
        
           | GaggiX wrote:
           | The biggest CLIP models we know of are open source.
           | 
           | If a company has a bigger CLIP model they don't have even
           | reported that.
           | 
           | Also OpenAI had already for a moment a proprietary CLIP model
           | that was bigger than any other models available, the CLIP-H
           | used by Dalle 2.
        
             | snordgren wrote:
             | As someone who is out of the loop but could use high
             | quality image embeddings right now, what's the best CLIP
             | model right now?
        
               | astrange wrote:
               | SDXL uses OpenCLIP, and then OpenAI CLIP as a backup
               | basically to allow it to spell words properly, but I
               | think you could replace the second one.
        
           | speedgoose wrote:
           | Stable Diffusion switched to OpenCLIP for stable diffusion 2.
           | But it looks they went back to clip for the xl version.
           | 
           | People complained about openclip not being as good. Hopefully
           | we can have a better and open clip model eventually.
        
       | evolveyourmind wrote:
       | Any benchmarks?
        
         | nre wrote:
         | Some people have timed it here, it looks like it's taking
         | 15-20s/it (dependent on quant and hardware).
         | 
         | https://github.com/leejet/stable-diffusion.cpp/issues/1
        
       | Lerc wrote:
       | Looking at the different quantization levels examples, I'm quite
       | impressed. The change from f16 to q8_0 seems to be more of a
       | change in direction than loss of quality. The q5_1 result seems
       | indistinguishable from the q8_0.
       | 
       | So you're losing determinism with the higher precision models,
       | but potentially quite usable.
        
       | jakearmitage wrote:
       | There's just something special to these C/C++ implementations of
       | AI stuff. They feel so clean and straightforward and make the
       | entire field of AI feel tangible and learnable.
       | 
       | Is that because Python's ecosystem is so messy?
        
         | snordgren wrote:
         | Rewrites tend to improve code quality, and replacing
         | dependencies with custom-tailored code that does just what you
         | need also improves code quality.
         | 
         | And while the Python version uses C and C++ code for speed,
         | this is all just one language.
         | 
         | A trifecta of factors enabling clean code.
        
       | BrutalCoding wrote:
       | Saw this repo today, fetched it, built a .dylib (Mac) and used
       | Dart's ffi-gen tooling to generate the bindings from the provided
       | header file.
       | 
       | I'm just experimenting with it together with Flutter. Ffi because
       | I'm trying to avoid spawning a subprocess
       | 
       | Fast forward: Ended up with severe headache and a broken app.
       | Will continue my attempt tmr with a fresh mind haha
       | 
       | This repo is great though, had it up and running within 10 min on
       | my M1 (using f16). Thanks for sharing!
        
       | lucabs wrote:
       | [dead]
        
       | waynecochran wrote:
       | Nice to see ML folks getting weaned off of Python and using a
       | language that can optimally exploit the underlying hardware and
       | not require setting up a specialized environment to build and
       | run.
        
         | xvilka wrote:
         | Do you mean Julia language?
        
         | A-Train wrote:
         | Amen.
        
         | aimor wrote:
         | I really appreciate the people doing this work. It's the only
         | way I've run these models without any headaches. The difference
         | is so stark, even with CUDA and Linux it's bad, with AMD and
         | Windows it's miserable. I'm pretty sure it's not just me..
        
         | galangalalgol wrote:
         | As long as we are language trolling, why would anyone start a
         | greenfield project like this in C++ these days? The android,
         | windows, firefox, and now chrome projects have all begun to
         | shift towards rust and in the case of Android and Firefox,
         | write a significant amounts of their project in rust. Migrating
         | an existing project like that is difficult. The chrome team in
         | particular lamented the difficulty. But starting a new project?
         | If you have a team familiar with performant c++ the speedbump
         | of starting a greenfield project in rust is negligible, and the
         | ergonomic improvements in the build system and the language
         | itself will make up for that in any project that takes more
         | than a few months. For that speed bump you get memory and
         | thread race safety, far better than any stack of c++ analysis
         | tools could ever provide with a tiny fraction of the unit tests
         | you'd write in c++. And you lose no performance.
        
           | FoodWThrow wrote:
           | Rust is _great_ when you know what you 're building. That
           | qualifier encompasses quite amount of software space, but not
           | all of it, and I would argue not even the majority of it.
           | 
           | If you don't know what you are doing, if you are exploring
           | ideas, Rust will just get in the way. At some point you will
           | end up realizing you need to adjust lifetimes, and that will
           | require you to touch non-trivial amount of your code base. If
           | you need to that multiple times, friction will overwhelm your
           | desire to code.
           | 
           | I have a pet theory that, the people that find Rust intuitive
           | and fun, are the people that are working on well beaten
           | paths; Rust is almost boring at doing that, which is a good
           | thing. And the people that find Rust gets in their way are
           | the people that like to experiment with their solutions,
           | because there aren't any set, trusted solutions within their
           | problem space, and even if there are, they like to approach
           | the problem on their own, for better or worse.
           | 
           | In any case:
           | 
           | > why would anyone start a greenfield project like this in
           | C++ these days?
           | 
           | The video game industry can single-handedly carry C++ on
           | their back, kicking and screaming, if need be. Rust is
           | uniquely unfit to write gameplay code due to game
           | development's iterative nature. Using scripting languages
           | doesn't cut it either, because often, slower designer made
           | scripts will need to be converted to C++ by a programmer, and
           | pull in the crazy reference hell of the game state into the
           | C++ land.
           | 
           | I would say Rust is OK for _engine_ level features -- those
           | don 't change that often, and requirements are usually well
           | understood. But that introduces a cadence mismatch between
           | different systems too, so there is a cost there as well. But
           | for gameplay? There's a reason why many Rust based game
           | engines use crazy amount of unsafe Rust to make their ECS.
           | Just not a good fit.
           | 
           | And of course, there's the consoles, where Sony seem to have
           | a political reason for not supporting Rust on non-1st-party
           | studios. I have no idea what they are thinking, honestly.
        
           | mnrlt wrote:
           | C++ has a standard, multiple competing implementations and a
           | largely drama-free community.
           | 
           | Does CUDA even have Rust bindings, and if so, are they on the
           | same level as the C++ ones?
           | 
           | What do you mean by "the windows projects" that shift towards
           | Rust?
        
             | galangalalgol wrote:
             | MS has started implementing pieces of windows in rust. If
             | you have windows 11 you are running rust. The cuda bindings
             | are good for ml, but missing for cufft and similar. There
             | are people working on better cuda support, but there are
             | even more people working on vendor agnostic gpgpu using
             | spirv and webgpu. It isn't there yet. Right now you are
             | mostly left to your own bindings unless you are doing ml or
             | blas.
             | 
             | Edit: I can't argue about the drama part. The competing
             | compilers will get there. A couple gcc frontends in work,
             | and crane lift as a competing back end for llvm and full
             | self-hosting. There is also miri I guess to emit c? People
             | use that to get rust on the C64 or other niche processors.
        
               | pjmlp wrote:
               | Yes they started, yet there is enough C++ to rewrite in
               | the 30 years of Windows NT history.
               | 
               | Meanwhile, Visual Studio team released better tooling for
               | Unreal in Visual C++.
        
           | Const-me wrote:
           | > why would anyone start a greenfield project like this in
           | C++ these days?
           | 
           | TLDR: quite often, using C++ instead of Rust saves software
           | development costs.
           | 
           | Some software needs to consume many external APIs. Examples
           | on Windows: Direct3D, Direct2D, DirectWrite, MediaFoundation.
           | Examples on Linux: V4L2, ALSA, DRM/KMS, GLES. These things
           | are huge in terms of API surface. Choose Rust, and you gonna
           | need to write and support non-trivial amount of boilerplate
           | code for the interop. Choose C++ (on Linux, C is good too)
           | and that code is gone, you only need well-documented and well
           | supported APIs supplied by the OS vendors.
           | 
           | Similarly, some software needs to integrate with other
           | systems or libraries written in C or C++. An example often
           | relevant to HPC applications is Eigen. Another related thing,
           | game console SDKs, and game engines, don't support Rust.
           | 
           | For the project being discussed here, GGML, for optimal
           | performance the implementation needs vector intrinsics.
           | Technically Rust has the support, but in practice Intel and
           | ARM are only supporting them for C and C++. Not just CPU
           | vendors, when using C or C++ there're useful relevant
           | resources: articles, blogs, and stackoverflow. These things
           | help a lot in practice. I don't program Rust, but I program
           | C# in addition to C++, technically most vector intrinsics are
           | available in the current version of C#, but they are much
           | harder to use from C# for this reason.
           | 
           | All current C and C++ compilers support OpenMP for
           | parallelism. While not a silver bullet, and not available on
           | all platforms supported by C or C++, some software benefits
           | tremendously from that thing.
           | 
           | Finally, it's easier to find good C++ developers, compared to
           | good Rust developers.
        
             | galangalalgol wrote:
             | There are existing supported bindings for direct3d from MS,
             | as they themselves are migrating. GLES and ggml also have
             | supported bindings. I like nalgebra + rustfft better than
             | eigen now. Nalgebra still isn't quite as peformant on small
             | matricies until const generic eval stabilizes, but it is
             | close enough for 6x6 stuff that it is in the noise. Rustfft
             | is faster than fftw even. Rust has intrinsic support on par
             | with clang and gcc, and the autovectorizer uses whatever
             | llvm know about, so again equivalent to clang.
             | 
             | On the last point, I will again assert that a good c++
             | developer is just a good rust developer minus a month of
             | ramp, that you'll get nack from not having to fight
             | combinations of automake, cmake, conan, vcpkg, meson, bazel
             | and hunter.
        
               | Const-me wrote:
               | I'll be very surprised if MS will ever support rust
               | bindings for media foundation. That thing is COM based,
               | requires users to implement COM interfaces not just
               | consume, and is heavily multithreaded.
               | 
               | About SIMD, automatic vectorizers are very limited. I was
               | talking about manually vectorized stuff with intrinsics.
               | 
               | I've been programming C++ for living for decades now.
               | Tried to learn rust but failed. I have an impression the
               | language is extremely hard to use.
        
               | galangalalgol wrote:
               | Not sure how fully featured it is.
               | https://lib.rs/crates/mmf
               | 
               | Yes, rust directly supports modern intrinsics, that is
               | what rustfft for instance uses. I try to stick with
               | autovec myself, because my needs are simpler such that a
               | couple tweaks usually gets me close to hand-rolled
               | speedups on both avx 512 and aarch64. But for more
               | complicated stuff yeah, rust seems to be keeping up. Some
               | intrinsics are still only in nightly, but plenty of major
               | projects use nightly for production, it is quite stable
               | and with a good pipeline you'll be fine.
               | 
               | I've written c++ since ~94, and mostly c++17 since it
               | came out. About a quarter of a century of that getting
               | paid for it. I never liked or used exceptions or rtti,
               | and generally used functional style except for
               | preallocation of memory for performance. I think those
               | habits might have made the transition a little easier,
               | but the people on my team who had used a more OOP style
               | and full c++ don't seem to have adapted much more slowly
               | if at all. I struggled for years to internalize rust at
               | home until I just jumped in at work by declaring the
               | project I lead would be in rust. I have had absolutely no
               | regrets. It really isn't as bad a learning curve as c++.
               | But we learned c++ one revision at a time. Also, much
               | like c++ rust has bits you mostly only need to know for
               | writing libraries. So getting started you can put those
               | things to the side for a bit right at first.
        
               | theLiminator wrote:
               | Curious how long you tried to learn rust? I've found C++
               | much harder to learn (coming from a python/scala)
               | background.
               | 
               | Is it just a case of you forgetting how hard C++ was to
               | learn?
        
               | pjmlp wrote:
               | No there aren't, unless you mean Rust/WinRT demos with
               | community bindings.
               | 
               | Agility SDK and XDK have zero Rust support. If it isn't
               | on the Agility SDK and XDK, it isn't official.
               | 
               | Hardly the same as the official Swift bindings to Metal,
               | written in Objective-C and C++14.
        
         | api wrote:
         | It's interesting to me that my CPU can run some of these things
         | in quantized form almost as fast as the GPU. Has the whole
         | thing been all about memory bandwidth all along?
         | 
         | In addition to compute the GPU architecture is one that
         | somewhat colocates working memory alongside compute. Units have
         | local memories that sync with global memory. Is that a big part
         | of why GPUs are so good for this?
        
           | brucethemoose2 wrote:
           | > Has the whole thing been all about memory bandwidth all
           | along
           | 
           | Yeah, sort of.
           | 
           | LLMs like llama at a batch size of 1 are hilariously
           | bandwidth bound.
           | 
           | Stable Diffusion less so. Its still bandwidth heavy on GPUs,
           | but compute is much more of a bottleneck.
        
         | intelVISA wrote:
         | Wasn't Python originally designed as a language to teach
         | children how to code? Weird to see so many, otherwise
         | intelligent, folks latch onto it.
         | 
         | It really doesn't have any redeeming characteristics vs. Common
         | Lisp, or Haskell, to warrant this bizarre popularity imo
        
           | mdp2021 wrote:
           | > _Wasn 't Python originally designed as a language to teach
           | children how to code_
           | 
           | I think it would be very confusing for a child to start with
           | a language so far away from low-level logic.
           | 
           | ...And some people said BASIC was evil. At least what it is
           | doing looks plain and direct.
        
             | tester756 wrote:
             | >I think it would be very confusing for a child to start
             | with a language so far away from low-level logic.
             | 
             | Why?
             | 
             | I started with C++ and when they showed me C# I instantly
             | feel in love cuz I didn't have to deal with unnecessary
             | complexity and annoyances and could focus on pure
             | programming, algorithms, etc.
        
               | mdp2021 wrote:
               | Yes Tester,
               | 
               | but you are confirming my point :) ...You _started_ with
               | C++, then went to C#...
        
               | tester756 wrote:
               | I started with C++ and switched around the beginning, so
               | there wasn't any "low level knowledge" nor above c++
               | beginner level concepts.
               | 
               | Both: high-to-low and low-to-high have some advantages,
               | but it's not like one is always better than the other.
               | 
               | high-to-low allows you to write stuff earlier - like
               | programs that do something useful, GUI, web, whatever.
               | 
               | but at the cost of understanding internals / under the
               | hood.
        
             | segfaultbuserr wrote:
             | > _I think it would be very confusing for a child to start
             | with a language so far away from low-level logic._
             | 
             | Depending on the person.
             | 
             | For some, it would be very frustrating to start with a
             | language so close to the implementation detail, and so far
             | away from what you want to do. It's very possible that
             | someone might have long lost the motivation before one can
             | do anything non-trivial.
             | 
             | I started from Python, to C, to assembly, to 4-layer
             | circuit boards. Whenever I went a level deeper, it feels
             | like opening the inner working of a blackbox that I
             | normally only interacts with pushbuttons on its front
             | panel, but I otherwise is roughly aware of what they do.
             | 
             | On the other hand, much of my childhood was spent on
             | tinkering with PCs and servers, including hosting websites
             | and compiling packages from source, so I was already well
             | aware of the basic concepts in computing before I started
             | programming. So, top-down and bottom-up are both absolutely
             | workable, under the right circumstances.
        
           | mnrlt wrote:
           | ABC, the predecessor from which Python took many syntax
           | features, was. I wonder if Python also took a lot of the ABC
           | implementation, given that it is still copyright CWI.
           | 
           | I agree that its popularity is very odd, but academics take
           | what they are given when attending fully paid conferences
           | (aka vacations).
        
           | highspeedbus wrote:
           | What's the term to describe a ad hominem fallacy towards
           | programming languages? Asked the almighty chat and got this
           | new term:
           | 
           | "Code Persona Attack"
           | 
           | Python is fine.
        
             | astrange wrote:
             | This is a Ycombinator site, the traditional term is Blub.
             | 
             | And they're right. Python is not a well designed
             | programming language - it has exceptions and doesn't have
             | value types so that's two strikes against it.
             | 
             | Of course, C++ isn't either.
        
         | gumby wrote:
         | It's a bummer there's so little work on the training side in
         | C++.
         | 
         | Especially since the python training systems are mostly calls
         | into libraries written in C++!
        
           | pjmlp wrote:
           | Yeah, and since C++17 the language is already quite
           | productive for scripting like workflows, the missing piece of
           | the puzzle is that there are two few C++ REPLs aroung,
           | ROOT/CINT being one of the few well known ones.
        
         | danybittel wrote:
         | Since when does C++ optimally exploit the underlying hardware?
         | It has no vector instructions, does not run on the GPU and is
         | arguably too hard to make multithreaded. Which leaves you with
         | about 0.5% performance of a current PCs.
        
           | jcelerier wrote:
           | > does not run on the GPU
           | 
           | both Cuda and the Metal shader language are C++, so is OpenCL
           | since 2.0 (https://www.khronos.org/opencl/), so is AMD ROCm's
           | HIP (https://github.com/ROCm-Developer-Tools/HIP), so is SYCL
           | (https://www.khronos.org/sycl/)? C++ is pretty much the
           | language that runs _most_ on GPUs.
           | 
           | > no vector instructions,
           | 
           | There's a thousand different possibilities for SIMD in C++,
           | from #pragma omp simd, to libs such as
           | std::experimental::simd
           | (https://en.cppreference.com/w/cpp/experimental/simd/simd),
           | Eve (https://github.com/jfalcou/eve), Highway
           | (https://github.com/google/highway), Vc
           | (https://github.com/VcDevel/Vc)...
        
           | pjmlp wrote:
           | When compared against Python, more than enough.
           | 
           | C++ is one of the supported CUDA languages, even standard
           | C++17 does run just fine on the GPU.
           | 
           | Metal uses C++14 alongside some extensions.
        
           | waynecochran wrote:
           | Vector types / instructions would be nice. The C++20 STL
           | algorithms are very friendly to vectorization with the
           | various parallel policies (e.g.
           | std::execution::unsequenced_policy) that open up your code to
           | be vectorized. Wonderful libs like Eigen handle a lot of my
           | numeric needs for linear algebra. I think you are forgetting
           | the CUDA is C/C++.
        
             | Someone wrote:
             | > Vector types / instructions would be nice
             | 
             | It's technically not a C++ feature, but both gcc
             | (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html)
             | and Clang (https://releases.llvm.org/3.1/tools/clang/docs/L
             | anguageExten...) have vector types, and clang even supports
             | the gcc way of writing them, so it gets pretty close.
        
               | astrange wrote:
               | Those are traditionally dangerous since they tend to
               | compile poorly; not as bad as autovectorization but not
               | as good as just writing in assembly. And since
               | vectorization is platform-dependent anyway (because it's
               | so different across platforms), assembly really isn't
               | nearly as bad as it sounds.
               | 
               | Though it's certainly gotten better, the reason people
               | push those is that they're written by compiler authors,
               | who don't want to hear that their compiler doesn't work.
               | 
               | Some of the reason for this is that C doesn't let you
               | specify memory aliasing as precisely as you want to.
               | Fortran is better about this.
        
         | codethief wrote:
         | That's a rather odd comparison to make. First of all, OP, like
         | llama.cpp, doesn't use the GPU - in contrast to most Python ML
         | code. It's not hard to write Python code that "optimally
         | exploits" the GPU. You might call the GPU a "specialized
         | environment to build and run" but it's arguably much better
         | suited to the problem.
         | 
         | Second, OP, like llama.cpp, produced efficient and highly
         | specialized code _after_ it was clear the model being
         | specialized for (StableDiffusion  / LLaMa / ...) works well.
         | Where Python shines, though, is the prototyping phase when you
         | have yet to find an appropriate model. We have yet to see this
         | sort of easy & convenient prototyping in C++.
         | 
         | Now, this is not to take away anything from the fantastic work
         | that's being done by the llama.cpp people (to whom I also count
         | OP) in the "ML on a CPU" space. But the problems being solved
         | are entirely different.
        
           | PcChip wrote:
           | >You might call the GPU a "specialized environment to build
           | and run" but it's arguably much better suited to the problem.
           | 
           | I feel like the person you're replying to knows that the GPU
           | is better suited than the CPU to do this task, and your
           | argument doesn't really make sense. I think they were
           | referring to the python venv environment with all the library
           | dependencies as the "specialized environment"
        
             | jebarker wrote:
             | The point is that as awesome as this repo is it doesn't do
             | much to ween the "ML folks" off of Python since it doesn't
             | provide the flexibility and GPU support that people
             | designing and training DL systems rely on.
        
               | waynecochran wrote:
               | I'm just encouraged when I see ML libraries not using
               | Python w its environment kludges. Just a step in the
               | right direction.
        
               | jebarker wrote:
               | I don't disagree that Python environments are a mess. I'm
               | actually a developer on quite a prominent large scale
               | neural network training library and a DL researcher that
               | uses said library. With my developer hat on I like to
               | have minimal dependencies and keep Python scripting as
               | decoupled as possible from the CUDA C++ implementation.
               | With my researcher hat on I don't want to be slowed down
               | by C++ development every time I want to change my model
               | or training pipeline. At least for me, C++ development is
               | slower and more error prone than modifying Python.
               | 
               | Obviously doing any heavy lifting in Python is a bad
               | idea. But as a scripting language I think it's good,
               | especially if you keep the environment simple. I don't
               | think the answer for DL training is to dump Python
               | entirely and start over in pure
               | C/C++/Rust/Julia/whatever. Learning C/C++ is too big of
               | an ask for everyone working on the model design and
               | training side and it would slow down progress
               | significantly - most of that work is actually data
               | munging and targeted model tweaks. But I do think there's
               | still a lot that can be done to decouple Python from the
               | underlying engine and yield networks where inference can
               | be run in a minimal dependency environment. There's lots
               | of great people working on all these things.
        
           | segfaultbuserr wrote:
           | > _Where Python shines, though, is the prototyping phase when
           | you have yet to find an appropriate model. We have yet to see
           | this sort of easy & convenient prototyping in C++._
           | 
           | +1.
           | 
           | To produce a highly-optimized C/C++ kernel that utilizes the
           | CPU to the fullest extent, it requires tremendously amount of
           | talent and expertise. For example, not everyone can write a
           | hand-vectorized kernel with AVX2 intrinsics (outside a few
           | specialized applications like 3D graphics, media encoding,
           | and the likes), and even fewer people can exploit the
           | underlying feature of the algorithm for optimization, such as
           | producing usable output at greatly reduced numerical
           | precision. The power of LLM provides strong motivation to
           | drive the brainpower of countless programmers all over the
           | world to do just that. New techniques are proposed and
           | implemented on a monthly basis, with people thinking and
           | applying every possible trick on the LLM optimization
           | problems. In this regard, moving from Python to C is totally
           | reasonable.
           | 
           | In comparison, right now I'm working on optimizing a niche
           | open-source scientific simulation kernel with a naive C
           | codebase. Before me, there were hardly any contributors in
           | the last decade.
           | 
           | Python has its place because not everyone has a level of
           | resource and expertise comparable to ML. In particular, when
           | the bulk of the data processing of a Python script is in done
           | in a function call to a C++ or FORTRAN kernel like scipy, the
           | differences between naive C and naive Python code (or Julia
           | code if you're following the trend) are not that much,
           | especially when it's a one-off project for just publishing a
           | single paper.
        
             | hotstickyballs wrote:
             | It's going to be a tf or PyTorch feature rather than going
             | directly to writing things in C. No point solving this
             | problem only once.
        
           | waynecochran wrote:
           | Yeah i make a living in the GPU space. I think my comment
           | comes from colleagues having to hold my hand to set up their
           | ML / Python environments with all of their picadellos. In
           | fact its bad enough that i have to use docker to create an
           | insular environment tailored to their specific setup. And
           | Python is like a 1000 times slower when its not using other
           | libs like numpy.
        
             | pdntspa wrote:
             | Are they not using venvs or something? It should be as
             | simple as python -m venv venv; ./activate; pip install -r
             | requirements.txt
        
               | Narew wrote:
               | Unfortunatly its not that simple expecially for NVIDIA
               | driver and cuda install. That's why we usually use conda
               | that can handle cuda install but even with that some time
               | it work flawlessly and some time not.
        
               | waynecochran wrote:
               | Everyone has their own way to do this. Every step is
               | broken by some unfamiliar dependency that requires
               | special arcane knowledge to fix. Part of me is a grumpy
               | old man that doesn't gravitate to the shiny new tools
               | that come out every week that the younger devs keep up
               | with :)
        
               | pdntspa wrote:
               | pip and venv are neither shiny nor new, it's the standard
               | way of doing things for a while. I am an outsider to
               | python and am incredibly thankful for this
               | standardization, because i agree getting python env set
               | up correctly before venv was a huge pain
               | 
               | If your guys arent on this I'd suggest you get them on
               | it, it dramatically simplifies setup
        
               | waynecochran wrote:
               | Here is a tiny excerpt try to get dvc to work just so I
               | could get the training weights for deployment ...
               | remember I don't develop much w Python...
               | $ dvc pull         Command 'dvc' not found, but can be
               | installed with:         sudo snap install dvc
               | $ sudo snap install dvc         error: This revision of
               | snap "dvc" was published using classic confinement and
               | thus may perform         arbitrary system changes outside
               | of the security sandbox that snaps are usually confined
               | to,         which may put your system at risk.
               | If you understand and want to proceed repeat the command
               | including --classic.           ok I get dvc installed
               | somehow -- don't remember. Time to get the weights...
               | $ python3 -m dvc pull         ERROR: unexpected error -
               | Forbidden: An error occurred (403) when calling the
               | HeadObject operation: Forbidden
               | Having any troubles? Hit us up at
               | https://dvc.org/support, we are always happy to help!
               | 
               | Finally I just have my colleague manually copy the
               | weights. This kind of thing went for hours.
        
               | pdntspa wrote:
               | Researchers are notorious for writing bad code
               | 
               | What even is dvc
               | 
               | edit: also- i'd avoid snap and just use your regular
               | package manager.
        
               | waynecochran wrote:
               | I think dvc is like git for large binary files. You need
               | someway to manage your NN weights -- what are other
               | methods?
        
               | pdntspa wrote:
               | git lfs is what everyone is using, HF in particular
        
               | efiop wrote:
               | Hey, DVC maintainer here.
               | 
               | Thanks for giving DVC a try!
               | 
               | There are a few ways to install dvc, see
               | https://dvc.org/doc/install/linux
               | 
               | With snap, you need to use `--classic` flag, as noted in
               | https://dvc.org/doc/install/linux#install-with-snap
               | Unfortunately that's just how snap works for us there :(
               | 
               | Regarding the pull error, it simply looks like you don't
               | have some credentials set up. See
               | https://dvc.org/doc/user-guide/data-management/remote-
               | storag... Still, the error could be better, so that's on
               | us.
               | 
               | Feel free to ping us in discord (see invite link in
               | https://dvc.org/support). I'm @ruslan there. We'll be
               | happy to help.
        
           | pjmlp wrote:
           | That Python ML code is calling C++ code running in the GPU,
           | one more reason to use C++ across the whole stack.
           | 
           | CERN already used prototyping in C++, with ROOT and CINT, 20
           | years ago.
           | 
           | https://root.cern/
           | 
           | Nowadays it is even usable from Netbooks via Xeus.
           | 
           | It is more a matter of lack of exposure to C++ interpreters
           | than anything else.
        
           | two_in_one wrote:
           | Add to that it's only inference code, not training.
        
           | kwant_kiddo wrote:
           | not sure what "It's not hard to write Python code that
           | "optimally exploits" the GPU", exactly means but Python is so
           | far from exploiting the GPU resources even with C/C++
           | bindings that it's not even funny. I am sure that HPC folks
           | would have migrated way from FORTRAN and C/C++ long time ago
           | if it was so easy.
        
             | codethief wrote:
             | I wasn't trying to claim that Python is great at fully
             | exploiting GPU resources on generic GPU tasks. But in ML
             | applications it often does, at least in my experience.
        
         | geysersam wrote:
         | It's not like any performance significant component of the ML
         | stack is actually implemented in Python. Everything is and has
         | always been cuda, c or c++ under the hood. Python is just the
         | extremely effective glue binding it all together.
        
           | brucethemoose2 wrote:
           | Sometimes implementations will spend a little too much time
           | in Python interpretation, but yeah, its largely lower level
           | code.
           | 
           | The problem with PyTorch specifically is that (without Triton
           | compilation) pretty much all projects run in eager mode.
           | That's fine for experimentation and demonstrations in papers,
           | but its _crazy_ that its used so much for production without
           | any compilation. It would be like using debug C binaries for
           | production, and they only work with any kind of sane
           | performance on a single CPU maker.
        
         | fassssst wrote:
         | Yup. I would much prefer if every ML model had a simple C
         | inference API that could be called directly from pretty much
         | any language on any platform, without a mess of dependencies
         | and environment setup.
        
           | naillo wrote:
           | ML is such a beautiful and perfect setup for dependency free
           | execution too. It should just be like downloading a
           | mathematical function. I'm glad we're finally embracing that.
        
       | brucethemoose2 wrote:
       | Llama.cpp/ggml is uniquely suited to llms. The memory
       | requirements are _huge_ , quantization is effective, and token
       | generation is surprisingly serial and bandwidth bound, making it
       | good for CPUs, and an even better fit for ggml's unique pipelined
       | CPU/GPU inference.
       | 
       | ...But Stable Diffusion is not the same. It doesn't quantize as
       | well, the unet is very compute intense, and batched image
       | generation is effective and useful to single users. Its a better
       | fit for GPUs/IGPs. Additionally, it massively benefits from the
       | hackability of the Python implementations.
       | 
       | I think ML compilation to executables is the way for SD.
       | AITemplate is already blazing fast [1], and TVM Vulkan is very
       | promising if anyone will actually flesh out the demo
       | implementation [2]. And they preserve most of the hackability of
       | the pure PyTorch implementations.
       | 
       | 1: https://github.com/VoltaML/voltaML-fast-stable-diffusion
       | 
       | 2: https://github.com/mlc-ai/web-stable-diffusion
        
         | WinLychee wrote:
         | The above project somewhat supports GPUs if you pass the
         | correct GGML compile flags to it. `GGML_CUBLAS` for example is
         | supported when compiling. You get a decent speedup relative to
         | pure C/C++.
        
           | brucethemoose2 wrote:
           | Interesting. It still doesn't seem to be very quick:
           | https://github.com/leejet/stable-diffusion.cpp/issues/6
           | 
           | But don't get me wrong, I look forward to playing with ggml
           | SD and its development.
        
             | WinLychee wrote:
             | Yeah for comparison, `tinygrad` takes a little over a
             | second per iteration on my machine. https://github.com/tiny
             | grad/tinygrad/blob/master/examples/st...
        
               | brucethemoose2 wrote:
               | Is that on GPU or CPU? 1 it/s would be very respectable
               | on CPU.
               | 
               | The fastest implementation on my 2060 laptop is
               | AITemplate, being about 2x faster than pure optimized HF
               | diffusers.
        
               | WinLychee wrote:
               | That was on GPU, and there are various CPU
               | implementations (e.g. based on Tencent/ncnn) on github
               | that have similar runtime (1-3s / iteration).
        
         | skykooler wrote:
         | On the other hand, this is nice for anyone who wants to play
         | with these networks locally and does not have a nvidia GPU with
         | 6+ gigabytes of VRAM. I can run this on an old laptop, even if
         | it takes a while.
        
           | brucethemoose2 wrote:
           | I would also highly recommend https://tinybots.net/artbot
           | 
           | You can even run CLIP or (if its fast enough) llama.cpp on
           | the CPU to contribute to the network, if you wish.
        
           | gsharma wrote:
           | ComfyUI works pretty well on old computers using CPU. It
           | takes over 30 seconds per sampling step on a 2015 MacBook Air
           | (I7, 8GB RAM).
        
         | voz_ wrote:
         | Iirc we had good speedups on it with torch.compile, and I
         | remember working on it. Let me see if I can find numbers...
        
           | brucethemoose2 wrote:
           | Its about 20-40% depending on the GPU, from my tests.
           | 
           | And only very recent builds of torch 2.1 (with dynamic input)
           | work properly, and it still doesn't like certain input
           | changes, or augmentations like controlnet.
           | 
           | AIT is the most usable compiled implementation I have
           | personally tested, but SHARK (running IREE/MLIR/Vulkan) and
           | torch-mlir are said to be very good.
           | 
           | Hidet is promising but doesn't really work yet. TVM doesn't
           | have a complete implementation outside of the WebGPU demo.
        
             | voz_ wrote:
             | Try head of master. If there's any bugs or graph breaks you
             | hit, lmk, I can take a look. My numbers say 71% with a few
             | custom hacks.
             | 
             | Glad the dynamic stuff is working out tho!
        
               | brucethemoose2 wrote:
               | I will, thanks!
               | 
               | I have been away for a month, but I will start testing it
               | again later and submit some issues I run into.
        
               | voz_ wrote:
               | My username without underscore, at meta. Email me any
               | bugs, I can help file them on GH and lend a hand fixing.
        
       | kpw94 wrote:
       | This is incredibly easy to setup, just tried it for first time.
       | 
       | How fast is it supposed to go?
       | 
       | Just tried on linux with `cmake .. -DGGML_OPENBLAS=ON` on a AMD
       | Ryzen 7 5700g (no discrete GPU, only integrated graphics)
       | ./bin/sd -m ../models/sd-v1-4-ggml-model-f32.bin -p "a lovely
       | cat"         [INFO]  stable-diffusion.cpp:2525 - loading model
       | from '../models/sd-v1-4-ggml-model-f32.bin'         ...
       | [INFO]  stable-diffusion.cpp:3375 - start sampling         [INFO]
       | stable-diffusion.cpp:3067 - step 1 sampling completed, taking
       | 12.25s         [INFO]  stable-diffusion.cpp:3067 - step 2
       | sampling completed, taking 12.22s         [INFO]  stable-
       | diffusion.cpp:3067 - step 3 sampling completed, taking 12.56s
       | ...         sampling completed, taking 246.40s
       | 
       | Is that expected performance?
       | 
       | (EDIT: Don't have open Blas installed, so that flag is no-op)
        
         | patrakov wrote:
         | CPU-only, 8-bit quant, Intel Core i7 4770S, 16 GB DDR3 RAM, 10
         | year old fanless PC: 32 seconds per sampling step, correct
         | output.
        
         | badsectoracula wrote:
         | This is nice, it basically does what i asked a year ago[0] and
         | at the time pretty much every solution wanted a litany of
         | Python dependencies that i ended up failing to install because
         | it took ages... and then i ran out of disk space.
         | 
         | No, really, this replaces literal gigabytes of disk space with
         | just a 799KB binary. And as a bonus using the Q8_0 format (the
         | one that seems to be the fastest) it also saves ~2.3GB of data
         | too.
         | 
         | That said, it seems to be buggy with anything other than the
         | default 512x512 image size. Some sizes (e.g. 544x544) tend to
         | cause assert fails, sizes smaller than 512x512 (which i tried
         | as 512x512 is quite slow on my PC) sometimes generate garbage
         | (anything smaller than 384x384 seems to always do that).
         | 
         | [0] https://news.ycombinator.com/item?id=32555608
        
           | kpw94 wrote:
           | Also got segfault core dump with different sizes. 512w * 768h
           | worked
        
         | brucethemoose2 wrote:
         | You should quantize the model, but 12s/iter seems about right.
        
           | kpw94 wrote:
           | Nice. Tried the fp32, q8_0, and q4_0, and for some reason
           | they all take ~12s/iter.
           | 
           | Must have something wrong with my setup, but no big deal, for
           | my minimal usage of it, and the amount of time spent,
           | fp32@12s/iter is fine
        
             | brucethemoose2 wrote:
             | Hmm, theoretically FP16 might be the fastest, if thats an
             | option in the implementation now.
        
               | Lockal wrote:
               | I did a quick run under profiler and on my AVX2-laptop
               | the slowest part (>50%) was matrix multiplication
               | (sgemm).
               | 
               | In current version of GGML if OpenBLAS is enabled, they
               | convert matrices to FP32 before running sgemm.
               | 
               | If OpenBLAS is disabled, on AVX2 plaftorm they convert
               | FP16 to FP32 on every FMA operation, which even worse
               | (due to repetition). After that, both ggml_vec_dot_f16
               | and ggml_vec_dot_f32 took first place in profiler.
               | 
               | Source: https://github.com/ggerganov/ggml/blob/master/src
               | /ggml.c#L10...
               | 
               | But I agree, that _in theory, and only with AVX512_ BF16
               | (not exactly FP16, but similar) will be fast with
               | VDPBF16PS instruction. Implementation is not there yet.
        
               | brucethemoose2 wrote:
               | Interesting.
               | 
               | I saw some discussion on llama.cpp that, theoretically,
               | implementing matmul for each quantization should be much
               | faster since it can skip the conversion. But practically,
               | its actually quite difficult since the various BLAS
               | libraries are so good.
        
       | billfruit wrote:
       | It appears to be in C++, why state it as C/C++?
        
         | mmcwilliams wrote:
         | From what I understand the underlying ggml dependency is
         | written in C.
        
       ___________________________________________________________________
       (page generated 2023-08-19 23:00 UTC)