[HN Gopher] Stable Diffusion based image compression ___________________________________________________________________ Stable Diffusion based image compression Author : nanidin Score : 416 points Date : 2022-09-20 03:58 UTC (19 hours ago) (HTM) web link (matthias-buehlmann.medium.com) (TXT) w3m dump (matthias-buehlmann.medium.com) | bjornsing wrote: | If it's a VAE then the latents should really be distributions, | usually represented as the mean and variance of a normal | distribution. If so then it should be possible to use the | variance to determine to what precision a particular latent needs | to be encoded. Could perhaps help increase the compression | further. | nullc wrote: | Why aren't they scaled to have uniform variances? | euphetar wrote: | I am currently also playing around with this. The best part is | that for storage you don't need to store the reconstructed image, | just the latent representation and the VAE decoder (which can do | the reconstructing later). So you can store the image as | relatively few numbers in a database. In my experiment I was able | to compress a (512, 384, 3) RGB image to (48, 64, 4) floats. In | terms of memory it was a 8x reduction. | | However, on some images the artefacts are terrible. It does not | work as a general-purpose lossy compressor unless you don't care | about details. | | The main obstacle is compute. The model is quite large, but hdds | are cheap. The real problem is that reconstruction requires a GPU | with lots of VRAM. Even with a GPU it's 15 seconds to reconstruct | an image in Google Collab. You could do it on CPU, but then it's | extremely slow. This is only viable if compute costs go down a | lot. | holoduke wrote: | In the future you can have full 16k movies representing only | 1.44mb seeds. A giant 500 petabyte trained model file can run | those movies. You can even generate your own movie by uploading a | book. | monokai_nl wrote: | Probably very unlikely, but sometimes I wonder if Jan Sloot did | something like this back in '95: | https://en.wikipedia.org/wiki/Sloot_Digital_Coding_System | aaroninsf wrote: | I would call this "confabulation" more than compression. | | Its accuracy is proportional to and bounded by the training data; | I suspect in practice it's got a specific strength (filling in | fungible detail) and as discussed ITT with fascinating and gnarly | corners, some specific failure modes which are going to lead to | bad outcomes. | | At least with "lossy" CODECs of various kinds, even if you don't | attend to absence until you do an A/B comparison, you can | perceive the difference when you do do those comparisons. | | In this case the serious peril is that an A/B comparison is | [soon] going to just show difference. "What... is... the Real?" | | When you contemplate that an ever-increasing proportion of the | training data itself stems from AI- or otherwise-enhanced | imagery, | | our hold on the real has never felt weaker, and our vulnerability | to the rewriting of reality, has never felt more present. | bane wrote: | The basic premise of these kinds of compression algorithms is | actually pretty clever. Here's a very _very_ trivialization of | this style of approach: | | 1. both the compressor and decompressor contain knowledge beyond | the algorithm used to compress/decompress some data | | 2. in this case the knowledge might be "all the images in the | world" | | 3. when presented with an image, the compressor simply looks up | some index or identifier of the the image | | 4. the identifier is passed around as the "compressed image" | | 5. "decompression" means looking up the identifier and retrieving | the image | | I've heard this called "compression via database" before and it | can give the appearance of defeating Shannon theorem for | compression even though it doesn't do that at all. | | Of course the author's idea is significantly more sophisticated | than the approach above, and trades a lossy approach for some | gains in storage and retrieval efficiency (we don't have to have | a copy of all of the pictures in the world in both the compressor | and the decompressor). The evaluation note of not using any known | image for the tests further challenges the approach and helps | sus-out where there are specific challenge like poor | reconstruction of specific image constructs like faces or text -- | I suspect that there are many other issues like these but the | author honed in on these because we (as literate humans) are | particularly sensitive to them. | | In these types of lossy compression approaches (as opposed to the | above which is lossless) the basic approach is: | | 1. Throw away data until you get to the desired file size. You | usually want to come up with some clever scheme to decide what | data you toss out. Alternative, just hash the input data using | some hash function that produces just the right number of bits | you want, but use a scheme that results in a hash digest that can | act as a (non-unique) index to the original image in a table of | every image in the world. | | 2. For images it's usually easy to eliminate pixels (resolution) | and color (bit-depth, channels, etc.). In this specific case, the | author uses an variational autoencoder to "choose" what gets | tossed. I suspect the autoencoder is very good at preserving | information rich, or high-entropy, information dense slices of a | latent space or something. At any rate, this produces something | that to us sorta kinda looks like a very low resolution, poorly | colored postage stamp of the original image, but actually | contains more data than that. I think at this point it can just | be considered the hash digest. | | 3. this hash digest, or VAE encoded image or whatever we want to | call it, is what's passed around as the "compressed" data. | | 4. just like above, "decompression" means effectively looking up | the value in a "database". If we are working with hash digests, | there was probably a collision during the construction of the | database of all images, so we lost some information. In this case | we're dealing with stable diffusion and instead of a simple | index->table entry, our "compressed" VAE image wraps through some | hyperspace to find the nearest preserved data. Since the VAE | "pixels" probably align close to data dense areas of the space | you tend to get back data that closely represents the original | image. It's still a database lookup in that sense, but it's | looking more for "similar" rather than "exact matches" which when | used to rebuild the image give a good approximation of the | original. | | Because it's an "approximation" it's "lossy". In fact I think | it'd be more accurate to say it's "generally lossy" as there is a | chance the original image can be reproduced _exactly_ , | especially if it's in the original training data. Which is why | the author was careful not to use anything from that set. | | Because we've stored so much information in the compressor and | decompressor, it can also give the appearance of defeating | Shannon entropy for compression except it's also not because: | | a) it's generally lossy | | b) just like the original example above we're cheating by simply | storing lots of information elsewhere | | There's probably some deep mathematical relationship between the | author's approach and compressive sensing. | | Still, it's useful, and has the possibility of improving data | transmission speeds at the cost of storing lots of local data at | both ends. | | Source: Many years ago before deep learning was even a "thing", I | worked briefly on some compression algorithms in an effort to | reduce data transfer issues in telecom poor regions. One of our | approaches was not too dissimilar to this -- throw away a bunch | of the original data in a structured way and use a smart | algorithm and some stored heuristics in the decompressor to guess | what we threw away. Our scheme had the benefit of almost | absolutely trivial "compression" with the downside of massive | computational needs on the "decompression" side, but had lots of | nice performance guarantees which you could use to design the | data transport stuff around. | | *edit* sorry if this explanation is confusing, it's been a while | and it's also very late where I am. I just found this post really | fun. | nl wrote: | For people interested in more about this, it's probably worth | reading the Hutter Prize FAQ: http://prize.hutter1.net/hfaq.htm | tomxor wrote: | Doesn't decompression require the entire stable fusion model? | (and the exact same model at that) | | This could be interesting but I'm wondering if the compression | size is more a result of the benefit of what is essentially a | massive offline dictionary built into the decoder vs some | intrinsic benefit to processing the image in latent space based | on the information in the image alone. | | That said... I suppose it's actually quite hard to implement a | "standard image dictionary" and this could be a good way to do | that. | operator-name wrote: | The latent space _is_ the massive offline dictionary, and the | benifit is not having to hand craft the massive offline | dictionary? | tomxor wrote: | For those of us unfamiliar... roughly how large is that in | terms of bytes? | tantalor wrote: | I thought that's what "some important caveats" was going to be, | but no, article didn't mention this. | thehappypm wrote: | Haha. Here's a faster compression model. Make a database of | every image ever made. Compute a thumbprint and use that as the | index of the database. Boom! | Sohcahtoa82 wrote: | A quick Google says there are 10^72 to 10^82 atoms in the | universe. | | Assuming 24-bit color, if you could store an entire image in | a single atom, then you could store images that are only 60 | pixels and each atom would still have a unique image. | thehappypm wrote: | Not every possible image has been produced! | Sohcahtoa82 wrote: | I'll get started, then! | Xcelerate wrote: | Great idea to use Stable Diffusion for image compression. There | are deep links between machine learning and data compression | (which I'm sure the author is aware of). | | If you could compute the true conditional Kolmogorov complexity | of an image or video file given all visual online media as the | prior, I imagine you would obtain mind-blowing compression | ratios. | | People complain of the biased artifacts that appear when using | neural networks for compression, but I'm not concerned in the | long term. The ability to extract algorithmic redundancy from | images using neural networks is obviously on its way to | outclassing manually crafted approaches, and it's just a matter | of time before we are able to tack on a debiasing step to the | process (such that the distribution of error between the | reconstructed image and the ground truth has certain nice | properties). | aaaaaaaaaaab wrote: | Save around a kilobyte with a decompressor that's ~5Gbyte. | egypturnash wrote: | _To evaluate this experimental compression codec, I didn't use | any of the standard test images or images found online in order | to ensure that I'm not testing it on any data that might have | been used in the training set of the Stable Diffusion model | (because such images might get an unfair compression advantage, | since part of their data might already be encoded in the trained | model)._ | | I think it would be _very interesting_ to determine if these | images _do_ come back with notably better compression. | bane wrote: | Given the approach, they'll probably come back with better | reconstruction/decompression too. | pishpash wrote: | Not clear. Fully encoding the training images could not be a | feasible aspect of a good auto-encoder. | Dwedit wrote: | On another note, you can also downscale an image, save it as a | JPEG or whatever, then Upscale it back using AI upscaling. | madsbuch wrote: | It is really interesting to talk about semantic lossy | compression, which is probably what we get. | | Where recreating with traditional codices introduce syntactic | noise, then this will introduce semantic noise. | | Imagine seeing a high res perfect picture, just until you see the | source image and discover that it was reinterpreted.. | | It is also going to be interesting, to see if this method will be | chosen for specific pictures, eg. pictures of celebrity objects | (or people, when/if issues around that resolve), but for novel | things, we need to use "syntactical" compression. | lastdong wrote: | Extraordinary! Is it going to be called Pied Piper? | mjan22640 wrote: | What they do is essentially a fractal compression with an | external library of patterns (that was IIRC pattented but the | patent should be long expired). | pishpash wrote: | This does remind of fractal compression [1] from the 90's which | never took off for various reasons which will be relevant here | as well. | | [1] https://en.wikipedia.org/wiki/Fractal_compression | eru wrote: | Compare compressed sensing's single pixel camera: | https://news.mit.edu/2017/faster-single-pixel-camera-lensles... | fritzo wrote: | I'd love to see a series of increasingly compressed images, say | 8kb -> 4kb -> 2kb -> ... -> 2bits -> 1bit. This would be a great | way to demonstrate the increasing fictionalization of the | method's recall. | minimaxir wrote: | For text, GPT-2 was used in a similar demo a year ago albeit said | demo is now defunct: | https://news.ycombinator.com/item?id=23618465 | DrNosferatu wrote: | Nice work! | | However, a cautionary tale on AI medical image "denoising": | | (and beyond, in science) | | - See the artifacts? | | The algorithm plugs into ambiguous areas of the image stuff it | has seen before / it was trained with. So, if such a system was | to "denoise" (or compress, which - if you think about it - is | basically the same operation) CT scans, X-rays, MRIs, etc., in | ambiguous areas it could plug-in diseased tissue where the | ground-truth was actually healthy. | | Or the opposite, which is even worse: substitute diseased areas | of the scan with healthy looking imagery it had been trained on. | | Reading recent publications that try to do "denoising" or | resolution "enhancement" in medical imaging contexts, the authors | seem to be completely oblivious to this pitfall. | | (maybe they had a background as World Bank / IMF economists?) | ska wrote: | People have been publishing fairly useless papers "for" medical | imaging enhancement/improvement for 3+ decades now. NB this is | not universal (there are some good ones) and _not_ limited to | AI techniques, although essentially every AI technique that | comes along gets applied to compression | /denoising/"superres"/etc. if it can, eventually. | | The main problems is that that typical imaging researchers are | too far from actual clinical applications, and often trying to | solve the wrong problems. It's a structural problem with | academic and clinical incentives, as much as anything else. | fny wrote: | There is nothing in the article suggesting this should be used | for medical imaging. | gregw134 wrote: | Fun to imagine this could show up in future court cases. Is the | picture true, or were details changed by the ai compression | algorithm? | petesergeant wrote: | From the article: | | > a bit of a danger of this method: One must not be fooled by | the quality of the reconstructed features -- the content may be | affected by compression artifacts, even if it looks very clear | | ... plus an excellent image showing the algorithm straight | making stuff up, so I suspect the author is aware. | anarticle wrote: | In my xp, medical imaging at the diagnostic tier uses only | lossless (JPEG2000 et al). It was explicitly mentioned on our | SOP/policies that we had to have a lossless setup. | | Very sketchy to use a super resolution for diagnostics. In | research (flourescence), sure. | | ref: my direct experience of pathology slide scanning machines | and their setup. | adammarples wrote: | Mentioned in TFA at least twice | Der_Einzige wrote: | Sounds like you need lossless compression. | | I was told that the GPT-2 text compression variant was a | lossless compressor (https://bellard.org/libnc/gpt2tc.html), | why is stable diffusion lossy? | operator-name wrote: | Probably something to do with the variational auto encoder, | which is lossy. | theemathas wrote: | Here's a similar case of a scanner using a traditional | compression algorithm. It has a bug in the compression | algorithm, which made it replace a number in the scanned image | with a different number. | | https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres... | sgc wrote: | That is completely outside all my expectations prior to | reading it. The consequences are potentially life and death, | or incarceration, etc, and yet they did nothing until called | out and basically forced to act. | | A good reminder that the bug can be anywhere, and when things | stop working we often need to get very dumb, and just | methodically troubleshoot. | function_seven wrote: | We programmers tend to think our abstractions match reality | somehow. Or that they don't leak. Or even if they _do_ | leak, that leakage won 't spill down several layers of | abstraction. | | I used to install T1 lines a long time ago. One day we had | a customer that complained that their T1 was dropping every | afternoon. We ran tests on the line for extended periods of | time trying to troubleshoot the problem. Every test passed. | Not a single bit error no matter what test pattern we used. | | We monitored it while they used it and saw not a single | error, except for when the line completely dropped. We | replaced the NIU card, no change. | | Customer then hit us with, "it looks like it only happens | when Jim VNCs to our remote server". | | Obviously a userland program (VNC) could not possibly cause | our NIU to reboot, right?? It's several layers "up the | stack" from the physical equipment sending the DS1 signal | over the copper. | | But that's what it was. We reliably triggered the issue by | running VNC on their network. We ended up changing the NIU | and corresponding CO card to a different manufacturer (from | Adtran to Soneplex I think?) to fix the issue. I wish I had | had time to really dig into that one, because obviously | other customers used VNC with no issues. Adtran was our | typical setup. Nothing else was weird about this standard | T1 install. But somehow the combination of our equipment, | their networking gear, and that program on that workstation | caused the local loop equipment to lose its mind. | | This number-swapping story hit me the same way. We would | all expect a compression bug to manifest as blurry text, or | weird artifacts. We would never suspect a clean | substitution of a meaningful symbol in what is "just a | raster image". | jend wrote: | Reminds me of this story: | http://blog.krisk.org/2013/02/packets-of-death.html | | tldr: Specific packet content triggers a bug in the | firmware of an Intel network card and bricks it until | powered off. | SV_BubbleTime wrote: | Iirc, this was an issue or conspiracy fuel or whatever with | the birth certificate that Obama released. That some of the | unique elements in the scan repeated over and over. | caycep wrote: | I assume something like jpeg (used in the DICOM standard | today) has more eyes on the code than proprietary Xerox | stuff? hopefully at least... | | I have seen weird artifacts on MRI scans, specifically the | FLAIR image enhancement algorithm used on T2 images, i.e. | white spots, which could in theory be interpreted by a | radiology as small strokes or MS...so I always take what I | see with a grain of salt.. | ska wrote: | The DICOM standard stuff did have a lot of eyes on it, and | was tuned toward fidelity which helps. It's not perfect, | but what is. | | MRI artifacts though are a whole can of worms, but | fundamentally most of them come from a combination of the | EM physics involved, and the reconstruction algorithm | needed to produce an image from the frequency data. | | I'm not sure what you mean by "image enhancement | algorithm"; FLAIR is a pulse sequence used to suppress | certain fluid signals, typically used in spine and brain. | | Many of the bright spots you see in FLAIR are due to B1 | inhomogeneity, iirc (it's been a while though) | ska wrote: | Probably worth mentioning also that "used in DICOM | standard" is true but possibly misleading to someone | unfamiliar with it. | | DICOM is a vast standard. In it's many crevasses, it | contains wire and file encoding schemas, some of which | include (many different) image data types, some of which | allow (multiple) compression schemes, both lossy and | lossless, as well as metadata schemes. These include JPEG, | JPEG-LS, JPEG-2000, MPEG2/4, HVEC. | | I think you have to encode the compression ratio as well, | if you do lossy compression. You definitely have to note | that you did lossy compression. | [deleted] | vjeux wrote: | How long does it take to compress and decompress an image that | way? | fzzt wrote: | The prospect of the images getting "structurally" garbled in | unpredictable ways would probably limit real-world applications: | https://miro.medium.com/max/4800/1*RCG7lcPNGAUnpkeSsYGGbg.pn... | | There's something to be said about compression algorithms being | predictable, deterministic, and only capable of introducing | defects that stand out as compression artifacts. | | Plus, decoding performance and power consumption matters, | especially on mobile devices (which also happens be the setting | where bandwidth gains are most meaningful). | kevincox wrote: | While that is kind of true it is also sort of the point. | | The optimal lossy compression algorithm would be based on | humans as a target. it would remove details that we wouldn't | notice to reduce the target size. If you show me a photo of a | face in front of some grass the optimal solution would likely | be to reproduce that face in high detail but replace the grass | with "stock imagery". | | I guess it comes down to what is important. In the past | algorithms were focused on visual perception, but maybe we are | getting so good at convincingly removing unnecessary detail | that we need to spend more time teaching the compressor what | details are important. For example if I know the person in the | grass preserving the face is important. If I don't know them | then it could be replaced by a stock face as well. Maybe the | optimal compression of a crowd of people is the 2 faces of | people I know preserved accurately and the rest replaced with | "stock" faces. | anilakar wrote: | Remember the Xerox scan-to-email scandal in which tiling | compression was replacing numbers in structural drawings? We're | talking about similar repercussions here. | behnamoh wrote: | This reminds me of a question I have about SD: why can't it do | a simple OCR to know those are characters not random shapes? | It's baffling that neither SD nor DE2 have any understanding of | the content they produce. | nl wrote: | > why can't it do a simple OCR to know those are characters | not random shapes? | | It's pretty easy to add this if you wanted to. | | But a better method would be to fine tune on a bunch of | machine-generated images of words if you want your model to | be good at generating characters. You'll need to consider | which of the many Unicode character sets you want your model | to specialize in though. | Xcelerate wrote: | You could certainly apply a "duct tape" solution like that, | but the issue is that neural networks were developed to | replace what were previously entire solutions built on a | "duct tape" collection of rule-based approaches (see the | early attempts at image recognition). So it would be nice to | solve the problem in a more general way. | montebicyclelo wrote: | Just a note that stable diffusion is/can be deterministic (if | set an rng seed). | shrx wrote: | I was told (on the Unstable Diffusion discord, so this info | might not be reliable) that even with using the same seed the | results will differ if the model is running on a different | GPU. This was also my experience when I couldn't reproduce | the results generated by the discord's SD txt2img generating | bot. | nl wrote: | It absolutely should be reproducable, and in my experience | it is. | | I do tend to use the HuggingFace version though. | montebicyclelo wrote: | I'm not sure about the different GPU issue. But if that is | an issue, the model can be made deterministic (probably | compromising inference speed), by making sure the | calculations are computed deterministically. | cma wrote: | With compression you often make a prediction then delta off of | it. A structurally garbled one could be discarded or just | result in a worse baseline for the delta. | bscphil wrote: | A few thoughts that aren't related to each other. | | 1. This is a brilliant hack. Kudos. | | 2. It would be great to see the best codecs included in the | comparison - AVIF and JPEG XL. Without those it's rather | incomplete. No surprise that JPEG and WEBP totally fall apart at | that bitrate. | | 3. A significant limitation of the approach seems to be that it | targets extremely low bitrates where other codecs fall apart, but | at these bitrates it incurs problems of its own (artifacts take | the form of meaningful changes to the source image instead of | blur or blocking, very high computational complexity for the | decoder). | | When only moderate compression is needed, codecs like JPEG XL | already achieve very good results. This proof of concept focuses | on the extreme case, but I wonder what would happen if you | targeted much higher bitrates, say 5x higher than used here. I | suspect (but have no evidence) that JPEG XL would improve in | fidelity _faster_ as you gave it more bits than this SD-based | technique. _Transparent_ compression, where the eye can 't tell a | visual difference between source and transcode (at least without | zooming in) is the optimal case for JPEG XL. I wonder what sort | of bitrate you'd need to provide that kind of guarantee with this | technique. | leeoniya wrote: | also thought it was odd that AVIF was not compared - it would | show a major quality and size improvement over WebP. | [deleted] | goombacloud wrote: | The comparison doesn't make much sense because for fair | comparisons you have to measure decompressor size plus encoded | image size. The decompressor here is super huge because it | includes the whole AI model. Also, everyone needs to have the | exact same copy of the model in the decompressor for it to work | reliably. | wongarsu wrote: | Only if decompressor and image are transmitted over the same | channel at the same time, and you only have a small number of | images. When compressing images for the web I don't care if a | webp decompressor is smaller than a jpg or png decompressor, | because the recipient already has all of those. | | Of course stable diffusion's 4GB is much more extreme than | Brotli's 120kb dictionary size, and would bloat a Browser's | install size substantially. But for someone like Instagram or | a Camera maker it could still make sense. Or imagine phones | having the dictionary shipped in the OS to save just a couple | kB on bad data connections. | operator-name wrote: | Even if dictionaries were shipped, the biggest difficulty | would be performance and resources. Most of these models | require beefy compute and a large amount of VRAM that isn't | likely to ever exist on end devices. | | Unless that can be resolved it just doesn't make sense to | use it as a (de)compressor. | a-dub wrote: | hm. would be interesting to see if any of the perceptual image | compression quality metrics could be inserted into the vae step | to improve quality and performance... | fjkdlsjflkds wrote: | This is not really "stable-diffusion based image compression", | since it only uses the VAE part of "stable diffusion", and not | the denoising UNet. | | Technically, this is simply "VAE-based image compression" (that | uses stable diffusion v1.4's pretrained variational autoencoder) | that takes the VAE representations and quantizes them. | | (Note: not saying this is not interesting or useful; just that | it's not what it says on the label) | | Using the "denoising UNet" would make the method more | computationally expensive, but probably even better (e.g., you | can quantize the internal VAE representations more aggressively, | since the denoising step might be able to recover the original | data anyway). | gliptic wrote: | It is using the UNet, though. | nl wrote: | It does use the UNet to denoise the VAE compressed image: | | "The dithering of the palettized latents has introduced noise, | which distorts the decoded result. But since Stable Diffusion | is based on de-noising of latents, we can use the U-Net to | remove the noise introduced by the dithering." | | The included Colab doesn't have line numbers, but you can see | the code doing it: # Use Stable Diffusion U-Net | to de-noise the dithered latents latents = | denoise(latents) denoised_img = to_img(latents) | display(denoised_img) del latents print('VAE | decoding of de-noised dithered 8-bit latents') | print('size: {}b = {}kB'.format(sd_bytes, sd_bytes/1024.0)) | print_metrics(gt_img, denoised_img) | fjkdlsjflkds wrote: | I stand corrected, then :) cheers. | zcw100 wrote: | You can do lossless neural compression too. | fho wrote: | > Quantizing the latents from floating point to 8-bit unsigned | integers by scaling, clamping and then remapping them results in | only very little visible reconstruction error. | | This might actually be interesting/important for the OpenVINO | adaptation of SD ... from what I gathered from the OpenVINO | documentation, quantizing is actually a big part of optimizing as | this allows the usage of Intels new(-ish) NN instruction sets. | stavros wrote: | Didn't I do this last week? | dwohnitmok wrote: | Indeed one way of looking at intelligence is that it is a method | of compressing the external universe. | | See e.g. the Hutter Prize. | mjan22640 wrote: | The feeling of understanding is essentially a decompression | result being successfuly pattern matched. | dan_mctree wrote: | Our sight is light detection compressed into human thought | | Written language is human thought compressed into words | | Digital images are light detection compressed into bits | | Text to images AI compress digital images into written language | | Then how do the AI weights relate to human thought? | Jack000 wrote: | The vae used in stable diffusion is not ideal for compression. I | think it would be better to use the vector-quantized variant (by | the same authors of latent diffusion) instead of the KL variant, | then store the indexes for each quantized vector using standard | entropy coding algorithms. | | From the paper the VQ variant also performs better overall, SD | may have chosen the KL variant only to lower vram use. | GaggiX wrote: | KL models performs better than VQ models as you can see in the | latent diffusion repo by CompVis. | Jack000 wrote: | just checked the paper again and yes you're right, the KL | version is better on the openimages dataset. The VQ version | is better in the inpainting comparison. | | In this case you'd still want to use the VQ version though, | it doesn't make sense to do an 8bit quantization on the KL | vectors when there's an existing quantization learned through | training. | akvadrako wrote: | I would like to see this with much smaller file sizes - like 100 | bytes. How well can SD preserve the core subjects or meaning of | the photos? | pishpash wrote: | You can already "compress" them down to a few words, so you | have your answer there. | fla wrote: | Is there a general name for this kind of latent space round-trip | compression ? If not, I think a good name could be "interpretive | compression" | pyinstallwoes wrote: | This relates to a strong hunch that consciousness is tightly | coupled to whatever compression is as an irreducible entity. | | Memory <> Compression <> Language <> Signal Strength <> Harmonics | and Ratios | mjan22640 wrote: | Consciousness is IMHO being avare of being avare. The mystic | specialty of it is IMHO a mental illusion, like the Penrose | ladder optical illusion. | eru wrote: | I see the relation between compression and consciousness. But | what do you mean by irreducible entity, and how does it relate | to the two? | pyinstallwoes wrote: | By irreducible entity, as the yet undefined entity that sits | at the nexus of mathematics, philosophy, computation, logic | (consciousness). | | It's not a well defined ontology yet. So whatever it is, at | its irreducible size pinpointing it as a thing in which gives | rise to such other things. | eru wrote: | What kind of reductions would be disallowed? | nl wrote: | I don't understand much of what the OP is saying. | | But I do like the Stephen Wolfram idea of consciousness being | the way a computationally bounded observer develops a | coherent view of a branching universe. | | This is related to compression because it a (lossy!) | reduction in information. | | I understand that Wolfram is controversial, but the | information-transmission-centric view of reality he works | with makes a lot of intuitive sense to me. | | https://writings.stephenwolfram.com/2021/03/what-is- | consciou... | jwr wrote: | While this is great as an experiment, before you jump into | practical applications, it is worth remembering that the | decompressor is roughly 5GB in size :-) | red75prime wrote: | It reminded me of a scene from "A Fire Upon the Deep" where | connection bitrate is abysmal, but the video is crisp and | realistic. It is used as a tool for deception, as it happens. | Invisible information loss has its costs. | Dwedit wrote: | This is why for compression tests, they incorporate the size of | everything needed to decompress the file. You can compress down | to 4.97KB all you want, just include the 4GB trained model. | janekm wrote: | Is that true? I have never seen this done for any image | compression comparisons that I have seen (i.e. only data that | is specific to the image that is being compressed is included, | not standard tables that are always used by the algorithm like | the quantisation tables used in JPG compression) | jerf wrote: | Yes, it is done all the time. | | However, several people here are conflating "best compression | as determined for a competition" and "best compression for | use in the real world". There is an important relationship | between them, absolutely, but in the real world we do not | download custom decoders for every bit of compressed content. | Just because there is a competition that quite correctly | measures the entire size of the decompressor and encoded | content does not mean that is now the only valid metric to | measure decompression performance. The competitions use that | metric for good and valid reasons, but those good and valid | reasons are only vaguely correlated to the issues faced in | the normal world. | | (Among the reasons why competitions must include the size of | the decoder is that without that the answer is trivial; I | define all your test inputs as a simple enumeration of them | and my decoder hard-codes the output as the test values. This | is trivially the optimal algorithm, making competition | useless. If you could have a real-world encoder that worked | this well, and had the storage to implement it, it would be | optimal, but you can't possibly store all possible messages. | For a humorous demonstration of this encoding method, see the | classic joke: https://onemansblog.com/2010/05/18/prison-joke/ | ) | fsiefken wrote: | For text compression benchmarks it's done | http://mattmahoney.net/dc/text.html | | Matt doesn't do this on the Silesia corpus compression | benchmark, even though it would make sense there as well: | http://mattmahoney.net/dc/silesia.html | | So a compressor of a few gigabyte would make sense if you | would have a set of pictures of more then a few gigabyte. | It's a bit similar to preprocessing text compression with a | dictionary and adding the dictionary to the extractor to | squeeze a bit more bytes. | goombacloud wrote: | By the way, the leading nncp in the LTCB (text.html) "is a | free, experimental file compressor by Fabrice Bellard, | released May 8, 2019" :) | Gigachad wrote: | Do you also include the library to render a jpeg? And maybe the | whole OS required to display it on your screen? | | There are very many uses where any fixed overhead is | meaningless. Imagine archiving billions of images for long term | storage. The 4GB model quickly becomes meaningless. | stavros wrote: | > Do you also include the library to render a jpeg? And maybe | the whole OS required to display it on your screen? | | No, what does that have to do with reconstructing the | original data? | | If the fixed overhead works for you, that's fine, but | including it is not meaningless. | 112233 wrote: | Fixed overheads are never meaningless. O(n^2) algorithm that | processes your data in 5s is faster on your data than O(log | n) that takes 20 hours. | | Long term storage of billions of images is meaningless, if it | takes billions of years to archive these images. | Gigachad wrote: | It's a one time cost rather than per image. You need the | 4GB model only once and then you can uncompress unlimited | images. | 112233 wrote: | Yes, but each image needs access to this 4GB (actually, I | have no idea how much RAM it takes up), plus whatever the | working set size is. It is a non-trivial overhead that | really limits throughput of your system, so you can | process less images in parallel, so compressing billion | of images in reasonable time suddenly may cost much more | than the amount of storage it would save, compared to | other methods. | quickthrower2 wrote: | If this were used in the wild, do you need a copy of the model | locally to decompress the images? | coffee_beqn wrote: | And how much compute time/power does "decompressing" take | compared to a jpg? | mcbuilder wrote: | Yes, but possibly not the entire model, hypothetically for | instance some fine-tuning on compression and then distillation. | Gigachad wrote: | I can imagine some uses for this. Imagine having to archive a | massive dataset where it's unlikely any individual image will | be retrieved and where perfect accuracy isn't required. | | Could cut down storage costs a lot. | kgeist wrote: | I heard Stable Diffusion's model is just 4 GB. It's incredible | that billions of images could be squeezed in just 4 GB. Sure it's | lossy compression but still. | eru wrote: | In this regard, stable diffusion is not so much comparable to a | corpus of jpeg images, but with the jpeg compression | algorithms. | akomtu wrote: | I think it's easy to explain. If we split all those images into | small 8x8 chunks, and put all the chunks into a fuzzy and a bit | lossy hashtable, we'll see that many chunks are very similar | and can be merged into one. To address this "space of 8x8 | chunks" we'll apply PCA to them, just like in jpeg, and use | only the top most significant components of the PCA vectors. | | So in essense, this SD model is like an Alexandria library of | visual elements, arranged on multidomensional shelves. | nl wrote: | I don't think that thinking of it as "compression" is useful, | and more than an artist recreating the Mona Lisa from memory is | "decompressing" it. The process that diffusion models use is | fundamentally different to decompression. | | For example, if you prompt Stable Diffusion with "Mona Lisa" | and look at the iterations, it is clearer what is happening - | it's not decompressing so much as drawing something it knows | looks like Mona Lisa and then iterating to make it look clearer | and clearer. | | It clearly "knows" what the Mona Lisa looks like, but what is | is doing isn't copying it - it's more like recreating a thing | that looks like it. | | (And yes I realize lots of artist on Twitter are complaining | that it is copying their work. I think "forgery" is a better | analogy than "stealing" though - it can create art that looks | like a Picasso or whatever, but it isn't copying it in a | conventional sense) | Gigachad wrote: | Forgery requires some kind of deception/fraud. Painting an | imitation of the Mona Lisa isn't forgery. Trying to sell it | as if it is the original is. | nl wrote: | Yes I agree with this too. | | I think using that language is better than "stealing", | because the immoral act is the passing off, not training of | the model. | ilaksh wrote: | What if I just want something pretty similar but not necessarily | the exact image. Maybe there could be a way to find a somewhat | similar text prompt as a starting point, and then add in some | compressed information to adjust the prompt output to be just a | bit closer to the original? | MarkusWandel wrote: | The one with the different buildings in the reconstructed image | is a bit spooky. I've always argued that human memory is highly | compressed, storing, for older memories anyway, a "vibe" plus | pointers to relevant experiences/details that can be used to | flesh it out as needed. Details may be wrong in the | recollecting/retelling, but the "feel" is right. | | And here we have computers doing the same thing! Reconstructing | an image from a highly compressed memory and filling in | appropriate, if not necessarily exact details. Human eye looks at | it casually and yeah, that's it, that's how I remember it. Except | that not all the details are right. | | Which is one of those "Whoa!" moments, like many many years ago, | when I wrote a "Connect 4" implementation in BASIC on the | Commodore 64, played it and lost! How did the machine get so | smart all of a sudden? | illubots wrote: | In theory, it would be possible to benefit from the ability of | Stable Diffusion to increase perceived image quality without even | using a new compression format. We could just enhance existing | JPG images in the browser. | | There already are client side algorithms that increase the | quality of JPGs a lot. For some reason, they are not used in | browsers yet. | | A Stable Diffusion based enhancement would probably be much nicer | in most cases. | | There might be an interesting race to do client side image | enhancements coming to the browsers over the next years. | codeflo wrote: | One interesting feature of ML-based image encoders is that it | might be hard to evaluate them with standard benchmarks, because | those are likely to be part of the training set, simply by virtue | of being scraped from the web. How many copies of Lenna has | Stable Diffusion been trained with? It's on so many websites. | zxexz wrote: | We might enter a time when every time a new model/compression | algo is introduced, a new series of benchmark images may need | to be introduced/taken and ALL historical benchmarks of major | compression algos redone on the new images. | seydor wrote: | Is there something like this for live video chat? | FrostKiwi wrote: | I thought this was another take on this parody post: | https://news.ycombinator.com/item?id=32671539 | | But no, it's the real deal. Great job author. | nl wrote: | This but for video using the "infilling" version for changing | parts between frames. | | The structural changes per frame matter much less. Send a 5kB | image every keyframe then bytes per subsequent image with a | sketch of the changes and where to mask them on the frame. | | Modern video codecs are pretty amazing though, so not sure how it | would compare in frame size | willbudd wrote: | I've been thinking about more or less the same idea, but the | computational edge inference costs probably makes it | impractical for most of today's client devices. I see a lot of | potential in this direction in the near future though. | nl wrote: | I think it's unclear how much computational resources the | uncompression steps take. | | At the moment it's fairly fast, but RAM hungry. But this | article makes it clear that quantizing the representation | works well (at least for the VAE). It's possible quantized | models could also do decent jobs. | swayvil wrote: | This is the algorithmic equivalent of a metaphor. | bane wrote: | Goodness, I love this. It's a great description of the | approach. | criddell wrote: | Before I clicked through to the article, I thought maybe they | were taking an image and spitting out a prompt that would | produce an image substantially similar to the original. | sod wrote: | This may give insights in how brain memory and thinking works. | | Imagine if some day a computer could take a snapshot of the | weights and memory bits of the brain and then reconstruct | memories and thoughts. | epmaybe wrote: | This kind of already fits a little bit with how the brain | processes images where there is information lacking. | Neurocognitive specialists can likely correct me on the | following. | | Glaucoma is a disease where one slowly loses peripheral vision, | until a small central island remains or you go completely | blind. | | So do patients perceive black peripheral vision? Or blurred | peripheral vision? | | Not really...patients actually make up the surrounding | peripheral vision, sometimes with objects! | SergeAx wrote: | Does anybody understand from the article, how much data needed to | be downloaded first on decompression side? The entire SD weights | 2GB array, right? | RosanaAnaDana wrote: | Something interesting about the San Francisco test image is that | if you start to look into the details, its clear that some real | changes have been made to the city. Rather than losing texture or | grain or clarity, the information lost in this is information | about the particular layout of a neighborhood of streets, which | has now been replaced as if some one were drawing the scene from | memory. A very different kind of loss that with out the original | might be imperceptible because the information that was lost | isn't replaced with random or systematic noise, but rather new, | structured information.. | jhrmnn wrote: | It's interesting that this is closer to how human memory | operates--we're quite good in unconsciously fabricating false | yet strong memories. | laundermaf wrote: | True, but I'd like to continue using products that produce | close-to-real images. Phones nowadays already process images | at lot. The moment they start replacing pixels it'll all be | fake. | | And... Some manufacturer apparently already did it on their | ultra zoom phones when taking photos of the moon. | NavinF wrote: | Meh. Cameras have been "replacing pixels" for as long as | I've been alive. Consider that a 4K camera only has 2k*4k | pixels whereas a 4K screen has 2k*4k*3 subpixels. | | 2/3 of the image is just dreamed up by the ISP (image | signal processor) when it debayers the raw image. | | I'm not aware of any consumer hardware that has open source | ISP firmware or claims to optimize for accuracy over | beauty. | montroser wrote: | Okay, but a camera doing this is unlikely to dream up | plausible features that didn't actually exist in the | scene. | NavinF wrote: | Of course it is! Try feeding static into a modern ISP. It | will find patterns that don't exist. | taberiand wrote: | I would've thought anyone relying on lossy-compressed images of | any sort already needs to be aware of the potential effects, or | otherwise isn't really concerned by the effect on the image | (and I'd guess that the vast majority of use cases actually | don't care if parts of the image are essentially "imaginary") | aaaaaaaaaaab wrote: | The good old JBIG2 debacle. | | "When used in lossy mode, JBIG2 compression can potentially | alter text in a way that's not discernible as corruption. This | is in contrast to some other algorithms, which simply degrade | into a blur, making the compression artifacts obvious.[14] | Since JBIG2 tries to match up similar-looking symbols, the | numbers "6" and "8" may get replaced, for example. | | In 2013, various substitutions (including replacing "6" with | "8") were reported to happen on many Xerox Workcentre | photocopier and printer machines, where numbers printed on | scanned (but not OCR-ed) documents could have potentially been | altered. This has been demonstrated on construction blueprints | and some tables of numbers; the potential impact of such | substitution errors in documents such as medical prescriptions | was briefly mentioned." | | https://en.m.wikipedia.org/wiki/JBIG2 | tlrobinson wrote: | One thing that worries me about generative AI is the | degradation of "truth" over time. AI will be the cheapest way | to generated content, by far. It will sometimes get facts | subtly wrong, and eventually that AI generated content will be | used to train future models. Rinse and repeat. | jacobr1 wrote: | The interesting thing is that is some ways this is a return | to pre-modern era of lossy information transmission between | the generations. Every story is re-molded by the re-teller. | Languages change and thus the contextual interpretations. | Even something a seemingly static as a book gets slowly | modified as scribes rewrite scrolls over centuries. | poszlem wrote: | We are getting closer and closer to a simulacrum and | hyperreality. | | We used to create things that were trying to simulate | (reproduce) reality, but now we are using those "simulations" | we'd created as if they were the real thing. With time we | will be getting farther away from the "truth" (as you put | it), and yes - I share your worry about that. | | https://en.wikipedia.org/wiki/Simulacrum | | EDIT: A good example I heard that explains what a simulacrum | is was this: Ask a random person to draw a photo of a princes | and see how many will draw a disney princess (which already | was based on real princesses) vs how many will draw one | looking like Catherine of Aragon or another real princess. | intrasight wrote: | art is truth | Xcelerate wrote: | So you've described humans. | _nalply wrote: | Currently computers can reliably do maths. Later AI will | unreliably do maths. Exactly like humans. | pishpash wrote: | So it will get stupider... maybe the singularity isn't | bad like too smart but bad like dealing with too many | stupid people. | ballenf wrote: | Maybe making (certain kinds of) math mistakes is a sign | of intelligence. | ciphol wrote: | The nice thing about math is that often it's much harder | to find a proof than to verify that proof. So math AI is | allowed to make lots of dumb mistakes, we just want it to | make the occasional real finding too. | MauranKilom wrote: | Unless we also ask AI to do the proof verification... | rowanG077 wrote: | Why would you do that? Proof verification is pretty much | a solved problem. | gpderetta wrote: | Both stupider and less deterministic, but also and | smarter and more flexible. Like humans. | tlrobinson wrote: | Fair point, though I feel there's a difference as AI can | generate content much more quickly. | jefftk wrote: | Similar to how we have low-background (pre-nuclear) steel, | might we have pre-transformer content? | Lorin wrote: | Jpeg bitrot 2.0 | blacksmith_tb wrote: | Certainly possible, though we also have many hundreds of | millions of people walking the globe taking pictures of | things with their phones (not all of which are public to be | used for training, but still). | fny wrote: | I've started seeing more of this crap show up on the front | page of Google. | sharemywin wrote: | Kind of like how chicken taste like everything. | robbomacrae wrote: | Yes indeed. I've been looking for an auto summarizer that | reliably doesn't change the content. So far everything I've | tried will make up or edit a key fact once in a while. | z3c0 wrote: | Anywhere that truth matters will be unaffected. If such | deviations from truth can withhold, then the truth never | mattered. False assumptions will never hold where they can't, | because reality is quite pervasive. Ask anyone who's had to | productionize an ML model in a setting that requires a foot | in reality. Even a single-digit drop in accuracy can have | resounding effects. | thaumasiotes wrote: | There was a scandal when it was discovered that Xerox machines | were doing this; in that case, the example showed "photocopies" | replacing numbers in documents with other numbers. | smitec wrote: | There is a talk about that issue [1]. | | During my PhD this issue came up amongst those in the group | looking into compressed sensing in MRI. Many reconstruction | methods (AI being a modern variant) work well because a best | guess is visually plausible. These kinds of methods fall | apart when visually plausible and "true" are different in a | meaningful way. The simplest examples here being the numbers | in scanned documents, or in the MRI case, areas of the brain | where "normal brain tissue" was on average more plausible | than "tumor". | | [1]: http://www.dkriesel.com/en/blog/2013/0802_xerox- | workcentres_... | nl wrote: | It's worth noting that these problems are things to be | aware of, not the complete showstoppers some people seem to | think that they are. | thaumasiotes wrote: | I'm having a hard time seeing where the random | substitution of all numbers isn't supposed to be a | complete showstopper. | nl wrote: | Well for example you train the VAE to reduce the | compression on characters. | kgwgk wrote: | The right amount of compression in a photocopy machine is | zero. | | Compression that gives you a blurred image is a trade- | off. | | But what does it mean to "be aware of" compression that | may give you a crisp image of some made up document? | nl wrote: | > The right amount of compression in a photocopy machine | is zero. | | This isn't an obvious statement to me. If you've had the | misfortune of scanning documents to PDF and getting the | 100MB per page files automatically emailed to you then | you might see the benefit in all that white space being | compressed somehow. | | > But what does it mean to "be aware of" compression that | may give you a crisp image of some made up document? | | This isn't something I said. A good compression system | for documents will not change characters in any | circumstances. | rjmunro wrote: | If you are making an image of a cityscape to illustrate | an article it probably doesn't matter what the city looks | like. But if the article is about the architecture of the | specific city, it probably does, so you need to 'be | aware' that the image you are showing people isn't | correct, and reduce the compression. | kgwgk wrote: | This subthread was about changing numbers in scanned | documents and vanishing tumors in medical images. | rowanG077 wrote: | An medical sensor filling in "plausible" information is | not a show stopper? I hope you are never in control of | making decisions like that. | nl wrote: | To be aware of when you are building compression systems. | | It's perfectly possible to build neural network based | compression systems that do not output false information. | lm28469 wrote: | > not the complete showstoppers some people seem to think | that they are. | | idk if I had to second guess every single result coming | out of a machine it would be a showstopper for me. This | isn't pokemon go, tumor detection is serious matter | pishpash wrote: | Why would you want to lossily compress any medical image | is beyond me. You get equipment to make precise high- | resolution measurements, it goes without saying that you | do not want noise added to that. | kybernetikos wrote: | Yeah, if it were actually adopted as a way to do compression, | it seems likely to lead to even worse problems than JBIG2 did | https://news.ycombinator.com/item?id=6156238 | | Invisibly changing the content rather than the image quality | seems like a really concerning failure mode for image | compression! | | I wonder if it'd be possible to use SD as part of a lossless | system - use SD as something that tells us the liklihood of | various pixel values given the rest of the image and combine | that liklihood with a huffman encoding. Either way, fantastic | hack, but we really should avoid using anything lossy built on | AI for image compression. | pishpash wrote: | Give it "enough" bits and it won't be a problem. How many is | enough is the question. | eloisius wrote: | Imagine a world where bandwidth constraints meant | transmitting a hidden compressed representation that gets | expanded locally by smart TVs that have pretrained weights | baked into the OS. Everyone sees a slightly different | reconstitution of the same input video. Firmware updates that | push new weights to your TV result in stochastic changes to a | movie you've watched before. | jacobr1 wrote: | You could still use some kind of adaptive huffman coding. | Current compression schemes have some kind of dictionary | embedded in the file to map between the common strings and | the compressed representation. Google tried proposing SDCH | a few years using a common dictionary for wep pages. There | isn't any reason why we can't be a bit more deterministic | and share a much larger latent representation of "human | visual comprehension" or whatever to do the same. It | doesn't need to be stochastic once generated. | kybernetikos wrote: | "The weather forecast was correct as broadcast, sir, it's | just your smart TV thought it was more likely that the | weather in your region would be warm on that day, so it | adjusted the symbol and temperature accordingly" | ZiiS wrote: | It opens up an interesting question that is it suggesting | "improvements" that could be done in the real world. | RosanaAnaDana wrote: | Are you suggesting a lossy but 'correct' version? | | IE, the algorithm ignores and loses the 'irrelevant' | information, but holds the important stuff? | phkahler wrote: | This needs to be compared with automated tests. A lack of | visual artifacts doesnt mean an accurate representation of the | image in this case. | freediver wrote: | Arguably this is still fine with the definition of lossy | compression. The compressed image still roughly shows the idea | of the original image. | perryizgr8 wrote: | I believe ML techniques are the future of video/image | compression. When you read a well written novel, you can kind of | construct images of characters, locations and scenes in your | mind. You can even draw these scenes, and if you're a good | artist, those won't have any artifacts. | | I don't expect future codecs to be able to reduce a movie to a | simple text stream, but maybe it could do something in the same | vein. Store abstract descriptions instead of bitmaps. If the | encoding and decoding are good enough, your phone could | reconstruct an image that closely resembles what the camera | recorded. If your phone has to store a 50Gb model for that, it | doesn't seem too bad, especially if the movie file could be | measured in tens of megabytes. | | Or it could go in another direction, where file sizes remain in | the gigabytes, but quality jumps to extremely crisp 8k that you | can zoom into or move the camera around if you want. | | Can't wait for this stuff! | UniverseHacker wrote: | From the title, I expected this to be basically pairing stable | diffusion with an image captioning algorithm by 'compressing' the | image to a simple human readable description, and then | regenerating a comparable image from the text. I imagine that | would work and be possible, essentially an autoencoder with a | 'latent space' of single short human readable sentences. | | The way this actually works is pretty impressive. I wonder if it | could be made lossless or less lossy in a similar manner to FLAC | and/or video compression algorithms... basically first do the | compression, and then add on a correction that converts the | result partially or completely into the true image. Essentially, | e.g. encoding real images of the most egregiously modified | regions of the photo and putting them back over the result. | Waterluvian wrote: | I wonder if this technique could be called something like | "abstraction" rather than "compression" given it will actually | change information rather than its quality. | | Ie. "There's a neighbourhood here" is more of an abstraction than | "here's this exact neighbourhood with the correct layout just | fuzzy or noisy." | seydor wrote: | like a MIDI file | Sohcahtoa82 wrote: | Well, a MIDI file says nothing about the sound a Trumpet | makes, whereas this SD-based abstraction does give a general | idea of what your neighborhood should look like. | | Maybe it's more like a MOD file? | rowanG077 wrote: | I would say any compression is abstraction in a certain sense. | A simple example is a gradient. A lossy compression might | abstract over the precise pixel value and simply records a | gradiant that almost matches the raw input. You could even make | the argument that lossless compressions is abstraction. A 2D | grid with 5px lines and 50px spacing between them could | feasibly be captured really well using a classical compression | scheme. | | What AI offers is just a more powerful and opaque way of doing | the same thing. | ipunchghosts wrote: | What does johanne balle have to say about this? ___________________________________________________________________ (page generated 2022-09-20 23:00 UTC)