[HN Gopher] Stable Diffusion based image compression
       ___________________________________________________________________
        
       Stable Diffusion based image compression
        
       Author : nanidin
       Score  : 416 points
       Date   : 2022-09-20 03:58 UTC (19 hours ago)
        
 (HTM) web link (matthias-buehlmann.medium.com)
 (TXT) w3m dump (matthias-buehlmann.medium.com)
        
       | bjornsing wrote:
       | If it's a VAE then the latents should really be distributions,
       | usually represented as the mean and variance of a normal
       | distribution. If so then it should be possible to use the
       | variance to determine to what precision a particular latent needs
       | to be encoded. Could perhaps help increase the compression
       | further.
        
         | nullc wrote:
         | Why aren't they scaled to have uniform variances?
        
       | euphetar wrote:
       | I am currently also playing around with this. The best part is
       | that for storage you don't need to store the reconstructed image,
       | just the latent representation and the VAE decoder (which can do
       | the reconstructing later). So you can store the image as
       | relatively few numbers in a database. In my experiment I was able
       | to compress a (512, 384, 3) RGB image to (48, 64, 4) floats. In
       | terms of memory it was a 8x reduction.
       | 
       | However, on some images the artefacts are terrible. It does not
       | work as a general-purpose lossy compressor unless you don't care
       | about details.
       | 
       | The main obstacle is compute. The model is quite large, but hdds
       | are cheap. The real problem is that reconstruction requires a GPU
       | with lots of VRAM. Even with a GPU it's 15 seconds to reconstruct
       | an image in Google Collab. You could do it on CPU, but then it's
       | extremely slow. This is only viable if compute costs go down a
       | lot.
        
       | holoduke wrote:
       | In the future you can have full 16k movies representing only
       | 1.44mb seeds. A giant 500 petabyte trained model file can run
       | those movies. You can even generate your own movie by uploading a
       | book.
        
         | monokai_nl wrote:
         | Probably very unlikely, but sometimes I wonder if Jan Sloot did
         | something like this back in '95:
         | https://en.wikipedia.org/wiki/Sloot_Digital_Coding_System
        
       | aaroninsf wrote:
       | I would call this "confabulation" more than compression.
       | 
       | Its accuracy is proportional to and bounded by the training data;
       | I suspect in practice it's got a specific strength (filling in
       | fungible detail) and as discussed ITT with fascinating and gnarly
       | corners, some specific failure modes which are going to lead to
       | bad outcomes.
       | 
       | At least with "lossy" CODECs of various kinds, even if you don't
       | attend to absence until you do an A/B comparison, you can
       | perceive the difference when you do do those comparisons.
       | 
       | In this case the serious peril is that an A/B comparison is
       | [soon] going to just show difference. "What... is... the Real?"
       | 
       | When you contemplate that an ever-increasing proportion of the
       | training data itself stems from AI- or otherwise-enhanced
       | imagery,
       | 
       | our hold on the real has never felt weaker, and our vulnerability
       | to the rewriting of reality, has never felt more present.
        
       | bane wrote:
       | The basic premise of these kinds of compression algorithms is
       | actually pretty clever. Here's a very _very_ trivialization of
       | this style of approach:
       | 
       | 1. both the compressor and decompressor contain knowledge beyond
       | the algorithm used to compress/decompress some data
       | 
       | 2. in this case the knowledge might be "all the images in the
       | world"
       | 
       | 3. when presented with an image, the compressor simply looks up
       | some index or identifier of the the image
       | 
       | 4. the identifier is passed around as the "compressed image"
       | 
       | 5. "decompression" means looking up the identifier and retrieving
       | the image
       | 
       | I've heard this called "compression via database" before and it
       | can give the appearance of defeating Shannon theorem for
       | compression even though it doesn't do that at all.
       | 
       | Of course the author's idea is significantly more sophisticated
       | than the approach above, and trades a lossy approach for some
       | gains in storage and retrieval efficiency (we don't have to have
       | a copy of all of the pictures in the world in both the compressor
       | and the decompressor). The evaluation note of not using any known
       | image for the tests further challenges the approach and helps
       | sus-out where there are specific challenge like poor
       | reconstruction of specific image constructs like faces or text --
       | I suspect that there are many other issues like these but the
       | author honed in on these because we (as literate humans) are
       | particularly sensitive to them.
       | 
       | In these types of lossy compression approaches (as opposed to the
       | above which is lossless) the basic approach is:
       | 
       | 1. Throw away data until you get to the desired file size. You
       | usually want to come up with some clever scheme to decide what
       | data you toss out. Alternative, just hash the input data using
       | some hash function that produces just the right number of bits
       | you want, but use a scheme that results in a hash digest that can
       | act as a (non-unique) index to the original image in a table of
       | every image in the world.
       | 
       | 2. For images it's usually easy to eliminate pixels (resolution)
       | and color (bit-depth, channels, etc.). In this specific case, the
       | author uses an variational autoencoder to "choose" what gets
       | tossed. I suspect the autoencoder is very good at preserving
       | information rich, or high-entropy, information dense slices of a
       | latent space or something. At any rate, this produces something
       | that to us sorta kinda looks like a very low resolution, poorly
       | colored postage stamp of the original image, but actually
       | contains more data than that. I think at this point it can just
       | be considered the hash digest.
       | 
       | 3. this hash digest, or VAE encoded image or whatever we want to
       | call it, is what's passed around as the "compressed" data.
       | 
       | 4. just like above, "decompression" means effectively looking up
       | the value in a "database". If we are working with hash digests,
       | there was probably a collision during the construction of the
       | database of all images, so we lost some information. In this case
       | we're dealing with stable diffusion and instead of a simple
       | index->table entry, our "compressed" VAE image wraps through some
       | hyperspace to find the nearest preserved data. Since the VAE
       | "pixels" probably align close to data dense areas of the space
       | you tend to get back data that closely represents the original
       | image. It's still a database lookup in that sense, but it's
       | looking more for "similar" rather than "exact matches" which when
       | used to rebuild the image give a good approximation of the
       | original.
       | 
       | Because it's an "approximation" it's "lossy". In fact I think
       | it'd be more accurate to say it's "generally lossy" as there is a
       | chance the original image can be reproduced _exactly_ ,
       | especially if it's in the original training data. Which is why
       | the author was careful not to use anything from that set.
       | 
       | Because we've stored so much information in the compressor and
       | decompressor, it can also give the appearance of defeating
       | Shannon entropy for compression except it's also not because:
       | 
       | a) it's generally lossy
       | 
       | b) just like the original example above we're cheating by simply
       | storing lots of information elsewhere
       | 
       | There's probably some deep mathematical relationship between the
       | author's approach and compressive sensing.
       | 
       | Still, it's useful, and has the possibility of improving data
       | transmission speeds at the cost of storing lots of local data at
       | both ends.
       | 
       | Source: Many years ago before deep learning was even a "thing", I
       | worked briefly on some compression algorithms in an effort to
       | reduce data transfer issues in telecom poor regions. One of our
       | approaches was not too dissimilar to this -- throw away a bunch
       | of the original data in a structured way and use a smart
       | algorithm and some stored heuristics in the decompressor to guess
       | what we threw away. Our scheme had the benefit of almost
       | absolutely trivial "compression" with the downside of massive
       | computational needs on the "decompression" side, but had lots of
       | nice performance guarantees which you could use to design the
       | data transport stuff around.
       | 
       | *edit* sorry if this explanation is confusing, it's been a while
       | and it's also very late where I am. I just found this post really
       | fun.
        
         | nl wrote:
         | For people interested in more about this, it's probably worth
         | reading the Hutter Prize FAQ: http://prize.hutter1.net/hfaq.htm
        
       | tomxor wrote:
       | Doesn't decompression require the entire stable fusion model?
       | (and the exact same model at that)
       | 
       | This could be interesting but I'm wondering if the compression
       | size is more a result of the benefit of what is essentially a
       | massive offline dictionary built into the decoder vs some
       | intrinsic benefit to processing the image in latent space based
       | on the information in the image alone.
       | 
       | That said... I suppose it's actually quite hard to implement a
       | "standard image dictionary" and this could be a good way to do
       | that.
        
         | operator-name wrote:
         | The latent space _is_ the massive offline dictionary, and the
         | benifit is not having to hand craft the massive offline
         | dictionary?
        
           | tomxor wrote:
           | For those of us unfamiliar... roughly how large is that in
           | terms of bytes?
        
         | tantalor wrote:
         | I thought that's what "some important caveats" was going to be,
         | but no, article didn't mention this.
        
         | thehappypm wrote:
         | Haha. Here's a faster compression model. Make a database of
         | every image ever made. Compute a thumbprint and use that as the
         | index of the database. Boom!
        
           | Sohcahtoa82 wrote:
           | A quick Google says there are 10^72 to 10^82 atoms in the
           | universe.
           | 
           | Assuming 24-bit color, if you could store an entire image in
           | a single atom, then you could store images that are only 60
           | pixels and each atom would still have a unique image.
        
             | thehappypm wrote:
             | Not every possible image has been produced!
        
               | Sohcahtoa82 wrote:
               | I'll get started, then!
        
       | Xcelerate wrote:
       | Great idea to use Stable Diffusion for image compression. There
       | are deep links between machine learning and data compression
       | (which I'm sure the author is aware of).
       | 
       | If you could compute the true conditional Kolmogorov complexity
       | of an image or video file given all visual online media as the
       | prior, I imagine you would obtain mind-blowing compression
       | ratios.
       | 
       | People complain of the biased artifacts that appear when using
       | neural networks for compression, but I'm not concerned in the
       | long term. The ability to extract algorithmic redundancy from
       | images using neural networks is obviously on its way to
       | outclassing manually crafted approaches, and it's just a matter
       | of time before we are able to tack on a debiasing step to the
       | process (such that the distribution of error between the
       | reconstructed image and the ground truth has certain nice
       | properties).
        
       | aaaaaaaaaaab wrote:
       | Save around a kilobyte with a decompressor that's ~5Gbyte.
        
       | egypturnash wrote:
       | _To evaluate this experimental compression codec, I didn't use
       | any of the standard test images or images found online in order
       | to ensure that I'm not testing it on any data that might have
       | been used in the training set of the Stable Diffusion model
       | (because such images might get an unfair compression advantage,
       | since part of their data might already be encoded in the trained
       | model)._
       | 
       | I think it would be _very interesting_ to determine if these
       | images _do_ come back with notably better compression.
        
         | bane wrote:
         | Given the approach, they'll probably come back with better
         | reconstruction/decompression too.
        
           | pishpash wrote:
           | Not clear. Fully encoding the training images could not be a
           | feasible aspect of a good auto-encoder.
        
       | Dwedit wrote:
       | On another note, you can also downscale an image, save it as a
       | JPEG or whatever, then Upscale it back using AI upscaling.
        
       | madsbuch wrote:
       | It is really interesting to talk about semantic lossy
       | compression, which is probably what we get.
       | 
       | Where recreating with traditional codices introduce syntactic
       | noise, then this will introduce semantic noise.
       | 
       | Imagine seeing a high res perfect picture, just until you see the
       | source image and discover that it was reinterpreted..
       | 
       | It is also going to be interesting, to see if this method will be
       | chosen for specific pictures, eg. pictures of celebrity objects
       | (or people, when/if issues around that resolve), but for novel
       | things, we need to use "syntactical" compression.
        
       | lastdong wrote:
       | Extraordinary! Is it going to be called Pied Piper?
        
       | mjan22640 wrote:
       | What they do is essentially a fractal compression with an
       | external library of patterns (that was IIRC pattented but the
       | patent should be long expired).
        
         | pishpash wrote:
         | This does remind of fractal compression [1] from the 90's which
         | never took off for various reasons which will be relevant here
         | as well.
         | 
         | [1] https://en.wikipedia.org/wiki/Fractal_compression
        
       | eru wrote:
       | Compare compressed sensing's single pixel camera:
       | https://news.mit.edu/2017/faster-single-pixel-camera-lensles...
        
       | fritzo wrote:
       | I'd love to see a series of increasingly compressed images, say
       | 8kb -> 4kb -> 2kb -> ... -> 2bits -> 1bit. This would be a great
       | way to demonstrate the increasing fictionalization of the
       | method's recall.
        
       | minimaxir wrote:
       | For text, GPT-2 was used in a similar demo a year ago albeit said
       | demo is now defunct:
       | https://news.ycombinator.com/item?id=23618465
        
       | DrNosferatu wrote:
       | Nice work!
       | 
       | However, a cautionary tale on AI medical image "denoising":
       | 
       | (and beyond, in science)
       | 
       | - See the artifacts?
       | 
       | The algorithm plugs into ambiguous areas of the image stuff it
       | has seen before / it was trained with. So, if such a system was
       | to "denoise" (or compress, which - if you think about it - is
       | basically the same operation) CT scans, X-rays, MRIs, etc., in
       | ambiguous areas it could plug-in diseased tissue where the
       | ground-truth was actually healthy.
       | 
       | Or the opposite, which is even worse: substitute diseased areas
       | of the scan with healthy looking imagery it had been trained on.
       | 
       | Reading recent publications that try to do "denoising" or
       | resolution "enhancement" in medical imaging contexts, the authors
       | seem to be completely oblivious to this pitfall.
       | 
       | (maybe they had a background as World Bank / IMF economists?)
        
         | ska wrote:
         | People have been publishing fairly useless papers "for" medical
         | imaging enhancement/improvement for 3+ decades now. NB this is
         | not universal (there are some good ones) and _not_ limited to
         | AI techniques, although essentially every AI technique that
         | comes along gets applied to compression
         | /denoising/"superres"/etc. if it can, eventually.
         | 
         | The main problems is that that typical imaging researchers are
         | too far from actual clinical applications, and often trying to
         | solve the wrong problems. It's a structural problem with
         | academic and clinical incentives, as much as anything else.
        
         | fny wrote:
         | There is nothing in the article suggesting this should be used
         | for medical imaging.
        
         | gregw134 wrote:
         | Fun to imagine this could show up in future court cases. Is the
         | picture true, or were details changed by the ai compression
         | algorithm?
        
         | petesergeant wrote:
         | From the article:
         | 
         | > a bit of a danger of this method: One must not be fooled by
         | the quality of the reconstructed features -- the content may be
         | affected by compression artifacts, even if it looks very clear
         | 
         | ... plus an excellent image showing the algorithm straight
         | making stuff up, so I suspect the author is aware.
        
         | anarticle wrote:
         | In my xp, medical imaging at the diagnostic tier uses only
         | lossless (JPEG2000 et al). It was explicitly mentioned on our
         | SOP/policies that we had to have a lossless setup.
         | 
         | Very sketchy to use a super resolution for diagnostics. In
         | research (flourescence), sure.
         | 
         | ref: my direct experience of pathology slide scanning machines
         | and their setup.
        
         | adammarples wrote:
         | Mentioned in TFA at least twice
        
         | Der_Einzige wrote:
         | Sounds like you need lossless compression.
         | 
         | I was told that the GPT-2 text compression variant was a
         | lossless compressor (https://bellard.org/libnc/gpt2tc.html),
         | why is stable diffusion lossy?
        
           | operator-name wrote:
           | Probably something to do with the variational auto encoder,
           | which is lossy.
        
         | theemathas wrote:
         | Here's a similar case of a scanner using a traditional
         | compression algorithm. It has a bug in the compression
         | algorithm, which made it replace a number in the scanned image
         | with a different number.
         | 
         | https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
        
           | sgc wrote:
           | That is completely outside all my expectations prior to
           | reading it. The consequences are potentially life and death,
           | or incarceration, etc, and yet they did nothing until called
           | out and basically forced to act.
           | 
           | A good reminder that the bug can be anywhere, and when things
           | stop working we often need to get very dumb, and just
           | methodically troubleshoot.
        
             | function_seven wrote:
             | We programmers tend to think our abstractions match reality
             | somehow. Or that they don't leak. Or even if they _do_
             | leak, that leakage won 't spill down several layers of
             | abstraction.
             | 
             | I used to install T1 lines a long time ago. One day we had
             | a customer that complained that their T1 was dropping every
             | afternoon. We ran tests on the line for extended periods of
             | time trying to troubleshoot the problem. Every test passed.
             | Not a single bit error no matter what test pattern we used.
             | 
             | We monitored it while they used it and saw not a single
             | error, except for when the line completely dropped. We
             | replaced the NIU card, no change.
             | 
             | Customer then hit us with, "it looks like it only happens
             | when Jim VNCs to our remote server".
             | 
             | Obviously a userland program (VNC) could not possibly cause
             | our NIU to reboot, right?? It's several layers "up the
             | stack" from the physical equipment sending the DS1 signal
             | over the copper.
             | 
             | But that's what it was. We reliably triggered the issue by
             | running VNC on their network. We ended up changing the NIU
             | and corresponding CO card to a different manufacturer (from
             | Adtran to Soneplex I think?) to fix the issue. I wish I had
             | had time to really dig into that one, because obviously
             | other customers used VNC with no issues. Adtran was our
             | typical setup. Nothing else was weird about this standard
             | T1 install. But somehow the combination of our equipment,
             | their networking gear, and that program on that workstation
             | caused the local loop equipment to lose its mind.
             | 
             | This number-swapping story hit me the same way. We would
             | all expect a compression bug to manifest as blurry text, or
             | weird artifacts. We would never suspect a clean
             | substitution of a meaningful symbol in what is "just a
             | raster image".
        
               | jend wrote:
               | Reminds me of this story:
               | http://blog.krisk.org/2013/02/packets-of-death.html
               | 
               | tldr: Specific packet content triggers a bug in the
               | firmware of an Intel network card and bricks it until
               | powered off.
        
           | SV_BubbleTime wrote:
           | Iirc, this was an issue or conspiracy fuel or whatever with
           | the birth certificate that Obama released. That some of the
           | unique elements in the scan repeated over and over.
        
           | caycep wrote:
           | I assume something like jpeg (used in the DICOM standard
           | today) has more eyes on the code than proprietary Xerox
           | stuff? hopefully at least...
           | 
           | I have seen weird artifacts on MRI scans, specifically the
           | FLAIR image enhancement algorithm used on T2 images, i.e.
           | white spots, which could in theory be interpreted by a
           | radiology as small strokes or MS...so I always take what I
           | see with a grain of salt..
        
             | ska wrote:
             | The DICOM standard stuff did have a lot of eyes on it, and
             | was tuned toward fidelity which helps. It's not perfect,
             | but what is.
             | 
             | MRI artifacts though are a whole can of worms, but
             | fundamentally most of them come from a combination of the
             | EM physics involved, and the reconstruction algorithm
             | needed to produce an image from the frequency data.
             | 
             | I'm not sure what you mean by "image enhancement
             | algorithm"; FLAIR is a pulse sequence used to suppress
             | certain fluid signals, typically used in spine and brain.
             | 
             | Many of the bright spots you see in FLAIR are due to B1
             | inhomogeneity, iirc (it's been a while though)
        
             | ska wrote:
             | Probably worth mentioning also that "used in DICOM
             | standard" is true but possibly misleading to someone
             | unfamiliar with it.
             | 
             | DICOM is a vast standard. In it's many crevasses, it
             | contains wire and file encoding schemas, some of which
             | include (many different) image data types, some of which
             | allow (multiple) compression schemes, both lossy and
             | lossless, as well as metadata schemes. These include JPEG,
             | JPEG-LS, JPEG-2000, MPEG2/4, HVEC.
             | 
             | I think you have to encode the compression ratio as well,
             | if you do lossy compression. You definitely have to note
             | that you did lossy compression.
        
       | [deleted]
        
       | vjeux wrote:
       | How long does it take to compress and decompress an image that
       | way?
        
       | fzzt wrote:
       | The prospect of the images getting "structurally" garbled in
       | unpredictable ways would probably limit real-world applications:
       | https://miro.medium.com/max/4800/1*RCG7lcPNGAUnpkeSsYGGbg.pn...
       | 
       | There's something to be said about compression algorithms being
       | predictable, deterministic, and only capable of introducing
       | defects that stand out as compression artifacts.
       | 
       | Plus, decoding performance and power consumption matters,
       | especially on mobile devices (which also happens be the setting
       | where bandwidth gains are most meaningful).
        
         | kevincox wrote:
         | While that is kind of true it is also sort of the point.
         | 
         | The optimal lossy compression algorithm would be based on
         | humans as a target. it would remove details that we wouldn't
         | notice to reduce the target size. If you show me a photo of a
         | face in front of some grass the optimal solution would likely
         | be to reproduce that face in high detail but replace the grass
         | with "stock imagery".
         | 
         | I guess it comes down to what is important. In the past
         | algorithms were focused on visual perception, but maybe we are
         | getting so good at convincingly removing unnecessary detail
         | that we need to spend more time teaching the compressor what
         | details are important. For example if I know the person in the
         | grass preserving the face is important. If I don't know them
         | then it could be replaced by a stock face as well. Maybe the
         | optimal compression of a crowd of people is the 2 faces of
         | people I know preserved accurately and the rest replaced with
         | "stock" faces.
        
         | anilakar wrote:
         | Remember the Xerox scan-to-email scandal in which tiling
         | compression was replacing numbers in structural drawings? We're
         | talking about similar repercussions here.
        
         | behnamoh wrote:
         | This reminds me of a question I have about SD: why can't it do
         | a simple OCR to know those are characters not random shapes?
         | It's baffling that neither SD nor DE2 have any understanding of
         | the content they produce.
        
           | nl wrote:
           | > why can't it do a simple OCR to know those are characters
           | not random shapes?
           | 
           | It's pretty easy to add this if you wanted to.
           | 
           | But a better method would be to fine tune on a bunch of
           | machine-generated images of words if you want your model to
           | be good at generating characters. You'll need to consider
           | which of the many Unicode character sets you want your model
           | to specialize in though.
        
           | Xcelerate wrote:
           | You could certainly apply a "duct tape" solution like that,
           | but the issue is that neural networks were developed to
           | replace what were previously entire solutions built on a
           | "duct tape" collection of rule-based approaches (see the
           | early attempts at image recognition). So it would be nice to
           | solve the problem in a more general way.
        
         | montebicyclelo wrote:
         | Just a note that stable diffusion is/can be deterministic (if
         | set an rng seed).
        
           | shrx wrote:
           | I was told (on the Unstable Diffusion discord, so this info
           | might not be reliable) that even with using the same seed the
           | results will differ if the model is running on a different
           | GPU. This was also my experience when I couldn't reproduce
           | the results generated by the discord's SD txt2img generating
           | bot.
        
             | nl wrote:
             | It absolutely should be reproducable, and in my experience
             | it is.
             | 
             | I do tend to use the HuggingFace version though.
        
             | montebicyclelo wrote:
             | I'm not sure about the different GPU issue. But if that is
             | an issue, the model can be made deterministic (probably
             | compromising inference speed), by making sure the
             | calculations are computed deterministically.
        
         | cma wrote:
         | With compression you often make a prediction then delta off of
         | it. A structurally garbled one could be discarded or just
         | result in a worse baseline for the delta.
        
       | bscphil wrote:
       | A few thoughts that aren't related to each other.
       | 
       | 1. This is a brilliant hack. Kudos.
       | 
       | 2. It would be great to see the best codecs included in the
       | comparison - AVIF and JPEG XL. Without those it's rather
       | incomplete. No surprise that JPEG and WEBP totally fall apart at
       | that bitrate.
       | 
       | 3. A significant limitation of the approach seems to be that it
       | targets extremely low bitrates where other codecs fall apart, but
       | at these bitrates it incurs problems of its own (artifacts take
       | the form of meaningful changes to the source image instead of
       | blur or blocking, very high computational complexity for the
       | decoder).
       | 
       | When only moderate compression is needed, codecs like JPEG XL
       | already achieve very good results. This proof of concept focuses
       | on the extreme case, but I wonder what would happen if you
       | targeted much higher bitrates, say 5x higher than used here. I
       | suspect (but have no evidence) that JPEG XL would improve in
       | fidelity _faster_ as you gave it more bits than this SD-based
       | technique. _Transparent_ compression, where the eye can 't tell a
       | visual difference between source and transcode (at least without
       | zooming in) is the optimal case for JPEG XL. I wonder what sort
       | of bitrate you'd need to provide that kind of guarantee with this
       | technique.
        
         | leeoniya wrote:
         | also thought it was odd that AVIF was not compared - it would
         | show a major quality and size improvement over WebP.
        
         | [deleted]
        
         | goombacloud wrote:
         | The comparison doesn't make much sense because for fair
         | comparisons you have to measure decompressor size plus encoded
         | image size. The decompressor here is super huge because it
         | includes the whole AI model. Also, everyone needs to have the
         | exact same copy of the model in the decompressor for it to work
         | reliably.
        
           | wongarsu wrote:
           | Only if decompressor and image are transmitted over the same
           | channel at the same time, and you only have a small number of
           | images. When compressing images for the web I don't care if a
           | webp decompressor is smaller than a jpg or png decompressor,
           | because the recipient already has all of those.
           | 
           | Of course stable diffusion's 4GB is much more extreme than
           | Brotli's 120kb dictionary size, and would bloat a Browser's
           | install size substantially. But for someone like Instagram or
           | a Camera maker it could still make sense. Or imagine phones
           | having the dictionary shipped in the OS to save just a couple
           | kB on bad data connections.
        
             | operator-name wrote:
             | Even if dictionaries were shipped, the biggest difficulty
             | would be performance and resources. Most of these models
             | require beefy compute and a large amount of VRAM that isn't
             | likely to ever exist on end devices.
             | 
             | Unless that can be resolved it just doesn't make sense to
             | use it as a (de)compressor.
        
       | a-dub wrote:
       | hm. would be interesting to see if any of the perceptual image
       | compression quality metrics could be inserted into the vae step
       | to improve quality and performance...
        
       | fjkdlsjflkds wrote:
       | This is not really "stable-diffusion based image compression",
       | since it only uses the VAE part of "stable diffusion", and not
       | the denoising UNet.
       | 
       | Technically, this is simply "VAE-based image compression" (that
       | uses stable diffusion v1.4's pretrained variational autoencoder)
       | that takes the VAE representations and quantizes them.
       | 
       | (Note: not saying this is not interesting or useful; just that
       | it's not what it says on the label)
       | 
       | Using the "denoising UNet" would make the method more
       | computationally expensive, but probably even better (e.g., you
       | can quantize the internal VAE representations more aggressively,
       | since the denoising step might be able to recover the original
       | data anyway).
        
         | gliptic wrote:
         | It is using the UNet, though.
        
         | nl wrote:
         | It does use the UNet to denoise the VAE compressed image:
         | 
         | "The dithering of the palettized latents has introduced noise,
         | which distorts the decoded result. But since Stable Diffusion
         | is based on de-noising of latents, we can use the U-Net to
         | remove the noise introduced by the dithering."
         | 
         | The included Colab doesn't have line numbers, but you can see
         | the code doing it:                 # Use Stable Diffusion U-Net
         | to de-noise the dithered latents       latents =
         | denoise(latents)       denoised_img = to_img(latents)
         | display(denoised_img)       del latents       print('VAE
         | decoding of de-noised dithered 8-bit latents')
         | print('size: {}b = {}kB'.format(sd_bytes, sd_bytes/1024.0))
         | print_metrics(gt_img, denoised_img)
        
           | fjkdlsjflkds wrote:
           | I stand corrected, then :) cheers.
        
       | zcw100 wrote:
       | You can do lossless neural compression too.
        
       | fho wrote:
       | > Quantizing the latents from floating point to 8-bit unsigned
       | integers by scaling, clamping and then remapping them results in
       | only very little visible reconstruction error.
       | 
       | This might actually be interesting/important for the OpenVINO
       | adaptation of SD ... from what I gathered from the OpenVINO
       | documentation, quantizing is actually a big part of optimizing as
       | this allows the usage of Intels new(-ish) NN instruction sets.
        
       | stavros wrote:
       | Didn't I do this last week?
        
       | dwohnitmok wrote:
       | Indeed one way of looking at intelligence is that it is a method
       | of compressing the external universe.
       | 
       | See e.g. the Hutter Prize.
        
         | mjan22640 wrote:
         | The feeling of understanding is essentially a decompression
         | result being successfuly pattern matched.
        
         | dan_mctree wrote:
         | Our sight is light detection compressed into human thought
         | 
         | Written language is human thought compressed into words
         | 
         | Digital images are light detection compressed into bits
         | 
         | Text to images AI compress digital images into written language
         | 
         | Then how do the AI weights relate to human thought?
        
       | Jack000 wrote:
       | The vae used in stable diffusion is not ideal for compression. I
       | think it would be better to use the vector-quantized variant (by
       | the same authors of latent diffusion) instead of the KL variant,
       | then store the indexes for each quantized vector using standard
       | entropy coding algorithms.
       | 
       | From the paper the VQ variant also performs better overall, SD
       | may have chosen the KL variant only to lower vram use.
        
         | GaggiX wrote:
         | KL models performs better than VQ models as you can see in the
         | latent diffusion repo by CompVis.
        
           | Jack000 wrote:
           | just checked the paper again and yes you're right, the KL
           | version is better on the openimages dataset. The VQ version
           | is better in the inpainting comparison.
           | 
           | In this case you'd still want to use the VQ version though,
           | it doesn't make sense to do an 8bit quantization on the KL
           | vectors when there's an existing quantization learned through
           | training.
        
       | akvadrako wrote:
       | I would like to see this with much smaller file sizes - like 100
       | bytes. How well can SD preserve the core subjects or meaning of
       | the photos?
        
         | pishpash wrote:
         | You can already "compress" them down to a few words, so you
         | have your answer there.
        
       | fla wrote:
       | Is there a general name for this kind of latent space round-trip
       | compression ? If not, I think a good name could be "interpretive
       | compression"
        
       | pyinstallwoes wrote:
       | This relates to a strong hunch that consciousness is tightly
       | coupled to whatever compression is as an irreducible entity.
       | 
       | Memory <> Compression <> Language <> Signal Strength <> Harmonics
       | and Ratios
        
         | mjan22640 wrote:
         | Consciousness is IMHO being avare of being avare. The mystic
         | specialty of it is IMHO a mental illusion, like the Penrose
         | ladder optical illusion.
        
         | eru wrote:
         | I see the relation between compression and consciousness. But
         | what do you mean by irreducible entity, and how does it relate
         | to the two?
        
           | pyinstallwoes wrote:
           | By irreducible entity, as the yet undefined entity that sits
           | at the nexus of mathematics, philosophy, computation, logic
           | (consciousness).
           | 
           | It's not a well defined ontology yet. So whatever it is, at
           | its irreducible size pinpointing it as a thing in which gives
           | rise to such other things.
        
             | eru wrote:
             | What kind of reductions would be disallowed?
        
           | nl wrote:
           | I don't understand much of what the OP is saying.
           | 
           | But I do like the Stephen Wolfram idea of consciousness being
           | the way a computationally bounded observer develops a
           | coherent view of a branching universe.
           | 
           | This is related to compression because it a (lossy!)
           | reduction in information.
           | 
           | I understand that Wolfram is controversial, but the
           | information-transmission-centric view of reality he works
           | with makes a lot of intuitive sense to me.
           | 
           | https://writings.stephenwolfram.com/2021/03/what-is-
           | consciou...
        
       | jwr wrote:
       | While this is great as an experiment, before you jump into
       | practical applications, it is worth remembering that the
       | decompressor is roughly 5GB in size :-)
        
       | red75prime wrote:
       | It reminded me of a scene from "A Fire Upon the Deep" where
       | connection bitrate is abysmal, but the video is crisp and
       | realistic. It is used as a tool for deception, as it happens.
       | Invisible information loss has its costs.
        
       | Dwedit wrote:
       | This is why for compression tests, they incorporate the size of
       | everything needed to decompress the file. You can compress down
       | to 4.97KB all you want, just include the 4GB trained model.
        
         | janekm wrote:
         | Is that true? I have never seen this done for any image
         | compression comparisons that I have seen (i.e. only data that
         | is specific to the image that is being compressed is included,
         | not standard tables that are always used by the algorithm like
         | the quantisation tables used in JPG compression)
        
           | jerf wrote:
           | Yes, it is done all the time.
           | 
           | However, several people here are conflating "best compression
           | as determined for a competition" and "best compression for
           | use in the real world". There is an important relationship
           | between them, absolutely, but in the real world we do not
           | download custom decoders for every bit of compressed content.
           | Just because there is a competition that quite correctly
           | measures the entire size of the decompressor and encoded
           | content does not mean that is now the only valid metric to
           | measure decompression performance. The competitions use that
           | metric for good and valid reasons, but those good and valid
           | reasons are only vaguely correlated to the issues faced in
           | the normal world.
           | 
           | (Among the reasons why competitions must include the size of
           | the decoder is that without that the answer is trivial; I
           | define all your test inputs as a simple enumeration of them
           | and my decoder hard-codes the output as the test values. This
           | is trivially the optimal algorithm, making competition
           | useless. If you could have a real-world encoder that worked
           | this well, and had the storage to implement it, it would be
           | optimal, but you can't possibly store all possible messages.
           | For a humorous demonstration of this encoding method, see the
           | classic joke: https://onemansblog.com/2010/05/18/prison-joke/
           | )
        
           | fsiefken wrote:
           | For text compression benchmarks it's done
           | http://mattmahoney.net/dc/text.html
           | 
           | Matt doesn't do this on the Silesia corpus compression
           | benchmark, even though it would make sense there as well:
           | http://mattmahoney.net/dc/silesia.html
           | 
           | So a compressor of a few gigabyte would make sense if you
           | would have a set of pictures of more then a few gigabyte.
           | It's a bit similar to preprocessing text compression with a
           | dictionary and adding the dictionary to the extractor to
           | squeeze a bit more bytes.
        
             | goombacloud wrote:
             | By the way, the leading nncp in the LTCB (text.html) "is a
             | free, experimental file compressor by Fabrice Bellard,
             | released May 8, 2019" :)
        
         | Gigachad wrote:
         | Do you also include the library to render a jpeg? And maybe the
         | whole OS required to display it on your screen?
         | 
         | There are very many uses where any fixed overhead is
         | meaningless. Imagine archiving billions of images for long term
         | storage. The 4GB model quickly becomes meaningless.
        
           | stavros wrote:
           | > Do you also include the library to render a jpeg? And maybe
           | the whole OS required to display it on your screen?
           | 
           | No, what does that have to do with reconstructing the
           | original data?
           | 
           | If the fixed overhead works for you, that's fine, but
           | including it is not meaningless.
        
           | 112233 wrote:
           | Fixed overheads are never meaningless. O(n^2) algorithm that
           | processes your data in 5s is faster on your data than O(log
           | n) that takes 20 hours.
           | 
           | Long term storage of billions of images is meaningless, if it
           | takes billions of years to archive these images.
        
             | Gigachad wrote:
             | It's a one time cost rather than per image. You need the
             | 4GB model only once and then you can uncompress unlimited
             | images.
        
               | 112233 wrote:
               | Yes, but each image needs access to this 4GB (actually, I
               | have no idea how much RAM it takes up), plus whatever the
               | working set size is. It is a non-trivial overhead that
               | really limits throughput of your system, so you can
               | process less images in parallel, so compressing billion
               | of images in reasonable time suddenly may cost much more
               | than the amount of storage it would save, compared to
               | other methods.
        
       | quickthrower2 wrote:
       | If this were used in the wild, do you need a copy of the model
       | locally to decompress the images?
        
         | coffee_beqn wrote:
         | And how much compute time/power does "decompressing" take
         | compared to a jpg?
        
         | mcbuilder wrote:
         | Yes, but possibly not the entire model, hypothetically for
         | instance some fine-tuning on compression and then distillation.
        
           | Gigachad wrote:
           | I can imagine some uses for this. Imagine having to archive a
           | massive dataset where it's unlikely any individual image will
           | be retrieved and where perfect accuracy isn't required.
           | 
           | Could cut down storage costs a lot.
        
       | kgeist wrote:
       | I heard Stable Diffusion's model is just 4 GB. It's incredible
       | that billions of images could be squeezed in just 4 GB. Sure it's
       | lossy compression but still.
        
         | eru wrote:
         | In this regard, stable diffusion is not so much comparable to a
         | corpus of jpeg images, but with the jpeg compression
         | algorithms.
        
         | akomtu wrote:
         | I think it's easy to explain. If we split all those images into
         | small 8x8 chunks, and put all the chunks into a fuzzy and a bit
         | lossy hashtable, we'll see that many chunks are very similar
         | and can be merged into one. To address this "space of 8x8
         | chunks" we'll apply PCA to them, just like in jpeg, and use
         | only the top most significant components of the PCA vectors.
         | 
         | So in essense, this SD model is like an Alexandria library of
         | visual elements, arranged on multidomensional shelves.
        
         | nl wrote:
         | I don't think that thinking of it as "compression" is useful,
         | and more than an artist recreating the Mona Lisa from memory is
         | "decompressing" it. The process that diffusion models use is
         | fundamentally different to decompression.
         | 
         | For example, if you prompt Stable Diffusion with "Mona Lisa"
         | and look at the iterations, it is clearer what is happening -
         | it's not decompressing so much as drawing something it knows
         | looks like Mona Lisa and then iterating to make it look clearer
         | and clearer.
         | 
         | It clearly "knows" what the Mona Lisa looks like, but what is
         | is doing isn't copying it - it's more like recreating a thing
         | that looks like it.
         | 
         | (And yes I realize lots of artist on Twitter are complaining
         | that it is copying their work. I think "forgery" is a better
         | analogy than "stealing" though - it can create art that looks
         | like a Picasso or whatever, but it isn't copying it in a
         | conventional sense)
        
           | Gigachad wrote:
           | Forgery requires some kind of deception/fraud. Painting an
           | imitation of the Mona Lisa isn't forgery. Trying to sell it
           | as if it is the original is.
        
             | nl wrote:
             | Yes I agree with this too.
             | 
             | I think using that language is better than "stealing",
             | because the immoral act is the passing off, not training of
             | the model.
        
       | ilaksh wrote:
       | What if I just want something pretty similar but not necessarily
       | the exact image. Maybe there could be a way to find a somewhat
       | similar text prompt as a starting point, and then add in some
       | compressed information to adjust the prompt output to be just a
       | bit closer to the original?
        
       | MarkusWandel wrote:
       | The one with the different buildings in the reconstructed image
       | is a bit spooky. I've always argued that human memory is highly
       | compressed, storing, for older memories anyway, a "vibe" plus
       | pointers to relevant experiences/details that can be used to
       | flesh it out as needed. Details may be wrong in the
       | recollecting/retelling, but the "feel" is right.
       | 
       | And here we have computers doing the same thing! Reconstructing
       | an image from a highly compressed memory and filling in
       | appropriate, if not necessarily exact details. Human eye looks at
       | it casually and yeah, that's it, that's how I remember it. Except
       | that not all the details are right.
       | 
       | Which is one of those "Whoa!" moments, like many many years ago,
       | when I wrote a "Connect 4" implementation in BASIC on the
       | Commodore 64, played it and lost! How did the machine get so
       | smart all of a sudden?
        
       | illubots wrote:
       | In theory, it would be possible to benefit from the ability of
       | Stable Diffusion to increase perceived image quality without even
       | using a new compression format. We could just enhance existing
       | JPG images in the browser.
       | 
       | There already are client side algorithms that increase the
       | quality of JPGs a lot. For some reason, they are not used in
       | browsers yet.
       | 
       | A Stable Diffusion based enhancement would probably be much nicer
       | in most cases.
       | 
       | There might be an interesting race to do client side image
       | enhancements coming to the browsers over the next years.
        
       | codeflo wrote:
       | One interesting feature of ML-based image encoders is that it
       | might be hard to evaluate them with standard benchmarks, because
       | those are likely to be part of the training set, simply by virtue
       | of being scraped from the web. How many copies of Lenna has
       | Stable Diffusion been trained with? It's on so many websites.
        
         | zxexz wrote:
         | We might enter a time when every time a new model/compression
         | algo is introduced, a new series of benchmark images may need
         | to be introduced/taken and ALL historical benchmarks of major
         | compression algos redone on the new images.
        
       | seydor wrote:
       | Is there something like this for live video chat?
        
       | FrostKiwi wrote:
       | I thought this was another take on this parody post:
       | https://news.ycombinator.com/item?id=32671539
       | 
       | But no, it's the real deal. Great job author.
        
       | nl wrote:
       | This but for video using the "infilling" version for changing
       | parts between frames.
       | 
       | The structural changes per frame matter much less. Send a 5kB
       | image every keyframe then bytes per subsequent image with a
       | sketch of the changes and where to mask them on the frame.
       | 
       | Modern video codecs are pretty amazing though, so not sure how it
       | would compare in frame size
        
         | willbudd wrote:
         | I've been thinking about more or less the same idea, but the
         | computational edge inference costs probably makes it
         | impractical for most of today's client devices. I see a lot of
         | potential in this direction in the near future though.
        
           | nl wrote:
           | I think it's unclear how much computational resources the
           | uncompression steps take.
           | 
           | At the moment it's fairly fast, but RAM hungry. But this
           | article makes it clear that quantizing the representation
           | works well (at least for the VAE). It's possible quantized
           | models could also do decent jobs.
        
       | swayvil wrote:
       | This is the algorithmic equivalent of a metaphor.
        
         | bane wrote:
         | Goodness, I love this. It's a great description of the
         | approach.
        
         | criddell wrote:
         | Before I clicked through to the article, I thought maybe they
         | were taking an image and spitting out a prompt that would
         | produce an image substantially similar to the original.
        
       | sod wrote:
       | This may give insights in how brain memory and thinking works.
       | 
       | Imagine if some day a computer could take a snapshot of the
       | weights and memory bits of the brain and then reconstruct
       | memories and thoughts.
        
         | epmaybe wrote:
         | This kind of already fits a little bit with how the brain
         | processes images where there is information lacking.
         | Neurocognitive specialists can likely correct me on the
         | following.
         | 
         | Glaucoma is a disease where one slowly loses peripheral vision,
         | until a small central island remains or you go completely
         | blind.
         | 
         | So do patients perceive black peripheral vision? Or blurred
         | peripheral vision?
         | 
         | Not really...patients actually make up the surrounding
         | peripheral vision, sometimes with objects!
        
       | SergeAx wrote:
       | Does anybody understand from the article, how much data needed to
       | be downloaded first on decompression side? The entire SD weights
       | 2GB array, right?
        
       | RosanaAnaDana wrote:
       | Something interesting about the San Francisco test image is that
       | if you start to look into the details, its clear that some real
       | changes have been made to the city. Rather than losing texture or
       | grain or clarity, the information lost in this is information
       | about the particular layout of a neighborhood of streets, which
       | has now been replaced as if some one were drawing the scene from
       | memory. A very different kind of loss that with out the original
       | might be imperceptible because the information that was lost
       | isn't replaced with random or systematic noise, but rather new,
       | structured information..
        
         | jhrmnn wrote:
         | It's interesting that this is closer to how human memory
         | operates--we're quite good in unconsciously fabricating false
         | yet strong memories.
        
           | laundermaf wrote:
           | True, but I'd like to continue using products that produce
           | close-to-real images. Phones nowadays already process images
           | at lot. The moment they start replacing pixels it'll all be
           | fake.
           | 
           | And... Some manufacturer apparently already did it on their
           | ultra zoom phones when taking photos of the moon.
        
             | NavinF wrote:
             | Meh. Cameras have been "replacing pixels" for as long as
             | I've been alive. Consider that a 4K camera only has 2k*4k
             | pixels whereas a 4K screen has 2k*4k*3 subpixels.
             | 
             | 2/3 of the image is just dreamed up by the ISP (image
             | signal processor) when it debayers the raw image.
             | 
             | I'm not aware of any consumer hardware that has open source
             | ISP firmware or claims to optimize for accuracy over
             | beauty.
        
               | montroser wrote:
               | Okay, but a camera doing this is unlikely to dream up
               | plausible features that didn't actually exist in the
               | scene.
        
               | NavinF wrote:
               | Of course it is! Try feeding static into a modern ISP. It
               | will find patterns that don't exist.
        
         | taberiand wrote:
         | I would've thought anyone relying on lossy-compressed images of
         | any sort already needs to be aware of the potential effects, or
         | otherwise isn't really concerned by the effect on the image
         | (and I'd guess that the vast majority of use cases actually
         | don't care if parts of the image are essentially "imaginary")
        
         | aaaaaaaaaaab wrote:
         | The good old JBIG2 debacle.
         | 
         | "When used in lossy mode, JBIG2 compression can potentially
         | alter text in a way that's not discernible as corruption. This
         | is in contrast to some other algorithms, which simply degrade
         | into a blur, making the compression artifacts obvious.[14]
         | Since JBIG2 tries to match up similar-looking symbols, the
         | numbers "6" and "8" may get replaced, for example.
         | 
         | In 2013, various substitutions (including replacing "6" with
         | "8") were reported to happen on many Xerox Workcentre
         | photocopier and printer machines, where numbers printed on
         | scanned (but not OCR-ed) documents could have potentially been
         | altered. This has been demonstrated on construction blueprints
         | and some tables of numbers; the potential impact of such
         | substitution errors in documents such as medical prescriptions
         | was briefly mentioned."
         | 
         | https://en.m.wikipedia.org/wiki/JBIG2
        
         | tlrobinson wrote:
         | One thing that worries me about generative AI is the
         | degradation of "truth" over time. AI will be the cheapest way
         | to generated content, by far. It will sometimes get facts
         | subtly wrong, and eventually that AI generated content will be
         | used to train future models. Rinse and repeat.
        
           | jacobr1 wrote:
           | The interesting thing is that is some ways this is a return
           | to pre-modern era of lossy information transmission between
           | the generations. Every story is re-molded by the re-teller.
           | Languages change and thus the contextual interpretations.
           | Even something a seemingly static as a book gets slowly
           | modified as scribes rewrite scrolls over centuries.
        
           | poszlem wrote:
           | We are getting closer and closer to a simulacrum and
           | hyperreality.
           | 
           | We used to create things that were trying to simulate
           | (reproduce) reality, but now we are using those "simulations"
           | we'd created as if they were the real thing. With time we
           | will be getting farther away from the "truth" (as you put
           | it), and yes - I share your worry about that.
           | 
           | https://en.wikipedia.org/wiki/Simulacrum
           | 
           | EDIT: A good example I heard that explains what a simulacrum
           | is was this: Ask a random person to draw a photo of a princes
           | and see how many will draw a disney princess (which already
           | was based on real princesses) vs how many will draw one
           | looking like Catherine of Aragon or another real princess.
        
           | intrasight wrote:
           | art is truth
        
           | Xcelerate wrote:
           | So you've described humans.
        
             | _nalply wrote:
             | Currently computers can reliably do maths. Later AI will
             | unreliably do maths. Exactly like humans.
        
               | pishpash wrote:
               | So it will get stupider... maybe the singularity isn't
               | bad like too smart but bad like dealing with too many
               | stupid people.
        
               | ballenf wrote:
               | Maybe making (certain kinds of) math mistakes is a sign
               | of intelligence.
        
               | ciphol wrote:
               | The nice thing about math is that often it's much harder
               | to find a proof than to verify that proof. So math AI is
               | allowed to make lots of dumb mistakes, we just want it to
               | make the occasional real finding too.
        
               | MauranKilom wrote:
               | Unless we also ask AI to do the proof verification...
        
               | rowanG077 wrote:
               | Why would you do that? Proof verification is pretty much
               | a solved problem.
        
               | gpderetta wrote:
               | Both stupider and less deterministic, but also and
               | smarter and more flexible. Like humans.
        
             | tlrobinson wrote:
             | Fair point, though I feel there's a difference as AI can
             | generate content much more quickly.
        
           | jefftk wrote:
           | Similar to how we have low-background (pre-nuclear) steel,
           | might we have pre-transformer content?
        
           | Lorin wrote:
           | Jpeg bitrot 2.0
        
           | blacksmith_tb wrote:
           | Certainly possible, though we also have many hundreds of
           | millions of people walking the globe taking pictures of
           | things with their phones (not all of which are public to be
           | used for training, but still).
        
           | fny wrote:
           | I've started seeing more of this crap show up on the front
           | page of Google.
        
           | sharemywin wrote:
           | Kind of like how chicken taste like everything.
        
           | robbomacrae wrote:
           | Yes indeed. I've been looking for an auto summarizer that
           | reliably doesn't change the content. So far everything I've
           | tried will make up or edit a key fact once in a while.
        
           | z3c0 wrote:
           | Anywhere that truth matters will be unaffected. If such
           | deviations from truth can withhold, then the truth never
           | mattered. False assumptions will never hold where they can't,
           | because reality is quite pervasive. Ask anyone who's had to
           | productionize an ML model in a setting that requires a foot
           | in reality. Even a single-digit drop in accuracy can have
           | resounding effects.
        
         | thaumasiotes wrote:
         | There was a scandal when it was discovered that Xerox machines
         | were doing this; in that case, the example showed "photocopies"
         | replacing numbers in documents with other numbers.
        
           | smitec wrote:
           | There is a talk about that issue [1].
           | 
           | During my PhD this issue came up amongst those in the group
           | looking into compressed sensing in MRI. Many reconstruction
           | methods (AI being a modern variant) work well because a best
           | guess is visually plausible. These kinds of methods fall
           | apart when visually plausible and "true" are different in a
           | meaningful way. The simplest examples here being the numbers
           | in scanned documents, or in the MRI case, areas of the brain
           | where "normal brain tissue" was on average more plausible
           | than "tumor".
           | 
           | [1]: http://www.dkriesel.com/en/blog/2013/0802_xerox-
           | workcentres_...
        
             | nl wrote:
             | It's worth noting that these problems are things to be
             | aware of, not the complete showstoppers some people seem to
             | think that they are.
        
               | thaumasiotes wrote:
               | I'm having a hard time seeing where the random
               | substitution of all numbers isn't supposed to be a
               | complete showstopper.
        
               | nl wrote:
               | Well for example you train the VAE to reduce the
               | compression on characters.
        
               | kgwgk wrote:
               | The right amount of compression in a photocopy machine is
               | zero.
               | 
               | Compression that gives you a blurred image is a trade-
               | off.
               | 
               | But what does it mean to "be aware of" compression that
               | may give you a crisp image of some made up document?
        
               | nl wrote:
               | > The right amount of compression in a photocopy machine
               | is zero.
               | 
               | This isn't an obvious statement to me. If you've had the
               | misfortune of scanning documents to PDF and getting the
               | 100MB per page files automatically emailed to you then
               | you might see the benefit in all that white space being
               | compressed somehow.
               | 
               | > But what does it mean to "be aware of" compression that
               | may give you a crisp image of some made up document?
               | 
               | This isn't something I said. A good compression system
               | for documents will not change characters in any
               | circumstances.
        
               | rjmunro wrote:
               | If you are making an image of a cityscape to illustrate
               | an article it probably doesn't matter what the city looks
               | like. But if the article is about the architecture of the
               | specific city, it probably does, so you need to 'be
               | aware' that the image you are showing people isn't
               | correct, and reduce the compression.
        
               | kgwgk wrote:
               | This subthread was about changing numbers in scanned
               | documents and vanishing tumors in medical images.
        
               | rowanG077 wrote:
               | An medical sensor filling in "plausible" information is
               | not a show stopper? I hope you are never in control of
               | making decisions like that.
        
               | nl wrote:
               | To be aware of when you are building compression systems.
               | 
               | It's perfectly possible to build neural network based
               | compression systems that do not output false information.
        
               | lm28469 wrote:
               | > not the complete showstoppers some people seem to think
               | that they are.
               | 
               | idk if I had to second guess every single result coming
               | out of a machine it would be a showstopper for me. This
               | isn't pokemon go, tumor detection is serious matter
        
               | pishpash wrote:
               | Why would you want to lossily compress any medical image
               | is beyond me. You get equipment to make precise high-
               | resolution measurements, it goes without saying that you
               | do not want noise added to that.
        
         | kybernetikos wrote:
         | Yeah, if it were actually adopted as a way to do compression,
         | it seems likely to lead to even worse problems than JBIG2 did
         | https://news.ycombinator.com/item?id=6156238
         | 
         | Invisibly changing the content rather than the image quality
         | seems like a really concerning failure mode for image
         | compression!
         | 
         | I wonder if it'd be possible to use SD as part of a lossless
         | system - use SD as something that tells us the liklihood of
         | various pixel values given the rest of the image and combine
         | that liklihood with a huffman encoding. Either way, fantastic
         | hack, but we really should avoid using anything lossy built on
         | AI for image compression.
        
           | pishpash wrote:
           | Give it "enough" bits and it won't be a problem. How many is
           | enough is the question.
        
           | eloisius wrote:
           | Imagine a world where bandwidth constraints meant
           | transmitting a hidden compressed representation that gets
           | expanded locally by smart TVs that have pretrained weights
           | baked into the OS. Everyone sees a slightly different
           | reconstitution of the same input video. Firmware updates that
           | push new weights to your TV result in stochastic changes to a
           | movie you've watched before.
        
             | jacobr1 wrote:
             | You could still use some kind of adaptive huffman coding.
             | Current compression schemes have some kind of dictionary
             | embedded in the file to map between the common strings and
             | the compressed representation. Google tried proposing SDCH
             | a few years using a common dictionary for wep pages. There
             | isn't any reason why we can't be a bit more deterministic
             | and share a much larger latent representation of "human
             | visual comprehension" or whatever to do the same. It
             | doesn't need to be stochastic once generated.
        
             | kybernetikos wrote:
             | "The weather forecast was correct as broadcast, sir, it's
             | just your smart TV thought it was more likely that the
             | weather in your region would be warm on that day, so it
             | adjusted the symbol and temperature accordingly"
        
         | ZiiS wrote:
         | It opens up an interesting question that is it suggesting
         | "improvements" that could be done in the real world.
        
           | RosanaAnaDana wrote:
           | Are you suggesting a lossy but 'correct' version?
           | 
           | IE, the algorithm ignores and loses the 'irrelevant'
           | information, but holds the important stuff?
        
         | phkahler wrote:
         | This needs to be compared with automated tests. A lack of
         | visual artifacts doesnt mean an accurate representation of the
         | image in this case.
        
         | freediver wrote:
         | Arguably this is still fine with the definition of lossy
         | compression. The compressed image still roughly shows the idea
         | of the original image.
        
       | perryizgr8 wrote:
       | I believe ML techniques are the future of video/image
       | compression. When you read a well written novel, you can kind of
       | construct images of characters, locations and scenes in your
       | mind. You can even draw these scenes, and if you're a good
       | artist, those won't have any artifacts.
       | 
       | I don't expect future codecs to be able to reduce a movie to a
       | simple text stream, but maybe it could do something in the same
       | vein. Store abstract descriptions instead of bitmaps. If the
       | encoding and decoding are good enough, your phone could
       | reconstruct an image that closely resembles what the camera
       | recorded. If your phone has to store a 50Gb model for that, it
       | doesn't seem too bad, especially if the movie file could be
       | measured in tens of megabytes.
       | 
       | Or it could go in another direction, where file sizes remain in
       | the gigabytes, but quality jumps to extremely crisp 8k that you
       | can zoom into or move the camera around if you want.
       | 
       | Can't wait for this stuff!
        
       | UniverseHacker wrote:
       | From the title, I expected this to be basically pairing stable
       | diffusion with an image captioning algorithm by 'compressing' the
       | image to a simple human readable description, and then
       | regenerating a comparable image from the text. I imagine that
       | would work and be possible, essentially an autoencoder with a
       | 'latent space' of single short human readable sentences.
       | 
       | The way this actually works is pretty impressive. I wonder if it
       | could be made lossless or less lossy in a similar manner to FLAC
       | and/or video compression algorithms... basically first do the
       | compression, and then add on a correction that converts the
       | result partially or completely into the true image. Essentially,
       | e.g. encoding real images of the most egregiously modified
       | regions of the photo and putting them back over the result.
        
       | Waterluvian wrote:
       | I wonder if this technique could be called something like
       | "abstraction" rather than "compression" given it will actually
       | change information rather than its quality.
       | 
       | Ie. "There's a neighbourhood here" is more of an abstraction than
       | "here's this exact neighbourhood with the correct layout just
       | fuzzy or noisy."
        
         | seydor wrote:
         | like a MIDI file
        
           | Sohcahtoa82 wrote:
           | Well, a MIDI file says nothing about the sound a Trumpet
           | makes, whereas this SD-based abstraction does give a general
           | idea of what your neighborhood should look like.
           | 
           | Maybe it's more like a MOD file?
        
         | rowanG077 wrote:
         | I would say any compression is abstraction in a certain sense.
         | A simple example is a gradient. A lossy compression might
         | abstract over the precise pixel value and simply records a
         | gradiant that almost matches the raw input. You could even make
         | the argument that lossless compressions is abstraction. A 2D
         | grid with 5px lines and 50px spacing between them could
         | feasibly be captured really well using a classical compression
         | scheme.
         | 
         | What AI offers is just a more powerful and opaque way of doing
         | the same thing.
        
       | ipunchghosts wrote:
       | What does johanne balle have to say about this?
        
       ___________________________________________________________________
       (page generated 2022-09-20 23:00 UTC)