[HN Gopher] Exploring 12M of the 2.3B images used to train Stabl...
       ___________________________________________________________________
        
       Exploring 12M of the 2.3B images used to train Stable Diffusion
        
       Author : detaro
       Score  : 53 points
       Date   : 2022-08-30 21:39 UTC (1 hours ago)
        
 (HTM) web link (waxy.org)
 (TXT) w3m dump (waxy.org)
        
       | rektide wrote:
       | So excellent. Flipping the story we see all the time on it's
       | head. AI's quasi-mystical powers are endless spectacle. Taking a
       | look through the other side of the looking glass is vastly
       | overdue. Amazing work.
       | 
       | Just just just starting to scratch the surface. 2% of the data
       | gathered, sources identified. These are a couple of sites we can
       | now source as the primary powerers of AI. Barely reviewed, dived
       | into in terms of the content itself. We have so little sense &
       | appreciation for what lurks beneath, but this is a go.
        
       | TaylorAlexander wrote:
       | "The most frequent artist in the dataset? The Painter of Light
       | himself, Thomas Kinkade, with 9,268 images."
       | 
       | Oh that's why it is so good at generating Thomas Kinkade style
       | paintings! I ran a bunch of those and they looked pretty good.
       | Some kind of garden cottage prompt with Thomas Kinkade style
       | works very well. Good image consistency with a high success rate,
       | few weird artifacts.
        
       | lmarcos wrote:
       | I always had the crazy idea of "infinite entertainment": somehow
       | we manage to "tap" into the multiverse and are able to watch TV
       | from countless of planets/universes (I think Rick and Morty did
       | something similar). So, in some channel at some time you may be
       | able to see Brad Pitt fighting against Godzilla while the monster
       | is hacking into the pentagon using ssh. Highly improbable, but in
       | the multiverse TV everything is possible.
       | 
       | Now I think we don't need the multiverse for that. Give this AI
       | technology a few years and you'll have streaming services a la
       | Netflix where you provide the prompt to create your own movie.
       | What the hell, people will vote "best movie" among the millions
       | submitted by other people. We'll be movie producers like we are
       | nowadays YouTubers. Overabundance of high quality material and so
       | little time to watch them all. Same goes for books, music and
       | everything else that is digital (even software?).
        
         | temp_account_32 wrote:
         | Isn't everything representable in a digital form? I think we're
         | in the very early era of entertainment becoming commoditized to
         | an even higher degree than it is now.
         | 
         | I envision exactly the future as you describe: Feed a song to
         | the AI, it spits out a completely new, whole discography from
         | the artist complete with lyrics and album art that you can
         | listen to infinitely.
         | 
         | "Hey Siri, play me a series about chickens from outer space
         | invading Earth": No problem, here's a 12 hour marathon,
         | complete with a coherent storyline, plot twists, good acting
         | and voice lines.
         | 
         | The only thing that is currently limiting us is computing
         | power, and given enough time, the barrier will be overcome.
         | 
         | A human brain is just a series of inputs, a function that
         | transforms them, and a series of outputs.
        
       | gpm wrote:
       | Huh, there's a ton of duplicates in the data set... I would have
       | expected that it would be worthwhile to remove those. Maybe
       | multiple descriptions of the same thing helps, but some of the
       | duplicates have duplicated descriptions as well. Maybe
       | deduplication happens after this step?
       | 
       | http://laion-aesthetic.datasette.io/laion-aesthetic-6pls/ima...
        
         | minimaxir wrote:
         | Per the project page: https://laion.ai/blog/laion-400-open-
         | dataset/
         | 
         | > There is a certain degree of duplication because we used
         | URL+text as deduplication criteria. The same image with the
         | same caption may sit at different URLs, causing duplicates. The
         | same image with other captions is not, however, considered
         | duplicated.
         | 
         | I am surprised that image-to-image dupes aren't removed,
         | though, as the cosine similarity trick the page mentions would
         | work for that too.
        
           | kaibee wrote:
           | I assume having multiple captions for the same image is very
           | helpful actually.
        
             | minimaxir wrote:
             | Scrolling through the sorted link from the GP, there are a
             | few dupes with identical images and captions, so that
             | doesn't always work either.
        
           | gchamonlive wrote:
           | Isn't it really expensive to dedupe images based on content?
           | As you have to compare every image to every other image in
           | the dataset?
           | 
           | How could one go about deduping images? Maybe using something
           | similar to rsync protocol? Cheap hash method, then a more
           | expensive one, then a full comparison, maybe. Even so 2B+
           | images... and you are talking about saving on storage costs,
           | mostly which is quite cheap these days.
        
       ___________________________________________________________________
       (page generated 2022-08-30 23:00 UTC)