[HN Gopher] Exploring 12M of the 2.3B images used to train Stabl... ___________________________________________________________________ Exploring 12M of the 2.3B images used to train Stable Diffusion Author : detaro Score : 53 points Date : 2022-08-30 21:39 UTC (1 hours ago) (HTM) web link (waxy.org) (TXT) w3m dump (waxy.org) | rektide wrote: | So excellent. Flipping the story we see all the time on it's | head. AI's quasi-mystical powers are endless spectacle. Taking a | look through the other side of the looking glass is vastly | overdue. Amazing work. | | Just just just starting to scratch the surface. 2% of the data | gathered, sources identified. These are a couple of sites we can | now source as the primary powerers of AI. Barely reviewed, dived | into in terms of the content itself. We have so little sense & | appreciation for what lurks beneath, but this is a go. | TaylorAlexander wrote: | "The most frequent artist in the dataset? The Painter of Light | himself, Thomas Kinkade, with 9,268 images." | | Oh that's why it is so good at generating Thomas Kinkade style | paintings! I ran a bunch of those and they looked pretty good. | Some kind of garden cottage prompt with Thomas Kinkade style | works very well. Good image consistency with a high success rate, | few weird artifacts. | lmarcos wrote: | I always had the crazy idea of "infinite entertainment": somehow | we manage to "tap" into the multiverse and are able to watch TV | from countless of planets/universes (I think Rick and Morty did | something similar). So, in some channel at some time you may be | able to see Brad Pitt fighting against Godzilla while the monster | is hacking into the pentagon using ssh. Highly improbable, but in | the multiverse TV everything is possible. | | Now I think we don't need the multiverse for that. Give this AI | technology a few years and you'll have streaming services a la | Netflix where you provide the prompt to create your own movie. | What the hell, people will vote "best movie" among the millions | submitted by other people. We'll be movie producers like we are | nowadays YouTubers. Overabundance of high quality material and so | little time to watch them all. Same goes for books, music and | everything else that is digital (even software?). | temp_account_32 wrote: | Isn't everything representable in a digital form? I think we're | in the very early era of entertainment becoming commoditized to | an even higher degree than it is now. | | I envision exactly the future as you describe: Feed a song to | the AI, it spits out a completely new, whole discography from | the artist complete with lyrics and album art that you can | listen to infinitely. | | "Hey Siri, play me a series about chickens from outer space | invading Earth": No problem, here's a 12 hour marathon, | complete with a coherent storyline, plot twists, good acting | and voice lines. | | The only thing that is currently limiting us is computing | power, and given enough time, the barrier will be overcome. | | A human brain is just a series of inputs, a function that | transforms them, and a series of outputs. | gpm wrote: | Huh, there's a ton of duplicates in the data set... I would have | expected that it would be worthwhile to remove those. Maybe | multiple descriptions of the same thing helps, but some of the | duplicates have duplicated descriptions as well. Maybe | deduplication happens after this step? | | http://laion-aesthetic.datasette.io/laion-aesthetic-6pls/ima... | minimaxir wrote: | Per the project page: https://laion.ai/blog/laion-400-open- | dataset/ | | > There is a certain degree of duplication because we used | URL+text as deduplication criteria. The same image with the | same caption may sit at different URLs, causing duplicates. The | same image with other captions is not, however, considered | duplicated. | | I am surprised that image-to-image dupes aren't removed, | though, as the cosine similarity trick the page mentions would | work for that too. | kaibee wrote: | I assume having multiple captions for the same image is very | helpful actually. | minimaxir wrote: | Scrolling through the sorted link from the GP, there are a | few dupes with identical images and captions, so that | doesn't always work either. | gchamonlive wrote: | Isn't it really expensive to dedupe images based on content? | As you have to compare every image to every other image in | the dataset? | | How could one go about deduping images? Maybe using something | similar to rsync protocol? Cheap hash method, then a more | expensive one, then a full comparison, maybe. Even so 2B+ | images... and you are talking about saving on storage costs, | mostly which is quite cheap these days. ___________________________________________________________________ (page generated 2022-08-30 23:00 UTC)