[HN Gopher] Search 5.8B images used to train popular AI art models
       ___________________________________________________________________
        
       Search 5.8B images used to train popular AI art models
        
       Author : homarp
       Score  : 37 points
       Date   : 2022-09-14 21:17 UTC (1 hours ago)
        
 (HTM) web link (haveibeentrained.com)
 (TXT) w3m dump (haveibeentrained.com)
        
       | naet wrote:
       | I put in Donald Trump to see what kind of celebrity images might
       | be in there, and there are a TON of memes / photoshopped versions
       | of him looking like a caricature or otherwise warped. I wonder if
       | the AI will average these into a fair resemblance, or whether
       | prompts using his name will end up more cartoonish than other
       | names due to the source data...
        
       | shantara wrote:
       | Does this website search the same dataset as
       | https://knn5.laion.ai?
        
       | latchkey wrote:
       | I just uploaded a picture of my dog (Bichon Frise) and it showed
       | me a bazillion nearly exact similar dogs.
       | 
       | Why isn't this immediately being used as a missing persons (or
       | pet) database service?
        
         | at_a_remove wrote:
         | Let's think of reasons!
         | 
         | 1) Privacy.
         | 
         | 2) Lack of geolocation data associated.
         | 
         | 3) Privacy.
         | 
         | 4) Lack of contact information attached.
         | 
         | 5) Privacy.
         | 
         | 6) Lack of case numbers for various missing persons cases being
         | attached.
         | 
         | 7) Privacy.
        
           | latchkey wrote:
           | Certainly this could be used for evil (tm), but it seems like
           | something could also be built out that enables this for good
           | in ways that don't cause issues with Mr. Fibbonaci.
        
         | groby_b wrote:
         | Because you're missing a person, not a picture of them?
        
         | philipkglass wrote:
         | I think that you answered your own question. You uploaded a
         | picture of your non-missing dog and it found many very similar
         | looking dogs.
        
         | fimdomeio wrote:
         | sometimes people are missing, sometimes they just have
         | legitimate reasons to not want to be found.
        
       | version_five wrote:
       | I found this very similar to what you'd get with a google / bing
       | / etc image search. Is that where this database comes from? I
       | noticed there is a lot of "Shutterstock" watermarked stuff. And I
       | also checked a few "adult" terms (large breasts etc) and found
       | there is a lot of nude content. Only curious because I've seen
       | lots of the generative models have some post-filtering for
       | nudity, why don't they just clean it out of the training data if
       | they're worried?
        
       | mgraczyk wrote:
       | Half page popup cookie banner with no opt-out option. I am
       | completely fine with cookies but clearly there is something wrong
       | with the state of things when I have to click through a
       | completely non-actionable popup
        
       | 0xrisk wrote:
       | I'm one of the people building this. Hi HN, AMA :)
        
         | telotortium wrote:
         | What visual search engine are you using?
        
           | simandl wrote:
           | This is clip matching the text or image searched to the
           | images in the Laion-5B dataset.
        
         | tener wrote:
         | Privacy policy for searched strings?
         | 
         | "Rutkowski" returns a bunch of book covers, repeated a lot. Can
         | you ensure images returned have diverse embeddings? I expected
         | digital art, not detective stories.
         | 
         | Do you use CLIP or just metadata?
         | 
         | What is intended process that starts after you get artist's
         | email (for either of purposes).
        
           | simandl wrote:
           | It's using clip to match the text to the image, so you can
           | actually prompt it like you might an art generator. Here's
           | "in the style of greg rutkowski": https://haveibeentrained.co
           | m/?search_text=in%20the%20style%2...
           | 
           | In the next few weeks we'll be adding the ability to log in
           | and flag or upload your works (if they aren't there). Those
           | lists will have permissions assigned to them, starting with
           | simple opt-in or opt-out.
        
       | educaysean wrote:
       | One thing that sticks out to me is how so much of the the images
       | in the collection have really terrible labels. I uncovered a
       | large collection of pieces by an illustrator who was unsearchable
       | by name, only via image upload. The reason they were
       | unsearchable: the majority of this particular artist's images had
       | labels that were all in the format of:
       | 
       | {username}'s profile image
        
       | constantlm wrote:
       | I'd love to know how this works. I entered my own name for the
       | lols, and it returned mostly paintings of the Cape Winelands in
       | South Africa where I grew up, which is pretty creepy.
        
         | 0xrisk wrote:
         | that is crazy. Perhaps your family name is common in that
         | region?
        
       | prox wrote:
       | It surprises me how many meme images there are. Aren't they low
       | quality content?
       | 
       | I haven't tried but I don't see SD making any memes by themselves
       | yet.
        
         | simandl wrote:
         | Stable Diffusion used an aesthetic filter to train on a subset
         | of the English language images from this full 5.8 billion
         | multi-language set. That probably got a lot of what you're
         | finding.
        
       | gauravphoenix wrote:
       | interestingly, if you search for "127.0.0.1", it throws
       | {"message":"Forbidden"}
       | 
       | in the api response and the page says-                 Sorry,
       | there was an error with your search. Please try a different
       | request.
        
         | jonas-w wrote:
         | Same for "localhost" but "[::1]" works.
        
       | gigel82 wrote:
       | It's surprising how poorly labeled these images are; who is
       | curating this collection?
       | 
       | Can't they crowd-source a proper labeling project - I wonder how
       | much better things like Stable Diffusion would be if its training
       | would include correct, complete labels for the images. I'm sure
       | lots of folks would willingly spend a few minutes here and there
       | to aid with the labeling if it means they get to enjoy the model
       | for free.
        
         | orbital-decay wrote:
         | If you aren't Google, manually doing that with 5+ billion
         | images might prove difficult, to put it mildly. Large-scale
         | labeling is typically bootstrapped with smaller models and
         | whatever manual data you have. What's being curated is the
         | bootstrapping process.
        
         | 0xrisk wrote:
         | part of our idea is that artists can help with labelling their
         | own work
        
         | simandl wrote:
         | This is Laion-5B, you can read more about it here:
         | https://laion.ai/blog/laion-5b/
         | 
         | Imagen and Stable-Diffusion both used subsets of this full 5.8B
         | image set.
        
       | homarp wrote:
       | tweet explaining:
       | https://twitter.com/matdryhurst/status/1570143343157575680
       | 
       | " Releasing our first Spawning tool to help artists see if they
       | are present in popular AI Art training data, and register to use
       | our tools to opt in and opt out of AI training
       | 
       | I think we have created a way to make this work out well for
       | everyone"
       | 
       | and StableDiffusion's Emad Mostaque agreeing to support the
       | initiative
       | https://twitter.com/EMostaque/status/1570158985852121090
        
       ___________________________________________________________________
       (page generated 2022-09-14 23:00 UTC)