[HN Gopher] Search 5.8B images used to train popular AI art models ___________________________________________________________________ Search 5.8B images used to train popular AI art models Author : homarp Score : 37 points Date : 2022-09-14 21:17 UTC (1 hours ago) (HTM) web link (haveibeentrained.com) (TXT) w3m dump (haveibeentrained.com) | naet wrote: | I put in Donald Trump to see what kind of celebrity images might | be in there, and there are a TON of memes / photoshopped versions | of him looking like a caricature or otherwise warped. I wonder if | the AI will average these into a fair resemblance, or whether | prompts using his name will end up more cartoonish than other | names due to the source data... | shantara wrote: | Does this website search the same dataset as | https://knn5.laion.ai? | latchkey wrote: | I just uploaded a picture of my dog (Bichon Frise) and it showed | me a bazillion nearly exact similar dogs. | | Why isn't this immediately being used as a missing persons (or | pet) database service? | at_a_remove wrote: | Let's think of reasons! | | 1) Privacy. | | 2) Lack of geolocation data associated. | | 3) Privacy. | | 4) Lack of contact information attached. | | 5) Privacy. | | 6) Lack of case numbers for various missing persons cases being | attached. | | 7) Privacy. | latchkey wrote: | Certainly this could be used for evil (tm), but it seems like | something could also be built out that enables this for good | in ways that don't cause issues with Mr. Fibbonaci. | groby_b wrote: | Because you're missing a person, not a picture of them? | philipkglass wrote: | I think that you answered your own question. You uploaded a | picture of your non-missing dog and it found many very similar | looking dogs. | fimdomeio wrote: | sometimes people are missing, sometimes they just have | legitimate reasons to not want to be found. | version_five wrote: | I found this very similar to what you'd get with a google / bing | / etc image search. Is that where this database comes from? I | noticed there is a lot of "Shutterstock" watermarked stuff. And I | also checked a few "adult" terms (large breasts etc) and found | there is a lot of nude content. Only curious because I've seen | lots of the generative models have some post-filtering for | nudity, why don't they just clean it out of the training data if | they're worried? | mgraczyk wrote: | Half page popup cookie banner with no opt-out option. I am | completely fine with cookies but clearly there is something wrong | with the state of things when I have to click through a | completely non-actionable popup | 0xrisk wrote: | I'm one of the people building this. Hi HN, AMA :) | telotortium wrote: | What visual search engine are you using? | simandl wrote: | This is clip matching the text or image searched to the | images in the Laion-5B dataset. | tener wrote: | Privacy policy for searched strings? | | "Rutkowski" returns a bunch of book covers, repeated a lot. Can | you ensure images returned have diverse embeddings? I expected | digital art, not detective stories. | | Do you use CLIP or just metadata? | | What is intended process that starts after you get artist's | email (for either of purposes). | simandl wrote: | It's using clip to match the text to the image, so you can | actually prompt it like you might an art generator. Here's | "in the style of greg rutkowski": https://haveibeentrained.co | m/?search_text=in%20the%20style%2... | | In the next few weeks we'll be adding the ability to log in | and flag or upload your works (if they aren't there). Those | lists will have permissions assigned to them, starting with | simple opt-in or opt-out. | educaysean wrote: | One thing that sticks out to me is how so much of the the images | in the collection have really terrible labels. I uncovered a | large collection of pieces by an illustrator who was unsearchable | by name, only via image upload. The reason they were | unsearchable: the majority of this particular artist's images had | labels that were all in the format of: | | {username}'s profile image | constantlm wrote: | I'd love to know how this works. I entered my own name for the | lols, and it returned mostly paintings of the Cape Winelands in | South Africa where I grew up, which is pretty creepy. | 0xrisk wrote: | that is crazy. Perhaps your family name is common in that | region? | prox wrote: | It surprises me how many meme images there are. Aren't they low | quality content? | | I haven't tried but I don't see SD making any memes by themselves | yet. | simandl wrote: | Stable Diffusion used an aesthetic filter to train on a subset | of the English language images from this full 5.8 billion | multi-language set. That probably got a lot of what you're | finding. | gauravphoenix wrote: | interestingly, if you search for "127.0.0.1", it throws | {"message":"Forbidden"} | | in the api response and the page says- Sorry, | there was an error with your search. Please try a different | request. | jonas-w wrote: | Same for "localhost" but "[::1]" works. | gigel82 wrote: | It's surprising how poorly labeled these images are; who is | curating this collection? | | Can't they crowd-source a proper labeling project - I wonder how | much better things like Stable Diffusion would be if its training | would include correct, complete labels for the images. I'm sure | lots of folks would willingly spend a few minutes here and there | to aid with the labeling if it means they get to enjoy the model | for free. | orbital-decay wrote: | If you aren't Google, manually doing that with 5+ billion | images might prove difficult, to put it mildly. Large-scale | labeling is typically bootstrapped with smaller models and | whatever manual data you have. What's being curated is the | bootstrapping process. | 0xrisk wrote: | part of our idea is that artists can help with labelling their | own work | simandl wrote: | This is Laion-5B, you can read more about it here: | https://laion.ai/blog/laion-5b/ | | Imagen and Stable-Diffusion both used subsets of this full 5.8B | image set. | homarp wrote: | tweet explaining: | https://twitter.com/matdryhurst/status/1570143343157575680 | | " Releasing our first Spawning tool to help artists see if they | are present in popular AI Art training data, and register to use | our tools to opt in and opt out of AI training | | I think we have created a way to make this work out well for | everyone" | | and StableDiffusion's Emad Mostaque agreeing to support the | initiative | https://twitter.com/EMostaque/status/1570158985852121090 ___________________________________________________________________ (page generated 2022-09-14 23:00 UTC)