hngopher.com

       [HN Gopher] Laion-5B: A new era of open large-scale multi-modal ...
       ___________________________________________________________________
        
       Laion-5B: A new era of open large-scale multi-modal datasets
        
       Author : tosh
       Score  : 129 points
       Date   : 2022-12-12 12:18 UTC (10 hours ago)
        
 (HTM) web link (laion.ai)
 (TXT) w3m dump (laion.ai)
        
       | jerpint wrote:
       | LAION is arguably as important as imagenet was in the early 2010s
        
       | SubiculumCode wrote:
       | What makes this multimodal, labels?
        
         | ShamelessC wrote:
         | One mode is natural language, the other is imagery. It is
         | combination becuse the model will learn statistical
         | associations between the modes e.g. "text to image", "voice to
         | text".
         | 
         | Within these respective modes are even more subgroups e.g.
         | language translation, audio diarization. For sd you can
         | consider animation and photographs as separate modes the model
         | has to learn. Although the language is fuzzy and im not being
         | statistically rigorous as it is a weak point of mine.
        
       | minimaxir wrote:
       | For practical context, Stable Diffusion 2.X was trained on
       | LAION-5B as opposed to LAION-400M for Stable Diffusion 1.X.
       | 
       | At the least, Stable Diffusion 2.X is better at pain points of
       | image generation such as text legibility and hands, potentially
       | due to having more data points.
        
         | in3d wrote:
         | This is incorrect. Stable Diffusion 1.x was trained on "laion-
         | improved-aesthetics" (a subset of laion2B-en).
        
           | minimaxir wrote:
           | Double checked and both the initial comment and the
           | correction are incorrect: the original v1.1 was trained on
           | LAION-2B, then subsequent versions were finetuned on the
           | aestethics subset.
           | 
           | Either way, the main point is the same: more training data
           | gives better results.
           | 
           | https://github.com/CompVis/stable-diffusion#weights
        
             | in3d wrote:
             | 1.1 wasn't public. Public releases were trained as I said.
        
               | cma wrote:
               | 1.1 is available here:
               | https://huggingface.co/CompVis/stable-
               | diffusion-v-1-1-origin...
        
         | satvikpendem wrote:
         | SD 2 also removed quite a lot of images of humans due to their
         | fear of people generating CSAM, so the quality actually has
         | gotten worse for anything resembling humans than SD 1.
        
           | astrange wrote:
           | 2.0 removed too many of them due to a bug in the NSFW filter.
           | 2.1+ should be better again.
           | 
           | But they're harder to control without negative prompting.
        
         | Terretta wrote:
         | Problem with hands is probability.
         | 
         | It's more probable a finger has a finger on both sides of it
         | than not. So the model diffuses lots of adjacent fingers.
        
           | stavros wrote:
           | But that's the same for everything that has structure. A
           | small section of an arm is much more likely to have another
           | small section of an arm next to it than to have a hand, yet
           | SD's arms are usually well-proportioned.
        
             | sdenton4 wrote:
             | There's a lot of loooooong necks, though.
        
           | alar44 wrote:
           | No the problem with fingers is that they resemble hotdogs and
           | the AI really likes hotdogs so you get a lot of fingers.
           | 
           | I can make things up too!
        
       | lajamerr wrote:
       | While impressive number of images today. I believe this will be
       | an underwhelming amount of images compared to what models are
       | trained on in the future.
       | 
       | This is an incomplete analogy but from the time a baby is born
       | that baby will have seen 1,892,160,000 frames of data per eye
       | 3,784,320,000 frames in a year. That baby practically knows
       | nothing about the world still.
        
         | rom1504 wrote:
         | yes indeed. Video is the clear next step.
        
         | minimaxir wrote:
         | Most of those frames are redundant.
        
           | bena wrote:
           | And unclassified. And of poor quality.
           | 
           | Babies have a much harder task. They have to construct a
           | corpus of knowledge from absolutely nothing.
        
             | coolspot wrote:
             | Not absolutely nothing, the neural net is initialized with
             | some weights encoding basic things (breathing, sucking,
             | crying, etc.). Newborn horse walks and follows mother after
             | first 5-10 minutes.
        
             | trasz2 wrote:
             | How do we know they start from nothing?
        
               | CamperBob2 wrote:
               | In fact, we're pretty sure that they don't "start from
               | nothing." E.g.,
               | https://en.wikipedia.org/wiki/The_Language_Instinct
        
               | bena wrote:
               | We're not pretty sure of anything e.g.
               | https://en.wikipedia.org/wiki/Educating_Eve
        
               | CamperBob2 wrote:
               | On the surface, that sounds like a reasonable position to
               | take. ("Cowley proposes an alternative: that language
               | acquisition involves culturally determined language
               | skills, apprehended by a biologically determined faculty
               | that responds to them. In other words, he proposes that
               | each extreme is right in what it affirms, but wrong in
               | what it denies. Both cultural diversity of language, and
               | a learning instinct, can be affirmed; neither need be
               | denied.")
               | 
               | GPT's ability to fool intelligent people into thinking
               | that it is "intelligent" itself seems like a powerful
               | argument that language, more than anything else, is what
               | makes humans capable of higher thought. Language is all
               | GPT has. (Well, that and a huge-ass cultural database.)
               | 
               | Intelligence is one of those areas in which, once you
               | fake it well enough, you've effectively made it. Another
               | 10x will be enough to tie the game against an average
               | human player.
        
             | the8472 wrote:
             | The upside is that babies get to interact with the
             | environment they're training on. Image models can't move
             | the camera a few cm to the right if they're interested in
             | the perspective of a particular scene.
        
           | lajamerr wrote:
           | There's value in redundancy and continuous stream of images
           | where one follows the other.
           | 
           | It would be nice to have a dataset of a couple "raising" a
           | Video recorder for 1 year as if they would a baby. A
           | continuous stream of data.
           | 
           | Could train a model to predict the next frames based on what
           | it's seen so far.
        
             | mindcrime wrote:
             | _It would be nice to have a dataset of a couple "raising" a
             | Video recorder for 1 year as if they would a baby. A
             | continuous stream of data._
             | 
             | The project I'm working on right now is to build a sort of
             | "body" for a (non ambulatory, totally non anthropomorphic)
             | "baby AI" that senses the world using cameras, microphones,
             | accelerometer/magnetometer/gyroscope sensor, temperature
             | sensors, gps, etc. The idea is exactly to carry it around
             | with me and "raise" it for long periods of time (a year?
             | Sure, absolutely, in principle. But see below) and explore
             | some ideas about how learning works in that regime.
             | 
             | The biggest (well, one of the biggest) challenge(s) is
             | going to be data storage. Once I start storing audio and
             | video the storage space required is going to ramp up
             | quickly, and since I'm paying for this out of my own pocket
             | I'm going to be limited in terms of how much data I can
             | keep around. Will I be able to keep a whole year? Don't
             | know yet.
             | 
             | There's also some legal and ethical stuff to work out,
             | around times when I take the thing out in public and am
             | therefore recording audio and video of other people.
        
               | lajamerr wrote:
               | Glad to hear you are working on such a project. There
               | definitely will be a lot of privacy concerns in any such
               | project so it may be difficult to open source the data to
               | broad public.
               | 
               | But could still be useful to research institutes who
               | follow privacy guidelines.
               | 
               | It might be best to do a short stint of 1 week to test
               | the feasibility. That should give you a good estimate on
               | future projections of how much data it will consume after
               | a month, 3 months, and a year.
               | 
               | I imagine any intelligent system could work with reduced
               | data quality/lossy data at least on the audio.
               | 
               | As long as it's consistent in the type/amount of
               | compression. So instead of WAV/FLAC/RAW. You could encode
               | it to something like Opus 100 Kbps and that would give
               | you 394.2 Gigabytes of Data for a single year for the
               | audio.
               | 
               | As for video... it would definitely require a lot of
               | tricks to store on a hobbyist level.
        
               | mindcrime wrote:
               | Yep. Your reply here encapsulates a lot of what I've been
               | thinking about for the past few weeks. I'd love to open-
               | source at least some of the data I collect, but the
               | privacy/ethics issues have to be considered. And as far
               | as that goes, there are legal/ethical issues around
               | simply _collecting_ data even if I don 't share it, that
               | come into play where other people are involved.
               | 
               |  _It might be best to do a short stint of 1 week to test
               | the feasibility. That should give you a good estimate on
               | future projections of how much data it will consume after
               | a month, 3 months, and a year._
               | 
               | Yep. That's basically the approach I took with "phase 1"
               | where the only data being ingested was gps /
               | accelerometer data. I just let it run for a couple of
               | weeks and then extrapolated out what the storage
               | requirements would be for the future. Obviously audio and
               | video are going to change the equation a lot, but the
               | same principle is what I am planning to employ.
               | 
               |  _I imagine any intelligent system could work with
               | reduced data quality /lossy data at least on the audio._
               | 
               | Yep, that's another area I've been thinking a lot about.
               | The "instinct" is to capture everything at the highest
               | possible resolution / sampling rate / etc. and store in a
               | totally lossless format. But that is also the most
               | expensive scenario and if it's not strictly required,
               | then why do it? We know human hearing at least can work
               | with relatively crappy audio. Look at the POTS phone
               | system and it's 8khz of bandwidth for example. Does that
               | analogy hold for video? Good question.
               | 
               |  _As long as it 's consistent in the type/amount of
               | compression. So instead of WAV/FLAC/RAW. You could encode
               | it to something like Opus 100 Kbps and that would give
               | you 394.2 Gigabytes of Data for a single year for the
               | audio._
               | 
               | Agreed.
               | 
               |  _As for video... it would definitely require a lot of
               | tricks to store on a hobbyist level._
               | 
               | Definitely. One thing that may help with costs in the
               | short-term is that I'm very explicitly not (for now
               | anyway) using a cloud storage service. Data ingestion is
               | to a server I own and physically have in my home. I can
               | get away with this because while the aggregate total
               | amount of data may wind up fairly big over longer periods
               | of time, the rate at which I need to ingest data isn't
               | all that high (there's only one of these devices sending
               | to the server). And I can just keep adding 5TB or 10TB
               | drives as needed. When one fills up, I can unplug it,
               | replace it with another, label and store it, and move on.
               | The big risks here are that I don't really have any
               | redundancy in that scenario, especially if my home burns
               | down or something. But in that case I have bigger
               | problems to worry about anyway!
               | 
               | There are other downsides to this approach, like dealing
               | with the case of needing to access the entire year's
               | worth of data "at once" for analysis or training, but I'm
               | not sure that need will ever even arise.
        
               | sharemywin wrote:
               | here was an article on using latent embeddings for
               | compression. might be useful.
               | 
               | https://pub.towardsai.net/stable-diffusion-based-image-
               | compr...
        
         | Hendrikto wrote:
         | Pretty sure this is a troll.
         | 
         | The assumption that human eyes can be measured in FPS is, in
         | itself, very questionable. And if it were indeed the case, then
         | it would surely be far in access of 60fps...
        
           | dr_dshiv wrote:
           | Well, inhibitory alpha waves cycle across the visual field 10
           | times a second. People with faster alpha waves can detect two
           | flashes that people with slower alpha waves see as one flash.
        
           | mindcrime wrote:
           | _The assumption that human eyes can be measured in FPS is, in
           | itself, very questionable._
           | 
           | In the strictest sense, yes. But it seems quite reasonable to
           | think that there is something like an "FPS equivalent" for
           | the human eye. I mean, it's not magic, and physics comes into
           | play at some level. There's a shortest unit of time / amount
           | of change that the eye can resolve. From that you could work
           | out something that is analogous to a frame-rate.
           | 
           |  _And if it were indeed the case, then it would surely be far
           | in access of 60fps_
           | 
           | Not necessarily. Quite a few people believe that the human
           | eye "FPS equivalent" is somewhere between 30-60 FPS. That's
           | by no means universally accepted and since it's just an
           | analogy to begin with the whole thing is admittedly a little
           | big dodgy. But by the same token, it's not immediately
           | obvious that the human "FPS equivalent" would be "far in
           | excess of 60 FPS" either.
        
         | satvikpendem wrote:
         | You are correct. Deepmind released a paper earlier this year
         | showing that data is the primary constraint holding back these
         | models, not their architecture size (ie a model with 5 billion
         | parameters is not much better than one with 1 billion, but more
         | data can make both much better) [0].
         | 
         | I will copy paste the main findings from the article here:
         | 
         | - Data, not size, is the currently active constraint on
         | language modeling performance. Current returns to additional
         | data are immense, and current returns to additional model size
         | are miniscule; indeed, most recent landmark models are
         | wastefully big.
         | 
         | - If we can leverage enough data, there is no reason to train
         | ~500B param models, much less 1T or larger models.
         | 
         | - If we _have_ to train models at these large sizes, it will
         | mean we have encountered a barrier to exploitation of data
         | scaling, which would be a great loss relative to what would
         | otherwise be possible.
         | 
         | - The literature is extremely unclear on how much text data is
         | actually available for training. We may be "running out" of
         | general-domain data, but the literature is too vague to know
         | one way or the other.
         | 
         | - The entire available quantity of data in highly specialized
         | domains like code is woefully tiny, compared to the gains that
         | would be possible if much more such data were available.
         | 
         | [0]
         | https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinc...
        
           | ma2rten wrote:
           | This post is about image generation, not language models.
        
             | satvikpendem wrote:
             | I'd imagine the situation is the same for image generation
             | models too.
        
       | gillesjacobs wrote:
       | Good to see open data and open models become a thing, I hope this
       | trend will continue and open AI will triumph like open source
       | software did.
        
       | ritwikgupta wrote:
       | This dataset is a massive failure when it comes to ethical
       | research practices. LAION-5B openly indexed copyrighted data that
       | it had no business collecting. They failed to go through an IRB
       | when curating this data. The ethics review for this paper was a
       | joke, where the ethics reviewer raises valid concerns and then
       | discards their review because "if they don't publish it here,
       | they'll publish it somewhere else anyways" [0].
       | 
       | LAION-5B has enabled some really cool technologies and a lot of
       | promising startups. This work should have been carried out
       | responsibly.
       | 
       | [0] https://openreview.net/forum?id=M3Y74vmsMcY
        
         | O__________O wrote:
         | What specifically are you claiming required a review board?
         | 
         | Quick review of their site and the paper turns up nothing that
         | commonly would be a topic that might merit such a review.
         | 
         | Related FAQs:
         | 
         | - https://laion.ai/faq/
        
           | ritwikgupta wrote:
           | LAION-5B includes images of humans without their explicit
           | consent. Images of people generally involve IRB/HSR.
           | Additionally, almost any IRB will mention that if you're
           | using data _derived_ from humans, you must go through IRB.
           | 
           | LAION can say all they want that they're not including images
           | in their dataset. They include a script to download those
           | URLs into images on disk. By being a company that's not bound
           | to decades of university ethics regulations, they are
           | seemingly allowed to skirt what you learn on your first day
           | as a researcher in academia. It may be legal, but it sure is
           | not ethical.
        
             | O__________O wrote:
             | Please provide link to another academic publication
             | agreeing with your claim that linking to online content is
             | unethical without the subject's explicit approval.
        
               | nl wrote:
               | Thanks a more specific claim that the OP didn't make.
        
         | OctopusLupid wrote:
         | > LAION-5B openly indexed copyrighted data that it had no
         | business collecting.
         | 
         | This seems to be legal in many countries (from what I know, the
         | UK, EU, Japan and Singapore) due to the TDM (Text and Data
         | Mining) exception, especially for researchers.
        
         | Blackthorn wrote:
         | > LAION-5B openly indexed copyrighted data that it had no
         | business collecting.
         | 
         | Seems like an open and shut fair use claim, web indexing (not
         | even scraping, just indexing) is not uncommon...
        
       | oth001 wrote:
       | Terribly unethical using unlicensed images. They could/could've
       | crowdsourced image gathering and labeling instead of stealing
       | images.
        
         | astrange wrote:
         | This is like saying Google Image Search stole your image.
         | 
         | (In fact it's exactly the same; it's allowed under the same
         | laws and it respects robots.txt.)
        
           | oth001 wrote:
           | Does Google.com allow anybody to instantly mimic an artist's
           | style? Obviously AI laws haven't been put in place yet - it
           | doesn't mean it's not unethical.
        
             | astrange wrote:
             | It's always been possible to imitate an artstyle.
             | Nevertheless, they've never gotten IP protection - they're
             | more like trade secrets.
             | 
             | What's notable is "AI users are trying to copy an artist"
             | != "AI has learned from an artist" != "AI has seen the
             | artist's images in the first place". The most popular
             | supposedly stolen-from artist Greg Rutkowski is not in
             | StableDiffusion's training images, even though users are
             | actively trying to copy him, it's a coincidence that it
             | appears to work. Is that unethical?
             | 
             | Also, AI laws (text and data mining exemptions) /have/ been
             | put in place - to make this explicitly legal!
        
       | satvikpendem wrote:
       | If you've ever actually looked into the Laion datasets, you'll
       | notice that they are hot garbage, in that the captions often
       | don't even correlate to what the image is about, and the images
       | are often low quality, bad cropped and so on.
       | 
       | There are other datasets being developed that use high quality
       | images that are manually labeled by humans, such as by Unstable
       | Diffusion who's having a Kickstarter right now [0]. They say they
       | will be able to get a much more high quality model due to such
       | high quality images and captioning, so we'll see. They also want
       | to make the model and code entirely open source rather than the
       | license that Stable Diffusion has which is not open source (it
       | has many restrictions, enforceable or not, on the images made).
       | 
       | [0]
       | https://www.kickstarter.com/projects/unstablediffusion/unsta...
        
         | infinityio wrote:
         | Obviously there would be limits as to how much could be
         | manually reviewed by hand (if 1000 people reviewed 1000 images
         | each, only 0.02% of the images would be reviewed assuming no
         | overlap was required), but I wonder if there would be any
         | benefit to attempting to crowdsource captions for the dataset
         | for the worst available images
        
         | alsodumb wrote:
         | If you ever actually look into Unstable diffusion Kickstarter,
         | you'll notice that they're not actually claiming they'll
         | manually label a dataset the size of Laion-5B - that's a much
         | bigger task than what you seem to think it is.
         | 
         | Even if a million people are labeling images, without any
         | overlap, 5 billion images would mean each of them has to label
         | 5000 images each.
         | 
         | What Unstable diffusion folks seem to be doing is that they're
         | using a few thousand labeled images to train a caption
         | generation model and then use it to create a huge multimodal
         | dataset with text and high quality images.
        
           | satvikpendem wrote:
           | > _If you ever actually look into Unstable diffusion
           | Kickstarter, you'll notice that they're not actually claiming
           | they'll manually label a dataset the size of Laion-5B -
           | that's a much bigger task than what you seem to think it is._
           | 
           | I never claimed this either.
        
         | fpgaminer wrote:
         | DALL-E, Stable Diffusion, GPT-3, Whisper, CLIP, etc are all
         | trained on "hot garbage" and all of them are SOTA. Whisper is a
         | great example, as it shows that this broader use of imperfect
         | training data helps to make the models more robust and general
         | than their "perfectly" trained counterparts. The trick behind
         | all of these is to build mechanisms on smaller scale, human
         | labelled data that can then be used to filter and label the
         | broader dataset. Or use training methods that are more robust
         | to imperfect data, like contrastive learning ala CLIP.
        
         | GaggiX wrote:
         | It is not possible to manually label hundreds of millions of
         | images to train a model on them, CFG exists to deal with this
         | problem, also Unstable Diffusion will just finetune a Stable
         | Diffusion model, so you cannot simply change the licence to
         | what you want.
        
           | operator-name wrote:
           | Boorus [0] contain millions of images, manually labeled to a
           | pretty high quality. Notably defusion models trained on booru
           | datasets have had good success.
           | 
           | This is not the only example of well curated image-tag pairs,
           | especially in artistic circles. It's just that most of them
           | are not CC.
           | 
           | [0]: https://en.wiktionary.org/wiki/booru
        
             | GaggiX wrote:
             | Booru use tags instead of captions, so a model trained on
             | them is really limited; moreover, Danbooru has only 5
             | million images, while other booru such as gelbooru and
             | sankaku have lower quality.
        
               | operator-name wrote:
               | Tags are limited how exactly? Prompt crafting becomes a
               | case of selecting the relevant tags, and the embedding
               | space will still capture the dataset.
               | 
               | Danbooru is only one such example of well curated tagging
               | and if we ignore copyright there are far more examples.
               | These example just serve as evidence that refining poor
               | labeling is not outside of the relm of possibility as you
               | suggested.
        
               | GaggiX wrote:
               | A tag-based system would completely lack any kind of
               | contextual information and it would not be possible to
               | create any relationship between words; natural language
               | is much more powerful.
               | 
               | An example, an image is tagged: kanna_kamui, kimono and
               | torhu_(maiddragon), who has the kimono? Kanna, Torhu or
               | both? It cannot be known, but with natural language it is
               | possible to describe who is wearing what.
        
           | devmor wrote:
           | >It is not possible to manually label hundreds of millions of
           | images to train a model on them
           | 
           | Citation, please?
           | 
           | I think you mean "the developers of this technology do not
           | want to pay to have hundreds of millions of images labeled".
        
             | GaggiX wrote:
             | It is not believable that someone would pay humans to label
             | 400mln or 5bln images/samples to train a model on them, but
             | I guess if you argument is "everything is possible" then
             | gotcha
        
               | satvikpendem wrote:
               | If it's done in a reCAPTCHA like way, it can be done
               | fairly efficiently and for cheap. In fact Scale AI does
               | just this, they do manual labor operations such as
               | captioning images, as an API. Here's their product for
               | image labeling: https://scale.com/rapid.
               | 
               | Unstable Diffusion is also doing their captioning like
               | how I mentioned, with groups of volunteers as well as
               | hired individuals.
        
               | GaggiX wrote:
               | Scale seems to do, for example, image classification but
               | not captioning as it would be hard to compare the results
               | with others people to verify the quality (when you have a
               | discrete number of classes is really straightforward),
               | also can you report where you read about the Unstable
               | Diffusion plan for manually labeling image datasets? I
               | want to dig deeper
        
               | satvikpendem wrote:
               | From their Reddit post about this: https://old.reddit.com
               | /r/StableDiffusion/comments/zhg18s/uns...
        
               | GaggiX wrote:
               | > We are releasing Unstable PhotoReal v0.5 trained on
               | thousands of tirelessly hand-captioned images
               | 
               | They seem to have created a much smaller dataset than
               | LAION's, it would not work to train a generative model on
               | such a small amount of images (obviously the images here
               | do not have a single domain).
        
               | devmor wrote:
               | You seem to be confusing "possibility" with your personal
               | opinion on what you think would be done by others.
        
               | GaggiX wrote:
               | As a human being I know human limitations, explicitly
               | labeling 400mln/5bln images for a particular task seems
               | absurd to me, but if you think it is realistically
               | possible perhaps you can give an example.
        
             | whiplash451 wrote:
             | The LAION dataset was designed for the broader community at
             | the first place, so clearly the premise is that they don't
             | have millions to throw at the problem.
        
         | rom1504 wrote:
         | Looks like you missed the whole point of this dataset.
         | 
         | The idea that we proved is you can get a dataset with decent
         | caption and images (that do match yes, you can see for yourself
         | at https://rom1504.github.io/clip-retrieval/ ) that can be used
         | to trained well performing models (eg openclip and stable
         | diffusion) while using only automated filtering of a noisy
         | source (common crawl)
         | 
         | We further proved that idea by using aesthetic prediction, nsfw
         | and watermark tags to select the best pictures.
         | 
         | Is it possible to write caption manually? sure, but that
         | doesn't scale much and won't make it possible to train general
         | models.
        
           | satvikpendem wrote:
           | > _Is it possible to write caption manually? sure, but that
           | doesn 't scale much and won't make it possible to train
           | general models._
           | 
           | Maybe, I don't think so however based on the above comments
           | by Unstable Diffusion. It seems like people are
           | underestimating the power of high quality data and just
           | throwing the kitchen sink at models. Perhaps a set of good
           | quality data can indeed outperform Laion-style datasets.
           | 
           | It's like the YC saying about doing things that don't scale,
           | perhaps with the high quality dataset, we can train better
           | models than CLIP and in turn use those to caption the rest of
           | the images, only now the caption model is much better than
           | previous ones.
        
             | GaggiX wrote:
             | The new Unstable Diffusion model will be one of the several
             | SD finetuned model out there, these models usually have
             | much higher quality (but smaller image diversity) because
             | they take the coherency of SD and costrain the distribution
             | to a small high quality portion, this means that you can
             | train a model on a smaller high quality dataset from
             | scratch but you would not, for example, have the same level
             | of coherency, this can only be obtained with an incredible
             | amount of images, and they don't need to be "high quality",
             | a man will almost always have 2 arms, 2 legs etc...
             | regardless of the quality of the images, and after the
             | model has fit the entire distribution you can finetune it
             | to produce high quality and coherent images with a small
             | dataset, that's why Unstable Diffusion will finetuned a SD
             | checkpoint, also why researchers use these big dataset like
             | LAION-400M/5B
        
               | cma wrote:
               | > and they don't need to be "high quality", a man will
               | almost always have 2 arms, 2 legs etc...
               | 
               | At the next generation it feels like the training set
               | will be inbreeding on the flood of stable diffusion
               | images with 7 mangled fingers, heads coming out of legs,
               | etc.
        
             | version_five wrote:
             | I'd guess there is a bias-variance tradeoff. If you just
             | want to make a certain kind of image, no doubt a manually
             | labeled and curated dataset can be better. If you want a
             | generic generative model that has learned a wide variety of
             | stuff, scale wins.
             | 
             | I can see LAION playing a similar role to imagenet. The
             | main application of imagenet isn't directly training image
             | recognition models. It's pertaining on diverse data so that
             | a "big" (big in 2016) model can be fine tuned easily on a
             | small dataset, after learning to be a good feature
             | extractor. From that perspective, the label quality (and
             | concerns about bias and whatnot) are almost irrelevant
        
           | napier wrote:
           | It's possible to so a much better job with automation too.
           | Context aware cropping, accurate aspect ratio, quality
           | filtering by various metric... all solved problems long ago,
           | but absent from Laion-5B for some reason. Perhaps it would be
           | a good idea go collaborate more closely with image experts
           | for the next round.
        
       | [deleted]
        
       | abeppu wrote:
       | So the core of the dataset is image _URLs_ and text captions.
       | 
       | 1. From a reproducibility perspective, isn't this kinda brittle
       | in that even without malicious intent, some of those images will
       | no longer be available when other researchers attempt to download
       | them?
       | 
       | 2. From a resilience perspective, if your site has some of the
       | images in the dataset, could you swap in another image with the
       | correct dimensions. Could you poison or skew the model in any
       | interesting ways?
        
         | version_five wrote:
         | Imagenet (arguably the most used image dataset of the last 10
         | years) is the same, it's a list of URLs with full archives of
         | the downloaded images available under some conditions.
        
           | tbalsam wrote:
           | Fair enough, but Imagenet is sort of a nightmare right now. I
           | get it's a crowd funded and sourced effort, but hopefully at
           | some point some brave soul(s) will step up to archive the
           | data as-is in a very reproducible kind of way. :D :))))
        
         | alsodumb wrote:
         | The key is the scale of the dataset. Both the points you
         | mention become irrelevant for a large dataset because
         | 
         | 1) The chance that a significant percentage of the images
         | become unavailable is low. Also, training on such a big dataset
         | means your model generalizes well and is usually robust.
         | 
         | 2) Again, you would need to inject adversarial/malicious images
         | to a significant number of those links in the dataset for it to
         | have actual impact on trained model. Again, unlikely.
        
           | [deleted]
        
           | abeppu wrote:
           | For point 1 ... it depends on the timescale. In the fullness
           | of time, surely a significant portion of images will be
           | unavailable. From the perspective of allowing other
           | researchers to work from the "same" baseline "today", this is
           | likely good enough. In a generation from now, if someone
           | wants to reproduce results from some landmark model trained
           | against this dataset, we'd have problems. In other fields
           | where people publish or share their datasets, would this be
           | considered sufficient?
           | 
           | For point 2, I think it's possible that for some narrow
           | topics, some domains have a significant share of images. I
           | think these can affect the model, which is in part why they
           | give special attention to watermarking. Suppose instead of
           | merely watermarking images, for every image on my large
           | collegiate track and field website I make sure someone is
           | wearing a garment with a visible Nike swoosh. Can I skew the
           | model towards associating Nike with the sport? I think this
           | kind of thing may be achievable for niche areas.
        
         | astrange wrote:
         | Since artists already appear to believe LAION is "stolen
         | content", actually downloading everything wouldn't help the
         | case that it's fine.
        
         | whiplash451 wrote:
         | And from the storing perspective? The full image dataset weighs
         | dozens of PB. How convenient is that to share?
        
       ___________________________________________________________________
       (page generated 2022-12-12 23:00 UTC)