[HN Gopher] MMC4: An open, billion-scale corpus of images interl...
       ___________________________________________________________________
        
       MMC4: An open, billion-scale corpus of images interleaved with text
        
       Author : tim_sw
       Score  : 46 points
       Date   : 2023-04-18 19:55 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ftxbro wrote:
       | "We introduce mmc4, a corpus of 585M images interleaved in 43B
       | English tokens from the popular c4 dataset."
       | 
       | This training corpus mmc4 is the one used by OpenFlamingo
       | (https://laion.ai/blog/open-flamingo/).
       | 
       | Something caught my eye when I read the pdf. They are using GPT-4
       | to name their word clusters: "Topic names are generated by GPT-4
       | conditioned on the top 20 words for each topic, prompted by a
       | request for a short 1-2 word summary."
       | 
       | I feel like LLMs will be used increasingly for this purpose;
       | whenever someone has to make a mostly arbitrary decision like the
       | kind that would make 'bike shedding arguments', they can delegate
       | it to the LLM. The point isn't that the LLM is necessarily better
       | or objective, but rather that it's a kind of 'schelling point'
       | for agreeing on things that don't matter.
       | 
       | For example I could imagine a super annoying peer reviewer (or
       | even co-author) ask why you used the name 'celebrations' as a
       | topic name associated to the set of words 'fun, wedding,
       | beautiful, christmas, happy, card, birthday, gift, blog, perfect'
       | like in the pdf, and why not use some other word. Instead of
       | having some meaningless back and forth 'bike shedding' discussion
       | with the reviewer over email, you can just say you used GPT-4 and
       | move on to more important things.
        
         | seydor wrote:
         | But then how can their license allow commercial use?
        
         | jmmcd wrote:
         | Great! Now the reviewer will say you should've used GPT4.2, the
         | version released on 29th August, not the version released on
         | 22nd August.
        
           | LawTalkingGuy wrote:
           | Lol, that's what a reviewer running on Llama13b would say!
        
       | andyjohnson0 wrote:
       | Whats the best source for a directory/list of these training data
       | corpuses?
        
       ___________________________________________________________________
       (page generated 2023-04-18 23:00 UTC)