[HN Gopher] MMC4: An open, billion-scale corpus of images interl... ___________________________________________________________________ MMC4: An open, billion-scale corpus of images interleaved with text Author : tim_sw Score : 46 points Date : 2023-04-18 19:55 UTC (3 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | ftxbro wrote: | "We introduce mmc4, a corpus of 585M images interleaved in 43B | English tokens from the popular c4 dataset." | | This training corpus mmc4 is the one used by OpenFlamingo | (https://laion.ai/blog/open-flamingo/). | | Something caught my eye when I read the pdf. They are using GPT-4 | to name their word clusters: "Topic names are generated by GPT-4 | conditioned on the top 20 words for each topic, prompted by a | request for a short 1-2 word summary." | | I feel like LLMs will be used increasingly for this purpose; | whenever someone has to make a mostly arbitrary decision like the | kind that would make 'bike shedding arguments', they can delegate | it to the LLM. The point isn't that the LLM is necessarily better | or objective, but rather that it's a kind of 'schelling point' | for agreeing on things that don't matter. | | For example I could imagine a super annoying peer reviewer (or | even co-author) ask why you used the name 'celebrations' as a | topic name associated to the set of words 'fun, wedding, | beautiful, christmas, happy, card, birthday, gift, blog, perfect' | like in the pdf, and why not use some other word. Instead of | having some meaningless back and forth 'bike shedding' discussion | with the reviewer over email, you can just say you used GPT-4 and | move on to more important things. | seydor wrote: | But then how can their license allow commercial use? | jmmcd wrote: | Great! Now the reviewer will say you should've used GPT4.2, the | version released on 29th August, not the version released on | 22nd August. | LawTalkingGuy wrote: | Lol, that's what a reviewer running on Llama13b would say! | andyjohnson0 wrote: | Whats the best source for a directory/list of these training data | corpuses? ___________________________________________________________________ (page generated 2023-04-18 23:00 UTC)