[HN Gopher] No Language Left Behind ___________________________________________________________________ No Language Left Behind Author : pesenti Score : 86 points Date : 2022-07-06 19:52 UTC (3 hours ago) (HTM) web link (ai.facebook.com) (TXT) w3m dump (ai.facebook.com) | TaupeRanger wrote: | So they have a system that can translate to languages for which | there isn't as much data as English, Spanish, etc. Waiting for a | Twitter thread from a native speaker of one of these "low | resource languages" to let us know how good the actual | translations are. Cynically, I'd venture that they hired some | native speakers to cherry pick their best translations for the | story books. But mostly this just seems like a nice bit of PR | (calling it a "breakthrough", etc.). I can't imagine this is | going to help anyone who actually speaks a random, e.g., Nilo- | Saharan language. | hello_im_angela wrote: | If you're curious to try the system yourself, it's actually | being used to help Wikipedia editors write articles for low- | resource language Wikipedias: | https://twitter.com/Wikimedia/status/1544699850960281601 | onurcel wrote: | in this work we tried to rely not only on automated evaluation | scores but also on human evaluation for exactly this reason: we | wanted to have a better understanding of how our model actually | performs and how it correlates to automated scores. | alexott wrote: | Twitter may not be representative imho because of the short | text. It should first come to a problem of reliable language | detection, and Twitter is quite often wrong there | microtherion wrote: | As a native Swiss German speaker, my native language is not only | low resource in general, but has the additional difficulty of not | having a standardized orthography (many native speakers will | exclusively write in Standard German, and use Swiss German only | for spoken communication). | | So you have a language with some economic opportunity (a few | million speakers in a fairly wealthy country) but no clearly | defined written interface, and an ambivalent attitude of many | speakers towards the very idea of writing the language. | rmbyrro wrote: | This only makes the problem behind the NLLB project even more | interesting to solve | hello_im_angela wrote: | sooo real. Many low-resource languages have many different | natural variants, can be written in multiple scripts, don't | have as much written standardization, or are mainly oral. As | part of the creation of our benchmark, FLORES-200, we tried to | support languages in multiple scripts (if they are naturally | written like that) and explored translating regional variants | (such as Moroccan Arabic, not just Arabic). | | As an aside, the question of how to think about language | standardization is really complex. We wrote some thoughts in | Appendix A of our paper: | https://research.facebook.com/publications/no-language-left-... | Etheryte wrote: | I'll believe it when I actually see it. I'm a native of a | reasonably small language spoken by about a million people and | never have I ever seen a good automatic translation for it. The | only translations that are good are the ones that have been | manually entered, and those that match the structure of the | manually entered ones. I think the sentiment is laudable and wish | godspeed to the people working on this, but for the time being I | don't see it becoming a reality yet. When Google Translate | regularly struggles even with big pairs such as German-English- | German, I have reservations about someone making it work for | languages where datasets are orders of magnitude smaller. | bobsmooth wrote: | There's a section where you can try reading translated | children's books. See if your language is supported and how | good the translation is. | hello_im_angela wrote: | It's an extremely difficult problem indeed. A lot of people on | the team speak low-resource languages too (my native language | as well!), so definitely resonate with what you're saying. My | overall feeling is: yeah it's hard, and after decades we can't | even do German translation perfectly. But if we don't work on | it, it's not gonna happen. I really hope that people who are | excited about technology for more languages can use what we've | open sourced. | azinman2 wrote: | > But if we don't work on it, it's not gonna happen. | | That's exactly right. There's too much bias in society that | if something isn't perfect, then why bother? Nothing is | perfect, so with that attitude there can be no progress. | Thank you for doing important work! | Tabular-Iceberg wrote: | My concern with this is that in low resource languages the | unavoidable biases of the ML models might overpower their own | organic development. | | We shrug off all the little quirks of machine translated text | because it usually gets the point across, and we recognize them | as quirks because most of what we read was written by real people | with no such quirks. But when most of what you read contain those | quirks, I fear those will quickly become the standard way of | writing and even speaking in those languages. | texaslonghorn5 wrote: | In a worst case you can end up with the Scots Wikipedia | situation, where some power editor created a bunch of pages | using an entirely fabricated, overly stereotypical language and | that influenced what people thought Scots actually was. | onurcel wrote: | This is one of the examples we keep in mind and that's also | why we can't 100% trust public dataset labels. This motivated | us to train a Language IDentification system for all the | languages we wanted to handle in order to build the | monolingual dataset. More details in the paper ;) Or here, if | you have questions | protomyth wrote: | I think it will interesting when it runs into a language (e.g. | Dakota) where the women and men speak differently. Should be an | interesting test. | zen_1 wrote: | Doesn't seem to be a big issue for Arabic, where verbs are | gendered (so in the sentence "I am going to the store", the | verb "to go" will be either masculine or feminine, reflecting | the speaker's gender). | nemothekid wrote: | Arabic is the 5th or 6th most spoken language. I think the | concern for low resource languages is that nuances like | that won't get picked up. | pesenti wrote: | Blog post: https://ai.facebook.com/blog/nllb-200-high-quality- | machine-t... | | Paper: https://research.facebook.com/publications/no-language- | left-... | | Github: https://github.com/facebookresearch/fairseq/tree/nllb/ | robocat wrote: | Also note comments from _hello_im_angela_ (= Angela Fan) and | _jw4ng_ (= Jeff Wang). Those are the HN accounts for Angela and | Jeff from No Languages left Behind. | albertzeyer wrote: | Note that very recently Google has done something very similar: | "Building Machine Translation Systems for the Next Thousand | Languages": https://arxiv.org/abs/2205.03983 | https://ai.googleblog.com/2022/05/24-new-languages-google-tr... | | The Facebook paper has some direct comparison to that work. | jkw wrote: | Evaluation was important to us, and we really wanted to have a | benchmark that covers all 200 languages | enos_feedler wrote: | I was two sentences in before I realized the headline wasn't "No | Luggage Left Behind" | onurcel wrote: | this is actually our recurring joke for our team meeting | offsites! | mikewarot wrote: | The analogy I like the most is that they've found the "shape" of | languages in high dimensions, and if you rotate the shape for | English the right way, you get an unreasonably good fit for the | shape of Spanish, again for all the other languages. | | We're at a point where it's now possible to determine the shape | of every language, provided there are enough speakers of the | language left who are both able and willing to help. | | <Snark> Once done, Facebook can then commodify their dissent, and | sell it back to them in their native language. </Snark> | goldemerald wrote: | The shape analogy doesn't really apply with modern language | models. Each word gets its own context dependent high | dimensional point. With everything being context dependent, | simple transformations like rotations are impossible. A more | accurate perception is that any concept expressible in language | now has its own high dimensional representation, which can then | be decoded into any other language. | labrador wrote: | I'll know AI translators are any good when the United Nations | starts using them | | _" Skills required: United Nations translators are required to | have a perfect command of their main language and an excellent | knowledge of, in most cases, two other official languages"_ | | https://www.un.org/dgacm/en/content/translation | kwhitefoot wrote: | What is a "low resource language"? | pesenti wrote: | https://datascience.stackexchange.com/questions/62868/high-l... | jw4ng wrote: | hey there, I work on this project. We categorize a language as | low-resource if there are fewer than 1M publicly available, de- | duplicated bitext samples. | | also see section 3, table 1 in the paper: | https://research.facebook.com/publications/no-language-left-... | maestrae wrote: | hey, this sounds silly but I can't seem to find a link of all | the languages covered in the 200 hundred languages. I've | looked at the website and the blogpost and neither have a | readily available link. Seems like a major oversight. There | is of course a drop down in both but the languages there are | a lot less than 200. I'm particularly interested in a list of | the 55 African languages for example. | hello_im_angela wrote: | We have a full list here (copy pastable): https://github.co | m/facebookresearch/flores/tree/main/flores2... and Table 1 | of our paper | (https://research.facebook.com/publications/no-language- | left-...) has a complete list as well. | goodside wrote: | Nice to see Esperanto made the cut -- the only artificial | language to do so, AFAICT. | hello_im_angela wrote: | ha yes, that's correct. If you have thoughts on specific | constructed languages where having translation would | really help people, let us know! | maestrae wrote: | thank you! | protomyth wrote: | Looking at the list, I see a lack of Native American | languages. Did anyone try to contact the tribes during this? | hello_im_angela wrote: | We interviewed speakers of low-resource languages from all | over the world to understand the human need for this kind | of technology --- what do people actually want, how would | they use it, and what's the quality they would find useful? | Many low-resource languages lack data online, but are | spoken by millions. However, many indigenous languages are | spoken by smaller numbers of people, and we are definitely | interested in partnering with local communities to co- | develop technology and have been actively investigating | these collaborations but don't have much to share yet. | vjerancrnjak wrote: | What are hardware requirements to run this? | | I see the mixture model is ~ 300 GB and was trained on 256 GPUs. | | I assume distilled versions can easily be run on one GPU. | hello_im_angela wrote: | We release several smaller models as well: | https://github.com/facebookresearch/fairseq/tree/nllb/exampl... | that are 1.3B and 615M parameters. These are usable on smaller | GPUs. To create these smaller models but retain good | performance, we use knowledge distillation. If you're curious | to learn more, we describe the process and results in Section | 8.6 of our paper: | https://research.facebook.com/publications/no-language-left-... | jkw wrote: | Hey all, I work on this project. Full list of languages can be | found here: | https://github.com/facebookresearch/flores/tree/main/flores2... | | As well as in the research paper: | https://research.facebook.com/publications/no-language-left-... | jw4ng wrote: | Jeff Wang here with my fellow Meta AI colleague Angela Fan from | No Languages left Behind, seeing the comments flowing through. If | you want to ask us anything, go for it! | dangom wrote: | What is the greatest insight you gained and could share with | non-experts from working on this project? | jw4ng wrote: | I gained a deeper understanding of what it truly means to be | inclusive. Every language is unique just like everybody and | making sure content works for all and including as many | people as possible is really really hard, but through this | project i'm hopeful we are taking it one step further | Jabbles wrote: | > Every language is unique just like everybody | | TBH it just sounds like you've redefined the word "unique". | mike8889 wrote: | pagekicker wrote: | Hi, I'm putting together an online event called 31 Days of AI | for Book-Lovers to coincide with US National Book Month, | October 2022. I was struck by the specific call-out to | translating literature on your demo page and would like to | feature a specifically book-related application of NLLB on one | of 'anchor days'. Can someone work with me on this? | shuraih wrote: | Hey Jeff, I'm a native speaker of Dhivehi -- the language | spoken by the people of Maldives. Since I couldn't find a full | list of supported languages I was wondering if Dhivehi is / | would be integrated. | jkw wrote: | Dhivehi is currently not supported, unfortunately. We view | this as a starting point and are committed to expanding to | many other languages as in the spirit of our project name. | | Full list of currently supported languages can be found here: | https://github.com/facebookresearch/flores/tree/main/flores2. | .. | jefflombardjr wrote: | Gangi ther vel! | pesenti wrote: | Are all the 200x200 translations going directly or is English | (or another language) used as an intermediate for some of them? | jw4ng wrote: | All translation directions are direct from language X to | language Y, with no intermediary. We evaluate the quality | through 40,602 different translation directions using | FLORES-200. 2,440 directions contain supervised training data | created through our data effort, and the remaining 38,162 are | zero-shot. | btheshoe wrote: | I'm not entirely sure why low resource languages are seen as such | a high priority for AI research. It seems that by definition | there's little payoff to solving translation for them. | goodside wrote: | "Low-resource language" isn't just a euphemism for "language | almost nobody speaks". There are many languages that are widely | spoken but nonetheless are hard to obtain training data for. | Getting something like Wikipedia going for a minority language | can be a difficult chicken-and-egg problem because users will | use English for its completeness/recency, despite their limited | fluency, and the native-language Wikipedia remains neglected. | So you can end up in a situation where users use one language | for social media and another for news/research, and Facebook is | in a unique position to care about the former. | quink wrote: | The examples given are, with native speaker numbers, Assamese | (15 million), Catalan (4 million) and Kinyarwanda (10 million). | These alone are more than an Australia. | | Furthermore, Facebook considers the internet to consist of | Facebook and Wikipedia (Zero). | | I view this as just another extension of their Next Billion | initiative, an effort to ensure that another billion people are | monopolised by Facebook. | | That's the payoff. | dunefox wrote: | Small data, big meaning is much more important than big data, | little meaning. Much closer to real intelligence. | onurcel wrote: | hi @btheshoe, I work on this project in the data part. As | others mentioned, the amount of data available for a language | is not correlated to the number of speakers of that language, | which explains the potential impact of focusing on these. | Jabbles wrote: | Surely the fact that they did all the high-resource languages | first and are only now getting round to the less-popular ones | demonstrates that that is not, in fact, the case? | tehsauce wrote: | I think the reason low resource languages are prioritized is to | compensate for the fact that AI research normally has a | tendency to marginalize these languages. | btheshoe wrote: | yes, but what principles justify the importance placed on low | resource languages? | froskur wrote: | Low resource in this context means that there are few | resources available to train a neural network with, not | that there are few speakers. Although many low resource | languages have relatively few speakers, there are also ones | with tens of millions of speakers. | | The reason for emphasis is in my opinion twofold: 1) | Allowing these people to use the fancy language technology | in their own language is good in and of itself. 2) Training | neural networks on fewer resources is more difficult than | using more resources and therefore a fun and interesting | challenge. | macintux wrote: | Plus presumably we learn more from solving harder | problems, and we prepare for one day needing to translate | some alien language in a hurry. | jw4ng wrote: | We think it's important for AI to truly support everyone in the | world. A world where AI only serves a subset of the population | is not ideal. In machine translation, this means supporting as | many language as possible at high quality. We also imagine a | future where anyone will be able to communicate with anyone | else seamlessly; this also means solving translations for all | languages. | daniel-cussen wrote: | Wouldn't that also entail a bot speaking in any language? | wilde wrote: | The point is that there are lots of humans who speak these | languages and use tech. They just don't use Wikipedia so | getting a good translation corpus going was harder. | gwern wrote: | And it's both cumulative across all those languages (see | above), cheap/amortized (if you can do a good multilingual | NMT for 50 languages, how hard can 50+1 languages be?), and | many of those languages are likely to grow both in terms of | sheer population and in GDP. (Think about South Asian or | African countries like Indonesia or Nigeria.) The question | isn't why are FB & Google investing so much in powerful | multilingual models which handle hundreds of languages, but | why aren't other entities as well? | ausbah wrote: | what other entities would really have access to the text | resources that FB & Google? outside of a few other large | companies I can't imagine many | munificent wrote: | Cynical answer: It's good PR. | albertzeyer wrote: | I don't really remember the exact numbers anymore, but covering | only the top 5 languages will cover maybe 40% of the world | population, while covering the top 200 languages (many of them | low resource) will cover maybe 90% of the world population. | | Some numbers (but you can not exactly infer from them such | accumulated numbers): | https://en.wikipedia.org/wiki/List_of_languages_by_total_num... | | Some more numbers from here: | https://www.sciencedirect.com/science/article/pii/S016763931... | | "96% of the world's languages are spoken by only 4% of its | people." | | Although this statement is more about the tail from the approx | 7000 languages. | bvanderveen wrote: | Great! Facebook no longer have to provide content moderation in | all the various corners of the world where they could | accidentally enable the dissemination of misinformation and hate | speech in minority languages. They can simply transform it into | English and run it back through the existing moderation tooling! | | Understanding foreign culture is about reading automated | translations of online comments into your native language. It has | nothing to do with putting the effort into learning a language | and understanding the nuances and current events and issues of | the culture it embeds. | | The ESL (English as a single language) speakers over at Facebook | don't even need to understand foreign cultures, because they | already know everyone in the world needs to spend their lives | staring into the Metaverse. So grateful that they are working on | the world's fattest pipeline for exporting Anglophone culture to | every corner of the planet! | LtWorf wrote: | Facebook translations are horrifying for the mainstream languages | already. They go from completely wrong to kinda understandable | but still wrong. | rmbyrro wrote: | Looks like they're investing to get better. The model is also | available and they called for contributions to improve it. ___________________________________________________________________ (page generated 2022-07-06 23:00 UTC)