[HN Gopher] No Language Left Behind
       ___________________________________________________________________
        
       No Language Left Behind
        
       Author : pesenti
       Score  : 86 points
       Date   : 2022-07-06 19:52 UTC (3 hours ago)
        
 (HTM) web link (ai.facebook.com)
 (TXT) w3m dump (ai.facebook.com)
        
       | TaupeRanger wrote:
       | So they have a system that can translate to languages for which
       | there isn't as much data as English, Spanish, etc. Waiting for a
       | Twitter thread from a native speaker of one of these "low
       | resource languages" to let us know how good the actual
       | translations are. Cynically, I'd venture that they hired some
       | native speakers to cherry pick their best translations for the
       | story books. But mostly this just seems like a nice bit of PR
       | (calling it a "breakthrough", etc.). I can't imagine this is
       | going to help anyone who actually speaks a random, e.g., Nilo-
       | Saharan language.
        
         | hello_im_angela wrote:
         | If you're curious to try the system yourself, it's actually
         | being used to help Wikipedia editors write articles for low-
         | resource language Wikipedias:
         | https://twitter.com/Wikimedia/status/1544699850960281601
        
         | onurcel wrote:
         | in this work we tried to rely not only on automated evaluation
         | scores but also on human evaluation for exactly this reason: we
         | wanted to have a better understanding of how our model actually
         | performs and how it correlates to automated scores.
        
         | alexott wrote:
         | Twitter may not be representative imho because of the short
         | text. It should first come to a problem of reliable language
         | detection, and Twitter is quite often wrong there
        
       | microtherion wrote:
       | As a native Swiss German speaker, my native language is not only
       | low resource in general, but has the additional difficulty of not
       | having a standardized orthography (many native speakers will
       | exclusively write in Standard German, and use Swiss German only
       | for spoken communication).
       | 
       | So you have a language with some economic opportunity (a few
       | million speakers in a fairly wealthy country) but no clearly
       | defined written interface, and an ambivalent attitude of many
       | speakers towards the very idea of writing the language.
        
         | rmbyrro wrote:
         | This only makes the problem behind the NLLB project even more
         | interesting to solve
        
         | hello_im_angela wrote:
         | sooo real. Many low-resource languages have many different
         | natural variants, can be written in multiple scripts, don't
         | have as much written standardization, or are mainly oral. As
         | part of the creation of our benchmark, FLORES-200, we tried to
         | support languages in multiple scripts (if they are naturally
         | written like that) and explored translating regional variants
         | (such as Moroccan Arabic, not just Arabic).
         | 
         | As an aside, the question of how to think about language
         | standardization is really complex. We wrote some thoughts in
         | Appendix A of our paper:
         | https://research.facebook.com/publications/no-language-left-...
        
       | Etheryte wrote:
       | I'll believe it when I actually see it. I'm a native of a
       | reasonably small language spoken by about a million people and
       | never have I ever seen a good automatic translation for it. The
       | only translations that are good are the ones that have been
       | manually entered, and those that match the structure of the
       | manually entered ones. I think the sentiment is laudable and wish
       | godspeed to the people working on this, but for the time being I
       | don't see it becoming a reality yet. When Google Translate
       | regularly struggles even with big pairs such as German-English-
       | German, I have reservations about someone making it work for
       | languages where datasets are orders of magnitude smaller.
        
         | bobsmooth wrote:
         | There's a section where you can try reading translated
         | children's books. See if your language is supported and how
         | good the translation is.
        
         | hello_im_angela wrote:
         | It's an extremely difficult problem indeed. A lot of people on
         | the team speak low-resource languages too (my native language
         | as well!), so definitely resonate with what you're saying. My
         | overall feeling is: yeah it's hard, and after decades we can't
         | even do German translation perfectly. But if we don't work on
         | it, it's not gonna happen. I really hope that people who are
         | excited about technology for more languages can use what we've
         | open sourced.
        
           | azinman2 wrote:
           | > But if we don't work on it, it's not gonna happen.
           | 
           | That's exactly right. There's too much bias in society that
           | if something isn't perfect, then why bother? Nothing is
           | perfect, so with that attitude there can be no progress.
           | Thank you for doing important work!
        
       | Tabular-Iceberg wrote:
       | My concern with this is that in low resource languages the
       | unavoidable biases of the ML models might overpower their own
       | organic development.
       | 
       | We shrug off all the little quirks of machine translated text
       | because it usually gets the point across, and we recognize them
       | as quirks because most of what we read was written by real people
       | with no such quirks. But when most of what you read contain those
       | quirks, I fear those will quickly become the standard way of
       | writing and even speaking in those languages.
        
         | texaslonghorn5 wrote:
         | In a worst case you can end up with the Scots Wikipedia
         | situation, where some power editor created a bunch of pages
         | using an entirely fabricated, overly stereotypical language and
         | that influenced what people thought Scots actually was.
        
           | onurcel wrote:
           | This is one of the examples we keep in mind and that's also
           | why we can't 100% trust public dataset labels. This motivated
           | us to train a Language IDentification system for all the
           | languages we wanted to handle in order to build the
           | monolingual dataset. More details in the paper ;) Or here, if
           | you have questions
        
         | protomyth wrote:
         | I think it will interesting when it runs into a language (e.g.
         | Dakota) where the women and men speak differently. Should be an
         | interesting test.
        
           | zen_1 wrote:
           | Doesn't seem to be a big issue for Arabic, where verbs are
           | gendered (so in the sentence "I am going to the store", the
           | verb "to go" will be either masculine or feminine, reflecting
           | the speaker's gender).
        
             | nemothekid wrote:
             | Arabic is the 5th or 6th most spoken language. I think the
             | concern for low resource languages is that nuances like
             | that won't get picked up.
        
       | pesenti wrote:
       | Blog post: https://ai.facebook.com/blog/nllb-200-high-quality-
       | machine-t...
       | 
       | Paper: https://research.facebook.com/publications/no-language-
       | left-...
       | 
       | Github: https://github.com/facebookresearch/fairseq/tree/nllb/
        
         | robocat wrote:
         | Also note comments from _hello_im_angela_ (= Angela Fan) and
         | _jw4ng_ (= Jeff Wang). Those are the HN accounts for Angela and
         | Jeff from No Languages left Behind.
        
       | albertzeyer wrote:
       | Note that very recently Google has done something very similar:
       | "Building Machine Translation Systems for the Next Thousand
       | Languages": https://arxiv.org/abs/2205.03983
       | https://ai.googleblog.com/2022/05/24-new-languages-google-tr...
       | 
       | The Facebook paper has some direct comparison to that work.
        
         | jkw wrote:
         | Evaluation was important to us, and we really wanted to have a
         | benchmark that covers all 200 languages
        
       | enos_feedler wrote:
       | I was two sentences in before I realized the headline wasn't "No
       | Luggage Left Behind"
        
         | onurcel wrote:
         | this is actually our recurring joke for our team meeting
         | offsites!
        
       | mikewarot wrote:
       | The analogy I like the most is that they've found the "shape" of
       | languages in high dimensions, and if you rotate the shape for
       | English the right way, you get an unreasonably good fit for the
       | shape of Spanish, again for all the other languages.
       | 
       | We're at a point where it's now possible to determine the shape
       | of every language, provided there are enough speakers of the
       | language left who are both able and willing to help.
       | 
       | <Snark> Once done, Facebook can then commodify their dissent, and
       | sell it back to them in their native language. </Snark>
        
         | goldemerald wrote:
         | The shape analogy doesn't really apply with modern language
         | models. Each word gets its own context dependent high
         | dimensional point. With everything being context dependent,
         | simple transformations like rotations are impossible. A more
         | accurate perception is that any concept expressible in language
         | now has its own high dimensional representation, which can then
         | be decoded into any other language.
        
       | labrador wrote:
       | I'll know AI translators are any good when the United Nations
       | starts using them
       | 
       |  _" Skills required: United Nations translators are required to
       | have a perfect command of their main language and an excellent
       | knowledge of, in most cases, two other official languages"_
       | 
       | https://www.un.org/dgacm/en/content/translation
        
       | kwhitefoot wrote:
       | What is a "low resource language"?
        
         | pesenti wrote:
         | https://datascience.stackexchange.com/questions/62868/high-l...
        
         | jw4ng wrote:
         | hey there, I work on this project. We categorize a language as
         | low-resource if there are fewer than 1M publicly available, de-
         | duplicated bitext samples.
         | 
         | also see section 3, table 1 in the paper:
         | https://research.facebook.com/publications/no-language-left-...
        
           | maestrae wrote:
           | hey, this sounds silly but I can't seem to find a link of all
           | the languages covered in the 200 hundred languages. I've
           | looked at the website and the blogpost and neither have a
           | readily available link. Seems like a major oversight. There
           | is of course a drop down in both but the languages there are
           | a lot less than 200. I'm particularly interested in a list of
           | the 55 African languages for example.
        
             | hello_im_angela wrote:
             | We have a full list here (copy pastable): https://github.co
             | m/facebookresearch/flores/tree/main/flores2... and Table 1
             | of our paper
             | (https://research.facebook.com/publications/no-language-
             | left-...) has a complete list as well.
        
               | goodside wrote:
               | Nice to see Esperanto made the cut -- the only artificial
               | language to do so, AFAICT.
        
               | hello_im_angela wrote:
               | ha yes, that's correct. If you have thoughts on specific
               | constructed languages where having translation would
               | really help people, let us know!
        
               | maestrae wrote:
               | thank you!
        
           | protomyth wrote:
           | Looking at the list, I see a lack of Native American
           | languages. Did anyone try to contact the tribes during this?
        
             | hello_im_angela wrote:
             | We interviewed speakers of low-resource languages from all
             | over the world to understand the human need for this kind
             | of technology --- what do people actually want, how would
             | they use it, and what's the quality they would find useful?
             | Many low-resource languages lack data online, but are
             | spoken by millions. However, many indigenous languages are
             | spoken by smaller numbers of people, and we are definitely
             | interested in partnering with local communities to co-
             | develop technology and have been actively investigating
             | these collaborations but don't have much to share yet.
        
       | vjerancrnjak wrote:
       | What are hardware requirements to run this?
       | 
       | I see the mixture model is ~ 300 GB and was trained on 256 GPUs.
       | 
       | I assume distilled versions can easily be run on one GPU.
        
         | hello_im_angela wrote:
         | We release several smaller models as well:
         | https://github.com/facebookresearch/fairseq/tree/nllb/exampl...
         | that are 1.3B and 615M parameters. These are usable on smaller
         | GPUs. To create these smaller models but retain good
         | performance, we use knowledge distillation. If you're curious
         | to learn more, we describe the process and results in Section
         | 8.6 of our paper:
         | https://research.facebook.com/publications/no-language-left-...
        
       | jkw wrote:
       | Hey all, I work on this project. Full list of languages can be
       | found here:
       | https://github.com/facebookresearch/flores/tree/main/flores2...
       | 
       | As well as in the research paper:
       | https://research.facebook.com/publications/no-language-left-...
        
       | jw4ng wrote:
       | Jeff Wang here with my fellow Meta AI colleague Angela Fan from
       | No Languages left Behind, seeing the comments flowing through. If
       | you want to ask us anything, go for it!
        
         | dangom wrote:
         | What is the greatest insight you gained and could share with
         | non-experts from working on this project?
        
           | jw4ng wrote:
           | I gained a deeper understanding of what it truly means to be
           | inclusive. Every language is unique just like everybody and
           | making sure content works for all and including as many
           | people as possible is really really hard, but through this
           | project i'm hopeful we are taking it one step further
        
             | Jabbles wrote:
             | > Every language is unique just like everybody
             | 
             | TBH it just sounds like you've redefined the word "unique".
        
         | mike8889 wrote:
        
         | pagekicker wrote:
         | Hi, I'm putting together an online event called 31 Days of AI
         | for Book-Lovers to coincide with US National Book Month,
         | October 2022. I was struck by the specific call-out to
         | translating literature on your demo page and would like to
         | feature a specifically book-related application of NLLB on one
         | of 'anchor days'. Can someone work with me on this?
        
         | shuraih wrote:
         | Hey Jeff, I'm a native speaker of Dhivehi -- the language
         | spoken by the people of Maldives. Since I couldn't find a full
         | list of supported languages I was wondering if Dhivehi is /
         | would be integrated.
        
           | jkw wrote:
           | Dhivehi is currently not supported, unfortunately. We view
           | this as a starting point and are committed to expanding to
           | many other languages as in the spirit of our project name.
           | 
           | Full list of currently supported languages can be found here:
           | https://github.com/facebookresearch/flores/tree/main/flores2.
           | ..
        
         | jefflombardjr wrote:
         | Gangi ther vel!
        
         | pesenti wrote:
         | Are all the 200x200 translations going directly or is English
         | (or another language) used as an intermediate for some of them?
        
           | jw4ng wrote:
           | All translation directions are direct from language X to
           | language Y, with no intermediary. We evaluate the quality
           | through 40,602 different translation directions using
           | FLORES-200. 2,440 directions contain supervised training data
           | created through our data effort, and the remaining 38,162 are
           | zero-shot.
        
       | btheshoe wrote:
       | I'm not entirely sure why low resource languages are seen as such
       | a high priority for AI research. It seems that by definition
       | there's little payoff to solving translation for them.
        
         | goodside wrote:
         | "Low-resource language" isn't just a euphemism for "language
         | almost nobody speaks". There are many languages that are widely
         | spoken but nonetheless are hard to obtain training data for.
         | Getting something like Wikipedia going for a minority language
         | can be a difficult chicken-and-egg problem because users will
         | use English for its completeness/recency, despite their limited
         | fluency, and the native-language Wikipedia remains neglected.
         | So you can end up in a situation where users use one language
         | for social media and another for news/research, and Facebook is
         | in a unique position to care about the former.
        
         | quink wrote:
         | The examples given are, with native speaker numbers, Assamese
         | (15 million), Catalan (4 million) and Kinyarwanda (10 million).
         | These alone are more than an Australia.
         | 
         | Furthermore, Facebook considers the internet to consist of
         | Facebook and Wikipedia (Zero).
         | 
         | I view this as just another extension of their Next Billion
         | initiative, an effort to ensure that another billion people are
         | monopolised by Facebook.
         | 
         | That's the payoff.
        
         | dunefox wrote:
         | Small data, big meaning is much more important than big data,
         | little meaning. Much closer to real intelligence.
        
         | onurcel wrote:
         | hi @btheshoe, I work on this project in the data part. As
         | others mentioned, the amount of data available for a language
         | is not correlated to the number of speakers of that language,
         | which explains the potential impact of focusing on these.
        
         | Jabbles wrote:
         | Surely the fact that they did all the high-resource languages
         | first and are only now getting round to the less-popular ones
         | demonstrates that that is not, in fact, the case?
        
         | tehsauce wrote:
         | I think the reason low resource languages are prioritized is to
         | compensate for the fact that AI research normally has a
         | tendency to marginalize these languages.
        
           | btheshoe wrote:
           | yes, but what principles justify the importance placed on low
           | resource languages?
        
             | froskur wrote:
             | Low resource in this context means that there are few
             | resources available to train a neural network with, not
             | that there are few speakers. Although many low resource
             | languages have relatively few speakers, there are also ones
             | with tens of millions of speakers.
             | 
             | The reason for emphasis is in my opinion twofold: 1)
             | Allowing these people to use the fancy language technology
             | in their own language is good in and of itself. 2) Training
             | neural networks on fewer resources is more difficult than
             | using more resources and therefore a fun and interesting
             | challenge.
        
               | macintux wrote:
               | Plus presumably we learn more from solving harder
               | problems, and we prepare for one day needing to translate
               | some alien language in a hurry.
        
         | jw4ng wrote:
         | We think it's important for AI to truly support everyone in the
         | world. A world where AI only serves a subset of the population
         | is not ideal. In machine translation, this means supporting as
         | many language as possible at high quality. We also imagine a
         | future where anyone will be able to communicate with anyone
         | else seamlessly; this also means solving translations for all
         | languages.
        
           | daniel-cussen wrote:
           | Wouldn't that also entail a bot speaking in any language?
        
         | wilde wrote:
         | The point is that there are lots of humans who speak these
         | languages and use tech. They just don't use Wikipedia so
         | getting a good translation corpus going was harder.
        
           | gwern wrote:
           | And it's both cumulative across all those languages (see
           | above), cheap/amortized (if you can do a good multilingual
           | NMT for 50 languages, how hard can 50+1 languages be?), and
           | many of those languages are likely to grow both in terms of
           | sheer population and in GDP. (Think about South Asian or
           | African countries like Indonesia or Nigeria.) The question
           | isn't why are FB & Google investing so much in powerful
           | multilingual models which handle hundreds of languages, but
           | why aren't other entities as well?
        
             | ausbah wrote:
             | what other entities would really have access to the text
             | resources that FB & Google? outside of a few other large
             | companies I can't imagine many
        
         | munificent wrote:
         | Cynical answer: It's good PR.
        
         | albertzeyer wrote:
         | I don't really remember the exact numbers anymore, but covering
         | only the top 5 languages will cover maybe 40% of the world
         | population, while covering the top 200 languages (many of them
         | low resource) will cover maybe 90% of the world population.
         | 
         | Some numbers (but you can not exactly infer from them such
         | accumulated numbers):
         | https://en.wikipedia.org/wiki/List_of_languages_by_total_num...
         | 
         | Some more numbers from here:
         | https://www.sciencedirect.com/science/article/pii/S016763931...
         | 
         | "96% of the world's languages are spoken by only 4% of its
         | people."
         | 
         | Although this statement is more about the tail from the approx
         | 7000 languages.
        
       | bvanderveen wrote:
       | Great! Facebook no longer have to provide content moderation in
       | all the various corners of the world where they could
       | accidentally enable the dissemination of misinformation and hate
       | speech in minority languages. They can simply transform it into
       | English and run it back through the existing moderation tooling!
       | 
       | Understanding foreign culture is about reading automated
       | translations of online comments into your native language. It has
       | nothing to do with putting the effort into learning a language
       | and understanding the nuances and current events and issues of
       | the culture it embeds.
       | 
       | The ESL (English as a single language) speakers over at Facebook
       | don't even need to understand foreign cultures, because they
       | already know everyone in the world needs to spend their lives
       | staring into the Metaverse. So grateful that they are working on
       | the world's fattest pipeline for exporting Anglophone culture to
       | every corner of the planet!
        
       | LtWorf wrote:
       | Facebook translations are horrifying for the mainstream languages
       | already. They go from completely wrong to kinda understandable
       | but still wrong.
        
         | rmbyrro wrote:
         | Looks like they're investing to get better. The model is also
         | available and they called for contributions to improve it.
        
       ___________________________________________________________________
       (page generated 2022-07-06 23:00 UTC)