[HN Gopher] Show HN: LibreTranslate - Open-source neural machine... ___________________________________________________________________ Show HN: LibreTranslate - Open-source neural machine translation API Author : pjfin123 Score : 91 points Date : 2021-02-06 18:48 UTC (4 hours ago) (HTM) web link (libretranslate.com) (TXT) w3m dump (libretranslate.com) | fartcannon wrote: | Good. Now can we finally stop pretending Cantonese doesn't exist? | bmicraft wrote: | Well it would be nice if it worked, but it could not even | translate "merry christmas" to German, just left it as is. | Apparently it needs the C to be capitalized ... | dheera wrote: | On a positive note I think it's great that we're seeing efforts | in this direction. | | Fixing capitalization and spelling is a fairly easy thing to | do, just put a spell-checker before the input. Maybe that would | be a good pull request. | pjfin123 wrote: | There's more in depth discussion of this issue here: | https://github.com/uav4geo/LibreTranslate/issues/20 | | In some cases using all lower case can help avoid this risk | if capitalization isn't important. | jarym wrote: | Interesting if I do English to French on the following: | | _Hello Sarah, what time is it?_ | | it translates to | | _Bonjour Sarah, quelle heure est-il ?_ | | Now if I change the input to | | _Hello Sara, what time is it?_ | | it translates instead to: | | _Quelle heure est-il ?_ | | Any idea why the one character difference in the name affects the | translation in this way? | pjfin123 wrote: | The process for translation is to "tokenize" a stream of | characters into "tokens" like this: | | "Hello Sarah, what time is it?" | | <Hello><_Sarah><,><_what><_time><_is><_it><?> | | Then the tokens are used for inference with a Transformer net. | There are no guarantees that output will be consistent to even | small changes in the input. The network, based on the data it | was trained on/luck, has slightly different connections for the | <_Sara> token than for the <_Sarah> token leading to different | output. | | Here's a video of some Linux YouTubers reviewing Argos | Translate (https://github.com/argosopentech/argos-translate), | the underlying translation library, and getting unexpected | outputs: https://www.youtube.com/watch?v=geMs9dxl1N8 | tejtm wrote: | pigpen hole principle? | | More verb tenses in French than in English means ambiguity in | where things may come from and go to. | yorwba wrote: | Hmm... "Let's see how well it works." seems to be handled | correctly when translating from English into any language except | Chinese, where the apostrophe is turned into <unk> and the | sentence is otherwise untranslated. | | Does that mean there's a different model for each language pair? | pjfin123 wrote: | There are different models for each language pair. Currently | there are only pre-trained models to and from English and other | language pairs "pivot" through English. | | ex: | | es -> en ->fr | | Chinese is the weakest language pair currently, but I'm | currently working on improving it: | https://github.com/argosopentech/argos-translate/issues/17 | yorwba wrote: | Thanks for the explanation. Pivoting through English isn't | ideal, but I'm just glad someone is working on this at all. | | Thinking about it a bit more, it's a bit weird that the | failure mode of a weak model would be to regurgitate the | input unchanged. I'd rather have expected random Chinese | gibberish in that case. Doesn't that mean the model has seen | at least a few cases where English sentences were left | untranslated in the training data? | | I wanted to download the training data to check, but the | instructions here https://github.com/argosopentech/onmt- | models#download-data say to use OPUS-Wikipedia, which has no | en-zh pairs, so the Chinese data must be from some other | source. | pjfin123 wrote: | Pivoting through English isn't inherent to Argos Translate, | you could train a French-German model or whatever you want | I've just been focusing on training models to add new | languages. The ideal strategy is to have models that know | multiple languages. | | Quoting a previous HN comment: | | I think cloud translation is still pretty valuable in a lot | of cases since the model for one single direction | translation is ~100MB. In addition to having more language | options without a large download cloud translations let you | use more specialized models for example French to Spanish. | I just have a model to and from English for each language | and any other translations have to "pivot" through English. | For cloud translations you can also use one model with | multiple input and output languages which gives you better | quality translation between languages that don't have as | much data available and lets you support direct translation | between a large number of languages. Here's a talk where | Google explains how they do this for Google Translate: | https://youtu.be/nR74lBO5M3s?t=1682. You could do this | locally but it would have its own set of challanges for | getting the right model for the languages you want to | translate. | | > Thinking about it a bit more, it's a bit weird that the | failure mode of a weak model would be to regurgitate the | input unchanged. I'd rather have expected random Chinese | gibberish in that case. Doesn't that mean the model has | seen at least a few cases where English sentences were left | untranslated in the training data? | | This was added last week, it's just not live on | libretranslate.com yet: | | https://github.com/uav4geo/LibreTranslate/issues/33 | | The training scripts are just an example for English- | Spanish, Opus(http://opus.nlpl.eu/) has data for English- | Chinese. | otagekki wrote: | I had tried translating a test sentence and I got a rate limit | related error... | | Would be glad to see this for Malagasy though | yorwba wrote: | Data availability is going to be a problem, I think. Checking | Malagasy Wikipedia https://mg.wikipedia.org , there are only | 93k articles. That's even less than Latin Wikipedia | https://la.wikipedia.org at 134k. And much of the text in these | articles probably isn't a direct translation of an article in | another language, so the amount usable for parallel-text mining | is going to be very small. | yamrzou wrote: | Well done. The UI is nice and easy to use. The results looked | good for the few sentences I tried (Arabic <--> English). | | May I know which datasets were used to train the models? | pjfin123 wrote: | OPUS parallel corpus: http://opus.nlpl.eu/ | | It's really great, they have a large amount of data and | organize it to make it easy to access. | Mizza wrote: | Very glad to see something like this, this was on my high- | priority Free software needs list. | | I'd very much like if it could be used programmatically, and not | just via API - C/Python/Rust bindings, etc. I'd like to build | some automatically translating forum software with it. | rahimnathwani wrote: | It's based on argos-translate, which has python bindings: | | https://github.com/argosopentech/argos-translate | pjfin123 wrote: | And a PyQt native Desktop app. | drusepth wrote: | OT: I'd definitely be interested in seeing the rest of your | high-priority Free software needs list! ___________________________________________________________________ (page generated 2021-02-06 23:00 UTC)