[HN Gopher] Show HN: LibreTranslate - Open-source neural machine...
       ___________________________________________________________________
        
       Show HN: LibreTranslate - Open-source neural machine translation
       API
        
       Author : pjfin123
       Score  : 91 points
       Date   : 2021-02-06 18:48 UTC (4 hours ago)
        
 (HTM) web link (libretranslate.com)
 (TXT) w3m dump (libretranslate.com)
        
       | fartcannon wrote:
       | Good. Now can we finally stop pretending Cantonese doesn't exist?
        
       | bmicraft wrote:
       | Well it would be nice if it worked, but it could not even
       | translate "merry christmas" to German, just left it as is.
       | Apparently it needs the C to be capitalized ...
        
         | dheera wrote:
         | On a positive note I think it's great that we're seeing efforts
         | in this direction.
         | 
         | Fixing capitalization and spelling is a fairly easy thing to
         | do, just put a spell-checker before the input. Maybe that would
         | be a good pull request.
        
           | pjfin123 wrote:
           | There's more in depth discussion of this issue here:
           | https://github.com/uav4geo/LibreTranslate/issues/20
           | 
           | In some cases using all lower case can help avoid this risk
           | if capitalization isn't important.
        
       | jarym wrote:
       | Interesting if I do English to French on the following:
       | 
       |  _Hello Sarah, what time is it?_
       | 
       | it translates to
       | 
       |  _Bonjour Sarah, quelle heure est-il ?_
       | 
       | Now if I change the input to
       | 
       |  _Hello Sara, what time is it?_
       | 
       | it translates instead to:
       | 
       |  _Quelle heure est-il ?_
       | 
       | Any idea why the one character difference in the name affects the
       | translation in this way?
        
         | pjfin123 wrote:
         | The process for translation is to "tokenize" a stream of
         | characters into "tokens" like this:
         | 
         | "Hello Sarah, what time is it?"
         | 
         | <Hello><_Sarah><,><_what><_time><_is><_it><?>
         | 
         | Then the tokens are used for inference with a Transformer net.
         | There are no guarantees that output will be consistent to even
         | small changes in the input. The network, based on the data it
         | was trained on/luck, has slightly different connections for the
         | <_Sara> token than for the <_Sarah> token leading to different
         | output.
         | 
         | Here's a video of some Linux YouTubers reviewing Argos
         | Translate (https://github.com/argosopentech/argos-translate),
         | the underlying translation library, and getting unexpected
         | outputs: https://www.youtube.com/watch?v=geMs9dxl1N8
        
         | tejtm wrote:
         | pigpen hole principle?
         | 
         | More verb tenses in French than in English means ambiguity in
         | where things may come from and go to.
        
       | yorwba wrote:
       | Hmm... "Let's see how well it works." seems to be handled
       | correctly when translating from English into any language except
       | Chinese, where the apostrophe is turned into <unk> and the
       | sentence is otherwise untranslated.
       | 
       | Does that mean there's a different model for each language pair?
        
         | pjfin123 wrote:
         | There are different models for each language pair. Currently
         | there are only pre-trained models to and from English and other
         | language pairs "pivot" through English.
         | 
         | ex:
         | 
         | es -> en ->fr
         | 
         | Chinese is the weakest language pair currently, but I'm
         | currently working on improving it:
         | https://github.com/argosopentech/argos-translate/issues/17
        
           | yorwba wrote:
           | Thanks for the explanation. Pivoting through English isn't
           | ideal, but I'm just glad someone is working on this at all.
           | 
           | Thinking about it a bit more, it's a bit weird that the
           | failure mode of a weak model would be to regurgitate the
           | input unchanged. I'd rather have expected random Chinese
           | gibberish in that case. Doesn't that mean the model has seen
           | at least a few cases where English sentences were left
           | untranslated in the training data?
           | 
           | I wanted to download the training data to check, but the
           | instructions here https://github.com/argosopentech/onmt-
           | models#download-data say to use OPUS-Wikipedia, which has no
           | en-zh pairs, so the Chinese data must be from some other
           | source.
        
             | pjfin123 wrote:
             | Pivoting through English isn't inherent to Argos Translate,
             | you could train a French-German model or whatever you want
             | I've just been focusing on training models to add new
             | languages. The ideal strategy is to have models that know
             | multiple languages.
             | 
             | Quoting a previous HN comment:
             | 
             | I think cloud translation is still pretty valuable in a lot
             | of cases since the model for one single direction
             | translation is ~100MB. In addition to having more language
             | options without a large download cloud translations let you
             | use more specialized models for example French to Spanish.
             | I just have a model to and from English for each language
             | and any other translations have to "pivot" through English.
             | For cloud translations you can also use one model with
             | multiple input and output languages which gives you better
             | quality translation between languages that don't have as
             | much data available and lets you support direct translation
             | between a large number of languages. Here's a talk where
             | Google explains how they do this for Google Translate:
             | https://youtu.be/nR74lBO5M3s?t=1682. You could do this
             | locally but it would have its own set of challanges for
             | getting the right model for the languages you want to
             | translate.
             | 
             | > Thinking about it a bit more, it's a bit weird that the
             | failure mode of a weak model would be to regurgitate the
             | input unchanged. I'd rather have expected random Chinese
             | gibberish in that case. Doesn't that mean the model has
             | seen at least a few cases where English sentences were left
             | untranslated in the training data?
             | 
             | This was added last week, it's just not live on
             | libretranslate.com yet:
             | 
             | https://github.com/uav4geo/LibreTranslate/issues/33
             | 
             | The training scripts are just an example for English-
             | Spanish, Opus(http://opus.nlpl.eu/) has data for English-
             | Chinese.
        
       | otagekki wrote:
       | I had tried translating a test sentence and I got a rate limit
       | related error...
       | 
       | Would be glad to see this for Malagasy though
        
         | yorwba wrote:
         | Data availability is going to be a problem, I think. Checking
         | Malagasy Wikipedia https://mg.wikipedia.org , there are only
         | 93k articles. That's even less than Latin Wikipedia
         | https://la.wikipedia.org at 134k. And much of the text in these
         | articles probably isn't a direct translation of an article in
         | another language, so the amount usable for parallel-text mining
         | is going to be very small.
        
       | yamrzou wrote:
       | Well done. The UI is nice and easy to use. The results looked
       | good for the few sentences I tried (Arabic <--> English).
       | 
       | May I know which datasets were used to train the models?
        
         | pjfin123 wrote:
         | OPUS parallel corpus: http://opus.nlpl.eu/
         | 
         | It's really great, they have a large amount of data and
         | organize it to make it easy to access.
        
       | Mizza wrote:
       | Very glad to see something like this, this was on my high-
       | priority Free software needs list.
       | 
       | I'd very much like if it could be used programmatically, and not
       | just via API - C/Python/Rust bindings, etc. I'd like to build
       | some automatically translating forum software with it.
        
         | rahimnathwani wrote:
         | It's based on argos-translate, which has python bindings:
         | 
         | https://github.com/argosopentech/argos-translate
        
           | pjfin123 wrote:
           | And a PyQt native Desktop app.
        
         | drusepth wrote:
         | OT: I'd definitely be interested in seeing the rest of your
         | high-priority Free software needs list!
        
       ___________________________________________________________________
       (page generated 2021-02-06 23:00 UTC)