[HN Gopher] A highly efficient, real-time text-to-speech system ...
       ___________________________________________________________________
        
       A highly efficient, real-time text-to-speech system deployed on
       CPUs
        
       Author : moneil971
       Score  : 84 points
       Date   : 2020-05-15 16:22 UTC (6 hours ago)
        
 (HTM) web link (ai.facebook.com)
 (TXT) w3m dump (ai.facebook.com)
        
       | thelazydogsback wrote:
       | Personally, I find I dislike any "emotion" added to TTS -- I find
       | Alexa's emo markup, a la:
       | 
       | https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-...
       | 
       | to be disturbing and without much added value. (Such as used with
       | games like Jeopardy.)
       | 
       | If used, the application of these tags needs to be both
       | meticulous in its proper context, somewhat non-deterministically
       | applied, and with randomized prosody. Repeated usage of the same
       | overstated emotive content is annoying and unnatural (worse than
       | a "flat" presentation) and only serves to underscore the
       | underlying inflexible conversational content.
        
         | teilo wrote:
         | I agree. All I care about is that the pronunciation is
         | contextually correct, and smooth. Accents are fine, and
         | necessary. But I don't want a non-human simulating human
         | emotions. Now if they aren't simulated, that's another story...
        
           | jandrese wrote:
           | Honestly, how do you expect to advance if you don't like
           | Genuine People Personalities(r)?
           | 
           | Who wouldn't want a door that is eternally cheerful about
           | opening up for you? Or a paranoid android?
        
             | AnimalMuppet wrote:
             | _The paranoid android_ , that's who wouldn't like a
             | paranoid android. Particularly, wouldn't like _being_ one.
        
         | ge96 wrote:
         | I take it you didn't like the movie Her ha
        
         | npunt wrote:
         | Agree, there's definitely a risk for uncanny valley in voice
         | assistants around tone. Emotion in particular is an area where
         | I think we won't have sophisticated enough systems to address
         | for some time, which means it's mostly something that makes
         | things unintentionally worse. A few of these downsides include:
         | 
         | 1. tone has a different valence than what the subject matter
         | requires ('Sure thing! The funeral home number is...'), which
         | reminds you of and amplifies negative feelings
         | 
         | 2. tone is used to get something from you, which makes you feel
         | manipulated
         | 
         | 3. tone is uniformly applied, which through repeat exposure
         | makes the world feel false
         | 
         | Totally agree that randomized prosody would help aid in making
         | assistants more natural; tempo and rhythm don't have as much
         | emotional weight.
         | 
         | I'm not sure explicit markup is the right way to approach the
         | problem, given its lack of understanding of user context and
         | state of mind, and the likelihood developers execute poorly. It
         | feels like a solution at the wrong level of abstraction...
         | getting in sync with a user's emotions is a pretty high bar.
        
       | ge96 wrote:
       | Impressive but also still sounds "robotic" like AWS Polly. I
       | wonder if they'll fuse that tech where you can sample someone's
       | voice from a paragraph and build something. Then you could hire a
       | voice actor(ress) and maybe license their voice? I don't know how
       | that would work.
        
       | jandrese wrote:
       | Speech Synthesis has always baffled me. You could run a
       | reasonable (albeit strangely accented) version on 16Mhz Macs
       | without major CPU impact. The code including sound data was less
       | than a megabyte.
       | 
       | In order to achieve modest improvements in dictation we're
       | throwing entire GPU arrays at the problem. What happened in the
       | middle? Was there really no room for improvement until we went
       | full AI?
        
         | Someone wrote:
         | IIRC, it was _with_ major CPU impact. A 8 MHz machine couldn't
         | do much else while talking.
         | 
         | Also, the original MacinTalk sounded a lot better if you fed it
         | phonemes instead of text. It didn't know how to pronounce that
         | many different words, and wasn't really good at making the
         | right choice when the pronunciation of a word depends on its
         | meaning.
         | 
         | For example, if you gave it the text "Read me", it always
         | pronounced "Read" in the past tense. That always seemed the
         | wrong bet to me, and I would think the developers had heard
         | that, too, but apparently, fixing it was not that simple.
         | 
         | I also think it didn't know to pronounce "Dr" as "Drive" or
         | "Doctor", depending on context, or "St" as "Saint" or "Street",
         | to mention a few examples, and probably was abysmal when you
         | asked it to speak a phone book, with its zillions of rare names
         | (back in the eighties, that's an area where AT&T's speech
         | synthesizers excelled, I've been told)
         | 
         | And that's just the text-to-phoneme conversion. The arts of
         | picking the right intonation and speed of talking are in a
         | whole different ball park; they require a kind of sentiment
         | analysis.
        
         | microtherion wrote:
         | There _was_ steady improvement for decades (e.g. on a Mac, you
         | can compare Fred, Victoria, Vicki, and Alex, as representatives
         | of 4 generations of TTS, and there were year-over-year
         | improvements as well).
         | 
         | But the latest neural techniques have added quite a bit of
         | naturalness, and their computational requirements, while high,
         | are within the reach of consumer level devices.
        
         | sdenton4 wrote:
         | Lots of the usual vocoder methods were developed in the 80s,
         | and kinda approximate voice as a collection of linear systems.
         | The neural systems allow nonlinear outputs, and are quickly
         | getting much cheaper. This is common in codecs: you get a
         | computationally expensive jump on quality, followed by years of
         | work making that quality jump computationally cheap.
         | 
         | You can also combine the linear and nonlinear approaches.
         | LPCNet uses a bit of each, for example: a linear lpc
         | prediction, which is then corrected by a neutral net.
        
         | IshKebab wrote:
         | > a reasonable (albeit strangely accented) version
         | 
         | I think that's a little rose tinted. Classical speech synthesis
         | was _awful_ in comparison to this.
        
         | beagle3 wrote:
         | You can go even backwards - SAM (Software Automated Mouth) on
         | the C64 produced understandable weirdly accented speech with
         | something like 10K code+data IIRC, and a 1Mhz 6502; see e.g.
         | https://discordier.github.io/sam/ (though it did take all the
         | CPU, because it did software based PCM on non-PCM supporting
         | hardware.... if the there was just little more control of the
         | waveform generators, it would likely have been possible with 1%
         | CPU or so ...)
        
       | godelski wrote:
       | That video at the end really is deep in the uncanny valley.
        
         | microtherion wrote:
         | Meh. The synthesis quality is not terrible, but calling it
         | "state of the art", quality-wise, is a bit of a stretch.
        
       | ekelsen wrote:
       | Exciting to see our research making broad impact across the
       | industry! https://arxiv.org/abs/1802.08435
        
         | ajtulloch wrote:
         | Absolutely, it's super impressive work (as is your later work
         | with Marat :) ).
        
       | blickentwapft wrote:
       | It's a pity that all the best text to speech and speech to text
       | systems are cloud based with heavy vendor lock in.
        
       | Avi-D-coder wrote:
       | Any chance of a open source implementation of this?
       | 
       | I could really use a better tts for Linux.
        
         | brutt wrote:
         | No, it cannot be open sourced. It literally has no source to
         | open.
        
           | jandrese wrote:
           | Huh? It appears to be written in PyTorch according to the
           | article?
           | 
           | The training data could also be considered source.
           | 
           | And I agree that this is of limited use if I have to access
           | it by uploading and downloading everything from Facebook
           | servers. Not only do I have privacy implications, but there's
           | the need for a solid fast low latency internet connection
           | that I can't guarantee.
        
             | brutt wrote:
             | AI is not written, AI is trained using dataset, PyTorch,
             | and lot of computer time (and manpower).
             | 
             | Dataset is not a big problem (if you can speak, you can
             | create your own). PyTorch is already open.
        
               | qchris wrote:
               | Depending on the architecture, though, it's possible to
               | export the trained model into a stand-alone file that can
               | be imported by somebody else's program, de-coupling the
               | network's training data from model it produces.
               | 
               | This is done pretty frequently in areas like computer
               | vision and speech recognition, with the pre-trained
               | weights for YOLO and Mozilla Deepspeech[0] being
               | available for download. I'm not sure if the word "open-
               | source" totally applies here, since as you pointed out,
               | apart from downloading the dataset source might be
               | tought, but OP's question might be answered by having the
               | resulting models made publicly available with the source
               | code of the networks they used to train and deploy it?
               | 
               | [0] https://github.com/mozilla/DeepSpeech/releases/v0.6.0
        
       | birdyrooster wrote:
       | How long until computers can brainstorm all sorts of exciting new
       | voices for characters removing the need for pesky contracts and
       | royalties paid?
        
       | bergstromm466 wrote:
       | Yeah, awesome! This proprietary transcription algorithm must make
       | it a hell of a lot easier for NSA databases. If this is deployed
       | and used by FB so they send the finished and full transcripts of
       | calls and other voice traffic [1] instead of the original audio
       | to be transcribed later, it will all be more efficienct! //
       | sarcasm
       | 
       | [1] https://theintercept.com/2015/05/05/nsa-speech-
       | recognition-s...
        
       ___________________________________________________________________
       (page generated 2020-05-15 23:00 UTC)