[HN Gopher] A highly efficient, real-time text-to-speech system ... ___________________________________________________________________ A highly efficient, real-time text-to-speech system deployed on CPUs Author : moneil971 Score : 84 points Date : 2020-05-15 16:22 UTC (6 hours ago) (HTM) web link (ai.facebook.com) (TXT) w3m dump (ai.facebook.com) | thelazydogsback wrote: | Personally, I find I dislike any "emotion" added to TTS -- I find | Alexa's emo markup, a la: | | https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-... | | to be disturbing and without much added value. (Such as used with | games like Jeopardy.) | | If used, the application of these tags needs to be both | meticulous in its proper context, somewhat non-deterministically | applied, and with randomized prosody. Repeated usage of the same | overstated emotive content is annoying and unnatural (worse than | a "flat" presentation) and only serves to underscore the | underlying inflexible conversational content. | teilo wrote: | I agree. All I care about is that the pronunciation is | contextually correct, and smooth. Accents are fine, and | necessary. But I don't want a non-human simulating human | emotions. Now if they aren't simulated, that's another story... | jandrese wrote: | Honestly, how do you expect to advance if you don't like | Genuine People Personalities(r)? | | Who wouldn't want a door that is eternally cheerful about | opening up for you? Or a paranoid android? | AnimalMuppet wrote: | _The paranoid android_ , that's who wouldn't like a | paranoid android. Particularly, wouldn't like _being_ one. | ge96 wrote: | I take it you didn't like the movie Her ha | npunt wrote: | Agree, there's definitely a risk for uncanny valley in voice | assistants around tone. Emotion in particular is an area where | I think we won't have sophisticated enough systems to address | for some time, which means it's mostly something that makes | things unintentionally worse. A few of these downsides include: | | 1. tone has a different valence than what the subject matter | requires ('Sure thing! The funeral home number is...'), which | reminds you of and amplifies negative feelings | | 2. tone is used to get something from you, which makes you feel | manipulated | | 3. tone is uniformly applied, which through repeat exposure | makes the world feel false | | Totally agree that randomized prosody would help aid in making | assistants more natural; tempo and rhythm don't have as much | emotional weight. | | I'm not sure explicit markup is the right way to approach the | problem, given its lack of understanding of user context and | state of mind, and the likelihood developers execute poorly. It | feels like a solution at the wrong level of abstraction... | getting in sync with a user's emotions is a pretty high bar. | ge96 wrote: | Impressive but also still sounds "robotic" like AWS Polly. I | wonder if they'll fuse that tech where you can sample someone's | voice from a paragraph and build something. Then you could hire a | voice actor(ress) and maybe license their voice? I don't know how | that would work. | jandrese wrote: | Speech Synthesis has always baffled me. You could run a | reasonable (albeit strangely accented) version on 16Mhz Macs | without major CPU impact. The code including sound data was less | than a megabyte. | | In order to achieve modest improvements in dictation we're | throwing entire GPU arrays at the problem. What happened in the | middle? Was there really no room for improvement until we went | full AI? | Someone wrote: | IIRC, it was _with_ major CPU impact. A 8 MHz machine couldn't | do much else while talking. | | Also, the original MacinTalk sounded a lot better if you fed it | phonemes instead of text. It didn't know how to pronounce that | many different words, and wasn't really good at making the | right choice when the pronunciation of a word depends on its | meaning. | | For example, if you gave it the text "Read me", it always | pronounced "Read" in the past tense. That always seemed the | wrong bet to me, and I would think the developers had heard | that, too, but apparently, fixing it was not that simple. | | I also think it didn't know to pronounce "Dr" as "Drive" or | "Doctor", depending on context, or "St" as "Saint" or "Street", | to mention a few examples, and probably was abysmal when you | asked it to speak a phone book, with its zillions of rare names | (back in the eighties, that's an area where AT&T's speech | synthesizers excelled, I've been told) | | And that's just the text-to-phoneme conversion. The arts of | picking the right intonation and speed of talking are in a | whole different ball park; they require a kind of sentiment | analysis. | microtherion wrote: | There _was_ steady improvement for decades (e.g. on a Mac, you | can compare Fred, Victoria, Vicki, and Alex, as representatives | of 4 generations of TTS, and there were year-over-year | improvements as well). | | But the latest neural techniques have added quite a bit of | naturalness, and their computational requirements, while high, | are within the reach of consumer level devices. | sdenton4 wrote: | Lots of the usual vocoder methods were developed in the 80s, | and kinda approximate voice as a collection of linear systems. | The neural systems allow nonlinear outputs, and are quickly | getting much cheaper. This is common in codecs: you get a | computationally expensive jump on quality, followed by years of | work making that quality jump computationally cheap. | | You can also combine the linear and nonlinear approaches. | LPCNet uses a bit of each, for example: a linear lpc | prediction, which is then corrected by a neutral net. | IshKebab wrote: | > a reasonable (albeit strangely accented) version | | I think that's a little rose tinted. Classical speech synthesis | was _awful_ in comparison to this. | beagle3 wrote: | You can go even backwards - SAM (Software Automated Mouth) on | the C64 produced understandable weirdly accented speech with | something like 10K code+data IIRC, and a 1Mhz 6502; see e.g. | https://discordier.github.io/sam/ (though it did take all the | CPU, because it did software based PCM on non-PCM supporting | hardware.... if the there was just little more control of the | waveform generators, it would likely have been possible with 1% | CPU or so ...) | godelski wrote: | That video at the end really is deep in the uncanny valley. | microtherion wrote: | Meh. The synthesis quality is not terrible, but calling it | "state of the art", quality-wise, is a bit of a stretch. | ekelsen wrote: | Exciting to see our research making broad impact across the | industry! https://arxiv.org/abs/1802.08435 | ajtulloch wrote: | Absolutely, it's super impressive work (as is your later work | with Marat :) ). | blickentwapft wrote: | It's a pity that all the best text to speech and speech to text | systems are cloud based with heavy vendor lock in. | Avi-D-coder wrote: | Any chance of a open source implementation of this? | | I could really use a better tts for Linux. | brutt wrote: | No, it cannot be open sourced. It literally has no source to | open. | jandrese wrote: | Huh? It appears to be written in PyTorch according to the | article? | | The training data could also be considered source. | | And I agree that this is of limited use if I have to access | it by uploading and downloading everything from Facebook | servers. Not only do I have privacy implications, but there's | the need for a solid fast low latency internet connection | that I can't guarantee. | brutt wrote: | AI is not written, AI is trained using dataset, PyTorch, | and lot of computer time (and manpower). | | Dataset is not a big problem (if you can speak, you can | create your own). PyTorch is already open. | qchris wrote: | Depending on the architecture, though, it's possible to | export the trained model into a stand-alone file that can | be imported by somebody else's program, de-coupling the | network's training data from model it produces. | | This is done pretty frequently in areas like computer | vision and speech recognition, with the pre-trained | weights for YOLO and Mozilla Deepspeech[0] being | available for download. I'm not sure if the word "open- | source" totally applies here, since as you pointed out, | apart from downloading the dataset source might be | tought, but OP's question might be answered by having the | resulting models made publicly available with the source | code of the networks they used to train and deploy it? | | [0] https://github.com/mozilla/DeepSpeech/releases/v0.6.0 | birdyrooster wrote: | How long until computers can brainstorm all sorts of exciting new | voices for characters removing the need for pesky contracts and | royalties paid? | bergstromm466 wrote: | Yeah, awesome! This proprietary transcription algorithm must make | it a hell of a lot easier for NSA databases. If this is deployed | and used by FB so they send the finished and full transcripts of | calls and other voice traffic [1] instead of the original audio | to be transcribed later, it will all be more efficienct! // | sarcasm | | [1] https://theintercept.com/2015/05/05/nsa-speech- | recognition-s... ___________________________________________________________________ (page generated 2020-05-15 23:00 UTC)