[HN Gopher] Emotionally Expressive Text to Speech ___________________________________________________________________ Emotionally Expressive Text to Speech Author : interweb Score : 124 points Date : 2020-05-16 04:54 UTC (18 hours ago) (HTM) web link (www.sonantic.io) (TXT) w3m dump (www.sonantic.io) | microtherion wrote: | The prosody sounds nice. But two of the longer samples have a lot | of vocal fry, and the third sounds like the voice has a stuffy | nose and/or a slight lisp. I wonder whether those mannerisms were | chosen to camouflage artifacts inherent in their current | implementation. | rowanG077 wrote: | Or the mannerisms where chosen to express that the system can | produce real voices and not just perfect ones. | microtherion wrote: | Yeah, but wouldn't you include at least one "clean" voice in | the samples to show what the system is capable of? | sonantic wrote: | Yep. Each of our TTS models is based on a real actor's voice | that has its own nuanced characteristics. Some voices are | naturally rougher / croaky while others smooth. As in real | life, our differences are what make us unique. Some voices | will work better for certain character profiles / scenes - | it's up to the user to decide. | amelius wrote: | Sounds nice but difficult to judge with the background music. | hobofan wrote: | I get the criticism (and "you need to find out for yourself" at | ~0:44 sounds somewhat robotic), but given that it's aiming at | the entertainment industry, where you will have background | music most of the time, it also seems like a fair choice of | representing real-world usage (where background music might | always hide the imperfections a bit). | cemregr wrote: | Is there an actual demo? | DenisM wrote: | This is very impressive. | | I wonder if attaching this to a modern-day Elisa will improve the | Turing test scores? Emotional load can reduce the requirement for | semantic coherence. | samcodes wrote: | "Emotional load can reduce the requirement for semantic | coherence." - great insight | ArneVogel wrote: | This site has to have one of the worst cookie choice decision | popups: https://imgur.com/a/YLsGadP | sonantic wrote: | Touche! It is pretty bad, huh? Our new site is only a few days | old (rolled out on Wednesday / around the same time as our | demo) so we haven't got eveything optimised just yet. Stay | tuned for better cookies in the future!!! | aantix wrote: | GDPR has reverted the web to Geocities. It's popup hell. | martimarkov wrote: | It's not GDPR. It's all the tracking website owners put | there. | | The fault is with the owners of the sites. | [deleted] | sergeykish wrote: | like option does not really matter | Animats wrote: | Can't hear the voices over the music. | spaceprison wrote: | My daughter is dyslexic and would love to play things like | stardew valley, pokemon or even animal crossing but being text | only makes them such a slog for her. | | The same goes for sub titles, she'd be perfectly fine with a | robot voice for the actors if they sounded real enough like this. | | Game changer. | londons_explore wrote: | Why not just read the text to her? | billme wrote: | For the rest of her life? | sonantic wrote: | Thank you for your comment. One of the reasons we founded | Sonantic was to improve accessibility so we are right there | with you! We plan to do this by reducing the barriers (both | financial and logistical) of voiced content for everyone from | indie developers to big AAA studios. We've already begun to see | progress on this through partnership with initial customers | during our beta. | mrec wrote: | Is your TTS process necessarily ahead of time, or can it be | done at runtime with all the flexibility (templating, | generative text etc) that brings? | sonantic wrote: | That's the holy grail right there, isn't it? :) We're | definitely working towards runtime but still some work to | be done there to account for additional complexities and | balance trade-offs re: speed, quality, accuracy etc. of the | rendered output. | disabled wrote: | I have a print-related disability causing severe convergence | insufficiency with my eyes, due to a rare neurological | disease affecting my peripheral nervous system. | | I generally use Kurzweil 3000 (http://KurzweilEdu.com) which | is made by Kurzweil Educational Systems, as a screen reader. | You should definitely considering partnering with them in | particular, as it would be very strategic. | jariel wrote: | Recommending editing the video down to 43-60 seconds. | | It would be nice to try with actual text inputs right on the | page, that this doesn't exist is tiny flag. | | A great choice to work with voice actors, because there isn't any | 'pure' TTY that's good enough in the most general sense, having | the actual voice actor as a working basis will help. | | Perhaps for small game houses, they can just use something off | the shelf, big houses can use a customized voice, and then not | worry if they have to make tweaks or changes, they don't have to | do a whole production. | sonantic wrote: | Thanks for your feedback! We felt that this storyline / length | was best in order to showcase the two different actor's | artificial voices and build up to the actual cry. | | As you've mentioned, we do work with real actors to create our | TTS and take misuse of their (artificial) voices very | seriously. Because they sound so lifelike, we've made the | decision not to allow public access/personal use at this time. | | Lastly, your assessment is spot on regarding standard vs custom | voices. Lots of interest for both! | zoomablemind wrote: | TTS=text-to-speech, so it's quite reasonable to showcase that | chain instead of an edited video. | | Not diminishing the quality of your product, just pointing | out an obvious expectation of the audience that it's | presented to. Perhaps, there could be a way to test-drive it | directly, with limited choices or combinations of the input | text. | dequalant wrote: | This is amazing! I was looking something like this to come up for | a long time. Finally someone did it! | schoolornot wrote: | Between this and Lyrebird there seem to be a high number of | cutting edge TTS solutions being worked on in the private sector. | Does anyone know why there haven't been much advancement with the | FOSS libraries? | jonas21 wrote: | There have been quite a few recent advances in open-source TTS, | such as: | | https://github.com/NVIDIA/tacotron2 | | https://github.com/CorentinJ/Real-Time-Voice-Cloning | | https://github.com/mozilla/TTS | | I think a lot of the remaining gap is due to a lack of high- | quality training data -- most of the open-source models are | trained on public-domain audiobooks (e.g. LJ Speech). | | However, good training data (large amounts of annotated | recordings by professional voice actors) is expensive to | create, and unlike code, there's not a tradition of people | sharing it. | vianneychevalie wrote: | I'm convinced that the barrier to entry in this field in terms | of technologic and financial investment is too high for FOSS | projects to compete with the commercial solutions | | We don't see FOSS pharmaceutical research for instance, I | believe for the same reason. The amount of coordination needed | and the impossibility to separate TTS projects into sub-parts | could also factors. | sgk284 wrote: | https://voice.mozilla.org/en | | "Common Voice is Mozilla's initiative to help teach machines | how real people speak." | ani-ani wrote: | Sure, open source TTS seems to be lagging behind recent | commercial offerings, but pharmaceutical research is actually | an excellent example of a field with massive FOSS software | usage. TTS is also "purely virtual" which makes it | significantly different, and I would say significantly more | approachable to open source collaboration. | reubenmorais wrote: | We have done Mean Opinion Score tests on Mozilla TTS [0] and | gotten similar scores to real humans. The main problem for open | sourcing higher quality models is licensing the dataset. | | [0] https://github.com/mozilla/TTS | klodolph wrote: | I think there are a couple factors here. This is an incredibly | difficult problem space. The solutions going forward involve ML | techniques, which require ML experts (currently being hired | away by industry) and the resources to create models, which | includes not only a large amount of computational resources but | a big chunk of training data which needs to be sourced somehow. | | I'm not really an expert. From what I understand, the "cutting- | edge" stuff requires pushing past the point where we are | splicing segments of speech together. Splicing segments | together is hard enough. | | There are a couple open-source efforts like Mozilla's, but if | you want something like Lyrebird, well, that technology isn't | even really productized commercially yet. | blattimwind wrote: | I could see this being used for RPG games to fix the choice | deficiency that has been caused by going for fully voiced | dialogue. Also, making Hitler read copypastas even more | convincingly. | ghaff wrote: | As good as even not-top-shelf voice actor talent is a really | high bar. I keep my eye on this space because there are a | number of things I do where having even just decent "radio | voice" TTS would be useful (and better than I can do myself). | But nothing is really there today. In some respects, it's | better than I can do myself but certainly not consistently. | blattimwind wrote: | The bar really isn't "has to be good out of the box", if it | requires some tweaking on a line to line basis that would | probably be ok and still much, much cheaper and much quicker | to iterate on than voice actors for these high volumes of | speech. In a lot of these games the existing voice acting is | often consistently poor (literally everything Bethesda ever | released comes to mind); certainly quite a few notches below | the average AAA voice acting (which is occasionally bad, but | on average good). | ghaff wrote: | Fair enough. I'm not really much of a gamer. | | One of the other challenges with using outside voice talent | is that it can be inconvenient/expensive when you need to | add/change something. I've been involved with podcasts | using an external host and one of the negatives with that | process is that if you discover a minor mistake/glitch in | the narration late in the process you can't easily fix it. | moron4hire wrote: | Any plans to support languages other than English? This would be | huge in the foreign language instruction field. | sonantic wrote: | Yes! Supporting additional languages (and dialects) is | definitely in our roadmap. | tomByrer wrote: | @sonantic Seems you don't do real-time yet? | | If so, have plans for a Web Speech API plugin? I'm about to | release a reader demo based around it. | https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_... | maxdo wrote: | Wow sounds very real | sonantic wrote: | Hey HN - Zeena Qureshi (Co-Founder and CEO at Sonantic) here. | | Thanks for your thoughts and feedback thus far! I'd be happy to | answer questions (within reason) about our latest cry demo / | emotional TTS! Feel free to fire away on this thread. | nmstoker wrote: | Saw your YouTube videos a few days ago and was very impressed. | | Clearly you can't give away too much on your "secret sauce" but | is there any insight you could share on two questions: | | 1. Do the individual voice talents need to express the emotion | types you use or can you layer it on after? (ie do they have to | have recorded say "happy" to get happy outputs or can that be | added to neutral recordings retrospectively) | | 2. What are the ball park audio amounts you need per voice? 10 | hrs, 20 hrs or more? | sonantic wrote: | Hey! Thanks so much. Yea, can't go into too much detail here, | but I will say that more def isn't always better when it | comes to the size of datasets. :) We aim for quality over | quantity in order to achieve natural expressiveness from our | actor recording sessions. | julvo wrote: | Hey Zeena, great to see this on HN - brilliant video! | sergeykish wrote: | From the first moment of our life we express emotions with | voice. Not only that - adults understand them. I can express my | own emotions without words. And I can change my mood by | singing. | | So the question is - what's there? Is it formants? Is it | universal? Can we map them like syllables? | | And music, it touches same emotions. Does it use same | mechanism? | | Edit: found "Emotional speech synthesis: Applications, history | and possible future" [1], looks like melody _is_ part of | emotion processing. | | If mapping is possible I'd love to see application in dubbing. | Both as translate and TTS with mapped emotions and dubbing | actors evaluation/autotune. | | [1] | https://www.researchgate.net/publication/268260426_EMOTIONAL... | netcan wrote: | You have the perfect name for someone making emotive machines. | | Would you do demos for well known speeches/texts? It'd be | easier to put this into context that way. | crazygringo wrote: | This is fascinating. | | But I'm very curious what the emotional "parameters" are? There | are literally at least a thousand different ways of saying "I | love you" (serious-romantic, throwaway to a buddy, reassuring a | scared child, sarcastic, choking up, full of gratitude, | irritated, self-questioning, dismissive, etc. ad finitum). Anyone | who's worked as an actor and done script analysis knows there are | 100's of variables that go into a line reading. Just three words, | by themselves, can communicate roughly an entire paragraph's | worth of meaning solely by the exact way they're said -- which is | one of the things that makes acting, and directing actors, such a | rewarding challenge. | | Obviously it's far too complex to infer from text alone. So | curious how the team has simplified it? What are emotional | dimensions that you can specify? And how did they choose those | dimensions over others? Are they geared towards the kind of | "everyday" expression in a normal conversation between friends, | or towards the more "dramatic" or "high comedy" of intense | situations that much of film and TV lean towards? | sergeykish wrote: | I hear same expression with different "strength". There is no | play. No motion. Expression should change after response. It | does not. There is no dialog. For me it sounds bald, boring. | It'd better not to participate in such dialog. | | We can express emotions without words: | | xxx: Distress | | yyy: Support | | xxx: Hope | | It maps on music and we have dictionary to describe it. The one | I'm listening to is Sorrow and Hopeful - entire track. May be a | good start. Write first (classification). | | Examples you gave I feel live on same scale but extreme values. | So even harder. | | I'd imagine it work like autotune - enhance human input | [deleted] | diminish wrote: | Impressive next step for text-to-speect. Wish there was some | simple real demos. I also work on the same thing using DL- and | hope to open source the "emotional part" of it. | | We soon can create emotionally expressive youtube videos with | synthetic actors.. | sonantic wrote: | Thanks for your comments and nice to hear you're also working | on TTS! We have a few more samples (without background music) | further down on our homepage and plan to add a full dedicated | subpage in time! | voiper1 wrote: | Is there any pay-to-use or open source voice for Hebrew? | | Amazon's Polly English voice, Matthew is pretty nice. But they | don't have Hebrew. Also Google doesn't have Hebrew. Bing has some | attribution requirement that I haven't fully investigated. | sarabande wrote: | Here is one, `lmh rydr https://www.almareader.com/ (not nearly | as emotionally expressive as this one though). | yorwba wrote: | If you're okay with very robotic speech, espeak-NG has | experimental support for Hebrew since last month: | https://github.com/espeak-ng/espeak-ng/issues/732 | sarabande wrote: | If this could generate well-done audiobooks instantly from a | text, that would be fantastic. All e-books could have an | audiobook version overnight. | woah wrote: | It's kind of odd that they are not pitching this as the primary | use case. Seems much more plug and play than game voicing | mthoms wrote: | It needs a human to annotate the text with the desired | emotion. | | Ideally, it would be able to infer the emotion from the text | itself, but I think that level of sophistication is a long | way off. | | Edit: Actually, this might be a perfect candidate for some | sort of crowdsourcing. Imagine Wikipedia pages containing | hidden annotations for the proper text-to-speech | "tone/cadence/whatever" of each sentence or paragraph. | sergeykish wrote: | <curios>how would it know?</curios> | | <dismissive>how would it know?</dismissive> | | <sorrow>how would it know?</sorrow> | | <angry>how would it know?</angry> ___________________________________________________________________ (page generated 2020-05-16 23:00 UTC)