[HN Gopher] Emotionally Expressive Text to Speech
       ___________________________________________________________________
        
       Emotionally Expressive Text to Speech
        
       Author : interweb
       Score  : 124 points
       Date   : 2020-05-16 04:54 UTC (18 hours ago)
        
 (HTM) web link (www.sonantic.io)
 (TXT) w3m dump (www.sonantic.io)
        
       | microtherion wrote:
       | The prosody sounds nice. But two of the longer samples have a lot
       | of vocal fry, and the third sounds like the voice has a stuffy
       | nose and/or a slight lisp. I wonder whether those mannerisms were
       | chosen to camouflage artifacts inherent in their current
       | implementation.
        
         | rowanG077 wrote:
         | Or the mannerisms where chosen to express that the system can
         | produce real voices and not just perfect ones.
        
           | microtherion wrote:
           | Yeah, but wouldn't you include at least one "clean" voice in
           | the samples to show what the system is capable of?
        
           | sonantic wrote:
           | Yep. Each of our TTS models is based on a real actor's voice
           | that has its own nuanced characteristics. Some voices are
           | naturally rougher / croaky while others smooth. As in real
           | life, our differences are what make us unique. Some voices
           | will work better for certain character profiles / scenes -
           | it's up to the user to decide.
        
       | amelius wrote:
       | Sounds nice but difficult to judge with the background music.
        
         | hobofan wrote:
         | I get the criticism (and "you need to find out for yourself" at
         | ~0:44 sounds somewhat robotic), but given that it's aiming at
         | the entertainment industry, where you will have background
         | music most of the time, it also seems like a fair choice of
         | representing real-world usage (where background music might
         | always hide the imperfections a bit).
        
       | cemregr wrote:
       | Is there an actual demo?
        
       | DenisM wrote:
       | This is very impressive.
       | 
       | I wonder if attaching this to a modern-day Elisa will improve the
       | Turing test scores? Emotional load can reduce the requirement for
       | semantic coherence.
        
         | samcodes wrote:
         | "Emotional load can reduce the requirement for semantic
         | coherence." - great insight
        
       | ArneVogel wrote:
       | This site has to have one of the worst cookie choice decision
       | popups: https://imgur.com/a/YLsGadP
        
         | sonantic wrote:
         | Touche! It is pretty bad, huh? Our new site is only a few days
         | old (rolled out on Wednesday / around the same time as our
         | demo) so we haven't got eveything optimised just yet. Stay
         | tuned for better cookies in the future!!!
        
         | aantix wrote:
         | GDPR has reverted the web to Geocities. It's popup hell.
        
           | martimarkov wrote:
           | It's not GDPR. It's all the tracking website owners put
           | there.
           | 
           | The fault is with the owners of the sites.
        
             | [deleted]
        
         | sergeykish wrote:
         | like option does not really matter
        
       | Animats wrote:
       | Can't hear the voices over the music.
        
       | spaceprison wrote:
       | My daughter is dyslexic and would love to play things like
       | stardew valley, pokemon or even animal crossing but being text
       | only makes them such a slog for her.
       | 
       | The same goes for sub titles, she'd be perfectly fine with a
       | robot voice for the actors if they sounded real enough like this.
       | 
       | Game changer.
        
         | londons_explore wrote:
         | Why not just read the text to her?
        
           | billme wrote:
           | For the rest of her life?
        
         | sonantic wrote:
         | Thank you for your comment. One of the reasons we founded
         | Sonantic was to improve accessibility so we are right there
         | with you! We plan to do this by reducing the barriers (both
         | financial and logistical) of voiced content for everyone from
         | indie developers to big AAA studios. We've already begun to see
         | progress on this through partnership with initial customers
         | during our beta.
        
           | mrec wrote:
           | Is your TTS process necessarily ahead of time, or can it be
           | done at runtime with all the flexibility (templating,
           | generative text etc) that brings?
        
             | sonantic wrote:
             | That's the holy grail right there, isn't it? :) We're
             | definitely working towards runtime but still some work to
             | be done there to account for additional complexities and
             | balance trade-offs re: speed, quality, accuracy etc. of the
             | rendered output.
        
           | disabled wrote:
           | I have a print-related disability causing severe convergence
           | insufficiency with my eyes, due to a rare neurological
           | disease affecting my peripheral nervous system.
           | 
           | I generally use Kurzweil 3000 (http://KurzweilEdu.com) which
           | is made by Kurzweil Educational Systems, as a screen reader.
           | You should definitely considering partnering with them in
           | particular, as it would be very strategic.
        
       | jariel wrote:
       | Recommending editing the video down to 43-60 seconds.
       | 
       | It would be nice to try with actual text inputs right on the
       | page, that this doesn't exist is tiny flag.
       | 
       | A great choice to work with voice actors, because there isn't any
       | 'pure' TTY that's good enough in the most general sense, having
       | the actual voice actor as a working basis will help.
       | 
       | Perhaps for small game houses, they can just use something off
       | the shelf, big houses can use a customized voice, and then not
       | worry if they have to make tweaks or changes, they don't have to
       | do a whole production.
        
         | sonantic wrote:
         | Thanks for your feedback! We felt that this storyline / length
         | was best in order to showcase the two different actor's
         | artificial voices and build up to the actual cry.
         | 
         | As you've mentioned, we do work with real actors to create our
         | TTS and take misuse of their (artificial) voices very
         | seriously. Because they sound so lifelike, we've made the
         | decision not to allow public access/personal use at this time.
         | 
         | Lastly, your assessment is spot on regarding standard vs custom
         | voices. Lots of interest for both!
        
           | zoomablemind wrote:
           | TTS=text-to-speech, so it's quite reasonable to showcase that
           | chain instead of an edited video.
           | 
           | Not diminishing the quality of your product, just pointing
           | out an obvious expectation of the audience that it's
           | presented to. Perhaps, there could be a way to test-drive it
           | directly, with limited choices or combinations of the input
           | text.
        
       | dequalant wrote:
       | This is amazing! I was looking something like this to come up for
       | a long time. Finally someone did it!
        
       | schoolornot wrote:
       | Between this and Lyrebird there seem to be a high number of
       | cutting edge TTS solutions being worked on in the private sector.
       | Does anyone know why there haven't been much advancement with the
       | FOSS libraries?
        
         | jonas21 wrote:
         | There have been quite a few recent advances in open-source TTS,
         | such as:
         | 
         | https://github.com/NVIDIA/tacotron2
         | 
         | https://github.com/CorentinJ/Real-Time-Voice-Cloning
         | 
         | https://github.com/mozilla/TTS
         | 
         | I think a lot of the remaining gap is due to a lack of high-
         | quality training data -- most of the open-source models are
         | trained on public-domain audiobooks (e.g. LJ Speech).
         | 
         | However, good training data (large amounts of annotated
         | recordings by professional voice actors) is expensive to
         | create, and unlike code, there's not a tradition of people
         | sharing it.
        
         | vianneychevalie wrote:
         | I'm convinced that the barrier to entry in this field in terms
         | of technologic and financial investment is too high for FOSS
         | projects to compete with the commercial solutions
         | 
         | We don't see FOSS pharmaceutical research for instance, I
         | believe for the same reason. The amount of coordination needed
         | and the impossibility to separate TTS projects into sub-parts
         | could also factors.
        
           | sgk284 wrote:
           | https://voice.mozilla.org/en
           | 
           | "Common Voice is Mozilla's initiative to help teach machines
           | how real people speak."
        
           | ani-ani wrote:
           | Sure, open source TTS seems to be lagging behind recent
           | commercial offerings, but pharmaceutical research is actually
           | an excellent example of a field with massive FOSS software
           | usage. TTS is also "purely virtual" which makes it
           | significantly different, and I would say significantly more
           | approachable to open source collaboration.
        
         | reubenmorais wrote:
         | We have done Mean Opinion Score tests on Mozilla TTS [0] and
         | gotten similar scores to real humans. The main problem for open
         | sourcing higher quality models is licensing the dataset.
         | 
         | [0] https://github.com/mozilla/TTS
        
         | klodolph wrote:
         | I think there are a couple factors here. This is an incredibly
         | difficult problem space. The solutions going forward involve ML
         | techniques, which require ML experts (currently being hired
         | away by industry) and the resources to create models, which
         | includes not only a large amount of computational resources but
         | a big chunk of training data which needs to be sourced somehow.
         | 
         | I'm not really an expert. From what I understand, the "cutting-
         | edge" stuff requires pushing past the point where we are
         | splicing segments of speech together. Splicing segments
         | together is hard enough.
         | 
         | There are a couple open-source efforts like Mozilla's, but if
         | you want something like Lyrebird, well, that technology isn't
         | even really productized commercially yet.
        
       | blattimwind wrote:
       | I could see this being used for RPG games to fix the choice
       | deficiency that has been caused by going for fully voiced
       | dialogue. Also, making Hitler read copypastas even more
       | convincingly.
        
         | ghaff wrote:
         | As good as even not-top-shelf voice actor talent is a really
         | high bar. I keep my eye on this space because there are a
         | number of things I do where having even just decent "radio
         | voice" TTS would be useful (and better than I can do myself).
         | But nothing is really there today. In some respects, it's
         | better than I can do myself but certainly not consistently.
        
           | blattimwind wrote:
           | The bar really isn't "has to be good out of the box", if it
           | requires some tweaking on a line to line basis that would
           | probably be ok and still much, much cheaper and much quicker
           | to iterate on than voice actors for these high volumes of
           | speech. In a lot of these games the existing voice acting is
           | often consistently poor (literally everything Bethesda ever
           | released comes to mind); certainly quite a few notches below
           | the average AAA voice acting (which is occasionally bad, but
           | on average good).
        
             | ghaff wrote:
             | Fair enough. I'm not really much of a gamer.
             | 
             | One of the other challenges with using outside voice talent
             | is that it can be inconvenient/expensive when you need to
             | add/change something. I've been involved with podcasts
             | using an external host and one of the negatives with that
             | process is that if you discover a minor mistake/glitch in
             | the narration late in the process you can't easily fix it.
        
       | moron4hire wrote:
       | Any plans to support languages other than English? This would be
       | huge in the foreign language instruction field.
        
         | sonantic wrote:
         | Yes! Supporting additional languages (and dialects) is
         | definitely in our roadmap.
        
       | tomByrer wrote:
       | @sonantic Seems you don't do real-time yet?
       | 
       | If so, have plans for a Web Speech API plugin? I'm about to
       | release a reader demo based around it.
       | https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...
        
       | maxdo wrote:
       | Wow sounds very real
        
       | sonantic wrote:
       | Hey HN - Zeena Qureshi (Co-Founder and CEO at Sonantic) here.
       | 
       | Thanks for your thoughts and feedback thus far! I'd be happy to
       | answer questions (within reason) about our latest cry demo /
       | emotional TTS! Feel free to fire away on this thread.
        
         | nmstoker wrote:
         | Saw your YouTube videos a few days ago and was very impressed.
         | 
         | Clearly you can't give away too much on your "secret sauce" but
         | is there any insight you could share on two questions:
         | 
         | 1. Do the individual voice talents need to express the emotion
         | types you use or can you layer it on after? (ie do they have to
         | have recorded say "happy" to get happy outputs or can that be
         | added to neutral recordings retrospectively)
         | 
         | 2. What are the ball park audio amounts you need per voice? 10
         | hrs, 20 hrs or more?
        
           | sonantic wrote:
           | Hey! Thanks so much. Yea, can't go into too much detail here,
           | but I will say that more def isn't always better when it
           | comes to the size of datasets. :) We aim for quality over
           | quantity in order to achieve natural expressiveness from our
           | actor recording sessions.
        
         | julvo wrote:
         | Hey Zeena, great to see this on HN - brilliant video!
        
         | sergeykish wrote:
         | From the first moment of our life we express emotions with
         | voice. Not only that - adults understand them. I can express my
         | own emotions without words. And I can change my mood by
         | singing.
         | 
         | So the question is - what's there? Is it formants? Is it
         | universal? Can we map them like syllables?
         | 
         | And music, it touches same emotions. Does it use same
         | mechanism?
         | 
         | Edit: found "Emotional speech synthesis: Applications, history
         | and possible future" [1], looks like melody _is_ part of
         | emotion processing.
         | 
         | If mapping is possible I'd love to see application in dubbing.
         | Both as translate and TTS with mapped emotions and dubbing
         | actors evaluation/autotune.
         | 
         | [1]
         | https://www.researchgate.net/publication/268260426_EMOTIONAL...
        
         | netcan wrote:
         | You have the perfect name for someone making emotive machines.
         | 
         | Would you do demos for well known speeches/texts? It'd be
         | easier to put this into context that way.
        
       | crazygringo wrote:
       | This is fascinating.
       | 
       | But I'm very curious what the emotional "parameters" are? There
       | are literally at least a thousand different ways of saying "I
       | love you" (serious-romantic, throwaway to a buddy, reassuring a
       | scared child, sarcastic, choking up, full of gratitude,
       | irritated, self-questioning, dismissive, etc. ad finitum). Anyone
       | who's worked as an actor and done script analysis knows there are
       | 100's of variables that go into a line reading. Just three words,
       | by themselves, can communicate roughly an entire paragraph's
       | worth of meaning solely by the exact way they're said -- which is
       | one of the things that makes acting, and directing actors, such a
       | rewarding challenge.
       | 
       | Obviously it's far too complex to infer from text alone. So
       | curious how the team has simplified it? What are emotional
       | dimensions that you can specify? And how did they choose those
       | dimensions over others? Are they geared towards the kind of
       | "everyday" expression in a normal conversation between friends,
       | or towards the more "dramatic" or "high comedy" of intense
       | situations that much of film and TV lean towards?
        
         | sergeykish wrote:
         | I hear same expression with different "strength". There is no
         | play. No motion. Expression should change after response. It
         | does not. There is no dialog. For me it sounds bald, boring.
         | It'd better not to participate in such dialog.
         | 
         | We can express emotions without words:
         | 
         | xxx: Distress
         | 
         | yyy: Support
         | 
         | xxx: Hope
         | 
         | It maps on music and we have dictionary to describe it. The one
         | I'm listening to is Sorrow and Hopeful - entire track. May be a
         | good start. Write first (classification).
         | 
         | Examples you gave I feel live on same scale but extreme values.
         | So even harder.
         | 
         | I'd imagine it work like autotune - enhance human input
        
       | [deleted]
        
       | diminish wrote:
       | Impressive next step for text-to-speect. Wish there was some
       | simple real demos. I also work on the same thing using DL- and
       | hope to open source the "emotional part" of it.
       | 
       | We soon can create emotionally expressive youtube videos with
       | synthetic actors..
        
         | sonantic wrote:
         | Thanks for your comments and nice to hear you're also working
         | on TTS! We have a few more samples (without background music)
         | further down on our homepage and plan to add a full dedicated
         | subpage in time!
        
       | voiper1 wrote:
       | Is there any pay-to-use or open source voice for Hebrew?
       | 
       | Amazon's Polly English voice, Matthew is pretty nice. But they
       | don't have Hebrew. Also Google doesn't have Hebrew. Bing has some
       | attribution requirement that I haven't fully investigated.
        
         | sarabande wrote:
         | Here is one, `lmh rydr https://www.almareader.com/ (not nearly
         | as emotionally expressive as this one though).
        
         | yorwba wrote:
         | If you're okay with very robotic speech, espeak-NG has
         | experimental support for Hebrew since last month:
         | https://github.com/espeak-ng/espeak-ng/issues/732
        
       | sarabande wrote:
       | If this could generate well-done audiobooks instantly from a
       | text, that would be fantastic. All e-books could have an
       | audiobook version overnight.
        
         | woah wrote:
         | It's kind of odd that they are not pitching this as the primary
         | use case. Seems much more plug and play than game voicing
        
           | mthoms wrote:
           | It needs a human to annotate the text with the desired
           | emotion.
           | 
           | Ideally, it would be able to infer the emotion from the text
           | itself, but I think that level of sophistication is a long
           | way off.
           | 
           | Edit: Actually, this might be a perfect candidate for some
           | sort of crowdsourcing. Imagine Wikipedia pages containing
           | hidden annotations for the proper text-to-speech
           | "tone/cadence/whatever" of each sentence or paragraph.
        
         | sergeykish wrote:
         | <curios>how would it know?</curios>
         | 
         | <dismissive>how would it know?</dismissive>
         | 
         | <sorrow>how would it know?</sorrow>
         | 
         | <angry>how would it know?</angry>
        
       ___________________________________________________________________
       (page generated 2020-05-16 23:00 UTC)