[HN Gopher] Show HN: Neural text to speech with dozens of celebr...
       ___________________________________________________________________
        
       Show HN: Neural text to speech with dozens of celebrity voices
        
       Author : echelon
       Score  : 267 points
       Date   : 2020-07-27 15:06 UTC (7 hours ago)
        
 (HTM) web link (vocodes.com)
 (TXT) w3m dump (vocodes.com)
        
       | minerjoe wrote:
       | Hate to be that guy but I can't participate in this discussion
       | due to javascript being required for the landing page.
       | 
       | As an outlier not running javascript, I'm reaping what I sow, but
       | it would be nice to me and others in the same boat if projects
       | make their landing page viewable without the need for javascript.
        
         | [deleted]
        
         | echelon wrote:
         | You can POST to https://mumble.stream/speak for a raw waveform.
         | 
         | Here's a request:
         | 
         | curl 'https://mumble.stream/speak' --compressed -H 'Referer:
         | https://vo.codes/' -H 'Content-Type: application/json' -H
         | 'Origin: https://vo.codes' -H 'Connection: keep-alive' --data-
         | raw '{"text":"testing 12345","speaker":"david-attenborough"}'
         | --output output.wav
         | 
         | The other speaker values:
         | 
         | http://mumble.stream/speakers
        
         | thiagocsf wrote:
         | There's a text area and a button to say what you typed.
         | 
         | Surely you can enable or use a browser with JavaScript when you
         | choose to?
        
           | minerjoe wrote:
           | If you don't have javascript you see only "This page requires
           | Javascript", when I would hope, even if the thing requires
           | javascript to operate, I could at least find out if it is
           | worth switching to another machine with X11 and firing up
           | firefox.
        
         | jjeaff wrote:
         | Why don't you just turn on JavaScript?
        
           | minerjoe wrote:
           | I use links. No javascript.
        
       | ReedJessen wrote:
       | You haven't done very many female voices. Is this a limitation of
       | the modeling process?
        
       | [deleted]
        
       | vmception wrote:
       | This could make video games take up so much less space and have
       | much more robust speech, especially from NPCs.
       | 
       | Subreddit simulator is pretty convincing conversations, putting
       | that to high quality voices? mannnn, so many good applications.
       | 
       | Speaking of which, why don't people just talk about the good
       | applications. You'll get ostracized for speculating more bad
       | things about COVID, but talk about how doomed we potentially are
       | with deep fakes? Give that blogger a pulitzer prize!
        
         | echelon wrote:
         | > This could make video games take up so much less space and
         | have much more robust speech, especially from NPCs.
         | 
         | Maybe, maybe not. You'll see some of the model sizes I posted
         | in comments above. These are quite large, and adding models for
         | multiple speakers gets quite large. These have to live in
         | memory and probably can't be paged in selectively.
         | 
         | Once we achieve high fidelity multi-speaker embedding models
         | (where multiple speakers are encoded in a singular model), then
         | we'll have something compelling. I imagine the models will
         | become less dense over time as well.
         | 
         | Furthermore, if the models are deterministic, then the
         | designers will know what each line will sound like exactly
         | before it's produced.
        
       | aww_dang wrote:
       | Needs more Christopher Walken
        
         | echelon wrote:
         | I both love your HN username (I hope you don't troll dang), and
         | think that's an awesome suggestion. I don't know why it didn't
         | occur to me.
        
       | searchableguy wrote:
       | That's super cool.
       | 
       | I am worried about the potential abuse of this service, are there
       | any existing services that can help to identify audio deep fakes
       | like this one is for making them?
       | 
       | Found Resemblyzer: https://github.com/resemble-ai/Resemblyzer
        
         | azinman2 wrote:
         | I wonder if at some point we need to create legal requirements
         | that all deepfakes have some kind of human-invisible (or
         | visible?) fingerprint for identification, restriction on
         | frequency range, etc. We have crypto export restrictions, why
         | not put handcuffs on this as well which has the potential for
         | probably larger scale harm?
        
           | Avicebron wrote:
           | Actually there might be a precedent with the Carrie Fisher
           | from Star Wars. I believe they did use some form of
           | virtualization after she passed..I don't know what the
           | outcome was legally, but it definitely is in this realm.
        
           | searchableguy wrote:
           | That will only stop law abiding citizens from committing
           | crime.
        
             | azinman2 wrote:
             | That's a big start. If you look at Reddit, you'll notice
             | all these deepfake porns are using scripts/apps that people
             | have packaged together. Those and any commercial variants
             | can abide by such restrictions which will help document the
             | fakeness once this goes mainstream. Only the super
             | technical would be able to get around it, and if tutorials
             | etc come out then you have legal grounds to go after it to
             | minimize.
             | 
             | Don't let perfect be the enemy of good. This has potential
             | to literally cause spilled blood, fraud, etc. Better to
             | have it for some than for zero.
        
               | searchableguy wrote:
               | I don't agree with the solution unless we as a society
               | stop putting trust on any digital media but at that
               | point, this is not necessary. Many governments would love
               | to use this tech so they have incentive to stop others
               | from using it and still let people believe in digital
               | evidence by putting half assed solution like the one you
               | proposed.
               | 
               | The cat is out of the bag. Digital media should not be
               | trusted blindly.
        
               | azinman2 wrote:
               | I don't appreciate the condescension. It's half assed in
               | that I put a rough direction out there for conversation
               | for a huge problem that will inevitably cause social
               | problems. Your reply isn't in line with HN guidelines and
               | certainly doesn't make me want to participate in
               | conversation with you about important topics.
        
         | GrantZvolsky wrote:
         | Such a system will always suffer from false positives and false
         | negatives.
         | 
         | On a more positive note, when deepfakes become a problem, we
         | will see the emergence of a culture where unsigned
         | authoritative content is not paid any attention.
        
           | underwater wrote:
           | Photoshop has existed for a long time, but people still take
           | photos at face value.
        
           | mywittyname wrote:
           | This is a big issue!
           | 
           | Lots of bad things happen, and they are only surfaced because
           | the person in question didn't notice the surreptitious
           | recording. When deep fakes becomes a problem, it will give
           | these people plausible deniability and they can just reject
           | it as "fake news."
        
           | kibwen wrote:
           | _> we will see the emergence of a culture where unsigned
           | authoritative content is not paid any attention_
           | 
           | If current events are any indication, that culture will only
           | emerge 30 years after the tech becomes widely usable, and in
           | the interim will lead to absolute chaos in the form of
           | weaponized disinformation.
        
       | mmastrac wrote:
       | I'd love to have an option for Majel Barrett
        
         | dsteinman wrote:
         | I was going to mention the same. It would be a childhood dream
         | come true to talk to my computer and have it talk back to me in
         | the TNG computer voice.
        
           | echelon wrote:
           | That's a fantastic suggestion! I'll get to it!
        
             | Baeocystin wrote:
             | Semi-serious follow-on question- would your model be able
             | to produce voices like GladOS, which are highly processed,
             | but in a consistent manner? Or are there too many
             | assumptions baked in regarding normal human speech?
        
       | svnpenn wrote:
       | I notice not a single young (or even younger) woman. Closest is
       | Hilary Clinton? Is this on purpose or an oversight?
        
       | rglover wrote:
       | Oh jeeze. I had to. Switch it to Bill Gates and pop this in:
       | 
       | > I'm going to steal your soul. One injection at a time. Slowly,
       | over the course of the next decade, the entire essence of your
       | being will be demolished until your body is nothing but a vessel
       | for my command.
        
       | 101008 wrote:
       | Can you comment a bit on the tech on this? I tried something
       | similar with songs: I wanted artists X to sing a song from artist
       | Y. I cleaned the voices, the audios, but the transfe rjust didnt
       | work. I didnt do any annotations on the text (it shouldnt be that
       | hard since all lyrics are available), but if you could recommend
       | a path or maybe an open source project I be grateful. Thanks and
       | great work by the way!
        
         | echelon wrote:
         | Thanks!
         | 
         | There are a lot of neat research threads ongoing in terms of
         | generating vocals.
         | 
         | Nvidia published Mellotron (code + paper + models), and the
         | results are promising:
         | 
         | https://github.com/NVIDIA/mellotron
         | 
         | https://nv-adlr.github.io/Mellotron
         | 
         | The best results I've seen are from researcher Ryuichi Yamamoto
         | (r9y9 on Github). He continually publishes astonishing results
         | and novel architectures:
         | 
         | https://github.com/r9y9
         | 
         | https://github.com/r9y9/nnsvs
         | 
         | https://soundcloud.com/r9y9/sets/dnn-based-singing-voice
         | 
         | These results lead me to believe he's going to have a
         | replacement for Vocaloid soon.
         | 
         | There's lots more stuff out there, and I can come back and edit
         | my post later.
         | 
         | Some folks are getting good results by simply combining
         | Tacotron with autotune:
         | 
         | - https://www.youtube.com/watch?v=3qR8I5zlMHs Mister Rogers
         | sings Beautiful World (amazing, super charming, and shows the
         | promise of this tech)
         | 
         | - https://www.youtube.com/watch?v=K1jrDgbRs9Q (Tupac, possibly
         | NSFW lyrics)
         | 
         | - https://www.youtube.com/watch?v=QW16_W0K3qU (Tupac with
         | various results, possibly NSFW)
         | 
         | There's a lot that gets posted to /r/VocalSynthesis and
         | occasionally /r/MediaSynthesis
        
           | 101008 wrote:
           | Thank you very much, I will look at them!
        
       | Firerouge wrote:
       | I skimmed your about, where you mention it as a hobby demo of
       | your deep work.
       | 
       | Do you have a GitHub or technical documentation about how you
       | build this sort of thing to work at scale?
        
         | echelon wrote:
         | I can make a blog post later, but at a high level:
         | 
         | A rust TTS server hosts two models: a mel inference model and a
         | mel inversion model. The ones I'm using are glow-tts and
         | melgan. They fit together back to back in a pipeline.
         | 
         | I chose these models not for their fidelity, but for their
         | performance. They're 10x faster at inference than Tacotron 2.
         | If you want something that sounds amazing, you're better off
         | with a denser set of networks, like Tacotron 2 + WaveGlow. You
         | should use these for achieving superior offline results for
         | multimedia purposes.
         | 
         | Instead of using graphemes, I'm using ARPABET phonemes, and I
         | get these from a lookup table called "CMUdict" from Carnegie
         | Mellon. In the future I'll supplement this with a model that
         | predicts phonemes for missing entries.
         | 
         | Each TTS server only hosts one or two voices due to memory
         | constraints. These models are huge. This fleet is scaled
         | horizontally. A proxy server sits in front and decodes the
         | request and directs it to the appropriate backend based on a
         | ConfigMap that associates a service with the underlying model.
         | Kubernetes is used to wire all of this up.
        
           | minerjoe wrote:
           | Can you share the cost of running this system?
        
             | echelon wrote:
             | I can come back and post a write up. Please refresh this
             | post later today.
             | 
             | I scaled for today, but it's pretty cheap to run day to
             | day.
             | 
             | I also have some architectural optimizations to make that
             | will greatly reduce the costs. Right now, nodes are
             | responsible for two speakers apiece. This is an under-
             | utilization since most speakers don't get used.
        
           | calebkaiser wrote:
           | This is incredibly cool. Do you mind sharing how big the
           | models are, and what kind of instances you're deploying them
           | on?
           | 
           | I ask because I help maintain an open source ML infra project
           | ( https://github.com/cortexlabs/cortex ) and we've recently
           | done a lot of work around autoscaling multi-model endpoints.
           | Always curious to see how others are approaching this.
        
             | echelon wrote:
             | glow-tts:                   total 4.2G         -rw-r--r-- 1
             | bt bt 110M glow-tts_alan-
             | rickman_ljstx_2020.07.22_expr-1_chkpt-4765.torchjit
             | -rw-r--r-- 1 bt bt 110M glow-tts_anderson_cooper_ljstx_2020
             | .07.21_expr-1_chkpt-6622.torchjit         -rw-r--r-- 1 bt
             | bt 110M glow-tts_arnold_schwarzenegger_ljstx_2020.07.16_exp
             | r-2_chkpt-9045.torchjit         -rw-r--r-- 1 bt bt 110M
             | glow-tts_barack_obama_ljstx_2020.06.28_expr-1_chkpt-1729.to
             | rchjit         -rw-r--r-- 1 bt bt 110M glow-tts_ben-
             | stein_ljstx_2020.07.21_expr-1_chkpt-7516.torchjit
             | -rw-r--r-- 1 bt bt 110M glow-
             | tts_betty_white_ljstx_2020.06.28_expr-1_chkpt-1666.torchjit
             | ...
             | 
             | melgan:                   -rw-r--r-- 1 bt bt 17M
             | melgan_manyvoice5.0_2020-07-23_12d5838_10760.torchjit
             | 
             | (All the voices use the same melgan, or derivations of it.)
             | 
             | I'll edit my post later with my deployment and cluster
             | architecture. In short, it's sharded and proxied from a
             | thin microservice at the top of the stack. I'll probably
             | introduce a job queue soon.
        
           | spdustin wrote:
           | > "...Instead of using graphemes, I'm using ARPABET
           | phonemes..."
           | 
           | Is this why some examples I tried seemed to skip some of the
           | words?
        
             | echelon wrote:
             | Exactly. If you type "I am a dangerous asdhfjahdsff
             | velociraptor, rawr."
             | 
             | There aren't entries for
             | 
             | - asdhfjahdsff
             | 
             | - rawr
             | 
             | I added around 500 new words, but I missed a lot of stuff.
             | 
             | The ultimate fix is to have grapheme -> phoneme prediction
             | so that all unseen words can be mapped to potential
             | phonemes (polyphones).
        
               | nmstoker wrote:
               | Are you logging the words people submit? That'd be a good
               | source for the most common OOV tokens to add.
        
       | echelon wrote:
       | This is my pandemic side project, and I'll be happy to answer any
       | questions about it.
        
         | jereees wrote:
         | You've clearly spent a good deal of your time creating this.
         | Bravo. What steps can I take to find more time to dig in
         | projects of my own? Assuming this is not what you make a living
         | out of.
        
         | zevv wrote:
         | After some experiments with playing text to people around me,
         | we decided that a huge factor in the perceptive quality of the
         | voice comes from knowing who you are listening to before you
         | get to listen, with the best perceived quality when the
         | listener actually gets to see the picture of the voice's owner.
         | Was it a deliberate choice to add those photographs for this
         | reason?
        
       | netman21 wrote:
       | I definitely need this. Looks like I have to wait until you are
       | off the front page of HN though.
       | 
       | I am a writer and found that the best editing comes when I am
       | reviewing audio files of my books from voice talent. Of course,
       | then it is way to late to change anything. With a tool like this
       | I can revise as much as I want!
        
       | boarnoah wrote:
       | On a more positive side to this technology.
       | 
       | I've been wondering about the possibility of using this sort of
       | tech (or the API offerings from Azure or GCP) to provide voice
       | overs in video games.
       | 
       | By that I mean for smaller budget Indie development, it would be
       | certainly interesting to either be able to generate voice audio
       | from transcripts in order to add voices to background NPCs and so
       | on (or even the possibility of doing it at run time to produce
       | much more dynamic worlds).
       | 
       | I guess the biggest blocker is the difficulty in conveying
       | emotion with what is currently available as well as the
       | difficulty in getting pronunciation correct (especially with
       | nouns).
        
         | echelon wrote:
         | There are half a dozen startups in this space that provide the
         | tech. They use embedded style tokens or sliders to change the
         | emotion, pitch, timbre, etc. I don't have links off hand, but
         | they're not too difficult to find.
         | 
         | These companies tend to focus on off-the-shelf turnkey
         | solutions, so they'll have a suite of a few voice actors to
         | choose from for different character archetypes.
        
           | ethbro wrote:
           | Out of curiosity, are there legal concerns?
           | 
           | E.g. training off Schwarzenegger and offering an Arnold
           | transform
        
             | azinman2 wrote:
             | Likeness is a legal concept.
        
             | kevinmchugh wrote:
             | It's not settled caselaw so anyone basing a business off
             | this should expect to spend a lot of money defending it in
             | courts
             | 
             | I saw once a company that offered to be the sole purveyor
             | of a celebrity's synthesized voice. I haven't been able to
             | find them again, but that seems like a much safer way to
             | monetize this.
        
             | echelon wrote:
             | I'm not a lawyer, but I think we're entering into a legal
             | gray area. There are the existing frameworks of copyright,
             | parody, free speech, slander, libel, etc. that are all
             | somewhat tangential to this.
             | 
             | I believe (I'm not certain) that celebrity voice
             | _impersonation_ is legal as long as it is not used to sell
             | or endorse a product.
             | 
             | Most models are trained on the original speaker's voice,
             | but maybe only a little bit. Models might incorporate
             | learning from many speakers. We might even be able to boil
             | down a speaker representation to a small vector encoding in
             | the future. It'll be interesting if we can capture the
             | representation of a person with just a few numbers.
             | 
             | I don't think the legislature should be overly protective
             | against machine learning. It seems obvious to me that
             | neural networks will play a huge role in creating entirely
             | virtual musicians and influencers. We're already seeing
             | this start to happen. r9y9 on github has published some
             | models that rival Vocaloid in lyrical ability.
             | 
             | At the same time, we don't want these techniques used to
             | commit fraud, slander, or have them be used to falsely
             | accuse someone of committing some act. These are things we
             | might need new legal protections for.
             | 
             | But I don't know what I'm talking about. I'm not a lawyer.
        
               | ethbro wrote:
               | I asked not because I expected an answer, but because I
               | figured you'd have an insightful opinion.
               | 
               | It's essentially the performance of a composition vs the
               | composition question again: at what point am I mimicking
               | someone to the extent they have a valid claim on a
               | portion of my work?
               | 
               | I expect it'll enter the courts a few milliseconds after
               | someone clones a dead actor (without their estate's
               | permission) for a new performance.
               | 
               | There's always been an inherent tension in the US
               | distinction between a law of nature and a creative work
               | though. It seems a bit silly for me to claim patent /
               | trademark on a vector that encodes my likeness.
        
               | mywittyname wrote:
               | I suspect there's some plausible deniability built-in
               | that might allow for such matters to be legal.
               | 
               | For example, lots of people sound like Arnold
               | Schwarzenegger. So if you trained a model with tall,
               | deep-voiced Austrian man, you could probably get
               | something that people will immediately associate with
               | Arnold without actually being his voice, or someone
               | emulating him. Because much of what Americans associate
               | with his voice is really a regional accent which is
               | relatively uncommon in the US.
               | 
               | There may be a little bit more difficulty getting away
               | with with someone like Gilbert Gottfried, whose voice is
               | much more unique. But I do think you could get away with
               | creating a voice that people _think_ sounds just like
               | him, but doesn 't hold up in a side-by-side comparison.
               | 
               | What I think will happen is celebrities like Morgan
               | Freeman will use their voice to train models like this,
               | then gift these to their estates for use in the future.
        
               | pbhjpbhj wrote:
               | > So if you trained a model with tall, deep-voiced
               | Austrian man, you could probably get something that
               | people will immediately associate with Arnold without
               | actually being his voice, or someone emulating him. //
               | 
               | I think "passing off", an unregistered element of
               | trademark laws, may be pertinent here. If the public
               | think that there's an association and you're knowingly
               | trading on that, even if the public are wrong, then you
               | can be 'passing off' your output as someone else's
               | goods/services/[vocal renditions].
               | 
               | It's likely you'd have to be very careful about use of
               | copyright material for training the voice (eg extracting
               | metrics that describe the voice). Fair Use might apply in
               | USA though (even commercially).
               | 
               |  _IANAL, this is not legal advice._
        
               | tmoney1818 wrote:
               | >"Most models are trained on the original speaker's
               | voice, but maybe only a little bit."
               | 
               | Really cool that you got this to work. I used to work on
               | TTS (a few years ago, now), and we trained on celebrity
               | voices, but used full audiobooks.
               | https://github.com/Kyubyong/tacotron
               | 
               | Here are some of our Nick Offerman samples:
               | https://soundcloud.com/kyubyong-
               | park/sets/tacotron_nick_215k .
        
               | echelon wrote:
               | Hey! I've seen your results! Really fantastic work!
               | 
               | Thanks for making this so open and accessible.
        
         | dralley wrote:
         | > On a more positive side to this technology.
         | 
         | I'm not sure that making it easier to profit off of the
         | likeness of others is a positive side. If it's legal for indie
         | studios to do, it's legal for 20th Century Fox, Universal, and
         | so forth.
        
           | SkyBelow wrote:
           | It reduces cost to produce a specific good. I would think it
           | would be measured similar to other technology advancements
           | that do the same. Good for society, bad for the craftsman
           | that were made obsolete, overall net positive.
           | 
           | This is purposefully not counting in the effect of being able
           | to fake people and the damage that does to society, but I
           | think that was implied by the previous poster specifying
           | looking for the more positive side of the technology.
        
         | electriccello wrote:
         | Check out Modulate.ai! We make real-time, emotive voice skins
         | aimed at gaming voice chat. Audio watermarking is also built-in
         | to prevent fraud. Currently in a closed alpha stage but if
         | you're part of a game studio and have interest please reach
         | out!
        
           | minerjoe wrote:
           | Minor point. Its nice when people write URL's so that we can
           | just follow as opposed to requiring mouse interactions to cut
           | and paste.
        
             | I_complete_me wrote:
             | FYI in Firefox, just highlight and right click to select
             | 'Open in new tab'.
        
               | minerjoe wrote:
               | Doesn't that still require the mouse?
        
               | wolco wrote:
               | Wouldn't clicking a link involve a mouse?
        
               | BlueDingo wrote:
               | Nope, not the GP, but I use hotkeys with a browser
               | plugin!
        
               | minerjoe wrote:
               | Yea. Same here.
        
       | maxerickson wrote:
       | It doesn't like spelling mistakes.
       | 
       | Ask it to say " The Aristoocrats!" with Gilbert Goddfried.
        
         | echelon wrote:
         | It's currently sourcing phonemes from a lookup table called
         | CMUdict, which is constructed by Carnegie Mellon [1]. That
         | database has 140,000 entries, but even so, you'd be surprised
         | how many common words are omitted. And of course it is missing
         | terms for things like "pokemon" and "fortnite", which I had to
         | add myself.
         | 
         | I don't have generic grapheme -> phoneme/polyphone prediction,
         | but that's something I look to add soon. In my literature
         | review I didn't see anything in this space, so I was thinking I
         | might have to come up with something novel.
         | 
         | [1] http://www.speech.cs.cmu.edu/cgi-bin/cmudict
        
           | nmstoker wrote:
           | Espeak-ng has pretty decent English word to phoneme
           | translation. You run it in the mode where it just outputs the
           | IPA. The vocabulary can be extended too (as the coverage is
           | good but far from perfect)
        
       | holoduke wrote:
       | really impressed by this.
        
         | echelon wrote:
         | Thanks! I kind of want pg to see :)
         | 
         | If he thinks this is egregious, I'll take it down.
        
       | mattigames wrote:
       | I was wondering, wouldn't it be possible to classify the voice of
       | every celebrity based on moods so one could make the voices less
       | monotonic? So one could then add text metadata for the text-to-
       | speech conversion, e.g. "[Angry] I have a dream, [Calm] but it
       | has a patent so you can't copy it! (laughter) [Calm-fade-to-
       | angry] In reality insomnia took it from me!"
        
         | echelon wrote:
         | Absolutely. These are called "style tokens" and they're an
         | active area of TTS research.
         | 
         | The problem is that currently your training data has to be
         | annotated with these tokens, and that adds a lot to the
         | difficulty of creating data sets.
         | 
         | I imagine that over time this will get much easier to do.
        
           | mywittyname wrote:
           | Are there good emotion detectors for speech-to-text? Much
           | like they have for facial recognition?
        
             | echelon wrote:
             | I'm not aware of any, and I haven't had much time to look
             | as I'm not to the point of doing style tokens yet. I'm
             | certain this would be useful for annotating data and for
             | all sorts of other applications. Sentiment analysis, etc.
        
       | ChadTheNomad wrote:
       | Well, there goes the neighborhood...
        
       | rojoca wrote:
       | This is great. Are you looking at supporting SSML?
        
       | vedran wrote:
       | This is great. I've been thinking about doing something similar
       | with cartoon characters to build a Disney-style companion for my
       | son as he gets older. I'm imagining something like an Alexa
       | assistant but with Mickey Mouse's voice.
        
         | pbhjpbhj wrote:
         | I know caselaw isn't settled at all on all this but I'd
         | absolutely avoid posting anything on the web mentioning D' and
         | the black and white mouse again unless you are interested in
         | finding out firsthand how the law gets settled here ;o).
         | 
         | Not legal advice, of course.
        
         | echelon wrote:
         | That's so cute! You should totally do it.
         | 
         | The hardest part of this is in dataset creation. It's hard to
         | clean and annotate the data and can be quite manual. That's why
         | companies with lots of data will win.
         | 
         | There are automated techniques to help with segmentation,
         | bandpass filtering, transcriptions, etc., but they're far from
         | perfect.
        
       | modzu wrote:
       | i can't be there only one wondering where is Morgan Freeman? :')
        
       | galuggus wrote:
       | How long does it take to generate a good quality voice?
        
         | echelon wrote:
         | I trained a base model on the Linda Johnson speech (LJS) data
         | set for several days.
         | 
         | I then transfer learned for each of these speakers. Some
         | speakers have as little as 40 minutes of data, others have up
         | to five hours. The resulting quality isn't strictly a function
         | of the amount of training data, though more typically helps.
         | It's also important to have high fidelity text transcriptions
         | free of errors.
         | 
         | The transfer learning runs vary between six hours and thirty
         | six hours.
         | 
         | I'm using 8xV100 instances to train glow-tts and 2x1080Ti to
         | train melgan. I'm continuously training melgan in the
         | background and simply adding more training data. The same model
         | works for all speakers.
        
           | blueblisters wrote:
           | Have you had any success with using speaker embeddings to
           | generate voices with fewer samples of speech? I did some
           | cursory experiments but I couldn't get too far beyond getting
           | pitch similar to the target speaker.
           | 
           | My reasoning for this approach: IMO, if the model learns a
           | "universal human voice", it shouldn't need too much
           | additional information to get a target voice.
        
             | echelon wrote:
             | I did! I tried creating a multi-speaker embedding model for
             | practical concerns: saving on memory costs. I'm going to
             | have to add additional layers, because it didn't fit
             | individual speakers very well. I wish I'd saved audio
             | results to share. I might be able to publish my findings if
             | I look around for the model files.
             | 
             | I think you're right in that if we can get such a model to
             | work, training new embeddings won't require much data.
        
       | modzu wrote:
       | who owns the likeness of a voice? anyone? are these legally safe
       | to use in a product?
        
       | sunsetMurk wrote:
       | Very cool, and easy to use!
       | 
       | Can you give some more info on how you generated the models? I'm
       | also interested in the tech stack you're using to implement this
       | webapp... Would love some details!
       | 
       | ..What's next?
        
         | echelon wrote:
         | > Can you give some more info on how you generated the models?
         | 
         | glow-tts and melgan, which are somewhat unpopular choices given
         | the proliferation of Tacotron2/Waveglow. I chose these due to
         | their sparsity and speed.
         | 
         | > I'm also interested in the tech stack you're using to
         | implement this webapp... Would love some details!
         | 
         | It's a Rust microservice architecture. There's a proxy layer
         | that decodes the request and sends it to the appropriate
         | backend, and then there's the tts service that is horizontally
         | scaled and is responsible for loading the model pipeline and
         | turning requests into audio.
         | 
         | > ..What's next?
         | 
         | For me? Voice conversion in the near term. This takes
         | microphone input and turns it into the target speaker's voice.
         | 
         | I'm also spending a lot of time on photogrammetry. I have a 3d
         | volumetric webcam system right now that I have much bigger
         | plans for.
        
         | lowdose wrote:
         | > What's next?
         | 
         | Text To Video webapp that renders text to video + voice
         | synchronised of famous people.
         | 
         | Who wouldn't like to laugh 5X more when social scrolling?
         | 
         | The first platform that enables creators with the ability to
         | produce deep fakes of celebs from text that they can broadcast
         | as HQ video content to their audience will kill both Youtube &
         | Instagram.
         | 
         | Ranking based on likes so the best jokes of the day are
         | trending on top of the feed.
         | 
         | Recommendation engine with a multibandid ML algo from the start
         | so you can leverage all that incoming data.
        
           | sunsetMurk wrote:
           | Awesome. here come the 'deepMemes'!
        
       | baxtr wrote:
       | Please add Steve Jobs. I miss him
        
       | gitgud wrote:
       | Gilbert Godfrey sounds like "Bonzi Buddy" from the old desktop
       | widget days... Those were some crazy times
       | 
       | https://en.m.wikipedia.org/wiki/BonziBuddy
        
       | programmarchy wrote:
       | Looks awesome, but haven't been able to get a result back yet. I
       | think you may be getting hugged to death :)
        
         | echelon wrote:
         | That's odd. I'm testing it right now and it's working. Which
         | voices are you trying, and which device and browser are you
         | using?
        
           | imjared wrote:
           | I've tried Gilbert Gottfried and NDT. I do get a console
           | error about CORS: > Access to fetch at
           | 'https://mumble.stream/speak_spectrogram' from origin
           | 'https://vo.codes' has been blocked by CORS policy: No
           | 'Access-Control-Allow-Origin' header is present on the
           | requested resource. If an opaque response serves your needs,
           | set the request's mode to 'no-cors' to fetch the resource
           | with CORS disabled.
           | 
           | Using Chrome stable
        
             | echelon wrote:
             | Oh man, I thought I had this CORS stuff sorted.
             | 
             | Thanks for the help and info!
             | 
             | I'm using version 84.0.4147.89 (Official Build) (64-bit)
             | and getting back responses.
             | 
             | I got the following response headers:
             | access-control-allow-origin: https://vo.codes
             | content-length: 151689       content-type: application/json
             | date: Mon, 27 Jul 2020 15:55:37 GMT       vary: Origin
             | x-backend-hostname: tts-group-1-965d444f5-7kvkm
             | 
             | I'll try to dump the cache and reproduce.
             | 
             | edit: I must have an old browser. It works everywhere I'm
             | testing it. CORS is hard. :(
        
               | programmarchy wrote:
               | I'm also on Chrome (84) macOS, using the Craig Ferguson
               | model.
               | 
               | I switched to Safari and Disabled CORS, but a 500 error
               | is coming back now. So maybe the 500 response is the root
               | cause, and the error handler is not returning CORS
               | headers, masking the issue on Chrome.
               | 
               | Edit: by putting in a shorter input (sentence rather than
               | paragraph) I was able to get a response.
        
               | echelon wrote:
               | I need better error messages, but I believe it should
               | respond with something stating the length is too long.
               | 
               | What might've happened is that the instance your request
               | was farmed out to might have been OOM killed. I've
               | provided lots of memory, but these models are pretty
               | massive and each inference run has to spin up a lot of
               | matrices in memory.
               | 
               | This is all CPU inference, not GPU.
               | 
               | When the pods get OOM killed, they spin up again. The
               | clusters for each speaker are about 5-10 pods apiece
               | (with some double tenancy).
        
       | afarrell wrote:
       | Is this ethical?
        
         | tdeck wrote:
         | I too expected more discussion of this. People play around with
         | these things because they're interesting, then mostly hand wave
         | away concerns about the implications with "well, people will
         | just have to learn to be skeptical of recordings". But what
         | we're really doing is muddying a previously reliable avenue of
         | gaining quality evidence about the world. I expect this opinion
         | is unpopular on HN but I think people shouldn't be developing
         | these things, companies shouldn't be working on them, and they
         | should be banned before they get to the point of causing real
         | harm. I also believe that _can_ be prevented by drying up
         | funding and research, because bad actors have to rely on the
         | body of existing work to make their bad actions practical.
        
           | slickQ wrote:
           | As NN models get more advanced generating speech synthesis
           | will get progressively more convincing and less expensive to
           | implement, even if the models aren't built for speech
           | synthesis specifically. The same can be said for image
           | generation/transformation. If we are to continue develop AI
           | then this is likely inevitable. There are benefits to these
           | models for mute people, for example. Adversarial models can
           | be built to detect fake audio samples. Regulation (ex: adding
           | tells/signatures in commercial products) would also help. The
           | government would have to ban most AI research or they would
           | only be prolonging the inevitable.
        
         | nepthar wrote:
         | Depends on your system of ethics.
        
       | phantom_rehan wrote:
       | Did you use machine learning ?
        
         | echelon wrote:
         | Yeah, forks of two open source pytorch models.
         | 
         | https://github.com/jaywalnut310/glow-tts
         | 
         | https://github.com/seungwonpark/melgan
        
       | almstimplmntd wrote:
       | Played with the gilbert gottfried speaker option, gave me serious
       | Twin Peaks "Red Room" vibes.
        
       | Quequau wrote:
       | I just want to hear Douglas Rain's voice coming out of my
       | computer.
        
       | echelon wrote:
       | I've built a lot of celebrity text to speech models and host them
       | online:
       | 
       | https://vo.codes
       | 
       | It has celebrities like Sir David Attenborough and Arnold
       | Schwarzenegger, a bunch of the presidents, and also some
       | engineers: PG, Sam Altman, Peter Thiel, Mark Zuckerberg
       | 
       | I'm not far away from a working "real time" [1] voice conversion
       | (VC) system. This turns a source voice into a target voice. The
       | most difficult part is getting it to generalize to new, unheard
       | speakers. I haven't recorded my progress recently, but here are
       | some old rudimentary results that make my voice sound slightly
       | like Trump [2]. If you know what my voice sounds like and you
       | kind of squint at it a little, the results are pretty neat. I'll
       | try to publish newer stuff soon, and that all sounds much better.
       | 
       | I was just about to submit all of this to HN (on "new").
       | 
       | Edit: well, my post [3] didn't make it (it fell to the second
       | page of new). But I'll be happy to answer questions here.
       | 
       | [1] It has about ~1500ms of lag, but I think it can be improved.
       | 
       | [2]
       | https://drive.google.com/file/d/1vgnq09YjX6pYwf4ubFYHukDafxP...
       | 
       | [3] I'm only linking this because it failed to reach popularity.
       | https://news.ycombinator.com/item?id=23965787
        
         | dang wrote:
         | (These comments originally were in
         | https://news.ycombinator.com/item?id=23965106 but I've moved
         | them)
         | 
         | We'll re-up that thread (see
         | https://news.ycombinator.com/item?id=11662380 for how this
         | works generally). I'm going to move this comment there as well
         | because it includes more background info than you posted there.
        
           | echelon wrote:
           | Thanks, dang! :)
        
         | [deleted]
        
       | martinesko36 wrote:
       | Having met some of the people on there, this is uncannily
       | accurate, especially in the High Quality voices. Well done!
        
       ___________________________________________________________________
       (page generated 2020-07-27 23:00 UTC)