[HN Gopher] Show HN: Neural text to speech with dozens of celebr... ___________________________________________________________________ Show HN: Neural text to speech with dozens of celebrity voices Author : echelon Score : 267 points Date : 2020-07-27 15:06 UTC (7 hours ago) (HTM) web link (vocodes.com) (TXT) w3m dump (vocodes.com) | minerjoe wrote: | Hate to be that guy but I can't participate in this discussion | due to javascript being required for the landing page. | | As an outlier not running javascript, I'm reaping what I sow, but | it would be nice to me and others in the same boat if projects | make their landing page viewable without the need for javascript. | [deleted] | echelon wrote: | You can POST to https://mumble.stream/speak for a raw waveform. | | Here's a request: | | curl 'https://mumble.stream/speak' --compressed -H 'Referer: | https://vo.codes/' -H 'Content-Type: application/json' -H | 'Origin: https://vo.codes' -H 'Connection: keep-alive' --data- | raw '{"text":"testing 12345","speaker":"david-attenborough"}' | --output output.wav | | The other speaker values: | | http://mumble.stream/speakers | thiagocsf wrote: | There's a text area and a button to say what you typed. | | Surely you can enable or use a browser with JavaScript when you | choose to? | minerjoe wrote: | If you don't have javascript you see only "This page requires | Javascript", when I would hope, even if the thing requires | javascript to operate, I could at least find out if it is | worth switching to another machine with X11 and firing up | firefox. | jjeaff wrote: | Why don't you just turn on JavaScript? | minerjoe wrote: | I use links. No javascript. | ReedJessen wrote: | You haven't done very many female voices. Is this a limitation of | the modeling process? | [deleted] | vmception wrote: | This could make video games take up so much less space and have | much more robust speech, especially from NPCs. | | Subreddit simulator is pretty convincing conversations, putting | that to high quality voices? mannnn, so many good applications. | | Speaking of which, why don't people just talk about the good | applications. You'll get ostracized for speculating more bad | things about COVID, but talk about how doomed we potentially are | with deep fakes? Give that blogger a pulitzer prize! | echelon wrote: | > This could make video games take up so much less space and | have much more robust speech, especially from NPCs. | | Maybe, maybe not. You'll see some of the model sizes I posted | in comments above. These are quite large, and adding models for | multiple speakers gets quite large. These have to live in | memory and probably can't be paged in selectively. | | Once we achieve high fidelity multi-speaker embedding models | (where multiple speakers are encoded in a singular model), then | we'll have something compelling. I imagine the models will | become less dense over time as well. | | Furthermore, if the models are deterministic, then the | designers will know what each line will sound like exactly | before it's produced. | aww_dang wrote: | Needs more Christopher Walken | echelon wrote: | I both love your HN username (I hope you don't troll dang), and | think that's an awesome suggestion. I don't know why it didn't | occur to me. | searchableguy wrote: | That's super cool. | | I am worried about the potential abuse of this service, are there | any existing services that can help to identify audio deep fakes | like this one is for making them? | | Found Resemblyzer: https://github.com/resemble-ai/Resemblyzer | azinman2 wrote: | I wonder if at some point we need to create legal requirements | that all deepfakes have some kind of human-invisible (or | visible?) fingerprint for identification, restriction on | frequency range, etc. We have crypto export restrictions, why | not put handcuffs on this as well which has the potential for | probably larger scale harm? | Avicebron wrote: | Actually there might be a precedent with the Carrie Fisher | from Star Wars. I believe they did use some form of | virtualization after she passed..I don't know what the | outcome was legally, but it definitely is in this realm. | searchableguy wrote: | That will only stop law abiding citizens from committing | crime. | azinman2 wrote: | That's a big start. If you look at Reddit, you'll notice | all these deepfake porns are using scripts/apps that people | have packaged together. Those and any commercial variants | can abide by such restrictions which will help document the | fakeness once this goes mainstream. Only the super | technical would be able to get around it, and if tutorials | etc come out then you have legal grounds to go after it to | minimize. | | Don't let perfect be the enemy of good. This has potential | to literally cause spilled blood, fraud, etc. Better to | have it for some than for zero. | searchableguy wrote: | I don't agree with the solution unless we as a society | stop putting trust on any digital media but at that | point, this is not necessary. Many governments would love | to use this tech so they have incentive to stop others | from using it and still let people believe in digital | evidence by putting half assed solution like the one you | proposed. | | The cat is out of the bag. Digital media should not be | trusted blindly. | azinman2 wrote: | I don't appreciate the condescension. It's half assed in | that I put a rough direction out there for conversation | for a huge problem that will inevitably cause social | problems. Your reply isn't in line with HN guidelines and | certainly doesn't make me want to participate in | conversation with you about important topics. | GrantZvolsky wrote: | Such a system will always suffer from false positives and false | negatives. | | On a more positive note, when deepfakes become a problem, we | will see the emergence of a culture where unsigned | authoritative content is not paid any attention. | underwater wrote: | Photoshop has existed for a long time, but people still take | photos at face value. | mywittyname wrote: | This is a big issue! | | Lots of bad things happen, and they are only surfaced because | the person in question didn't notice the surreptitious | recording. When deep fakes becomes a problem, it will give | these people plausible deniability and they can just reject | it as "fake news." | kibwen wrote: | _> we will see the emergence of a culture where unsigned | authoritative content is not paid any attention_ | | If current events are any indication, that culture will only | emerge 30 years after the tech becomes widely usable, and in | the interim will lead to absolute chaos in the form of | weaponized disinformation. | mmastrac wrote: | I'd love to have an option for Majel Barrett | dsteinman wrote: | I was going to mention the same. It would be a childhood dream | come true to talk to my computer and have it talk back to me in | the TNG computer voice. | echelon wrote: | That's a fantastic suggestion! I'll get to it! | Baeocystin wrote: | Semi-serious follow-on question- would your model be able | to produce voices like GladOS, which are highly processed, | but in a consistent manner? Or are there too many | assumptions baked in regarding normal human speech? | svnpenn wrote: | I notice not a single young (or even younger) woman. Closest is | Hilary Clinton? Is this on purpose or an oversight? | rglover wrote: | Oh jeeze. I had to. Switch it to Bill Gates and pop this in: | | > I'm going to steal your soul. One injection at a time. Slowly, | over the course of the next decade, the entire essence of your | being will be demolished until your body is nothing but a vessel | for my command. | 101008 wrote: | Can you comment a bit on the tech on this? I tried something | similar with songs: I wanted artists X to sing a song from artist | Y. I cleaned the voices, the audios, but the transfe rjust didnt | work. I didnt do any annotations on the text (it shouldnt be that | hard since all lyrics are available), but if you could recommend | a path or maybe an open source project I be grateful. Thanks and | great work by the way! | echelon wrote: | Thanks! | | There are a lot of neat research threads ongoing in terms of | generating vocals. | | Nvidia published Mellotron (code + paper + models), and the | results are promising: | | https://github.com/NVIDIA/mellotron | | https://nv-adlr.github.io/Mellotron | | The best results I've seen are from researcher Ryuichi Yamamoto | (r9y9 on Github). He continually publishes astonishing results | and novel architectures: | | https://github.com/r9y9 | | https://github.com/r9y9/nnsvs | | https://soundcloud.com/r9y9/sets/dnn-based-singing-voice | | These results lead me to believe he's going to have a | replacement for Vocaloid soon. | | There's lots more stuff out there, and I can come back and edit | my post later. | | Some folks are getting good results by simply combining | Tacotron with autotune: | | - https://www.youtube.com/watch?v=3qR8I5zlMHs Mister Rogers | sings Beautiful World (amazing, super charming, and shows the | promise of this tech) | | - https://www.youtube.com/watch?v=K1jrDgbRs9Q (Tupac, possibly | NSFW lyrics) | | - https://www.youtube.com/watch?v=QW16_W0K3qU (Tupac with | various results, possibly NSFW) | | There's a lot that gets posted to /r/VocalSynthesis and | occasionally /r/MediaSynthesis | 101008 wrote: | Thank you very much, I will look at them! | Firerouge wrote: | I skimmed your about, where you mention it as a hobby demo of | your deep work. | | Do you have a GitHub or technical documentation about how you | build this sort of thing to work at scale? | echelon wrote: | I can make a blog post later, but at a high level: | | A rust TTS server hosts two models: a mel inference model and a | mel inversion model. The ones I'm using are glow-tts and | melgan. They fit together back to back in a pipeline. | | I chose these models not for their fidelity, but for their | performance. They're 10x faster at inference than Tacotron 2. | If you want something that sounds amazing, you're better off | with a denser set of networks, like Tacotron 2 + WaveGlow. You | should use these for achieving superior offline results for | multimedia purposes. | | Instead of using graphemes, I'm using ARPABET phonemes, and I | get these from a lookup table called "CMUdict" from Carnegie | Mellon. In the future I'll supplement this with a model that | predicts phonemes for missing entries. | | Each TTS server only hosts one or two voices due to memory | constraints. These models are huge. This fleet is scaled | horizontally. A proxy server sits in front and decodes the | request and directs it to the appropriate backend based on a | ConfigMap that associates a service with the underlying model. | Kubernetes is used to wire all of this up. | minerjoe wrote: | Can you share the cost of running this system? | echelon wrote: | I can come back and post a write up. Please refresh this | post later today. | | I scaled for today, but it's pretty cheap to run day to | day. | | I also have some architectural optimizations to make that | will greatly reduce the costs. Right now, nodes are | responsible for two speakers apiece. This is an under- | utilization since most speakers don't get used. | calebkaiser wrote: | This is incredibly cool. Do you mind sharing how big the | models are, and what kind of instances you're deploying them | on? | | I ask because I help maintain an open source ML infra project | ( https://github.com/cortexlabs/cortex ) and we've recently | done a lot of work around autoscaling multi-model endpoints. | Always curious to see how others are approaching this. | echelon wrote: | glow-tts: total 4.2G -rw-r--r-- 1 | bt bt 110M glow-tts_alan- | rickman_ljstx_2020.07.22_expr-1_chkpt-4765.torchjit | -rw-r--r-- 1 bt bt 110M glow-tts_anderson_cooper_ljstx_2020 | .07.21_expr-1_chkpt-6622.torchjit -rw-r--r-- 1 bt | bt 110M glow-tts_arnold_schwarzenegger_ljstx_2020.07.16_exp | r-2_chkpt-9045.torchjit -rw-r--r-- 1 bt bt 110M | glow-tts_barack_obama_ljstx_2020.06.28_expr-1_chkpt-1729.to | rchjit -rw-r--r-- 1 bt bt 110M glow-tts_ben- | stein_ljstx_2020.07.21_expr-1_chkpt-7516.torchjit | -rw-r--r-- 1 bt bt 110M glow- | tts_betty_white_ljstx_2020.06.28_expr-1_chkpt-1666.torchjit | ... | | melgan: -rw-r--r-- 1 bt bt 17M | melgan_manyvoice5.0_2020-07-23_12d5838_10760.torchjit | | (All the voices use the same melgan, or derivations of it.) | | I'll edit my post later with my deployment and cluster | architecture. In short, it's sharded and proxied from a | thin microservice at the top of the stack. I'll probably | introduce a job queue soon. | spdustin wrote: | > "...Instead of using graphemes, I'm using ARPABET | phonemes..." | | Is this why some examples I tried seemed to skip some of the | words? | echelon wrote: | Exactly. If you type "I am a dangerous asdhfjahdsff | velociraptor, rawr." | | There aren't entries for | | - asdhfjahdsff | | - rawr | | I added around 500 new words, but I missed a lot of stuff. | | The ultimate fix is to have grapheme -> phoneme prediction | so that all unseen words can be mapped to potential | phonemes (polyphones). | nmstoker wrote: | Are you logging the words people submit? That'd be a good | source for the most common OOV tokens to add. | echelon wrote: | This is my pandemic side project, and I'll be happy to answer any | questions about it. | jereees wrote: | You've clearly spent a good deal of your time creating this. | Bravo. What steps can I take to find more time to dig in | projects of my own? Assuming this is not what you make a living | out of. | zevv wrote: | After some experiments with playing text to people around me, | we decided that a huge factor in the perceptive quality of the | voice comes from knowing who you are listening to before you | get to listen, with the best perceived quality when the | listener actually gets to see the picture of the voice's owner. | Was it a deliberate choice to add those photographs for this | reason? | netman21 wrote: | I definitely need this. Looks like I have to wait until you are | off the front page of HN though. | | I am a writer and found that the best editing comes when I am | reviewing audio files of my books from voice talent. Of course, | then it is way to late to change anything. With a tool like this | I can revise as much as I want! | boarnoah wrote: | On a more positive side to this technology. | | I've been wondering about the possibility of using this sort of | tech (or the API offerings from Azure or GCP) to provide voice | overs in video games. | | By that I mean for smaller budget Indie development, it would be | certainly interesting to either be able to generate voice audio | from transcripts in order to add voices to background NPCs and so | on (or even the possibility of doing it at run time to produce | much more dynamic worlds). | | I guess the biggest blocker is the difficulty in conveying | emotion with what is currently available as well as the | difficulty in getting pronunciation correct (especially with | nouns). | echelon wrote: | There are half a dozen startups in this space that provide the | tech. They use embedded style tokens or sliders to change the | emotion, pitch, timbre, etc. I don't have links off hand, but | they're not too difficult to find. | | These companies tend to focus on off-the-shelf turnkey | solutions, so they'll have a suite of a few voice actors to | choose from for different character archetypes. | ethbro wrote: | Out of curiosity, are there legal concerns? | | E.g. training off Schwarzenegger and offering an Arnold | transform | azinman2 wrote: | Likeness is a legal concept. | kevinmchugh wrote: | It's not settled caselaw so anyone basing a business off | this should expect to spend a lot of money defending it in | courts | | I saw once a company that offered to be the sole purveyor | of a celebrity's synthesized voice. I haven't been able to | find them again, but that seems like a much safer way to | monetize this. | echelon wrote: | I'm not a lawyer, but I think we're entering into a legal | gray area. There are the existing frameworks of copyright, | parody, free speech, slander, libel, etc. that are all | somewhat tangential to this. | | I believe (I'm not certain) that celebrity voice | _impersonation_ is legal as long as it is not used to sell | or endorse a product. | | Most models are trained on the original speaker's voice, | but maybe only a little bit. Models might incorporate | learning from many speakers. We might even be able to boil | down a speaker representation to a small vector encoding in | the future. It'll be interesting if we can capture the | representation of a person with just a few numbers. | | I don't think the legislature should be overly protective | against machine learning. It seems obvious to me that | neural networks will play a huge role in creating entirely | virtual musicians and influencers. We're already seeing | this start to happen. r9y9 on github has published some | models that rival Vocaloid in lyrical ability. | | At the same time, we don't want these techniques used to | commit fraud, slander, or have them be used to falsely | accuse someone of committing some act. These are things we | might need new legal protections for. | | But I don't know what I'm talking about. I'm not a lawyer. | ethbro wrote: | I asked not because I expected an answer, but because I | figured you'd have an insightful opinion. | | It's essentially the performance of a composition vs the | composition question again: at what point am I mimicking | someone to the extent they have a valid claim on a | portion of my work? | | I expect it'll enter the courts a few milliseconds after | someone clones a dead actor (without their estate's | permission) for a new performance. | | There's always been an inherent tension in the US | distinction between a law of nature and a creative work | though. It seems a bit silly for me to claim patent / | trademark on a vector that encodes my likeness. | mywittyname wrote: | I suspect there's some plausible deniability built-in | that might allow for such matters to be legal. | | For example, lots of people sound like Arnold | Schwarzenegger. So if you trained a model with tall, | deep-voiced Austrian man, you could probably get | something that people will immediately associate with | Arnold without actually being his voice, or someone | emulating him. Because much of what Americans associate | with his voice is really a regional accent which is | relatively uncommon in the US. | | There may be a little bit more difficulty getting away | with with someone like Gilbert Gottfried, whose voice is | much more unique. But I do think you could get away with | creating a voice that people _think_ sounds just like | him, but doesn 't hold up in a side-by-side comparison. | | What I think will happen is celebrities like Morgan | Freeman will use their voice to train models like this, | then gift these to their estates for use in the future. | pbhjpbhj wrote: | > So if you trained a model with tall, deep-voiced | Austrian man, you could probably get something that | people will immediately associate with Arnold without | actually being his voice, or someone emulating him. // | | I think "passing off", an unregistered element of | trademark laws, may be pertinent here. If the public | think that there's an association and you're knowingly | trading on that, even if the public are wrong, then you | can be 'passing off' your output as someone else's | goods/services/[vocal renditions]. | | It's likely you'd have to be very careful about use of | copyright material for training the voice (eg extracting | metrics that describe the voice). Fair Use might apply in | USA though (even commercially). | | _IANAL, this is not legal advice._ | tmoney1818 wrote: | >"Most models are trained on the original speaker's | voice, but maybe only a little bit." | | Really cool that you got this to work. I used to work on | TTS (a few years ago, now), and we trained on celebrity | voices, but used full audiobooks. | https://github.com/Kyubyong/tacotron | | Here are some of our Nick Offerman samples: | https://soundcloud.com/kyubyong- | park/sets/tacotron_nick_215k . | echelon wrote: | Hey! I've seen your results! Really fantastic work! | | Thanks for making this so open and accessible. | dralley wrote: | > On a more positive side to this technology. | | I'm not sure that making it easier to profit off of the | likeness of others is a positive side. If it's legal for indie | studios to do, it's legal for 20th Century Fox, Universal, and | so forth. | SkyBelow wrote: | It reduces cost to produce a specific good. I would think it | would be measured similar to other technology advancements | that do the same. Good for society, bad for the craftsman | that were made obsolete, overall net positive. | | This is purposefully not counting in the effect of being able | to fake people and the damage that does to society, but I | think that was implied by the previous poster specifying | looking for the more positive side of the technology. | electriccello wrote: | Check out Modulate.ai! We make real-time, emotive voice skins | aimed at gaming voice chat. Audio watermarking is also built-in | to prevent fraud. Currently in a closed alpha stage but if | you're part of a game studio and have interest please reach | out! | minerjoe wrote: | Minor point. Its nice when people write URL's so that we can | just follow as opposed to requiring mouse interactions to cut | and paste. | I_complete_me wrote: | FYI in Firefox, just highlight and right click to select | 'Open in new tab'. | minerjoe wrote: | Doesn't that still require the mouse? | wolco wrote: | Wouldn't clicking a link involve a mouse? | BlueDingo wrote: | Nope, not the GP, but I use hotkeys with a browser | plugin! | minerjoe wrote: | Yea. Same here. | maxerickson wrote: | It doesn't like spelling mistakes. | | Ask it to say " The Aristoocrats!" with Gilbert Goddfried. | echelon wrote: | It's currently sourcing phonemes from a lookup table called | CMUdict, which is constructed by Carnegie Mellon [1]. That | database has 140,000 entries, but even so, you'd be surprised | how many common words are omitted. And of course it is missing | terms for things like "pokemon" and "fortnite", which I had to | add myself. | | I don't have generic grapheme -> phoneme/polyphone prediction, | but that's something I look to add soon. In my literature | review I didn't see anything in this space, so I was thinking I | might have to come up with something novel. | | [1] http://www.speech.cs.cmu.edu/cgi-bin/cmudict | nmstoker wrote: | Espeak-ng has pretty decent English word to phoneme | translation. You run it in the mode where it just outputs the | IPA. The vocabulary can be extended too (as the coverage is | good but far from perfect) | holoduke wrote: | really impressed by this. | echelon wrote: | Thanks! I kind of want pg to see :) | | If he thinks this is egregious, I'll take it down. | mattigames wrote: | I was wondering, wouldn't it be possible to classify the voice of | every celebrity based on moods so one could make the voices less | monotonic? So one could then add text metadata for the text-to- | speech conversion, e.g. "[Angry] I have a dream, [Calm] but it | has a patent so you can't copy it! (laughter) [Calm-fade-to- | angry] In reality insomnia took it from me!" | echelon wrote: | Absolutely. These are called "style tokens" and they're an | active area of TTS research. | | The problem is that currently your training data has to be | annotated with these tokens, and that adds a lot to the | difficulty of creating data sets. | | I imagine that over time this will get much easier to do. | mywittyname wrote: | Are there good emotion detectors for speech-to-text? Much | like they have for facial recognition? | echelon wrote: | I'm not aware of any, and I haven't had much time to look | as I'm not to the point of doing style tokens yet. I'm | certain this would be useful for annotating data and for | all sorts of other applications. Sentiment analysis, etc. | ChadTheNomad wrote: | Well, there goes the neighborhood... | rojoca wrote: | This is great. Are you looking at supporting SSML? | vedran wrote: | This is great. I've been thinking about doing something similar | with cartoon characters to build a Disney-style companion for my | son as he gets older. I'm imagining something like an Alexa | assistant but with Mickey Mouse's voice. | pbhjpbhj wrote: | I know caselaw isn't settled at all on all this but I'd | absolutely avoid posting anything on the web mentioning D' and | the black and white mouse again unless you are interested in | finding out firsthand how the law gets settled here ;o). | | Not legal advice, of course. | echelon wrote: | That's so cute! You should totally do it. | | The hardest part of this is in dataset creation. It's hard to | clean and annotate the data and can be quite manual. That's why | companies with lots of data will win. | | There are automated techniques to help with segmentation, | bandpass filtering, transcriptions, etc., but they're far from | perfect. | modzu wrote: | i can't be there only one wondering where is Morgan Freeman? :') | galuggus wrote: | How long does it take to generate a good quality voice? | echelon wrote: | I trained a base model on the Linda Johnson speech (LJS) data | set for several days. | | I then transfer learned for each of these speakers. Some | speakers have as little as 40 minutes of data, others have up | to five hours. The resulting quality isn't strictly a function | of the amount of training data, though more typically helps. | It's also important to have high fidelity text transcriptions | free of errors. | | The transfer learning runs vary between six hours and thirty | six hours. | | I'm using 8xV100 instances to train glow-tts and 2x1080Ti to | train melgan. I'm continuously training melgan in the | background and simply adding more training data. The same model | works for all speakers. | blueblisters wrote: | Have you had any success with using speaker embeddings to | generate voices with fewer samples of speech? I did some | cursory experiments but I couldn't get too far beyond getting | pitch similar to the target speaker. | | My reasoning for this approach: IMO, if the model learns a | "universal human voice", it shouldn't need too much | additional information to get a target voice. | echelon wrote: | I did! I tried creating a multi-speaker embedding model for | practical concerns: saving on memory costs. I'm going to | have to add additional layers, because it didn't fit | individual speakers very well. I wish I'd saved audio | results to share. I might be able to publish my findings if | I look around for the model files. | | I think you're right in that if we can get such a model to | work, training new embeddings won't require much data. | modzu wrote: | who owns the likeness of a voice? anyone? are these legally safe | to use in a product? | sunsetMurk wrote: | Very cool, and easy to use! | | Can you give some more info on how you generated the models? I'm | also interested in the tech stack you're using to implement this | webapp... Would love some details! | | ..What's next? | echelon wrote: | > Can you give some more info on how you generated the models? | | glow-tts and melgan, which are somewhat unpopular choices given | the proliferation of Tacotron2/Waveglow. I chose these due to | their sparsity and speed. | | > I'm also interested in the tech stack you're using to | implement this webapp... Would love some details! | | It's a Rust microservice architecture. There's a proxy layer | that decodes the request and sends it to the appropriate | backend, and then there's the tts service that is horizontally | scaled and is responsible for loading the model pipeline and | turning requests into audio. | | > ..What's next? | | For me? Voice conversion in the near term. This takes | microphone input and turns it into the target speaker's voice. | | I'm also spending a lot of time on photogrammetry. I have a 3d | volumetric webcam system right now that I have much bigger | plans for. | lowdose wrote: | > What's next? | | Text To Video webapp that renders text to video + voice | synchronised of famous people. | | Who wouldn't like to laugh 5X more when social scrolling? | | The first platform that enables creators with the ability to | produce deep fakes of celebs from text that they can broadcast | as HQ video content to their audience will kill both Youtube & | Instagram. | | Ranking based on likes so the best jokes of the day are | trending on top of the feed. | | Recommendation engine with a multibandid ML algo from the start | so you can leverage all that incoming data. | sunsetMurk wrote: | Awesome. here come the 'deepMemes'! | baxtr wrote: | Please add Steve Jobs. I miss him | gitgud wrote: | Gilbert Godfrey sounds like "Bonzi Buddy" from the old desktop | widget days... Those were some crazy times | | https://en.m.wikipedia.org/wiki/BonziBuddy | programmarchy wrote: | Looks awesome, but haven't been able to get a result back yet. I | think you may be getting hugged to death :) | echelon wrote: | That's odd. I'm testing it right now and it's working. Which | voices are you trying, and which device and browser are you | using? | imjared wrote: | I've tried Gilbert Gottfried and NDT. I do get a console | error about CORS: > Access to fetch at | 'https://mumble.stream/speak_spectrogram' from origin | 'https://vo.codes' has been blocked by CORS policy: No | 'Access-Control-Allow-Origin' header is present on the | requested resource. If an opaque response serves your needs, | set the request's mode to 'no-cors' to fetch the resource | with CORS disabled. | | Using Chrome stable | echelon wrote: | Oh man, I thought I had this CORS stuff sorted. | | Thanks for the help and info! | | I'm using version 84.0.4147.89 (Official Build) (64-bit) | and getting back responses. | | I got the following response headers: | access-control-allow-origin: https://vo.codes | content-length: 151689 content-type: application/json | date: Mon, 27 Jul 2020 15:55:37 GMT vary: Origin | x-backend-hostname: tts-group-1-965d444f5-7kvkm | | I'll try to dump the cache and reproduce. | | edit: I must have an old browser. It works everywhere I'm | testing it. CORS is hard. :( | programmarchy wrote: | I'm also on Chrome (84) macOS, using the Craig Ferguson | model. | | I switched to Safari and Disabled CORS, but a 500 error | is coming back now. So maybe the 500 response is the root | cause, and the error handler is not returning CORS | headers, masking the issue on Chrome. | | Edit: by putting in a shorter input (sentence rather than | paragraph) I was able to get a response. | echelon wrote: | I need better error messages, but I believe it should | respond with something stating the length is too long. | | What might've happened is that the instance your request | was farmed out to might have been OOM killed. I've | provided lots of memory, but these models are pretty | massive and each inference run has to spin up a lot of | matrices in memory. | | This is all CPU inference, not GPU. | | When the pods get OOM killed, they spin up again. The | clusters for each speaker are about 5-10 pods apiece | (with some double tenancy). | afarrell wrote: | Is this ethical? | tdeck wrote: | I too expected more discussion of this. People play around with | these things because they're interesting, then mostly hand wave | away concerns about the implications with "well, people will | just have to learn to be skeptical of recordings". But what | we're really doing is muddying a previously reliable avenue of | gaining quality evidence about the world. I expect this opinion | is unpopular on HN but I think people shouldn't be developing | these things, companies shouldn't be working on them, and they | should be banned before they get to the point of causing real | harm. I also believe that _can_ be prevented by drying up | funding and research, because bad actors have to rely on the | body of existing work to make their bad actions practical. | slickQ wrote: | As NN models get more advanced generating speech synthesis | will get progressively more convincing and less expensive to | implement, even if the models aren't built for speech | synthesis specifically. The same can be said for image | generation/transformation. If we are to continue develop AI | then this is likely inevitable. There are benefits to these | models for mute people, for example. Adversarial models can | be built to detect fake audio samples. Regulation (ex: adding | tells/signatures in commercial products) would also help. The | government would have to ban most AI research or they would | only be prolonging the inevitable. | nepthar wrote: | Depends on your system of ethics. | phantom_rehan wrote: | Did you use machine learning ? | echelon wrote: | Yeah, forks of two open source pytorch models. | | https://github.com/jaywalnut310/glow-tts | | https://github.com/seungwonpark/melgan | almstimplmntd wrote: | Played with the gilbert gottfried speaker option, gave me serious | Twin Peaks "Red Room" vibes. | Quequau wrote: | I just want to hear Douglas Rain's voice coming out of my | computer. | echelon wrote: | I've built a lot of celebrity text to speech models and host them | online: | | https://vo.codes | | It has celebrities like Sir David Attenborough and Arnold | Schwarzenegger, a bunch of the presidents, and also some | engineers: PG, Sam Altman, Peter Thiel, Mark Zuckerberg | | I'm not far away from a working "real time" [1] voice conversion | (VC) system. This turns a source voice into a target voice. The | most difficult part is getting it to generalize to new, unheard | speakers. I haven't recorded my progress recently, but here are | some old rudimentary results that make my voice sound slightly | like Trump [2]. If you know what my voice sounds like and you | kind of squint at it a little, the results are pretty neat. I'll | try to publish newer stuff soon, and that all sounds much better. | | I was just about to submit all of this to HN (on "new"). | | Edit: well, my post [3] didn't make it (it fell to the second | page of new). But I'll be happy to answer questions here. | | [1] It has about ~1500ms of lag, but I think it can be improved. | | [2] | https://drive.google.com/file/d/1vgnq09YjX6pYwf4ubFYHukDafxP... | | [3] I'm only linking this because it failed to reach popularity. | https://news.ycombinator.com/item?id=23965787 | dang wrote: | (These comments originally were in | https://news.ycombinator.com/item?id=23965106 but I've moved | them) | | We'll re-up that thread (see | https://news.ycombinator.com/item?id=11662380 for how this | works generally). I'm going to move this comment there as well | because it includes more background info than you posted there. | echelon wrote: | Thanks, dang! :) | [deleted] | martinesko36 wrote: | Having met some of the people on there, this is uncannily | accurate, especially in the High Quality voices. Well done! ___________________________________________________________________ (page generated 2020-07-27 23:00 UTC)