[HN Gopher] StyleTTS2 - open-source Eleven-Labs-quality Text To ... ___________________________________________________________________ StyleTTS2 - open-source Eleven-Labs-quality Text To Speech Author : sandslides Score : 354 points Date : 2023-11-19 17:40 UTC (5 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | sandslides wrote: | Just tried the collab notebooks. Seems to be very good quality. | It also supports voice cloning. | fullstackchris wrote: | Great stuff, took a look through the README but... what are the | minimum hardware requirements to run this? Is this gonna blow | up my CPU / harddrive? | sandslides wrote: | Not sure. The only inference demos are colab notebooks. The | models are approx 700mb each so I imagine it will run on | modest gpu | bbbruno222 wrote: | Would it run in a cheap non-GPU server? | dmw_ng wrote: | Seems to run about "2x realtime" on 2015 4 core i7-6700HQ | laptop, that is, 5 seconds to generate 10 seconds of | output. Can imagine that being 4x or greater on a real | machine | thot_experiment wrote: | I skimmed the github but didn't see any info on this, how long | does it take to finetune to a particular voice? | progbits wrote: | > MIT license | | > Before using these models, you agree to [...] | | No, this is not MIT. If you don't like MIT license then feel free | to use something else, but you can't pretend this is open source | and then attempt to slap on additional restrictions on how the | code can be used. | sandslides wrote: | Yes, I noticed that. Doesn't seem right does it | weego wrote: | I think you mis-parsed the disclaimer. It's just warning people | that cloned voices come with a different set of rights to the | software (because the person the voice is a clone of has rights | to their voice). | chrismorgan wrote: | (Don't let's derail the conversation, please, but | "disclaimer" is completely the wrong word here. This is a | condition of use. A disclaimer is "this isn't mine" or "I'm | not responsible for this". Disclaimers and disclosures are | quite different things and commonly confused, but this isn't | even either of them.) | gosub100 wrote: | This always annoys me when people put "disclaimers" on | their posts. IANAL, so tired of hearing that one. It's | pointless because even if you _were_ a lawyer, you cannot | meaningfully comment on a case without the details, | jurisdiction, circumstance, etc. Next, it 's meaningless | because is anyone going to blindly bow down and obey if you | state the opposite? "Yes, I AM a lawyer, you do not need to | pay taxes, they are unconstitutional." Thirdly, when they | "disclaimer" themselves as working at google, that's not a | _dis_ -claimer, thats a "claimer", asserting the | affirmative. I know their companies require them to not | speak for the company without permission, but I hardly ever | hear that one, usually its just some useless self- | disclosure that they might be biased because they work | there. Ok, who isn't biased? | | What bugs me overall is that it's usually vapid mimicry of | a phrase they don't even understand. | nielsole wrote: | Ianal, but giving legal advice without being a lawyer may | be illegal in some jurisdictions. Not sure if the | disclaimer is effective or was ever tested in court. The | disclaimer/disclosure mix-up is super annoying, but | disclosing obvious biases even if not legally required | seems like good practice to me. | gpm wrote: | As I understand it the source code is licensed MIT, the weights | are licensed "weird proprietary license that doesn't explicitly | grant you any rights and implicitly probably grants you some | usage rights so long as you tell the listeners or have | permission from the voice you cloned". | | Which, if you think the weights are copyright-able in the first | place, makes them practically unusable for anything | commercial/that you might get sued over because relying on a | vague implicit license is definitely not a good idea. | ronsor wrote: | And if you don't think weights are copyrightable, it means | nothing at all. | IshKebab wrote: | I think that's referring to the pre-trained models, not the | source code. | ericra wrote: | This bothered me as well. I opened an issue on the repo asking | them to consider updating the license file to reflect these | additional requirements. | | The wording they currently use suggests that this additional | license requirement applies not only to their pre-trained | models. | pdntspa wrote: | As if anyone outside of corporate legal actually cares | mlsu wrote: | We're now at "free, local, AI friend that you can have | conversations with on consumer hardware" territory. | | - synthesize an avatar using stablediffusion | | - synthesize conversation with llama | | - synthesize the voice with this text thing | | soon | | - VR | | - Video | | wild times! | jpeter wrote: | Which consumer gpu runs llama 70B? | sroussey wrote: | Prosumer gear. | | MacBook Pro M3 Max. | mlsu wrote: | A Mac with a lot of unified RAM can do it, or a dual | 3090/4090 setup gets you 48gb of VRAM. | jadbox wrote: | Does this actually work? I had thought that you can't use | SLI to increase your net memory for the modal? | speedgoose wrote: | It works. I use ollama these days, with litellm for the | api compatibility, and it seems to use both 24GB GPUs on | the server. | benjaminwootton wrote: | I've got a 64gb Mac M2. All of the openllm models seem to | hang on startup or on API calls. I got them working through | GCP colab. Not sure if it's a configuration issue or if the | hardware just isn't up to it? | benreesman wrote: | Valiant et al work great on my 64Gb Studio at Q4_K_M. | Happy to answer questions. | wahnfrieden wrote: | Try llama.cpp with Metal (critical) and GGUF models from | TheBloke | | Or wait another month or so for https://ChatOnMac.com | brucethemoose2 wrote: | A single 3090, or any 24GB GPU. Just barely. | | Yi 34B is a much better fit. I can cram 75K context onto 24GB | without brutalizing the model with <3bpw quantization, like | you have to do with 70B for 4K context. | speedgoose wrote: | Can it produce any meaningful outputs with such an extreme | quantisation? | brucethemoose2 wrote: | Yeah, quite good actually, especially if you quantize it | on text close to what you are trying to output. | | Llama 70B is a huge compromise at 2.65bpw... This does | make the much "dumber." Yi 34B is much better, as you can | quantize it at ~4bpw and still have a huge context. | lossolo wrote: | How would you compare mistral-7b-instruct 16fp (or | similar 7b/13b model like llama2 etc) to Yi-34b | quantized? | Hamcha wrote: | Yup, and you can already mix and match both local and cloud AIs | with stuff like SillyTavern/RealmPlay if you wanna try what the | experience is like, people have been using it to roleplay for a | while. | cloudking wrote: | Would be great to have a local home assistant voice interface | with this + llama + whisper. | trafficante wrote: | Seems like a fun afternoon project to get this hooked into one | of the Skyrim TTS mods. I previously messed around with | elevenlabs, but it had too much latency and would be somewhat | expensive long term so I'm excited to try local and free. | | I'm sure I have a lot of reading up to do first, but is it a | safe assumption that I'd be better served running this on an m2 | mbp rather than tax out my desktop's poor 3070 running it on | top of Skyrim VR? | godelski wrote: | Why name it Style<anything> if it isn't a StyleGAN? Looks like | the first one wasn't either. Interesting to see moves away from | flows, especially when none of the flows were modern. | | Also, is no one clicking on the audio links? There are some... | questionable ones... and I'm pretty sure lots of mistakes. | gwern wrote: | > Looks like the first one wasn't either. | | The first one says it uses AdaIN layers to help control style? | https://arxiv.org/pdf/2205.15439.pdf#page=2 Seems as | justifiable as the original StyleGAN calling itself StyleX... | godelski wrote: | See my other comment. StyleGAN isn't about AdaIN. StyleGAN2 | even modified it. | lhl wrote: | It's not called a GAN TTS right? StyleGAN is called what it is | because of a "style-based" approach and StyleTTS/2 seems to be | doing the same (applying style transfer) through different | method (and disentangling style from the rest of the voice | synthesis). | | (Actually, looked at the original StyleTTS paper and it | actually even partially uses AdaIN in the decoder, which is the | same way that StyleGAN injected style information? Still, I | think is besides the point for the naming.) | godelski wrote: | Yeah no I get this but the naming convention has become so | prolific that anyone working in generative space hears | "Style<thing>" and you should think "GAN". (I work in | generative vision btw) | | My point is not that it is technically right, it is that the | name is strongly related with the concept now. Such that if | you use a style based network and don't name it StyleX that | it's odd and might look like you're trying to claim you've | done more. Not that there aren't plenty of GANs that are | using Karras's code and called something else. | | > AdaIN | | Yes, StyleGAN (version 1) uses AdaIN but StyleGAN2 (and | beyond) doesn't. AdaIN stands for Adaptive Instance | Normalization. While they use it in that network, to be | clear, they did not invent AdaIN and the technique isn't | explicit to style, it's a normalization technique. One that | StyleGAN2 modifies because the standard one creates strong | and localized spikes in the statistics which results in image | artifacts. | lhl wrote: | So what I'm hearing is... no one should use "style" in its | name anymore to describe style transfers because it's too | closely associated with a set of models in a sub-field that | uses a different concept to apply style that used "style" | in its name, unless it also uses that unrelated concept in | its implementation? Is that the gist of it, because that | sounds a bit mental. | | (I'm half kidding, I get what you mean, but also, think | about it. The alternative is worse.) | api wrote: | It should be pretty easy to make training data for TTS. The | Whisper STT models are open so just chop up a ton of audio and | use Whisper to annotate it, then train the other direction to | produce audio from text. So you're basically inverting Whisper. | eginhard wrote: | STT training data includes all kinds of "noisy" speech so that | the model learns to recognise speech in any conditions. TTS | training data needs to be as clean as possible so that you | don't introduce artefacts in the output and this high-quality | data is much harder to get. A simple inversion is not really | feasible or at least requires filtering out much of the data. | satvikpendem wrote: | Funnily enough, the TTS2 examples sound _better_ than the ground | truth [0]. For example, the "Then leaving the corpse within the | house [...]" example has the ground truth pronounce "house" | weirdly, with some change in the tonality that sounds higher, but | the TTS2 version sounds more natural. | | I'm excited to use this for all my ePub files, many of which | don't have corresponding audiobooks, such as a lot of Japanese | light novels. I am currently using Moon+ Reader on Android which | has TTS but it is very robotic. | | [0] https://styletts2.github.io/ | risho wrote: | how are you planning on using this with epubs? i'm in a similar | boat. would really like to leverage something like this for | ebooks. | satvikpendem wrote: | I wonder if you can add a TTS engine to Android as an app or | plugin, then make Moon+ Reader or another reader to use that | custom engine. That's probably how I'd do it for the easiest | approach, but if that doesn't work, I might just have to make | my own app. | a_wild_dandan wrote: | I'm planning on making a self-host solution where you can | upload files and the host sends back the audio to play, as | a first pass on this tech. I'll open source the repo after | fiddling and prototyping. I've needed this kinda thing for | a long time! | risho wrote: | Please make sure to link it back to HN so that we can | check it out! | jrpear wrote: | You can! [rhvoice](https://rhvoice.org/) is an open source | example. | KolmogorovComp wrote: | The pace is better, but imho you there is still a very | noticeable "metalic" tone which makes it inferior to the real | thing. | | Impressive results nonetheless, and superior to all other TTS. | lhl wrote: | I tested StyleTTS2 last month, my step-by-step notes that might | be useful for people doing local setup (not too hard): | https://llm-tracker.info/books/howto-guides/page/styletts-2 | | Also I did a little speed/quality shootoff with the LJSpeech | model (vs VITS and XTTS). StyleTTS2 was pretty good and very | fast: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2 | kelseyfrog wrote: | > inferences at up to 15-95X (!) RT on my 4090 | | That's incredible! | | Are infill and outpainting equivalents possible? Super-RT TTS | at this level of quality opens up a diverse array of uses esp | for indie/experimental gamedev that I'm excited for. | refulgentis wrote: | Not sure what you mean: If you mean could inpainting and out | painting with image models be faster, its a "not even wrong" | question, similar to asking if the United Airlines app could | get faster because American Airlines did. (Yes, getting | faster is an option available to ~all code) | | If you mean could you inpaint and outpaint text...yes, by | inserting and deleting characters. | | If you mean could you use an existing voice clip to generate | speech by the same speaker in the clip, yes, part of the | article is demonstrating generating speech by speakers not | seen at training time | pedrovhb wrote: | I'm not sure I understand what you mean to say. To me it's | a reasonable question asking whether text to speech models | can complete a missing part of some existing speech audio, | or make it go on for longer, rather than only generating | speech from scratch. I don't see a connection to your | faster apps analogy. | | Fwiw, I imagine this is possible, at least to some extent. | I was recently playing with xtts and it can generate | speaker embeddings from short periods of speech, so you | could use those to provide a logical continuation to | existing audio. However, I'm not sure it's possible or easy | to manage the "seams" between what is generated and what is | preexisting very easily yet. | | It's certainly not a misguided question to me. Perhaps you | could be less curt and offer your domain knowledge to | contribute to the discussion? | | Edit: I see you've edited your post to be more informative, | thanks for sharing more of your thoughts. | refulgentis wrote: | It imposes a cost on others when when you makes false | claims like I said or felt the question was unreasonable. | | I didn't and don't. | | It is a hard question to understand and an interesting | mind-bender to answer. | | Less policing of the metacontext and more focusing on the | discussion at hand will help ensure there's interlocutors | around to, at the very least, continue policing. | kelseyfrog wrote: | Ignore the speed comment; it is unrelated to my question. | | What I mean is, can output be conditioned on antecedent | audio as well as text analogous to how image diffusion | models can condition inpainting and outpatient on static | parts of an image and clip embeddings? | refulgentis wrote: | Yes, the paper and Eleven Labs have a major feature of | "given $AUDIO_SET, generate speech for $TEXT in the same | style of $AUDIO_SET" | | No, in that, you can't cut it at an arbitrary midword | point, say at "what tim" in "what time is it bejing", and | give it the string "what time is it in beijing", and have | it recover seamlessly. | | Yes, in that, you can cut it at an arbirtrary phoneme | boundary, say 'this, I.S. a; good: test! ok?' in IPA is | 'd'Is, ,aI,es'eI; g'Ud: t'est! ,oUk'eI?', and I can cut | it 'between' a phoneme, give it the and have it complete. | kelseyfrog wrote: | Perfect! Thank you | huac wrote: | It is theoretically possible to train a model that, given | some speech, attempts to continue the speech, e.g. Spectron: | https://michelleramanovich.github.io/spectron/spectron/. | Similarly, it is possible to train a model to edit the | content, a la Voicebox: | https://voicebox.metademolab.com/edit.html. | jasonjmcghee wrote: | I've been playing with XTTSv2 and on my 3080ti, and it's sightly | faster than the length of the final audio. It's also good | quality, but these samples sound better. | | Excited to try it out! | gjm11 wrote: | HN title at present is "StyleTTS2 - open-source Eleven Labs | quality Text To Speech". Actual title at the far end doesn't name | any particular other product; arXiv paper linked from there | doesn't mention Eleven Labs either. I thought this sort of | editorializing was frowned on. | stevenhuang wrote: | Eleven Labs is the gold standard for voice synthesis. There is | nothing better out there. | | So it is extremely notable for an open source system to be able | to approach this level of quality, which is why I'd imagine | most would appreciate the comparison. I know it caught my | attention. | lucubratory wrote: | OpenAI's TTS is better than Eleven Labs, but they don't let | you train it to have a particular voice out of fear of the | consequences. | huac wrote: | I concur that, for the use cases that OpenAI's voices | cover, it is significantly better than Eleven. | GaggiX wrote: | Yes, it's against the guidelines. In fact, when I read the | title, I didn't think it was a new research paper but a random | GitHub project. | modeless wrote: | It is editorializing and it is an exaggeration. However I've | been using StyleTTS2 myself and IMO it is the best open source | TTS by far and definitely deserves a spot on the top of HN for | a while. | stevenhuang wrote: | I really want to try this but making the venv to install all the | torch dependencies is starting to get old lol. | | How are other people dealing with this? Is there an easy way to | get multiple venvs to share like a common torch venv? I can do | this manually but I'm wondering if there's a tool out there that | does this. | wczekalski wrote: | I use nix to setup the python env (python version + poetry + | sometimes python packages that are difficult to install with | poetry) and use poetry for the rest. | | The workflow is: > nix flake init -t | github:dialohq/flake-templates#python > nix develop -c | $SHELL > # I'm in the shell with poetry env, I have a | shell hook in the nix devenv that does poetry install and | poetry activate. | lukasga wrote: | Can relate to this problem a lot. I have considered starting | using a Docker dev container and making a base image for shared | dependencies which I then can customize in a dockerfile for | each new project, not sure if there's a better alternative | though. | eurekin wrote: | Same here. I'm using conda and eyeing simply installing a | pytorch into the base conda env | lhl wrote: | I don't think "base" works like that (while it can be a | fallback for some dependencies, afaik, Python packages are | isolated/not in path). But even if you could, don't do it. | Different packages usually have different pytorch | dependencies (often CUDA as well) and it will definitely bite | you. | | The biggest optimization I've found is to use mamba for | everything. It's ridiculously faster than conda for package | resolution. With everything cached, you're mostly just | waiting for your SSD at that point. | | (I suppose you _could_ add the base env 's lib path to the | end of your PYTHONPATH, but that sounds like a sure way to | get bitten by weird dependency/reproducibility issues down | the line.) | stavros wrote: | I generally try to use Docker for this stuff, but yeah, it's | the main reason why I pass on these, even though I've been | looking for something like this. It's just too hard to figure | out the dependencies. | victorbjorklund wrote: | This only works for English voices right? | e12e wrote: | No? From the readme: | | In Utils folder, there are three pre-trained models: | ASR folder: It contains the pre-trained text aligner, which was | pre-trained on English (LibriTTS), Japanese (JVS), and Chinese | (AiShell) corpus. It works well for most other languages | without fine-tuning, but you can always train your own text | aligner with the code here: yl4579/AuxiliaryASR. | JDC folder: It contains the pre-trained pitch extractor, which | was pre-trained on English (LibriTTS) corpus only. However, it | works well for other languages too because F0 is independent of | language. If you want to train on singing corpus, it is | recommended to train a new pitch extractor with the code here: | yl4579/PitchExtractor. PLBERT folder: It contains | the pre-trained PL-BERT model, which was pre-trained on English | (Wikipedia) corpus only. It probably does not work very well on | other languages, so you will need to train a different PL-BERT | for different languages using the repo here: yl4579/PL-BERT. | You can also replace this module with other phoneme BERT models | like XPhoneBERT which is pre-trained on more than 100 | languages. | modeless wrote: | Those are just parts of the system and don't make a complete | TTS. In theory you could train a complete StyleTTS2 for other | languages but currently the pretrained models are English | only. | svapnil wrote: | How fast is inference with this model? | | For reference, I'm using 11Labs to synthesize short messages - | maybe a sentence or something, using voice cloning, and I'm | getting it at around 400 - 500ms response times. | | Is there any OS solution that gets me to around the same | inference time? | wczekalski wrote: | It depends on hardware but IIRC on V100s it took 0.01-0.03s for | 1s of audio. | eigenvalue wrote: | Was somewhat annoying to get everything to work as the | documentation is a bit spotty, but after ~20 minutes it's all | working well for me on WSL Ubuntu 22.04. Sound quality is very | good, much better than other open source TTS projects I've seen. | It's also SUPER fast (at least using a 4090 GPU). | | Not sure it's quite up to Eleven Labs quality. But to me, what | makes Eleven so cool is that they have a large library of high | quality voices that are easy to choose from. I don't yet see any | way with this library to get a different voice from the default | female voice. | | Also, the real special sauce for Eleven is the near instant voice | cloning with just a single 5 minute sample, which works | shockingly (even spookily) well. Can't wait to have that all | available in a fully open source project! The services that | provide this as an API are just too expensive for many use cases. | Even the OpenAI one which is on the cheaper side costs ~10 cents | for a couple thousand word generation. | wczekalski wrote: | have you tested longer utterances with both ElevenLabs and with | StyleTTS? Short audio synthesis is a ~solved problem in the TTS | world but things start falling apart once you want to do | something like create an audiobook with text to speech. | wingworks wrote: | I can say that the paid service from ElevenLabs can do long | form TTS very well. I used it for a while to convert long | articles to voice to listen to later instead of reading. It | works very well. I only stopped because it gets a little | pricey. | wczekalski wrote: | One thing I've seen done for style cloning is a high quality | fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for | intonation + pronunciation, RVC for voice texture. With | StyleTTS and this pipeline you should get close to ElevenLabs. | eigenvalue wrote: | I suspect they are doing many more things to make it sounds | better. I certainly hope open source solutions can approach | that level of quality, but so far I've been very | disappointed. | sandslides wrote: | The LibriTTS demo clones unseen speakers from a five second or | so clip | eigenvalue wrote: | Ah ok, thanks. I tried the other demo. | eigenvalue wrote: | I tried it. Sounds absolutely nothing like my voice or my | wife's voice. I used the same sample files as I used 2 days | ago on the Eleven Labs website, and they worked flawlessly | there. So this is very, very far from being close to | "Eleven Labs quality" when it comes to voice cloning. | sandslides wrote: | The speech generated is the best I've heard from an open | source model. The one test I made didn't make an exact | clone either but this is still early days. There's likely | something not quite right. The cloned voice does speak | without any artifacts or other weirdness that most TTS | systems suffer from. | thot_experiment wrote: | Ah that's disappointing, have you tried | https://git.ecker.tech/mrq/ai-voice-cloning ? I've had | decent results with that, but inference is quite slow. | jsjmch wrote: | ElevenLabs are based on Tortoise-TTS which was already | pre-trained on millions of hours of data, but this one | was only trained on LibriTTS which was 500 hours at best. | If you have seen millions of voices, there are definitely | gonna be some of them that sound like you. It is just a | matter of training data, but it is very difficult to have | someone collect these large amounts of data and train on | it. | eigenvalue wrote: | To save people some time, this is tested on Ubuntu 22.04 | (google is being annoying about the download link, saying too | many people have downloaded it in the past 24 hours, but if you | wait a bit it should work again): git clone | https://github.com/yl4579/StyleTTS2.git cd StyleTTS2 | python3 -m venv venv source venv/bin/activate | python3 -m pip install --upgrade pip python3 -m pip | install wheel pip install -r requirements.txt pip | install phonemizer sudo apt-get install -y espeak-ng | pip install gdown gdown https://drive.google.com/uc?id=1K | 3jt1JEbtohBLUA0X75KLw36TW7U1yxq 7z x Models.zip rm | Models.zip gdown https://drive.google.com/uc?id=1jK_VV3Tn | GM9dkrIMsdQ_upov8FrIymr7 7z x Models.zip rm | Models.zip pip install ipykernel pickleshare nltk | SoundFile python -c "import nltk; nltk.download('punkt')" | pip install --upgrade jupyter ipywidgets librosa python | -m ipykernel install --user --name=venv --display-name="Python | (venv)" jupyter notebook | | Then navigate to /Demo and open either | `Inference_LJSpeech.ipynb` or `Inference_LibriTTS.ipynb` and | they should work. | Evidlo wrote: | What's a ballpark estimate for inference time on a modern CPU? | beltsazar wrote: | If AI will render some jobs obsolete, I suppose the first one | will be audio book narrators and voice actors. | washadjeffmad wrote: | Hardly. Imagine licensing your voice to Amazon so that any | customer could stream any book narrated in your likeness | without you having to commit the time to record. You could | still work as a custom voice artist, all with a "no clone" | clause if you chose. You could profit from your performance and | craft in a fraction of the time, focusing as your own agent on | the management of your assets. Or, you could just keep and | commit to your day job. | | Just imagine hearing the final novel of ASoIaF narrated by Roy | Dotrice and knowing that a royalty went to his family and | estate, or if David Attenborough willed the digital likeness of | his voice and its performance to the BBC for use in nature | documentaries after his death. | | The advent of recorded audio didn't put artists out of | business, it expanded the industries that relied on them by | allowing more of them to work. Film and tape didn't put artists | out of business, it expanded the industries that relied on them | by allowing more of them to work. Audio digitization and the | internet didn't put artists out of business; it expanded the | industries that relied on them by allowing more of them to | work. | | And TTS won't put artists out of business, but it will create | yet another new market with another niche that people will have | to figure out how to monetize, even though 98% of the revenues | will still somehow end up with the distributors. | nikkwong wrote: | What you're not considering here is that a large majority of | this industry is made up of no-name voice actors who have a | pleasant (but perfectly substitutible) voice which is now | something that AI can do perfectly and at a fraction of the | price. | | Sure, celebrities and other well-known figures will have more | to gain here as they can license out their voice; but the | majority of voice actors won't be able to capitalize on this. | So this is actually even more perverse because it again | creates a system where all assets will accumulate at the top | and there won't be any distributions for everyone else. | bongodongobob wrote: | The point is no one will pay for any of that if you can just | clone someone's voice locally. Or just tell the AI how you | want it to sound. Your argument literally ignores the entire | elephant in the room. | riquito wrote: | I can see a future where the label "100% narrated by a human" | (and similar in other industries) will be a thing | tomcam wrote: | Very impressive. It would take me a long time to even guess that | some of these are text to speech. | carbocation wrote: | Curious if we'll see a Civitai-style LoRA[1] marketplace for | text-to-speech models. | | 1 = https://github.com/microsoft/LoRA | swyx wrote: | silicon valley is very leaky, eleven labs is widely rumored to | have raised a huge round recently. great timing because with | OpenAI's TTS and now this thing the options in the market have | just expanded greatly. | readyplayernull wrote: | Someone please create a TTS with marked-down | emotions/intonations. | wg0 wrote: | The quality is really really INSANE and pretty much unimaginable | in early 2000s. | | Could have interesting prospects for games where you have LLM | assuming a character and such TTS giving those NPCs voice. | abraae wrote: | This is a big thing for one area I'm interested in - golf | simulation. | | Currently playing in a golf simulator has a bit of a post- | apocalyptian vibe. The birds are cheeping, the grass is | rustling, the game play is realistic, but there's not a human | to be seen. Just so different from the smacktalking of a real | round, or the crowd noise at a big game. | | It's begging for some LLM-fuelled banter to be added. | billylo wrote: | Or the occasional "Fore!!"s. :-) | wahnfrieden wrote: | Is there a way to port this to iOS? Apple doesn't provide an API | for their version of this. | ddmma wrote: | Well done, been waiting for a moment like this. Will give it a | try! | zsoltkacsandi wrote: | Is it possible to optimize somehow the model to run a Raspberry | with 4 GB of RAM? | modeless wrote: | I made a 100% local voice chatbot using StyleTTS2 and other open | source pieces (Whisper and OpenHermes2-Mistral-7B). It responds | _so_ much faster than ChatGPT. You can have a real conversation | with it instead of the stilted Siri-style interaction you have | with other voice assistants. Fun to play with! | | Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU | (tested on 3060 12GB) can install and converse with StyleTTS2 | with one click, no fiddling with Python or CUDA needed: | https://apps.microsoft.com/detail/9NC624PBFGB7 | | The demo is janky in various ways (requires headphones, has no UI | to speak of, voice recognition sometimes fails), but it's a sneak | peek at what will soon be possible to run on a normal gaming PC | just by putting together open source pieces. The models are | improving rapidly, there are already several improved models I | haven't yet incorporated. | lucubratory wrote: | How hard on your end does the task of making the chatbot | converse naturally look? Specifically I'm thinking about | interruptions, if it's talking too long I would like to be able | to start talking and interrupt it like in a normal | conversation, or if I'm saying something it could quickly | interject something. Once you've got the extremely high speed, | theoretically faster than real time, you can start doing that | stuff right? | | There is another thing remaining after that for fully natural | conversation, which is making the AI context aware like a human | would be. Basically giving it eyes so it can see your face and | judge body language to know if it's talking too long and needs | to be more brief, the same way a human talks. | modeless wrote: | Yes, I implemented the ability to interrupt the chatbot while | it is talking. It wasn't too hard, although it does require | you to wear headphones so the bot doesn't hear itself and get | interrupted. | | The other way around (bot interrupting the user) is hard. | Currently the bot starts processing a response after every | word that the voice recognition outputs, to reduce latency. | When new words come in before the response is ready it starts | over. If it finishes its response before any more words | arrive (~1 second usually) it starts speaking. This is not | ideal because the user might not be done speaking, of course. | If the user continues speaking the bot will stop and listen. | But deciding when the user is done speaking (or if the bot | should interrupt before the user is done) is a hard problem. | It could possibly be done zero-shot using prompting of a LLM | but you'd probably need a GPT-4 level LLM to do a good job | and GPT-4 is too slow for instant response right now. A | better idea would be to train a turn-taking model that | predicts who should speak next in conversations. I haven't | thought much about how to source a dataset and train a model | for that yet. | | Ultimately the end state of this type of system is a complete | end-to-end audio-to-audio language model. There should be | only one model, it should take audio directly as input and | produce audio directly as output. I believe that having TTS | and voice recognition and language modeling all as separate | systems will not get us to 100% natural human conversation. I | think that such a system would be within reach of today's | hardware too, all you need is the right training | dataset/procedure and some architecture bits to make it | efficient. | causality0 wrote: | What are the chances this gets packaged into something a little | more streamlined to use? I have a lot of ebooks I'd love to | generate audio versions of. | carbocation wrote: | Having now tried it (the linked repo links to pre-built colab | notebooks): | | 1) It does a fantastic job of text-to-speech. | | 2) I have had no success in getting any meaningful zero-shot | voice cloning working. It technically runs and produces a voice, | but it sounds nothing like the target voice. (This includes | trying their microphone-based self-voice-cloning option.) | | Presumably fine-tuning is needed - but I am curious if anyone had | better luck with the zero-shot approach. ___________________________________________________________________ (page generated 2023-11-19 23:00 UTC)