hngopher.com

       [HN Gopher] StyleTTS2 - open-source Eleven-Labs-quality Text To ...
       ___________________________________________________________________
        
       StyleTTS2 - open-source Eleven-Labs-quality Text To Speech
        
       Author : sandslides
       Score  : 354 points
       Date   : 2023-11-19 17:40 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | sandslides wrote:
       | Just tried the collab notebooks. Seems to be very good quality.
       | It also supports voice cloning.
        
         | fullstackchris wrote:
         | Great stuff, took a look through the README but... what are the
         | minimum hardware requirements to run this? Is this gonna blow
         | up my CPU / harddrive?
        
           | sandslides wrote:
           | Not sure. The only inference demos are colab notebooks. The
           | models are approx 700mb each so I imagine it will run on
           | modest gpu
        
             | bbbruno222 wrote:
             | Would it run in a cheap non-GPU server?
        
               | dmw_ng wrote:
               | Seems to run about "2x realtime" on 2015 4 core i7-6700HQ
               | laptop, that is, 5 seconds to generate 10 seconds of
               | output. Can imagine that being 4x or greater on a real
               | machine
        
         | thot_experiment wrote:
         | I skimmed the github but didn't see any info on this, how long
         | does it take to finetune to a particular voice?
        
       | progbits wrote:
       | > MIT license
       | 
       | > Before using these models, you agree to [...]
       | 
       | No, this is not MIT. If you don't like MIT license then feel free
       | to use something else, but you can't pretend this is open source
       | and then attempt to slap on additional restrictions on how the
       | code can be used.
        
         | sandslides wrote:
         | Yes, I noticed that. Doesn't seem right does it
        
         | weego wrote:
         | I think you mis-parsed the disclaimer. It's just warning people
         | that cloned voices come with a different set of rights to the
         | software (because the person the voice is a clone of has rights
         | to their voice).
        
           | chrismorgan wrote:
           | (Don't let's derail the conversation, please, but
           | "disclaimer" is completely the wrong word here. This is a
           | condition of use. A disclaimer is "this isn't mine" or "I'm
           | not responsible for this". Disclaimers and disclosures are
           | quite different things and commonly confused, but this isn't
           | even either of them.)
        
             | gosub100 wrote:
             | This always annoys me when people put "disclaimers" on
             | their posts. IANAL, so tired of hearing that one. It's
             | pointless because even if you _were_ a lawyer, you cannot
             | meaningfully comment on a case without the details,
             | jurisdiction, circumstance, etc. Next, it 's meaningless
             | because is anyone going to blindly bow down and obey if you
             | state the opposite? "Yes, I AM a lawyer, you do not need to
             | pay taxes, they are unconstitutional." Thirdly, when they
             | "disclaimer" themselves as working at google, that's not a
             | _dis_ -claimer, thats a "claimer", asserting the
             | affirmative. I know their companies require them to not
             | speak for the company without permission, but I hardly ever
             | hear that one, usually its just some useless self-
             | disclosure that they might be biased because they work
             | there. Ok, who isn't biased?
             | 
             | What bugs me overall is that it's usually vapid mimicry of
             | a phrase they don't even understand.
        
               | nielsole wrote:
               | Ianal, but giving legal advice without being a lawyer may
               | be illegal in some jurisdictions. Not sure if the
               | disclaimer is effective or was ever tested in court. The
               | disclaimer/disclosure mix-up is super annoying, but
               | disclosing obvious biases even if not legally required
               | seems like good practice to me.
        
         | gpm wrote:
         | As I understand it the source code is licensed MIT, the weights
         | are licensed "weird proprietary license that doesn't explicitly
         | grant you any rights and implicitly probably grants you some
         | usage rights so long as you tell the listeners or have
         | permission from the voice you cloned".
         | 
         | Which, if you think the weights are copyright-able in the first
         | place, makes them practically unusable for anything
         | commercial/that you might get sued over because relying on a
         | vague implicit license is definitely not a good idea.
        
           | ronsor wrote:
           | And if you don't think weights are copyrightable, it means
           | nothing at all.
        
         | IshKebab wrote:
         | I think that's referring to the pre-trained models, not the
         | source code.
        
         | ericra wrote:
         | This bothered me as well. I opened an issue on the repo asking
         | them to consider updating the license file to reflect these
         | additional requirements.
         | 
         | The wording they currently use suggests that this additional
         | license requirement applies not only to their pre-trained
         | models.
        
         | pdntspa wrote:
         | As if anyone outside of corporate legal actually cares
        
       | mlsu wrote:
       | We're now at "free, local, AI friend that you can have
       | conversations with on consumer hardware" territory.
       | 
       | - synthesize an avatar using stablediffusion
       | 
       | - synthesize conversation with llama
       | 
       | - synthesize the voice with this text thing
       | 
       | soon
       | 
       | - VR
       | 
       | - Video
       | 
       | wild times!
        
         | jpeter wrote:
         | Which consumer gpu runs llama 70B?
        
           | sroussey wrote:
           | Prosumer gear.
           | 
           | MacBook Pro M3 Max.
        
           | mlsu wrote:
           | A Mac with a lot of unified RAM can do it, or a dual
           | 3090/4090 setup gets you 48gb of VRAM.
        
             | jadbox wrote:
             | Does this actually work? I had thought that you can't use
             | SLI to increase your net memory for the modal?
        
               | speedgoose wrote:
               | It works. I use ollama these days, with litellm for the
               | api compatibility, and it seems to use both 24GB GPUs on
               | the server.
        
             | benjaminwootton wrote:
             | I've got a 64gb Mac M2. All of the openllm models seem to
             | hang on startup or on API calls. I got them working through
             | GCP colab. Not sure if it's a configuration issue or if the
             | hardware just isn't up to it?
        
               | benreesman wrote:
               | Valiant et al work great on my 64Gb Studio at Q4_K_M.
               | Happy to answer questions.
        
               | wahnfrieden wrote:
               | Try llama.cpp with Metal (critical) and GGUF models from
               | TheBloke
               | 
               | Or wait another month or so for https://ChatOnMac.com
        
           | brucethemoose2 wrote:
           | A single 3090, or any 24GB GPU. Just barely.
           | 
           | Yi 34B is a much better fit. I can cram 75K context onto 24GB
           | without brutalizing the model with <3bpw quantization, like
           | you have to do with 70B for 4K context.
        
             | speedgoose wrote:
             | Can it produce any meaningful outputs with such an extreme
             | quantisation?
        
               | brucethemoose2 wrote:
               | Yeah, quite good actually, especially if you quantize it
               | on text close to what you are trying to output.
               | 
               | Llama 70B is a huge compromise at 2.65bpw... This does
               | make the much "dumber." Yi 34B is much better, as you can
               | quantize it at ~4bpw and still have a huge context.
        
               | lossolo wrote:
               | How would you compare mistral-7b-instruct 16fp (or
               | similar 7b/13b model like llama2 etc) to Yi-34b
               | quantized?
        
         | Hamcha wrote:
         | Yup, and you can already mix and match both local and cloud AIs
         | with stuff like SillyTavern/RealmPlay if you wanna try what the
         | experience is like, people have been using it to roleplay for a
         | while.
        
         | cloudking wrote:
         | Would be great to have a local home assistant voice interface
         | with this + llama + whisper.
        
         | trafficante wrote:
         | Seems like a fun afternoon project to get this hooked into one
         | of the Skyrim TTS mods. I previously messed around with
         | elevenlabs, but it had too much latency and would be somewhat
         | expensive long term so I'm excited to try local and free.
         | 
         | I'm sure I have a lot of reading up to do first, but is it a
         | safe assumption that I'd be better served running this on an m2
         | mbp rather than tax out my desktop's poor 3070 running it on
         | top of Skyrim VR?
        
       | godelski wrote:
       | Why name it Style<anything> if it isn't a StyleGAN? Looks like
       | the first one wasn't either. Interesting to see moves away from
       | flows, especially when none of the flows were modern.
       | 
       | Also, is no one clicking on the audio links? There are some...
       | questionable ones... and I'm pretty sure lots of mistakes.
        
         | gwern wrote:
         | > Looks like the first one wasn't either.
         | 
         | The first one says it uses AdaIN layers to help control style?
         | https://arxiv.org/pdf/2205.15439.pdf#page=2 Seems as
         | justifiable as the original StyleGAN calling itself StyleX...
        
           | godelski wrote:
           | See my other comment. StyleGAN isn't about AdaIN. StyleGAN2
           | even modified it.
        
         | lhl wrote:
         | It's not called a GAN TTS right? StyleGAN is called what it is
         | because of a "style-based" approach and StyleTTS/2 seems to be
         | doing the same (applying style transfer) through different
         | method (and disentangling style from the rest of the voice
         | synthesis).
         | 
         | (Actually, looked at the original StyleTTS paper and it
         | actually even partially uses AdaIN in the decoder, which is the
         | same way that StyleGAN injected style information? Still, I
         | think is besides the point for the naming.)
        
           | godelski wrote:
           | Yeah no I get this but the naming convention has become so
           | prolific that anyone working in generative space hears
           | "Style<thing>" and you should think "GAN". (I work in
           | generative vision btw)
           | 
           | My point is not that it is technically right, it is that the
           | name is strongly related with the concept now. Such that if
           | you use a style based network and don't name it StyleX that
           | it's odd and might look like you're trying to claim you've
           | done more. Not that there aren't plenty of GANs that are
           | using Karras's code and called something else.
           | 
           | > AdaIN
           | 
           | Yes, StyleGAN (version 1) uses AdaIN but StyleGAN2 (and
           | beyond) doesn't. AdaIN stands for Adaptive Instance
           | Normalization. While they use it in that network, to be
           | clear, they did not invent AdaIN and the technique isn't
           | explicit to style, it's a normalization technique. One that
           | StyleGAN2 modifies because the standard one creates strong
           | and localized spikes in the statistics which results in image
           | artifacts.
        
             | lhl wrote:
             | So what I'm hearing is... no one should use "style" in its
             | name anymore to describe style transfers because it's too
             | closely associated with a set of models in a sub-field that
             | uses a different concept to apply style that used "style"
             | in its name, unless it also uses that unrelated concept in
             | its implementation? Is that the gist of it, because that
             | sounds a bit mental.
             | 
             | (I'm half kidding, I get what you mean, but also, think
             | about it. The alternative is worse.)
        
       | api wrote:
       | It should be pretty easy to make training data for TTS. The
       | Whisper STT models are open so just chop up a ton of audio and
       | use Whisper to annotate it, then train the other direction to
       | produce audio from text. So you're basically inverting Whisper.
        
         | eginhard wrote:
         | STT training data includes all kinds of "noisy" speech so that
         | the model learns to recognise speech in any conditions. TTS
         | training data needs to be as clean as possible so that you
         | don't introduce artefacts in the output and this high-quality
         | data is much harder to get. A simple inversion is not really
         | feasible or at least requires filtering out much of the data.
        
       | satvikpendem wrote:
       | Funnily enough, the TTS2 examples sound _better_ than the ground
       | truth [0]. For example, the  "Then leaving the corpse within the
       | house [...]" example has the ground truth pronounce "house"
       | weirdly, with some change in the tonality that sounds higher, but
       | the TTS2 version sounds more natural.
       | 
       | I'm excited to use this for all my ePub files, many of which
       | don't have corresponding audiobooks, such as a lot of Japanese
       | light novels. I am currently using Moon+ Reader on Android which
       | has TTS but it is very robotic.
       | 
       | [0] https://styletts2.github.io/
        
         | risho wrote:
         | how are you planning on using this with epubs? i'm in a similar
         | boat. would really like to leverage something like this for
         | ebooks.
        
           | satvikpendem wrote:
           | I wonder if you can add a TTS engine to Android as an app or
           | plugin, then make Moon+ Reader or another reader to use that
           | custom engine. That's probably how I'd do it for the easiest
           | approach, but if that doesn't work, I might just have to make
           | my own app.
        
             | a_wild_dandan wrote:
             | I'm planning on making a self-host solution where you can
             | upload files and the host sends back the audio to play, as
             | a first pass on this tech. I'll open source the repo after
             | fiddling and prototyping. I've needed this kinda thing for
             | a long time!
        
               | risho wrote:
               | Please make sure to link it back to HN so that we can
               | check it out!
        
             | jrpear wrote:
             | You can! [rhvoice](https://rhvoice.org/) is an open source
             | example.
        
         | KolmogorovComp wrote:
         | The pace is better, but imho you there is still a very
         | noticeable "metalic" tone which makes it inferior to the real
         | thing.
         | 
         | Impressive results nonetheless, and superior to all other TTS.
        
       | lhl wrote:
       | I tested StyleTTS2 last month, my step-by-step notes that might
       | be useful for people doing local setup (not too hard):
       | https://llm-tracker.info/books/howto-guides/page/styletts-2
       | 
       | Also I did a little speed/quality shootoff with the LJSpeech
       | model (vs VITS and XTTS). StyleTTS2 was pretty good and very
       | fast: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2
        
         | kelseyfrog wrote:
         | > inferences at up to 15-95X (!) RT on my 4090
         | 
         | That's incredible!
         | 
         | Are infill and outpainting equivalents possible? Super-RT TTS
         | at this level of quality opens up a diverse array of uses esp
         | for indie/experimental gamedev that I'm excited for.
        
           | refulgentis wrote:
           | Not sure what you mean: If you mean could inpainting and out
           | painting with image models be faster, its a "not even wrong"
           | question, similar to asking if the United Airlines app could
           | get faster because American Airlines did. (Yes, getting
           | faster is an option available to ~all code)
           | 
           | If you mean could you inpaint and outpaint text...yes, by
           | inserting and deleting characters.
           | 
           | If you mean could you use an existing voice clip to generate
           | speech by the same speaker in the clip, yes, part of the
           | article is demonstrating generating speech by speakers not
           | seen at training time
        
             | pedrovhb wrote:
             | I'm not sure I understand what you mean to say. To me it's
             | a reasonable question asking whether text to speech models
             | can complete a missing part of some existing speech audio,
             | or make it go on for longer, rather than only generating
             | speech from scratch. I don't see a connection to your
             | faster apps analogy.
             | 
             | Fwiw, I imagine this is possible, at least to some extent.
             | I was recently playing with xtts and it can generate
             | speaker embeddings from short periods of speech, so you
             | could use those to provide a logical continuation to
             | existing audio. However, I'm not sure it's possible or easy
             | to manage the "seams" between what is generated and what is
             | preexisting very easily yet.
             | 
             | It's certainly not a misguided question to me. Perhaps you
             | could be less curt and offer your domain knowledge to
             | contribute to the discussion?
             | 
             | Edit: I see you've edited your post to be more informative,
             | thanks for sharing more of your thoughts.
        
               | refulgentis wrote:
               | It imposes a cost on others when when you makes false
               | claims like I said or felt the question was unreasonable.
               | 
               | I didn't and don't.
               | 
               | It is a hard question to understand and an interesting
               | mind-bender to answer.
               | 
               | Less policing of the metacontext and more focusing on the
               | discussion at hand will help ensure there's interlocutors
               | around to, at the very least, continue policing.
        
             | kelseyfrog wrote:
             | Ignore the speed comment; it is unrelated to my question.
             | 
             | What I mean is, can output be conditioned on antecedent
             | audio as well as text analogous to how image diffusion
             | models can condition inpainting and outpatient on static
             | parts of an image and clip embeddings?
        
               | refulgentis wrote:
               | Yes, the paper and Eleven Labs have a major feature of
               | "given $AUDIO_SET, generate speech for $TEXT in the same
               | style of $AUDIO_SET"
               | 
               | No, in that, you can't cut it at an arbitrary midword
               | point, say at "what tim" in "what time is it bejing", and
               | give it the string "what time is it in beijing", and have
               | it recover seamlessly.
               | 
               | Yes, in that, you can cut it at an arbirtrary phoneme
               | boundary, say 'this, I.S. a; good: test! ok?' in IPA is
               | 'd'Is, ,aI,es'eI; g'Ud: t'est! ,oUk'eI?', and I can cut
               | it 'between' a phoneme, give it the and have it complete.
        
               | kelseyfrog wrote:
               | Perfect! Thank you
        
           | huac wrote:
           | It is theoretically possible to train a model that, given
           | some speech, attempts to continue the speech, e.g. Spectron:
           | https://michelleramanovich.github.io/spectron/spectron/.
           | Similarly, it is possible to train a model to edit the
           | content, a la Voicebox:
           | https://voicebox.metademolab.com/edit.html.
        
       | jasonjmcghee wrote:
       | I've been playing with XTTSv2 and on my 3080ti, and it's sightly
       | faster than the length of the final audio. It's also good
       | quality, but these samples sound better.
       | 
       | Excited to try it out!
        
       | gjm11 wrote:
       | HN title at present is "StyleTTS2 - open-source Eleven Labs
       | quality Text To Speech". Actual title at the far end doesn't name
       | any particular other product; arXiv paper linked from there
       | doesn't mention Eleven Labs either. I thought this sort of
       | editorializing was frowned on.
        
         | stevenhuang wrote:
         | Eleven Labs is the gold standard for voice synthesis. There is
         | nothing better out there.
         | 
         | So it is extremely notable for an open source system to be able
         | to approach this level of quality, which is why I'd imagine
         | most would appreciate the comparison. I know it caught my
         | attention.
        
           | lucubratory wrote:
           | OpenAI's TTS is better than Eleven Labs, but they don't let
           | you train it to have a particular voice out of fear of the
           | consequences.
        
             | huac wrote:
             | I concur that, for the use cases that OpenAI's voices
             | cover, it is significantly better than Eleven.
        
         | GaggiX wrote:
         | Yes, it's against the guidelines. In fact, when I read the
         | title, I didn't think it was a new research paper but a random
         | GitHub project.
        
         | modeless wrote:
         | It is editorializing and it is an exaggeration. However I've
         | been using StyleTTS2 myself and IMO it is the best open source
         | TTS by far and definitely deserves a spot on the top of HN for
         | a while.
        
       | stevenhuang wrote:
       | I really want to try this but making the venv to install all the
       | torch dependencies is starting to get old lol.
       | 
       | How are other people dealing with this? Is there an easy way to
       | get multiple venvs to share like a common torch venv? I can do
       | this manually but I'm wondering if there's a tool out there that
       | does this.
        
         | wczekalski wrote:
         | I use nix to setup the python env (python version + poetry +
         | sometimes python packages that are difficult to install with
         | poetry) and use poetry for the rest.
         | 
         | The workflow is:                 > nix flake init -t
         | github:dialohq/flake-templates#python       > nix develop -c
         | $SHELL       > # I'm in the shell with poetry env, I have a
         | shell hook in the nix devenv that does poetry install and
         | poetry activate.
        
         | lukasga wrote:
         | Can relate to this problem a lot. I have considered starting
         | using a Docker dev container and making a base image for shared
         | dependencies which I then can customize in a dockerfile for
         | each new project, not sure if there's a better alternative
         | though.
        
         | eurekin wrote:
         | Same here. I'm using conda and eyeing simply installing a
         | pytorch into the base conda env
        
           | lhl wrote:
           | I don't think "base" works like that (while it can be a
           | fallback for some dependencies, afaik, Python packages are
           | isolated/not in path). But even if you could, don't do it.
           | Different packages usually have different pytorch
           | dependencies (often CUDA as well) and it will definitely bite
           | you.
           | 
           | The biggest optimization I've found is to use mamba for
           | everything. It's ridiculously faster than conda for package
           | resolution. With everything cached, you're mostly just
           | waiting for your SSD at that point.
           | 
           | (I suppose you _could_ add the base env 's lib path to the
           | end of your PYTHONPATH, but that sounds like a sure way to
           | get bitten by weird dependency/reproducibility issues down
           | the line.)
        
         | stavros wrote:
         | I generally try to use Docker for this stuff, but yeah, it's
         | the main reason why I pass on these, even though I've been
         | looking for something like this. It's just too hard to figure
         | out the dependencies.
        
       | victorbjorklund wrote:
       | This only works for English voices right?
        
         | e12e wrote:
         | No? From the readme:
         | 
         | In Utils folder, there are three pre-trained models:
         | ASR folder: It contains the pre-trained text aligner, which was
         | pre-trained on English (LibriTTS), Japanese (JVS), and Chinese
         | (AiShell) corpus. It works well for most other languages
         | without fine-tuning, but you can always train your own text
         | aligner with the code here: yl4579/AuxiliaryASR.
         | JDC folder: It contains the pre-trained pitch extractor, which
         | was pre-trained on English (LibriTTS) corpus only. However, it
         | works well for other languages too because F0 is independent of
         | language. If you want to train on singing corpus, it is
         | recommended to train a new pitch extractor with the code here:
         | yl4579/PitchExtractor.              PLBERT folder: It contains
         | the pre-trained PL-BERT model, which was pre-trained on English
         | (Wikipedia) corpus only. It probably does not work very well on
         | other languages, so you will need to train a different PL-BERT
         | for different languages using the repo here: yl4579/PL-BERT.
         | You can also replace this module with other phoneme BERT models
         | like XPhoneBERT which is pre-trained on more than 100
         | languages.
        
           | modeless wrote:
           | Those are just parts of the system and don't make a complete
           | TTS. In theory you could train a complete StyleTTS2 for other
           | languages but currently the pretrained models are English
           | only.
        
       | svapnil wrote:
       | How fast is inference with this model?
       | 
       | For reference, I'm using 11Labs to synthesize short messages -
       | maybe a sentence or something, using voice cloning, and I'm
       | getting it at around 400 - 500ms response times.
       | 
       | Is there any OS solution that gets me to around the same
       | inference time?
        
         | wczekalski wrote:
         | It depends on hardware but IIRC on V100s it took 0.01-0.03s for
         | 1s of audio.
        
       | eigenvalue wrote:
       | Was somewhat annoying to get everything to work as the
       | documentation is a bit spotty, but after ~20 minutes it's all
       | working well for me on WSL Ubuntu 22.04. Sound quality is very
       | good, much better than other open source TTS projects I've seen.
       | It's also SUPER fast (at least using a 4090 GPU).
       | 
       | Not sure it's quite up to Eleven Labs quality. But to me, what
       | makes Eleven so cool is that they have a large library of high
       | quality voices that are easy to choose from. I don't yet see any
       | way with this library to get a different voice from the default
       | female voice.
       | 
       | Also, the real special sauce for Eleven is the near instant voice
       | cloning with just a single 5 minute sample, which works
       | shockingly (even spookily) well. Can't wait to have that all
       | available in a fully open source project! The services that
       | provide this as an API are just too expensive for many use cases.
       | Even the OpenAI one which is on the cheaper side costs ~10 cents
       | for a couple thousand word generation.
        
         | wczekalski wrote:
         | have you tested longer utterances with both ElevenLabs and with
         | StyleTTS? Short audio synthesis is a ~solved problem in the TTS
         | world but things start falling apart once you want to do
         | something like create an audiobook with text to speech.
        
           | wingworks wrote:
           | I can say that the paid service from ElevenLabs can do long
           | form TTS very well. I used it for a while to convert long
           | articles to voice to listen to later instead of reading. It
           | works very well. I only stopped because it gets a little
           | pricey.
        
         | wczekalski wrote:
         | One thing I've seen done for style cloning is a high quality
         | fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for
         | intonation + pronunciation, RVC for voice texture. With
         | StyleTTS and this pipeline you should get close to ElevenLabs.
        
           | eigenvalue wrote:
           | I suspect they are doing many more things to make it sounds
           | better. I certainly hope open source solutions can approach
           | that level of quality, but so far I've been very
           | disappointed.
        
         | sandslides wrote:
         | The LibriTTS demo clones unseen speakers from a five second or
         | so clip
        
           | eigenvalue wrote:
           | Ah ok, thanks. I tried the other demo.
        
             | eigenvalue wrote:
             | I tried it. Sounds absolutely nothing like my voice or my
             | wife's voice. I used the same sample files as I used 2 days
             | ago on the Eleven Labs website, and they worked flawlessly
             | there. So this is very, very far from being close to
             | "Eleven Labs quality" when it comes to voice cloning.
        
               | sandslides wrote:
               | The speech generated is the best I've heard from an open
               | source model. The one test I made didn't make an exact
               | clone either but this is still early days. There's likely
               | something not quite right. The cloned voice does speak
               | without any artifacts or other weirdness that most TTS
               | systems suffer from.
        
               | thot_experiment wrote:
               | Ah that's disappointing, have you tried
               | https://git.ecker.tech/mrq/ai-voice-cloning ? I've had
               | decent results with that, but inference is quite slow.
        
               | jsjmch wrote:
               | ElevenLabs are based on Tortoise-TTS which was already
               | pre-trained on millions of hours of data, but this one
               | was only trained on LibriTTS which was 500 hours at best.
               | If you have seen millions of voices, there are definitely
               | gonna be some of them that sound like you. It is just a
               | matter of training data, but it is very difficult to have
               | someone collect these large amounts of data and train on
               | it.
        
         | eigenvalue wrote:
         | To save people some time, this is tested on Ubuntu 22.04
         | (google is being annoying about the download link, saying too
         | many people have downloaded it in the past 24 hours, but if you
         | wait a bit it should work again):                 git clone
         | https://github.com/yl4579/StyleTTS2.git       cd StyleTTS2
         | python3 -m venv venv       source venv/bin/activate
         | python3 -m pip install --upgrade pip       python3 -m pip
         | install wheel       pip install -r requirements.txt       pip
         | install phonemizer       sudo apt-get install -y espeak-ng
         | pip install gdown       gdown https://drive.google.com/uc?id=1K
         | 3jt1JEbtohBLUA0X75KLw36TW7U1yxq       7z x Models.zip       rm
         | Models.zip       gdown https://drive.google.com/uc?id=1jK_VV3Tn
         | GM9dkrIMsdQ_upov8FrIymr7       7z x Models.zip       rm
         | Models.zip       pip install ipykernel pickleshare nltk
         | SoundFile       python -c "import nltk; nltk.download('punkt')"
         | pip install --upgrade jupyter ipywidgets librosa       python
         | -m ipykernel install --user --name=venv --display-name="Python
         | (venv)"       jupyter notebook
         | 
         | Then navigate to /Demo and open either
         | `Inference_LJSpeech.ipynb` or `Inference_LibriTTS.ipynb` and
         | they should work.
        
       | Evidlo wrote:
       | What's a ballpark estimate for inference time on a modern CPU?
        
       | beltsazar wrote:
       | If AI will render some jobs obsolete, I suppose the first one
       | will be audio book narrators and voice actors.
        
         | washadjeffmad wrote:
         | Hardly. Imagine licensing your voice to Amazon so that any
         | customer could stream any book narrated in your likeness
         | without you having to commit the time to record. You could
         | still work as a custom voice artist, all with a "no clone"
         | clause if you chose. You could profit from your performance and
         | craft in a fraction of the time, focusing as your own agent on
         | the management of your assets. Or, you could just keep and
         | commit to your day job.
         | 
         | Just imagine hearing the final novel of ASoIaF narrated by Roy
         | Dotrice and knowing that a royalty went to his family and
         | estate, or if David Attenborough willed the digital likeness of
         | his voice and its performance to the BBC for use in nature
         | documentaries after his death.
         | 
         | The advent of recorded audio didn't put artists out of
         | business, it expanded the industries that relied on them by
         | allowing more of them to work. Film and tape didn't put artists
         | out of business, it expanded the industries that relied on them
         | by allowing more of them to work. Audio digitization and the
         | internet didn't put artists out of business; it expanded the
         | industries that relied on them by allowing more of them to
         | work.
         | 
         | And TTS won't put artists out of business, but it will create
         | yet another new market with another niche that people will have
         | to figure out how to monetize, even though 98% of the revenues
         | will still somehow end up with the distributors.
        
           | nikkwong wrote:
           | What you're not considering here is that a large majority of
           | this industry is made up of no-name voice actors who have a
           | pleasant (but perfectly substitutible) voice which is now
           | something that AI can do perfectly and at a fraction of the
           | price.
           | 
           | Sure, celebrities and other well-known figures will have more
           | to gain here as they can license out their voice; but the
           | majority of voice actors won't be able to capitalize on this.
           | So this is actually even more perverse because it again
           | creates a system where all assets will accumulate at the top
           | and there won't be any distributions for everyone else.
        
           | bongodongobob wrote:
           | The point is no one will pay for any of that if you can just
           | clone someone's voice locally. Or just tell the AI how you
           | want it to sound. Your argument literally ignores the entire
           | elephant in the room.
        
         | riquito wrote:
         | I can see a future where the label "100% narrated by a human"
         | (and similar in other industries) will be a thing
        
       | tomcam wrote:
       | Very impressive. It would take me a long time to even guess that
       | some of these are text to speech.
        
       | carbocation wrote:
       | Curious if we'll see a Civitai-style LoRA[1] marketplace for
       | text-to-speech models.
       | 
       | 1 = https://github.com/microsoft/LoRA
        
       | swyx wrote:
       | silicon valley is very leaky, eleven labs is widely rumored to
       | have raised a huge round recently. great timing because with
       | OpenAI's TTS and now this thing the options in the market have
       | just expanded greatly.
        
       | readyplayernull wrote:
       | Someone please create a TTS with marked-down
       | emotions/intonations.
        
       | wg0 wrote:
       | The quality is really really INSANE and pretty much unimaginable
       | in early 2000s.
       | 
       | Could have interesting prospects for games where you have LLM
       | assuming a character and such TTS giving those NPCs voice.
        
         | abraae wrote:
         | This is a big thing for one area I'm interested in - golf
         | simulation.
         | 
         | Currently playing in a golf simulator has a bit of a post-
         | apocalyptian vibe. The birds are cheeping, the grass is
         | rustling, the game play is realistic, but there's not a human
         | to be seen. Just so different from the smacktalking of a real
         | round, or the crowd noise at a big game.
         | 
         | It's begging for some LLM-fuelled banter to be added.
        
           | billylo wrote:
           | Or the occasional "Fore!!"s. :-)
        
       | wahnfrieden wrote:
       | Is there a way to port this to iOS? Apple doesn't provide an API
       | for their version of this.
        
       | ddmma wrote:
       | Well done, been waiting for a moment like this. Will give it a
       | try!
        
       | zsoltkacsandi wrote:
       | Is it possible to optimize somehow the model to run a Raspberry
       | with 4 GB of RAM?
        
       | modeless wrote:
       | I made a 100% local voice chatbot using StyleTTS2 and other open
       | source pieces (Whisper and OpenHermes2-Mistral-7B). It responds
       | _so_ much faster than ChatGPT. You can have a real conversation
       | with it instead of the stilted Siri-style interaction you have
       | with other voice assistants. Fun to play with!
       | 
       | Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU
       | (tested on 3060 12GB) can install and converse with StyleTTS2
       | with one click, no fiddling with Python or CUDA needed:
       | https://apps.microsoft.com/detail/9NC624PBFGB7
       | 
       | The demo is janky in various ways (requires headphones, has no UI
       | to speak of, voice recognition sometimes fails), but it's a sneak
       | peek at what will soon be possible to run on a normal gaming PC
       | just by putting together open source pieces. The models are
       | improving rapidly, there are already several improved models I
       | haven't yet incorporated.
        
         | lucubratory wrote:
         | How hard on your end does the task of making the chatbot
         | converse naturally look? Specifically I'm thinking about
         | interruptions, if it's talking too long I would like to be able
         | to start talking and interrupt it like in a normal
         | conversation, or if I'm saying something it could quickly
         | interject something. Once you've got the extremely high speed,
         | theoretically faster than real time, you can start doing that
         | stuff right?
         | 
         | There is another thing remaining after that for fully natural
         | conversation, which is making the AI context aware like a human
         | would be. Basically giving it eyes so it can see your face and
         | judge body language to know if it's talking too long and needs
         | to be more brief, the same way a human talks.
        
           | modeless wrote:
           | Yes, I implemented the ability to interrupt the chatbot while
           | it is talking. It wasn't too hard, although it does require
           | you to wear headphones so the bot doesn't hear itself and get
           | interrupted.
           | 
           | The other way around (bot interrupting the user) is hard.
           | Currently the bot starts processing a response after every
           | word that the voice recognition outputs, to reduce latency.
           | When new words come in before the response is ready it starts
           | over. If it finishes its response before any more words
           | arrive (~1 second usually) it starts speaking. This is not
           | ideal because the user might not be done speaking, of course.
           | If the user continues speaking the bot will stop and listen.
           | But deciding when the user is done speaking (or if the bot
           | should interrupt before the user is done) is a hard problem.
           | It could possibly be done zero-shot using prompting of a LLM
           | but you'd probably need a GPT-4 level LLM to do a good job
           | and GPT-4 is too slow for instant response right now. A
           | better idea would be to train a turn-taking model that
           | predicts who should speak next in conversations. I haven't
           | thought much about how to source a dataset and train a model
           | for that yet.
           | 
           | Ultimately the end state of this type of system is a complete
           | end-to-end audio-to-audio language model. There should be
           | only one model, it should take audio directly as input and
           | produce audio directly as output. I believe that having TTS
           | and voice recognition and language modeling all as separate
           | systems will not get us to 100% natural human conversation. I
           | think that such a system would be within reach of today's
           | hardware too, all you need is the right training
           | dataset/procedure and some architecture bits to make it
           | efficient.
        
       | causality0 wrote:
       | What are the chances this gets packaged into something a little
       | more streamlined to use? I have a lot of ebooks I'd love to
       | generate audio versions of.
        
       | carbocation wrote:
       | Having now tried it (the linked repo links to pre-built colab
       | notebooks):
       | 
       | 1) It does a fantastic job of text-to-speech.
       | 
       | 2) I have had no success in getting any meaningful zero-shot
       | voice cloning working. It technically runs and produces a voice,
       | but it sounds nothing like the target voice. (This includes
       | trying their microphone-based self-voice-cloning option.)
       | 
       | Presumably fine-tuning is needed - but I am curious if anyone had
       | better luck with the zero-shot approach.
        
       ___________________________________________________________________
       (page generated 2023-11-19 23:00 UTC)