[HN Gopher] Building an end-to-end Speech Recognition model in P... ___________________________________________________________________ Building an end-to-end Speech Recognition model in PyTorch Author : makaimc Score : 214 points Date : 2020-04-17 14:11 UTC (8 hours ago) (HTM) web link (www.assemblyai.com) (TXT) w3m dump (www.assemblyai.com) | albertzeyer wrote: | This seems to be a CTC model. CTC is not really the best option | for a good end-to-end system. Encoder-decoder-attention models or | RNN-T models are both better alternatives. | | There is also not really a problem about available open source | code. There are countless of open source projects which already | have that mostly ready to use, for all the common DL frameworks, | like TF, PyTorch, Jax, MXNet, whatever. For anyone with a bit of | ML experience, this should really not be too hard to setup. | | But then to get good performance, on your own dataset, what you | really need is experience. Probably taking some existing pipeline | will get you some model, with an okish word-error-rate. But then | you should tune it. In any case, even without tuning, probably | encoder-decoder-attention models will perform better than CTC | models. | dylanbfox wrote: | That's right - most literature does show that encoder-decoder | architectures outperform CTC. I think one of the main reasons | for this is that CTC assumes the label outputs are | conditionally independent of each other, which is a pretty big | flaw in that loss function. | | The blog does mention Listen-Attend-Spell (which is an encoder- | decoder architecture) as an alternative to the CTC model. | tasubotadas wrote: | It would seem that the best practical approach is to use RNNT | as it still lets you do streaming predictions (while Attention | won't really let that). | albertzeyer wrote: | If you need streaming, then yes, RNNT is a good option. If | not, encoder-decoder-attention performs a bit better than | RNN-T. | | Note that there are also approaches for encoder-decoder- | attention to make that streaming capable, e.g. MoChA or hard | attention, etc. | | Google uses RNN-T on-device. But they are researching on | extending it with another encoder-decoder-attention model on- | top, to get better performance. | | This is a quite active research area, and it has not really | settled. But CTC is not really too much relevant anymore, as | RNNT is just a better variant. | woodson wrote: | Recent work on transformer transducers with limited right | (and left) context seem to give decent results as well: | https://arxiv.org/abs/2002.02562 | p1esk wrote: | I opened the pdf, did ctrl+F for 'github', got zero | results. Have you reproduced their "decent results"? | woodson wrote: | It's by Google, so you won't get the code, but espnet | apparently already has implementations of versions of | transformer transducers with RNN-T loss (though, with | different network architecture). At least the paper cites | results on a freely available dataset, right? | p1esk wrote: | Why would you not expect Google to post their code? E.g. | the transformer paper had it: | https://arxiv.org/abs/1706.03762 | mikaelphi wrote: | Author here! RNNT's is indeed a good approach. RNNT's with | masked attention are also capable of streaming as well. | bginsburg wrote: | Actually end-2-end CTC models are very good - Wav2letter, | Jasper, QuartzNet - all these models are much better than | DeepSpeech2. | woodson wrote: | Agreed, and that's where it seems having lots of experience | working with speech data helps more than trying to brute-force | it with just larger CTC models and "more" data of dubious | quality. | mikaelphi wrote: | Author here! CTC models perform quite well and are easy to get | started for beginners with an added benefit of real-time | streaming capabilities. RNNT's and Encoder-Decoder like Listen- | attend-spell are also very solid choices, and literature points | to slightly higher accuracy with them on academic datasets. | RNNT's being and extension of CTC are streamable as well. | option wrote: | I question that you need full attention in the acoustic model. | The pronunciation of a word in the middle of phrase does not | have much dependence on the beginning. | | You do need attention in the language model part of the | pipeline | albertzeyer wrote: | Almost all the literature which compares CTC and encoder- | decoder-attention models shows pretty well that encoder- | decoder-attention performs better than CTC in the acoustic | model. | | See for example here as an overview (my own work, already a | bit outdated, but attention has even improved much more since | then): https://openreview.net/pdf?id=S1gp9v_jsm | zerop wrote: | Good article. Speech recognition for real time use cases must get | a really working open source solution. I have been evaluating | deepspeech, which is okay. but there is lots of work needed to | make it working close to Google Speech engine. Apart from a good | Deep neural network, a good speech recognition system needs two | important things: | | 1. Tons of diverse data sets (real world) | | 2. Solution for Noise - Either de-noise and train OR train with | noise. | | There are lots of extra challenges that voice recognition problem | have to solve which is not common with other deep learning | problems: | | 1. Pitch | | 2. Speed of conversation | | 3. Accents (can be solved with more data, I think) | | 4. Real time inference (low latency) | | 5. On the edge (i.e. Offline on mobile devices) | ALittleLight wrote: | Your point about needing a dataset made me think about how a | post on hackernews like this may be a good way to get data. How | many people would contribute by reading a prompt if they | visited a link like this and had the option to donate some | data? That would get many distinct voices and microphones and | some different conditions. | | The article mentions that they used a dataset composed of 100 | hours of audiobooks. A comment thread here [1] estimates 10-50k | visitors from a successful hackernews post. Call it 30k | visitors. If 20% of visitors donated by reading a one minute | prompt, that's another 6,000 minutes, or, oddly, also 100 | hours. | | Seems like a potentially easy way to double your dataset and | make it more diverse. | | 1 - https://news.ycombinator.com/item?id=20612717 | starpilot wrote: | There might be some sampling bias with an HN user dataset. At | my company, many of our customer service calls are from older | people, especially women, who call because they don't like | using the internet (or they don't even have internet). | Different voices and patterns of speech. This could be a | really different demographic from HN users. | Isn0gud wrote: | You might be interested in a project, doing exactly that: | https://voice.mozilla.org/ | | Audio data of people reading prompts is quite common, what is | missing for robust voice recognition is plenty of data of | e.g. people screaming it across the room. There is only so | much physics simulations can do. | ALittleLight wrote: | That is interesting. I gave my contribution! | spzb wrote: | This is probably really good but the linked Colab notebook is | failing on the first step with some unresolvable dependencies. | This does seem to be a bit of a common theme whenever I try | running example ML projects. | | Edit: I think I've fixed it by changing the pip command to: | | !pip install torchaudio comet_ml==3.0 | zanew101 wrote: | Hah, classic. But in all seriousness I think its a pretty | interesting issue. A lot of the ML and data science sees people | coming in who do not have formal computer science and software | development backgrounds. We build tools and methodologies | around abstracting away some of the code development process | and hope that it lands us in an environment that's easy to | share with others. This is unfortunately rarely the case. | | Its a problem that as an industry I think we are in the middle | of "solving" (probably can't be solved fully, but things are | getting better). I'm really excited to see what kinds of tools | and tests will be developed around getting ML projects with | some better practices. | mikaelphi wrote: | Author here, thanks for pointing that out is and it's fixed | now! I also made sure to add the pip version for torchaudio and | torch as well so this should not be an issue anymore. | option wrote: | Have a look at NeMo https://github.com/nvidia/NeMo it comes with | QuartzNet (only 19M of weights and better accuracy than | DeepSpeech2) pretrained on thousands of hours of speech. | sniper200012 wrote: | really interesting repo | komuher wrote: | Dunno why (probably dataset) but open source Speech Recognition | models are performing very poorly on real world data compared to | google speech to text or azure cognitive. | dylanbfox wrote: | One of the main factors for this is probably due to dataset | size. Commercial STT models are trained on 10s of thousands of | hours of data from real-world data. Even a decent model | architecture is going to perform pretty well on that much data. | | Most open source models are trained on Libri, SWB, etc. which | are not really big or diverse enough for real-world scenarios. | | But to max-out results the devil is in the details IMO (network | architecture, optimizer, weight initialization, regularization, | data augmentation, hyperparam tuning, etc) which requires a lot | of experiments. | bginsburg wrote: | There are new, very large public English speech datasets: | Mozilla Common Voice, National Speech Corpus, which can be | combined with LibriSpeech to train large models. | eindiran wrote: | If you combine them you get 5ish K hours of speech for | English, which is still fairly small compared to what most | big players have access to. | solidasparagus wrote: | Amazon worked with 7k hours of labeled data + 1 million | hours of unlabeled data - | https://arxiv.org/pdf/1904.01624.pdf | tasubotadas wrote: | LibreSpeech is 1000h, WSJ is 73h. | | Google is training on datasets that are as big as 30kh and MS | seems to work on a 10k h dataset. | | At the moment, I am working on a similar e2e system but 80h big | dataset makes it a really challenging task to generalize well. | option wrote: | The key is to fine-tune on your data. Take publicly available | pretrained model, fine-tune on your data and you can often get | results better than google's service or azure cognitive on your | use-case (google and azure asr are great general services, but | they cannot do better than _custom_ , say, health-center | specific model) | coder543 wrote: | Mentioned once in the other comments here without any link, but | another open source speech recognition model I heard about | recently is Mozilla DeepSpeech: | | https://github.com/mozilla/DeepSpeech | | https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-sp... | | I haven't had a chance to test it, and I wish there were a | client-side WASM demo of it that I could just visit on Mozilla's | site. | mikaelphi wrote: | Author here! Deep Speech is an excellent repo if you just want | to pip install something. We wanted to do a comprehensive | writeup to give devs the ability to build their own end-to-end | model. ___________________________________________________________________ (page generated 2020-04-17 23:00 UTC)