[HN Gopher] Building an end-to-end Speech Recognition model in P...
       ___________________________________________________________________
        
       Building an end-to-end Speech Recognition model in PyTorch
        
       Author : makaimc
       Score  : 214 points
       Date   : 2020-04-17 14:11 UTC (8 hours ago)
        
 (HTM) web link (www.assemblyai.com)
 (TXT) w3m dump (www.assemblyai.com)
        
       | albertzeyer wrote:
       | This seems to be a CTC model. CTC is not really the best option
       | for a good end-to-end system. Encoder-decoder-attention models or
       | RNN-T models are both better alternatives.
       | 
       | There is also not really a problem about available open source
       | code. There are countless of open source projects which already
       | have that mostly ready to use, for all the common DL frameworks,
       | like TF, PyTorch, Jax, MXNet, whatever. For anyone with a bit of
       | ML experience, this should really not be too hard to setup.
       | 
       | But then to get good performance, on your own dataset, what you
       | really need is experience. Probably taking some existing pipeline
       | will get you some model, with an okish word-error-rate. But then
       | you should tune it. In any case, even without tuning, probably
       | encoder-decoder-attention models will perform better than CTC
       | models.
        
         | dylanbfox wrote:
         | That's right - most literature does show that encoder-decoder
         | architectures outperform CTC. I think one of the main reasons
         | for this is that CTC assumes the label outputs are
         | conditionally independent of each other, which is a pretty big
         | flaw in that loss function.
         | 
         | The blog does mention Listen-Attend-Spell (which is an encoder-
         | decoder architecture) as an alternative to the CTC model.
        
         | tasubotadas wrote:
         | It would seem that the best practical approach is to use RNNT
         | as it still lets you do streaming predictions (while Attention
         | won't really let that).
        
           | albertzeyer wrote:
           | If you need streaming, then yes, RNNT is a good option. If
           | not, encoder-decoder-attention performs a bit better than
           | RNN-T.
           | 
           | Note that there are also approaches for encoder-decoder-
           | attention to make that streaming capable, e.g. MoChA or hard
           | attention, etc.
           | 
           | Google uses RNN-T on-device. But they are researching on
           | extending it with another encoder-decoder-attention model on-
           | top, to get better performance.
           | 
           | This is a quite active research area, and it has not really
           | settled. But CTC is not really too much relevant anymore, as
           | RNNT is just a better variant.
        
             | woodson wrote:
             | Recent work on transformer transducers with limited right
             | (and left) context seem to give decent results as well:
             | https://arxiv.org/abs/2002.02562
        
               | p1esk wrote:
               | I opened the pdf, did ctrl+F for 'github', got zero
               | results. Have you reproduced their "decent results"?
        
               | woodson wrote:
               | It's by Google, so you won't get the code, but espnet
               | apparently already has implementations of versions of
               | transformer transducers with RNN-T loss (though, with
               | different network architecture). At least the paper cites
               | results on a freely available dataset, right?
        
               | p1esk wrote:
               | Why would you not expect Google to post their code? E.g.
               | the transformer paper had it:
               | https://arxiv.org/abs/1706.03762
        
           | mikaelphi wrote:
           | Author here! RNNT's is indeed a good approach. RNNT's with
           | masked attention are also capable of streaming as well.
        
         | bginsburg wrote:
         | Actually end-2-end CTC models are very good - Wav2letter,
         | Jasper, QuartzNet - all these models are much better than
         | DeepSpeech2.
        
         | woodson wrote:
         | Agreed, and that's where it seems having lots of experience
         | working with speech data helps more than trying to brute-force
         | it with just larger CTC models and "more" data of dubious
         | quality.
        
         | mikaelphi wrote:
         | Author here! CTC models perform quite well and are easy to get
         | started for beginners with an added benefit of real-time
         | streaming capabilities. RNNT's and Encoder-Decoder like Listen-
         | attend-spell are also very solid choices, and literature points
         | to slightly higher accuracy with them on academic datasets.
         | RNNT's being and extension of CTC are streamable as well.
        
         | option wrote:
         | I question that you need full attention in the acoustic model.
         | The pronunciation of a word in the middle of phrase does not
         | have much dependence on the beginning.
         | 
         | You do need attention in the language model part of the
         | pipeline
        
           | albertzeyer wrote:
           | Almost all the literature which compares CTC and encoder-
           | decoder-attention models shows pretty well that encoder-
           | decoder-attention performs better than CTC in the acoustic
           | model.
           | 
           | See for example here as an overview (my own work, already a
           | bit outdated, but attention has even improved much more since
           | then): https://openreview.net/pdf?id=S1gp9v_jsm
        
       | zerop wrote:
       | Good article. Speech recognition for real time use cases must get
       | a really working open source solution. I have been evaluating
       | deepspeech, which is okay. but there is lots of work needed to
       | make it working close to Google Speech engine. Apart from a good
       | Deep neural network, a good speech recognition system needs two
       | important things:
       | 
       | 1. Tons of diverse data sets (real world)
       | 
       | 2. Solution for Noise - Either de-noise and train OR train with
       | noise.
       | 
       | There are lots of extra challenges that voice recognition problem
       | have to solve which is not common with other deep learning
       | problems:
       | 
       | 1. Pitch
       | 
       | 2. Speed of conversation
       | 
       | 3. Accents (can be solved with more data, I think)
       | 
       | 4. Real time inference (low latency)
       | 
       | 5. On the edge (i.e. Offline on mobile devices)
        
         | ALittleLight wrote:
         | Your point about needing a dataset made me think about how a
         | post on hackernews like this may be a good way to get data. How
         | many people would contribute by reading a prompt if they
         | visited a link like this and had the option to donate some
         | data? That would get many distinct voices and microphones and
         | some different conditions.
         | 
         | The article mentions that they used a dataset composed of 100
         | hours of audiobooks. A comment thread here [1] estimates 10-50k
         | visitors from a successful hackernews post. Call it 30k
         | visitors. If 20% of visitors donated by reading a one minute
         | prompt, that's another 6,000 minutes, or, oddly, also 100
         | hours.
         | 
         | Seems like a potentially easy way to double your dataset and
         | make it more diverse.
         | 
         | 1 - https://news.ycombinator.com/item?id=20612717
        
           | starpilot wrote:
           | There might be some sampling bias with an HN user dataset. At
           | my company, many of our customer service calls are from older
           | people, especially women, who call because they don't like
           | using the internet (or they don't even have internet).
           | Different voices and patterns of speech. This could be a
           | really different demographic from HN users.
        
           | Isn0gud wrote:
           | You might be interested in a project, doing exactly that:
           | https://voice.mozilla.org/
           | 
           | Audio data of people reading prompts is quite common, what is
           | missing for robust voice recognition is plenty of data of
           | e.g. people screaming it across the room. There is only so
           | much physics simulations can do.
        
             | ALittleLight wrote:
             | That is interesting. I gave my contribution!
        
       | spzb wrote:
       | This is probably really good but the linked Colab notebook is
       | failing on the first step with some unresolvable dependencies.
       | This does seem to be a bit of a common theme whenever I try
       | running example ML projects.
       | 
       | Edit: I think I've fixed it by changing the pip command to:
       | 
       | !pip install torchaudio comet_ml==3.0
        
         | zanew101 wrote:
         | Hah, classic. But in all seriousness I think its a pretty
         | interesting issue. A lot of the ML and data science sees people
         | coming in who do not have formal computer science and software
         | development backgrounds. We build tools and methodologies
         | around abstracting away some of the code development process
         | and hope that it lands us in an environment that's easy to
         | share with others. This is unfortunately rarely the case.
         | 
         | Its a problem that as an industry I think we are in the middle
         | of "solving" (probably can't be solved fully, but things are
         | getting better). I'm really excited to see what kinds of tools
         | and tests will be developed around getting ML projects with
         | some better practices.
        
         | mikaelphi wrote:
         | Author here, thanks for pointing that out is and it's fixed
         | now! I also made sure to add the pip version for torchaudio and
         | torch as well so this should not be an issue anymore.
        
       | option wrote:
       | Have a look at NeMo https://github.com/nvidia/NeMo it comes with
       | QuartzNet (only 19M of weights and better accuracy than
       | DeepSpeech2) pretrained on thousands of hours of speech.
        
         | sniper200012 wrote:
         | really interesting repo
        
       | komuher wrote:
       | Dunno why (probably dataset) but open source Speech Recognition
       | models are performing very poorly on real world data compared to
       | google speech to text or azure cognitive.
        
         | dylanbfox wrote:
         | One of the main factors for this is probably due to dataset
         | size. Commercial STT models are trained on 10s of thousands of
         | hours of data from real-world data. Even a decent model
         | architecture is going to perform pretty well on that much data.
         | 
         | Most open source models are trained on Libri, SWB, etc. which
         | are not really big or diverse enough for real-world scenarios.
         | 
         | But to max-out results the devil is in the details IMO (network
         | architecture, optimizer, weight initialization, regularization,
         | data augmentation, hyperparam tuning, etc) which requires a lot
         | of experiments.
        
           | bginsburg wrote:
           | There are new, very large public English speech datasets:
           | Mozilla Common Voice, National Speech Corpus, which can be
           | combined with LibriSpeech to train large models.
        
             | eindiran wrote:
             | If you combine them you get 5ish K hours of speech for
             | English, which is still fairly small compared to what most
             | big players have access to.
        
             | solidasparagus wrote:
             | Amazon worked with 7k hours of labeled data + 1 million
             | hours of unlabeled data -
             | https://arxiv.org/pdf/1904.01624.pdf
        
         | tasubotadas wrote:
         | LibreSpeech is 1000h, WSJ is 73h.
         | 
         | Google is training on datasets that are as big as 30kh and MS
         | seems to work on a 10k h dataset.
         | 
         | At the moment, I am working on a similar e2e system but 80h big
         | dataset makes it a really challenging task to generalize well.
        
         | option wrote:
         | The key is to fine-tune on your data. Take publicly available
         | pretrained model, fine-tune on your data and you can often get
         | results better than google's service or azure cognitive on your
         | use-case (google and azure asr are great general services, but
         | they cannot do better than _custom_ , say, health-center
         | specific model)
        
       | coder543 wrote:
       | Mentioned once in the other comments here without any link, but
       | another open source speech recognition model I heard about
       | recently is Mozilla DeepSpeech:
       | 
       | https://github.com/mozilla/DeepSpeech
       | 
       | https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-sp...
       | 
       | I haven't had a chance to test it, and I wish there were a
       | client-side WASM demo of it that I could just visit on Mozilla's
       | site.
        
         | mikaelphi wrote:
         | Author here! Deep Speech is an excellent repo if you just want
         | to pip install something. We wanted to do a comprehensive
         | writeup to give devs the ability to build their own end-to-end
         | model.
        
       ___________________________________________________________________
       (page generated 2020-04-17 23:00 UTC)