[HN Gopher] Wav2vec Overview: Semi and Unsupervised Speech Recog...
       ___________________________________________________________________
        
       Wav2vec Overview: Semi and Unsupervised Speech Recognition
        
       Author : vackosar
       Score  : 110 points
       Date   : 2021-07-03 15:39 UTC (7 hours ago)
        
 (HTM) web link (vaclavkosar.com)
 (TXT) w3m dump (vaclavkosar.com)
        
       | lunixbochs wrote:
       | One addendum to the linked post's notes:
       | 
       | > SoTa in low-resource setting Libri-light by a lot on WER clean
       | test 100h labeled: others ~4 vs theirs ~2.5
       | 
       | > SoTa on high-resource noisy data (3.3 vs 3.4) close to SoTa on
       | clean data
       | 
       | This note isn't super specific, but it's outdated if I'm
       | understanding it correctly. To my understanding, the SOTA on this
       | data is held by Conformer 1B (a 1 billion parameter model), at
       | 1.4 clean, 2.6 noisy.
       | 
       | Conformer 1B is something like wav2vec 2.0 pretraining +
       | conformer + noisy student + specaugment.
       | 
       | https://arxiv.org/pdf/2010.10504.pdf
       | 
       | --
       | 
       | Wav2vec 2.0 is very cool, but I've had some trouble reproducing
       | the pretraining and fine tuning reliably. It might need a lot of
       | resources (e.g. hundreds of clustered GPUs).
       | 
       | I think Wav2vec-U is extremely cool.
        
         | knuthsat wrote:
         | I always wonder how people figure out these successful gigantic
         | models if it takes hundreds of TPUs and days to train them.
         | 
         | I recently bought rtx 3090 in hopes of playing around with some
         | computer vision applications but I guess having 24GB VRAM is
         | nothing if I want to get something SOTA working.
        
           | qayxc wrote:
           | The RTX 3090 is a beast compared to what researchers had
           | available to them just a few years ago.
           | 
           | Don't try to chase SOTA - that's a fruitless endeavour.
           | 
           | 24GB of VRAM is plenty for CV and you can train some
           | excellent models with it. You also need to keep in mind that
           | you don't necessarily need to train models from scratch
           | either.
           | 
           | You can achieve great things by downloading a well-tested,
           | pretrained model and fine-tune it for your particular task or
           | application. Trying to come up with new models and training
           | them from scratch is an exercise in futility for really big
           | models.
           | 
           | I usually only train smaller models (couple of million
           | parameters) and training and finetuning usually takes
           | anywhere from a few hours to a day or two. But then again my
           | hardware is two generations older than yours.
        
           | sdenton4 wrote:
           | The EfficientNet paper has some good things to say on this.
           | 
           | If you're working at a place with giant datacenters full of
           | (T/G)PUs, you can train one giant model a few times, or train
           | smaller models hundreds of times. Without hyperparameter
           | search, there's a really high chance that you're just looking
           | in the wrong region and wind up with something gigantic but
           | kinda meh.
           | 
           | So, the simple strategy is to use the smaller models to find
           | a great mix of hyperparameters, and then scale up to a
           | gigantic model. The EfficientNet paper demonstrates some
           | fairly reliable ways to scale up the model, changing width
           | and depth together according to a scaling factor.
           | 
           | But yeah, even for smaller model footprints, the ability to
           | run tens of experiments in parallel goes a very long way. If
           | you've got a single GPU to play with, I would instead try to
           | focus on a well-scoped interesting question that you can
           | answer without having to demonstrate SOTA-ness, as it will be
           | an uphill climb.
           | 
           | Also remember that it's good to lean heavily on pre-trained
           | models to save time. Anything you can do to iterate faster,
           | really.
        
       | WillDaSilva wrote:
       | I wonder how much better this would be at capturing information
       | that doesn't translate well into text representations of speech.
       | 
       | Consider how with word2vec there are relationships in the
       | embedding space between semantically related words. I would
       | expect the examples of that for word2vec (e.g. king -> queen
       | being a similar translation as man -> woman) to apply here too,
       | but can it also do things like place regular questions and
       | rhetorical questions in different regions of the embedding space
       | based off of of the inflection in the speech?
       | 
       | It would also be interesting to see what relationships exist
       | between equivalent words in different languages within the
       | embedding space. I suppose something like that is probably
       | already used for text translation neural networks, but maybe some
       | notable differences exist when dealing with speech directly.
        
       | theropost wrote:
       | Does anyone know of some good open sourced projects for OCR?
       | Tesseract always seems to be the default, and then it seems
       | Google cloud, and other services are miles ahead. However, for
       | those who don't want to rely on the big tech companies, are there
       | any comparable alternatives?
        
         | ismaj wrote:
         | There is easyocr which is good enough but lacks maturity (it
         | was acknowledged at some point by Yann LeCun). The code base
         | isn't ideal. I'm currently working on my own custom OCR since
         | easyocr isn't perfect at detecting emails for example
         | Www.ismaj@gmail ;com
        
         | piceas wrote:
         | I recently came across CRAFT wich appears to have come out of
         | the ICDAR2017 Robust reading challenge.
         | 
         | It performed better than expected. I only tested a few images
         | so please don't take my word for it.
         | 
         | That led me to PaddleOCR. There is still plenty of room for
         | improvement but I found it way more convenient to use for my
         | purposes than messing with Tesseract.
         | 
         | https://github.com/clovaai/CRAFT-pytorch
         | 
         | https://github.com/PaddlePaddle/PaddleOCR
        
       | spijdar wrote:
       | As someone who's an idiot about machine learning, is it possible
       | to run this code in reverse? e.g. take the generated (or novel)
       | vectors and convert them back into audio/waveforms?
        
         | monocasa wrote:
         | Generalized reverse projection through even non recurrent
         | neural networks is still an open research problem.
         | 
         | So no in this case.
        
           | spywaregorilla wrote:
           | That doesn't sound like a particularly realistic problem to
           | solve.
        
             | monocasa wrote:
             | I agree, but all the more glory if someone does solve it
             | then. And the field is still new enough that I don't want
             | to be cited for decades like the iPod release "no wireless.
             | Less space than a Nomad. Lame." slashdot comment.
        
         | [deleted]
        
         | jmalicki wrote:
         | If you look at the architecture diagram for Wav2Vec-U, the
         | "generator" is doing exactly that - generating waveforms from
         | the vectors. All GANs work this way, and is how websites like
         | https://thispersondoesnotexist.com/ work. Of course as the
         | sibling comment notes the results today might not be great for
         | this task, and it is open research, bit it's not as of it just
         | can't be done at all.
        
           | lunixbochs wrote:
           | My reading of the generator diagram (figure 6) isn't that it
           | is generating waveforms, but that it is generating phoneme
           | probabilities.
           | 
           | You can train a similar system to produce audio on the output
           | of wav2vec, though it probably won't sound similar to the
           | input audio (accent/voice) unless you expose more features of
           | the input than phonemes.
        
       ___________________________________________________________________
       (page generated 2021-07-03 23:00 UTC)