[HN Gopher] Wav2vec Overview: Semi and Unsupervised Speech Recog... ___________________________________________________________________ Wav2vec Overview: Semi and Unsupervised Speech Recognition Author : vackosar Score : 110 points Date : 2021-07-03 15:39 UTC (7 hours ago) (HTM) web link (vaclavkosar.com) (TXT) w3m dump (vaclavkosar.com) | lunixbochs wrote: | One addendum to the linked post's notes: | | > SoTa in low-resource setting Libri-light by a lot on WER clean | test 100h labeled: others ~4 vs theirs ~2.5 | | > SoTa on high-resource noisy data (3.3 vs 3.4) close to SoTa on | clean data | | This note isn't super specific, but it's outdated if I'm | understanding it correctly. To my understanding, the SOTA on this | data is held by Conformer 1B (a 1 billion parameter model), at | 1.4 clean, 2.6 noisy. | | Conformer 1B is something like wav2vec 2.0 pretraining + | conformer + noisy student + specaugment. | | https://arxiv.org/pdf/2010.10504.pdf | | -- | | Wav2vec 2.0 is very cool, but I've had some trouble reproducing | the pretraining and fine tuning reliably. It might need a lot of | resources (e.g. hundreds of clustered GPUs). | | I think Wav2vec-U is extremely cool. | knuthsat wrote: | I always wonder how people figure out these successful gigantic | models if it takes hundreds of TPUs and days to train them. | | I recently bought rtx 3090 in hopes of playing around with some | computer vision applications but I guess having 24GB VRAM is | nothing if I want to get something SOTA working. | qayxc wrote: | The RTX 3090 is a beast compared to what researchers had | available to them just a few years ago. | | Don't try to chase SOTA - that's a fruitless endeavour. | | 24GB of VRAM is plenty for CV and you can train some | excellent models with it. You also need to keep in mind that | you don't necessarily need to train models from scratch | either. | | You can achieve great things by downloading a well-tested, | pretrained model and fine-tune it for your particular task or | application. Trying to come up with new models and training | them from scratch is an exercise in futility for really big | models. | | I usually only train smaller models (couple of million | parameters) and training and finetuning usually takes | anywhere from a few hours to a day or two. But then again my | hardware is two generations older than yours. | sdenton4 wrote: | The EfficientNet paper has some good things to say on this. | | If you're working at a place with giant datacenters full of | (T/G)PUs, you can train one giant model a few times, or train | smaller models hundreds of times. Without hyperparameter | search, there's a really high chance that you're just looking | in the wrong region and wind up with something gigantic but | kinda meh. | | So, the simple strategy is to use the smaller models to find | a great mix of hyperparameters, and then scale up to a | gigantic model. The EfficientNet paper demonstrates some | fairly reliable ways to scale up the model, changing width | and depth together according to a scaling factor. | | But yeah, even for smaller model footprints, the ability to | run tens of experiments in parallel goes a very long way. If | you've got a single GPU to play with, I would instead try to | focus on a well-scoped interesting question that you can | answer without having to demonstrate SOTA-ness, as it will be | an uphill climb. | | Also remember that it's good to lean heavily on pre-trained | models to save time. Anything you can do to iterate faster, | really. | WillDaSilva wrote: | I wonder how much better this would be at capturing information | that doesn't translate well into text representations of speech. | | Consider how with word2vec there are relationships in the | embedding space between semantically related words. I would | expect the examples of that for word2vec (e.g. king -> queen | being a similar translation as man -> woman) to apply here too, | but can it also do things like place regular questions and | rhetorical questions in different regions of the embedding space | based off of of the inflection in the speech? | | It would also be interesting to see what relationships exist | between equivalent words in different languages within the | embedding space. I suppose something like that is probably | already used for text translation neural networks, but maybe some | notable differences exist when dealing with speech directly. | theropost wrote: | Does anyone know of some good open sourced projects for OCR? | Tesseract always seems to be the default, and then it seems | Google cloud, and other services are miles ahead. However, for | those who don't want to rely on the big tech companies, are there | any comparable alternatives? | ismaj wrote: | There is easyocr which is good enough but lacks maturity (it | was acknowledged at some point by Yann LeCun). The code base | isn't ideal. I'm currently working on my own custom OCR since | easyocr isn't perfect at detecting emails for example | Www.ismaj@gmail ;com | piceas wrote: | I recently came across CRAFT wich appears to have come out of | the ICDAR2017 Robust reading challenge. | | It performed better than expected. I only tested a few images | so please don't take my word for it. | | That led me to PaddleOCR. There is still plenty of room for | improvement but I found it way more convenient to use for my | purposes than messing with Tesseract. | | https://github.com/clovaai/CRAFT-pytorch | | https://github.com/PaddlePaddle/PaddleOCR | spijdar wrote: | As someone who's an idiot about machine learning, is it possible | to run this code in reverse? e.g. take the generated (or novel) | vectors and convert them back into audio/waveforms? | monocasa wrote: | Generalized reverse projection through even non recurrent | neural networks is still an open research problem. | | So no in this case. | spywaregorilla wrote: | That doesn't sound like a particularly realistic problem to | solve. | monocasa wrote: | I agree, but all the more glory if someone does solve it | then. And the field is still new enough that I don't want | to be cited for decades like the iPod release "no wireless. | Less space than a Nomad. Lame." slashdot comment. | [deleted] | jmalicki wrote: | If you look at the architecture diagram for Wav2Vec-U, the | "generator" is doing exactly that - generating waveforms from | the vectors. All GANs work this way, and is how websites like | https://thispersondoesnotexist.com/ work. Of course as the | sibling comment notes the results today might not be great for | this task, and it is open research, bit it's not as of it just | can't be done at all. | lunixbochs wrote: | My reading of the generator diagram (figure 6) isn't that it | is generating waveforms, but that it is generating phoneme | probabilities. | | You can train a similar system to produce audio on the output | of wav2vec, though it probably won't sound similar to the | input audio (accent/voice) unless you expose more features of | the input than phonemes. ___________________________________________________________________ (page generated 2021-07-03 23:00 UTC)