[HN Gopher] Show HN: LibreASR - An On-Premises, Streaming Speech... ___________________________________________________________________ Show HN: LibreASR - An On-Premises, Streaming Speech Recognition System Author : iceychris Score : 184 points Date : 2020-11-15 10:10 UTC (12 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | OnlyMortal wrote: | Having worked on ViaVoice OSX back in the day, we had to have | models for different varieties of English. The US model couldn't | understand my northern English (think GoT) accent. It's why the | product came out with a UK localisation. | | Wondering if you might have a better reco of the French President | if you have a model per dialect of English? | iceychris wrote: | Yes, probably. The data I trained on mostly reflects UK and US | accents. | OnlyMortal wrote: | IBM kept their US and UK models apart. May have been historic | or dataset size. | | As a FYI, I was told "the money" was in specific | "dictionaries" for medical professionals and so forth. | Apparently, doctors liked to dictate straight into text. | Might be worth trying that $$$EUREUREURPSPSPS? | fareesh wrote: | Is there an open-source or paid SDK/API that I can use to create | a group voice chat mobile app with "live" transcription? Or | something that can plug-in to a system like this? | | I looked at Twilio but they seem to only offer a means to do it | on their VOIP/SIP product. | dcsan wrote: | how real-time do you need it? if you use a streaming API you | can even use google and there isn't too much lag, and it's | continuous. | | Agora also talk about this, but I haven't used it myself | https://www.agora.io/en/ | whimsicalism wrote: | > open-source or paid SDK/API that I can use to create a group | voice chat mobile app with "live" transcription? Or something | that can plug-in to a system like this? | | Yes, Google, Amazon, Microsoft all offer streaming solutions | (wouldn't recommend Amazon's however, might recommend Microsoft | over Google). wav2letter from FB is the only open-source | framework worth looking at, deepspeech is not a seriously | usable framework. | woodson wrote: | Check out Kaldi. It's a toolkit rather than a ready-to-deploy | service but has some solid pretrained models and recipes for | training your own. You can use various existing projects for | deployment, e.g. vosk-server (also for on-device) which comes | with models for various languages and accents and has an | excellent support channel via telegram. Quite frankly, | despite not being "end-to-end", you'll get much much better | results in practice. | whimsicalism wrote: | I collected custom audio and had it transcribed by hand for | cash, then evaluated it on wav2letter and vosk. At least | for that domain, wav2letter outperforms vosk. | woodson wrote: | Good for you, it's the only way to know which tool works | best in your case. I did the same for my use case and | arrived at the opposite conclusion. | | What most people don't realize is that it heavily depends | on your use case and domain whether any given | model/algorithm will work better. | sbr464 wrote: | Telnyx has media forking, the ability to clone a media stream | in real time without affecting the original call. It allows | receiving the stream directly and operating on it without | latency. | | Not sure if relevant though, it's using their SIP product also. | If the original service isn't using Telnyx, you could get | creative and have a Telnyx shadow user join the group call to | receive the stream, etc. | Y_Y wrote: | Google Meet does this | donpdonp wrote: | The README could use a section as to what CPU platform and | storage requirements are necessary to run this app. | sagz wrote: | This is great! would you be able to integrate it with Live | Transcribe to make a great FOSS solution for the deaf and hard of | hearing? :) | | https://github.com/google/live-transcribe-speech-engine | kkielhofner wrote: | I LOVE that you provided a sample application targeting the | ESP32-LyraT! While the ESP8266/ESP32 get plenty of love on HN | (and elsewhere) I think the ESP ADF (audio development framework) | and various boards dev boards (Lyra, Korvo, etc) are really under | appreciated and essentially unknown. | | I enjoy a Raspberry Pi, Jetson nano, Arduino, whatever as much as | the next person but the seemingly endless stream of projects and | resulting blog posts, etc featuring them can get a little old. | | Great work! | lunixbochs wrote: | Can you post WER per dataset? Bucketing all of the WER together | means you can only directly compare to models that are validated | on the exact same combination datasets. This excludes all other | ASR systems from comparison, as well as your own models if you | decide to add validation data in the future. | pw6hv wrote: | Is the transcription of Macron's speech completely off or am I | not understanding what the two shown texts represent? | jiehong wrote: | Seems off indeed. Plus, the readme says French is not supported | yet. Not the best demo IMO. | iceychris wrote: | I have not yet trained a french model. Also, the gif shows | Macron speaking to the congress with his english accent [0] | | [0] https://www.youtube.com/watch?v=RqUc1h7bZQ4 | iceychris wrote: | The upper transcript is YouTube's automatic transcription. | Below is the web app transcribing live. And yes, it is actually | missing a few words. | saurik wrote: | (Anyone know how the transcription quality compares to the | various cloud offerings from AWS, Google, and IBM?) | iceychris wrote: | As I commented above, very poorly. It's still early days. | iceychris wrote: | Hey HN! | | I've been working on this for a while now. While there are other | on-premise solutions using older models such as DeepSpeech [0], I | haven't found a deployable project supporting multiple languages | using the recent RNN-T Architecture [1]. | | Please note that this does not achieve SotA performance. Also, | I've only trained it on one GPU so there might be room for | improvement. | | Edit: Don't expect good performance :D this is still in early | stage development. I am looking for contributers :) | | [0] https://github.com/mozilla/DeepSpeech | | [1] https://arxiv.org/abs/1811.06621 | the_biot wrote: | Why on earth would you put a demo front and center that shows | your software doing a terrible, terrible job? | brmgb wrote: | That's called setting up expectations. If you know your | project might interest people but needs work, why pretends | it's good when it's not? They seem to be courting | contributors more than users anyway. | | I found the video to be funny. It nicely highlights both the | current limitations and the ambition of the projects. Bold | choice certainly but I think it works. | dcsan wrote: | and it's maybe a dig at Macron's accent at the same time :D | although the author is a student in germany. Anyway you | should join the Discord, we discussed this there too... | | https://discord.gg/pqTMeP5D3g | whimsicalism wrote: | Awesome project - I'm also working on a similar idea for an on- | premise ASR server! Any reason you decided to go with RNN-T? | woodson wrote: | You can also check out | https://github.com/TensorSpeech/TensorFlowASR for inspiration | (not my project, not involved). It implements streaming | transformers and conformer RNN-T (but in TF2). Deployment on | device as TFLite. So far, there aren't many usable pretrained | models available (just LibriSpeech), but with some work it | could turn out quite nicely. | th3h4mm3r wrote: | Hi! What should you need to implement other language i.e. | Italian or French? I mean: it's a problem due to the less of | datas or what? | | Another question: could you use for example mozilla voice data | to train/test? | iceychris wrote: | Data and compute are the largest hurdles. I only have one GPU | and training one model takes 3+ days, so I am limited by | that. Also, scraping from YouTube takes time and a lot of | storage (multiple TBs). | | Mozilla Common Voice data is already used for training. | jack_pp wrote: | Why does it take a lot of data? Afaik you can select lower | quality in youtube-dl but you don't even need video do you? | whimsicalism wrote: | > Why does it take a lot of data? Afaik you can select | lower quality in youtube-dl but you don't even need video | do you? | | But you need supervised data too. | klysm wrote: | I know you can scrape only audio from YouTube with | YouTubeDL but it's somewhat annoying | jerf wrote: | youtube-dl -f bestaudio $URL | | Dunno when that went in but it works now. | Shared404 wrote: | I use something akin to 'alias | downloadmusic='youtube-dl --extract-audio --audio-quality | 0 --extract-metadata' | | in my .bashrc | | I find that helps with the annoyance of downloading | things off of YT. This is for music obviously, but | there's an option to download subtitles as well. | | EDIT: Typed this from memory, there may be errors in the | alias. | th3h4mm3r wrote: | So do you scrap videos from youtube with subtitles to | collect data? | th3h4mm3r wrote: | For the compute problem: maybe you can use cloud server gpu | powered as https://www.paperspace.com/ I don't know update | prices but I remember it was quite affordable. | whimsicalism wrote: | > I remember it was quite affordable. | | Relative to what? Paperspace is one of the costlier GPU | providers. | th3h4mm3r wrote: | Okay, you are right, but it's also really performant, so | imho you can do a lot of work in minor time. | | For something cheapest I read that post on reddit : | | https://amp.reddit.com/r/devops/comments/dqh09n/cheapest_ | clo... | whimsicalism wrote: | performant? It's the same GPU..? | jiehong wrote: | What datasets are used to train the models? | iceychris wrote: | LibriSpeech, Tatoeba, Common Voice and scraped YouTube videos. | blackcat201 wrote: | Do you get good results when adding scraped youtube audio? My | model performance on LibriSpeech dev drops a bit when adding | youtube audio to the training dataset ( my guess is likely | due to poor alignment from auto generated captions ). | iceychris wrote: | I haven't trained on LibriSpeech exclusively, but yes, the | perf on LibriSpeech dev is quite bad, around ~60.0 WER. If | the poor alignment of yt captions is the issue, maybe | concatenating multiple samples helps a bit. | lunixbochs wrote: | You should consider realignment; maybe start with | something like DSAlign or my wav2train project. | dcsan wrote: | would it be possible to train on any of the more recent Text | to speech engines out there? some of them are very realistic. | | this would give you absolutely perfect sync down to the word, | I assume... I don't know about the cost if you paid ratecard | though, perhaps you can do some partnership with them since | yours is a symetrical product | blackcat201 wrote: | Cool project, seems like your model have similar WER as mine (4th | reference in readme). Do you plan to do any pre-training on the | encoder part in the future? Maybe something like this[1] | | [1] https://ai.facebook.com/blog/wav2vec-20-learning-the- | structu... | bravura wrote: | Hey black cat, I have some work in preprint for a NeuroIPS | workshop, demonstrating negative results of different audio | distances on pitch tasks. There is one particular w2v result | I'd like your feedback on. | | Do you mind emailing me? Lastname at gmail dot com (see my | profile for my name) | iceychris wrote: | Hey blackcat! Your project [0] helped me a lot! Pre-training | the encoder sounds great, I'll maybe add it in the future. | | [0] https://github.com/theblackcat102/Online-Speech-Recognition | guillem_lefait wrote: | Nice project and bold to demo with a French native speaking | English. | | On a side project, I'm looking at the best interface to | facilitate further edition (correction) of the recognized text. | Target is local councils and regional parliament, where sessions | are usually recorded but without transcripts. If xx% accuracy is | enough to identify keywords, manual edition is still required to | not distort precise meaning. | | Nothing special in the interface, but two features seems | interesting: 1. Be able to collaborate in real-time. Maybe using | Etherpad API to merge multiple editions. 2. Easily validate text | and label speakers so as to generate new training data. | | Pointers to similar existing solutions would be very appreciated. | nsomaru wrote: | I'm really interested in this project too. Been thinking about | similar solutions for a while now. | | I looked into Kaldi and Mozilla Deep Speech but the former | seems geared at ASR experts and the latter didn't seem suited | for my particular application (longer recorded audio or real | time stream) | whimsicalism wrote: | Use wav2letter | posguy wrote: | Mozilla DeepSpeech has streaming audio support as of a few | releases ago, the word error rate has also improved. | | I would recommend looking at Vosk too, it converts speech to | text much faster than Mozilla DeepSpeech while having | slightly better results: https://alphacephei.com/vosk/ | ahoka wrote: | In case someone wondered what does the title have to do with | logic, it's just probably a (common) malapropism. | | premise (noun) a previous statement or proposition from which | another is inferred or follows as a conclusion. | | premises (noun) a house or building, together with its land and | outbuildings, occupied by a business or considered in an official | context. | iceychris wrote: | Right, fixed it, thank you :D | Proven wrote: | It's not "mal" anything, many simply prefer to use "on-premise" | or on-prem for "on-premises". | | You didn't have any issue understanding the original title. | yjftsjthsd-h wrote: | > You didn't have any issue understanding the original title. | | Humans are very good at live error correction, but that | doesn't make it not wrong. | woodson wrote: | This is somewhat off-topic, but as someone who has worked on | various speech processing and ASR projects, I'm curious to learn | from people who have specific problems and applications for this | technology where it can make a difference. | | That is to say, what are areas where you think ASR can enable new | products or make common and tedious tasks much more efficient? | crawfordcomeaux wrote: | My partner and I are raising our child in unique ways with a | focus on avoiding certain linguistic patterns. I'm interested | in filming/recording our home, automating transcription, and | creating a system for voice-driven video editing we can do on | the fly to create highlights and capture discussions. I see | this as a way to create a dataset they can use in the future to | diagnose any trauma caused by our choices. | 72deluxe wrote: | You're effectively experimenting on your child? Don't you | think the repurcussions might be severe? | crawfordcomeaux wrote: | Every parent is, though many will deny it. The | repercussions of doing so without intention and without | admitting it are already severe. Every parent is likely to | traumatize their child in some way, including the trauma of | protecting them from trauma to the point that they don't | learn to heal through it. I'm ok with intentionally | experimenting and normalizing healing within our family. | | Also, it's already paying off tremendously. When | repercussions can be severe, so can rewards. We have a | 2-year old who is incredibly emotionally aware, has a huge | vocabulary, enjoys eating anything, explores freely, is | learning to play multiple instruments, draws with a pencil | grip in both hands, can sit to actively listen to music for | 20+ minutes at a time, and learns lyrics incredibly fast. | | If you have specific fears, I'm interested in hearing them | because that gives us an opportunity to prepare. ___________________________________________________________________ (page generated 2020-11-15 23:00 UTC)