[HN Gopher] Show HN: LibreASR - An On-Premises, Streaming Speech...
       ___________________________________________________________________
        
       Show HN: LibreASR - An On-Premises, Streaming Speech Recognition
       System
        
       Author : iceychris
       Score  : 184 points
       Date   : 2020-11-15 10:10 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | OnlyMortal wrote:
       | Having worked on ViaVoice OSX back in the day, we had to have
       | models for different varieties of English. The US model couldn't
       | understand my northern English (think GoT) accent. It's why the
       | product came out with a UK localisation.
       | 
       | Wondering if you might have a better reco of the French President
       | if you have a model per dialect of English?
        
         | iceychris wrote:
         | Yes, probably. The data I trained on mostly reflects UK and US
         | accents.
        
           | OnlyMortal wrote:
           | IBM kept their US and UK models apart. May have been historic
           | or dataset size.
           | 
           | As a FYI, I was told "the money" was in specific
           | "dictionaries" for medical professionals and so forth.
           | Apparently, doctors liked to dictate straight into text.
           | Might be worth trying that $$$EUREUREURPSPSPS?
        
       | fareesh wrote:
       | Is there an open-source or paid SDK/API that I can use to create
       | a group voice chat mobile app with "live" transcription? Or
       | something that can plug-in to a system like this?
       | 
       | I looked at Twilio but they seem to only offer a means to do it
       | on their VOIP/SIP product.
        
         | dcsan wrote:
         | how real-time do you need it? if you use a streaming API you
         | can even use google and there isn't too much lag, and it's
         | continuous.
         | 
         | Agora also talk about this, but I haven't used it myself
         | https://www.agora.io/en/
        
         | whimsicalism wrote:
         | > open-source or paid SDK/API that I can use to create a group
         | voice chat mobile app with "live" transcription? Or something
         | that can plug-in to a system like this?
         | 
         | Yes, Google, Amazon, Microsoft all offer streaming solutions
         | (wouldn't recommend Amazon's however, might recommend Microsoft
         | over Google). wav2letter from FB is the only open-source
         | framework worth looking at, deepspeech is not a seriously
         | usable framework.
        
           | woodson wrote:
           | Check out Kaldi. It's a toolkit rather than a ready-to-deploy
           | service but has some solid pretrained models and recipes for
           | training your own. You can use various existing projects for
           | deployment, e.g. vosk-server (also for on-device) which comes
           | with models for various languages and accents and has an
           | excellent support channel via telegram. Quite frankly,
           | despite not being "end-to-end", you'll get much much better
           | results in practice.
        
             | whimsicalism wrote:
             | I collected custom audio and had it transcribed by hand for
             | cash, then evaluated it on wav2letter and vosk. At least
             | for that domain, wav2letter outperforms vosk.
        
               | woodson wrote:
               | Good for you, it's the only way to know which tool works
               | best in your case. I did the same for my use case and
               | arrived at the opposite conclusion.
               | 
               | What most people don't realize is that it heavily depends
               | on your use case and domain whether any given
               | model/algorithm will work better.
        
         | sbr464 wrote:
         | Telnyx has media forking, the ability to clone a media stream
         | in real time without affecting the original call. It allows
         | receiving the stream directly and operating on it without
         | latency.
         | 
         | Not sure if relevant though, it's using their SIP product also.
         | If the original service isn't using Telnyx, you could get
         | creative and have a Telnyx shadow user join the group call to
         | receive the stream, etc.
        
         | Y_Y wrote:
         | Google Meet does this
        
       | donpdonp wrote:
       | The README could use a section as to what CPU platform and
       | storage requirements are necessary to run this app.
        
       | sagz wrote:
       | This is great! would you be able to integrate it with Live
       | Transcribe to make a great FOSS solution for the deaf and hard of
       | hearing? :)
       | 
       | https://github.com/google/live-transcribe-speech-engine
        
       | kkielhofner wrote:
       | I LOVE that you provided a sample application targeting the
       | ESP32-LyraT! While the ESP8266/ESP32 get plenty of love on HN
       | (and elsewhere) I think the ESP ADF (audio development framework)
       | and various boards dev boards (Lyra, Korvo, etc) are really under
       | appreciated and essentially unknown.
       | 
       | I enjoy a Raspberry Pi, Jetson nano, Arduino, whatever as much as
       | the next person but the seemingly endless stream of projects and
       | resulting blog posts, etc featuring them can get a little old.
       | 
       | Great work!
        
       | lunixbochs wrote:
       | Can you post WER per dataset? Bucketing all of the WER together
       | means you can only directly compare to models that are validated
       | on the exact same combination datasets. This excludes all other
       | ASR systems from comparison, as well as your own models if you
       | decide to add validation data in the future.
        
       | pw6hv wrote:
       | Is the transcription of Macron's speech completely off or am I
       | not understanding what the two shown texts represent?
        
         | jiehong wrote:
         | Seems off indeed. Plus, the readme says French is not supported
         | yet. Not the best demo IMO.
        
           | iceychris wrote:
           | I have not yet trained a french model. Also, the gif shows
           | Macron speaking to the congress with his english accent [0]
           | 
           | [0] https://www.youtube.com/watch?v=RqUc1h7bZQ4
        
         | iceychris wrote:
         | The upper transcript is YouTube's automatic transcription.
         | Below is the web app transcribing live. And yes, it is actually
         | missing a few words.
        
       | saurik wrote:
       | (Anyone know how the transcription quality compares to the
       | various cloud offerings from AWS, Google, and IBM?)
        
         | iceychris wrote:
         | As I commented above, very poorly. It's still early days.
        
       | iceychris wrote:
       | Hey HN!
       | 
       | I've been working on this for a while now. While there are other
       | on-premise solutions using older models such as DeepSpeech [0], I
       | haven't found a deployable project supporting multiple languages
       | using the recent RNN-T Architecture [1].
       | 
       | Please note that this does not achieve SotA performance. Also,
       | I've only trained it on one GPU so there might be room for
       | improvement.
       | 
       | Edit: Don't expect good performance :D this is still in early
       | stage development. I am looking for contributers :)
       | 
       | [0] https://github.com/mozilla/DeepSpeech
       | 
       | [1] https://arxiv.org/abs/1811.06621
        
         | the_biot wrote:
         | Why on earth would you put a demo front and center that shows
         | your software doing a terrible, terrible job?
        
           | brmgb wrote:
           | That's called setting up expectations. If you know your
           | project might interest people but needs work, why pretends
           | it's good when it's not? They seem to be courting
           | contributors more than users anyway.
           | 
           | I found the video to be funny. It nicely highlights both the
           | current limitations and the ambition of the projects. Bold
           | choice certainly but I think it works.
        
             | dcsan wrote:
             | and it's maybe a dig at Macron's accent at the same time :D
             | although the author is a student in germany. Anyway you
             | should join the Discord, we discussed this there too...
             | 
             | https://discord.gg/pqTMeP5D3g
        
         | whimsicalism wrote:
         | Awesome project - I'm also working on a similar idea for an on-
         | premise ASR server! Any reason you decided to go with RNN-T?
        
         | woodson wrote:
         | You can also check out
         | https://github.com/TensorSpeech/TensorFlowASR for inspiration
         | (not my project, not involved). It implements streaming
         | transformers and conformer RNN-T (but in TF2). Deployment on
         | device as TFLite. So far, there aren't many usable pretrained
         | models available (just LibriSpeech), but with some work it
         | could turn out quite nicely.
        
         | th3h4mm3r wrote:
         | Hi! What should you need to implement other language i.e.
         | Italian or French? I mean: it's a problem due to the less of
         | datas or what?
         | 
         | Another question: could you use for example mozilla voice data
         | to train/test?
        
           | iceychris wrote:
           | Data and compute are the largest hurdles. I only have one GPU
           | and training one model takes 3+ days, so I am limited by
           | that. Also, scraping from YouTube takes time and a lot of
           | storage (multiple TBs).
           | 
           | Mozilla Common Voice data is already used for training.
        
             | jack_pp wrote:
             | Why does it take a lot of data? Afaik you can select lower
             | quality in youtube-dl but you don't even need video do you?
        
               | whimsicalism wrote:
               | > Why does it take a lot of data? Afaik you can select
               | lower quality in youtube-dl but you don't even need video
               | do you?
               | 
               | But you need supervised data too.
        
               | klysm wrote:
               | I know you can scrape only audio from YouTube with
               | YouTubeDL but it's somewhat annoying
        
               | jerf wrote:
               | youtube-dl -f bestaudio $URL
               | 
               | Dunno when that went in but it works now.
        
               | Shared404 wrote:
               | I use something akin to                   'alias
               | downloadmusic='youtube-dl --extract-audio --audio-quality
               | 0 --extract-metadata'
               | 
               | in my .bashrc
               | 
               | I find that helps with the annoyance of downloading
               | things off of YT. This is for music obviously, but
               | there's an option to download subtitles as well.
               | 
               | EDIT: Typed this from memory, there may be errors in the
               | alias.
        
             | th3h4mm3r wrote:
             | So do you scrap videos from youtube with subtitles to
             | collect data?
        
             | th3h4mm3r wrote:
             | For the compute problem: maybe you can use cloud server gpu
             | powered as https://www.paperspace.com/ I don't know update
             | prices but I remember it was quite affordable.
        
               | whimsicalism wrote:
               | > I remember it was quite affordable.
               | 
               | Relative to what? Paperspace is one of the costlier GPU
               | providers.
        
               | th3h4mm3r wrote:
               | Okay, you are right, but it's also really performant, so
               | imho you can do a lot of work in minor time.
               | 
               | For something cheapest I read that post on reddit :
               | 
               | https://amp.reddit.com/r/devops/comments/dqh09n/cheapest_
               | clo...
        
               | whimsicalism wrote:
               | performant? It's the same GPU..?
        
       | jiehong wrote:
       | What datasets are used to train the models?
        
         | iceychris wrote:
         | LibriSpeech, Tatoeba, Common Voice and scraped YouTube videos.
        
           | blackcat201 wrote:
           | Do you get good results when adding scraped youtube audio? My
           | model performance on LibriSpeech dev drops a bit when adding
           | youtube audio to the training dataset ( my guess is likely
           | due to poor alignment from auto generated captions ).
        
             | iceychris wrote:
             | I haven't trained on LibriSpeech exclusively, but yes, the
             | perf on LibriSpeech dev is quite bad, around ~60.0 WER. If
             | the poor alignment of yt captions is the issue, maybe
             | concatenating multiple samples helps a bit.
        
               | lunixbochs wrote:
               | You should consider realignment; maybe start with
               | something like DSAlign or my wav2train project.
        
           | dcsan wrote:
           | would it be possible to train on any of the more recent Text
           | to speech engines out there? some of them are very realistic.
           | 
           | this would give you absolutely perfect sync down to the word,
           | I assume... I don't know about the cost if you paid ratecard
           | though, perhaps you can do some partnership with them since
           | yours is a symetrical product
        
       | blackcat201 wrote:
       | Cool project, seems like your model have similar WER as mine (4th
       | reference in readme). Do you plan to do any pre-training on the
       | encoder part in the future? Maybe something like this[1]
       | 
       | [1] https://ai.facebook.com/blog/wav2vec-20-learning-the-
       | structu...
        
         | bravura wrote:
         | Hey black cat, I have some work in preprint for a NeuroIPS
         | workshop, demonstrating negative results of different audio
         | distances on pitch tasks. There is one particular w2v result
         | I'd like your feedback on.
         | 
         | Do you mind emailing me? Lastname at gmail dot com (see my
         | profile for my name)
        
         | iceychris wrote:
         | Hey blackcat! Your project [0] helped me a lot! Pre-training
         | the encoder sounds great, I'll maybe add it in the future.
         | 
         | [0] https://github.com/theblackcat102/Online-Speech-Recognition
        
       | guillem_lefait wrote:
       | Nice project and bold to demo with a French native speaking
       | English.
       | 
       | On a side project, I'm looking at the best interface to
       | facilitate further edition (correction) of the recognized text.
       | Target is local councils and regional parliament, where sessions
       | are usually recorded but without transcripts. If xx% accuracy is
       | enough to identify keywords, manual edition is still required to
       | not distort precise meaning.
       | 
       | Nothing special in the interface, but two features seems
       | interesting: 1. Be able to collaborate in real-time. Maybe using
       | Etherpad API to merge multiple editions. 2. Easily validate text
       | and label speakers so as to generate new training data.
       | 
       | Pointers to similar existing solutions would be very appreciated.
        
         | nsomaru wrote:
         | I'm really interested in this project too. Been thinking about
         | similar solutions for a while now.
         | 
         | I looked into Kaldi and Mozilla Deep Speech but the former
         | seems geared at ASR experts and the latter didn't seem suited
         | for my particular application (longer recorded audio or real
         | time stream)
        
           | whimsicalism wrote:
           | Use wav2letter
        
           | posguy wrote:
           | Mozilla DeepSpeech has streaming audio support as of a few
           | releases ago, the word error rate has also improved.
           | 
           | I would recommend looking at Vosk too, it converts speech to
           | text much faster than Mozilla DeepSpeech while having
           | slightly better results: https://alphacephei.com/vosk/
        
       | ahoka wrote:
       | In case someone wondered what does the title have to do with
       | logic, it's just probably a (common) malapropism.
       | 
       | premise (noun) a previous statement or proposition from which
       | another is inferred or follows as a conclusion.
       | 
       | premises (noun) a house or building, together with its land and
       | outbuildings, occupied by a business or considered in an official
       | context.
        
         | iceychris wrote:
         | Right, fixed it, thank you :D
        
         | Proven wrote:
         | It's not "mal" anything, many simply prefer to use "on-premise"
         | or on-prem for "on-premises".
         | 
         | You didn't have any issue understanding the original title.
        
           | yjftsjthsd-h wrote:
           | > You didn't have any issue understanding the original title.
           | 
           | Humans are very good at live error correction, but that
           | doesn't make it not wrong.
        
       | woodson wrote:
       | This is somewhat off-topic, but as someone who has worked on
       | various speech processing and ASR projects, I'm curious to learn
       | from people who have specific problems and applications for this
       | technology where it can make a difference.
       | 
       | That is to say, what are areas where you think ASR can enable new
       | products or make common and tedious tasks much more efficient?
        
         | crawfordcomeaux wrote:
         | My partner and I are raising our child in unique ways with a
         | focus on avoiding certain linguistic patterns. I'm interested
         | in filming/recording our home, automating transcription, and
         | creating a system for voice-driven video editing we can do on
         | the fly to create highlights and capture discussions. I see
         | this as a way to create a dataset they can use in the future to
         | diagnose any trauma caused by our choices.
        
           | 72deluxe wrote:
           | You're effectively experimenting on your child? Don't you
           | think the repurcussions might be severe?
        
             | crawfordcomeaux wrote:
             | Every parent is, though many will deny it. The
             | repercussions of doing so without intention and without
             | admitting it are already severe. Every parent is likely to
             | traumatize their child in some way, including the trauma of
             | protecting them from trauma to the point that they don't
             | learn to heal through it. I'm ok with intentionally
             | experimenting and normalizing healing within our family.
             | 
             | Also, it's already paying off tremendously. When
             | repercussions can be severe, so can rewards. We have a
             | 2-year old who is incredibly emotionally aware, has a huge
             | vocabulary, enjoys eating anything, explores freely, is
             | learning to play multiple instruments, draws with a pencil
             | grip in both hands, can sit to actively listen to music for
             | 20+ minutes at a time, and learns lyrics incredibly fast.
             | 
             | If you have specific fears, I'm interested in hearing them
             | because that gives us an opportunity to prepare.
        
       ___________________________________________________________________
       (page generated 2020-11-15 23:00 UTC)