hngopher.com

       [HN Gopher] Show HN: Willow - Open-source privacy-focused voice ...
       ___________________________________________________________________
        
       Show HN: Willow - Open-source privacy-focused voice assistant
       hardware
        
       As the Home Assistant project says, it's the year of voice!  I love
       Home Assistant and I've always thought the ESP BOX[0] hardware is
       cool. I finally got around to starting a project to use the ESP BOX
       hardware with Home Assistant and other platforms. Why?  - It's
       actually "Alexa/Echo competitive". Wake word detection, voice
       activity detection, echo cancellation, automatic gain control, and
       high quality audio for $50 means with Willow and the support of
       Home Assistant there are no compromises on looks, quality,
       accuracy, speed, and cost.  - It's cheap. With a touch LCD display,
       dual microphones, speaker, enclosure, buttons, etc it can be bought
       today for $50 all-in.  - It's ready to go. Take it out of the box,
       flash with Willow, put it somewhere.  - It's not creepy. Voice is
       either sent to a self-hosted inference server or commands are
       recognized locally on the ESP BOX.  - It doesn't hassle or try to
       sell you. If I hear "Did you know?" one more time from Alexa I
       think I'm going to lose it.  - It's open source.  - It's capable.
       This is the first "release" of Willow and I don't think we've even
       begun scratching the surface of what the hardware and software
       components are capable of.  - It can integrate with anything.
       Simple on the wire format - speech output text is sent via HTTP
       POST to whatever URI you configure. Send it anywhere, and do
       anything!  - It still does cool maker stuff. With 16 GPIOs exposed
       on the back of the enclosure there are all kinds of interesting
       possibilities.  This is the first (and VERY early) release but
       we're really interested to hear what HN thinks!  [0] -
       https://github.com/espressif/esp-box
        
       Author : kkielhofner
       Score  : 421 points
       Date   : 2023-05-15 14:13 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Cheetah26 wrote:
       | This looks like something I've been wanting to see for a while.
       | 
       | I currently have a google home and I'm getting increasingly fed
       | up with it. Besides the privacy concerns, it seems like it's
       | getting worse at being an assistant. I'll want my light turned on
       | by saying "light 100" (for light to 100 percent) and it works
       | about 80% of the time, but the others it starts playing a song
       | with a similar name.
       | 
       | I'd be great if this allows limiting / customizing what words and
       | actions you want.
        
         | schainks wrote:
         | THIS. It's hilarious and infuriating our digital assistants
         | struggle to understand variants of "set lights at X%
         | intensity".
         | 
         | However, if I spend the time to configure a "scene" with the
         | right presets, Google has no issue figuring it out.
         | 
         | If only it could notice regular patterns about light settings
         | and offer suggestions that I could approve/deny.
        
         | kkielhofner wrote:
         | Totally get it!
         | 
         | There are at least two ways to deal with this frustrating issue
         | with Willow:
         | 
         | - With local command recognition via ESP SR command recognition
         | runs completely on the device and the accepted command syntax
         | is defined. It essentially does "fuzzy" matching to address
         | your light command ("light 100") but there's no way it's going
         | to send some random match to play music.
         | 
         | - When using the inference server -or- local recognition we
         | send the speech to text output to the Home Assistant
         | conversation/intents[0] API and you can define valid
         | actions/matches there.
         | 
         | [0] - https://developers.home-assistant.io/docs/intent_index/
        
         | t-vi wrote:
         | Personally, I plugged a Jabra conference speaker to a Raspberry
         | and if it hears something interesting, it sends to my local GPU
         | computer for decoding (with whisper) + answer-getting +
         | response sent back to the Raspberry as audio (with a model from
         | coqui-ai/TTS but using more plain PyTorch). Works really nicely
         | for having very local weather, calendar, ...
        
           | kkielhofner wrote:
           | Neat!
           | 
           | If you don't mind my asking, what do you mean "if it hears
           | something interesting"? Is that based on wake word, or always
           | listen/process?
        
             | t-vi wrote:
             | Both:
             | 
             | A long while ago, I wrote a little tutorial[0] on
             | quantizing a speech commands network to the Raspberry. I
             | used that to control lights directly and also for wake word
             | detection.
             | 
             | More recently, I found that I can just use more classic VAD
             | because my uses typically don't suffer if I turn on/off the
             | microphone. My main goal is to not get out the mobile phone
             | for information. That reduces the processing when I turn on
             | the radio...
             | 
             | Not high-end as your solution, but nice enough for my
             | purposes.
             | 
             | [0]. https://devblog.pytorchlightning.ai/applying-
             | quantization-to...
        
         | chankstein38 wrote:
         | This drives me nuts and happens all the time as well. To be
         | honest, I unplugged my google home device a while back and
         | haven't missed it. It mostly ended up being a clock for me
         | because I'd try to change the color of my lights to a color
         | that it mustn't have been capable of because I'd have to sit
         | there for minutes listening to it list stores in the area that
         | might sell those colored lights or something. It wouldn't stop.
         | This is just one of many frustrating experiences I'd had with
         | that thing.
        
       | andruby wrote:
       | Perhaps I am missing something obvious. Where can I buy an ESP-
       | BOX? The github makes no mention. I can't find any on aliexpress
       | or amazon.
        
         | kkielhofner wrote:
         | We mention the names of a bunch of vendors but we don't have
         | direct links. We probably should but I didn't want to wander
         | into the waters of appearing to suggest/recommend one vendor
         | over another.
        
         | stinger wrote:
         | You can get the links here https://www.espressif.com/en/dev-
         | board/esp32-s3-box-en
        
         | qup wrote:
         | adafruit
        
       | Brendinooo wrote:
       | Nice. Any bits of Mycroft in here? That project just imploded and
       | I'm still sad about it.
        
         | jzebedee wrote:
         | The "TTS Output" and "Audio on device" sections make it seem
         | like there is no spoken output, only status beeps.
         | 
         | A former Mycroft dev, Michael Hansen[1], is still building
         | several year-of-the-voice projects after he was let go. I'm
         | especially excited about Piper[2], which is a C++/py
         | alternative to Mimic3.
         | 
         | [1] https://github.com/synesthesiam [2]
         | https://github.com/rhasspy/piper
        
           | kkielhofner wrote:
           | We plan to make a Home Assistant Willow component to use any
           | of their supported TTS modules to play speech output on
           | device. We just didn't get to it yet.
           | 
           | Our inference server (open source, releasing next week) has
           | highly optimized Whisper, LLaMA/Vicuna/etc, text to speech,
           | etc implementations as well.
           | 
           | It's actually not that hard on the device - if the response
           | from the HA component has audio, play it.
           | 
           | We just don't have the HA component yet :).
        
         | kkielhofner wrote:
         | Thanks!
         | 
         | None.
         | 
         | The ESP BOX and ESP SR speech recognition library from
         | Espressif handles the low-level audio stuff like wake word
         | detection, DSP work for quality voice, voice activity
         | detection, etc to get usable far-field audio. The wake word
         | engine uses models from Espressif with wake words like "Alexa",
         | "Hi ESP", "Hi Lexin", etc. If we get traction Espressif can
         | make us a wake engine model for whatever we want (we're
         | thinking "Hi Willow") but open to better ideas!
         | 
         | We currently stream audio after wake in realtime to our very
         | high performance (optimized for "realtime" speech) Whisper
         | inference server implementation. We plan to open source this
         | next week.
         | 
         | We also patched in support for the most recent ESP SR version
         | that has their actually amazingly good Multinet 6 speech
         | command model that does recognition of up to 400 commands
         | completely on device after wake activation. We currently try to
         | pull light and switch entities from your configured Home
         | Assistant instance to build the speech commands but it's really
         | janky. We're working on this.
         | 
         | The default currently is to use our best-effort hosted
         | inference server implementation but like I say in the README,
         | etc we're open sourcing that next week so anyone can stand it
         | up and do all of this completely locally/inside your walls.
        
         | ftyers wrote:
         | Imploded?
        
           | Brendinooo wrote:
           | https://mycroft.ai/blog/update-from-the-ceo-part-1/
        
       | _joel wrote:
       | Is there an analogous thing for RPi? I've got some old ones and a
       | USB mic array from seeed etc that I've still not put to use. Also
       | got an ESP32 (vanilla with a small oled) if I can use that?
        
         | kkielhofner wrote:
         | The Home Assistant project (as part of the "year of voice") is
         | working on wake word, etc for Raspberry Pi from what I
         | understand. However, as someone who's tried to do exactly this
         | on a Raspberry Pi before supporting wake word and getting clean
         | audio from 25 feet away with background noise, acoustic echo,
         | etc with a random collection of software and hardware is very
         | challenging. I have an entire graveyard of mic arrays from
         | seeed and others myself :).
         | 
         | Espressif really did us all a solid with this hardware and
         | their ADF and SR frameworks.
         | 
         | Whether it's cost, being fully assembled and ready to go, and
         | even wake word, AEC, AGC, BSS, NS, etc at least as of now the
         | ESP BOX is essentially impossible to compete with in terms of
         | hardware in the open ecosystem.
         | 
         | I talk about this and more on our wiki pages[0] (check out
         | "Hardware" and "Home Assistant"). In short, the Espressif
         | frameworks we use /technically/ support the "regular" ESP32 but
         | it's so limited (and the ESP BOX/ESP S3 is so cheap) we're not
         | super interested in supporting it.
         | 
         | We're aiming for an end-user experience that's competitive with
         | Echo, Google Home, etc in every possible way - speed, quality,
         | reliability, functionality, and cost.
         | 
         | In fact, we want to crush them on all points to where there's
         | no reason left to buy one of them.
         | 
         | [0] - https://github.com/toverainc/willow/wiki/
        
           | _joel wrote:
           | Thanks, yes that's perfectly reasonable. Cheers for the
           | reply.
        
       | RileyJames wrote:
       | I've been living in a house for the past few months with a google
       | assistant. I only use it to put on music, but I have noticed I
       | play more music due to the ease of putting it on.
       | 
       | But I hate the privacy invasion aspect. I'm definitely in the
       | market for something like this. And this one looks great.
       | 
       | Additionally, I've noticed that the google voice assistant
       | (connected to Spotify) doesn't keep playing the albums I ask for.
       | 
       | It states it's playing the album. But after 4/5 songs it starts
       | playing different songs, or different artists.
        
         | chankstein38 wrote:
         | It also, at least in my case, frequently won't stop playing
         | when you tell it to. And, if you want a song that has a title
         | that isn't family friendly, it'll completely ignore that title
         | and just play whatever the heck it wants.
        
         | kkielhofner wrote:
         | Music output is "on the list".
         | 
         | Biggest fundamental issue is the speaker built in the ESP BOX
         | is optimized for speech and not going to impress anyone playing
         | music.
         | 
         | That said, the ESP BOX (of course) supports bluetooth so we can
         | definitely pair with a speaker you bring.
         | 
         | Willow is the first of it's kind that I'm aware of to enable
         | this kind of functionality at anything close to this price
         | point in the open source ecosystem. Either we (or someone else)
         | is likely going to manufacture an improved ESP BOX with market
         | competitive speakers built-in for music playback.
         | 
         | Then it's "just" a matter of actually getting the music audio
         | but we'll figure that out ;).
        
           | RileyJames wrote:
           | Nice. I guess I don't expect willow to cover the speaker
           | element. I'd rather connect with my existing hifi / Bluetooth
           | speakers.
           | 
           | But with google I'm stuck with their integration to Spotify.
           | It's that component I'd like control over, and that's why I'd
           | use willow.
           | 
           | That and not being spied on in my own home.
           | 
           | Definitely keen for one.
        
           | COGlory wrote:
           | Could we not use Willow to cast music, say, via Spotify or
           | some other network streamer, through HA, to my pre-existing
           | sound system?
        
             | kkielhofner wrote:
             | The approach there would be to ignore Willow for music
             | output and just do what it does today:
             | 
             | - Wake
             | 
             | - Get command
             | 
             | - Send to Home Assistant conversation/intents API[0]
             | 
             | - Home Assistant does whatever you define, including what
             | you describe just like it does today
             | 
             | So unless I'm missing something your use case should "just
             | work".
             | 
             | [0] - https://developers.home-
             | assistant.io/docs/intent_index/
        
       | odiroot wrote:
       | Does it also work with the lite version of ESP box?
        
         | stintel wrote:
         | It should. I personally haven't tested it as I don't own the
         | lite version, but everything should work besides touch. It
         | complains about not being able to initialize touch during boot,
         | but that's not fatal. Shouldn't be too hard to ifdef away that
         | error.
        
         | kkielhofner wrote:
         | It does for everything but the touch display because it doesn't
         | have one[0]. We don't support the three buttons at the bottom
         | but I have two ESP BOX Lites and we should be able to make it
         | happen pretty easily.
         | 
         | We haven't been focused on the ESP BOX Lite because it seems
         | kind of limited. However, Espressif hasn't sold many of these
         | things since release and judging from people looking for stock,
         | etc in this thread I think that's about to change.
         | 
         | Espressif has incredible manufacturing capacity and our hope is
         | they will ramp up manufacture of the ESP BOX family now because
         | (to my knowledge) Willow is the first project that actually
         | makes meaningful use of them.
         | 
         | The only gaiting component of the ESP BOX family that I can see
         | is the plastic enclosure. I'm sure Espressif can figure out how
         | to crank these things out ;).
         | 
         | [0] - https://github.com/toverainc/willow/wiki/Hardware
        
       | hitchstory wrote:
       | This looks amazing.
        
         | kkielhofner wrote:
         | Thanks!
        
       | b8 wrote:
       | This project reminds me of MyCroft
       | https://github.com/MycroftAI/mycroft-core.
        
         | nescioquid wrote:
         | I think Mycroft is dead at this point. The project had some
         | intertwined relationship with other projects and Neon AI. The
         | suggestion from Mycroft at this point is to use NeonAI's OS for
         | the Mycroft device:
         | 
         | https://neon.ai/NeonAIforMycroftMarkII
        
           | masukomi wrote:
           | details from the mycroft forums:
           | 
           | https://community.mycroft.ai/t/faq-ovos-neon-and-the-
           | future-...
           | 
           | Although MycroftAI, Inc. has ceased development, the
           | Assistant survives.
           | 
           | A few years ago, some of MycroftAI's partners started using
           | @JarbasAl 's code (more information below) which eventually
           | became a fork of the assistant. Now that MycroftAI is unable
           | to continue development, the fork's devs - The OVOS
           | Foundation - have decided to take up leadership of the
           | Assistant's development and its open source ecosystem.
           | 
           | MycroftAI has signed over Mark II software support, as well
           | as these forums, to one of those partners, a company called
           | NeonAI. Between the OVOS Foundation and NeonAI, the voice
           | assistant and the smart speaker project are getting a new
           | lease on life.
           | 
           | The OVOS Assistant - it'll get a better name soon, we promise
           | - started out as a drop-in replacement for Mycroft. It should
           | be compatible with all your classic Mycroft skills. It will
           | even accept existing configuration files! Because we have
           | been operating at a much smaller scale for the past three
           | years, things will seem rough around the edges for a little
           | while. However, we are scaling up. Read on.
        
       | michaelmior wrote:
       | I think your post suggests the answer is yes, but do you think
       | the ESP BOX is hardware is a good long term bet? That is, do you
       | see Willow as working with ESP BOX for the foreseeable future
       | with whatever improvements are planned? Just wondering if it's
       | worthwhile investing in the hardware now even if it doesn't
       | currently quite do what I want.
        
         | kkielhofner wrote:
         | Absolutely.
         | 
         | I cannot stress enough what a gift the ESP BOX and Espressif
         | component libraries are to Willow. As anyone who's dealt with
         | it can tell you wake word detection (while minimizing false
         | wake) and getting clean speech with acoustic echo, background
         | noise, etc from 30ft away is still a fairly hard problem. I've
         | been pretty deep in this space for a while and I'm not aware of
         | anything even close to approaching open source that is even
         | remotely competitive with the Espressif SR+AFE implementations.
         | The ESP BOX has also been acoustically tuned by Espressif to
         | address all of the weird enclosure issues with audio. Their
         | AFE+SR interface has been tested and certified by Amazon
         | themselves for use in Alexa ecosystem devices. It's truly
         | excellent.
         | 
         | Espressif has an excellent track record for VERY long term
         | hardware and software support and if anything we're on the very
         | early bleeding edge of what the hardware and software
         | components are capable of. As one example, we're using the ESP
         | SR wake and local command support released by Espressif last
         | week!
        
       | WaitWaitWha wrote:
       | Congratulations! This is great news!
       | 
       | I do not see anything posted on the Home Assistant (HA) Community
       | forums.
       | 
       | > Configuring and building Willow for the ESP BOX is a multi-step
       | process. We're working on improving that but for now...
       | 
       | This is crucial as your "competitors" are ready out of the box. I
       | believe HA can be a Google/Alexa alternative to the masses only
       | if the "out-of-the-box" experience is comparable to the
       | commercial solutions.
       | 
       | Good luck, and keep us updated!
        
         | kkielhofner wrote:
         | Thanks!
         | 
         | HN was my first stop (of course) - I'll be heading over there
         | shortly to post.
         | 
         | Oh yeah, we're well aware of how much of a "pain" getting
         | Willow going can be. I don't like it (at all).
         | 
         | That said, you configure and build once for your environment
         | and then get a .bin that can be flashed to the ESP BOX with
         | anything that does ESP flashing (like various web interfaces,
         | etc) or you can re-run the flash command across X devices. So
         | even now, in this early stage, it's at least only painful once
         | ;).
         | 
         | Down the road we want to have a Willow Home Assistant component
         | that does everything inside of the HA dashboard so users (like
         | esphome, maybe even using esphome) can point-click-configure-
         | flash entirely from the HA dashboard. Not to mention ongoing
         | dynamic configuration, over the air updates, etc.
         | 
         | I talk about all of this on our wiki[0].
         | 
         | [0] - https://github.com/toverainc/willow/wiki/Home-Assistant
        
           | freedomben wrote:
           | IMHO better to release early like this to a group of hackers
           | than to wait until you have a nice out of the box setup
           | going. This way you're going to get a lot of great feedback
           | and hopefully some help. Awesome project!
        
             | kkielhofner wrote:
             | Bingo, thanks!
        
       | [deleted]
        
       | balloob wrote:
       | Some feedback to make your project easier to install and
       | integrate better with Home Assistant (I'm the founder):
       | 
       | Home Assistant is building a voice assistant as part of our Year
       | of the Voice theme. https://www.home-
       | assistant.io/blog/2023/04/27/year-of-the-vo...
       | 
       | As part of our recent chapter 2 milestone, we introduced new
       | Assist Pipelines. This allows users to configure multiple voice
       | assistants. Your project is using the old "conversation" API.
       | Instead it should use our new assist pipelines API. Docs:
       | https://developers.home-assistant.io/docs/voice/pipelines/
       | 
       | You can even off-load the STT and TTS fully to Home Assistant and
       | only focus on wake words.
       | 
       | You will see a lot higher adoption rate if users can just buy the
       | ESP BOX and install the software on it without
       | installing/compiling stuff. That's exactly why we created ESP Web
       | Tools. It offers projects to offer browser-based installation
       | directly from their website. https://esphome.github.io/esp-web-
       | tools/
       | 
       | If you're going the ESP Web Tools route (and you should!), we've
       | also created Improv Wi-Fi, a small protocol to configure Wi-Fi on
       | the ESP device. This will allow ESP Web Tools to offer an
       | onboarding wizard in the browser once the software has been
       | installed. More info at https://www.improv-wifi.com/
       | 
       | Good luck!
        
         | kkielhofner wrote:
         | Hey there!
         | 
         | First of all, everyone involved in this project has been big
         | fans and users of HA for many years (in my case at least a
         | decade). THANK YOU! For now Willow wouldn't do anything other
         | than light up a display and sit there without Home Assistant.
         | 
         | We will support the pipelines API and make it a configuration
         | option (eventually default). HA has very rapid release cycles
         | and as you note this is very new. At least for the time being
         | we like the option of people being able to point Willow at
         | older installs and have it "do something" today without
         | requiring an HA upgrade that may or may not include breaking
         | changes - hence the conversation API.
         | 
         | One of our devs is a contributor for esphome and we're heading
         | somewhere in that direction, and he's a big fan of improv :).
         | 
         | We have plans for a Willow HA component and we'd love to run
         | some ideas past the team. Conceptually, in my mind, we'll get
         | to:
         | 
         | - Flashing and initial configuration from HA like esphome
         | (possibly using esphome, but the Espressif ADF/SR/LCD/etc
         | frameworks appear to be quite a ways out for esphome).
         | 
         | - Configuration for all Willow parameters from wifi to local
         | speech commands in the HA dashboard, with dynamic and automatic
         | updates for everything including local speech commands.
         | 
         | - OTA update support.
         | 
         | - TTS and STT components for our inference server
         | implementation. These will (essentially) be very thin proxies
         | for Willow but also enable use of TTS and STT functionality
         | throughout HA.
         | 
         | - Various latency improvements. As the somewhat hasty and lame
         | demo video illustrates[0] we're already "faster" than Alexa
         | while maintaining Alexa competitive wake word, voice activity
         | detection, noise suppression, far-field speech quality,
         | accuracy, etc. With local command recognition on the Willow
         | device and my HA install using Wemo switches (completely local)
         | it's almost "you can't really believe it" fast and accurate.
         | 
         | I should be absolutely clear on something for all - our goal is
         | to be the best hardware voice interface in the world (open
         | source or otherwise) that happens to work very well with Home
         | Assistant. Our goal is not to be a Home Assistant Voice
         | Assistant. I hope that distinction makes at least a little
         | sense.
         | 
         | You and the team are doing incredible work on that goal and
         | while there is certainly some overlap we intend to maintain
         | broad usability and compatibility with just about any platform
         | (home automation, open source, closed source, commercial,
         | whatever) someone may want to use Willow with.
         | 
         | In fact, our "monetization strategy" (to the extent we have
         | one) is based on the various commercial opportunities I've been
         | approached with over the years. Turns out no one wants to see
         | an Amazon Echo in a doctor's office but healthcare is excited
         | about voice (as one example) :).
         | 
         | Essentially, Home Assistant support in Willow will be one of
         | the many integration modules we support, with Willow using as
         | many bog-standard common denominator compliant protocols and
         | transports that don't compromise our goals, while maintaining
         | broad compatibility with just about any integration someone
         | wants to use with Willow.
         | 
         | This is the very early initial release of Willow. We're happy
         | for "end-users" to use it but we don't see the one-time
         | configuration and build step being a huge blocker for our
         | current target user - more technical early adopters who can
         | stand a little pain ;).
         | 
         | [0] - https://www.youtube.com/watch?v=8ETQaLfoImc
        
         | daredoes wrote:
         | Thanks for all your work!
        
       | macrolime wrote:
       | Nice. Siri is completely unusable with an accuracy of less than
       | 10%. I'm guessing Whisper on CPU is probably the same, so I
       | wouldn't risk wasting time on trying the inference server if it
       | only runs on CPU, but once that runs on GPU it would be cool to
       | try this out.
        
         | kkielhofner wrote:
         | GPU (currently CUDA only) is our primary target for our
         | inference server implementation. It "runs" on CPU but our goal
         | is to enable an ecosystem that is competitive with Alexa in
         | every possible way and even with the amazing work of
         | whisper.cpp and other efforts it's just not happening (yet).
         | 
         | We're aware that's controversial and not really applicable to
         | many home users - that's why we want to support any TTS/STT
         | engine on any hardware supported by Home Assistant (or
         | elsewhere) in addition to ESP BOX on device local command
         | recognition.
         | 
         | But for the people such as yourself, and other
         | commercial/power/whatever users our inference server that we're
         | releasing next week that works with Willow provides impressive
         | results - on anything from a GTX 1060 to an H100 (we've tested
         | and optimized for anything in between the two).
         | 
         | We use ctranslate2 (like faster-whisper) and some other
         | optimizations for performance improvements and conservative
         | VRAM usage. We can simultaneously load large-v2, medium, and
         | base on a GTX 1060 3GB and handle requests without issue.
         | 
         | Again, it's controversial but the fact remains a $100 Tesla P4
         | that idles at 5 watts and has max TDP of 60 watts from eBay
         | with our inference server implementation does the following:
         | 
         | large-v2, beam 5 - 3.8s of speech, inference time 1.1s
         | 
         | medium, beam 1 (suitable for Willow tasks) - 3.8s of speech,
         | inference time 588ms
         | 
         | medium, beam 1 (suitable for Willow tasks), 29.2s of speech,
         | inference time 1.6s
         | 
         | An RTX 4090 with large-v2, beam 5 does 3.8s of speech in 140ms
         | and 29.2s of speech with medium beam 1 (greedy) in 84ms.
        
           | macrolime wrote:
           | You've convinced me. Just ordered an ESP-BOX :p
           | 
           | Got a Home Assistant Yellow not long ago, so would be nice to
           | get some decent voice control for it.
        
         | CharlesW wrote:
         | > _Siri is completely unusable with an accuracy of less than
         | 10%._
         | 
         | That seems unusual. I've been using both for the last few weeks
         | while replacing my Homebridge setup, and Siri has been as
         | accurate as Alexa -- good enough that I've decided that I can
         | now leave the Alexa ecosystem. To be more specific, both are
         | (conservatively) 95%+ accurate for my home control scenarios.
        
           | macrolime wrote:
           | I've never tried any voice recognition system that works
           | well. Maybe my accent is too different from typical training
           | data or something. I had a voice recognition program on my
           | computer in 1994 that had about the same accuracy for me as
           | any modern voice recognition system that I have tried.
        
       | vrglvrglvrgl wrote:
       | [dead]
        
       | api wrote:
       | I love seeing lots of practical refutations of the "we have to do
       | the voice processing in the cloud for performance" rationales
       | peddled by the various home 1984 surveillance box vendors.
       | 
       | It's actually faster to do it locally. They want it tethered to
       | the cloud for surveillance.
        
         | kkielhofner wrote:
         | We can do either.
         | 
         | For "basic" command recognition the ESP SR (speech recognition)
         | library supports up to 400 defined speech commands that run
         | completely on the device. For most people this is plenty to
         | control devices around the home, etc. Because it is all local
         | it's extremely fast - as I said in another comment pushing "Did
         | that really just happen?" fast.
         | 
         | However, for cases where someone wants to be able to throw any
         | kind of random speech at it "Hey Willow what is the weather in
         | Sofia, Bulgaria?" that's probably beyond the fundamental
         | capabilities of a device with enclosure, display, mics, etc
         | that sells for $50.
         | 
         | That's why we plan to support any of the STT/TTS modules
         | provided by Home Assistant to run on local Raspberry Pis or
         | wherever they host HA. Additionally, we're open sourcing our
         | extremely fast highly optimized Whisper/LLM/TTS inference
         | server next week so people can self host that wherever they
         | want.
        
           | java_beyb wrote:
           | first, good initiative! thanks for sharing. i think you gotta
           | be more diligent and careful with the problem statement.
           | 
           | checking the weather in Sofia, Bulgaria requires cloud,
           | current information. it's not "random speech". ESP SR
           | capability issues don't mean that you cannot process it
           | locally.
           | 
           | the comment was on "voice processing" i.e. sending speech to
           | the cloud, not sending a call request to get the weather
           | information.
           | 
           | besides, local intent detection, beyond 400 commands, there
           | are great local STT options, working better than most cloud
           | STTs for "random speech"
           | 
           | https://github.com/alphacep/vosk-api
           | https://picovoice.ai/platform/cheetah/
        
             | kkielhofner wrote:
             | Thanks!
             | 
             | There are at least two things here:
             | 
             | 1) The ability to do speech to text on random speech. I'm
             | going to stick by that description :). If you've ever
             | watched a little kid play with Alexa it's definitely what
             | you would call "random speech" haha!
             | 
             | 2) The ability to satisfy the request (intent) of the text
             | output. Up to and including current information via API,
             | etc.
             | 
             | Our soon to be released highly optimized open source
             | inference server uses Whisper and is ridiculously fast and
             | accurate. Based on our testing with nieces and nephews we
             | have "random speech" covered :). Our inference server also
             | supports LLaMA, Vicuna, etc and can chain together STT ->
             | LLM/API/etc -> TTS - with the output simply played over the
             | Willow speaker and/or displayed on the LCD.
             | 
             | Our goal is to make a Willow Home Assistant component that
             | assists with #2. There are plenty of HA integrations and
             | components to do things like get weather in real time, in
             | addition to satisfying user intent recognition. They have
             | an entire platform for it[0]. Additionally, we will make
             | our inference server implementation (that does truly unique
             | things for Willow) available as just another TTS/STT
             | integration option on top of the implementations they
             | already support so you can use whatever you want, or send
             | the audio output after wake to whatever you want like Vosk,
             | Cheetah, etc, etc.
             | 
             | [0] - https://developers.home-
             | assistant.io/docs/intent_index/
        
       | FloatArtifact wrote:
       | Is there a kit to be purchased?
        
         | kkielhofner wrote:
         | That's the best part - the ESP BOX is not a kit. You take it
         | out of the box, flash it, and put it on your kitchen counter or
         | wherever you want.
         | 
         | The only challenge at this point has been interest in Willow.
         | I've checked stock on ESP BOXes across various vendors and they
         | are selling out.
         | 
         | However - Espressif has tremendous manufacturing capacity. From
         | a review of the bill of materials for the ESP BOX as far as I
         | can tell they don't make a lot of them because they haven't
         | sold a lot of them. Until Willow, that is :).
         | 
         | We anticipate Espressif will ramp up manufacturing and crank
         | these things out like hot cakes. No one wants another Rasperry
         | Pi supply chain issue!
        
           | FloatArtifact wrote:
           | Well that's nice to know. Thanks for taking time to respond.
           | The interest of repairability is there parts list? Will
           | replacement parts be available to the end user?
        
       | Vinnl wrote:
       | It looks great, but also like it might need to mature a bit
       | before it's usable for less advanced users like myself. Do you
       | have an RSS feed or newsletter I can subscribe to so I'm reminded
       | some time from now to check it out again?
        
         | kkielhofner wrote:
         | Thanks!
         | 
         | You're spot on - we're happy for anyone to test and use Willow
         | but the intended users at this moment are early adopters that
         | can build and flash as development is moving very, very
         | quickly.
         | 
         | Unfortunately we do not, maybe try "watching" the repo on
         | github?
        
       | canadiantim wrote:
       | Wow this looks beyond epic. I've been looking for something like
       | this.
       | 
       | Going to try to hack this into something my mom can use (who has
       | trouble with confusion and memory). Could potentially be very
       | great.
       | 
       | Thank you
        
         | kkielhofner wrote:
         | Thanks!
         | 
         | We are really, truly, and seriously committed to building a
         | device that with support from Home Assistant and other
         | integrations doesn't leave any reason whatsoever to buy an Echo
         | or similar creepy commercial device. No compromises on cost,
         | performance, accuracy, speed, usability, functionality, etc.
         | 
         | We're really looking forward to getting additional testing and
         | feedback from the community on speech recognition results,
         | other integrations, etc. It's just two of us working on this
         | part time over the last month or so - this is VERY early but I
         | think we're off to a good start!
        
           | canadiantim wrote:
           | Wow yeah I think you're really onto something here. No one
           | actually wants the creepiness from Echo or Alexa etc. That's
           | what prevented me from trying any Home Assistant thing
           | before, but I know it could be very useful if actually
           | sensitive to privacy-concerns.
           | 
           | Best of luck with the development! I'll definitely be
           | following closely. Do you sell the pre-built hardware
           | yourself?
        
             | kkielhofner wrote:
             | Thanks!
             | 
             | When you're releasing a pet project of love like this you
             | never really know if other people are going to appreciate
             | it as much as you do. Looking here on HN it seems like
             | people appreciate it.
             | 
             | We don't sell the hardware currently because:
             | 
             | 1) Espressif has well established sales channels and
             | distribution worldwide.
             | 
             | 2) It's not our "business model". In my capacity as advisor
             | to a few startups in the space I've been approached by
             | various commercial entities that want a hardware voice
             | interface they fully control. In healthcare, for example,
             | there are all kinds of interesting audio and speech
             | applications but NO ONE, and I mean NO ONE is going to be
             | ok with seeing an Echo in their doctor's office. That's
             | where an ESP BOX or custom manufactured hardware and Willow
             | come in.
             | 
             | Our business model is to combine our soon to be released
             | very high performance inference and API server with Willow
             | to support these commercial applications (and home users
             | with HA, of course). In all but a few identified and very
             | limited cases all work will come back to the open source
             | projects like our inference server and Willow.
        
       | dingledork69 wrote:
       | Cool! What software is used for the wake word detection, speech
       | to text and text to speech?
        
         | kkielhofner wrote:
         | For wake word and voice activity detection, audio processing,
         | etc we use the ESP SR (speech recognition) framework from
         | Espressif[0].
         | 
         | For speech to text there are two options and more to come:
         | 
         | 1) Completely on device command recognition using the ESP SR
         | Multinet 6 model. Willow will (currently) pull your light and
         | switch entities from Home Assistant and generate the grammar
         | and command definition required by Multinet. We want to develop
         | a Willow Home Assistant component that will provide tighter
         | Willow integration with HA and allow users to do this point and
         | click with dynamic updates for new/changed entities, different
         | kinds of entities, etc all in the HA dashboard/config.
         | 
         | The only "issue" with Multinet is that it only supports 400
         | defined commands. You're not going to get something like
         | "What's the weather like in $CITY?" out of it.
         | 
         | For that we have:
         | 
         | 2-?) Our own highly optimized inference server using Whisper,
         | LLamA/Vicuna, and Speecht5 from transformers (more to come
         | soon). We're open sourcing it next week. Willow streams audio
         | after wake in realtime, gets the STT output, and sends it
         | wherever you want. With the Willow Home Assistant component
         | (doesn't exist yet) it will sit in between our inference server
         | implementation doing STT/TTS or any other STT/TTS
         | implementation supported by Home Assistant and handle all of
         | this for you.
         | 
         | [0] - https://github.com/espressif/esp-sr
        
           | dingledork69 wrote:
           | Thanks so much for doing this, that sounds very exciting and
           | I can't wait to try it out!
        
       | mdrzn wrote:
       | Very interesting, I would buy an "off the shelf" version if it
       | worked out of the box with Vicuna13 or similar LLM.
        
         | kkielhofner wrote:
         | Our inference server (open source - releasing next week) has
         | support for loading LLaMA and derivative models complete with
         | 4-bit quantization, etc. I like Vicuna 13B myself :). Not to
         | mention extremely fast and memory optimized Whisper via
         | ctranslate2 and a bunch of our own tweaks.
         | 
         | Our inference server also supports long-lived sessions via
         | WebRTC for transcription, etc applications ;).
         | 
         | You can chain speech to text -> LLM -> text to speech
         | completely in the inference server and input/output through
         | Willow, along with other APIs or whatever you want.
        
           | vlugorilla wrote:
           | Awesome work! May I ask what are you using for text-to-
           | speech?
        
             | kkielhofner wrote:
             | Thanks, of course!
             | 
             | For wake word and voice activity detection, audio
             | processing, etc we use the ESP SR (speech recognition)
             | framework from Espressif[0]. For speech to text there are
             | two options and more to come:
             | 
             | 1) Completely on device command recognition using the ESP
             | SR Multinet 6 model. Willow will (currently) pull your
             | light and switch entities from Home Assistant and generate
             | the grammar and command definition required by Multinet. We
             | want to develop a Willow Home Assistant component that will
             | provide tighter Willow integration with HA and allow users
             | to do this point and click with dynamic updates for
             | new/changed entities, different kinds of entities, etc all
             | in the HA dashboard/config.
             | 
             | The only "issue" with Multinet is that it only supports 400
             | defined commands. You're not going to get something like
             | "What's the weather like in $CITY?" out of it.
             | 
             | For that we have:
             | 
             | 2-?) Our own highly optimized inference server using
             | Whisper, LLamA/Vicuna, and Speecht5 from transformers (more
             | to come soon). We're open sourcing it next week. Willow
             | streams audio after wake in realtime, gets the STT output,
             | and sends it wherever you want. With the Willow Home
             | Assistant component (doesn't exist yet) it will sit in
             | between our inference server implementation doing STT/TTS
             | or any other STT/TTS implementation supported by Home
             | Assistant and handle all of this for you - including
             | chaining together other HA components, APIs, etc.
             | 
             | [0] - https://github.com/espressif/esp-sr
        
       | tikkun wrote:
       | What are the biggest challenges that you see for improving it
       | even further?
       | 
       | Looks really promising!
        
         | kkielhofner wrote:
         | Thanks!
         | 
         | If I'm being perfectly honest I'm surprised we got it this far
         | already. If I wanted to be really critical:
         | 
         | - Far-field speech is actually kind of hard. There are at least
         | dozens of "knobs" we can tweak between the various component
         | libraries, etc to improve speech quality and reliability for
         | more users in more environments. We've tested as much as we can
         | considering there's only two of us but we need more testing
         | from more speakers in more environments.
         | 
         | - On the wire/protocol stuff. We're doing pretty rudimentary
         | "open new connection, stream voice, POST somewhere". This adds
         | extra latency and CPU usage because of repeated TLS handshakes,
         | etc. We have plans to use Websockets and what-not to cut down
         | on this.
         | 
         | - We don't really support audio playback yet. For a real
         | "Amazon Echo" type experience you need to be able to ask it
         | random things like "Hey what's the weather outside?" and it
         | needs to "tell" you.
         | 
         | - Ecosystem support. Using the example above, something like
         | Home Assistant or similar needs to know where you are, get the
         | weather, do text to speech, etc for Willow to be able to play
         | it back.
         | 
         | - Other integrations. Alexa has "skills" and stuff and we need
         | to be able to talk to more things.
         | 
         | - UI/UX work. We support the touch display but we did just
         | enough to show colors, print status, add a button, and make a
         | touch cursor that follows your finger around. We also only give
         | audio feedback with a kind-of annoying tone that beeps once for
         | success and twice for failure.
         | 
         | - Speaking of failure, we don't do a great job of telling you
         | what went wrong and where.
         | 
         | - Configuration and flashing. It's very static and has multiple
         | steps. There are all kinds of things that need to get done to
         | make Willow easy enough for less-technical users to deploy and
         | actually use daily without any hassle.
         | 
         | - Local command recognition. It's very early but as noted in
         | the README, wiki, etc the ESP BOX itself can recognize up to
         | 400 commands directly on the device. In testing it works
         | surprisingly well but we have a lot of work to do to make it
         | actually practical for most people.
         | 
         | - Open sourcing our inference server. We plan to do this next
         | week!
        
           | michaelmior wrote:
           | > the ESP BOX itself can recognize up to 400 commands
           | directly on the device.
           | 
           | That's really cool! Does this mean 400 specific commands,
           | e.g. "turn on the living room lights" or 400 commands that
           | can be applied to different targets, e.g. "turn on the X
           | lights" where X is some light. (400 actually feels like it
           | would be enough to speed up the vast majority of interactions
           | either way, but I'm curious :)
        
             | kkielhofner wrote:
             | 400 commands where "turn on X" is one and "turn off X" is
             | two.
             | 
             | With Home Assistant this means turning on and off two
             | hundred entities. We currently pull light and switch
             | entities from Home Assistant and build the local Multinet
             | speech grammar.
             | 
             | We have goals for better dynamic and adaptive configuration
             | of Willow and part of that is using a Willow Home Assistant
             | component with user configuration inthe HA dashboard, etc
             | to easily select entities, define commands, etc and
             | dynamically update all associated Willow devices.
             | 
             | We feel that with this 400 commands is enough to be
             | practical and useful. Additionally, because the Multinet
             | model returns probability on match to command "fuzzy
             | matching" actually works quite well where "light",
             | "lights", and slightly mis-worded commands still match
             | correctly.
        
           | haarts wrote:
           | With regard to this:
           | 
           | > - On the wire/protocol stuff. We're doing pretty
           | rudimentary "open new connection, stream voice, POST
           | somewhere". This adds extra latency and CPU usage because of
           | repeated TLS handshakes, etc. We have plans to use Websockets
           | and what-not to cut down on this.
           | 
           | I've recently used the Noise protocol[1] to do some encrypted
           | communication between two services I control but separated by
           | the internet.
           | 
           | It was surprisingly easy!
           | 
           | [1]: https://noiseprotocol.org/
        
             | kkielhofner wrote:
             | Thanks for mentioning noise! I've certainly looked at it
             | before but our challenge is the sheer scope of what we're
             | doing. Not to mention (similar to WebRTC that people have
             | asked about) I'm not completely understanding the fit and
             | benefit for our use case and application.
             | 
             | I talk about websockets because they achieve our mission
             | and goal (in this case shaving milliseconds off command ->
             | action -> confirmation) with robust, battle-tested client
             | implementations already available in the ESP framework
             | libraries. Same thing for MQTT. Both are supported by Home
             | Assistant (and almost everything else in the space) today.
             | 
             | Because of this existing framework support, we'll have
             | websockets done today-ish. Then we can (for now) move on to
             | all of the other things people have asked for :). Hah,
             | priorities!
             | 
             | Not saying Noise won't/can't ever happen - just that this
             | is a very ambitious project as it stands and we have plenty
             | of work to do all over the place :)!
             | 
             | Want to write a noise implementation for ESP IDF :)?
        
           | michaelmior wrote:
           | > Open sourcing our inference server
           | 
           | I'm curious if this is something lightweight enough that
           | might be possible to run as a Home Assistant add-on on
           | relatively low-powered hardware such as an RPi.
        
             | kkielhofner wrote:
             | I talk about this a bit on the wiki[0] but our goal is to
             | have a Willow Home Assistant component do the Willow
             | specific stuff and enable users to use any of the STT/TTS
             | modules provided by Home Assistant.
             | 
             | We'll also (likely) be creating our own TTS/STT HA
             | component for our inference server that does some
             | special/unique things to support Willow.
             | 
             | [0] - https://github.com/toverainc/willow/wiki/Home-
             | Assistant
        
           | tikkun wrote:
           | Nice! How's the speech recognition accuracy and response
           | latency?
        
             | kkielhofner wrote:
             | Thanks!
             | 
             | Faster than Alexa (and only going to get faster)[0].
             | 
             | Between the far-field speech optimizations provided by the
             | ESP BOX and Espressif frameworks and our inference server
             | (open sourcing next week) using Whisper, and our unique
             | streaming format we've found it to be comparable in terms
             | of quality to Alexa/Echo even with background noise and at
             | distances of up to 30 feet.
             | 
             | [0] - https://www.youtube.com/watch?v=8ETQaLfoImc
        
               | tikkun wrote:
               | That's really nice - and thanks for including the demo
               | link too, impressive!
        
               | kkielhofner wrote:
               | Thanks again!
               | 
               | Not only are we working on improving performance with the
               | inference server, local on device command recognition is
               | extremely fast. Like "did that really just happen?" fast.
               | 
               | In my local setup when using locally-controlled Wemo
               | switches I swear the latency with local devices is around
               | 300ms or so.
               | 
               | I should make another demo video with that...
        
       | kristofferR wrote:
       | I'm really looking forward to adding something like this to my
       | home.
       | 
       | Are there any similar devices, without a screen, available?
       | 
       | Preferably with a nice design, so I don't have to hide them?
        
         | kkielhofner wrote:
         | The only real hardware requirement of Willow (currently) is
         | something based on the ESP32 S3 that has a microphone.
         | 
         | I'm aware of various such devices out there but we've been
         | really focusing on the ESP BOX. If you or anyone else in the
         | community is interested in other devices we'll certainly look
         | at supporting them!
        
       | depingus wrote:
       | So I was just looking at the installation process for this
       | device's dev environment (ESP-IDF from espresiff) and it seems
       | kind of...insane.
       | 
       | The manual install method in the directions is not manual at all.
       | It's a script that calls several python scripts. One has 2660 LOC
       | and installs a root certificate (hard coded in the script itself)
       | because of course, even though you just cloned the whole repo, it
       | still has to download stuff from the internet. According to the
       | code, "This works around the issue with outdated certificate
       | stores in some installations".
       | 
       | Does anyone familiar with espressiff have an actual manual method
       | of installing a dev environment for this device that doesn't
       | involved pwning myself?
        
         | detaro wrote:
         | yes, do it in a container or VM. Welcome to the wonderful world
         | of hardware manufacturer SDKs.
        
           | stintel wrote:
           | Indeed. This is exactly the reason why we standardized on
           | building in a container.
        
         | yelite wrote:
         | If you are open to Nix, you can try
         | https://github.com/mirrexagon/nixpkgs-esp-dev. I used it for a
         | small project a while ago and the experience was pretty good.
        
           | kkielhofner wrote:
           | Nice!
           | 
           | For anyone who would try to use this with Willow (I like the
           | effort and CERTAINLY don't love the ESP dev environment as-
           | is):
           | 
           | - ESP ADF is actually the root of the project. ESP-IDF and
           | all other sub-components are components themselves to ADF.
           | 
           | - We use bleeding edge ESP SR[0] that we also include as an
           | ADF component.
           | 
           | - Plus LVGL, ESP-DSP, esp-lcd-touch, and likely others I'm
           | forgetting ATM.
           | 
           | [0] - https://github.com/espressif/esp-sr
        
       | yoavm wrote:
       | This is wonderful. I would love to replace my stupid Google Home
       | Minis with this if I can actually get the hardware for $50. The
       | Mycroft device is like $400 so I didn't even consider it, and I
       | never understood why it had to be so expensive. I don't even need
       | a screen - just a microphone. Will definitely give this a shot!
        
         | kkielhofner wrote:
         | Thank you!
         | 
         | Yes, this is why we went through the pain of doing what we're
         | able to do with this hardware.
         | 
         | Even in this initial release it's competitive with Echo, etc
         | even on cost.
        
       | COGlory wrote:
       | Ordered a box, can't wait to try this out! I've really been
       | looking for something like this. My dream would be to have an LLM
       | "agent" running locally, that knows who I am, etc, that can also
       | double as a smart assistant for HA.
        
       | streakfix wrote:
       | Can I replace google voice assistant on a pixel 7 with it? How
       | about a rooted pixel 7?
        
       | syntaxing wrote:
       | Anyway to tie this into homeassistant? Siri is such a POS that I
       | been debating for a while to replace it with something Whisper
       | based.
       | 
       | Edit: should have read the README more carefully....
        
       | mikece wrote:
       | I love the privacy-focused aspect but playing devil's advocate:
       | how could a device like this be hijacked and used for anti-
       | privacy purposes? Does this require physical access or has it
       | been subjected to the likes of the Black Hat conference to see if
       | it can be owned from the street outside someone's home?
        
         | kkielhofner wrote:
         | It's Wifi client only and supports WPA3, protected management
         | frames, etc, etc. It doesn't listen on any network sockets.
         | Even bluetooth is currently disabled.
         | 
         | Other than low level issues in the Espressif wifi stack (which
         | is very robust, mature, and has been beat on heavily) I don't
         | see any potential security issues.
         | 
         | That said the old expression "it's easy for someone to design a
         | lock they can't pick" certainly applies.
         | 
         | We'd welcome someone owning it and bringing any issues to our
         | attention!
        
       | edf13 wrote:
       | Sounds interesting - one question I have is about the mic
       | array... Isn't this one of the supposed benefits of a physical
       | Alexa device, and rumored to be sold at a loss because of the
       | quality.
       | 
       | How does the esp-box compare? E.g. in a noisy environemnt, tv in
       | the background, kids and dogs running around?
        
         | kkielhofner wrote:
         | The ESP BOX has an acoustically optimized enclosure with dual
         | microphones for noise cancelation, separation, etc.
         | 
         | Between that and the Espressif AFE (audio frontend interface)
         | doing a bunch of DSP "stuff" in our testing it does remarkably
         | well in noisy environments and far-field (25-30 feet) use
         | cases.
         | 
         | Our inference server implementation (open source, releasing
         | next week) uses a highly performance optimized Whisper which
         | does famously well with less-than-ideal speech quality.
         | 
         | All in, even though it's all very early, it's very competitive
         | with Echo, etc.
        
           | stavros wrote:
           | Where can I get one? I can't find it on Ali :(
        
             | kkielhofner wrote:
             | You never know if people are going to love your pet project
             | as much as you do. We had a hunch the community would
             | appreciate Willow but like I said, you just never know.
             | 
             | My suspicion is Espressif (until now, hah) hasn't sold a
             | lot of ESP Boxes. We were concerned that if Willow takes
             | off they will sell out. That already appears to be
             | happening.
             | 
             | Espressif has tremendous manufacturing capacity and we hope
             | they will scale up ESP BOX production to meet demand now
             | that (with Willow) it exists. The only gaiting item for
             | them is probably the plastic enclosure and they should be
             | able to figure out how to produce that en masse :).
        
               | stavros wrote:
               | I really hope so, I've been waiting for good audio
               | assistant hardware forever. I hope this is finally the
               | time where I ditch Alexa once and for all, thanks for
               | releasing Willow!
        
               | r1cht3r wrote:
               | fwiw I found them in stock on adafruit.com
        
               | stavros wrote:
               | Thanks, but I'm not in the US :( Good idea to check
               | Pimoroni, though, thanks!
               | 
               | EDIT: Found one with a direct link from Ali from
               | Espressif's site, even though it doesn't show up in a
               | search:
               | 
               | https://www.aliexpress.com/item/1005003980216150.html?spm
               | =a2...
        
               | nickthegreek wrote:
               | Thanks for the link. ~$66 with shipping, worth it to test
               | out this project though.
        
           | glenngillen wrote:
           | What's the latency on inference on a rasbpi (I assume it's
           | not running it direct on the device)? I think I read
           | previously that it was up to 7 secs, and if you wanted sub-
           | second you'd need an i5.
        
             | kkielhofner wrote:
             | Willow supports the Espressif ESP SR speech recognition
             | framework to do completely on device speech recognition for
             | up to 400 commands. When configured, we pull light and
             | switch entities from home assistant and build the grammar
             | to turn them on and off. There's no reason it has to be
             | limited to that, we just need to do some extra work for
             | better dynamic configuration and tighter integration with
             | Home Assistant to allow users to define up to 400 commands
             | to do whatever they want with their various entities.
             | 
             | With local command recognition with Willow I can turn my
             | wemo switches on and off, completely on device, in roughly
             | 300ms. That's not a typo. I'm going to make another demo
             | video showing that.
             | 
             | We also support live streaming of audio after wake to our
             | highly optimized Whisper inference server implementation
             | (open source, releasing next week). That's what our current
             | demo video uses[0]. It's really more intended for
             | pro/commercial applications as it supports CPU but really
             | flies with CUDA - where even on a GTX 1060 3GB you can do
             | 3-5 seconds of speech in ~500ms or so.
             | 
             | We also plan to have a Willow Home Assistant component to
             | support Willow "stuff" while enabling use of any of the
             | STT/TTS modules in Home Assistant (including another
             | component for our inference server you can self-host that
             | does special Willow stuff).
             | 
             | [0] - https://www.youtube.com/watch?v=8ETQaLfoImc
        
               | woodson wrote:
               | Have you considered K2/Sherpa for ASR instead of
               | ctranslate2/faster-whisper? It's much better suited for
               | streaming ASR (whisper transcribes 30 sec chunks, no
               | streaming). They're also working on adding context
               | biasing using Aho-Corasick automata, to handle dynamic
               | recognition of eg. contact list entries or music library
               | titles (https://github.com/k2-fsa/icefall/pull/1038).
        
               | kkielhofner wrote:
               | Whisper, per the model, does 30 second chunks and doesn't
               | support "streaming" by the strict definition.
               | 
               | You'll be able to see when we release our inference
               | server implementation next week that it's more than a
               | version of "realtime" enough to fool nearly anyone,
               | especially with an application like this where you aren't
               | looking for model output in real time. You're streaming
               | speech, buffering on the server, waiting for the end of
               | voice activity detection, running Whisper, taking the
               | transcription, and doing something with it. Other than a
               | cool demo I'm not really sure what streaming ASR output
               | provides but that's probably lack of imagination on my
               | part :).
               | 
               | That said, these are great pointers and we're certainly
               | not opposed to it! At the end of the day Willow does the
               | "work on the ground" of detecting wake word, getting
               | clean audio, and streaming the audio. Where it goes and
               | what happens then is up to you! There's no reason at all
               | we couldn't support streaming ASR output.
        
       | barbariangrunge wrote:
       | I never really considered getting a home assistant doodad because
       | of the privacy issues around them. This sounds like a cool
       | project
        
         | stintel wrote:
         | Thanks!
        
       ___________________________________________________________________
       (page generated 2023-05-15 23:00 UTC)