[HN Gopher] Show HN: Open-source macOS AI copilot (using vision ... ___________________________________________________________________ Show HN: Open-source macOS AI copilot (using vision and voice) Heeey! I built a macOS copilot that has been useful to me, so I open sourced it in case others would find it useful too. It's pretty simple: - Use a keyboard shortcut to take a screenshot of your active macOS window and start recording the microphone. - Speak your question, then press the keyboard shortcut again to send your question + screenshot off to OpenAI Vision - The Vision response is presented in-context/overlayed over the active window, and spoken to you as audio. - The app keeps running in the background, only taking a screenshot/listening when activated by keyboard shortcut. It's built with NodeJS/Electron, and uses OpenAI Whisper, Vision and TTS APIs under the hood (BYO API key). There's a simple demo and a longer walk-through in the GH readme https://github.com/elfvingralf/macOSpilot-ai-assistant, and I also posted a different demo on Twitter: https://twitter.com/ralfelfving/status/1732044723630805212 Author : ralfelfving Score : 333 points Date : 2023-12-12 13:17 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | pyryt wrote: | Do you have use case demo videos somewhere? Would be great to see | this in action | ralfelfving wrote: | There's one at 00:30 in this YouTube video (timestamped the | link): https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s | faceless3 wrote: | Wrote some similar scripts for my Linux setup, that I bind with | XFCE keyboard shortcuts: | | https://github.com/samoylenkodmitry/Linux-AI-Assistant-scrip... | | F1 - ask ChatGPT API about current clipboard content F5 - same, | but opens editor before asking num+ - starts/stops recording | microphone, then passes to Whisper (locally installed), copies to | clipboard | | I find myself rarely using them however. | ralfelfving wrote: | Nice! | ProfessorZoom wrote: | e-e-e-electron... for this.. | ralfelfving wrote: | I don't know man. I'm new to development, it's what I chose, | probably don't know any better. Tell me what you would have | chosen instead? | xNeil wrote: | electron's a really nice option, specially for people that | aren't interested in porting their apps or spending too much | time on development | | this is a macOS specific app it seems - if you want better | performance and more integration with the OS, i'd recommend | using swift | ralfelfving wrote: | Time to learn learn Swift in the next project then! Thank | you for the deets. | Filligree wrote: | The good news is you already have a tool to help you with | inevitable XCode issues. _grin_ | lolinder wrote: | Don't mind them--there's a certain subset of HN that is upset | that web tech has taken over the world. There are some | legitimate gripes about the performance of some electron | apps, but with some people those have turned into compulsive | shallow dismissals of any web app that they believe could | have been native. | | There's nothing wrong with using web tech to build things! | It's often easier, the documentation is more comprehensive, | and if you ever wanted to make it cross-platform election | makes it trivial. | | If you were working for a company it might be worth | considering the trade-offs--do you need to support Macs with | less RAM?--but for a side project that's for yourself and | maybe some friends, just do what works for you! | ralfelfving wrote: | Thank you for the explanation! At the end of the day, I'm a | newbie and I'm in it to learn something new with each | project. Next time I'll probably try my hand at a different | framework. | millzlane wrote: | I just watched a video about building a startup. One of | the key points was to use what you know to get an MVP. | Don't fret over which language or library to use (unless | the goal is to learn a new framework). Just get building. | I may not be a pro dev, but there is one thing I have | learned over the years from hanging out amongst all of | you. And that is, it doesnt matter if you are using emacs | or vim, tabs vs spaces, or Java vs Python, the end | product after all is what matters at the end of the day. | Code can always be refactored. | | Good luck in your development journey. | jdamon96 wrote: | ignore the naysayers; nice job building out your idea | ralfelfving wrote: | Thank you! I got pretty thick skin, but always a bit of | insecurity involved in doing something the first time -- | first public GH repo and Show HN :D | airstrike wrote: | I think the parent comment is a shallow dismissal, but since | you're asking, I would have built in SwiftUI | guytv wrote: | What's important is to get an product out there. Nobody cares | what stack you use. just us geeks. don't get discouraged. you | did well :) | programmarchy wrote: | My two cents: I think you made a good, practical choice. If | you're happy with Electron, I'd say stick with it, especially | if you have cross-platform plans in the future. | | If you want to niche down into a more macOS specific app, you | could learn AppKit and SwiftUI and build a fully native macOS | app. | | If you want to stay cross-platform, but you're not happy with | Electron, then it might be worth checking out Tauri. It | provides a JavaScript-based API to display native UI | components, but without packaging a V8 runtime with your app | bundle. Instead, it uses a native JavaScript host e.g. on | macOS it uses WebKit, so it significantly reduces the | download size of your app. | | In terms of developing this into a product, on one hand it | seems like deep integration with the host OS is the best way | to build a "moat", but then again, Apple could release their | own version and quickly blow a product like that out of the | water. | atraac wrote: | Ah yes, cause what's better than building a real, working MVP? | Learning Rust for half a year just so you can 'optimize' the f | out of an app that does two REST calls. | wtallis wrote: | To be fair, this _does_ sound like the kind of app that would | benefit from being able to launch instantly, and potentially | registering with the OS as a service in a way that cross- | platform frameworks like Electron cannot easily accommodate. | But Rust would not be the easiest choice to avoid those | limitations. | havkom wrote: | A lot of negative comments here. However, I liked it! | | Perfect Show HN and a great start of a product if the author | wants to. | ralfelfving wrote: | Thank you, it's my first GH project & Show HN.. and.. yeah.. | learning here :D | jonplackett wrote: | Also think this is fun. | | In general I'm pretty excited about LLM as interface and what | that is going to mean going forward. | | I think our kids are going to think mice and keyboards are | hilariously primitive. | ralfelfving wrote: | Before we know it, even voice might be obsolete when we can | just think :) But maybe at that point, even thinking | becomes obsolete because the AI:s are doing all the | thinking for us?! | swiftcoder wrote: | Worth mentioning that if you are in a corporate environment, | running a service that sends arbitrary desktop screenshots to a | 3rd party cloud service is going to run afoul of pretty much | every security and regulatory control in existence | thelittleone wrote: | The control for that is endpoints should be locked down to | prevent install of non approved apps. Any org under regulatory | controls would have some variation of that. Safe to assume an | orgs users are stupid or nefarious and build defences | accordingly. | ralfelfving wrote: | I assume that anyone capable of cloning the app, starting the | it on their machine and obtaining + adding an OpenAI API key | understands that some data is being sent offsite -- and will be | aware of their corporate policies. I think that's a fair | assumption. | greenie_beans wrote: | that's a fair assumption. feels like swiftcoder is just | trying to gotcha | brookst wrote: | True, but also true of other screen capture utilities that send | data to the cloud. Your PSA is true, but hardly unique to this | little utility. And probably not surprising to the intended | audience. | isoprophlex wrote: | You're telling me... the cloud... is other people's computers?! | abrichr wrote: | This is exactly why in https://github.com/OpenAdaptAI/OpenAdapt | we have implemented three separate PII scrubbing providers. | | Congrats to the op on shipping! | jondwillis wrote: | You should add an option for streaming text as the response | instead of TTS. And also maybe text in place of the voice command | as well. I have been tire-kicking a similar kind of copilot for | awhile, hit me up on discord @jonwilldoit | ralfelfving wrote: | There's definitely some improvements to shuttling the data | between interface<->API, all that was done in a few hours on | day 1 and there's a few things I decided to fix later. | | I prefer speaking over typing, and I sit alone, so probably | won't add a text input anytime soon. But I'll hit you up on | Discord in a bit and share notes. | jondwillis wrote: | Yeah, just some features I could see adding value and not | being too hard to implement :) | tomComb wrote: | > text in place of the voice command as well | | That would be great for people with Mac mini who don't have a | mic. | ralfelfving wrote: | Hmmm... what if I added functionality that uses the webcam to | read your lips? | | Just kidding. Text seem to be the most requested addition, | and it wasn't on my own list :) Will see if I add it, should | be fairly easy to make it configurable and render a text | input window with a button instead of triggering the | microphone. | | Won't make any promises, but might do it. | amelius wrote: | Please include "OpenAI-based" in the title. (Now many people here | are disappointed). | ralfelfving wrote: | Fair point, didn't think it would matter so much. Can't edit it | any more, otherwise I'd change it to add OpenAI to the title! | ukuina wrote: | This is very cool! Thank you for working on it and sharing it | with us. | ralfelfving wrote: | Thank you for checking it out! <3 | netika wrote: | Such a shame it uses Vision API, i.e. it can not be replaced by | some random self-hosted LLM. | ralfelfving wrote: | It can be replaced with a self-hosted LLM, simply change the | code where the Vision API is being called. That's true for all | of the API calls in the app. | freedomben wrote: | Actually it's open source, so it _can_ be replaced by some | random self-hosted LLM | iandanforth wrote: | For example, one of these: | | https://opencompass.org.cn/leaderboard-multimodal | jackculpan wrote: | This is awesome | ralfelfving wrote: | Thanks, glad you liked it! | knowsuchagency wrote: | This is brilliant! | ralfelfving wrote: | Glad you liked it! | satchlj wrote: | It's not working for me, I get a "Too many requests" http error | ralfelfving wrote: | Hmm.. OpenAI bunch a few things into some error. Iirc this | could be because you're out of credits / don't have a valid | payment method on file, but it could also be that you're | hitting rate limits. The Vision API could be the culprit, while | in beta you can only call it X amount of times per day (X | varies by account). | | Make the console.log:s for the three API calls a bit more | verbose to find out which call is causing this, and if there's | more info in the error body. | I_am_tiberius wrote: | I would love to have something like this but using an open source | model and without any network requests. | trenchgun wrote: | Probably in three months, approximately. | dave1010uk wrote: | LLaVA, Whisper and a few bash scripts should be able to do it. | I don't know how helpful the model is with screenshots though. | | 1. Download LLaVA from https://github.com/Mozilla- | Ocho/llamafile | | 2. Run Whisper locally for speech to text | | 3. Save screenshots and send to the model, with a script like | https://til.dave.engineer/openai/gpt-4-vision/ | behat wrote: | Nice! Built something similar earlier to get fixes from chatgpt | for error messages on screen. No voice input because I don't like | speaking. My approach then was Apple Computer Vision Kit for OCR | + chatgpt. This reminds me to test out OpenAI's Vision API as a | replacement. | | Thanks for sharing! | ralfelfving wrote: | Thanks! You could probably grab what I have, and tweak it a | bit. Try checking if you can screenshot just the error message | and check what the value of the window.owner is. It should be | the name of the application, so you could just append `Can you | help me with this error I get in ${window.owner}?` to the | Vision API call. | thomashop wrote: | Just used it with the digital audio workstation Ableton Live. It | is amazing! Its tips were spot-on. | | I can see how much time it will save me when I'm working with a | software or domain I don't know very well. | | Here is the video of my interaction: | https://www.youtube.com/watch?v=ikVdjom5t0E&feature=youtu.be | | Weird these negative comments. Did people actually try it? | pelorat wrote: | I mean it does send a screenshot of your screen off to a 3rd | party, and that screenshot will most likely be used in future | AI training sets. | | So... beware when you use it. | thomashop wrote: | Beware of it seeing a screenshot of my music set? OpenAI will | start copying my song structure? | | You can turn it on and off. Not necessary to turn it on when | editing confidential documents. | | You never enable screen-sharing in videoconferencing | software? | aaronscott wrote: | I completely agree. A huge business with a singular focus | isn't going to pivot into the music business (or any of the | myriad use cases the general public throws at it). And if | they did use someone's info, it's more likely an unethical | employee than a genuine business tactic. | | Besides, the parent program uses the API, which allows | opting out of training or retaining that data. | mecsred wrote: | Yes this makes perfect sense. As we know, businesses | definitely do not treat data as a commodity and engage in | selling/buying data sets on the open market as a "genuine | business tactic". Therefore, since the company in | question doesn't have a clear business case for data | collection _currently_ , we can be sure this data will | never be used against our interests by any company. | zwily wrote: | OpenAI claims that data sent via the API (as opposed to | chatGPT) will not be used in training. Whether or not you | believe them is a separate question, but that's the claim. | ralfelfving wrote: | So glad when I saw this, thanks for sharing this! It was | exactly music production in Ableton was the spark that lit this | idea in my head the other week. I tried to explain to a friend | that don't use GPT much that with Vision, you can speed up your | music production and learn how to use advanced tools like | Ableton more quickly. He didn't believe me. So I grabbed a | Ableton screenshot off Google and used ChatGPT -- then I felt | there had to be a better way, I realized that I have my own | use-cases, and it all evolved into this. | | I sent him your video, hopefully he'll believe me now :) | thomashop wrote: | You may be interested in two proof of concepts I've been | working on. I work with generative AI and music at a company. | | MidiJourney: ChatGPT integrated into Ableton Live to create | MIDI clips from prompts. https://github.com/korus- | labs/MIDIjourney | | I have some work on a branch that makes ChatGPT a lot better | at generating symbolic music (a better prompt and music | notation). | | LayerMosaic allows you to allow MusicGen text-to-music loops | with the music library of our company. | https://layermosaic.pixelynx-ai.com/ | ralfelfving wrote: | Oooh. Yes, very interested in MusicGen. I played with | MusicGen for the first time the other week and created a | little script that uses GPT to create the prompt and params | which is stored to a text file along with the output. Let | it loop for a few hours to get a few 100 output files that | allowed me to learn a bit more about what kind of prompts | that gave reasonable output (it was all bad, lol!) | ralfelfving wrote: | My brain read midjourney until I clicked on the GH link. | What a great name, MIDIjourney! | ralfelfving wrote: | Oh LayerMosaic is dope. I'm not entirely sure how it works, | but the sounds coming out of it is good -- so you have me | intrigued! Can I read more about it somewhere, I might have | a crazy idea I'd like to use this for. | mikey_p wrote: | Is it just me or is it incredibly useless? | | "Here's a list of effects. Here's a list of things that make a | song. Is it good? Yes. What about my drum effects? Yes here's | the name of the two effects you are using on your drum channel" | | None of this is really helpful and I can't get over how much it | sounds like Eliza. | thomashop wrote: | I made that video right at the start but since then I've | asked it for example what kind of compression parameters | would fit with a certain track and it could explain to me how | to find an expert function which I would have had to consult | a manual for otherwise. | e28eta wrote: | Did you find that calling it "OSX" in the prompt worked better | than macOS? Or was that just an early choice that you didn't | spend much time on? | | I was skimming through the video you posted, and was curious. | | https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s | | code link: https://github.com/elfvingralf/macOSpilot-ai- | assistant/blob/... | ralfelfving wrote: | No, this is an oversight by me. To be completely honest, up | until the other day I thought it was still called OSX. So the | project was literally called cOSXpilot, but at some point I | double checked and realize it's been called macOS for many | years. Updated the project, but apparently not the code :) | | I suspect OSX vs macOS has marginal impact on the outcome :) | e28eta wrote: | Haha, makes perfect sense, thanks for the reply! | hot_gril wrote: | Heh. I remember calling it Mac OS back in the day and getting | corrected that it's actually OS X, as in "OS ten," and hasn't | been called Mac OS since Mac OS 9. Glad Apple finally saw it my | way (except it's cased macOS). | qainsights wrote: | Great. I created `kel` for terminal users. Please check it out at | https://github.com/qainsights/kel | causal wrote: | Chatblade is another good one: | https://github.com/npiv/chatblade | dave1010uk wrote: | Very cool! Have you had much luck with Llama models? | | I made Clipea, which is similar but has special integration | with zsh. | | https://github.com/dave1010/clipea | Jayakumark wrote: | Was following these two projects by someuser on Github which | makes similar things possible with Local models. Sending | screenshot to openai is expensive , if done every few seconds or | minutes. | | https://github.com/KoljaB/LocalAIVoiceChat | | While the below one uses openai - don't see why it can't be | replaced with above project and local mode. | | https://github.com/KoljaB/Linguflex | ralfelfving wrote: | Nice! Although the productivity increase from being able to | resolve blockers more quickly adds up to a lot (at least for | me), local models would be more cost effective -- and probably | feel less iffy for many people. | | I went for OpenAI because I wanted to build something quickly, | but you should be able to replace the external API calls with | calls to your internal models. | stephenblum wrote: | You made real-life Clippy! for the Mac. This would be great to be | for other mac apps too. Add context of current running apps. | ralfelfving wrote: | It should work for any macOS app. It just takes a screenshot of | the currently active window, you can even append the | application name if you'd like. | lordswork wrote: | This looks very cool. Does anyone know of something similar for | Windows? (or does OP intend to extend support to Windows?) | ralfelfving wrote: | Hey, OP here. I don't have a Windows machine so have not been | able to confirm if it works, and probably won't be able to | develop/test for it either -- sorry! :/ | | I suspect that you should be able to take my code and only | require a few tweaks to make it work tho, shouldn't be much | about it that is macOS only. | coolspot wrote: | For testing/development, you can download a free Windows VM | here: https://developer.microsoft.com/en- | us/windows/downloads/virt... | poorman wrote: | Currently imagining my productivity while waiting 10 seconds for | the results of the `ls` command. | ralfelfving wrote: | It's a basic demo to show people how it works. I think you can | imagine many other examples where it'll save you a lot of time. | hot_gril wrote: | The demo on Twitter is a lot cooler, partially because you | scroll to show the AI what the page has. Maybe there's a more | impressive demo to put on the GH too? | jamesmurdza wrote: | Have you thought about integrating the macOS accessibility API | for either reading text or performing actions? | ralfelfving wrote: | No, my thought process never really stretched outside of what I | built. I had this particular idea, then sat down to build it. I | had some idea of getting OpenAI to respond with keyboard | shortcuts that the application could execute. | | E.g. in Photoshop: "How do I merge all layers" --> "To merge | all layers you can use the keyboard shortcut Shift + command + | E" | | If you can get that response in JSON, you could prompt the user | if they want to take the suggested action. I don't see myself | using it very often, so didn't think much further about it. | quinncom wrote: | I'd love to see a version of this that uses text input/output | instead of voice. I often have someone sleeping in the room with | me and don't want to speak. | ralfelfving wrote: | You're not the first to request. Might add it, can't promise | tho. | hackncheese wrote: | Love it! Will definitely use this when a quick screenshot will | help specify what I am confused about. Is there a way to hide the | window when I am not using it? i.e. I hit cmd+shift+' and it | shows the window, then when the response finishes reading, it | hides again? | ralfelfving wrote: | There's a way for sure, it's just not implemented. Allowing for | more configurability of the window(s) is on my list, because it | annoys me too! :) | hackncheese wrote: | Annoyance Driven Development(tm) | qup wrote: | I have a tangential question: my dad is old. I would love to be | able to have this feature, or any voice access to an LLM, | available to him via an easy-to-press external button. Kind of | like the big "easy button" from staples. Is there anything like | that, that can be made to trigger a keypress perhaps? | ralfelfving wrote: | I personally have no experience with configuring or triggering | keyboard shortcuts beyond what I learned and implemented in | this project. But with that said, I'm very confident that what | you're describing is not only possible but fairly easy. | Art9681 wrote: | Make sure to set OpenAI API spend limits when using this or | you'll quickly find yourself learning the difference between the | cost of the text models and vision models. | | EDIT: I checked again and it seems the pricing is comparable. | Good stuff. | ralfelfving wrote: | I think a prompt cost estimator might be a nifty thing to add | to the UI. | | Right now there's also a daily API limit on the Vision API too | that kicks in before it gets really bad, 100+ requests | depending on what your max spend limit is. | qirpi wrote: | Awesome! I love it! I was just about to sign up for ChatGPT Plus, | but maybe I will pay for the API instead. So much good stuff | coming out daily. | | How does the pricing per message + reply end up in practice? (If | my calculations are right, it shouldn't be too bad, but sounds a | bit too good to be true) | ralfelfving wrote: | I have a hard time saying how much this particular application | cost to run, because I use the Voice+Vision APIs for so many | different projects on a near daily basis and haven't | implemented a prompt cost estimator. | | But I also pay for ChatGPT Plus, and it's sooo worth it to me. | | If you'd like to skip Plus and use something else, I don't | think my project is the right one. I'd STRONGLY suggest you | check out TypingMind, the best wrapper I've found: | https://www.typingmind.com/ | qirpi wrote: | Wow, thanks for sharing that link, I've been looking for | something like this :) | spullara wrote: | Did you not find the built-in voice-to-text and text-to-speech | APIs to be sufficient? | ralfelfving wrote: | Didn't even think of them to be honest. | zmmmmm wrote: | I've been wanting to build something like this by integrating | into the terminal itself. Seems very straight forward and avoids | the screen shotting. So you would just type a comment in the | right format and it would recognise it: $ ls | a.txt b.txt c.txt $ # AI: concatenate these files | and sort the result on the third column $ #.... $ | # cat a.txt b.txt c.txt | sort -k 3 | | This already works brilliantly by just pasting into CodeLLaMa so | it's purely terminal integration to make it work. All i need is | the rest of life to stop being so annoyingly busy. | paulmedwards wrote: | I wrote a simple command line app to let me quickly ask a quick | question in the terminal - https://github.com/edwardsp/qq. It | outputs the command I need and puts it in the paste buffer. I | use it all the time now, e.g. $ qq | concatenate all files in the current directory and sort the | result on the third column cat * | sort -k3 | zmmmmm wrote: | yep absolutely - have seen a few of those. And how well they | work is what inspires me to want the next parts, which are | (a) send the surrounding lines and output as context - notice | above I can ask it about "these files" (b) automatically add | the result to terminal history so I can avoid copy/paste if I | want to run it. I think this could make these things | absolutely fluid, almost like autocomplete (another crazy | idea is to _actually_ tie it into bash-completion so when you | press tab it does the above). | | CodeLLama with GPU acceleration on Mac M1 is almost instant | in response, its really compelling. | smcleod wrote: | Nice project, any plans to make it work with local LLMs rather | than "open"AI? | ralfelfving wrote: | Thanks. Had no plans, but might give it a try at some point. | For me, personally, using OpenAI for this isn't an issue. | hmottestad wrote: | I think that LM Studio has an OpenAI "compliant" API, so if | there is something similar that supports vision+text then it | would be easy enough to make the base URL configurable and then | point it to localhost. | | Do you know of a simple setup that I can run locally with | support for both images and text? ___________________________________________________________________ (page generated 2023-12-12 23:00 UTC)