[HN Gopher] Segment Anything Model and the hard problems of comp... ___________________________________________________________________ Segment Anything Model and the hard problems of computer vision Author : swyx Score : 100 points Date : 2023-04-13 17:01 UTC (5 hours ago) (HTM) web link (www.latent.space) (TXT) w3m dump (www.latent.space) | endisneigh wrote: | What's an interesting problem that's solved with segment | anything? | swyx wrote: | see video demo where joseph showed how it improves on sota? | https://youtu.be/SZQSF-A-WkA | mritchie712 wrote: | yep, value is pretty clear from his demo. Goes from dozens of | clicks to identify an object within an image to a single | click. SAM does almost exactly what you'd want as a human in | every one of his examples. | rampantraccoon wrote: | The problem being solved is AI being able to distinguish unique | objects within visual data. Before SAM, people would have to | train a model on specific objects by labeling data and training | a model to understand those objects specifically. This becomes | problematic given the variety of objects in the world, settings | they can be in, and their orientation in an image. SAM can | identify objects it has never seen before, as in objects that | might not be part of the training data. | | Once you can determine which pixels belong to which object | automatically, you can start to utilize that knowledge for | other applications. | | If you have SAM showing you all objects, you can use other | models to identify what the object is, understand it's | shape/size, understand depth/distance, etc. It's a foundational | model to build off of for any application that wants to use | visual data as an input. | DaiPlusPlus wrote: | > SAM can identify objects it has never seen before | | I'd love to see what SAM does when you send it a photo of | rolling fog though, e.g. https://www.google.com/search?q=roll | ing+fog+scotland&tbm=isc... - what happens then? (and how can | it meaningfully segment-out fog?) | yeldarb wrote: | Not sure if this is what you mean, but I grabbed some of | those images & dropped them in to see what it predicted: | https://imgur.com/a/CXLmYXo | idopmstuff wrote: | It groups the fog as a single object (except where it's | separated by things like hills). | | You can see what it does - it's available to test at | https://segment-anything.com/. | endisneigh wrote: | Yes, what I am interested in are the other applications. | swyx wrote: | Hey HN! I'm very proud to release the deepest interview deep dive | into the SAM model I can find on the Internet (seriously - i | looked on youtube and listennotes and all of them were pretty | superficial). the Roboflow team has spent the past week hacking | on and building with SAM and I ran into Joseph Nelson this | weekend and realized he might be the perfect non-Meta-AI person | to discuss what it means for developers building with SAM. | | so.. enjoy! worked really hard on the prep and editing, any | feedback and suggestions/recommendations welcome. still new to AI | and new to the podcast game. | | edit: Video demo is here in case people miss it | https://youtu.be/SZQSF-A-WkA | CyberDildonics wrote: | It's easy to have someone lay down 30 points for a simple | banana shaped outline and compare segmentation to that, but how | does this compare to other automatic techniques like spectral | matting (which is now 16 years old) ? | | http://people.csail.mit.edu/alevin/papers/spectral-matting-l... | re5i5tor wrote: | Really great, thank you. | swyx wrote: | any requests for under-covered topics? i felt like this one | resonated because somehow the other podcasters/youtubers | seemed to miss how big of a deal it was. hungry for more. | Eisenstein wrote: | Why is there no volume control on your podcast page player? | steventey wrote: | Incredible stuff - glad you got to collab with Joseph on this!! | swyx wrote: | thanks steven... would love to chat sharegpt learnings and | whatever else u have going on next time you are in town? | jeron wrote: | Curious if chatGPT can convert this podcast transcript into an | article | LoganDark wrote: | Then try it..? Maybe it also could have written a more | thoughtful comment for you. | passion__desire wrote: | Text just doesn't cut it for many now. I would love if AI | could customize minimal animations that goes along with the | comment. Something like this. | | https://www.youtube.com/shorts/Qnvt5GHTywU | swyx wrote: | fun fact i just sent some money to someone offering to do | youtube shorts for me so we might try to do this. honeslty | i dont think shorts is conducive to deep conversation or | technical topics tho. pple watch with their brain off | DonHopkins wrote: | In the style of Hunter S. Thompson. | | https://www.youtube.com/watch?v=vUgs2O7Okqc | wmwmwm wrote: | I've had some pretty remarkable results pasting lecture | transcripts from youtube into gpt4 and getting well | formatted/relevant markdown summaries from meandering and mis- | transcribed content! Needs chunking up but surprisingly | effective. It can even generate youtube urls with the right | timestamps if you ask it nicely | variousNick wrote: | It's less configurable than what you're describing, but I've | found this useful in at least determining if a given video | has the content I'm looking for: https://www.summarize.tech/ | wmwmwm wrote: | There's also a python api for getting the transcript from a | given youtube video id so you can script the whole thing | m3kw9 wrote: | Thing is when you segment everything in a scene, sometimes those | things are actually just one object say a laptop, and it starts | segmenting trackpad, individual keys, screen. And then you need | another algorithm or human intervention to say this segment is | pointless etc, a noise filter | alsodumb wrote: | Which is not bad imo. Infact SAM actively proposed a fix for | this. Bounding box algorithms are relatively easier to train | and SAM can take in a rough bounding box of the laptop as input | prompt and create a detailed segmentation of laptop. Their | webpage has an example with a bunch of cats. | | SAM is good at (i) deciding detailed mask around segments, (ii) | taking a wide range of prompts as input to decide what exactly | the user wants to be segmented, and it processes these prompts | at very low compute requirements. | | I think SAM is very well designed architecture and I'm not sure | how better it can be. I mean coming back to your question, | there should be some signal that user wants a segmentation of | laptop. SAM takes exactly that as prompt input. | dekhn wrote: | The real hard problem in CV (and science in general) is bad | papers that omit useful info. | | Segment Anything requires an image embedding. They report in the | paper that segmentation takes ~50ms, but conveniently leave out | that computing an embedding of an image (640x480) in their model | takes ~2+ seconds (on a 3080 Ti). Well, at least they released | all the code and model and enough instructions to figure that | part out. | p1esk wrote: | They mention this multiple times in the paper. For example, in | the "Limitations" section they write: "SAM can process prompts | in real-time, but nevertheless SAM's overall performance is not | real-time when using a heavy image encoder." | | This paper is one of the highest quality papers released this | year. I wish more papers were so clear and informative. | dekhn wrote: | Yes, I read that section. They should have included the time | required by the heavy image encoder- unless you know of a way | to make SAM work with another encoder. | haliskerbas wrote: | I wonder if Lex Friedman will cover this | brianjking wrote: | Why would it matter if he did? | not-my-account wrote: | Many would find it interesting! | swyx wrote: | i mean maybe i'm the new lex fridman? i can ask my guests | what is meaning of life if you want | yeldarb wrote: | The feature they discussed re using Segment Anything to greatly | speed up labeling is now live in Roboflow Annotate for anyone to | try: https://blog.roboflow.com/label-data-segment-anything- | model-... | | Distilling these big, slow vision transformer models into | something that can be used in realtime on the edge is going to be | huge. | swyx wrote: | > Distilling these big, slow vision transformer models into | something that can be used in realtime on the edge is going to | be huge. | | something i didnt quite get to is does Roboflow do this for you | or are you pointing to more work that you'd like to happen | someday? (possibly done by Roboflow, possibly someone else) | also are you worried about business model if people can distil | to run on their own devices (so they dont need to pay you | anymore)? | yeldarb wrote: | This is the core of what we do! Previously our job was | distilling human knowledge into a model; now that knowledge | is starting to come from bigger models with humans managing | the objectives vs doing the labor. | | > also are you worried about business model if people can | distil to run on their own devices (so they dont need to pay | you anymore)? | | This is probably a risk to the current business model over | the long term, but we're constantly working on reinventing | ourselves & finding new ways to provide value. If we don't | adapt to the changing world we deserve to go out of business | someday. I'd much rather help build the thing that makes us | obsolete than sit idly by while someone else builds it. | | I think of this risk similarly to the way that I'm marginally | worried that, in the long run, AGI will obviate the need for | my job. Probably true, but the opportunities it will present | are far greater & it's better to focus on how to be valuable | in the future than cling to how I provide value today. | swyx wrote: | haha true true. thanks Brad appreciate the responses and | would love to have you on the pod next time there's hot | computer vision news! ___________________________________________________________________ (page generated 2023-04-13 23:00 UTC)