[HN Gopher] Segment Anything Model and the hard problems of comp...
       ___________________________________________________________________
        
       Segment Anything Model and the hard problems of computer vision
        
       Author : swyx
       Score  : 100 points
       Date   : 2023-04-13 17:01 UTC (5 hours ago)
        
 (HTM) web link (www.latent.space)
 (TXT) w3m dump (www.latent.space)
        
       | endisneigh wrote:
       | What's an interesting problem that's solved with segment
       | anything?
        
         | swyx wrote:
         | see video demo where joseph showed how it improves on sota?
         | https://youtu.be/SZQSF-A-WkA
        
           | mritchie712 wrote:
           | yep, value is pretty clear from his demo. Goes from dozens of
           | clicks to identify an object within an image to a single
           | click. SAM does almost exactly what you'd want as a human in
           | every one of his examples.
        
         | rampantraccoon wrote:
         | The problem being solved is AI being able to distinguish unique
         | objects within visual data. Before SAM, people would have to
         | train a model on specific objects by labeling data and training
         | a model to understand those objects specifically. This becomes
         | problematic given the variety of objects in the world, settings
         | they can be in, and their orientation in an image. SAM can
         | identify objects it has never seen before, as in objects that
         | might not be part of the training data.
         | 
         | Once you can determine which pixels belong to which object
         | automatically, you can start to utilize that knowledge for
         | other applications.
         | 
         | If you have SAM showing you all objects, you can use other
         | models to identify what the object is, understand it's
         | shape/size, understand depth/distance, etc. It's a foundational
         | model to build off of for any application that wants to use
         | visual data as an input.
        
           | DaiPlusPlus wrote:
           | > SAM can identify objects it has never seen before
           | 
           | I'd love to see what SAM does when you send it a photo of
           | rolling fog though, e.g. https://www.google.com/search?q=roll
           | ing+fog+scotland&tbm=isc... - what happens then? (and how can
           | it meaningfully segment-out fog?)
        
             | yeldarb wrote:
             | Not sure if this is what you mean, but I grabbed some of
             | those images & dropped them in to see what it predicted:
             | https://imgur.com/a/CXLmYXo
        
             | idopmstuff wrote:
             | It groups the fog as a single object (except where it's
             | separated by things like hills).
             | 
             | You can see what it does - it's available to test at
             | https://segment-anything.com/.
        
           | endisneigh wrote:
           | Yes, what I am interested in are the other applications.
        
       | swyx wrote:
       | Hey HN! I'm very proud to release the deepest interview deep dive
       | into the SAM model I can find on the Internet (seriously - i
       | looked on youtube and listennotes and all of them were pretty
       | superficial). the Roboflow team has spent the past week hacking
       | on and building with SAM and I ran into Joseph Nelson this
       | weekend and realized he might be the perfect non-Meta-AI person
       | to discuss what it means for developers building with SAM.
       | 
       | so.. enjoy! worked really hard on the prep and editing, any
       | feedback and suggestions/recommendations welcome. still new to AI
       | and new to the podcast game.
       | 
       | edit: Video demo is here in case people miss it
       | https://youtu.be/SZQSF-A-WkA
        
         | CyberDildonics wrote:
         | It's easy to have someone lay down 30 points for a simple
         | banana shaped outline and compare segmentation to that, but how
         | does this compare to other automatic techniques like spectral
         | matting (which is now 16 years old) ?
         | 
         | http://people.csail.mit.edu/alevin/papers/spectral-matting-l...
        
         | re5i5tor wrote:
         | Really great, thank you.
        
           | swyx wrote:
           | any requests for under-covered topics? i felt like this one
           | resonated because somehow the other podcasters/youtubers
           | seemed to miss how big of a deal it was. hungry for more.
        
         | Eisenstein wrote:
         | Why is there no volume control on your podcast page player?
        
       | steventey wrote:
       | Incredible stuff - glad you got to collab with Joseph on this!!
        
         | swyx wrote:
         | thanks steven... would love to chat sharegpt learnings and
         | whatever else u have going on next time you are in town?
        
       | jeron wrote:
       | Curious if chatGPT can convert this podcast transcript into an
       | article
        
         | LoganDark wrote:
         | Then try it..? Maybe it also could have written a more
         | thoughtful comment for you.
        
           | passion__desire wrote:
           | Text just doesn't cut it for many now. I would love if AI
           | could customize minimal animations that goes along with the
           | comment. Something like this.
           | 
           | https://www.youtube.com/shorts/Qnvt5GHTywU
        
             | swyx wrote:
             | fun fact i just sent some money to someone offering to do
             | youtube shorts for me so we might try to do this. honeslty
             | i dont think shorts is conducive to deep conversation or
             | technical topics tho. pple watch with their brain off
        
           | DonHopkins wrote:
           | In the style of Hunter S. Thompson.
           | 
           | https://www.youtube.com/watch?v=vUgs2O7Okqc
        
         | wmwmwm wrote:
         | I've had some pretty remarkable results pasting lecture
         | transcripts from youtube into gpt4 and getting well
         | formatted/relevant markdown summaries from meandering and mis-
         | transcribed content! Needs chunking up but surprisingly
         | effective. It can even generate youtube urls with the right
         | timestamps if you ask it nicely
        
           | variousNick wrote:
           | It's less configurable than what you're describing, but I've
           | found this useful in at least determining if a given video
           | has the content I'm looking for: https://www.summarize.tech/
        
           | wmwmwm wrote:
           | There's also a python api for getting the transcript from a
           | given youtube video id so you can script the whole thing
        
       | m3kw9 wrote:
       | Thing is when you segment everything in a scene, sometimes those
       | things are actually just one object say a laptop, and it starts
       | segmenting trackpad, individual keys, screen. And then you need
       | another algorithm or human intervention to say this segment is
       | pointless etc, a noise filter
        
         | alsodumb wrote:
         | Which is not bad imo. Infact SAM actively proposed a fix for
         | this. Bounding box algorithms are relatively easier to train
         | and SAM can take in a rough bounding box of the laptop as input
         | prompt and create a detailed segmentation of laptop. Their
         | webpage has an example with a bunch of cats.
         | 
         | SAM is good at (i) deciding detailed mask around segments, (ii)
         | taking a wide range of prompts as input to decide what exactly
         | the user wants to be segmented, and it processes these prompts
         | at very low compute requirements.
         | 
         | I think SAM is very well designed architecture and I'm not sure
         | how better it can be. I mean coming back to your question,
         | there should be some signal that user wants a segmentation of
         | laptop. SAM takes exactly that as prompt input.
        
       | dekhn wrote:
       | The real hard problem in CV (and science in general) is bad
       | papers that omit useful info.
       | 
       | Segment Anything requires an image embedding. They report in the
       | paper that segmentation takes ~50ms, but conveniently leave out
       | that computing an embedding of an image (640x480) in their model
       | takes ~2+ seconds (on a 3080 Ti). Well, at least they released
       | all the code and model and enough instructions to figure that
       | part out.
        
         | p1esk wrote:
         | They mention this multiple times in the paper. For example, in
         | the "Limitations" section they write: "SAM can process prompts
         | in real-time, but nevertheless SAM's overall performance is not
         | real-time when using a heavy image encoder."
         | 
         | This paper is one of the highest quality papers released this
         | year. I wish more papers were so clear and informative.
        
           | dekhn wrote:
           | Yes, I read that section. They should have included the time
           | required by the heavy image encoder- unless you know of a way
           | to make SAM work with another encoder.
        
       | haliskerbas wrote:
       | I wonder if Lex Friedman will cover this
        
         | brianjking wrote:
         | Why would it matter if he did?
        
           | not-my-account wrote:
           | Many would find it interesting!
        
             | swyx wrote:
             | i mean maybe i'm the new lex fridman? i can ask my guests
             | what is meaning of life if you want
        
       | yeldarb wrote:
       | The feature they discussed re using Segment Anything to greatly
       | speed up labeling is now live in Roboflow Annotate for anyone to
       | try: https://blog.roboflow.com/label-data-segment-anything-
       | model-...
       | 
       | Distilling these big, slow vision transformer models into
       | something that can be used in realtime on the edge is going to be
       | huge.
        
         | swyx wrote:
         | > Distilling these big, slow vision transformer models into
         | something that can be used in realtime on the edge is going to
         | be huge.
         | 
         | something i didnt quite get to is does Roboflow do this for you
         | or are you pointing to more work that you'd like to happen
         | someday? (possibly done by Roboflow, possibly someone else)
         | also are you worried about business model if people can distil
         | to run on their own devices (so they dont need to pay you
         | anymore)?
        
           | yeldarb wrote:
           | This is the core of what we do! Previously our job was
           | distilling human knowledge into a model; now that knowledge
           | is starting to come from bigger models with humans managing
           | the objectives vs doing the labor.
           | 
           | > also are you worried about business model if people can
           | distil to run on their own devices (so they dont need to pay
           | you anymore)?
           | 
           | This is probably a risk to the current business model over
           | the long term, but we're constantly working on reinventing
           | ourselves & finding new ways to provide value. If we don't
           | adapt to the changing world we deserve to go out of business
           | someday. I'd much rather help build the thing that makes us
           | obsolete than sit idly by while someone else builds it.
           | 
           | I think of this risk similarly to the way that I'm marginally
           | worried that, in the long run, AGI will obviate the need for
           | my job. Probably true, but the opportunities it will present
           | are far greater & it's better to focus on how to be valuable
           | in the future than cling to how I provide value today.
        
             | swyx wrote:
             | haha true true. thanks Brad appreciate the responses and
             | would love to have you on the pod next time there's hot
             | computer vision news!
        
       ___________________________________________________________________
       (page generated 2023-04-13 23:00 UTC)