[HN Gopher] Real-Time Coherent 3D Reconstruction from Monocular ...
       ___________________________________________________________________
        
       Real-Time Coherent 3D Reconstruction from Monocular Video
        
       Author : samber
       Score  : 88 points
       Date   : 2022-03-14 17:50 UTC (5 hours ago)
        
 (HTM) web link (zju3dv.github.io)
 (TXT) w3m dump (zju3dv.github.io)
        
       | stefan_ wrote:
       | Looked cool, then I read that there is some Apple ARkit magic
       | blackbox in the middle of it all.
        
         | cmelbye wrote:
         | I don't think that's true. The paper says that a camera pose
         | estimated by a SLAM system is required. ARKit implements SLAM
         | and can easily provide camera pose for each frame through the
         | ARFrame class. But there are countless other implementations of
         | SLAM, including Android ARCore, Oculus Quest, Roomba, self-
         | driving cars, and a number of GitHub repos
         | (https://github.com/tzutalin/awesome-visual-slam).
        
         | fxtentacle wrote:
         | Yeah, I also consider that odd to use LIDAR-based poses and
         | then call it "monocular".
        
       | spyder wrote:
       | It's not exactly the same but Neural Radiance Fields are getting
       | more impressive:
       | 
       | First one was this but it was slow:
       | https://www.matthewtancik.com/nerf
       | 
       | Then it's got faster: https://www.youtube.com/watch?v=fvXOjV7EHbk
       | 
       | Lot of interesting papers:
       | 
       | https://github.com/yenchenlin/awesome-NeRF
        
       | alhirzel wrote:
       | Does anyone know what the state of the art is for doing this type
       | of reconstruction as a streaming input to detection and
       | recognition algorithms? For instance, this could be used for
       | object detection and identification on a recycling conveyor line.
        
         | beambot wrote:
         | I don't believe that either does reconstruction... but for the
         | recycling application, there are a handful of companies
         | tackling this problem -- e.g. Everest Labs & Amp Robotics.
        
       | leobg wrote:
       | So much for the folks who think Tesla is on a fool's errand when
       | they're using cameras instead of LIDAR.
        
         | kajecounterhack wrote:
         | Companies like Waymo and Cruise use this kind of technology
         | too. Unfortunately there are tons of corner cases of weird
         | things you haven't seen before -- for example, some special
         | vehicles self-occlude and you never get enough coverage to
         | observe them correctly until you're too close. In general,
         | radars and lidars used in _conjunction_ with cameras can handle
         | occluded objects much better.
         | 
         | Also, to measure the performance / evaluate observations
         | generated from this tech, you would want to compare it to a
         | pretty sizable 3D ground truth set which Tesla does not
         | currently have. There are pretty big advantages to starting
         | with a maximal set of sensors even if (eventually)
         | breakthroughs turn them into unnecessary crutches.
        
           | leobg wrote:
           | That was very insightful. Do you work in that space? It is
           | comments like yours that make HN a special place.
        
         | ceejayoz wrote:
         | The failure mode (for example: decapitation;
         | https://www.latimes.com/business/la-fi-tesla-florida-
         | acciden...) is pretty significant when used in a Tesla. Less so
         | in this tech demo.
        
       | billconan wrote:
       | can ARkit return accurate camera position?
        
         | upbeat_general wrote:
         | I haven't looked at any metrics but based on using ARKit
         | applications (and various VIO SLAM implementations) it can but
         | it depends highly on the scene, camera motion, and whether
         | there is LIDAR/Stereo Depth.
        
       | AndrewKemendo wrote:
       | Honestly this doesn't look any better than what we were doing
       | back in 2016-2017. I'm not sure what's novel here.
       | 
       | This is the only video I could find, but we were doing monocular
       | reconstruction from a limited number of RGB (not depth) images
       | AND doing voxel segmentation on the processing side.
       | https://www.youtube.com/watch?v=nqy44VSWh3g
       | 
       | Even as far back as 2010 people were doing reasonable monocular
       | reconstruction including software like meshroom etc...the whole
       | of TU Munich also under Matthias Niessner has been doing this for
       | a while.
       | 
       | What's novel here?
        
         | tintor wrote:
         | Fast enough to be used for mobile robots?
        
         | nobbis wrote:
         | Their research doesn't just integrate depth maps into a TSDF -
         | it uses NN's to incorporate surface priors.
         | 
         | I don't recall you having similar real-time meshing
         | functionality in 2016-2017, Andrew. Can you show what you had?
         | 
         | As far as I'm aware, Abound was the first to demo real-time
         | monocular mobile meshing: on Android in early 2017 (e.g.
         | https://www.youtube.com/watch?v=K9CpT-sy7HE), and iOS in early
         | 2018 (e.g.
         | https://twitter.com/nobbis/status/972298968574013440).
        
         | pj_mukh wrote:
         | Looks like a much better response to white walls/texture less
         | surfaces.
        
         | fxtentacle wrote:
         | This is a paper about a new way of storing/merging 3D data.
         | 
         | The actual 3D reconstruction is so-so, I agree. And they kinda
         | cheat by using ARKit (which uses LIDAR internally) to get good
         | camera poses even if there is little texture.
         | 
         | So the novel part here is that they can immediately merge all
         | the images into a coherent representation of the 3D space, as
         | opposed to first doing bundle adjustment, then doing pairwise
         | depth matching, then doing streak-based depth matching, and
         | then merging the resulting point clouds.
         | 
         | Also, they can use learned 3D shape priors to improve their
         | results. Basically that means "if there is no visible gap,
         | assume the surface is flat". But AFAIK, that's not new.
         | 
         | EDIT: My main criticism of this paper after looking at the
         | source code a bit would be that due to the TSDF, which is like
         | a 3D voxel grid, they need insane amounts of GPU memory, or
         | else the scenes either need to be very small or low resolution.
         | That is most likely also the reason why they reconstruction
         | looks so cartoon-like and is smooth on all corners: They lack
         | the memory to store more high-frequency details.
         | 
         | EDIT2: Mainly, it looks like they managed to reduce the GPU
         | memory consumption of Atlas [1] which is why they can
         | reconstruct larger areas and/or higher resolution. But it's
         | still far less detail than Colmap [2].
         | 
         | [1] https://github.com/magicleap/Atlas
         | 
         | [2] https://colmap.github.io/
        
         | closetnerd wrote:
         | Says it's real-time
        
           | AndrewKemendo wrote:
           | 2016 from Matthias Niessner's group
           | 
           | https://www.youtube.com/watch?v=keIirXrRb1k
           | 
           | http://graphics.stanford.edu/projects/bundlefusion/
        
             | jonas21 wrote:
             | That requires depth input
        
               | AndrewKemendo wrote:
               | Good point, I don't recall offhand the paper that was the
               | mono-RT one.
               | 
               | At a minimum though 6D.ai and a few others had companies
               | that were selling this as a service at least as far back
               | as 2017.
        
           | fxtentacle wrote:
           | I always found ORB-SLAM2 pretty impressive, which can map 3D
           | neighborhoods in realtime while you drive around in a car:
           | 
           | https://www.youtube.com/watch?v=ufvPS5wJAx0
           | 
           | https://www.youtube.com/watch?v=3BrXWH6zRHg
        
       | polishdude20 wrote:
       | Shame there's no Android or iPhone app available
        
       | adampk wrote:
       | I am surprised that the team didn't choose to add the "Fusion"
       | append at the end.
       | 
       | This seems to fit into the genealogy of KinectFusion,
       | ElasticFusion, BundleFusion, etc.
       | 
       | https://www.microsoft.com/en-us/research/wp-content/uploads/...
       | https://www.imperial.ac.uk/dyson-robotics-lab/downloads/elas...
       | https://graphics.stanford.edu/projects/bundlefusion/
       | 
       | Very impressive work. I have not seen any use cases for online 3D
       | reconstruction unfortunately. 6D.ai made terrific progress in
       | this tech but also could not find great use cases for online
       | reconstruction and ended up having to sell to Niantic.
       | 
       | Seems like what people want, if they want 3D reconstruction, is
       | extremely high-fidelity scans (a la Matterport) and are willing
       | to wait for the model. Unfortunately TSDF approach create a
       | "slimey" end look which isn't usually what people are after if
       | they want an accurate 3D reconstruction.
       | 
       | It SEEMS like _online_ 3D reconstruction would be helpful, but I
       | have yet to see a use case for  "online"...
        
         | [deleted]
        
         | tintor wrote:
         | Use case: Mobile robotics, lidar replacement in self-driving
         | vehicles
        
         | tonyarkles wrote:
         | I'm very curious to see how well this would work for online
         | terrain reconstruction. I've got a drone with a pretty powerful
         | onboard computer and it's always nice to be able to solve and
         | tune problems with software instead of additional (e.g. LIDAR)
         | hardware.
        
       ___________________________________________________________________
       (page generated 2022-03-14 23:00 UTC)