[HN Gopher] NeRF: Representing scenes as neural radiance fields ...
       ___________________________________________________________________
        
       NeRF: Representing scenes as neural radiance fields for view
       synthesis
        
       Author : dfield
       Score  : 148 points
       Date   : 2020-03-20 14:25 UTC (8 hours ago)
        
 (HTM) web link (www.matthewtancik.com)
 (TXT) w3m dump (www.matthewtancik.com)
        
       | kuprel wrote:
       | This would be great for instant replays
        
         | jayd16 wrote:
         | Intel already does this with their "True View" setup. They also
         | had a tech demo CES where they synthesized camera positions for
         | movie sets. https://www.youtube.com/watch?v=9qd276AJg-o
        
       | blackhaz wrote:
       | Could someone ELI5, please?
        
         | type_enthusiast wrote:
         | They're modeling a scene mathematically as a "radiance field" -
         | a function that takes a view position and direction as inputs
         | and returns the light color that hits that position from the
         | direction it's facing. They use some input images to train a
         | neural network, in order to find an optimal radiance field
         | function which explains the input images. Once they have that
         | function, they can construct images from new angles by
         | evaluating the function over the (position, direction) inputs
         | needed by the pixels in the new image.
        
         | ur-whale wrote:
         | >Could someone ELI5, please?
         | 
         | Smart, high-dimensional interpolator.
        
         | quadrature wrote:
         | its a very similar concept to photogrammetry which is
         | recovering a 3d representation of an object given pictures
         | taken from different angles.
         | 
         | In this work they take pictures of a scene from different
         | angles and are able to train a neural network to render the
         | scene from new angles that aren't in any source pictures.
         | 
         | The neural network takes in a location (x,y,z), a viewing
         | direction and spits out the RGB of the rendered image if you
         | were to view the scene at that location and angle.
         | 
         | Using this network and traditional rendering techniques they
         | are able to render the whole scene.
        
           | wokwokwok wrote:
           | Significantly, the input is a sparse dataset.
           | 
           | ie. Few source images vs. traditional photogrammetry.
           | 
           | ...but basically yes, tldr; photogrammetry using neural
           | networks; this one is better than other recent attempts at
           | the same thing, but takes a really long time (2 days for this
           | vs 10 minutes for a voxel based approach in one of their
           | comparisons).
           | 
           | Why bother?
           | 
           | mmm... theres some kind speculation you might be able to
           | represent a photorealistic scene/ 3d object as a neural model
           | instead of voxels or meshes.
           | 
           | That might be useful for some things. eg. say, a voxel
           | representation of semi transparent fog, or high detail
           | objects like hair are impractically huge, and as a mesh its
           | very difficult to represent.
        
             | visarga wrote:
             | > Why bother?
             | 
             | There might be 10x speedups to be gained with a tweaked
             | model.
        
             | rebuilder wrote:
             | A number of things this seems to do well would be pretty
             | much impossible with standard photogrammetry : trees with
             | leaves, fine details like rigging on a ship, reflective
             | surfaces, even refraction (!)
             | 
             | Of course the output is a new view, not a shaded mesh, but
             | given it appears to generate depth data, I think you should
             | be able to generate a point cloud and mesh it. Getting the
             | materials from the output light even be possible, I'm not
             | very up to date on the state of material capture nowadays.
        
         | imposter wrote:
         | Wow great
        
         | mooneater wrote:
         | If you give it a bunch of photos of a scene from different
         | angles, this machine learning method lets you see angles that
         | did not exist in the original set.
         | 
         | Better results than other methods so far.
        
           | notfed wrote:
           | Fist bump for actually answering as ELI5 (unlike the other
           | responses).
        
       | teknopurge wrote:
       | This is bad-ass, partly because it's so elegant.
        
       | jayd16 wrote:
       | Very cool. Reminds me of when I played with Google's Seurat.
       | 
       | The paper says its 5MB, 12 hours to train the NN and then 30
       | seconds to render novel views of the scene on an nVidia V100.
       | 
       | Sadly not something you can use in real time but still very cool.
       | 
       | Edit:12 hours and 5MB NN not 5 Minutes
        
         | ssivark wrote:
         | Huh, what? It needs almost a million views, and takes 1-2 days
         | to train on a GPU. I'm not sure where the "5 minutes" number
         | comes from.
         | 
         | EDIT: I was referring to the last paragraph of section 5.3
         | (Implementation details), but maybe I'm misunderstanding how
         | they use rays / sampled coordinates.
         | 
         | Very impressive visual quality. But it seems like they need a
         | LOT of data and computation for each scene. So, its still
         | plausible that intelligently done photogrammetry will beat this
         | approach in efficiency, but a bunch of important details need
         | to be figured out to make that happen.
        
           | jayd16 wrote:
           | Excuse me I meant 5MB. It takes 12 hours to train.
           | 
           | >All compared single scene methods take at least 12 hours to
           | train per scene
           | 
           | But it seems to only need sparse images.
           | 
           | >Here, we visualize the set of 100 input views of the
           | synthetic Drums scene randomly captured on a surrounding
           | hemisphere, and we show two novel views rendered from our
           | optimized NeRF representation
        
           | scribu wrote:
           | > It needs almost a million views
           | 
           | Not sure what you mean by "views". The comparisons in the
           | paper use at most 100 input images per scene.
        
       | byt143 wrote:
       | If you're only looking for one novel view, can it use less views
       | that are close to the novel one?
        
       | uoaei wrote:
       | This is absolutely stunning.
       | 
       | As they say in ML, representation first -- and this is one of the
       | most natural and elegant ways to represent 3D scenes and
       | subjective viewpoints. Great that it works into a rendering
       | environment such that it's E2E differentiable.
       | 
       | This is the first leap toward true high-quality real-time ML-
       | based rendering. I'm blown away.
        
       | lifeisstillgood wrote:
       | Well that took some effort just to work out what they actually
       | did. How they actually did it I have no idea. Impressive however
       | - a sort of fill in the blanks for the bits that are missing. If
       | our brains _dont_ do this one would be surprised.
       | 
       | And we are all supposed to become AI developers this decade?!
       | 
       | Come back Visual Basic all is forgiven :-)
        
       | [deleted]
        
       | raidicy wrote:
       | This blows my mind. This is probably a naive thought; This
       | technique looks like it could be combined with robotics to help
       | it navigate through its environment.
       | 
       | I'd also like to see what it does when you give it multiple views
       | of scenes in a video game. Some from the direct pictures and some
       | from pictures of the monitor.
        
       | ssivark wrote:
       | Does anyone know how they do the "virtual object insertion"
       | demonstrated in the paper summary video? Can that be somehow done
       | on the network itself, or is that a diagnostic for scene accuracy
       | by performing SFM on network output?
        
         | theresistor wrote:
         | I'm pretty sure they're rendering a depth channel and
         | compositing it in.
        
           | teraflop wrote:
           | You could do that, but I think it's simpler to just introduce
           | additional objects during the raytracing process that
           | generates the images. That would produce accurate results
           | even with semitransparent objects, unlike compositing with an
           | depth buffer.
        
       ___________________________________________________________________
       (page generated 2020-03-20 23:00 UTC)