[HN Gopher] An optimized 2D game engine can render 200k sprites ...
       ___________________________________________________________________
        
       An optimized 2D game engine can render 200k sprites at 200fps
       [video]
        
       Author : farzher
       Score  : 30 points
       Date   : 2022-05-01 17:54 UTC (2 days ago)
        
 (HTM) web link (www.youtube.com)
 (TXT) w3m dump (www.youtube.com)
        
       | _aavaa_ wrote:
       | This by the looks of it is in Jonathan Blow's Jai language.
       | 
       | How are you finding working with it? Have you done a similar
       | thing in C++ to compare the results and the process of writing
       | it?
       | 
       | 200k at 200fps on an 8700k with a 1070 seems like a lot of
       | rabbits. Are there similar benchmarks to compare against in other
       | languages?
        
         | farzher wrote:
         | it's a lot of fun! jai is my intro to systems programming. so i
         | haven't tried this in C++ (actually i have tried a few times
         | over the past few years but never successuflly).
         | 
         | this is just a test of opengl, C++ should be the same exact
         | performance considering my cpu usage is only 7% while gpu usage
         | is 80%. but the process of writing it is infinitely better than
         | C++, since i never got C++ to compile a hardware accelerated
         | bunnymark.
         | 
         | the only bunnymarks i'm aware of are slow
         | https://www.reddit.com/r/Kha/comments/8hjupc/how_the_heck_is...
         | 
         | which is why i wrote this, to see how fast it could go.
        
           | DantesKite wrote:
           | I thought Jai wasn't released yet. Are you a beta user or did
           | he release it already?
        
           | adamrezich wrote:
           | the official rendering modules are a bit all over the place
           | atm... did you use Simp, Render, GL, or handle the rendering
           | yourself?
        
       | xaedes wrote:
       | Nice demo! We need more of this approach.
       | 
       | You really can achieve amazing stuff with just plain e.g. OpenGL
       | optimized for your rendering needs. With todays GPU acceleration
       | capabilities we could have town-building games with huge map
       | resolutions and millions of entities. Instead its mostly only
       | used to make fancy graphics.
       | 
       | Actually I am currently trying to build something like that [1].
       | A big big world with hundreds of millions of sprites is
       | achievable and runs smoothly, video RAM is the limit. Admittedly
       | it is not optimized to display those hundreds of millions of
       | sprites all at once, maybe just a few millions. Would be a bit
       | too chaotic for a game anyway I guess.
       | 
       | [1] https://www.youtube.com/watch?v=6ADWXIr_IUc
        
         | p1necone wrote:
         | Is this not done because of technical limitations, or is it
         | just not done because a town building game with millions of
         | entities would not be fun/manageable for the player?
         | 
         | Although, there's a few space 4x games that try this
         | "everything is simulated" kind of approach and succeed.
         | Allowing AI control of everything the player doesn't want to
         | manage themselves is one nice way of dealing with it. See:
         | https://store.steampowered.com/app/261470/Distant_Worlds_Uni...
        
         | bob1029 wrote:
         | > We need more of this approach.
         | 
         | 1000% agree.
         | 
         | I recently took it upon myself to see just how far I can push
         | modern hardware with some very tight constraints. I've been
         | playing around with a 100% custom 3D rasterizer which purely
         | operates on the CPU. For reasonable scenes (<10k triangles) and
         | resolutions (720~1080p), I have been able to push over 30fps
         | with a _single_ thread. On a 5950x, I was able to support over
         | 10 clients simultaneously without any issues. The GPU in my
         | workstation is just moving the final content to the display
         | device via whatever means necessary. The machine generating the
         | frames doesnt even need a graphics device installed at all...
         | 
         | To be clear, this is exceptionally primitive graphics
         | capability, but there are many styles of interactive experience
         | that do _not_ demand 4k textures, global illumination, etc. I
         | am also not fully extracting the capabilities of my CPU. There
         | are many optimizations (e.g. SIMD) that could be applied to get
         | even more uplift.
         | 
         | One fun thing I discovered is just how low latency a pure CPU
         | rasterizer can be compared to a full CPU-GPU pipeline. I have
         | CPU-only user-interactive experiences that can go from input
         | event to final output frame in under 2 milliseconds. I don't
         | think even games like Overwatch can react to user input that
         | quickly.
        
           | kingcharles wrote:
           | Just to be clear - you're writing a "software-based" 3D
           | renderer, right? This is the sort of thing I excelled at back
           | in the late 80s, early 90s, before the first 3D accelerators
           | turned up around 1995 I think.
           | 
           | What features does your renderer support in terms of shading
           | and texturing? Are you writing this all in a high-level
           | language, e.g. C, or assembler? If assembler, what CPUs and
           | features are you targeting?
           | 
           | And of course, why?
        
         | syntheweave wrote:
         | The upper rendering limit generally isn't explored deeply by
         | games because as soon as you add simulation behaviors, it
         | imposes new bottlenecks. And the design space of "large scale"
         | is often restricted by what is necessary to implement it; many
         | of Minecraft's bugs, for example, are edge cases of streaming
         | in the world data in chunks.
         | 
         | Thus games that ship to a schedule are hugely incentivized to
         | favor making smaller play spaces with more authored detail,
         | since that controls all the outcomes and reduces the technical
         | dependencies of how scenes are authored.
         | 
         | There is a more philosophical reason to go in that direction
         | too: Simulation building is essentially the art of building
         | Plato's cave, and spending all your time on making the cave
         | very large and the puppets extremely elaborate is a rather
         | dubious idea.
        
       | SemanticStrengh wrote:
       | Yes although the performance is probably largely due to
       | occlusion? Also the sprites do not collides with their
       | environnement
        
       | chmod775 wrote:
       | Bit of a tangent and useless thought experiment, but I think you
       | could render an infinite amount of such bunnies, or as many as
       | you can fit in RAM/simulate. One the CPU, for each frame, iterate
       | over all bunnies. Do your simulation for that bunny and at the
       | pixel corresponding to its position, store its information in a
       | texture at that pixel if it is positioned over the bunny
       | currently stored there (just its logical position, don't put it
       | in all the pixels of its texture!). Then on the GPU have a pixel
       | shader look up (in surrounding pixels) the topmost bunny for the
       | current pixel and draw it (or just draw all the overlaps using
       | the z-buffer). For your source texture, use 0 for no bunny, and
       | other values to indicate the bunny's z-position.
       | 
       | The CPU work would be O(n) and the rendering/GPU work O(m*k),
       | where n is the number of bunnies, m is the display resolution and
       | k is the size of our bunny sprite.
       | 
       | The advantage of this (in real applications utterly useless[1])
       | method is that CPU work only increases linearly with the number
       | of bunnies, you get to discard bunnies you don't care about
       | really early in the process, and GPU work is constant regardless
       | of how many bunnies you add.
       | 
       | It's conceptually similar to rendering voxels, except you're not
       | tracing rays deep, but instead sweeping wide.
       | 
       | As long as your GPU is fine with sampling that many surrounding
       | pixels, you're exploiting the capabilities of both your CPU and
       | GPU quite well. Also the CPU work can be parallelized: Each
       | thread operates on a subset of the bunnies and on its own
       | texture, and only in the final step the textures are combined
       | into one (which can also be done in parallel!). I wouldn't be
       | surprised if modern CPUs could handle millions of bunnies while
       | modern GPUs would just shrug as long as the sprite is small.
       | 
       | [1] In reality you don't have sprites at constant sizes and also
       | this method can't properly deal with transparency of any kind.
       | The size of your sprites will be directly limited by how many
       | surrounding pixels your shader looks up during rendering, even if
       | you add support for multiple sprites/sprite sizes using other
       | channels on your textures.
        
       | sqrt_1 wrote:
       | I assume each sprite is moved on the CPU and the position data is
       | passed to the GPU for rendering.
       | 
       | Curious how you are passing the data to the GPU - are you having
       | a single dynamic vertex buffer that is uploaded each frame?
       | 
       | Is the vertex data a single position and the GPU is generating
       | the quad from this?
        
       | farzher wrote:
       | i finally got around to writing an opengl "bunnymark" to check
       | how fast computers are.
       | 
       | i got 200k sprites at 200fps on a 1070 (while recording). i'm not
       | sure anyone could survive that many vampires
        
         | nick__m wrote:
         | that many rabbits, it's frightening!
         | 
         | Do you have the code somewhere, I would like to see how it's
         | made?
        
       | juancn wrote:
       | Neat. Isn't this akin to 400k triangles on a GPU? So as long as
       | you do instancing it doesn't seem too difficult (performance
       | wise) in itself. Even if there are many sprites, texture mapping
       | should solve for the taking pixels to the screen part.
       | 
       | My guess is that the rendering is not the hardest part, although
       | it's kinda cool.
        
         | moffkalast wrote:
         | 200k sprites is roughly a mesh with 400k triangles assuming
         | each sprite is a quad and it's all instanced/batched into one
         | draw call as it should be. It's quite a bit but most modern
         | GPUs should be able to handle that easily.
         | 
         | It's the moving the individual quads around that can be kinda
         | tricky. Draw calls are still the most limiting thing I think,
         | but a good ballpark for those was around max 1k for a scene
         | last I checked, so merging the entire scene into one geometry
         | isn't exactly something that needs to be done in practical
         | terms. This is premature optimization at its best.
        
       ___________________________________________________________________
       (page generated 2022-05-03 23:00 UTC)