[HN Gopher] An optimized 2D game engine can render 200k sprites ... ___________________________________________________________________ An optimized 2D game engine can render 200k sprites at 200fps [video] Author : farzher Score : 30 points Date : 2022-05-01 17:54 UTC (2 days ago) (HTM) web link (www.youtube.com) (TXT) w3m dump (www.youtube.com) | _aavaa_ wrote: | This by the looks of it is in Jonathan Blow's Jai language. | | How are you finding working with it? Have you done a similar | thing in C++ to compare the results and the process of writing | it? | | 200k at 200fps on an 8700k with a 1070 seems like a lot of | rabbits. Are there similar benchmarks to compare against in other | languages? | farzher wrote: | it's a lot of fun! jai is my intro to systems programming. so i | haven't tried this in C++ (actually i have tried a few times | over the past few years but never successuflly). | | this is just a test of opengl, C++ should be the same exact | performance considering my cpu usage is only 7% while gpu usage | is 80%. but the process of writing it is infinitely better than | C++, since i never got C++ to compile a hardware accelerated | bunnymark. | | the only bunnymarks i'm aware of are slow | https://www.reddit.com/r/Kha/comments/8hjupc/how_the_heck_is... | | which is why i wrote this, to see how fast it could go. | DantesKite wrote: | I thought Jai wasn't released yet. Are you a beta user or did | he release it already? | adamrezich wrote: | the official rendering modules are a bit all over the place | atm... did you use Simp, Render, GL, or handle the rendering | yourself? | xaedes wrote: | Nice demo! We need more of this approach. | | You really can achieve amazing stuff with just plain e.g. OpenGL | optimized for your rendering needs. With todays GPU acceleration | capabilities we could have town-building games with huge map | resolutions and millions of entities. Instead its mostly only | used to make fancy graphics. | | Actually I am currently trying to build something like that [1]. | A big big world with hundreds of millions of sprites is | achievable and runs smoothly, video RAM is the limit. Admittedly | it is not optimized to display those hundreds of millions of | sprites all at once, maybe just a few millions. Would be a bit | too chaotic for a game anyway I guess. | | [1] https://www.youtube.com/watch?v=6ADWXIr_IUc | p1necone wrote: | Is this not done because of technical limitations, or is it | just not done because a town building game with millions of | entities would not be fun/manageable for the player? | | Although, there's a few space 4x games that try this | "everything is simulated" kind of approach and succeed. | Allowing AI control of everything the player doesn't want to | manage themselves is one nice way of dealing with it. See: | https://store.steampowered.com/app/261470/Distant_Worlds_Uni... | bob1029 wrote: | > We need more of this approach. | | 1000% agree. | | I recently took it upon myself to see just how far I can push | modern hardware with some very tight constraints. I've been | playing around with a 100% custom 3D rasterizer which purely | operates on the CPU. For reasonable scenes (<10k triangles) and | resolutions (720~1080p), I have been able to push over 30fps | with a _single_ thread. On a 5950x, I was able to support over | 10 clients simultaneously without any issues. The GPU in my | workstation is just moving the final content to the display | device via whatever means necessary. The machine generating the | frames doesnt even need a graphics device installed at all... | | To be clear, this is exceptionally primitive graphics | capability, but there are many styles of interactive experience | that do _not_ demand 4k textures, global illumination, etc. I | am also not fully extracting the capabilities of my CPU. There | are many optimizations (e.g. SIMD) that could be applied to get | even more uplift. | | One fun thing I discovered is just how low latency a pure CPU | rasterizer can be compared to a full CPU-GPU pipeline. I have | CPU-only user-interactive experiences that can go from input | event to final output frame in under 2 milliseconds. I don't | think even games like Overwatch can react to user input that | quickly. | kingcharles wrote: | Just to be clear - you're writing a "software-based" 3D | renderer, right? This is the sort of thing I excelled at back | in the late 80s, early 90s, before the first 3D accelerators | turned up around 1995 I think. | | What features does your renderer support in terms of shading | and texturing? Are you writing this all in a high-level | language, e.g. C, or assembler? If assembler, what CPUs and | features are you targeting? | | And of course, why? | syntheweave wrote: | The upper rendering limit generally isn't explored deeply by | games because as soon as you add simulation behaviors, it | imposes new bottlenecks. And the design space of "large scale" | is often restricted by what is necessary to implement it; many | of Minecraft's bugs, for example, are edge cases of streaming | in the world data in chunks. | | Thus games that ship to a schedule are hugely incentivized to | favor making smaller play spaces with more authored detail, | since that controls all the outcomes and reduces the technical | dependencies of how scenes are authored. | | There is a more philosophical reason to go in that direction | too: Simulation building is essentially the art of building | Plato's cave, and spending all your time on making the cave | very large and the puppets extremely elaborate is a rather | dubious idea. | SemanticStrengh wrote: | Yes although the performance is probably largely due to | occlusion? Also the sprites do not collides with their | environnement | chmod775 wrote: | Bit of a tangent and useless thought experiment, but I think you | could render an infinite amount of such bunnies, or as many as | you can fit in RAM/simulate. One the CPU, for each frame, iterate | over all bunnies. Do your simulation for that bunny and at the | pixel corresponding to its position, store its information in a | texture at that pixel if it is positioned over the bunny | currently stored there (just its logical position, don't put it | in all the pixels of its texture!). Then on the GPU have a pixel | shader look up (in surrounding pixels) the topmost bunny for the | current pixel and draw it (or just draw all the overlaps using | the z-buffer). For your source texture, use 0 for no bunny, and | other values to indicate the bunny's z-position. | | The CPU work would be O(n) and the rendering/GPU work O(m*k), | where n is the number of bunnies, m is the display resolution and | k is the size of our bunny sprite. | | The advantage of this (in real applications utterly useless[1]) | method is that CPU work only increases linearly with the number | of bunnies, you get to discard bunnies you don't care about | really early in the process, and GPU work is constant regardless | of how many bunnies you add. | | It's conceptually similar to rendering voxels, except you're not | tracing rays deep, but instead sweeping wide. | | As long as your GPU is fine with sampling that many surrounding | pixels, you're exploiting the capabilities of both your CPU and | GPU quite well. Also the CPU work can be parallelized: Each | thread operates on a subset of the bunnies and on its own | texture, and only in the final step the textures are combined | into one (which can also be done in parallel!). I wouldn't be | surprised if modern CPUs could handle millions of bunnies while | modern GPUs would just shrug as long as the sprite is small. | | [1] In reality you don't have sprites at constant sizes and also | this method can't properly deal with transparency of any kind. | The size of your sprites will be directly limited by how many | surrounding pixels your shader looks up during rendering, even if | you add support for multiple sprites/sprite sizes using other | channels on your textures. | sqrt_1 wrote: | I assume each sprite is moved on the CPU and the position data is | passed to the GPU for rendering. | | Curious how you are passing the data to the GPU - are you having | a single dynamic vertex buffer that is uploaded each frame? | | Is the vertex data a single position and the GPU is generating | the quad from this? | farzher wrote: | i finally got around to writing an opengl "bunnymark" to check | how fast computers are. | | i got 200k sprites at 200fps on a 1070 (while recording). i'm not | sure anyone could survive that many vampires | nick__m wrote: | that many rabbits, it's frightening! | | Do you have the code somewhere, I would like to see how it's | made? | juancn wrote: | Neat. Isn't this akin to 400k triangles on a GPU? So as long as | you do instancing it doesn't seem too difficult (performance | wise) in itself. Even if there are many sprites, texture mapping | should solve for the taking pixels to the screen part. | | My guess is that the rendering is not the hardest part, although | it's kinda cool. | moffkalast wrote: | 200k sprites is roughly a mesh with 400k triangles assuming | each sprite is a quad and it's all instanced/batched into one | draw call as it should be. It's quite a bit but most modern | GPUs should be able to handle that easily. | | It's the moving the individual quads around that can be kinda | tricky. Draw calls are still the most limiting thing I think, | but a good ballpark for those was around max 1k for a scene | last I checked, so merging the entire scene into one geometry | isn't exactly something that needs to be done in practical | terms. This is premature optimization at its best. ___________________________________________________________________ (page generated 2022-05-03 23:00 UTC)