[HN Gopher] Accelerating Conway's Game of Life Using CUDA
       Accelerating Conway's Game of Life Using CUDA
       Author : brendanrayw
       Score  : 46 points
       Date   : 2021-06-03 16:24 UTC (1 days ago)
 (HTM) web link (brendanrayw.medium.com)
 (TXT) w3m dump (brendanrayw.medium.com)
       | iseanstevens wrote:
       | Need to re-read in detail, but seems like great learning
       | material.
       | I especially liked the line including "with less power comes
       | greater simplicity" :)
         | brendanrayw wrote:
         | Thanks for reading :D
       | brendanrayw wrote:
       | After having attended a few CUDA workshops at NVIDA's latest GTC,
       | I was inspired to continue learning CUDA on my own. To do so I
       | decided to build John Conway's famous "Game of Life" and use CUDA
       | to accelerate the program. I explore multiple different CUDA
       | techniques including managed memory, pinned memory, multiple
       | streams, and asynchronous memory transfers.
         | IdiocyInAction wrote:
         | Nice. I took a CUDA course at uni where I built a neural
         | network and a physics simulation. Optimizing them was very
         | involved, but ultimately very cool; I learned a ton of stuff.
         | I'd love to work with CUDA in practice, but there's not that
         | many jobs around.
         | gtn42 wrote:
         | Nice, thanks for sharing your experience!
         | rrss wrote:
         | this was a fun read, thanks for sharing.
         | FYI, the transfers from pageable memory almost certainly do not
         | go to the storage device in your system, unless you have high
         | memory pressure. "pageable" (as a cuda-ism) does mean that the
         | buffer _may_ be paged out to storage, but as a result it means
         | that (more importantly) even if the buffer is in RAM, the GPU
         | cannot access it directly.
         | so for pageable copies the flow is probably not:
         | storage - buffer in RAM - device,
         | but rather:                 original buffer in RAM
         | (inaccessible to the device) - intermediate buffer in RAM
         | (accessible to the device) - device.
         | also, in several places you use the term 'stack' where I think
         | it should just be 'RAM' / main memory.
           | brendanrayw wrote:
           | Thanks for reading! I appreciate the feedback and the info,
           | I'll keep that in mind.
         | joe_the_user wrote:
         | Thanks for your effort! I really like the idea, it's similar to
         | a more ambitious project I'm thinking of. And I do have
         | questions
         | Is your board a giant two-dimensional array in memory?
         | Are your threads/kernels reading from this array and then
         | writing back to it?
         | Do you do synchronization to make sure reads happen before the
         | later rights?
         | Do you do any verification that your transition happen
         | correctly?
         | Do you have an estimate for time spend in - transfer from
         | global GPU memory to each kernel, calculations in the kernel,
         | and time spent idling through synchronization (assuming you do
         | it).
       | jacquesm wrote:
       | Interesting. I'm assuming you are familiar with Hashlife? If not
       | check it out, it is absolutely amazing how fast it is, and as a
       | study in memoization maybe it will inspire you on how you can get
       | some more mileage out of your CUDA version.
       | https://en.wikipedia.org/wiki/Hashlife
         | buescher wrote:
         | Agreed! The CUDA implementation is nice and I was going to say
         | "now do Hashlife" myself. Here's the original paper
         | https://www.lri.fr/~filliatr/m1/gol/gosper-84.pdf
       (page generated 2021-06-04 23:01 UTC)