[HN Gopher] The Stack Monoid
       ___________________________________________________________________
        
       The Stack Monoid
        
       Author : raphlinus
       Score  : 121 points
       Date   : 2020-09-05 17:06 UTC (1 days ago)
        
 (HTM) web link (raphlinus.github.io)
 (TXT) w3m dump (raphlinus.github.io)
        
       | raphlinus wrote:
       | A note: I just updated the post with a bunch of related work that
       | people have sent me since I published the first draft. It also
       | includes a bit more intuition behind the monoid.
        
       | agumonkey wrote:
       | best thread of the year
        
       | andrewflnr wrote:
       | The elements of this monoid kind of look like type signatures for
       | Forth words. I was already playing with the idea of a sort of
       | concurrent forth-like where each word sort of gets its own stack
       | context. I'm really interested in neat models that provide
       | partial evaluation, so I will be giving this some thought.
        
       | nilknarf wrote:
       | > Automatically generating monoids such as the stack monoid from
       | their corresponding simple sequential programs seems like an
       | extremely promising area for research.
       | 
       | I was interested in this problem a while back and the papers I
       | remembered being useful were:
       | 
       | "Automatic Inversion Generates Divide-and-Conquer Parallel
       | Programs": https://core.ac.uk/download/pdf/192663016.pdf
       | 
       | which is based on "The Third Homomorphism Theorem":
       | http://www.cs.ox.ac.uk/people/jeremy.gibbons/publications/th...
       | 
       | The motivating example is whether you can generate a mergesort
       | (where the monoid is combine two sorted lists) given an
       | implementation of an insertion sort (where the sequential program
       | is inserting one element by one into a sorted list).
       | 
       | I remember being disappointed by the answer after reading these
       | papers but in the citation graph there are a bunch of relatively
       | recent program synthesis papers. Maybe there's something better
       | now.
        
       | boulos wrote:
       | Raph, I think some of the order-independent transparency folks
       | (Bavoil, Wyman, Lefohn, Salvi, etc.) have had to do this bounded
       | k-stack plus overflow thing. Might be a good "hmm, how did they
       | do it" for comparison.
        
         | raphlinus wrote:
         | Good call, that rings a bell and I'll look into it. A bounded
         | k-stack with overflow seems like a really good fit for the
         | capabilities provided by GPU.
        
           | boulos wrote:
           | Hmm. In the end, maybe transparency is a bad example, because
           | people immediately jumped to approximating it instead.
           | Wyman's review paper though is a good read regardless:
           | 
           | http://cwyman.org/papers/hpg16_oitContinuum.pdf
           | 
           | The GPU / accelerator BVH traversals are perhaps more
           | applicable (e.g., https://www.embree.org/papers/2019-HPG-
           | ShortStack.pdf) but they also have a different "I can restart
           | my computation" property that say JSON parsing wouldn't have
           | (though maybe if you're willing to do a lot of restarts once
           | you parse the inner portions and push them onto a heap...).
           | 
           | Anyway, cool problem!
        
             | raphlinus wrote:
             | The bell it rang for me was [1], which has come up in
             | discussions with Patrick Walton about maintaining a sort
             | order for coarse rasterization in 2D vector rendering.
             | Taking another look at that, it seems like they propose
             | _both_ bounded k-stacks and linked lists as solutions for
             | recording all the fragments to be combined for transparency
             | blending, but not the hybrid of the two. I wouldn 't be
             | surprised if that's been considered.
             | 
             | The problem is slightly different, though, because the
             | major focus for transparency is concurrent writing from
             | potentially many different shaders, while in the stack case
             | each thread can build its data structure without any
             | concurrency or need for atomics; once an aggregate is
             | built, it is "published" (see my prefix sum blog for
             | details on that) and treated as immutable.
             | 
             | These kinds of concurrent approaches do become important
             | for later stages in the JSON parsing process, like building
             | hashmaps for keys in a dictionary. I'm pretty sure the
             | whole problem is tractable and have thoughts how to solve
             | it, but deliberately kept the scope of this blog post
             | limited, as I figured it was challenging enough. But it's
             | very cool to reach people who appreciate the ideas, and if
             | you find anything else that's relevant, I'd love to hear
             | from you.
             | 
             | [1]: https://on-
             | demand.gputechconf.com/gtc/2014/presentations/S43...
        
       | wrnr wrote:
       | How do you get started with learning GPU programming? I mean that
       | in a pedagogical way, like what is the best educational way
       | forward. Build stupid stuff in shader toys?
        
         | Jhsto wrote:
         | In a sense GPU programming is similar to cellular automata.
         | Each cell is an invocation on the GPU, and you can assume that
         | you may have unlimited number of these invocations. You can
         | freeze cells infinitely or temporarily, but you cannot group
         | them. You can assume you know the identifier number of each
         | cell, and you can assume you know the number of cells for each
         | program. The hard part is now to generalize algorithms in which
         | only control structures are based on freezing, identifying, and
         | counting of cells. The cells can communicate information to the
         | neighboring cells, and to any arbitrary cell, but the latter is
         | not good for performance.
         | 
         | Most universities don't really teach parallel algorithm
         | development in any way despite the core count increasing in
         | both CPU and GPU world.
         | 
         | Current languages like GLSL and CUDA are very abstracted in the
         | sense that they don't really tell why some approach is better
         | or worse than the other. To me, if you can create as general
         | cellular automata -like solution, then such logic can be
         | converted and is generally fine performance-wise for SIMD
         | orientated programming.
        
         | raphlinus wrote:
         | It's a good question, and it's been suggested I write a GPU
         | compute 101. I did do a talk at Jane Street earlier in the year
         | called "A taste of GPU compute" which has some relevant
         | resources (a Google search on that term will give you a number
         | of links).
         | 
         | I think there are probably a number of on-ramps. One easy way
         | to get started is shadertoy, which if nothing else should give
         | familiarity with GLSL syntax and intuition for performance in
         | the simplest case: each "thread" computes something (a pixel
         | value) independently of the others. You'll quickly run into
         | limitations though, as it can't really do any of the more
         | advanced compute stuff. I think as WebGPU becomes real, an
         | analogous tool that unlocks compute kernels (aka compute
         | shaders) could be very powerful for pedagogy.
         | 
         | I think most people that do compute on GPU use CUDA, as it's
         | the only really practical toolchain for it. That has a large
         | number of high quality open source libraries, which tend to be
         | well-documented and have good research papers behind them. You
         | can start by using these libraries, then digging deeper to see
         | how they're implemented.
         | 
         | As I've been going on about, I believe this space is ripe for
         | major growth in the next decade or so. As a rough guide, if you
         | can make your algorithm fit the GPU compute model nicely,
         | you'll get about 10x the compute per dollar, which is
         | effectively the same as compute per watt. Why leave such
         | performance on the table? The answer is that programming GPUs
         | is just too hard. In certain areas, including machine learning,
         | an investment of research has partially fixed that problem. But
         | in others there is fruit at medium height, ripe for the
         | picking. And in some other areas, you'll need a tall ladder, of
         | which I believe a solid understanding of monoids is but one
         | rung.
        
           | wrnr wrote:
           | Thanks, it was your work that kindled my interest in
           | programming the GPU. The fact that most vector graphics are
           | done by CPU libraries made me resize it's still a very
           | underdeveloped area. The hard thing about it is also what
           | makes it interesting, it's not just a different programming
           | language but a different architecture with other runtime
           | characteristics. Sort of the closest you can get to owning a
           | quantum computer. Anyway I'll learn some math and cuda, and
           | buy the book if you find the time :)
        
       | bitdizzy wrote:
       | This language of matching brackets is also called the Dyck
       | language and its syntactic monoid is the bicyclic semigroup,
       | which is an interesting example of an inverse semigroup. Edward
       | Kmett gave a talk on programming with these sorts of algebraic
       | structures that some might find illuminating:
       | https://www.youtube.com/watch?v=HGi5AxmQUwU
       | 
       | Inverse semigroups are used in the geometry of interaction to
       | give a parallelizable interpretation of lambda calculus. I don't
       | have a great introductory reference. Here's some people who have
       | implemented a system based on the theory:
       | https://www.cambridge.org/core/journals/mathematical-structu...
       | 
       | Sorry for the citation dump, I'm short on time and I thought they
       | would be interesting to those who found this submission
       | interesting.
       | 
       | References
       | 
       | https://en.wikipedia.org/wiki/Dyck_language
       | 
       | https://en.wikipedia.org/wiki/Bicyclic_semigroup
        
       | chombier wrote:
       | About the stack/monoid structure, there was a nice talk by Conal
       | Elliott recently describing how it can be used to implement a
       | compiler [0]. He also mentions parallel computations at some
       | point.
       | 
       | [0] https://www.youtube.com/watch?v=wvQbpS6wBa0
        
         | agumonkey wrote:
         | I may be extrapolating too much but I think Steele team workig
         | on Fortress was also doing monoid like thinking to parallelize
         | heavy. He made a talk on how to delinearize problems into
         | partially solved subproblems to be reconciled later.
        
       | acoye wrote:
       | I had this crazy idea a while back using Ray tracing APIs to
       | piggy back a database. Testing for a row could be done shooting
       | in parallel rays to a structure mapped to the data.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2020-09-06 23:00 UTC)