hngopher.com

       [HN Gopher] FlashAttention: Fast Transformer training with long ...
       ___________________________________________________________________
        
       FlashAttention: Fast Transformer training with long sequences
        
       Author : kristianp
       Score  : 127 points
       Date   : 2023-10-01 11:23 UTC (11 hours ago)
        
 (HTM) web link (www.adept.ai)
 (TXT) w3m dump (www.adept.ai)
        
       | kken wrote:
       | Here is a recent interview with the author of FlashAttention, Tri
       | Dao:
       | 
       | https://www.youtube.com/watch?v=J4-qZ6KBalk
        
       | fzysingularity wrote:
       | Also, isn't the author Tri Dao at Together AI now as their chief
       | scientist?
        
       | dang wrote:
       | Related:
       | 
       |  _FlashAttention-2, 2x faster than FlashAttention_ -
       | https://news.ycombinator.com/item?id=36761988 - July 2023 (18
       | comments)
       | 
       |  _FlashAttention: Fast and Memory-Efficient Exact Attention with
       | IO-Awareness_ - https://news.ycombinator.com/item?id=31568090 -
       | May 2022 (3 comments)
        
       | sebzim4500 wrote:
       | It's insane that FlashAttention was released 16 months ago. It
       | feels like a decade.
        
       | ttul wrote:
       | It's basically a way of more efficiently making use of memory
       | transfers during the calculation of the attention blocks in a
       | transformer. You transfer a block at a time, increasing inference
       | throughout because less time is spent overall fetching things
       | from slow memory.
        
       | [deleted]
        
       | thawab wrote:
       | The same author Tri Dao, released FlashAttention 2 in July.
       | 
       | https://together.ai/blog/tri-dao-flash-attention
        
       | 1024core wrote:
       | Has anybody used FlashAttention in their model? Are there any
       | benchmark numbers on the quality impact?
        
         | pama wrote:
         | The result is identical to regular attention in transformers
         | but training can be about four times faster, so there is almost
         | no reason to not use it.
        
       ___________________________________________________________________
       (page generated 2023-10-01 23:00 UTC)