[HN Gopher] FlashAttention: Fast Transformer training with long ... ___________________________________________________________________ FlashAttention: Fast Transformer training with long sequences Author : kristianp Score : 127 points Date : 2023-10-01 11:23 UTC (11 hours ago) (HTM) web link (www.adept.ai) (TXT) w3m dump (www.adept.ai) | kken wrote: | Here is a recent interview with the author of FlashAttention, Tri | Dao: | | https://www.youtube.com/watch?v=J4-qZ6KBalk | fzysingularity wrote: | Also, isn't the author Tri Dao at Together AI now as their chief | scientist? | dang wrote: | Related: | | _FlashAttention-2, 2x faster than FlashAttention_ - | https://news.ycombinator.com/item?id=36761988 - July 2023 (18 | comments) | | _FlashAttention: Fast and Memory-Efficient Exact Attention with | IO-Awareness_ - https://news.ycombinator.com/item?id=31568090 - | May 2022 (3 comments) | sebzim4500 wrote: | It's insane that FlashAttention was released 16 months ago. It | feels like a decade. | ttul wrote: | It's basically a way of more efficiently making use of memory | transfers during the calculation of the attention blocks in a | transformer. You transfer a block at a time, increasing inference | throughout because less time is spent overall fetching things | from slow memory. | [deleted] | thawab wrote: | The same author Tri Dao, released FlashAttention 2 in July. | | https://together.ai/blog/tri-dao-flash-attention | 1024core wrote: | Has anybody used FlashAttention in their model? Are there any | benchmark numbers on the quality impact? | pama wrote: | The result is identical to regular attention in transformers | but training can be about four times faster, so there is almost | no reason to not use it. ___________________________________________________________________ (page generated 2023-10-01 23:00 UTC)