[HN Gopher] Understanding and coding the self-attention mechanis... ___________________________________________________________________ Understanding and coding the self-attention mechanism of large language models Author : mariuz Score : 52 points Date : 2023-02-10 18:04 UTC (4 hours ago) (HTM) web link (sebastianraschka.com) (TXT) w3m dump (sebastianraschka.com) | hprotagonist wrote: | https://arxiv.org/abs/2105.02723 | | _The strong performance of vision transformers on image | classification and other vision tasks is often attributed to the | design of their multi-head attention layers. However, the extent | to which attention is responsible for this strong performance | remains unclear. | | In this short report, we ask: is the attention layer even | necessary? | | Specifically, we replace the attention layer in a vision | transformer with a feed-forward layer applied over the patch | dimension. The resulting architecture is simply a series of feed- | forward layers applied over the patch and feature dimensions in | an alternating fashion. In experiments on ImageNet, this | architecture performs surprisingly well: a ViT/DeiT-base-sized | model obtains 74.9\% top-1 accuracy, compared to 77.9\% and | 79.9\% for ViT and DeiT respectively. | | These results indicate that aspects of vision transformers other | than attention, such as the patch embedding, may be more | responsible for their strong performance than previously thought. | We hope these results prompt the community to spend more time | trying to understand why our current models are as effective as | they are._ | [deleted] | lostmsu wrote: | That's a pretty huge drop in accuracy. | thomasahle wrote: | ViT gives you 90% top-1 accuracy on ImageNet: | https://paperswithcode.com/sota/image-classification-on-imag... | I don't know where they get the 77.9% number from. 75% is | pretty bad. Similar to the 2015 VGG net as the authors also | admit. | thomasahle wrote: | Nevermind, I guess it's "ImageNet-1K trained models" on which | ViT gets 79.9% and the 90% is only when pretraining with | ImageNet-22K. | | There are other non-attention based networks that get 90% too | though: https://arxiv.org/pdf/2212.11696v3.pdf | mirker wrote: | Isn't this very similar to Karpathy's nanoGPT? | hummus_bae wrote: | [dead] ___________________________________________________________________ (page generated 2023-02-10 23:00 UTC)