[HN Gopher] Understanding and coding the self-attention mechanis...
       ___________________________________________________________________
        
       Understanding and coding the self-attention mechanism of large
       language models
        
       Author : mariuz
       Score  : 52 points
       Date   : 2023-02-10 18:04 UTC (4 hours ago)
        
 (HTM) web link (sebastianraschka.com)
 (TXT) w3m dump (sebastianraschka.com)
        
       | hprotagonist wrote:
       | https://arxiv.org/abs/2105.02723
       | 
       |  _The strong performance of vision transformers on image
       | classification and other vision tasks is often attributed to the
       | design of their multi-head attention layers. However, the extent
       | to which attention is responsible for this strong performance
       | remains unclear.
       | 
       | In this short report, we ask: is the attention layer even
       | necessary?
       | 
       | Specifically, we replace the attention layer in a vision
       | transformer with a feed-forward layer applied over the patch
       | dimension. The resulting architecture is simply a series of feed-
       | forward layers applied over the patch and feature dimensions in
       | an alternating fashion. In experiments on ImageNet, this
       | architecture performs surprisingly well: a ViT/DeiT-base-sized
       | model obtains 74.9\% top-1 accuracy, compared to 77.9\% and
       | 79.9\% for ViT and DeiT respectively.
       | 
       | These results indicate that aspects of vision transformers other
       | than attention, such as the patch embedding, may be more
       | responsible for their strong performance than previously thought.
       | We hope these results prompt the community to spend more time
       | trying to understand why our current models are as effective as
       | they are._
        
         | [deleted]
        
         | lostmsu wrote:
         | That's a pretty huge drop in accuracy.
        
         | thomasahle wrote:
         | ViT gives you 90% top-1 accuracy on ImageNet:
         | https://paperswithcode.com/sota/image-classification-on-imag...
         | I don't know where they get the 77.9% number from. 75% is
         | pretty bad. Similar to the 2015 VGG net as the authors also
         | admit.
        
           | thomasahle wrote:
           | Nevermind, I guess it's "ImageNet-1K trained models" on which
           | ViT gets 79.9% and the 90% is only when pretraining with
           | ImageNet-22K.
           | 
           | There are other non-attention based networks that get 90% too
           | though: https://arxiv.org/pdf/2212.11696v3.pdf
        
       | mirker wrote:
       | Isn't this very similar to Karpathy's nanoGPT?
        
         | hummus_bae wrote:
         | [dead]
        
       ___________________________________________________________________
       (page generated 2023-02-10 23:00 UTC)