       | Buttons840 wrote:
       | What's a good introduction to transformers?
         | tchalla wrote:
         | I have seen a lot of introduction that explains the mechanics.
         | However, I haven't seen one that explains the intuition of the
         | hypothesis on why it works.
         | lalaithion wrote:
         | Not sure if this is a good introduction, but a good second
         | paper to read is https://arxiv.org/abs/2106.06981
         | You can think of finite state machines as being two functions:
         | f(input, state) = output, and g(input, state) = next_state.
         | (Traditional FSMs have 3 'output' states, basically 'terminated
         | - success', 'terminated - failure', and 'still working', but in
         | theory it makes sense to fully generalize it).
         | If you think about plain neural networks as approximating
         | arbitrary functions f(input) = output, then recurrent neural
         | networks are "continuous state machines", where you have the
         | same two functions f(input, state) = output, and g(input,
         | state) = next_state, except instead of being finite symbols,
         | they're continuous points in N dimensional space. This, at
         | least to me, clarifies why recurrent neural networks work on
         | simple and short time-based problems, but can't efficiently
         | generalize to complex problems--they're just FSMs!
         | The paper I linked above provides a similar high-level
         | computational analogy to how transformers work.
         | atty wrote:
         | If you mean the technical details of attention models, the
         | original paper "Attention is all you need" is not too difficult
         | to read. If you're more interested in applications, hugging
         | face has a "course" on their website that walks through the
         | high level topics of applying transformers to natural language
         | processing (can't remember if they cover transformers for other
         | topics).
         | criticaltinker wrote:
         | The original paper that introduced the Transformer architecture
         | is quite accessible and outlines a lot of the history and
         | rational for the design [1].
         | [1] https://arxiv.org/pdf/1706.03762.pdf
           | mrfusion wrote:
           | I tried that but it seems to gloss over what an encoder, etc
           | actually are.
           | I think I'd do better with pseudo code or a toy example.
             | saynay wrote:
             | The encoder is the neural-net that converts the input to
             | the embedding vector. The decoder is the neural-net that
             | converts that vector into output. What that embedding
             | vector "means" is whatever the entire algorithm has learned
             | it means.
             | For more simplified look at embeddings, I would look at
             | Word2Vec (although, it doesn't involve transformers). It
             | encodes single words, instead of entire phrases, and does
             | so by looking at their relative position to other words
             | while being trained.
             | Embeddings are just vectors, and so you can do math or
             | compare them to other embeddings. The famous example is
             | E(king) - E(man) + E(woman) = E(queen)
               | mrfusion wrote:
               | So you're saying the encoding could be a neural net OR
               | something like word2vec?
             | criticaltinker wrote:
             | Check out The Annotated Transformer, it's one of my
             | favorite references! It contains straightforward python
             | code side by side with excerpts from the original paper.
             | http://nlp.seas.harvard.edu/2018/04/03/attention.html
         | beefman wrote:
         | Transformers from Scratch
         | link: https://e2eml.school/transformers.html
         | discussion here: https://news.ycombinator.com/item?id=29315107
         | dpflan wrote:
         | Another resource: "The Illustrated Transformer"
         | - https://jalammar.github.io/illustrated-transformer/
         | - HN post for the article:
         | https://news.ycombinator.com/item?id=18351674
         | visarga wrote:
         | For accessibility I recommend Yannic Kilcher video review of
         | "Attention Is All You Need"
         | https://www.youtube.com/watch?v=iDulhoQ2pro
         | Yannic has been making about 62 other transformer paper reviews
         | since. You can find the usual suspects.
         | https://www.youtube.com/watch?v=u1_qMdb0kYU&list=PL1v8zpldgH...
         | moffkalast wrote:
         | Transformers (2007)
           | timy2shoes wrote:
           | I prefer to go to the original source, specifically The
           | Transformers (1984-87).
       | visarga wrote:
       | So transformers have done it again, another sub-field of ML with
       | all its past approaches surpassed by a simple language model, at
       | least when there is enough data.
       | So far they can handle: text, image, video, code, proteins and
       | now planning and behavior. It's like a universal algorithm for
       | learning and reminds me of the uniformity of the brain. Hope
       | we're going to see much more efficient hardware implementations
       | in the future.
         | blovescoffee wrote:
         | I wouldn't say they've "done it" quite yet. There's definitely
         | an application for imitation learning but that might be it. A
         | translation of the work in sequence-to-sequence to sequence-to-
         | action is something I've also considered researching. A few
         | challenges exist which the author touches on in just one
         | sentence. First, we need data about previous sequences of
         | actions and this is necessarily a challenge in many fields in
         | robotics/learning. A related problem is that of exploration.
         | How exactly should we inform the exploration of new sequences?
         | Also, if our policy is based on the prediction of a
         | Transformer, does it have the traditional desirable properties
         | of a policy in an RL environment? Off the top of my head it
         | seems like a Transformer fed into an MLP would probably be fit
         | but I'm not sure. Transformers do seem promising, but it's a
         | bit early to say they've "done it" :)
