• rgovostes 7 hours ago

    Thanks to the author for clarifying something that's been a mystery to me for a few years. The positional encoding scheme in the "Attention Is All You Need" paper is only given half a page and the construction appears to come out of nowhere.

    • FL33TW00D 9 minutes ago

      Thank you! Seemed like voodoo to me too, hence this post!

    • valine 8 hours ago

      One of the things I really love about rope is that it allows for a lot of interesting encoding schemes at inference time without model retraining. I’ve had a lot of fun playing with different relative positions. You can elicit a lot of interesting behaviors from the model when you use different rotations for keys vs queries, they don’t always have to match.

      For example exact position doesn’t matter too much when tokens are spaced out. Let’s say you use token position 100 for your query, you can shift all the keys around position 100, and the further they are back in the context the more freedom you have to play with the value.

      • imjonse an hour ago

        I don't think the first code example should work (it indeed says false here).

        When given a permuted sequence, the attention output will also be permuted, not identical. The need for positional encodings is due to two tokens resulting in the same value in the final attention matrix regardless of the tokens' absolute and relative position; that is enough to miss a lot of meaning.

        • FL33TW00D 24 minutes ago

          The first code example says False because of high precision, I've updated the example.

          • jmmcd 18 minutes ago

            But u/imjonse's reasoning seems right. I haven't run either version of the code, but when reading it I expected that to be False. The output is still a list with an order.

            the dog chased the cat: position 1 in the output is attention(dog, everything)

            the cat chased the dog: position 1 in the output is attention(cat, everything)

            • FL33TW00D 9 minutes ago

              Run the code and look at the values!

        • jcims 5 hours ago

          I'm effectively a complete layman in this (although I do see some parallels to physical positional encoders, which is interesting) so at first read this entire thing went WAAAAY over my head. At first glance it seemed to be way overcomplicated just to encode position, so I figured I was missing something. ChatGPT was super helpful in explaining spiking neural networks to me so I just spent 20 minutes asking ChatGPT to explain this to me and I feel like I actually learned something.

          Then at the end I asked ChatGPT how this all relates to how it operates and it was interesting to see things like:

          >Tokens as Subword Units: I use a tokenization method called Byte Pair Encoding (BPE), which breaks text into subword units.

          I don't know if it's accurate or not, but it's wild seeing it talk about how it works.

          • gloflo 3 hours ago

            The context includes that "it" is ChatGPT. The fact that ChatGPT uses Byte Pair Encoding is widely published. It is expectable that a LLM can regurgitate this kind of information, nothing wild about that.

            • astrange 2 hours ago

              Note if you don't have a good system prompt, other LLMs will also tell you they're ChatGPT or Claude.

            • refulgentis 4 hours ago

              100% accurate

            • elieb44 an hour ago

              How about context encoding more generally ? Are there techniques to do that. I.E, during training, I want the string "Dubito ergo cogito, cogito ergo sum, sum ergo Deus est." to have embedded René Descartes as main author, year 1637 as date of writing and "Discours de la méthode" as global context of writing.

              So that when trained again another part of the same book, the model can learn they were from same context.

              • throwawaymaths 6 hours ago

                Maybe someone could answer this for me: it seems like encoding the positional embeddings as augmentations to the "natural" activations instead of as their own inputs (concatenated onto the activations) make things like sliding a window much harder... I guess obviously the drawback is you have a somewhat less textually derived information.

                I recall a early transformers video where they tried both and it turned out that adding the position onto the existing vectors was no worse so they went with it... No further discussion about motivations happened in that video.

                Is it worth revisiting that maybe now that activations have a gobsmackingly large dimension?

                • stephantul 4 hours ago

                  They are not concatenated, but summed. I think concatenation wouldn’t work, as you indicate.

                  I think you mean the line in the original paper where they say compared the learned attention weights with the predefined encoding, and it made no difference.

                  • throwawaymaths an hour ago

                    > I think concatenation wouldn’t work, as you indicate.

                    Why do you say that?

                • logicchains 25 minutes ago

                  Does anyone know why 2D rope implementations apply two separate 1D rotations to pairs, instead of applying a 2d rotation to triplets?

                  • cperciva 5 hours ago

                    The binary coding example would have been much better with Gray codes.

                    • Der_Einzige 3 hours ago

                      Similarly, "you" could have designed state of the art LLM sampling: https://openreview.net/forum?id=FBkpCyujtS&referrer=%5BTasks...