Comments Page - You could have designed state of the art positional encoding

« Back You could have designed state of the art positional encodingfleetwood.devSubmitted by Philpax 7 months ago

rgovostes 7 months ago
Thanks to the author for clarifying something that's been a mystery to me for a few years. The positional encoding scheme in the "Attention Is All You Need" paper is only given half a page and the construction appears to come out of nowhere.
- FL33TW00D 7 months ago
  Thank you! Seemed like voodoo to me too, hence this post!
valine 7 months ago
One of the things I really love about rope is that it allows for a lot of interesting encoding schemes at inference time without model retraining. I’ve had a lot of fun playing with different relative positions. You can elicit a lot of interesting behaviors from the model when you use different rotations for keys vs queries, they don’t always have to match.
For example exact position doesn’t matter too much when tokens are spaced out. Let’s say you use token position 100 for your query, you can shift all the keys around position 100, and the further they are back in the context the more freedom you have to play with the value.
- zackangelo 7 months ago
  I'm surprised this is the case! I've been working on a rope implementation for my own project (needed to account for padding in unique situations) and even an off by one error usually causes the model to produce non-sensical output.
  valine 7 months ago
  You have to be careful to keep the relative positions for adjacent and nearby tokens intact. The relative positions of distant tokens are less brittle.
- bhickey 7 months ago
  Can you describe the behaviors that you can elicit with this technique?
  valine 7 months ago
  One strategy I’ve been playing around with is to take an instruction I want the model to follow and squish the positional encodings for the keys down to position zero, and the new queries out slightly further in the window. The model will still follow the instruction but the behaviors are more global. It’s behaves more like a fine-tune and less like the instruction is part of the conversation.
  bhickey 7 months ago
  > squish the positional encodings for the keys down to position zero
  I might be misunderstanding, but wouldn't this turn your instructions into a bag of words?
  valine 7 months ago
  No, and that’s because we are talking about relative positions. Every query can have its own set of keys. From the perspective of token 100 token 3 would be squished down, but from the perspective of token 3 it is still at position 3 and can see tokens 0,1,2 without them being squished.
espadrine 7 months ago
> Furthermore, by rotating the vector, we have absolutely zero impact on the norm of the vector, which encodes the semantic information of our token.
Doesn’t the angle encode semantic information? Cosine similarity works for embeddings after all.
elieb44 7 months ago
How about context encoding more generally ? Are there techniques to do that. I.E, during training, I want the string "Dubito ergo cogito, cogito ergo sum, sum ergo Deus est." to have embedded René Descartes as main author, year 1637 as date of writing and "Discours de la méthode" as global context of writing.
So that when trained again another part of the same book, the model can learn they were from same context.
- jmmcd 7 months ago
  This is a good idea! The answer to my knowledge is no-one does this, we just the simplest, stupidest, possible method, which is to concatenate all the text in the world. That is during training, of course. At runtime, there is the system prompt.
  The second simplest method might indeed use something like a system prompt with metadata like that, injected before the current window of text. But what would happen at runtime, when that metadata is not present? Probably performance would be much worse.
jcims 7 months ago
I'm effectively a complete layman in this (although I do see some parallels to physical positional encoders, which is interesting) so at first read this entire thing went WAAAAY over my head. At first glance it seemed to be way overcomplicated just to encode position, so I figured I was missing something. ChatGPT was super helpful in explaining spiking neural networks to me so I just spent 20 minutes asking ChatGPT to explain this to me and I feel like I actually learned something.
Then at the end I asked ChatGPT how this all relates to how it operates and it was interesting to see things like:
>Tokens as Subword Units: I use a tokenization method called Byte Pair Encoding (BPE), which breaks text into subword units.
I don't know if it's accurate or not, but it's wild seeing it talk about how it works.
- gloflo 7 months ago
  The context includes that "it" is ChatGPT. The fact that ChatGPT uses Byte Pair Encoding is widely published. It is expectable that a LLM can regurgitate this kind of information, nothing wild about that.
  astrange 7 months ago
  Note if you don't have a good system prompt, other LLMs will also tell you they're ChatGPT or Claude.
  im3w1l 7 months ago
  That's kind of interesting. Like they will know they are an AI? Just not which one?
  astrange 7 months ago
  I think it's because they've been trained by copying answers from ChatGPT. They're not really very copyrighted after all.
  Though the other day I saw someone demonstrate this with Google's Gemini through the API, so maybe it is just picking up conversation traces off the internet.
  throwaway314155 7 months ago
  You think Google is above stealing outputs from OpenAI?
  astrange 7 months ago
  I think they know how to search and replace.
- refulgentis 7 months ago
  100% accurate
alok-g 7 months ago
On a related note, one thing I still do not understand is why are positional encodings 'added' to the token embeddings as opposed to (having a smaller position encoding vector that is) 'concatenated'. It would be great if someone could explain.
- d3m0t3p 7 months ago
  Increasing the dimension causes a lot more computation, this is one of the main reason. You can see evidence of this in the multi head where the dim is reduced via a linear projection.
  h_i = attention(W_i^Q Q^T @ W_i^K K) W_i^v V
  h = W_o @ concat(h_1...h_8)
  bfelbo 7 months ago
  How many dimensions would you need to increase by to capture positional information?
  Seems to me like it’d be a quite low number compared to the dimensionality of the semantic vectors?
throwawaymaths 7 months ago
Maybe someone could answer this for me: it seems like encoding the positional embeddings as augmentations to the "natural" activations instead of as their own inputs (concatenated onto the activations) make things like sliding a window much harder... I guess obviously the drawback is you have a somewhat less textually derived information.
I recall a early transformers video where they tried both and it turned out that adding the position onto the existing vectors was no worse so they went with it... No further discussion about motivations happened in that video.
Is it worth revisiting that maybe now that activations have a gobsmackingly large dimension?
- stephantul 7 months ago
  They are not concatenated, but summed. I think concatenation wouldn’t work, as you indicate.
  I think you mean the line in the original paper where they say compared the learned attention weights with the predefined encoding, and it made no difference.
  throwawaymaths 7 months ago
  > I think concatenation wouldn’t work, as you indicate.
  Why do you say that?
  donkeyboy 7 months ago
  Concat could work too although less efficient because you need to make a new tensor.
  Actually summing might learn a concat on its own. Imagine the embedding learned for a token takes up the first N-20 dimensions and leaves the last 20 dimensions as 0. And the positional encoding causes the first N-20 dims to be 0 and the last 20 to encode the information. Then when you sum you are actually concatenating. So I think of them as equivalent except add is more efficient/preserves the dim space, while concat would grow the dim space. And for something like position, which certainly does not need to occupy 1000+ dimensions, it would not make sense to concat all of that since it would be wasteful
  throwawaymaths 7 months ago
  why would you need to make a new tensor?
  Suppose you had a 4096 (llama-2) sized activations. Maybe, you make do with 3084 activations and concatenate 1024 positional activations onto that.
  Then you pass that to Mk Mq Mv and generate K, Q, V.
  The only thing that would change would be the Mff-out, which would now be a (big)x3084 matrix instead of (big)x4096
  In any case you would be retraining, so changing the dims of the tensors I think is not a big deal... In fact in this case they would be smaller (at the cost of fewer interlayer activations), but you would have the same number of tensors.
  > Actually summing might learn a concat on its own.
  But you see the point? You're forcing the model to learn something that maybe it didn't need to. That's like saying "well a fully connected network might learn convolution on its own". Historically breakthroughs in capability have accompanied one of: [more data | more layers | smarter constraints on activations]
  Unless you have some sort of argument that forcing it to learn position has carryover value in generating activations, it seems, naively, a bad idea.

imjonse 7 months ago

I don't think the first code example should work (it indeed says false here).

When given a permuted sequence, the attention output will also be permuted, not identical. The need for positional encodings is due to two tokens resulting in the same value in the final attention matrix regardless of the tokens' absolute and relative position; that is enough to miss a lot of meaning.

aconz2 7 months ago
To add on since this took me a while to understand: for a single token, self attention is permutation invariant because we take the qK (one query dot all the other keys) weighted sum of all the values; that sum is what gives the invariance because + is commutative. But for all the tokens, the mha output matrix will not be invariant, but rather equivariant, where you apply the same permutation to the output matrix as you did to the input tokens. What might be a more useful example is to take one position, like the last one, and compute its mha for every permutation of the previous tokens; those will/should all be the same.

FL33TW00D 7 months ago

The first code example says False because of high precision, I've updated the example.

jmmcd 7 months ago

But u/imjonse's reasoning seems right. I haven't run either version of the code, but when reading it I expected that to be False. The output is still a list with an order.

the dog chased the cat: position 1 in the output is attention(dog, everything)

the cat chased the dog: position 1 in the output is attention(cat, everything)

FL33TW00D 7 months ago

Run the code and look at the values!

jmmcd 7 months ago

Well, yes, I deserved that reply! And yes the code is printing True. It's not that I disbelieved you... but something is wrong here. Investigation below, thanks to Claude.ai for walking me through it!

    In [10]: o1[0, :, :3]
    Out[10]:
    tensor([[ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012]],       grad_fn=<SliceBackward0>)

Every token has the same attention values. I expect attention(cat, everything) to differ from attention(dog, everything), even without positional encoding.

Further, the attention weights are uniform and identical for both sentences:

    In [46]: o1, aw1 = mha(W_q(e1), W_k(e1), W_v(e1))
    In [47]: o2, aw2 = mha(W_q(e2), W_k(e2), W_v(e2))
    In [48]: aw1.shape
    Out[48]: torch.Size([1, 6, 6])
    In [49]: aw2.shape
    Out[49]: torch.Size([1, 6, 6])
    In [50]: aw1
    Out[50]:
    tensor([[[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
       grad_fn=<MeanBackward1>)

    In [51]: aw2
    Out[51]:
    tensor([[[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
       grad_fn=<MeanBackward1>)

That is not expected. It's because the Linear layers are initialised with such small values. And the softmax causes a collapse.

Trying random weights on a larger scale:

    In [52]: W_q.weight.data *= 100
         W_k.weight.data *= 100
         W_v.weight.data *= 100

    In [55]: o1, aw1 = mha(W_q(e1), W_k(e1), W_v(e1))
    In [56]: o2, aw2 = mha(W_q(e2), W_k(e2), W_v(e2))
    In [57]: aw1
    Out[57]:
    tensor([[[0.2049, 0.1606, 0.1256, 0.1095, 0.1723, 0.2270],
         [0.0883, 0.2047, 0.1544, 0.2776, 0.1405, 0.1345],
         [0.1196, 0.1719, 0.1831, 0.1541, 0.1374, 0.2339],
         [0.1413, 0.2399, 0.1617, 0.2056, 0.1634, 0.0880],
         [0.1455, 0.1432, 0.2432, 0.1239, 0.1494, 0.1948],
         [0.1897, 0.1817, 0.1920, 0.1478, 0.1618, 0.1270]]],
       grad_fn=<MeanBackward1>)

    In [58]: aw2
    Out[58]:
    tensor([[[0.2049, 0.1606, 0.2270, 0.1095, 0.1723, 0.1256],
         [0.0883, 0.2047, 0.1345, 0.2776, 0.1405, 0.1544],
         [0.1897, 0.1817, 0.1270, 0.1478, 0.1618, 0.1920],
         [0.1413, 0.2399, 0.0880, 0.2056, 0.1634, 0.1617],
         [0.1455, 0.1432, 0.1948, 0.1239, 0.1494, 0.2432],
         [0.1196, 0.1719, 0.2339, 0.1541, 0.1374, 0.1831]]],
       grad_fn=<MeanBackward1>)

    In [60]: o1[:, :, :5]
    Out[60]:
    tensor([[[ 0.0145,  0.3128, -0.3659, -0.1884,  0.1724],
         [-0.2319,  0.1407, -0.6010, -0.4064,  0.4259],
         [-0.3231,  0.1622, -0.6351, -0.1711,  0.4014],
         [-0.0596,  0.2610, -0.7388, -0.2987,  0.3214],
         [-0.2750,  0.0676, -0.4140, -0.2024,  0.3383],
         [-0.1434,  0.0871, -0.3154, -0.0755,  0.3314]]],
       grad_fn=<SliceBackward0>)

    In [61]: o2[:, :, :5]
    Out[61]:
    tensor([[[ 0.0145,  0.3128, -0.3659, -0.1884,  0.1724],
         [-0.2319,  0.1407, -0.6010, -0.4064,  0.4259],
         [-0.1434,  0.0871, -0.3154, -0.0755,  0.3314],
         [-0.0596,  0.2610, -0.7388, -0.2987,  0.3214],
         [-0.2750,  0.0676, -0.4140, -0.2024,  0.3383],
         [-0.3231,  0.1622, -0.6351, -0.1711,  0.4014]]],
       grad_fn=<SliceBackward0>)

    In [62]: print("Matches: ", torch.allclose(o1, o2, atol=1e-6))
    Matches:  False

FL33TW00D 7 months ago
Hm! Very interesting! Thank you for taking the time to debug that.
I'm going to have to think hard about how to rewrite the motivating example to explain this best.
Edit: updated the post, thanks for pointing out the pernicious init values!

breadislove 7 months ago
There is this really interesting blog post about making rope (by the main author of the paper) multimodal as used by qwen2 vl. it's in chinese but google translate does a pretty good job: https://spaces.ac.cn/archives/10040
1024core 7 months ago
I didn't get the sudden leap from "position encodings" to "QKV" magic.
What is the connection between the two? Where does "Q" come from? What are "K" and "V"? (I know they stand for "Query", "Key", "Value"; but what do they have to do with position embeddings?)
- flebron 7 months ago
  All of them are vectors of embedded representations of tokens. In a transformer, you want to compute the inner product between a query (the token who is doing the attending) and the key (the token who is being attended to). An inductive bias we have is that the neural network's performance will be better if this inner product depends on the relative distance between the query token's position, and the key token's position. We thus encode each one with positional information, in such a way that (for RoPE at least) the inner product depends only on the distance between these tokens, and not their absolute positions in the input sentence.
- FL33TW00D 7 months ago
  "This post intends to limit the mathematical knowledge required to follow along, but some basic linear algebra, trigonometry and understanding of self attention is expected."
  If you're not sure on self attention, the post will be a little unclear
Scene_Cast2 7 months ago
If you're interested in positional embeddings for Transformers, check out this repo - https://github.com/gazelle93/Attention-Various-Positional-En... - it implements various popular ones.
majke 7 months ago
The best explanation of classic positional encoding I found in this blog https://medium.com/mantisnlp/positional-encodings-i-main-app...
Der_Einzige 7 months ago
Similarly, "you" could have designed state of the art LLM sampling: https://openreview.net/forum?id=FBkpCyujtS&referrer=%5BTasks...
cperciva 7 months ago
The binary coding example would have been much better with Gray codes.
logicchains 7 months ago
Does anyone know why 2D rope implementations apply two separate 1D rotations to pairs, instead of applying a 2d rotation to triplets?
- rini17 7 months ago
  No they apply many rotations, same as the number of dimensions of the embedding space.
JP_DW 7 months ago
The illustrations are really good to look at. I couldn't find thel on your github, are they created with manim?