minimaxir 8 days ago

> I EXPECT them to be. After all, they are the reverse of one another.

That isn't how tokenized inputs work. It's partially the same reason why "how many r's are in strawberry" is a hard problem for LLMs.

All these models are trained for semantic similarity by how they are actually used in relation to other words, so a data point where that doesn't follow intuitively is indeed weird.

1
godelski 8 days ago

I'm not talking about Tokenization.

It can get confusing because we usually role tokenization and embedding up as a singular process but the tokenization is the translation of our characters into numeric representations. There's self discovery of what the atomic units should be (bounded by our vocabulary size).

The process is, at a high level: string -> integer -> vec<float>. You are learning the string splits, integer IDs, and vector embeddings. You are literally building a dictionary. The BPE paper is a good place to start[0], but it is far from the place we are now.

The embeddings is this data in that latent representation space.

  > All these models are trained for semantic similarity
Citation needed...

There's no real good measure of semantic similarity so it would be really naive to assume that this must be happening. There is a natural pressure for this to occur because words are generated in a bias way, but that's different than saying they're trained to be semantically similar. There's even a bit of discussion about this in the Word2Vec paper[1], but you should also follow some of the citations to dig deeper.

You need to think VERY carefully about the vector basis[2]. You can very easily create an infinite number of basis vectors that are isomorphic to the standard cartesian coordinate. We usually use [[1,0],[0,1]], but there's no reason you can't use some rotation like [[1/sqrt(2), -1/sqrt(2)],[1/sqrt(2),1/sqrt(2)]]. Our (x,y) space is isomorphic to our new (u,v) space but traveling along our u basis vector is not equivalent to traveling along the x basis vector (\hat{i}) or even the y (\hat{j}). You are traveling along them equally! u is still orthogonal to v and x is still orthogonal to y but it is a rotation. we can also do something more complex like using polar coordinates. All this stuff is equivalent! They all provide linearly independent unit vectors that span our field.

The point is, the semantics is a happy outcome, not a guaranteed or even specifically trained for outcome. We should expect it to happen frequently because of hour our languages evolved but the "god" "dog" example perfectly illustrates how this is naive.

You *CANNOT* train for semantic similarity until you *DEFINE* semantic similarity. That definition needs to be a strong rigorous mathematical one. Not an ad-hoc Justice Potter "know it when I see it" kinda policy. The way they are used in relation to other words is definitely not well aligned to semantics. I can talk about cats and dogs or cats and potatoes all day long. The real similarity we'll come up with there is nouns and that's not much in the way of semantics. Even the examples I gave aren't strictly nouns. Shit gets real fucking messy real fast[3]. It's not just English, it happens in every language[4]

We can get WAY more into this, but no, sorry, that's not how this works.

[0] https://arxiv.org/abs/1508.07909

[1] https://arxiv.org/abs/1301.3781

[2] https://en.wikipedia.org/wiki/Basis_(linear_algebra)

[3] I'll leave you with my favorite example of linguistic ambiguity

  Read rhymes with lead
  and lead rhymes with read
  but read doesn't rhyme with lead
  and lead doesn't rhyme with read
[4] https://en.wikipedia.org/wiki/Lion-Eating_Poet_in_the_Stone_...