godelski 8 days ago

  >  I like the histograms of similarity the best but they are in the weeds a lot with things like "god" ~ "dog".
I do like those too. But I think we have a tendency to misinterpret what direction we're moving in. Like the coordinate system is highly non-intuitive. Shouldn't be all that surprising that if we create an n-ball around "dog" that you get things that are more semantically meaningful like "cat" or "animal" but jesus christ, we're in over a thousand dimensions. It shouldn't be surprising that one of those directions is letter permutation. I can't even think of what a thousand meaningful directions would be!

Honestly, I think we should be more surprised that cosine similarity even works! Everything should be orthogonal. But clearly the manifold hypothesis is giving us a big leg up here, with the semantic biases built into language too.

People wildly underestimate how complex this topic is. It's baffling. It's mindblowing. And that's why it is so awesome and should excite people! I think we're doing a lot of footgunning by thinking this stuff is simple or solved. It is a wonderfully rich topic with so much left to discover.

1
PaulHoule 7 days ago

I certainly wish we had embeddings that were factorizable. There ought to be some subvectors that contain different sorts of semantic information (big, small, in the future, in the past, animal, vegetable, mineral, is negated, ...) as well as syntactic information (verb, noun, English, French, ...) as well as lexical ("dog"-"god", token is 15 bytes, token is 7 characters, ...)

The summer BERT came out I was at a startup that was struggling with text representations: character vectors sorta worked but had too many tokens and it seemed it wasn't an efficient use of a CNN or RNN to use it as a dictionary. Word vectors didn't really seem to work, I was hoping we could develop some kind of factorizable vector that has one subvector that models the word and another that models the context of the word but we never got that far. But then BERT came along and it does give us a vector which represents word and context, just it is not factorizable.

godelski 7 days ago

I'm not sure the question is about factorization but more about having the proper coordinates. Right now our basis vectors are pointing in... well... random directions. So when we naively interpolate between two vectors we're certainty going to pass through different semantical concepts. Our basis vectors aren't aligned to those concepts. And certainly LERPing is not the right way to move along this geometry either, and we should actually be thinking about the geodesics of our surface.

We've gone about all this as if we're working with a 2D cartesian coordinate system. I'm actually impressed things have worked this well! Crazy how we've done all this without care for the structural preservation of our data. But I guess with scale you can kinda side step that, forcing the manifold hypothesis. But to move forward and tackle these challenges? Might be time to get back into the weeds.