Item 44136593

PaulHoule • 7 days ago

I certainly wish we had embeddings that were factorizable. There ought to be some subvectors that contain different sorts of semantic information (big, small, in the future, in the past, animal, vegetable, mineral, is negated, ...) as well as syntactic information (verb, noun, English, French, ...) as well as lexical ("dog"-"god", token is 15 bytes, token is 7 characters, ...)

The summer BERT came out I was at a startup that was struggling with text representations: character vectors sorta worked but had too many tokens and it seemed it wasn't an efficient use of a CNN or RNN to use it as a dictionary. Word vectors didn't really seem to work, I was hoping we could develop some kind of factorizable vector that has one subvector that models the word and another that models the context of the word but we never got that far. But then BERT came along and it does give us a vector which represents word and context, just it is not factorizable.

godelski • 7 days ago

I'm not sure the question is about factorization but more about having the proper coordinates. Right now our basis vectors are pointing in... well... random directions. So when we naively interpolate between two vectors we're certainty going to pass through different semantical concepts. Our basis vectors aren't aligned to those concepts. And certainly LERPing is not the right way to move along this geometry either, and we should actually be thinking about the geodesics of our surface.

We've gone about all this as if we're working with a 2D cartesian coordinate system. I'm actually impressed things have worked this well! Crazy how we've done all this without care for the structural preservation of our data. But I guess with scale you can kinda side step that, forcing the manifold hypothesis. But to move forward and tackle these challenges? Might be time to get back into the weeds.