The thing about high dimensional vector spaces is that when N is large they are strangely different from the N=2 and N=3 cases we are familiar with. For instance when N=3 you could imagine that a cube is not all that different from a sphere, just sand away the corners. If N=10,000 though, the "cube" has a huge number of corners which are a distance of 100 away from the origin whereas the sphere never gets past 1. Hypercubes look something like this
https://www.amazon.com/Torre-Tagus-901918B-Spike-Sphere/dp/B...
A consequence of that is that many visualizations give people the wrong idea so I wouldn't try too hard.
Of everything in the article I like the histograms of similarity the best but they are in the weeds a lot with things like "god" ~ "dog". When I was building search engines I looked a lot at graphs that showed the similarity distribution of relevant vs irrelevant results
I'll argue bitterly about word embeddings being "very good" for anything; actually that similarity distribution looks pretty good, but my experience is that when you are looking at N words word vectors look promising when N=5 but when N>50 or so they break down completely. I've worked on teams that were considering both RNN and CNN models. My thinking was that if word embeddings had any knowledge in them that a deep model could benefit from you could also train a classical ML model (say some kind of SVM) to classify words on some characteristic like "is a color" or "is a kind of person" or "can be used as a verb" but I could never get it to work.
Now I went looking and never found that anyone had published positive or negative results for such a classifier, my feeling was it was a terrible tarpit, particularly when N was tiny it would almost seem to work but when N increased it would always fall apart. Between the bias that people don't publish negative results and that people who got negative results might blame themselves and not word embeddings or the hype around word embeddings, they didn't get published.
I do collect papers from arXiv where people do some boring text classification task because I do boring text classification tasks and I facepalm so often because people often try 15 or so algos, most of which never work well, and word embeddings are always in that category. If people tried some classical ML algos with bag-of-words and pooled ModernBERT they'd sample a good segment of the efficient frontier -- a BERT embedding doesn't just capture the word, it captures the meaning of the word in context which is night and day different when it comes to relevance because matching the synonyms of all the different word senses brings as many or more irrelevant matches as it does relevant ones.
> I like the histograms of similarity the best but they are in the weeds a lot with things like "god" ~ "dog".
I do like those too. But I think we have a tendency to misinterpret what direction we're moving in. Like the coordinate system is highly non-intuitive. Shouldn't be all that surprising that if we create an n-ball around "dog" that you get things that are more semantically meaningful like "cat" or "animal" but jesus christ, we're in over a thousand dimensions. It shouldn't be surprising that one of those directions is letter permutation. I can't even think of what a thousand meaningful directions would be!Honestly, I think we should be more surprised that cosine similarity even works! Everything should be orthogonal. But clearly the manifold hypothesis is giving us a big leg up here, with the semantic biases built into language too.
People wildly underestimate how complex this topic is. It's baffling. It's mindblowing. And that's why it is so awesome and should excite people! I think we're doing a lot of footgunning by thinking this stuff is simple or solved. It is a wonderfully rich topic with so much left to discover.
I certainly wish we had embeddings that were factorizable. There ought to be some subvectors that contain different sorts of semantic information (big, small, in the future, in the past, animal, vegetable, mineral, is negated, ...) as well as syntactic information (verb, noun, English, French, ...) as well as lexical ("dog"-"god", token is 15 bytes, token is 7 characters, ...)
The summer BERT came out I was at a startup that was struggling with text representations: character vectors sorta worked but had too many tokens and it seemed it wasn't an efficient use of a CNN or RNN to use it as a dictionary. Word vectors didn't really seem to work, I was hoping we could develop some kind of factorizable vector that has one subvector that models the word and another that models the context of the word but we never got that far. But then BERT came along and it does give us a vector which represents word and context, just it is not factorizable.
I'm not sure the question is about factorization but more about having the proper coordinates. Right now our basis vectors are pointing in... well... random directions. So when we naively interpolate between two vectors we're certainty going to pass through different semantical concepts. Our basis vectors aren't aligned to those concepts. And certainly LERPing is not the right way to move along this geometry either, and we should actually be thinking about the geodesics of our surface.
We've gone about all this as if we're working with a 2D cartesian coordinate system. I'm actually impressed things have worked this well! Crazy how we've done all this without care for the structural preservation of our data. But I guess with scale you can kinda side step that, forcing the manifold hypothesis. But to move forward and tackle these challenges? Might be time to get back into the weeds.