pamelafox 8 days ago

I actually explicitly left the PCA graphs out of the blog post version, as I think they lose so much information as to be deceiving. That's what I told folks in person at the poster session as well.

I think the other graphs I included aren't deceiving, they're just not quite as fun as an attempt to visualize the similarity space.

3
godelski 8 days ago

Yeah PCA gets tough. It isn't great for non-linear relationships and I mean that's the whole reason we use activation functions haha. And don't get me started on how people refer to t-SNE as dimensionality reduction instead of visualization...

I don't think the other graphs are necessarily deceiving but I think they don't capture as much information as we often imply and I think this ends up leading people to make wrong assumptions about what is happening in the data.

Embeddings and immersions get really fucking weird at high dimensions. I mean it gets weird at like 4D and insane by 10D. The spaces we're talking about are incomprehensible. Every piece of geometric intuition you have should be thrown out the window. It won't help you, it harms you. If you start digging into the high dimensional statistics and metric theory for high dimensions you'll quickly see what I'm talking about. Like the craziness of Lp distances and contraction of variance. Like you have to really dig into why we prefer L1 over L2 and why even fractional ps are of interest. We run into all kinds of problems with i.i.d. assumptions and all that. It is wild how many assumptions are being made that we generally don't even think about. They seem obvious and natural to use, but they don't work very well when D>3. I do think the visualizations become useful again once you start getting used to this again but that's more like in the way that you are interpreting it with far less generalization in meaning.

I'm not trying to dunk on your post. I think it is fine. But I think our ML community needs to be having more conversations about these limits. We're really running into issues with them.

dleeftink 8 days ago

I think the heatmap + dendogram approach can be useful for high dimensional comparisons (to a degree). Check out ClustViz for an interactive demo[0].

[0]: https://biit.cs.ut.ee/clustvis/

been-jammin 8 days ago

I think visually so very helpful thanks. I also agree once you get into higher dimensionality it becomes difficult to represent visually. Nevertheless helpful for an 'old' (50) computer scientist wrapping my head around AI concepts so I can keep up with my team.

godelski 8 days ago

I'm a very visual person too but that at first harmed me when working in high dimensions. I was generalizing too much instead of thinking of them as very narrow windows into the data