anonymoushn 7 days ago

https://x.com/alexjc/status/1881410039639863622

I agree that it would be more useful to compare vocabs of identical size.

1
pama 7 days ago

Thanks! It seems to me that the performance gain here was due to a smaller vocab size. This type of change is almost guaranteed to backfire for larger models and larger datasets / lower loss requirements, so it probably is not very useful. Generally the historical trend has been to use larger vocabularies as the models got larger themselves.

anonymoushn 6 days ago

Well, he says he achieved some downstream win on the same size, but it didn't translate into a win in perplexity, so he tried to do something else. Like I said, it's unfortunate.

anonymoushn 6 days ago

I actually wonder if he could just claim a win by calculating validation set BPB for both equally-sized vocabs instead of targeting the same level of perplexity as in the speedrun finish line lol