itake 2 days ago

I've been learning vietnamese. Unfortunately, a lot of social media (reddit, fb, etc) has a new generation of language. The younger generation uses so much abbreviations and acronyms, ChatGPT and Google Translate can't keep up.

I think if you're goal is to have properly written langauge using older writing styles, then you're correct.

1
ookdatnog 2 days ago

I don't think it's simply a stylistic matter: it seems reasonable to assume that text in books tends to have higher information density, and contains longer and more complicated arguments (when compared to text obtained from social media posts, blogs, shorter articles, etc). If you want models that appear more intelligent, I think you need them to train on this kind of high-quality content.

The fact that these tend to be written in an older writing style is to me incidental. You could rewrite all your college text books in contemporary social media slang and I would still consider them high-quality texts.