galaxyLogic 8 days ago

> The text-embedding-ada-002 model accepts up to 8192 "tokens", where a "token" is the unit of measurement for the model (typically corresponding to a word or syllable),

So the "input" is up to 8192 "units of measurement". What would that mean in practice? How are the units of measurement produced? Can they be anything?

2
striking 8 days ago

They're produced from a tokenizer. Technically they could be anything (even raw bytes) but you get better results by choosing a better tokenization strategy.

mdp2021 8 days ago

The "token" is a basic unit of meaning (for example, 'unit', 'of', 'mean', '-ing') that can have its coordinates in an embedding space, where they match scalars in the dimensions representing concepts which the machine reconstructs.

Allowing more tokens for an input allows (I understand, in this context) to output the embedding coordinates not just of a unit of meaning, not just of a word, but of expressions or sentences or paragraphs ('go'; 'going'; 'going away'; 'they are going away', 'Mr. Fitzgerald and his wife are going away'...).