Tokenization

The process of dividing the input for an AI model into discrete units that have numerical IDs called tokens. Depending on what the input is (such as text, audio, or an image) the tokens might be based on different words or subwords in text, or different slices/blocks of pixels in images.

For example, consider the sentence, "The cat sat on the mat." A word-level tokenization might split this sentence into the following words: "The," "cat," "sat," "on," "the," "mat." Then it replaces each word with a token (a number). The token "vocabulary"—the mapping of words to numbers—is predetermined and may vary from model to model.

But tokenizers in large language models (LLMs) are much more sophisticated than that. Among other things, they also tokenize punctuations (or combinations of words and punctuations) and break words into subwords that allow them to tokenize words they've never seen before.

Because LLMs are trained on these tokens, they don't actually understand words and letters the way we do. They can only recognize and generate information based on the token vocabulary that they were trained upon. (Popular LLMs have a token vocabulary with over 100,000 tokens.)