Back to AI Encyclopedia
What is a token? Why is a paragraph cut into many small pieces by AI?

What is a token? Why is a paragraph cut into many small pieces by AI?

AI Encyclopedia Admin 56 views

Tokens can be understood as the "smallest unit of work" when the model processes text. It doesn't necessarily equal a word, a word, or a punctuation, but more like a fragment cut out by the model itself. In English, a word may be split into several tokens, and in Chinese, a short sentence may be divided into multiple tokens.

This seems abstract, but it directly affects three of the most realistic things: how much you can cram in, how much a conversation will cost, and why the model sometimes truncates long text. Because instead of understanding the world in "paragraphs", the model calculates inputs and outputs in tokens.

Why users always encounter it

  • When uploading a long document, the system will prompt a length limit, which is often due to the lack of tokens.
  • You feel like it's just a short paragraph, but the actual token is probably already a lot.
  • Some models answer shorter, not necessarily because it doesn't want to say it, but because the available token budget is running out.

Many people come into contact with tokens for the first time and mistakenly think that it is just a billing unit. In fact, it is more like the "language granularity" of the model. The model first breaks down the text into tokens, and then encodes, pays attention, and generates them, so tokens are also a prerequisite for understanding the context window. For Chinese users, punctuation, abbreviations, numbers, and code blocks can make the number of tokens higher than intuitive. Because of this, the actual tokens occupied by the same Chinese and English content may be quite different.

The most practical judgment

If you are working on long text processing, knowledge base, and prompt design, don't just focus on word count, it's best to develop the habit of looking at tokens. Especially when mixing Chinese and English, code, tables, and a lot of punctuation, word count and token count are often not the same thing.

Summary: The token is the unit of measurement that the model really processes the text, and only by understanding it can we truly understand the context, cost, and length limit.

Recommended Tools

More