What are tokens and how to use OpenAI Tokenizer to count them?

ChatGPT uses GPT-3.5 and GPT-4 base models. Tokens in GPT models are the basic units of text, which can be as short as one character or as long as one word. These models process text in chunks called tokens, and each token can represent a single character or a word.

In the world of natural language processing (NLP) and machine learning, understanding the concept of tokens is crucial. Tokens form the basic building blocks of text processing, allowing machines to comprehend and analyze written language.

Here, we will delve into what tokens are, what is the difference between a token and a character and how to use the OpenAI Tokenizer to count them.

What are tokens in GPT models?

Tokens are like the building blocks of language. They are small pieces of text that language models use to understand and generate text.

In English, a token can be as short as one character or as long as one word (like “b” or “banana”).

Tokens are also important while using GPT API’s. Because the number of tokens used in an API call affects the cost, duration. You pay per token, and there’s also a maximum limit to how many tokens a model can handle.

What is the difference between a token and a character?

A token is like a puzzle piece in language, while a character is like a single letter of that puzzle piece. Tokens can be small, like one letter, or bigger, like a whole word.

For example, let’s take the word ‘cat.’ In this word, ‘c,’ ‘a,’ and ‘t’ are the characters. When we put these characters together, we get the token ‘cat.’ The word “cat” represents only 1 token.

Think of tokens as the LEGO blocks of language. Just like LEGO blocks can be small or big, tokens can be short (like a single letter) or longer (like a word). They help language models understand and work with different parts of words or sentences.

Therefore, the main difference between a token and a character is that a token can be made up of one or more characters, and it represents a meaningful unit in language. Characters, on the other hand, are the individual letters or symbols that make up those tokens.

Why are tokens important?

Tokens are important for several reasons.

Knowing the total number of tokens used is crucial because OpenAI charges based on the token count when using their GPT API’s. This means that for each token used, there is a corresponding cost. By understanding the token count, you can estimate the expenses.

It helps you manage the cost and ensures that you stay within the allocated token limit to avoid any unexpected charges or issues with your API calls.

In addition, GPT models have character limits as ChatGPT has, which are counted based on tokens. By being aware of the total token count, you can ensure that your input and output remain within the character limits set by the model.

Learn more: How to bypass ChatGPT character limit?

This understanding helps you manage the content length effectively and avoid exceeding any character limitations.

What is OpenAI Tokenizer?

Tokens don’t always directly correspond to individual characters in English, making counting them challenging. To simplify this process, OpenAI has developed an accessible web page called Tokenizer. Through the Tokenizer, you can count how many tokens and characters are in the text you input.

This tool breaks down your input into tokens. You can easily count and determine the number of tokens present in a given input. It simplifies the process of understanding the tokenization of text

How to Use OpenAI Tokenizer?

Here is a step-by-step guide on how to use the OpenAI tokenizer:

  1. Visit https://platform.openai.com/tokenizer.
  2. Choose from GPT-3 or Codex models. Codex uses a different encoding that more effectively counts spaces.
  3. Enter the text you want to calculate tokens for.
  4. After entering the text, the total character count and token count will be automatically calculated.
  5. You can also view how the tokens are grouped in your text with the help of colored elements.

How to count tokens in programming languages?

It’s possible to use OpenAI’s GPT models via their API, and they charge you based on the number of tokens used. Therefore, calculating tokens within your software program, without relying on a tokenizer, is crucial. You can leverage libraries developed by OpenAI for this purpose.

Learn more: How to call GPT-4 API?

There are different libraries available for counting tokens in programming languages. In Python, the Tiktoken package provides a programmatic interface for tokenizing text.

It is a fast BPE tokenizer designed for OpenAI models and offers faster performance compared to other open-source tokenizers.

To use Tiktoken:

  1. install it using the “%pip install –upgrade tiktoken” command,
  2. import it into your Python file,
  3. load an encoding using the tiktoken.encoding_for_model() method
  4. turn text into tokens with the encoding.encode() method.

Other programming languages also have libraries for token counting. In JavaScript, you can use OpenAI’s GPT-3-Encoder, which is a node package manager compatible with Node.js.

For Java, the jtokkit library can be used, while SharpToken library is available for .Net. In PHP, the GPT-3 Encoder can be utilized.

These libraries enable developers to count the number of tokens in text for different programming languages.