# How to count tokens with tiktoken

[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.

Given a text string (e.g., `"tiktoken is great!"`) and an encoding (e.g., `"gpt2"`), a tokenizer can split the text string into a list of tokens (e.g., `["t", "ik", "token", " is", " great", "!"]`).

Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.

`tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `gpt2` (or `r50k_base`) | Most GPT-3 models                                   |
| `p50k_base`             | Code models, `text-davinci-002`, `text-davinci-003` |
| `cl100k_base`           | `text-embedding-ada-002`                            |

`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.

## Tokenizer libraries and languages

For `gpt2` encodings, tokenizers are available in many languages.
- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))
- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)
- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)
- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)
- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)

(OpenAI makes no endorsements or guarantees of third-party libraries.)

For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.
- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)

## How strings are typically tokenized

In English, tokens commonly range in length from one character to one word (e.g., `"t"` or `" great"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `" is"` instead of `"is "` or `" "`+`"is"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer).

## 0. Install `tiktoken`

In your terminal, install `tiktoken` with `pip`:

```bash
pip install tiktoken
```

## 1. Import `tiktoken`

In [None]:
from typing import Iterable, Sequence, Optional

import tiktoken


## 2. Load an encoding

Use `tiktoken.get_encoding()` to load an encoding by name.

The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection.

In [None]:
encoding = tiktoken.get_encoding("cl100k_base")


## 3. Turn text into tokens with `encoding.encode()`



The `.encode()` method converts a text string into a list of token integers.

In [None]:
encoding.encode("tiktoken is great!")


Count tokens by counting the length of the list returned by `.encode()`.

In [None]:
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


In [None]:
num_tokens_from_string("tiktoken is great!", "gpt2")


## 4. Turn tokens into text with `encoding.decode()`

`.decode()` converts a list of token integers to a string.

In [None]:
encoding.decode([83, 1134, 30001, 318, 1049, 0])


Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries.

For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents.

In [None]:
[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]


(The `b` in front of the strings indicates that the strings are byte strings.)

## 5. Comparing encodings

Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings.

In [None]:
def compare_encodings(example_string: str) -> None:
    """Prints a comparison of three string encodings."""
    # print the example string
    print(f'\nExample string: "{example_string}"')
    # for each encoding, print the # of tokens, the token integers, and the token bytes
    for encoding_name in ["gpt2", "p50k_base", "cl100k_base"]:
        encoding = tiktoken.get_encoding(encoding_name)
        token_integers = encoding.encode(example_string)
        num_tokens = len(token_integers)
        token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]
        print()
        print(f"{encoding_name}: {num_tokens} tokens")
        print(f"token integers: {token_integers}")
        print(f"token bytes: {token_bytes}")#%% md
# How to count tokens with tiktoken

[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.

Given a text string (e.g., `"tiktoken is great!"`) and an encoding (e.g., `"gpt2"`), a tokenizer can split the text string into a list of tokens (e.g., `["t", "ik", "token", " is", " great", "!"]`).

Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.

`tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `gpt2` (or `r50k_base`) | Most GPT-3 models                                   |
| `p50k_base`             | Code models, `text-davinci-002`, `text-davinci-003` |
| `cl100k_base`           | `text-embedding-ada-002`                            |

`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.

## Tokenizer libraries and languages

For `gpt2` encodings, tokenizers are available in many languages.
- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))
- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)
- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)
- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)
- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)

(OpenAI makes no endorsements or guarantees of third-party libraries.)

For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.
- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)

## How strings are typically tokenized

In English, tokens commonly range in length from one character to one word (e.g., `"t"` or `" great"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `" is"` instead of `"is "` or `" "`+`"is"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer).

## 0. Install `tiktoken`

In your terminal, install `tiktoken` with `pip`:

```bash
pip install tiktoken
```

## 1. Import `tiktoken`

In [None]:
from typing import Iterable, Sequence, Optional

import tiktoken


## 2. Load an encoding

Use `tiktoken.get_encoding()` to load an encoding by name.

The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection.

In [None]:
encoding = tiktoken.get_encoding("cl100k_base")


## 3. Turn text into tokens with `encoding.encode()`



The `.encode()` method converts a text string into a list of token integers.

In [None]:
encoding.encode("tiktoken is great!")


Count tokens by counting the length of the list returned by `.encode()`.

In [None]:
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


In [None]:
num_tokens_from_string("tiktoken is great!", "gpt2")


## 4. Turn tokens into text with `encoding.decode()`

`.decode()` converts a list of token integers to a string.

In [None]:
encoding.decode([83, 1134, 30001, 318, 1049, 0])


Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries.

For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents.

In [None]:
[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]


(The `b` in front of the strings indicates that the strings are byte strings.)

## 5. Comparing encodings

Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings.

In [None]:
def compare_encodings(example_string: str) -> None:
    """Prints a comparison of three string encodings."""
    # print the example string
    print(f'\nExample string: "{example_string}"')
    # for each encoding, print the # of tokens, the token integers, and the token bytes
    for encoding_name in ["gpt2", "p50k_base", "cl100k_base"]:
        encoding = tiktoken.get_encoding(encoding_name)
        token_integers = encoding.encode(example_string)
        num_tokens = len(token_integers)
        token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]
        print()
        print(f"{encoding_name}: {num_tokens} tokens")
        print(f"token integers: {token_integers}")
        print(f"token bytes: {token_bytes}")


In [None]:
compare_encodings("antidisestablishmentarianism")


In [None]:
compare_encodings("2 + 2 = 4")


In [None]:
compare_encodings("お誕生日おめでとう")


In [None]:
long_prompt = str(list(range(3000)))
num_tokens_from_string(long_prompt, 'cl100k_base')

In [None]:
import openai
EMBEDDING_MODEL = 'text-embedding-ada-002'
EMBEDDING_CTX_LENGTH = 8191
openai.Embedding.create(input=long_prompt, model=EMBEDDING_MODEL)



In [None]:
def truncate_string_tokens(text: str, encoding_name: str = 'cl100k_base', max_tokens: int = EMBEDDING_CTX_LENGTH) -> list[int]:
    """Truncate a string to have `max_tokens` according to the given encoding."""
    encoding = tiktoken.get_encoding(encoding_name)
    return encoding.encode(text)[:max_tokens]


In [None]:
from itertools import islice

# From: https://docs.python.org/3/library/itertools.html#itertools-recipes
def batched(iterable, n):
    """Batch data into tuples of length n. The last batch may be shorter."""
    # batched('ABCDEFG', 3) --> ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while (batch := tuple(islice(it, n))):
        yield batch


def chunked_tokens(text: str, encoding_name: str = 'cl100k_base', chunk_ctx_length: int = EMBEDDING_CTX_LENGTH):
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(text)
    chunks_iterator = batched(tokens, chunk_ctx_length)
    yield from chunks_iterator


In [None]:
import numpy as np
from tenacity import retry, wait_random_exponential, stop_after_attempt


@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(tokens: Sequence[int], model=EMBEDDING_MODEL) -> list[float]:
    return openai.Embedding.create(input=tokens, model=model)["data"][0]["embedding"]


def len_safe_get_embedding(text: str, model=EMBEDDING_MODEL, max_tokens: int = EMBEDDING_CTX_LENGTH, encoding_name: str = 'cl100k_base', reduction: Optional[str]='average'):
    chunk_embeddings = []
    for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_ctx_length=max_tokens):
        chunk_embeddings.append(get_embedding(chunk, model=model))

    if reduction is None:
        return chunk_embeddings
    elif reduction == 'average':
        return np.mean(chunk_embeddings, weights=[len(c) for c in chunk_embeddings])
    else:
        raise NotI





In [None]:
compare_encodings("antidisestablishmentarianism")


In [None]:
compare_encodings("2 + 2 = 4")


In [None]:
compare_encodings("お誕生日おめでとう")


In [None]:
long_prompt = str(list(range(3000)))
num_tokens_from_string(long_prompt, 'cl100k_base')

In [None]:
import openai

EMBEDDING_MODEL = 'text-embedding-ada-002'
EMBEDDING_CTX_LENGTH = 8191

openai.Embedding.create(input=long_prompt, model=EMBEDDING_MODEL)



In [None]:
def truncate_string_tokens(text: str, encoding_name: str = 'cl100k_base', max_tokens: int = EMBEDDING_CTX_LENGTH) -> list[int]:
    """Truncate a string to have `max_tokens` according to the given encoding."""
    encoding = tiktoken.get_encoding(encoding_name)
    return encoding.encode(text)[:max_tokens]


In [None]:
from itertools import islice

# From: https://docs.python.org/3/library/itertools.html#itertools-recipes
def batched(iterable, n):
    """Batch data into tuples of length n. The last batch may be shorter."""
    # batched('ABCDEFG', 3) --> ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while (batch := tuple(islice(it, n))):
        yield batch


def chunked_tokens(text: str, encoding_name: str = 'cl100k_base', chunk_ctx_length: int = EMBEDDING_CTX_LENGTH):
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(text)
    chunks_iterator = batched(tokens, chunk_ctx_length)
    yield from chunks_iterator


In [None]:
import numpy as np
from tenacity import retry, wait_random_exponential, stop_after_attempt


@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(tokens: Sequence[int], model=EMBEDDING_MODEL) -> list[float]:
    return openai.Embedding.create(input=tokens, model=model)["data"][0]["embedding"]


def len_safe_get_embedding(text: str, model=EMBEDDING_MODEL, max_tokens: int = EMBEDDING_CTX_LENGTH, encoding_name: str = 'cl100k_base', reduction: Optional[str] = None):
    chunk_embeddings = []
    for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_ctx_length=max_tokens):
        chunk_embeddings.append(get_embedding(chunk, model=model))

    if reduction is None:
        return chunk_embeddings
    elif reduction == 'average':
        return np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()
    else:
        raise ValueError(f'reduction {reduction} not valid.')



