mirror of
https://github.com/james-m-jordan/openai-cookbook.git
synced 2025-05-09 19:32:38 +00:00
823 lines
25 KiB
Plaintext
823 lines
25 KiB
Plaintext
![]() |
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# How to count tokens with tiktoken\n",
|
||
|
"\n",
|
||
|
"[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
|
||
|
"\n",
|
||
|
"Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
|
||
|
"\n",
|
||
|
"Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
|
||
|
"\n",
|
||
|
"`tiktoken` supports three encodings used by OpenAI models:\n",
|
||
|
"\n",
|
||
|
"| Encoding name | OpenAI models |\n",
|
||
|
"|-------------------------|-----------------------------------------------------|\n",
|
||
|
"| `gpt2` (or `r50k_base`) | Most GPT-3 models |\n",
|
||
|
"| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |\n",
|
||
|
"| `cl100k_base` | `text-embedding-ada-002` |\n",
|
||
|
"\n",
|
||
|
"`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.\n",
|
||
|
"\n",
|
||
|
"## Tokenizer libraries and languages\n",
|
||
|
"\n",
|
||
|
"For `gpt2` encodings, tokenizers are available in many languages.\n",
|
||
|
"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n",
|
||
|
"- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n",
|
||
|
"- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n",
|
||
|
"- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n",
|
||
|
"- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n",
|
||
|
"\n",
|
||
|
"(OpenAI makes no endorsements or guarantees of third-party libraries.)\n",
|
||
|
"\n",
|
||
|
"For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.\n",
|
||
|
"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n",
|
||
|
"\n",
|
||
|
"## How strings are typically tokenized\n",
|
||
|
"\n",
|
||
|
"In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## 0. Install `tiktoken`\n",
|
||
|
"\n",
|
||
|
"In your terminal, install `tiktoken` with `pip`:\n",
|
||
|
"\n",
|
||
|
"```bash\n",
|
||
|
"pip install tiktoken\n",
|
||
|
"```"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## 1. Import `tiktoken`"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from typing import Iterable, Sequence, Optional\n",
|
||
|
"\n",
|
||
|
"import tiktoken\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## 2. Load an encoding\n",
|
||
|
"\n",
|
||
|
"Use `tiktoken.get_encoding()` to load an encoding by name.\n",
|
||
|
"\n",
|
||
|
"The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"encoding = tiktoken.get_encoding(\"cl100k_base\")\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## 3. Turn text into tokens with `encoding.encode()`\n",
|
||
|
"\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The `.encode()` method converts a text string into a list of token integers."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"encoding.encode(\"tiktoken is great!\")\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Count tokens by counting the length of the list returned by `.encode()`."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def num_tokens_from_string(string: str, encoding_name: str) -> int:\n",
|
||
|
" \"\"\"Returns the number of tokens in a text string.\"\"\"\n",
|
||
|
" encoding = tiktoken.get_encoding(encoding_name)\n",
|
||
|
" num_tokens = len(encoding.encode(string))\n",
|
||
|
" return num_tokens\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"num_tokens_from_string(\"tiktoken is great!\", \"gpt2\")\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## 4. Turn tokens into text with `encoding.decode()`"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"`.decode()` converts a list of token integers to a string."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"encoding.decode([83, 1134, 30001, 318, 1049, 0])\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"(The `b` in front of the strings indicates that the strings are byte strings.)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"attachments": {},
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## 5. Comparing encodings\n",
|
||
|
"\n",
|
||
|
"Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def compare_encodings(example_string: str) -> None:\n",
|
||
|
" \"\"\"Prints a comparison of three string encodings.\"\"\"\n",
|
||
|
" # print the example string\n",
|
||
|
" print(f'\\nExample string: \"{example_string}\"')\n",
|
||
|
" # for each encoding, print the # of tokens, the token integers, and the token bytes\n",
|
||
|
" for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n",
|
||
|
" encoding = tiktoken.get_encoding(encoding_name)\n",
|
||
|
" token_integers = encoding.encode(example_string)\n",
|
||
|
" num_tokens = len(token_integers)\n",
|
||
|
" token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n",
|
||
|
" print()\n",
|
||
|
" print(f\"{encoding_name}: {num_tokens} tokens\")\n",
|
||
|
" print(f\"token integers: {token_integers}\")\n",
|
||
|
" print(f\"token bytes: {token_bytes}\")#%% md\n",
|
||
|
"# How to count tokens with tiktoken\n",
|
||
|
"\n",
|
||
|
"[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
|
||
|
"\n",
|
||
|
"Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
|
||
|
"\n",
|
||
|
"Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
|
||
|
"\n",
|
||
|
"`tiktoken` supports three encodings used by OpenAI models:\n",
|
||
|
"\n",
|
||
|
"| Encoding name | OpenAI models |\n",
|
||
|
"|-------------------------|-----------------------------------------------------|\n",
|
||
|
"| `gpt2` (or `r50k_base`) | Most GPT-3 models |\n",
|
||
|
"| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |\n",
|
||
|
"| `cl100k_base` | `text-embedding-ada-002` |\n",
|
||
|
"\n",
|
||
|
"`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.\n",
|
||
|
"\n",
|
||
|
"## Tokenizer libraries and languages\n",
|
||
|
"\n",
|
||
|
"For `gpt2` encodings, tokenizers are available in many languages.\n",
|
||
|
"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n",
|
||
|
"- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n",
|
||
|
"- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n",
|
||
|
"- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n",
|
||
|
"- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n",
|
||
|
"\n",
|
||
|
"(OpenAI makes no endorsements or guarantees of third-party libraries.)\n",
|
||
|
"\n",
|
||
|
"For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.\n",
|
||
|
"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n",
|
||
|
"\n",
|
||
|
"## How strings are typically tokenized\n",
|
||
|
"\n",
|
||
|
"In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)."
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"## 0. Install `tiktoken`\n",
|
||
|
"\n",
|
||
|
"In your terminal, install `tiktoken` with `pip`:\n",
|
||
|
"\n",
|
||
|
"```bash\n",
|
||
|
"pip install tiktoken\n",
|
||
|
"```"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"## 1. Import `tiktoken`"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from typing import Iterable, Sequence, Optional\n",
|
||
|
"\n",
|
||
|
"import tiktoken\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"## 2. Load an encoding\n",
|
||
|
"\n",
|
||
|
"Use `tiktoken.get_encoding()` to load an encoding by name.\n",
|
||
|
"\n",
|
||
|
"The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection."
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"encoding = tiktoken.get_encoding(\"cl100k_base\")\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"## 3. Turn text into tokens with `encoding.encode()`\n",
|
||
|
"\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"The `.encode()` method converts a text string into a list of token integers."
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"encoding.encode(\"tiktoken is great!\")\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Count tokens by counting the length of the list returned by `.encode()`."
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def num_tokens_from_string(string: str, encoding_name: str) -> int:\n",
|
||
|
" \"\"\"Returns the number of tokens in a text string.\"\"\"\n",
|
||
|
" encoding = tiktoken.get_encoding(encoding_name)\n",
|
||
|
" num_tokens = len(encoding.encode(string))\n",
|
||
|
" return num_tokens\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"num_tokens_from_string(\"tiktoken is great!\", \"gpt2\")\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"## 4. Turn tokens into text with `encoding.decode()`"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"`.decode()` converts a list of token integers to a string."
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"encoding.decode([83, 1134, 30001, 318, 1049, 0])\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries."
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents."
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"(The `b` in front of the strings indicates that the strings are byte strings.)"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"## 5. Comparing encodings\n",
|
||
|
"\n",
|
||
|
"Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def compare_encodings(example_string: str) -> None:\n",
|
||
|
" \"\"\"Prints a comparison of three string encodings.\"\"\"\n",
|
||
|
" # print the example string\n",
|
||
|
" print(f'\\nExample string: \"{example_string}\"')\n",
|
||
|
" # for each encoding, print the # of tokens, the token integers, and the token bytes\n",
|
||
|
" for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n",
|
||
|
" encoding = tiktoken.get_encoding(encoding_name)\n",
|
||
|
" token_integers = encoding.encode(example_string)\n",
|
||
|
" num_tokens = len(token_integers)\n",
|
||
|
" token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n",
|
||
|
" print()\n",
|
||
|
" print(f\"{encoding_name}: {num_tokens} tokens\")\n",
|
||
|
" print(f\"token integers: {token_integers}\")\n",
|
||
|
" print(f\"token bytes: {token_bytes}\")\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"compare_encodings(\"antidisestablishmentarianism\")\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"compare_encodings(\"2 + 2 = 4\")\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"compare_encodings(\"お誕生日おめでとう\")\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"long_prompt = str(list(range(3000)))\n",
|
||
|
"num_tokens_from_string(long_prompt, 'cl100k_base')"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import openai\n",
|
||
|
"EMBEDDING_MODEL = 'text-embedding-ada-002'\n",
|
||
|
"EMBEDDING_CTX_LENGTH = 8191\n",
|
||
|
"openai.Embedding.create(input=long_prompt, model=EMBEDDING_MODEL)\n",
|
||
|
"\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def truncate_string_tokens(text: str, encoding_name: str = 'cl100k_base', max_tokens: int = EMBEDDING_CTX_LENGTH) -> list[int]:\n",
|
||
|
" \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n",
|
||
|
" encoding = tiktoken.get_encoding(encoding_name)\n",
|
||
|
" return encoding.encode(text)[:max_tokens]\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from itertools import islice\n",
|
||
|
"\n",
|
||
|
"# From: https://docs.python.org/3/library/itertools.html#itertools-recipes\n",
|
||
|
"def batched(iterable, n):\n",
|
||
|
" \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n",
|
||
|
" # batched('ABCDEFG', 3) --> ABC DEF G\n",
|
||
|
" if n < 1:\n",
|
||
|
" raise ValueError('n must be at least one')\n",
|
||
|
" it = iter(iterable)\n",
|
||
|
" while (batch := tuple(islice(it, n))):\n",
|
||
|
" yield batch\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"def chunked_tokens(text: str, encoding_name: str = 'cl100k_base', chunk_ctx_length: int = EMBEDDING_CTX_LENGTH):\n",
|
||
|
" encoding = tiktoken.get_encoding(encoding_name)\n",
|
||
|
" tokens = encoding.encode(text)\n",
|
||
|
" chunks_iterator = batched(tokens, chunk_ctx_length)\n",
|
||
|
" yield from chunks_iterator\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import numpy as np\n",
|
||
|
"from tenacity import retry, wait_random_exponential, stop_after_attempt\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
|
||
|
"def get_embedding(tokens: Sequence[int], model=EMBEDDING_MODEL) -> list[float]:\n",
|
||
|
" return openai.Embedding.create(input=tokens, model=model)[\"data\"][0][\"embedding\"]\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"def len_safe_get_embedding(text: str, model=EMBEDDING_MODEL, max_tokens: int = EMBEDDING_CTX_LENGTH, encoding_name: str = 'cl100k_base', reduction: Optional[str]='average'):\n",
|
||
|
" chunk_embeddings = []\n",
|
||
|
" for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_ctx_length=max_tokens):\n",
|
||
|
" chunk_embeddings.append(get_embedding(chunk, model=model))\n",
|
||
|
"\n",
|
||
|
" if reduction is None:\n",
|
||
|
" return chunk_embeddings\n",
|
||
|
" elif reduction == 'average':\n",
|
||
|
" return np.mean(chunk_embeddings, weights=[len(c) for c in chunk_embeddings])\n",
|
||
|
" else:\n",
|
||
|
" raise NotI\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"compare_encodings(\"antidisestablishmentarianism\")\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"compare_encodings(\"2 + 2 = 4\")\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"compare_encodings(\"お誕生日おめでとう\")\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"long_prompt = str(list(range(3000)))\n",
|
||
|
"num_tokens_from_string(long_prompt, 'cl100k_base')"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import openai\n",
|
||
|
"\n",
|
||
|
"EMBEDDING_MODEL = 'text-embedding-ada-002'\n",
|
||
|
"EMBEDDING_CTX_LENGTH = 8191\n",
|
||
|
"\n",
|
||
|
"openai.Embedding.create(input=long_prompt, model=EMBEDDING_MODEL)\n",
|
||
|
"\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def truncate_string_tokens(text: str, encoding_name: str = 'cl100k_base', max_tokens: int = EMBEDDING_CTX_LENGTH) -> list[int]:\n",
|
||
|
" \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n",
|
||
|
" encoding = tiktoken.get_encoding(encoding_name)\n",
|
||
|
" return encoding.encode(text)[:max_tokens]\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from itertools import islice\n",
|
||
|
"\n",
|
||
|
"# From: https://docs.python.org/3/library/itertools.html#itertools-recipes\n",
|
||
|
"def batched(iterable, n):\n",
|
||
|
" \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n",
|
||
|
" # batched('ABCDEFG', 3) --> ABC DEF G\n",
|
||
|
" if n < 1:\n",
|
||
|
" raise ValueError('n must be at least one')\n",
|
||
|
" it = iter(iterable)\n",
|
||
|
" while (batch := tuple(islice(it, n))):\n",
|
||
|
" yield batch\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"def chunked_tokens(text: str, encoding_name: str = 'cl100k_base', chunk_ctx_length: int = EMBEDDING_CTX_LENGTH):\n",
|
||
|
" encoding = tiktoken.get_encoding(encoding_name)\n",
|
||
|
" tokens = encoding.encode(text)\n",
|
||
|
" chunks_iterator = batched(tokens, chunk_ctx_length)\n",
|
||
|
" yield from chunks_iterator\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import numpy as np\n",
|
||
|
"from tenacity import retry, wait_random_exponential, stop_after_attempt\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
|
||
|
"def get_embedding(tokens: Sequence[int], model=EMBEDDING_MODEL) -> list[float]:\n",
|
||
|
" return openai.Embedding.create(input=tokens, model=model)[\"data\"][0][\"embedding\"]\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"def len_safe_get_embedding(text: str, model=EMBEDDING_MODEL, max_tokens: int = EMBEDDING_CTX_LENGTH, encoding_name: str = 'cl100k_base', reduction: Optional[str] = None):\n",
|
||
|
" chunk_embeddings = []\n",
|
||
|
" for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_ctx_length=max_tokens):\n",
|
||
|
" chunk_embeddings.append(get_embedding(chunk, model=model))\n",
|
||
|
"\n",
|
||
|
" if reduction is None:\n",
|
||
|
" return chunk_embeddings\n",
|
||
|
" elif reduction == 'average':\n",
|
||
|
" return np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n",
|
||
|
" else:\n",
|
||
|
" raise ValueError(f'reduction {reduction} not valid.')\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"\n"
|
||
|
],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"outputs": [],
|
||
|
"source": [],
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
}
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "openai",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.9.9"
|
||
|
},
|
||
|
"orig_nbformat": 4,
|
||
|
"vscode": {
|
||
|
"interpreter": {
|
||
|
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
|
||
|
}
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 2
|
||
|
}
|