From cc6df3a8c3551b69770d6a5f7fbbd4f43bb3c303 Mon Sep 17 00:00:00 2001 From: Filipe de Avila Belbute Peres Date: Tue, 17 Jan 2023 18:17:39 -0800 Subject: [PATCH 1/9] Write initial draft --- .../Truncate_prompts_to_context_length.ipynb | 822 ++++++++++++++++++ 1 file changed, 822 insertions(+) create mode 100644 examples/Truncate_prompts_to_context_length.ipynb diff --git a/examples/Truncate_prompts_to_context_length.ipynb b/examples/Truncate_prompts_to_context_length.ipynb new file mode 100644 index 0000000..a560dd1 --- /dev/null +++ b/examples/Truncate_prompts_to_context_length.ipynb @@ -0,0 +1,822 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to count tokens with tiktoken\n", + "\n", + "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n", + "\n", + "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n", + "\n", + "Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n", + "\n", + "`tiktoken` supports three encodings used by OpenAI models:\n", + "\n", + "| Encoding name | OpenAI models |\n", + "|-------------------------|-----------------------------------------------------|\n", + "| `gpt2` (or `r50k_base`) | Most GPT-3 models |\n", + "| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |\n", + "| `cl100k_base` | `text-embedding-ada-002` |\n", + "\n", + "`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.\n", + "\n", + "## Tokenizer libraries and languages\n", + "\n", + "For `gpt2` encodings, tokenizers are available in many languages.\n", + "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n", + "- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n", + "- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n", + "- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n", + "- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n", + "\n", + "(OpenAI makes no endorsements or guarantees of third-party libraries.)\n", + "\n", + "For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.\n", + "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n", + "\n", + "## How strings are typically tokenized\n", + "\n", + "In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 0. Install `tiktoken`\n", + "\n", + "In your terminal, install `tiktoken` with `pip`:\n", + "\n", + "```bash\n", + "pip install tiktoken\n", + "```" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import `tiktoken`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Iterable, Sequence, Optional\n", + "\n", + "import tiktoken\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load an encoding\n", + "\n", + "Use `tiktoken.get_encoding()` to load an encoding by name.\n", + "\n", + "The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "encoding = tiktoken.get_encoding(\"cl100k_base\")\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Turn text into tokens with `encoding.encode()`\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `.encode()` method converts a text string into a list of token integers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "encoding.encode(\"tiktoken is great!\")\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Count tokens by counting the length of the list returned by `.encode()`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def num_tokens_from_string(string: str, encoding_name: str) -> int:\n", + " \"\"\"Returns the number of tokens in a text string.\"\"\"\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " num_tokens = len(encoding.encode(string))\n", + " return num_tokens\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "num_tokens_from_string(\"tiktoken is great!\", \"gpt2\")\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Turn tokens into text with `encoding.decode()`" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`.decode()` converts a list of token integers to a string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "encoding.decode([83, 1134, 30001, 318, 1049, 0])\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(The `b` in front of the strings indicates that the strings are byte strings.)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Comparing encodings\n", + "\n", + "Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "def compare_encodings(example_string: str) -> None:\n", + " \"\"\"Prints a comparison of three string encodings.\"\"\"\n", + " # print the example string\n", + " print(f'\\nExample string: \"{example_string}\"')\n", + " # for each encoding, print the # of tokens, the token integers, and the token bytes\n", + " for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " token_integers = encoding.encode(example_string)\n", + " num_tokens = len(token_integers)\n", + " token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n", + " print()\n", + " print(f\"{encoding_name}: {num_tokens} tokens\")\n", + " print(f\"token integers: {token_integers}\")\n", + " print(f\"token bytes: {token_bytes}\")#%% md\n", + "# How to count tokens with tiktoken\n", + "\n", + "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n", + "\n", + "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n", + "\n", + "Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n", + "\n", + "`tiktoken` supports three encodings used by OpenAI models:\n", + "\n", + "| Encoding name | OpenAI models |\n", + "|-------------------------|-----------------------------------------------------|\n", + "| `gpt2` (or `r50k_base`) | Most GPT-3 models |\n", + "| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |\n", + "| `cl100k_base` | `text-embedding-ada-002` |\n", + "\n", + "`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.\n", + "\n", + "## Tokenizer libraries and languages\n", + "\n", + "For `gpt2` encodings, tokenizers are available in many languages.\n", + "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n", + "- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n", + "- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n", + "- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n", + "- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n", + "\n", + "(OpenAI makes no endorsements or guarantees of third-party libraries.)\n", + "\n", + "For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.\n", + "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n", + "\n", + "## How strings are typically tokenized\n", + "\n", + "In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## 0. Install `tiktoken`\n", + "\n", + "In your terminal, install `tiktoken` with `pip`:\n", + "\n", + "```bash\n", + "pip install tiktoken\n", + "```" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## 1. Import `tiktoken`" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from typing import Iterable, Sequence, Optional\n", + "\n", + "import tiktoken\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## 2. Load an encoding\n", + "\n", + "Use `tiktoken.get_encoding()` to load an encoding by name.\n", + "\n", + "The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "encoding = tiktoken.get_encoding(\"cl100k_base\")\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## 3. Turn text into tokens with `encoding.encode()`\n", + "\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "The `.encode()` method converts a text string into a list of token integers." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "encoding.encode(\"tiktoken is great!\")\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "Count tokens by counting the length of the list returned by `.encode()`." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "def num_tokens_from_string(string: str, encoding_name: str) -> int:\n", + " \"\"\"Returns the number of tokens in a text string.\"\"\"\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " num_tokens = len(encoding.encode(string))\n", + " return num_tokens\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "num_tokens_from_string(\"tiktoken is great!\", \"gpt2\")\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## 4. Turn tokens into text with `encoding.decode()`" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "`.decode()` converts a list of token integers to a string." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "encoding.decode([83, 1134, 30001, 318, 1049, 0])\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "(The `b` in front of the strings indicates that the strings are byte strings.)" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## 5. Comparing encodings\n", + "\n", + "Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "def compare_encodings(example_string: str) -> None:\n", + " \"\"\"Prints a comparison of three string encodings.\"\"\"\n", + " # print the example string\n", + " print(f'\\nExample string: \"{example_string}\"')\n", + " # for each encoding, print the # of tokens, the token integers, and the token bytes\n", + " for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " token_integers = encoding.encode(example_string)\n", + " num_tokens = len(token_integers)\n", + " token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n", + " print()\n", + " print(f\"{encoding_name}: {num_tokens} tokens\")\n", + " print(f\"token integers: {token_integers}\")\n", + " print(f\"token bytes: {token_bytes}\")\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "compare_encodings(\"antidisestablishmentarianism\")\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "compare_encodings(\"2 + 2 = 4\")\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "compare_encodings(\"お誕生日おめでとう\")\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "long_prompt = str(list(range(3000)))\n", + "num_tokens_from_string(long_prompt, 'cl100k_base')" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "import openai\n", + "EMBEDDING_MODEL = 'text-embedding-ada-002'\n", + "EMBEDDING_CTX_LENGTH = 8191\n", + "openai.Embedding.create(input=long_prompt, model=EMBEDDING_MODEL)\n", + "\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "def truncate_string_tokens(text: str, encoding_name: str = 'cl100k_base', max_tokens: int = EMBEDDING_CTX_LENGTH) -> list[int]:\n", + " \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " return encoding.encode(text)[:max_tokens]\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from itertools import islice\n", + "\n", + "# From: https://docs.python.org/3/library/itertools.html#itertools-recipes\n", + "def batched(iterable, n):\n", + " \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n", + " # batched('ABCDEFG', 3) --> ABC DEF G\n", + " if n < 1:\n", + " raise ValueError('n must be at least one')\n", + " it = iter(iterable)\n", + " while (batch := tuple(islice(it, n))):\n", + " yield batch\n", + "\n", + "\n", + "def chunked_tokens(text: str, encoding_name: str = 'cl100k_base', chunk_ctx_length: int = EMBEDDING_CTX_LENGTH):\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " tokens = encoding.encode(text)\n", + " chunks_iterator = batched(tokens, chunk_ctx_length)\n", + " yield from chunks_iterator\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "import numpy as np\n", + "from tenacity import retry, wait_random_exponential, stop_after_attempt\n", + "\n", + "\n", + "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n", + "def get_embedding(tokens: Sequence[int], model=EMBEDDING_MODEL) -> list[float]:\n", + " return openai.Embedding.create(input=tokens, model=model)[\"data\"][0][\"embedding\"]\n", + "\n", + "\n", + "def len_safe_get_embedding(text: str, model=EMBEDDING_MODEL, max_tokens: int = EMBEDDING_CTX_LENGTH, encoding_name: str = 'cl100k_base', reduction: Optional[str]='average'):\n", + " chunk_embeddings = []\n", + " for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_ctx_length=max_tokens):\n", + " chunk_embeddings.append(get_embedding(chunk, model=model))\n", + "\n", + " if reduction is None:\n", + " return chunk_embeddings\n", + " elif reduction == 'average':\n", + " return np.mean(chunk_embeddings, weights=[len(c) for c in chunk_embeddings])\n", + " else:\n", + " raise NotI\n", + "\n", + "\n", + "\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "compare_encodings(\"antidisestablishmentarianism\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "compare_encodings(\"2 + 2 = 4\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "compare_encodings(\"お誕生日おめでとう\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "long_prompt = str(list(range(3000)))\n", + "num_tokens_from_string(long_prompt, 'cl100k_base')" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "EMBEDDING_MODEL = 'text-embedding-ada-002'\n", + "EMBEDDING_CTX_LENGTH = 8191\n", + "\n", + "openai.Embedding.create(input=long_prompt, model=EMBEDDING_MODEL)\n", + "\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "def truncate_string_tokens(text: str, encoding_name: str = 'cl100k_base', max_tokens: int = EMBEDDING_CTX_LENGTH) -> list[int]:\n", + " \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " return encoding.encode(text)[:max_tokens]\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from itertools import islice\n", + "\n", + "# From: https://docs.python.org/3/library/itertools.html#itertools-recipes\n", + "def batched(iterable, n):\n", + " \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n", + " # batched('ABCDEFG', 3) --> ABC DEF G\n", + " if n < 1:\n", + " raise ValueError('n must be at least one')\n", + " it = iter(iterable)\n", + " while (batch := tuple(islice(it, n))):\n", + " yield batch\n", + "\n", + "\n", + "def chunked_tokens(text: str, encoding_name: str = 'cl100k_base', chunk_ctx_length: int = EMBEDDING_CTX_LENGTH):\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " tokens = encoding.encode(text)\n", + " chunks_iterator = batched(tokens, chunk_ctx_length)\n", + " yield from chunks_iterator\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "import numpy as np\n", + "from tenacity import retry, wait_random_exponential, stop_after_attempt\n", + "\n", + "\n", + "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n", + "def get_embedding(tokens: Sequence[int], model=EMBEDDING_MODEL) -> list[float]:\n", + " return openai.Embedding.create(input=tokens, model=model)[\"data\"][0][\"embedding\"]\n", + "\n", + "\n", + "def len_safe_get_embedding(text: str, model=EMBEDDING_MODEL, max_tokens: int = EMBEDDING_CTX_LENGTH, encoding_name: str = 'cl100k_base', reduction: Optional[str] = None):\n", + " chunk_embeddings = []\n", + " for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_ctx_length=max_tokens):\n", + " chunk_embeddings.append(get_embedding(chunk, model=model))\n", + "\n", + " if reduction is None:\n", + " return chunk_embeddings\n", + " elif reduction == 'average':\n", + " return np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n", + " else:\n", + " raise ValueError(f'reduction {reduction} not valid.')\n", + "\n", + "\n", + "\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "openai", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.9" + }, + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 85032e08310699ec826117f058cdf1408e1c4671 Mon Sep 17 00:00:00 2001 From: Filipe de Avila Belbute Peres Date: Wed, 18 Jan 2023 17:50:31 -0800 Subject: [PATCH 2/9] Finish first draft --- .../Truncate_prompts_to_context_length.ipynb | 879 ++++-------------- 1 file changed, 185 insertions(+), 694 deletions(-) diff --git a/examples/Truncate_prompts_to_context_length.ipynb b/examples/Truncate_prompts_to_context_length.ipynb index a560dd1..1ed48f2 100644 --- a/examples/Truncate_prompts_to_context_length.ipynb +++ b/examples/Truncate_prompts_to_context_length.ipynb @@ -5,41 +5,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# How to count tokens with tiktoken\n", + "# Embedding texts that are larger than the model's context length\n", "\n", - "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n", + "All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n", "\n", - "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n", - "\n", - "Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n", - "\n", - "`tiktoken` supports three encodings used by OpenAI models:\n", - "\n", - "| Encoding name | OpenAI models |\n", - "|-------------------------|-----------------------------------------------------|\n", - "| `gpt2` (or `r50k_base`) | Most GPT-3 models |\n", - "| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |\n", - "| `cl100k_base` | `text-embedding-ada-002` |\n", - "\n", - "`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.\n", - "\n", - "## Tokenizer libraries and languages\n", - "\n", - "For `gpt2` encodings, tokenizers are available in many languages.\n", - "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n", - "- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n", - "- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n", - "- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n", - "- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n", - "\n", - "(OpenAI makes no endorsements or guarantees of third-party libraries.)\n", - "\n", - "For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.\n", - "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n", - "\n", - "## How strings are typically tokenized\n", - "\n", - "In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)." + "In this notebook, we will go over how to deal with texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n" ] }, { @@ -47,596 +17,37 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## 0. Install `tiktoken`\n", + "## 1. Model context length\n", "\n", - "In your terminal, install `tiktoken` with `pip`:\n", - "\n", - "```bash\n", - "pip install tiktoken\n", - "```" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Import `tiktoken`" + "First, let us define the model we will be working with and a funciton to get embeddings from the API." ] }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from typing import Iterable, Sequence, Optional\n", - "\n", - "import tiktoken\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Load an encoding\n", - "\n", - "Use `tiktoken.get_encoding()` to load an encoding by name.\n", - "\n", - "The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "encoding = tiktoken.get_encoding(\"cl100k_base\")\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. Turn text into tokens with `encoding.encode()`\n", - "\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `.encode()` method converts a text string into a list of token integers." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "encoding.encode(\"tiktoken is great!\")\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Count tokens by counting the length of the list returned by `.encode()`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def num_tokens_from_string(string: str, encoding_name: str) -> int:\n", - " \"\"\"Returns the number of tokens in a text string.\"\"\"\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " num_tokens = len(encoding.encode(string))\n", - " return num_tokens\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "num_tokens_from_string(\"tiktoken is great!\", \"gpt2\")\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. Turn tokens into text with `encoding.decode()`" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`.decode()` converts a list of token integers to a string." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "encoding.decode([83, 1134, 30001, 318, 1049, 0])\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "(The `b` in front of the strings indicates that the strings are byte strings.)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. Comparing encodings\n", - "\n", - "Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def compare_encodings(example_string: str) -> None:\n", - " \"\"\"Prints a comparison of three string encodings.\"\"\"\n", - " # print the example string\n", - " print(f'\\nExample string: \"{example_string}\"')\n", - " # for each encoding, print the # of tokens, the token integers, and the token bytes\n", - " for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " token_integers = encoding.encode(example_string)\n", - " num_tokens = len(token_integers)\n", - " token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n", - " print()\n", - " print(f\"{encoding_name}: {num_tokens} tokens\")\n", - " print(f\"token integers: {token_integers}\")\n", - " print(f\"token bytes: {token_bytes}\")#%% md\n", - "# How to count tokens with tiktoken\n", - "\n", - "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n", - "\n", - "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n", - "\n", - "Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n", - "\n", - "`tiktoken` supports three encodings used by OpenAI models:\n", - "\n", - "| Encoding name | OpenAI models |\n", - "|-------------------------|-----------------------------------------------------|\n", - "| `gpt2` (or `r50k_base`) | Most GPT-3 models |\n", - "| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |\n", - "| `cl100k_base` | `text-embedding-ada-002` |\n", - "\n", - "`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.\n", - "\n", - "## Tokenizer libraries and languages\n", - "\n", - "For `gpt2` encodings, tokenizers are available in many languages.\n", - "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n", - "- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n", - "- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n", - "- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n", - "- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n", - "\n", - "(OpenAI makes no endorsements or guarantees of third-party libraries.)\n", - "\n", - "For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.\n", - "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n", - "\n", - "## How strings are typically tokenized\n", - "\n", - "In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "## 0. Install `tiktoken`\n", - "\n", - "In your terminal, install `tiktoken` with `pip`:\n", - "\n", - "```bash\n", - "pip install tiktoken\n", - "```" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "## 1. Import `tiktoken`" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from typing import Iterable, Sequence, Optional\n", - "\n", - "import tiktoken\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "## 2. Load an encoding\n", - "\n", - "Use `tiktoken.get_encoding()` to load an encoding by name.\n", - "\n", - "The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "encoding = tiktoken.get_encoding(\"cl100k_base\")\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "## 3. Turn text into tokens with `encoding.encode()`\n", - "\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "The `.encode()` method converts a text string into a list of token integers." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "encoding.encode(\"tiktoken is great!\")\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "Count tokens by counting the length of the list returned by `.encode()`." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def num_tokens_from_string(string: str, encoding_name: str) -> int:\n", - " \"\"\"Returns the number of tokens in a text string.\"\"\"\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " num_tokens = len(encoding.encode(string))\n", - " return num_tokens\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "num_tokens_from_string(\"tiktoken is great!\", \"gpt2\")\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "## 4. Turn tokens into text with `encoding.decode()`" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "`.decode()` converts a list of token integers to a string." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "encoding.decode([83, 1134, 30001, 318, 1049, 0])\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "(The `b` in front of the strings indicates that the strings are byte strings.)" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "## 5. Comparing encodings\n", - "\n", - "Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings." - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def compare_encodings(example_string: str) -> None:\n", - " \"\"\"Prints a comparison of three string encodings.\"\"\"\n", - " # print the example string\n", - " print(f'\\nExample string: \"{example_string}\"')\n", - " # for each encoding, print the # of tokens, the token integers, and the token bytes\n", - " for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " token_integers = encoding.encode(example_string)\n", - " num_tokens = len(token_integers)\n", - " token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n", - " print()\n", - " print(f\"{encoding_name}: {num_tokens} tokens\")\n", - " print(f\"token integers: {token_integers}\")\n", - " print(f\"token bytes: {token_bytes}\")\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "compare_encodings(\"antidisestablishmentarianism\")\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "compare_encodings(\"2 + 2 = 4\")\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "compare_encodings(\"お誕生日おめでとう\")\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "long_prompt = str(list(range(3000)))\n", - "num_tokens_from_string(long_prompt, 'cl100k_base')" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, + "execution_count": 85, "outputs": [], "source": [ "import openai\n", - "EMBEDDING_MODEL = 'text-embedding-ada-002'\n", - "EMBEDDING_CTX_LENGTH = 8191\n", - "openai.Embedding.create(input=long_prompt, model=EMBEDDING_MODEL)\n", - "\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def truncate_string_tokens(text: str, encoding_name: str = 'cl100k_base', max_tokens: int = EMBEDDING_CTX_LENGTH) -> list[int]:\n", - " \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " return encoding.encode(text)[:max_tokens]\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from itertools import islice\n", - "\n", - "# From: https://docs.python.org/3/library/itertools.html#itertools-recipes\n", - "def batched(iterable, n):\n", - " \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n", - " # batched('ABCDEFG', 3) --> ABC DEF G\n", - " if n < 1:\n", - " raise ValueError('n must be at least one')\n", - " it = iter(iterable)\n", - " while (batch := tuple(islice(it, n))):\n", - " yield batch\n", - "\n", - "\n", - "def chunked_tokens(text: str, encoding_name: str = 'cl100k_base', chunk_ctx_length: int = EMBEDDING_CTX_LENGTH):\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " tokens = encoding.encode(text)\n", - " chunks_iterator = batched(tokens, chunk_ctx_length)\n", - " yield from chunks_iterator\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "import numpy as np\n", "from tenacity import retry, wait_random_exponential, stop_after_attempt\n", "\n", "\n", + "EMBEDDING_MODEL = 'text-embedding-ada-002'\n", + "EMBEDDING_CTX_LENGTH = 8191\n", + "EMBEDDING_ENCODING = 'cl100k_base'\n", + "\n", + "\n", "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n", - "def get_embedding(tokens: Sequence[int], model=EMBEDDING_MODEL) -> list[float]:\n", - " return openai.Embedding.create(input=tokens, model=model)[\"data\"][0][\"embedding\"]\n", - "\n", - "\n", - "def len_safe_get_embedding(text: str, model=EMBEDDING_MODEL, max_tokens: int = EMBEDDING_CTX_LENGTH, encoding_name: str = 'cl100k_base', reduction: Optional[str]='average'):\n", - " chunk_embeddings = []\n", - " for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_ctx_length=max_tokens):\n", - " chunk_embeddings.append(get_embedding(chunk, model=model))\n", - "\n", - " if reduction is None:\n", - " return chunk_embeddings\n", - " elif reduction == 'average':\n", - " return np.mean(chunk_embeddings, weights=[len(c) for c in chunk_embeddings])\n", - " else:\n", - " raise NotI\n", - "\n", - "\n", - "\n" + "def get_embedding(text_or_tokens, model=EMBEDDING_MODEL):\n", + " return openai.Embedding.create(input=text_or_tokens, model=model)[\"data\"][0][\"embedding\"]" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "The `text-embedding-ada-002` model has a context length of 8191 tokens with the `cl100k_base` encoding, and we can see that going over that limit causes an error." ], "metadata": { "collapsed": false @@ -644,64 +55,114 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 94, + "outputs": [ + { + "ename": "InvalidRequestError", + "evalue": "This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", + "output_type": "error", + "traceback": [ + "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m", + "\u001B[0;31mInvalidRequestError\u001B[0m Traceback (most recent call last)", + "Cell \u001B[0;32mIn [94], line 4\u001B[0m\n\u001B[1;32m 1\u001B[0m \u001B[38;5;28;01mimport\u001B[39;00m \u001B[38;5;21;01mopenai\u001B[39;00m\n\u001B[1;32m 3\u001B[0m long_text \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mAGI \u001B[39m\u001B[38;5;124m'\u001B[39m \u001B[38;5;241m*\u001B[39m \u001B[38;5;241m5000\u001B[39m\n\u001B[0;32m----> 4\u001B[0m get_embedding(\u001B[43mopenai\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mEmbedding\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43minput\u001B[39;49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mlong_text\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmodel\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mEMBEDDING_MODEL\u001B[49m\u001B[43m)\u001B[49m)\n", + "File \u001B[0;32m~/code/openai-python/openai/api_resources/embedding.py:33\u001B[0m, in \u001B[0;36mEmbedding.create\u001B[0;34m(cls, *args, **kwargs)\u001B[0m\n\u001B[1;32m 31\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[1;32m 32\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m---> 33\u001B[0m response \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 35\u001B[0m \u001B[38;5;66;03m# If a user specifies base64, we'll just return the encoded string.\u001B[39;00m\n\u001B[1;32m 36\u001B[0m \u001B[38;5;66;03m# This is only for the default case.\u001B[39;00m\n\u001B[1;32m 37\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m user_provided_encoding_format:\n", + "File \u001B[0;32m~/code/openai-python/openai/api_resources/abstract/engine_api_resource.py:153\u001B[0m, in \u001B[0;36mEngineAPIResource.create\u001B[0;34m(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)\u001B[0m\n\u001B[1;32m 127\u001B[0m \u001B[38;5;129m@classmethod\u001B[39m\n\u001B[1;32m 128\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mcreate\u001B[39m(\n\u001B[1;32m 129\u001B[0m \u001B[38;5;28mcls\u001B[39m,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 136\u001B[0m \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams,\n\u001B[1;32m 137\u001B[0m ):\n\u001B[1;32m 138\u001B[0m (\n\u001B[1;32m 139\u001B[0m deployment_id,\n\u001B[1;32m 140\u001B[0m engine,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 150\u001B[0m api_key, api_base, api_type, api_version, organization, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams\n\u001B[1;32m 151\u001B[0m )\n\u001B[0;32m--> 153\u001B[0m response, _, api_key \u001B[38;5;241m=\u001B[39m \u001B[43mrequestor\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrequest\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 154\u001B[0m \u001B[43m \u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mpost\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\n\u001B[1;32m 155\u001B[0m \u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 156\u001B[0m \u001B[43m \u001B[49m\u001B[43mparams\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mparams\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 157\u001B[0m \u001B[43m \u001B[49m\u001B[43mheaders\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 158\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mstream\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 159\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_id\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_id\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 160\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_timeout\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_timeout\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 161\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 163\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream:\n\u001B[1;32m 164\u001B[0m \u001B[38;5;66;03m# must be an iterator\u001B[39;00m\n\u001B[1;32m 165\u001B[0m \u001B[38;5;28;01massert\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(response, OpenAIResponse)\n", + "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:227\u001B[0m, in \u001B[0;36mAPIRequestor.request\u001B[0;34m(self, method, url, params, headers, files, stream, request_id, request_timeout)\u001B[0m\n\u001B[1;32m 206\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mrequest\u001B[39m(\n\u001B[1;32m 207\u001B[0m \u001B[38;5;28mself\u001B[39m,\n\u001B[1;32m 208\u001B[0m method,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 215\u001B[0m request_timeout: Optional[Union[\u001B[38;5;28mfloat\u001B[39m, Tuple[\u001B[38;5;28mfloat\u001B[39m, \u001B[38;5;28mfloat\u001B[39m]]] \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m,\n\u001B[1;32m 216\u001B[0m ) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m Tuple[Union[OpenAIResponse, Iterator[OpenAIResponse]], \u001B[38;5;28mbool\u001B[39m, \u001B[38;5;28mstr\u001B[39m]:\n\u001B[1;32m 217\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mrequest_raw(\n\u001B[1;32m 218\u001B[0m method\u001B[38;5;241m.\u001B[39mlower(),\n\u001B[1;32m 219\u001B[0m url,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 225\u001B[0m request_timeout\u001B[38;5;241m=\u001B[39mrequest_timeout,\n\u001B[1;32m 226\u001B[0m )\n\u001B[0;32m--> 227\u001B[0m resp, got_stream \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response\u001B[49m\u001B[43m(\u001B[49m\u001B[43mresult\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 228\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp, got_stream, \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mapi_key\n", + "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:620\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response\u001B[0;34m(self, result, stream)\u001B[0m\n\u001B[1;32m 612\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[1;32m 613\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_interpret_response_line(\n\u001B[1;32m 614\u001B[0m line, result\u001B[38;5;241m.\u001B[39mstatus_code, result\u001B[38;5;241m.\u001B[39mheaders, stream\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 615\u001B[0m )\n\u001B[1;32m 616\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m line \u001B[38;5;129;01min\u001B[39;00m parse_stream(result\u001B[38;5;241m.\u001B[39miter_lines())\n\u001B[1;32m 617\u001B[0m ), \u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 618\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 619\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[0;32m--> 620\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response_line\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 621\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcontent\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdecode\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mutf-8\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 622\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mstatus_code\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 623\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 624\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mFalse\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[1;32m 625\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m,\n\u001B[1;32m 626\u001B[0m \u001B[38;5;28;01mFalse\u001B[39;00m,\n\u001B[1;32m 627\u001B[0m )\n", + "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:680\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response_line\u001B[0;34m(self, rbody, rcode, rheaders, stream)\u001B[0m\n\u001B[1;32m 678\u001B[0m stream_error \u001B[38;5;241m=\u001B[39m stream \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124merror\u001B[39m\u001B[38;5;124m\"\u001B[39m \u001B[38;5;129;01min\u001B[39;00m resp\u001B[38;5;241m.\u001B[39mdata\n\u001B[1;32m 679\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream_error \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;241m200\u001B[39m \u001B[38;5;241m<\u001B[39m\u001B[38;5;241m=\u001B[39m rcode \u001B[38;5;241m<\u001B[39m \u001B[38;5;241m300\u001B[39m:\n\u001B[0;32m--> 680\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mhandle_error_response(\n\u001B[1;32m 681\u001B[0m rbody, rcode, resp\u001B[38;5;241m.\u001B[39mdata, rheaders, stream_error\u001B[38;5;241m=\u001B[39mstream_error\n\u001B[1;32m 682\u001B[0m )\n\u001B[1;32m 683\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp\n", + "\u001B[0;31mInvalidRequestError\u001B[0m: This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length." + ] + } + ], + "source": [ + "long_text = 'AGI ' * 5000\n", + "get_embedding(input=long_text)" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "Clearly we want to avoid these errors, particularly when dealing programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to dealing with these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## 1. Truncating the input text\n", + "\n", + "The simplest solution is to truncate the input text to the maximum allowed length. Since the context length is in terms of tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, thus as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 98, "outputs": [], + "source": [ + "import tiktoken\n", + "\n", + "def truncate_text_tokens(text, encoding_name=EMBEDDING_ENCODING, max_tokens=EMBEDDING_CTX_LENGTH):\n", + " \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " return encoding.encode(text)[:max_tokens]" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "Our example from before now works." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 97, + "outputs": [ + { + "data": { + "text/plain": "[-0.015384314581751823,\n 0.0031692360062152147,\n -0.007302511017769575,\n -0.02778581902384758,\n -0.013409210368990898,\n 0.0029592972714453936,\n -0.019119545817375183,\n -0.0004874778969679028,\n -0.010721994563937187,\n -0.023486273363232613,\n 0.016351712867617607,\n 0.005532307084649801,\n -0.009136536158621311,\n -0.014282556250691414,\n 0.005122506525367498,\n 0.02888757921755314,\n 0.020973725244402885,\n 0.009136536158621311,\n 0.003303596982732415,\n -0.013382338918745518,\n -0.024749264121055603,\n 0.03904525563120842,\n -0.01699664443731308,\n -0.010312194004654884,\n -0.009029047563672066,\n -0.001587137347087264,\n 0.017036953940987587,\n -0.056915245950222015,\n -0.011084768921136856,\n -0.006375421304255724,\n 0.011145230382680893,\n -0.01094368938356638,\n -0.010184550657868385,\n -0.009546336717903614,\n -0.012105911038815975,\n -0.004675756674259901,\n 0.002245505340397358,\n -0.0015040015568956733,\n -0.007457026280462742,\n 0.0029206685721874237,\n 0.03993203863501549,\n -0.02390279248356819,\n 0.003399329027161002,\n -0.02109465003013611,\n -0.026590008288621902,\n 0.004457420669496059,\n -0.03638491407036781,\n -0.018958313390612602,\n 0.002221992239356041,\n -0.007846672087907791,\n -0.0106548136100173,\n 0.0019096032483503222,\n -0.015451495535671711,\n -0.00783995445817709,\n 0.016821976751089096,\n 0.007409999612718821,\n -0.017601268365979195,\n 0.01502154115587473,\n -0.026119744405150414,\n -0.011333336122334003,\n -0.017184749245643616,\n -0.0352562814950943,\n -0.002327801426872611,\n 0.015666473656892776,\n -0.023069754242897034,\n -0.016821976751089096,\n -0.0005298855248838663,\n 0.0010933612938970327,\n 0.0048571438528597355,\n -0.034503862261772156,\n 0.007712311577051878,\n 0.038024116307497025,\n -0.017856555059552193,\n -0.02415807731449604,\n 0.020664695650339127,\n -0.01742659881711006,\n 0.012072320096194744,\n 0.015249954536557198,\n -0.008357243612408638,\n 0.001610650448128581,\n 0.018017787486314774,\n -0.02247856743633747,\n -3.219936115783639e-05,\n 0.02421182207763195,\n 0.010594351217150688,\n 0.01800435036420822,\n -0.019777914509177208,\n 0.024695521220564842,\n 0.0013805575435981154,\n -0.0138122932985425,\n 0.02132306434214115,\n 0.023325040936470032,\n 0.027597714215517044,\n 0.06062360480427742,\n -0.019562937319278717,\n 0.009559772908687592,\n -0.02183363400399685,\n 0.0173728559166193,\n -0.028242645785212517,\n -0.03058052435517311,\n 0.01847461424767971,\n -0.026536263525485992,\n -0.007947443053126335,\n -0.007517488207668066,\n -0.026616880670189857,\n 0.009183562360703945,\n 0.01872989907860756,\n -0.022075483575463295,\n 0.019589809700846672,\n -0.023916227743029594,\n 0.019347960129380226,\n 0.02378186769783497,\n 0.019764477387070656,\n -0.0202616136521101,\n -0.019401703029870987,\n 0.006335113197565079,\n 0.015209645964205265,\n -0.029935592785477638,\n -0.007013635244220495,\n -0.0363042950630188,\n 0.00704050762578845,\n 0.01616360805928707,\n 0.014981232583522797,\n -0.0013931537978351116,\n 0.030661141499876976,\n 0.01389290951192379,\n 0.007712311577051878,\n -0.01910611055791378,\n -0.0020792337600141764,\n -0.008404269814491272,\n 0.024722393602132797,\n 0.01699664443731308,\n 0.008350525982677937,\n 0.009727723896503448,\n -0.010695122182369232,\n 0.006560167297720909,\n -0.031386688351631165,\n 0.0263078510761261,\n -0.0001876852911664173,\n -0.01816558465361595,\n 0.019482320174574852,\n 0.023190679028630257,\n -0.015115593560039997,\n -0.015384314581751823,\n -0.005233354400843382,\n 0.004225648008286953,\n -0.0011555030941963196,\n -0.012092474848031998,\n 0.011602058075368404,\n -0.02179332636296749,\n -0.003029836807399988,\n 0.0030382343102246523,\n -0.011151948943734169,\n 0.007430153898894787,\n 0.001625766046345234,\n 0.010795892216265202,\n 0.0033136738929897547,\n 0.013167361728847027,\n -0.027033399790525436,\n 0.002052361611276865,\n 0.015061848796904087,\n 0.017762500792741776,\n 0.014349736273288727,\n -0.007047225721180439,\n 0.014887180179357529,\n 0.023190679028630257,\n 0.0055289482697844505,\n 0.018084967508912086,\n -0.0014888859586790204,\n -0.003711717901751399,\n 0.008290063589811325,\n 0.03740605339407921,\n 0.007960879243910313,\n 0.01809840463101864,\n 0.010916817001998425,\n 0.03504130616784096,\n 0.0031138123013079166,\n -0.005303893703967333,\n -0.022868212312459946,\n -0.01373839471489191,\n -0.013933218084275723,\n 0.008525194600224495,\n 0.05304565653204918,\n 0.014537842012941837,\n 0.006230983417481184,\n -0.004920965526252985,\n 0.002856847131624818,\n -0.015868013724684715,\n 0.006835607346147299,\n -0.027449917048215866,\n -0.0049041700549423695,\n 0.009808340109884739,\n 0.028914449736475945,\n -0.017386291176080704,\n -0.6199946403503418,\n -0.02336534857749939,\n -0.018353689461946487,\n -0.0028131799772381783,\n 0.019804786890745163,\n 0.04409722611308098,\n 0.005280380602926016,\n 0.011702828109264374,\n -0.024829881265759468,\n 0.01465876679867506,\n -0.013395775109529495,\n 0.025877896696329117,\n -0.01636514998972416,\n 0.01199842244386673,\n -0.01084291934967041,\n -0.008827506564557552,\n 0.00870658177882433,\n -0.020086944103240967,\n -0.0006025243201293051,\n 0.027812691405415535,\n -0.03404703363776207,\n 0.0019079238409176469,\n 0.0024403284769505262,\n -0.006099981721490622,\n 0.009082792326807976,\n -0.0050519672222435474,\n 0.014309428632259369,\n -0.022921957075595856,\n -0.02199486829340458,\n 0.003607588354498148,\n -0.008518476970493793,\n 0.00871329940855503,\n 0.02747678942978382,\n -0.020906545221805573,\n 0.04253863915801048,\n 0.000455147324828431,\n 0.014484097249805927,\n 0.033079635351896286,\n 0.026590008288621902,\n 0.05258882790803909,\n -0.025971949100494385,\n 0.010493581183254719,\n 0.026455648243427277,\n -0.008511758409440517,\n -0.019025493413209915,\n 0.020019764080643654,\n 0.01937483251094818,\n -0.013415928930044174,\n -0.0027863075956702232,\n -0.007860108278691769,\n 0.011662520468235016,\n -0.007255484815686941,\n 0.0033707772381603718,\n -0.01479312777519226,\n 0.009358231909573078,\n 0.007100970018655062,\n 0.02388935536146164,\n -0.017171313986182213,\n -0.008726735599339008,\n 0.010977279394865036,\n -0.003943490330129862,\n 0.004695910960435867,\n -0.003323751036077738,\n 0.011642365716397762,\n -0.014510969631373882,\n 0.0063888574950397015,\n -0.006832248065620661,\n 0.01937483251094818,\n 0.0011857342906296253,\n 0.006308240815997124,\n -0.0029643357265740633,\n 0.012569455429911613,\n -0.013677932322025299,\n 0.01015767827630043,\n -0.002561253262683749,\n -0.0055994875729084015,\n 0.024144640192389488,\n 0.0076988753862679005,\n -0.01128630992025137,\n -0.022277025505900383,\n 0.013422646559774876,\n 0.00892155896872282,\n 0.0036613326519727707,\n -0.009438848122954369,\n 0.04151749610900879,\n -0.005727130454033613,\n -0.00863268319517374,\n -0.012804587371647358,\n 0.011138512752950191,\n -0.003283442696556449,\n -0.00783995445817709,\n 0.028538240119814873,\n 0.00030609077657572925,\n 0.006113417912274599,\n 0.0205303356051445,\n 0.0037721802946180105,\n -0.02425212971866131,\n 0.013771984726190567,\n 0.0034833045210689306,\n -0.01748034358024597,\n -0.0062444196082651615,\n -0.005653231870383024,\n 0.011037741787731647,\n 0.02684529311954975,\n -0.023822175338864326,\n 0.041598111391067505,\n -0.02915629930794239,\n -0.009895674884319305,\n 0.03240783140063286,\n -0.022639799863100052,\n 0.01879708096385002,\n -0.03727169334888458,\n -0.02415807731449604,\n -0.02132306434214115,\n 0.014940924011170864,\n -0.03536377102136612,\n 0.012925512157380581,\n 0.012421658262610435,\n 0.017117569223046303,\n -0.01281130500137806,\n -0.014269120059907436,\n -0.010144243016839027,\n 0.0049075293354690075,\n -0.01338905654847622,\n 0.00038628740003332496,\n -7.085434481268749e-05,\n -0.00503853103145957,\n -0.024507414549589157,\n -0.022653235122561455,\n 0.02374155819416046,\n 0.03141356259584427,\n -0.003390931524336338,\n 0.015653036534786224,\n -0.024386489763855934,\n 0.05546415224671364,\n 0.015438059344887733,\n 0.02504485845565796,\n -0.001958309207111597,\n 0.013684650883078575,\n -0.01979134976863861,\n -0.022706979885697365,\n 0.013825729489326477,\n 0.008753607980906963,\n -0.014537842012941837,\n 0.01672792248427868,\n -0.01663387008011341,\n -0.014121322892606258,\n -0.015451495535671711,\n -0.005186328198760748,\n 0.03512192144989967,\n -0.008746890351176262,\n 0.029693743214011192,\n -0.016486072912812233,\n 0.026482518762350082,\n -0.023969972506165504,\n -0.037916626781225204,\n -0.017722193151712418,\n -0.0202616136521101,\n 0.023298168554902077,\n 0.0012461966834962368,\n 0.024695521220564842,\n 0.021658966317772865,\n -0.010936971753835678,\n 0.002561253262683749,\n -0.005156097002327442,\n -0.01057419739663601,\n 0.009754596278071404,\n 0.03125232830643654,\n -0.021282754838466644,\n -0.031924132257699966,\n 0.016647307202219963,\n 0.013355466537177563,\n -0.00647283298894763,\n 0.019334523007273674,\n 0.012032012455165386,\n 0.02778581902384758,\n -0.008001187816262245,\n 0.011071332730352879,\n 0.0004958754288963974,\n 0.006368703208863735,\n -0.002013732912018895,\n -0.011951396241784096,\n -0.02372812293469906,\n 0.0049075293354690075,\n 0.032810915261507034,\n -0.010963844135403633,\n -0.013986961916089058,\n 0.041974324733018875,\n -0.018541794270277023,\n 0.022586055099964142,\n -0.003758744103834033,\n 0.020355666056275368,\n 0.022129228338599205,\n 0.004353290889412165,\n -0.03531002625823021,\n 0.0034564323723316193,\n 0.00438352208584547,\n -0.0040207477286458015,\n -0.002181683899834752,\n 0.005488639697432518,\n 0.013227824121713638,\n -0.002729204250499606,\n 0.02899506688117981,\n -0.027328992262482643,\n 0.019536064937710762,\n -0.01735941879451275,\n 0.007235330529510975,\n -0.02126931957900524,\n 0.026388466358184814,\n 0.016284532845020294,\n 0.0048806569539010525,\n -0.018971748650074005,\n -0.008384115993976593,\n 0.00697332713752985,\n 0.023647505789995193,\n 0.011843906715512276,\n -0.004353290889412165,\n 0.013059872202575207,\n -0.014846871607005596,\n -0.0016291250940412283,\n 0.010896663181483746,\n -0.003557202871888876,\n 0.0031524410005658865,\n -0.0016249263426288962,\n -0.03509504720568657,\n 0.005639795679599047,\n 0.00781980063766241,\n 0.007786210160702467,\n 0.007907134480774403,\n -0.01518277358263731,\n 0.0005798509810119867,\n 0.006738195661455393,\n 0.014578149653971195,\n 0.023553453385829926,\n 0.013395775109529495,\n 0.015706781297922134,\n 0.019119545817375183,\n -0.006818812340497971,\n 0.01567990891635418,\n -0.0037251540925353765,\n 0.003856155788525939,\n 0.009425411932170391,\n 0.012052166275680065,\n -0.030392419546842575,\n 0.03697609901428223,\n -0.0009875521063804626,\n 0.05073465034365654,\n -0.012119347229599953,\n -0.03520253673195839,\n 0.027758946642279625,\n -0.012240272015333176,\n 0.018246199935674667,\n 0.0012361196568235755,\n -0.002601561602205038,\n 0.02625410631299019,\n -0.03541751578450203,\n 0.0018071531085297465,\n 0.010795892216265202,\n 0.002294211182743311,\n 0.023486273363232613,\n 0.02646908350288868,\n 0.006677733268588781,\n 0.0025394195690751076,\n 0.010218140669167042,\n 0.013200951740145683,\n 0.02325785905122757,\n 0.020328793674707413,\n -0.004642166662961245,\n -0.015558984130620956,\n 0.008323653601109982,\n -0.003956926520913839,\n -0.015196209773421288,\n -0.006422447506338358,\n -0.010513735003769398,\n 0.009794904850423336,\n -0.011434106156229973,\n -0.018555231392383575,\n -0.014833435416221619,\n 0.008934995159506798,\n -0.00547184469178319,\n -0.0012193245347589254,\n -0.01504841260612011,\n 0.022693544626235962,\n 0.02642877586185932,\n 0.001154663390479982,\n -0.01336890272796154,\n -0.004890734329819679,\n 0.025071730837225914,\n -0.005928671453148127,\n 0.004323059692978859,\n -0.012018576264381409,\n 0.030822373926639557,\n 0.007517488207668066,\n 0.011675955727696419,\n 0.004951196722686291,\n 0.015827706083655357,\n 0.014094451442360878,\n -0.039018385112285614,\n 0.003486663568764925,\n 0.0029206685721874237,\n 0.02284134179353714,\n 0.00033863127464428544,\n -0.014403481036424637,\n -0.001502322033047676,\n 0.0244671069085598,\n 0.0013881153427064419,\n 0.0076316953636705875,\n -0.009700851514935493,\n -0.02435961924493313,\n -0.045494578778743744,\n 0.010124088265001774,\n 0.0009220512001775205,\n -0.01584114134311676,\n -0.005663308780640364,\n -0.012549301609396935,\n -0.0027980643790215254,\n -0.017292238771915436,\n -0.013241259381175041,\n 0.005199763923883438,\n -0.0024604827631264925,\n 0.011541595682501793,\n -0.005219918210059404,\n -0.01762814074754715,\n -0.0006478711147792637,\n 0.09980322420597076,\n 0.02805454097688198,\n 0.004746296443045139,\n 0.021699273958802223,\n 0.006892710458487272,\n 0.011971550062298775,\n -0.007779492065310478,\n -0.010426400229334831,\n 0.009768032468855381,\n -0.023029446601867676,\n 0.01742659881711006,\n 0.0013268132461234927,\n 0.002284134039655328,\n 0.007766055874526501,\n 0.018461177125573158,\n -0.016244225203990936,\n -0.021336499601602554,\n -0.0093447957187891,\n 0.004245802294462919,\n -0.004783245734870434,\n -0.009808340109884739,\n 0.014631894417107105,\n -0.02362063340842724,\n 0.013651059940457344,\n 0.021954558789730072,\n -0.02114839479327202,\n 0.0031591590959578753,\n 0.003164197551086545,\n -0.010769020766019821,\n -0.006855761166661978,\n -0.016969772055745125,\n -0.00590515835210681,\n -0.015411186963319778,\n 0.0001366701617371291,\n -0.015196209773421288,\n 0.0011731380363926291,\n 0.009855367243289948,\n -0.018071532249450684,\n 0.03936772421002388,\n -0.027342429384589195,\n 0.029451893642544746,\n 0.0027476788964122534,\n -0.009828494861721992,\n -0.02257261984050274,\n 0.01479312777519226,\n -0.026119744405150414,\n -0.01007706206291914,\n 0.009559772908687592,\n -0.014752819202840328,\n -0.03135981783270836,\n 0.014322864823043346,\n -0.0008481527329422534,\n -0.01502154115587473,\n 0.004148390609771013,\n 0.010856354609131813,\n 0.013919781893491745,\n -0.03135981783270836,\n -0.010688403621315956,\n 0.008827506564557552,\n -0.017251931130886078,\n -0.009700851514935493,\n -0.012925512157380581,\n 0.010466708801686764,\n -0.019831659272313118,\n 0.009721006266772747,\n -0.028403880074620247,\n -0.0027325632981956005,\n 0.00016994545876514167,\n 0.0021850429475307465,\n 0.004924324341118336,\n 0.02958625555038452,\n 0.008216165006160736,\n -0.01915985345840454,\n -0.0005004940903745592,\n -0.004598499275743961,\n 0.02642877586185932,\n 0.011689391918480396,\n -0.0024201744236052036,\n 0.008384115993976593,\n 0.02309662662446499,\n 0.023378783836960793,\n -0.040657587349414825,\n -0.015908321365714073,\n -0.0019129622960463166,\n -0.019966019317507744,\n 0.009156690910458565,\n -0.01279115118086338,\n -0.0228950846940279,\n 0.002087631495669484,\n -0.0008225401979871094,\n 0.013281567953526974,\n -0.0075779506005346775,\n 0.00553902518004179,\n -0.019038928672671318,\n -0.02327129617333412,\n 0.002831654390320182,\n 0.023177243769168854,\n -0.02531358040869236,\n 0.0001918840571306646,\n -0.004662320949137211,\n 0.0281620305031538,\n -0.00968741625547409,\n 0.0023210833314806223,\n -0.011561749503016472,\n -0.0007293273811228573,\n 0.018770208582282066,\n 0.005458408500999212,\n 0.0038628738839179277,\n 0.00013467573444359004,\n -0.0010253411019220948,\n 0.014161631464958191,\n 0.003374136285856366,\n 0.012065602466464043,\n -0.013617469929158688,\n -0.0018323458498343825,\n 0.005045249126851559,\n 0.0011084768921136856,\n -0.006785221863538027,\n 0.009069356136023998,\n -0.0005160295404493809,\n -0.012636636383831501,\n -0.00517289200797677,\n 0.022357642650604248,\n -0.006953172851353884,\n -0.029666870832443237,\n 0.0021296192426234484,\n -0.0006881793960928917,\n -0.002552855759859085,\n 0.0004912567674182355,\n -0.004255879204720259,\n -0.0008040656102821231,\n 0.018810516223311424,\n -0.015196209773421288,\n -0.015962066128849983,\n -0.008565503172576427,\n 0.0014930847100913525,\n -0.023338476195931435,\n 0.012092474848031998,\n -0.025743534788489342,\n -0.010957125574350357,\n -0.033079635351896286,\n -0.00035395679879002273,\n 0.022760724648833275,\n -0.003933413419872522,\n -0.03563249111175537,\n -0.04122190177440643,\n 0.0057943109422922134,\n 0.017077261582016945,\n -0.008478168398141861,\n 0.023862482979893684,\n -0.004820194561034441,\n -0.004339854698628187,\n -0.007947443053126335,\n -0.01799091510474682,\n -0.01946888491511345,\n -0.027812691405415535,\n -0.019656989723443985,\n 0.01883738860487938,\n 0.02663031592965126,\n 0.028484495356678963,\n 0.02800079621374607,\n 0.020328793674707413,\n 0.02125588245689869,\n -5.395427069743164e-05,\n 0.02556886523962021,\n -0.012052166275680065,\n 0.03243470564484596,\n 0.004974709823727608,\n -0.0105204526335001,\n 0.011897651478648186,\n 0.013335312716662884,\n -0.013825729489326477,\n 0.014363172464072704,\n -0.01811183989048004,\n -0.0024403284769505262,\n 0.03727169334888458,\n -0.012092474848031998,\n -0.016284532845020294,\n -0.008148984052240849,\n -0.031091095879673958,\n -0.01621735282242298,\n 0.006654220167547464,\n 0.020879672840237617,\n 0.005364356096833944,\n -0.03436949849128723,\n -0.025931639596819878,\n 0.013590597547590733,\n 0.008639401756227016,\n 0.017278803512454033,\n -0.012992692179977894,\n 0.021766453981399536,\n -0.003293519839644432,\n 0.013919781893491745,\n 0.009438848122954369,\n 0.015505239367485046,\n -0.02374155819416046,\n -0.010863073170185089,\n -0.0030936580151319504,\n 0.0107891745865345,\n 0.017668448388576508,\n 0.005965620744973421,\n 0.005928671453148127,\n 0.002952579176053405,\n -0.016714487224817276,\n -0.017036953940987587,\n 0.024964241310954094,\n -0.01173641812056303,\n -0.003752026241272688,\n 0.01094368938356638,\n -0.022747289389371872,\n 0.00047992009785957634,\n -0.01778937317430973,\n -0.05425490438938141,\n -0.01731911115348339,\n 0.020812492817640305,\n -0.0032431345898658037,\n -0.0292100440710783,\n -0.004279392305761576,\n 0.012482120655477047,\n -0.03541751578450203,\n 0.002704011742025614,\n -0.007759337779134512,\n 0.02304288186132908,\n 0.012199963442981243,\n 0.028538240119814873,\n 0.014860307797789574,\n -0.012307452037930489,\n -0.01936139538884163,\n -0.0033607003279030323,\n -0.004014029633253813,\n -0.007638412993401289,\n -0.010271885432302952,\n 0.008021341636776924,\n 0.0010925214737653732,\n -0.0373791828751564,\n 0.0024923933669924736,\n 0.008021341636776924,\n -0.00739656388759613,\n -0.02410433255136013,\n 0.025246400386095047,\n 0.005374433007091284,\n 0.010762302204966545,\n -0.006627347785979509,\n -0.015142465941607952,\n -0.050439056009054184,\n 0.04108754172921181,\n 0.03869592025876045,\n 0.004007311537861824,\n 0.003866232931613922,\n 0.004413753282278776,\n 0.015129029750823975,\n 0.023298168554902077,\n -0.024064024910330772,\n 0.0011177140986546874,\n -0.009270897135138512,\n 0.0016266057500615716,\n 0.017386291176080704,\n -0.013745113275945187,\n 0.01694290153682232,\n 0.003973721526563168,\n -0.012011858634650707,\n -7.295373507076874e-05,\n -0.016324840486049652,\n 0.011555030941963196,\n 0.014725946821272373,\n 0.003930054139345884,\n -0.012253707274794579,\n -0.01537087932229042,\n 0.0050519672222435474,\n 0.016136735677719116,\n -0.04573642462491989,\n -0.009647107683122158,\n -0.014201940037310123,\n -0.006543372292071581,\n 0.017655013129115105,\n 0.0035168947651982307,\n -0.00868642795830965,\n 0.011165385134518147,\n -0.023768430575728416,\n -0.011763290502130985,\n 0.03350958973169327,\n 0.003799052443355322,\n 0.0060966224409639835,\n 0.0007314267568290234,\n -0.004679115954786539,\n -0.003631101455539465,\n 0.007705593481659889,\n 0.010433118790388107,\n 0.029021939262747765,\n -0.008390833623707294,\n -0.023929663002490997,\n -0.010963844135403633,\n 0.00109504081774503,\n -0.0034161240328103304,\n 0.009304487146437168,\n -0.014282556250691414,\n -0.00626121461391449,\n 0.03221972659230232,\n -0.04299546405673027,\n 0.007121123839169741,\n 0.014551278203725815,\n -0.012206681072711945,\n -0.008169138804078102,\n 0.001264671329408884,\n -0.004766450263559818,\n 0.00836396124213934,\n 0.04237740486860275,\n 0.003034875262528658,\n -0.01231416966766119,\n -0.01523651834577322,\n 0.017775937914848328,\n 0.03990516811609268,\n -0.002383225131779909,\n 0.004830271936953068,\n 0.013563726097345352,\n 0.000969917222391814,\n 0.01346967276185751,\n 0.002389943227171898,\n -0.014806563034653664,\n 0.007436871994286776,\n -0.039448339492082596,\n 0.009015611372888088,\n 0.0007436032174155116,\n -0.004622012376785278,\n 0.004222289193421602,\n 0.016244225203990936,\n 0.01831338182091713,\n 0.005146019626408815,\n -0.013691368512809277,\n -0.03904525563120842,\n -0.024695521220564842,\n -0.019562937319278717,\n -0.013852601870894432,\n -0.009385104291141033,\n 0.003081901464611292,\n -0.013019564561545849,\n -0.025851024314761162,\n 0.011440824717283249,\n -0.02679155021905899,\n -0.025219528004527092,\n 0.01173641812056303,\n 0.01402727048844099,\n 0.02888757921755314,\n 0.020503463223576546,\n -0.007759337779134512,\n -0.013852601870894432,\n -0.005596128758043051,\n 0.0010958805214613676,\n -0.05113773047924042,\n -0.022236717864871025,\n -0.0123679144307971,\n 0.021954558789730072,\n 0.015196209773421288,\n -0.03004308231174946,\n -0.03135981783270836,\n -0.016284532845020294,\n -0.05863506719470024,\n -0.018138712272047997,\n 0.006852402351796627,\n 0.014282556250691414,\n 0.016459202393889427,\n -0.013006128370761871,\n 0.009613517671823502,\n 0.020705003291368484,\n 0.0090760737657547,\n 0.0022656593937426805,\n -0.006879274267703295,\n -0.02109465003013611,\n -0.003799052443355322,\n -0.006419088691473007,\n 0.000651650014333427,\n -0.01878364384174347,\n 0.002342917025089264,\n -0.015572420321404934,\n 0.010453272610902786,\n -0.015962066128849983,\n -0.00675163185223937,\n 0.021229011937975883,\n 0.0007910493877716362,\n -0.004830271936953068,\n -0.015518675558269024,\n 0.007087533827871084,\n 0.013295004144310951,\n 0.025877896696329117,\n 0.007873544469475746,\n -0.027973923832178116,\n -0.0028232568874955177,\n 0.008303498849272728,\n -0.0018491409718990326,\n -0.014712510630488396,\n -0.010056908242404461,\n 0.0013133770553395152,\n 0.0015375918010249734,\n 0.025394197553396225,\n -0.0009573209099471569,\n -0.0033640593755990267,\n -0.011749854311347008,\n -0.01386603806167841,\n 0.02336534857749939,\n 0.010809328407049179,\n 0.010695122182369232,\n 0.0006445121252909303,\n -0.010668249800801277,\n 0.0041886987164616585,\n 0.013630906119942665,\n 0.015397750772535801,\n -0.015505239367485046,\n -0.0025142270606011152,\n 0.02105434238910675,\n -0.005992493126541376,\n 0.019025493413209915,\n -0.04425845667719841,\n 0.007114405743777752,\n -0.023754995316267014,\n -0.010473426431417465,\n 0.012764278799295425,\n -0.012018576264381409,\n -0.04624699801206589,\n 0.022586055099964142,\n 0.00040014335536397994,\n -0.009257460944354534,\n -0.006214188411831856,\n 0.011252719908952713,\n -0.011467697098851204,\n -0.00610669981688261,\n -0.02998933754861355,\n 0.017184749245643616,\n -0.026482518762350082,\n -0.02105434238910675,\n -0.006083186715841293,\n -0.00826319120824337,\n -0.001481328159570694,\n -0.010983997955918312,\n 0.03326774016022682,\n 0.0012226835824549198,\n -0.004769809544086456,\n 0.19713421165943146,\n -0.014269120059907436,\n -0.0032179418485611677,\n 0.012105911038815975,\n -0.028511367738246918,\n -0.017131006345152855,\n 0.044177841395139694,\n 0.0017853195313364267,\n -0.0024302515666931868,\n 0.03783601149916649,\n -0.00892155896872282,\n 0.012549301609396935,\n 0.027275249361991882,\n -0.01050029881298542,\n 0.012723970226943493,\n -0.02199486829340458,\n -0.03536377102136612,\n -0.02347283624112606,\n 0.009519464336335659,\n 0.005925312638282776,\n 0.018703026697039604,\n 0.0024755983613431454,\n -0.01386603806167841,\n -0.011481133289635181,\n 0.0009497631108388305,\n 0.000136355243739672,\n -0.007302511017769575,\n -0.00022925317171029747,\n 0.010896663181483746,\n -0.0023731482215225697,\n -0.008934995159506798,\n -0.02426556497812271,\n -0.005135942716151476,\n 0.014040706679224968,\n -4.135794370085932e-05,\n 0.003822565544396639,\n 0.022236717864871025,\n 0.01215293724089861,\n 0.02794705331325531,\n 0.013476391322910786,\n 0.02304288186132908,\n 0.016297968104481697,\n 0.007846672087907791,\n -0.035229410976171494,\n -0.018716463819146156,\n 0.03547126054763794,\n ...]" + }, + "execution_count": 97, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "truncated = truncate_text_tokens(long_text)\n", + "get_embedding(truncated)" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", "source": [], "metadata": { "collapsed": false } }, { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "cell_type": "markdown", "source": [ - "compare_encodings(\"antidisestablishmentarianism\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "compare_encodings(\"2 + 2 = 4\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "compare_encodings(\"お誕生日おめでとう\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "long_prompt = str(list(range(3000)))\n", - "num_tokens_from_string(long_prompt, 'cl100k_base')" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "import openai\n", + "## 2. Chunking the input text\n", "\n", - "EMBEDDING_MODEL = 'text-embedding-ada-002'\n", - "EMBEDDING_CTX_LENGTH = 8191\n", + "Though the option above works, it has the clear drawback of simply discarding all text after the maximum context is filled. Another possible approach that addresses this issue is to in fact divide the input text into chunks and then embed each chunk individually. We can then either use the chunk embeddings separately, or combine them in some way, such as for example calculating their average (weighted by the size of each chunk).\n", "\n", - "openai.Embedding.create(input=long_prompt, model=EMBEDDING_MODEL)\n", - "\n" + "We will first take a function from python's own cookbook that breaks up a sequence into chunks." ], "metadata": { "collapsed": false @@ -709,21 +170,7 @@ }, { "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def truncate_string_tokens(text: str, encoding_name: str = 'cl100k_base', max_tokens: int = EMBEDDING_CTX_LENGTH) -> list[int]:\n", - " \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " return encoding.encode(text)[:max_tokens]\n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, + "execution_count": 91, "outputs": [], "source": [ "from itertools import islice\n", @@ -736,14 +183,16 @@ " raise ValueError('n must be at least one')\n", " it = iter(iterable)\n", " while (batch := tuple(islice(it, n))):\n", - " yield batch\n", - "\n", - "\n", - "def chunked_tokens(text: str, encoding_name: str = 'cl100k_base', chunk_ctx_length: int = EMBEDDING_CTX_LENGTH):\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " tokens = encoding.encode(text)\n", - " chunks_iterator = batched(tokens, chunk_ctx_length)\n", - " yield from chunks_iterator\n" + " yield batch" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "Now let's define a function that encodes a string into tokens and then breaks it up into chunks." ], "metadata": { "collapsed": false @@ -753,25 +202,43 @@ "cell_type": "code", "execution_count": null, "outputs": [], + "source": [ + "def chunked_tokens(text, encoding_name, chunk_length):\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " tokens = encoding.encode(text)\n", + " chunks_iterator = batched(tokens, chunk_length)\n", + " yield from chunks_iterator" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The `reduction` flag can be set to either `'average'`, to return the weighted average of the chunk embeddings, or `None`, to simply return the unmodified list of chunk embeddings." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 101, + "outputs": [], "source": [ "import numpy as np\n", - "from tenacity import retry, wait_random_exponential, stop_after_attempt\n", "\n", "\n", - "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n", - "def get_embedding(tokens: Sequence[int], model=EMBEDDING_MODEL) -> list[float]:\n", - " return openai.Embedding.create(input=tokens, model=model)[\"data\"][0][\"embedding\"]\n", - "\n", - "\n", - "def len_safe_get_embedding(text: str, model=EMBEDDING_MODEL, max_tokens: int = EMBEDDING_CTX_LENGTH, encoding_name: str = 'cl100k_base', reduction: Optional[str] = None):\n", + "def len_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, reduction=None):\n", " chunk_embeddings = []\n", - " for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_ctx_length=max_tokens):\n", + " for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):\n", " chunk_embeddings.append(get_embedding(chunk, model=model))\n", "\n", " if reduction is None:\n", " return chunk_embeddings\n", " elif reduction == 'average':\n", - " return np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n", + " return [np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()]\n", " else:\n", " raise ValueError(f'reduction {reduction} not valid.')\n", "\n", @@ -782,11 +249,35 @@ "collapsed": false } }, + { + "cell_type": "markdown", + "source": [ + "Once again, we can verify that we can now handle long input texts." + ], + "metadata": { + "collapsed": false + } + }, { "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [], + "execution_count": 102, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Setting reduction=None gives us 2 embedding vectors.\n", + "Setting reduction='average' gives us 1 embedding vectors.\n" + ] + } + ], + "source": [ + "embedding_vectors_no_reduce = len_safe_get_embedding(long_text, reduction=None)\n", + "average_embedding_vector = len_safe_get_embedding(long_text, reduction='average')\n", + "\n", + "print(f\"Setting reduction=None gives us {len(embedding_vectors_no_reduce)} embedding vectors.\")\n", + "print(f\"Setting reduction='average' gives us {len(average_embedding_vector)} embedding vector.\")\n" + ], "metadata": { "collapsed": false } From f259e82ab66d8c5a8ff9ef3a350512907ca7ae16 Mon Sep 17 00:00:00 2001 From: Filipe de Avila Belbute Peres Date: Wed, 18 Jan 2023 17:51:32 -0800 Subject: [PATCH 3/9] Rename file --- ...mpts_to_context_length.ipynb => Embedding_long_inputs.ipynb} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename examples/{Truncate_prompts_to_context_length.ipynb => Embedding_long_inputs.ipynb} (99%) diff --git a/examples/Truncate_prompts_to_context_length.ipynb b/examples/Embedding_long_inputs.ipynb similarity index 99% rename from examples/Truncate_prompts_to_context_length.ipynb rename to examples/Embedding_long_inputs.ipynb index 1ed48f2..be629d2 100644 --- a/examples/Truncate_prompts_to_context_length.ipynb +++ b/examples/Embedding_long_inputs.ipynb @@ -5,7 +5,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Embedding texts that are larger than the model's context length\n", + "# Embedding texts that are longer than the model's context length\n", "\n", "All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n", "\n", From ec564da44d4f907592895426e022b60f4ae54883 Mon Sep 17 00:00:00 2001 From: Filipe de Avila Belbute Peres Date: Wed, 18 Jan 2023 18:16:22 -0800 Subject: [PATCH 4/9] Change len_safe function signature + other changes --- examples/Embedding_long_inputs.ipynb | 1172 +++++++++++++++++++++++--- 1 file changed, 1066 insertions(+), 106 deletions(-) diff --git a/examples/Embedding_long_inputs.ipynb b/examples/Embedding_long_inputs.ipynb index be629d2..eaf0c20 100644 --- a/examples/Embedding_long_inputs.ipynb +++ b/examples/Embedding_long_inputs.ipynb @@ -1,7 +1,6 @@ { "cells": [ { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -9,11 +8,10 @@ "\n", "All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n", "\n", - "In this notebook, we will go over how to deal with texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n" + "In this notebook, we will go over how to handle texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -24,38 +22,35 @@ }, { "cell_type": "code", - "execution_count": 85, + "execution_count": 16, + "metadata": {}, "outputs": [], "source": [ "import openai\n", - "from tenacity import retry, wait_random_exponential, stop_after_attempt\n", + "from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type\n", "\n", "\n", "EMBEDDING_MODEL = 'text-embedding-ada-002'\n", "EMBEDDING_CTX_LENGTH = 8191\n", "EMBEDDING_ENCODING = 'cl100k_base'\n", "\n", - "\n", - "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n", + "# let's make sure to not retry on an invalid request, because that is what we want to demonstrate\n", + "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.InvalidRequestError))\n", "def get_embedding(text_or_tokens, model=EMBEDDING_MODEL):\n", " return openai.Embedding.create(input=text_or_tokens, model=model)[\"data\"][0][\"embedding\"]" - ], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "The `text-embedding-ada-002` model has a context length of 8191 tokens with the `cl100k_base` encoding, and we can see that going over that limit causes an error." - ], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "code", - "execution_count": 94, + "execution_count": 18, + "metadata": {}, "outputs": [ { "ename": "InvalidRequestError", @@ -64,7 +59,14 @@ "traceback": [ "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m", "\u001B[0;31mInvalidRequestError\u001B[0m Traceback (most recent call last)", - "Cell \u001B[0;32mIn [94], line 4\u001B[0m\n\u001B[1;32m 1\u001B[0m \u001B[38;5;28;01mimport\u001B[39;00m \u001B[38;5;21;01mopenai\u001B[39;00m\n\u001B[1;32m 3\u001B[0m long_text \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mAGI \u001B[39m\u001B[38;5;124m'\u001B[39m \u001B[38;5;241m*\u001B[39m \u001B[38;5;241m5000\u001B[39m\n\u001B[0;32m----> 4\u001B[0m get_embedding(\u001B[43mopenai\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mEmbedding\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43minput\u001B[39;49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mlong_text\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmodel\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mEMBEDDING_MODEL\u001B[49m\u001B[43m)\u001B[49m)\n", + "Cell \u001B[0;32mIn [18], line 2\u001B[0m\n\u001B[1;32m 1\u001B[0m long_text \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mAGI \u001B[39m\u001B[38;5;124m'\u001B[39m \u001B[38;5;241m*\u001B[39m \u001B[38;5;241m5000\u001B[39m\n\u001B[0;32m----> 2\u001B[0m \u001B[43mget_embedding\u001B[49m\u001B[43m(\u001B[49m\u001B[43mlong_text\u001B[49m\u001B[43m)\u001B[49m\n", + "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:326\u001B[0m, in \u001B[0;36mBaseRetrying.wraps..wrapped_f\u001B[0;34m(*args, **kw)\u001B[0m\n\u001B[1;32m 324\u001B[0m \u001B[38;5;129m@functools\u001B[39m\u001B[38;5;241m.\u001B[39mwraps(f)\n\u001B[1;32m 325\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mwrapped_f\u001B[39m(\u001B[38;5;241m*\u001B[39margs: t\u001B[38;5;241m.\u001B[39mAny, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkw: t\u001B[38;5;241m.\u001B[39mAny) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m t\u001B[38;5;241m.\u001B[39mAny:\n\u001B[0;32m--> 326\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mf\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkw\u001B[49m\u001B[43m)\u001B[49m\n", + "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:406\u001B[0m, in \u001B[0;36mRetrying.__call__\u001B[0;34m(self, fn, *args, **kwargs)\u001B[0m\n\u001B[1;32m 404\u001B[0m retry_state \u001B[38;5;241m=\u001B[39m RetryCallState(retry_object\u001B[38;5;241m=\u001B[39m\u001B[38;5;28mself\u001B[39m, fn\u001B[38;5;241m=\u001B[39mfn, args\u001B[38;5;241m=\u001B[39margs, kwargs\u001B[38;5;241m=\u001B[39mkwargs)\n\u001B[1;32m 405\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[0;32m--> 406\u001B[0m do \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43miter\u001B[49m\u001B[43m(\u001B[49m\u001B[43mretry_state\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mretry_state\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 407\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(do, DoAttempt):\n\u001B[1;32m 408\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n", + "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:351\u001B[0m, in \u001B[0;36mBaseRetrying.iter\u001B[0;34m(self, retry_state)\u001B[0m\n\u001B[1;32m 349\u001B[0m is_explicit_retry \u001B[38;5;241m=\u001B[39m retry_state\u001B[38;5;241m.\u001B[39moutcome\u001B[38;5;241m.\u001B[39mfailed \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(retry_state\u001B[38;5;241m.\u001B[39moutcome\u001B[38;5;241m.\u001B[39mexception(), TryAgain)\n\u001B[1;32m 350\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m (is_explicit_retry \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mretry(retry_state\u001B[38;5;241m=\u001B[39mretry_state)):\n\u001B[0;32m--> 351\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mfut\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mresult\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 353\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mafter \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[1;32m 354\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mafter(retry_state)\n", + "File \u001B[0;32m~/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/_base.py:438\u001B[0m, in \u001B[0;36mFuture.result\u001B[0;34m(self, timeout)\u001B[0m\n\u001B[1;32m 436\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m CancelledError()\n\u001B[1;32m 437\u001B[0m \u001B[38;5;28;01melif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;241m==\u001B[39m FINISHED:\n\u001B[0;32m--> 438\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m__get_result\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 440\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_condition\u001B[38;5;241m.\u001B[39mwait(timeout)\n\u001B[1;32m 442\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;129;01min\u001B[39;00m [CANCELLED, CANCELLED_AND_NOTIFIED]:\n", + "File \u001B[0;32m~/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/_base.py:390\u001B[0m, in \u001B[0;36mFuture.__get_result\u001B[0;34m(self)\u001B[0m\n\u001B[1;32m 388\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_exception:\n\u001B[1;32m 389\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 390\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_exception\n\u001B[1;32m 391\u001B[0m \u001B[38;5;28;01mfinally\u001B[39;00m:\n\u001B[1;32m 392\u001B[0m \u001B[38;5;66;03m# Break a reference cycle with the exception in self._exception\u001B[39;00m\n\u001B[1;32m 393\u001B[0m \u001B[38;5;28mself\u001B[39m \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m\n", + "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:409\u001B[0m, in \u001B[0;36mRetrying.__call__\u001B[0;34m(self, fn, *args, **kwargs)\u001B[0m\n\u001B[1;32m 407\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(do, DoAttempt):\n\u001B[1;32m 408\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 409\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[43mfn\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 410\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mBaseException\u001B[39;00m: \u001B[38;5;66;03m# noqa: B902\u001B[39;00m\n\u001B[1;32m 411\u001B[0m retry_state\u001B[38;5;241m.\u001B[39mset_exception(sys\u001B[38;5;241m.\u001B[39mexc_info())\n", + "Cell \u001B[0;32mIn [16], line 12\u001B[0m, in \u001B[0;36mget_embedding\u001B[0;34m(text_or_tokens, model)\u001B[0m\n\u001B[1;32m 10\u001B[0m \u001B[38;5;129m@retry\u001B[39m(wait\u001B[38;5;241m=\u001B[39mwait_random_exponential(\u001B[38;5;28mmin\u001B[39m\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m1\u001B[39m, \u001B[38;5;28mmax\u001B[39m\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m20\u001B[39m), stop\u001B[38;5;241m=\u001B[39mstop_after_attempt(\u001B[38;5;241m6\u001B[39m), retry\u001B[38;5;241m=\u001B[39mretry_if_not_exception_type(openai\u001B[38;5;241m.\u001B[39mInvalidRequestError))\n\u001B[1;32m 11\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mget_embedding\u001B[39m(text_or_tokens, model\u001B[38;5;241m=\u001B[39mEMBEDDING_MODEL):\n\u001B[0;32m---> 12\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mopenai\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mEmbedding\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43minput\u001B[39;49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mtext_or_tokens\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmodel\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mmodel\u001B[49m\u001B[43m)\u001B[49m[\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mdata\u001B[39m\u001B[38;5;124m\"\u001B[39m][\u001B[38;5;241m0\u001B[39m][\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124membedding\u001B[39m\u001B[38;5;124m\"\u001B[39m]\n", "File \u001B[0;32m~/code/openai-python/openai/api_resources/embedding.py:33\u001B[0m, in \u001B[0;36mEmbedding.create\u001B[0;34m(cls, *args, **kwargs)\u001B[0m\n\u001B[1;32m 31\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[1;32m 32\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m---> 33\u001B[0m response \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 35\u001B[0m \u001B[38;5;66;03m# If a user specifies base64, we'll just return the encoded string.\u001B[39;00m\n\u001B[1;32m 36\u001B[0m \u001B[38;5;66;03m# This is only for the default case.\u001B[39;00m\n\u001B[1;32m 37\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m user_provided_encoding_format:\n", "File \u001B[0;32m~/code/openai-python/openai/api_resources/abstract/engine_api_resource.py:153\u001B[0m, in \u001B[0;36mEngineAPIResource.create\u001B[0;34m(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)\u001B[0m\n\u001B[1;32m 127\u001B[0m \u001B[38;5;129m@classmethod\u001B[39m\n\u001B[1;32m 128\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mcreate\u001B[39m(\n\u001B[1;32m 129\u001B[0m \u001B[38;5;28mcls\u001B[39m,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 136\u001B[0m \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams,\n\u001B[1;32m 137\u001B[0m ):\n\u001B[1;32m 138\u001B[0m (\n\u001B[1;32m 139\u001B[0m deployment_id,\n\u001B[1;32m 140\u001B[0m engine,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 150\u001B[0m api_key, api_base, api_type, api_version, organization, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams\n\u001B[1;32m 151\u001B[0m )\n\u001B[0;32m--> 153\u001B[0m response, _, api_key \u001B[38;5;241m=\u001B[39m \u001B[43mrequestor\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrequest\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 154\u001B[0m \u001B[43m \u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mpost\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\n\u001B[1;32m 155\u001B[0m \u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 156\u001B[0m \u001B[43m \u001B[49m\u001B[43mparams\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mparams\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 157\u001B[0m \u001B[43m \u001B[49m\u001B[43mheaders\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 158\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mstream\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 159\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_id\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_id\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 160\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_timeout\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_timeout\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 161\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 163\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream:\n\u001B[1;32m 164\u001B[0m \u001B[38;5;66;03m# must be an iterator\u001B[39;00m\n\u001B[1;32m 165\u001B[0m \u001B[38;5;28;01massert\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(response, OpenAIResponse)\n", "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:227\u001B[0m, in \u001B[0;36mAPIRequestor.request\u001B[0;34m(self, method, url, params, headers, files, stream, request_id, request_timeout)\u001B[0m\n\u001B[1;32m 206\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mrequest\u001B[39m(\n\u001B[1;32m 207\u001B[0m \u001B[38;5;28mself\u001B[39m,\n\u001B[1;32m 208\u001B[0m method,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 215\u001B[0m request_timeout: Optional[Union[\u001B[38;5;28mfloat\u001B[39m, Tuple[\u001B[38;5;28mfloat\u001B[39m, \u001B[38;5;28mfloat\u001B[39m]]] \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m,\n\u001B[1;32m 216\u001B[0m ) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m Tuple[Union[OpenAIResponse, Iterator[OpenAIResponse]], \u001B[38;5;28mbool\u001B[39m, \u001B[38;5;28mstr\u001B[39m]:\n\u001B[1;32m 217\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mrequest_raw(\n\u001B[1;32m 218\u001B[0m method\u001B[38;5;241m.\u001B[39mlower(),\n\u001B[1;32m 219\u001B[0m url,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 225\u001B[0m request_timeout\u001B[38;5;241m=\u001B[39mrequest_timeout,\n\u001B[1;32m 226\u001B[0m )\n\u001B[0;32m--> 227\u001B[0m resp, got_stream \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response\u001B[49m\u001B[43m(\u001B[49m\u001B[43mresult\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 228\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp, got_stream, \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mapi_key\n", @@ -76,35 +78,29 @@ ], "source": [ "long_text = 'AGI ' * 5000\n", - "get_embedding(input=long_text)" - ], - "metadata": { - "collapsed": false - } + "get_embedding(long_text)" + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ - "Clearly we want to avoid these errors, particularly when dealing programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to dealing with these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually." - ], - "metadata": { - "collapsed": false - } + "Clearly we want to avoid these errors, particularly when handling programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to handling these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually." + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "## 1. Truncating the input text\n", "\n", "The simplest solution is to truncate the input text to the maximum allowed length. Since the context length is in terms of tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, thus as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function." - ], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "code", "execution_count": 98, + "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", @@ -113,27 +109,1025 @@ " \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n", " encoding = tiktoken.get_encoding(encoding_name)\n", " return encoding.encode(text)[:max_tokens]" - ], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "Our example from before now works." - ], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "code", "execution_count": 97, + "metadata": {}, "outputs": [ { "data": { - "text/plain": "[-0.015384314581751823,\n 0.0031692360062152147,\n -0.007302511017769575,\n -0.02778581902384758,\n -0.013409210368990898,\n 0.0029592972714453936,\n -0.019119545817375183,\n -0.0004874778969679028,\n -0.010721994563937187,\n -0.023486273363232613,\n 0.016351712867617607,\n 0.005532307084649801,\n -0.009136536158621311,\n -0.014282556250691414,\n 0.005122506525367498,\n 0.02888757921755314,\n 0.020973725244402885,\n 0.009136536158621311,\n 0.003303596982732415,\n -0.013382338918745518,\n -0.024749264121055603,\n 0.03904525563120842,\n -0.01699664443731308,\n -0.010312194004654884,\n -0.009029047563672066,\n -0.001587137347087264,\n 0.017036953940987587,\n -0.056915245950222015,\n -0.011084768921136856,\n -0.006375421304255724,\n 0.011145230382680893,\n -0.01094368938356638,\n -0.010184550657868385,\n -0.009546336717903614,\n -0.012105911038815975,\n -0.004675756674259901,\n 0.002245505340397358,\n -0.0015040015568956733,\n -0.007457026280462742,\n 0.0029206685721874237,\n 0.03993203863501549,\n -0.02390279248356819,\n 0.003399329027161002,\n -0.02109465003013611,\n -0.026590008288621902,\n 0.004457420669496059,\n -0.03638491407036781,\n -0.018958313390612602,\n 0.002221992239356041,\n -0.007846672087907791,\n -0.0106548136100173,\n 0.0019096032483503222,\n -0.015451495535671711,\n -0.00783995445817709,\n 0.016821976751089096,\n 0.007409999612718821,\n -0.017601268365979195,\n 0.01502154115587473,\n -0.026119744405150414,\n -0.011333336122334003,\n -0.017184749245643616,\n -0.0352562814950943,\n -0.002327801426872611,\n 0.015666473656892776,\n -0.023069754242897034,\n -0.016821976751089096,\n -0.0005298855248838663,\n 0.0010933612938970327,\n 0.0048571438528597355,\n -0.034503862261772156,\n 0.007712311577051878,\n 0.038024116307497025,\n -0.017856555059552193,\n -0.02415807731449604,\n 0.020664695650339127,\n -0.01742659881711006,\n 0.012072320096194744,\n 0.015249954536557198,\n -0.008357243612408638,\n 0.001610650448128581,\n 0.018017787486314774,\n -0.02247856743633747,\n -3.219936115783639e-05,\n 0.02421182207763195,\n 0.010594351217150688,\n 0.01800435036420822,\n -0.019777914509177208,\n 0.024695521220564842,\n 0.0013805575435981154,\n -0.0138122932985425,\n 0.02132306434214115,\n 0.023325040936470032,\n 0.027597714215517044,\n 0.06062360480427742,\n -0.019562937319278717,\n 0.009559772908687592,\n -0.02183363400399685,\n 0.0173728559166193,\n -0.028242645785212517,\n -0.03058052435517311,\n 0.01847461424767971,\n -0.026536263525485992,\n -0.007947443053126335,\n -0.007517488207668066,\n -0.026616880670189857,\n 0.009183562360703945,\n 0.01872989907860756,\n -0.022075483575463295,\n 0.019589809700846672,\n -0.023916227743029594,\n 0.019347960129380226,\n 0.02378186769783497,\n 0.019764477387070656,\n -0.0202616136521101,\n -0.019401703029870987,\n 0.006335113197565079,\n 0.015209645964205265,\n -0.029935592785477638,\n -0.007013635244220495,\n -0.0363042950630188,\n 0.00704050762578845,\n 0.01616360805928707,\n 0.014981232583522797,\n -0.0013931537978351116,\n 0.030661141499876976,\n 0.01389290951192379,\n 0.007712311577051878,\n -0.01910611055791378,\n -0.0020792337600141764,\n -0.008404269814491272,\n 0.024722393602132797,\n 0.01699664443731308,\n 0.008350525982677937,\n 0.009727723896503448,\n -0.010695122182369232,\n 0.006560167297720909,\n -0.031386688351631165,\n 0.0263078510761261,\n -0.0001876852911664173,\n -0.01816558465361595,\n 0.019482320174574852,\n 0.023190679028630257,\n -0.015115593560039997,\n -0.015384314581751823,\n -0.005233354400843382,\n 0.004225648008286953,\n -0.0011555030941963196,\n -0.012092474848031998,\n 0.011602058075368404,\n -0.02179332636296749,\n -0.003029836807399988,\n 0.0030382343102246523,\n -0.011151948943734169,\n 0.007430153898894787,\n 0.001625766046345234,\n 0.010795892216265202,\n 0.0033136738929897547,\n 0.013167361728847027,\n -0.027033399790525436,\n 0.002052361611276865,\n 0.015061848796904087,\n 0.017762500792741776,\n 0.014349736273288727,\n -0.007047225721180439,\n 0.014887180179357529,\n 0.023190679028630257,\n 0.0055289482697844505,\n 0.018084967508912086,\n -0.0014888859586790204,\n -0.003711717901751399,\n 0.008290063589811325,\n 0.03740605339407921,\n 0.007960879243910313,\n 0.01809840463101864,\n 0.010916817001998425,\n 0.03504130616784096,\n 0.0031138123013079166,\n -0.005303893703967333,\n -0.022868212312459946,\n -0.01373839471489191,\n -0.013933218084275723,\n 0.008525194600224495,\n 0.05304565653204918,\n 0.014537842012941837,\n 0.006230983417481184,\n -0.004920965526252985,\n 0.002856847131624818,\n -0.015868013724684715,\n 0.006835607346147299,\n -0.027449917048215866,\n -0.0049041700549423695,\n 0.009808340109884739,\n 0.028914449736475945,\n -0.017386291176080704,\n -0.6199946403503418,\n -0.02336534857749939,\n -0.018353689461946487,\n -0.0028131799772381783,\n 0.019804786890745163,\n 0.04409722611308098,\n 0.005280380602926016,\n 0.011702828109264374,\n -0.024829881265759468,\n 0.01465876679867506,\n -0.013395775109529495,\n 0.025877896696329117,\n -0.01636514998972416,\n 0.01199842244386673,\n -0.01084291934967041,\n -0.008827506564557552,\n 0.00870658177882433,\n -0.020086944103240967,\n -0.0006025243201293051,\n 0.027812691405415535,\n -0.03404703363776207,\n 0.0019079238409176469,\n 0.0024403284769505262,\n -0.006099981721490622,\n 0.009082792326807976,\n -0.0050519672222435474,\n 0.014309428632259369,\n -0.022921957075595856,\n -0.02199486829340458,\n 0.003607588354498148,\n -0.008518476970493793,\n 0.00871329940855503,\n 0.02747678942978382,\n -0.020906545221805573,\n 0.04253863915801048,\n 0.000455147324828431,\n 0.014484097249805927,\n 0.033079635351896286,\n 0.026590008288621902,\n 0.05258882790803909,\n -0.025971949100494385,\n 0.010493581183254719,\n 0.026455648243427277,\n -0.008511758409440517,\n -0.019025493413209915,\n 0.020019764080643654,\n 0.01937483251094818,\n -0.013415928930044174,\n -0.0027863075956702232,\n -0.007860108278691769,\n 0.011662520468235016,\n -0.007255484815686941,\n 0.0033707772381603718,\n -0.01479312777519226,\n 0.009358231909573078,\n 0.007100970018655062,\n 0.02388935536146164,\n -0.017171313986182213,\n -0.008726735599339008,\n 0.010977279394865036,\n -0.003943490330129862,\n 0.004695910960435867,\n -0.003323751036077738,\n 0.011642365716397762,\n -0.014510969631373882,\n 0.0063888574950397015,\n -0.006832248065620661,\n 0.01937483251094818,\n 0.0011857342906296253,\n 0.006308240815997124,\n -0.0029643357265740633,\n 0.012569455429911613,\n -0.013677932322025299,\n 0.01015767827630043,\n -0.002561253262683749,\n -0.0055994875729084015,\n 0.024144640192389488,\n 0.0076988753862679005,\n -0.01128630992025137,\n -0.022277025505900383,\n 0.013422646559774876,\n 0.00892155896872282,\n 0.0036613326519727707,\n -0.009438848122954369,\n 0.04151749610900879,\n -0.005727130454033613,\n -0.00863268319517374,\n -0.012804587371647358,\n 0.011138512752950191,\n -0.003283442696556449,\n -0.00783995445817709,\n 0.028538240119814873,\n 0.00030609077657572925,\n 0.006113417912274599,\n 0.0205303356051445,\n 0.0037721802946180105,\n -0.02425212971866131,\n 0.013771984726190567,\n 0.0034833045210689306,\n -0.01748034358024597,\n -0.0062444196082651615,\n -0.005653231870383024,\n 0.011037741787731647,\n 0.02684529311954975,\n -0.023822175338864326,\n 0.041598111391067505,\n -0.02915629930794239,\n -0.009895674884319305,\n 0.03240783140063286,\n -0.022639799863100052,\n 0.01879708096385002,\n -0.03727169334888458,\n -0.02415807731449604,\n -0.02132306434214115,\n 0.014940924011170864,\n -0.03536377102136612,\n 0.012925512157380581,\n 0.012421658262610435,\n 0.017117569223046303,\n -0.01281130500137806,\n -0.014269120059907436,\n -0.010144243016839027,\n 0.0049075293354690075,\n -0.01338905654847622,\n 0.00038628740003332496,\n -7.085434481268749e-05,\n -0.00503853103145957,\n -0.024507414549589157,\n -0.022653235122561455,\n 0.02374155819416046,\n 0.03141356259584427,\n -0.003390931524336338,\n 0.015653036534786224,\n -0.024386489763855934,\n 0.05546415224671364,\n 0.015438059344887733,\n 0.02504485845565796,\n -0.001958309207111597,\n 0.013684650883078575,\n -0.01979134976863861,\n -0.022706979885697365,\n 0.013825729489326477,\n 0.008753607980906963,\n -0.014537842012941837,\n 0.01672792248427868,\n -0.01663387008011341,\n -0.014121322892606258,\n -0.015451495535671711,\n -0.005186328198760748,\n 0.03512192144989967,\n -0.008746890351176262,\n 0.029693743214011192,\n -0.016486072912812233,\n 0.026482518762350082,\n -0.023969972506165504,\n -0.037916626781225204,\n -0.017722193151712418,\n -0.0202616136521101,\n 0.023298168554902077,\n 0.0012461966834962368,\n 0.024695521220564842,\n 0.021658966317772865,\n -0.010936971753835678,\n 0.002561253262683749,\n -0.005156097002327442,\n -0.01057419739663601,\n 0.009754596278071404,\n 0.03125232830643654,\n -0.021282754838466644,\n -0.031924132257699966,\n 0.016647307202219963,\n 0.013355466537177563,\n -0.00647283298894763,\n 0.019334523007273674,\n 0.012032012455165386,\n 0.02778581902384758,\n -0.008001187816262245,\n 0.011071332730352879,\n 0.0004958754288963974,\n 0.006368703208863735,\n -0.002013732912018895,\n -0.011951396241784096,\n -0.02372812293469906,\n 0.0049075293354690075,\n 0.032810915261507034,\n -0.010963844135403633,\n -0.013986961916089058,\n 0.041974324733018875,\n -0.018541794270277023,\n 0.022586055099964142,\n -0.003758744103834033,\n 0.020355666056275368,\n 0.022129228338599205,\n 0.004353290889412165,\n -0.03531002625823021,\n 0.0034564323723316193,\n 0.00438352208584547,\n -0.0040207477286458015,\n -0.002181683899834752,\n 0.005488639697432518,\n 0.013227824121713638,\n -0.002729204250499606,\n 0.02899506688117981,\n -0.027328992262482643,\n 0.019536064937710762,\n -0.01735941879451275,\n 0.007235330529510975,\n -0.02126931957900524,\n 0.026388466358184814,\n 0.016284532845020294,\n 0.0048806569539010525,\n -0.018971748650074005,\n -0.008384115993976593,\n 0.00697332713752985,\n 0.023647505789995193,\n 0.011843906715512276,\n -0.004353290889412165,\n 0.013059872202575207,\n -0.014846871607005596,\n -0.0016291250940412283,\n 0.010896663181483746,\n -0.003557202871888876,\n 0.0031524410005658865,\n -0.0016249263426288962,\n -0.03509504720568657,\n 0.005639795679599047,\n 0.00781980063766241,\n 0.007786210160702467,\n 0.007907134480774403,\n -0.01518277358263731,\n 0.0005798509810119867,\n 0.006738195661455393,\n 0.014578149653971195,\n 0.023553453385829926,\n 0.013395775109529495,\n 0.015706781297922134,\n 0.019119545817375183,\n -0.006818812340497971,\n 0.01567990891635418,\n -0.0037251540925353765,\n 0.003856155788525939,\n 0.009425411932170391,\n 0.012052166275680065,\n -0.030392419546842575,\n 0.03697609901428223,\n -0.0009875521063804626,\n 0.05073465034365654,\n -0.012119347229599953,\n -0.03520253673195839,\n 0.027758946642279625,\n -0.012240272015333176,\n 0.018246199935674667,\n 0.0012361196568235755,\n -0.002601561602205038,\n 0.02625410631299019,\n -0.03541751578450203,\n 0.0018071531085297465,\n 0.010795892216265202,\n 0.002294211182743311,\n 0.023486273363232613,\n 0.02646908350288868,\n 0.006677733268588781,\n 0.0025394195690751076,\n 0.010218140669167042,\n 0.013200951740145683,\n 0.02325785905122757,\n 0.020328793674707413,\n -0.004642166662961245,\n -0.015558984130620956,\n 0.008323653601109982,\n -0.003956926520913839,\n -0.015196209773421288,\n -0.006422447506338358,\n -0.010513735003769398,\n 0.009794904850423336,\n -0.011434106156229973,\n -0.018555231392383575,\n -0.014833435416221619,\n 0.008934995159506798,\n -0.00547184469178319,\n -0.0012193245347589254,\n -0.01504841260612011,\n 0.022693544626235962,\n 0.02642877586185932,\n 0.001154663390479982,\n -0.01336890272796154,\n -0.004890734329819679,\n 0.025071730837225914,\n -0.005928671453148127,\n 0.004323059692978859,\n -0.012018576264381409,\n 0.030822373926639557,\n 0.007517488207668066,\n 0.011675955727696419,\n 0.004951196722686291,\n 0.015827706083655357,\n 0.014094451442360878,\n -0.039018385112285614,\n 0.003486663568764925,\n 0.0029206685721874237,\n 0.02284134179353714,\n 0.00033863127464428544,\n -0.014403481036424637,\n -0.001502322033047676,\n 0.0244671069085598,\n 0.0013881153427064419,\n 0.0076316953636705875,\n -0.009700851514935493,\n -0.02435961924493313,\n -0.045494578778743744,\n 0.010124088265001774,\n 0.0009220512001775205,\n -0.01584114134311676,\n -0.005663308780640364,\n -0.012549301609396935,\n -0.0027980643790215254,\n -0.017292238771915436,\n -0.013241259381175041,\n 0.005199763923883438,\n -0.0024604827631264925,\n 0.011541595682501793,\n -0.005219918210059404,\n -0.01762814074754715,\n -0.0006478711147792637,\n 0.09980322420597076,\n 0.02805454097688198,\n 0.004746296443045139,\n 0.021699273958802223,\n 0.006892710458487272,\n 0.011971550062298775,\n -0.007779492065310478,\n -0.010426400229334831,\n 0.009768032468855381,\n -0.023029446601867676,\n 0.01742659881711006,\n 0.0013268132461234927,\n 0.002284134039655328,\n 0.007766055874526501,\n 0.018461177125573158,\n -0.016244225203990936,\n -0.021336499601602554,\n -0.0093447957187891,\n 0.004245802294462919,\n -0.004783245734870434,\n -0.009808340109884739,\n 0.014631894417107105,\n -0.02362063340842724,\n 0.013651059940457344,\n 0.021954558789730072,\n -0.02114839479327202,\n 0.0031591590959578753,\n 0.003164197551086545,\n -0.010769020766019821,\n -0.006855761166661978,\n -0.016969772055745125,\n -0.00590515835210681,\n -0.015411186963319778,\n 0.0001366701617371291,\n -0.015196209773421288,\n 0.0011731380363926291,\n 0.009855367243289948,\n -0.018071532249450684,\n 0.03936772421002388,\n -0.027342429384589195,\n 0.029451893642544746,\n 0.0027476788964122534,\n -0.009828494861721992,\n -0.02257261984050274,\n 0.01479312777519226,\n -0.026119744405150414,\n -0.01007706206291914,\n 0.009559772908687592,\n -0.014752819202840328,\n -0.03135981783270836,\n 0.014322864823043346,\n -0.0008481527329422534,\n -0.01502154115587473,\n 0.004148390609771013,\n 0.010856354609131813,\n 0.013919781893491745,\n -0.03135981783270836,\n -0.010688403621315956,\n 0.008827506564557552,\n -0.017251931130886078,\n -0.009700851514935493,\n -0.012925512157380581,\n 0.010466708801686764,\n -0.019831659272313118,\n 0.009721006266772747,\n -0.028403880074620247,\n -0.0027325632981956005,\n 0.00016994545876514167,\n 0.0021850429475307465,\n 0.004924324341118336,\n 0.02958625555038452,\n 0.008216165006160736,\n -0.01915985345840454,\n -0.0005004940903745592,\n -0.004598499275743961,\n 0.02642877586185932,\n 0.011689391918480396,\n -0.0024201744236052036,\n 0.008384115993976593,\n 0.02309662662446499,\n 0.023378783836960793,\n -0.040657587349414825,\n -0.015908321365714073,\n -0.0019129622960463166,\n -0.019966019317507744,\n 0.009156690910458565,\n -0.01279115118086338,\n -0.0228950846940279,\n 0.002087631495669484,\n -0.0008225401979871094,\n 0.013281567953526974,\n -0.0075779506005346775,\n 0.00553902518004179,\n -0.019038928672671318,\n -0.02327129617333412,\n 0.002831654390320182,\n 0.023177243769168854,\n -0.02531358040869236,\n 0.0001918840571306646,\n -0.004662320949137211,\n 0.0281620305031538,\n -0.00968741625547409,\n 0.0023210833314806223,\n -0.011561749503016472,\n -0.0007293273811228573,\n 0.018770208582282066,\n 0.005458408500999212,\n 0.0038628738839179277,\n 0.00013467573444359004,\n -0.0010253411019220948,\n 0.014161631464958191,\n 0.003374136285856366,\n 0.012065602466464043,\n -0.013617469929158688,\n -0.0018323458498343825,\n 0.005045249126851559,\n 0.0011084768921136856,\n -0.006785221863538027,\n 0.009069356136023998,\n -0.0005160295404493809,\n -0.012636636383831501,\n -0.00517289200797677,\n 0.022357642650604248,\n -0.006953172851353884,\n -0.029666870832443237,\n 0.0021296192426234484,\n -0.0006881793960928917,\n -0.002552855759859085,\n 0.0004912567674182355,\n -0.004255879204720259,\n -0.0008040656102821231,\n 0.018810516223311424,\n -0.015196209773421288,\n -0.015962066128849983,\n -0.008565503172576427,\n 0.0014930847100913525,\n -0.023338476195931435,\n 0.012092474848031998,\n -0.025743534788489342,\n -0.010957125574350357,\n -0.033079635351896286,\n -0.00035395679879002273,\n 0.022760724648833275,\n -0.003933413419872522,\n -0.03563249111175537,\n -0.04122190177440643,\n 0.0057943109422922134,\n 0.017077261582016945,\n -0.008478168398141861,\n 0.023862482979893684,\n -0.004820194561034441,\n -0.004339854698628187,\n -0.007947443053126335,\n -0.01799091510474682,\n -0.01946888491511345,\n -0.027812691405415535,\n -0.019656989723443985,\n 0.01883738860487938,\n 0.02663031592965126,\n 0.028484495356678963,\n 0.02800079621374607,\n 0.020328793674707413,\n 0.02125588245689869,\n -5.395427069743164e-05,\n 0.02556886523962021,\n -0.012052166275680065,\n 0.03243470564484596,\n 0.004974709823727608,\n -0.0105204526335001,\n 0.011897651478648186,\n 0.013335312716662884,\n -0.013825729489326477,\n 0.014363172464072704,\n -0.01811183989048004,\n -0.0024403284769505262,\n 0.03727169334888458,\n -0.012092474848031998,\n -0.016284532845020294,\n -0.008148984052240849,\n -0.031091095879673958,\n -0.01621735282242298,\n 0.006654220167547464,\n 0.020879672840237617,\n 0.005364356096833944,\n -0.03436949849128723,\n -0.025931639596819878,\n 0.013590597547590733,\n 0.008639401756227016,\n 0.017278803512454033,\n -0.012992692179977894,\n 0.021766453981399536,\n -0.003293519839644432,\n 0.013919781893491745,\n 0.009438848122954369,\n 0.015505239367485046,\n -0.02374155819416046,\n -0.010863073170185089,\n -0.0030936580151319504,\n 0.0107891745865345,\n 0.017668448388576508,\n 0.005965620744973421,\n 0.005928671453148127,\n 0.002952579176053405,\n -0.016714487224817276,\n -0.017036953940987587,\n 0.024964241310954094,\n -0.01173641812056303,\n -0.003752026241272688,\n 0.01094368938356638,\n -0.022747289389371872,\n 0.00047992009785957634,\n -0.01778937317430973,\n -0.05425490438938141,\n -0.01731911115348339,\n 0.020812492817640305,\n -0.0032431345898658037,\n -0.0292100440710783,\n -0.004279392305761576,\n 0.012482120655477047,\n -0.03541751578450203,\n 0.002704011742025614,\n -0.007759337779134512,\n 0.02304288186132908,\n 0.012199963442981243,\n 0.028538240119814873,\n 0.014860307797789574,\n -0.012307452037930489,\n -0.01936139538884163,\n -0.0033607003279030323,\n -0.004014029633253813,\n -0.007638412993401289,\n -0.010271885432302952,\n 0.008021341636776924,\n 0.0010925214737653732,\n -0.0373791828751564,\n 0.0024923933669924736,\n 0.008021341636776924,\n -0.00739656388759613,\n -0.02410433255136013,\n 0.025246400386095047,\n 0.005374433007091284,\n 0.010762302204966545,\n -0.006627347785979509,\n -0.015142465941607952,\n -0.050439056009054184,\n 0.04108754172921181,\n 0.03869592025876045,\n 0.004007311537861824,\n 0.003866232931613922,\n 0.004413753282278776,\n 0.015129029750823975,\n 0.023298168554902077,\n -0.024064024910330772,\n 0.0011177140986546874,\n -0.009270897135138512,\n 0.0016266057500615716,\n 0.017386291176080704,\n -0.013745113275945187,\n 0.01694290153682232,\n 0.003973721526563168,\n -0.012011858634650707,\n -7.295373507076874e-05,\n -0.016324840486049652,\n 0.011555030941963196,\n 0.014725946821272373,\n 0.003930054139345884,\n -0.012253707274794579,\n -0.01537087932229042,\n 0.0050519672222435474,\n 0.016136735677719116,\n -0.04573642462491989,\n -0.009647107683122158,\n -0.014201940037310123,\n -0.006543372292071581,\n 0.017655013129115105,\n 0.0035168947651982307,\n -0.00868642795830965,\n 0.011165385134518147,\n -0.023768430575728416,\n -0.011763290502130985,\n 0.03350958973169327,\n 0.003799052443355322,\n 0.0060966224409639835,\n 0.0007314267568290234,\n -0.004679115954786539,\n -0.003631101455539465,\n 0.007705593481659889,\n 0.010433118790388107,\n 0.029021939262747765,\n -0.008390833623707294,\n -0.023929663002490997,\n -0.010963844135403633,\n 0.00109504081774503,\n -0.0034161240328103304,\n 0.009304487146437168,\n -0.014282556250691414,\n -0.00626121461391449,\n 0.03221972659230232,\n -0.04299546405673027,\n 0.007121123839169741,\n 0.014551278203725815,\n -0.012206681072711945,\n -0.008169138804078102,\n 0.001264671329408884,\n -0.004766450263559818,\n 0.00836396124213934,\n 0.04237740486860275,\n 0.003034875262528658,\n -0.01231416966766119,\n -0.01523651834577322,\n 0.017775937914848328,\n 0.03990516811609268,\n -0.002383225131779909,\n 0.004830271936953068,\n 0.013563726097345352,\n 0.000969917222391814,\n 0.01346967276185751,\n 0.002389943227171898,\n -0.014806563034653664,\n 0.007436871994286776,\n -0.039448339492082596,\n 0.009015611372888088,\n 0.0007436032174155116,\n -0.004622012376785278,\n 0.004222289193421602,\n 0.016244225203990936,\n 0.01831338182091713,\n 0.005146019626408815,\n -0.013691368512809277,\n -0.03904525563120842,\n -0.024695521220564842,\n -0.019562937319278717,\n -0.013852601870894432,\n -0.009385104291141033,\n 0.003081901464611292,\n -0.013019564561545849,\n -0.025851024314761162,\n 0.011440824717283249,\n -0.02679155021905899,\n -0.025219528004527092,\n 0.01173641812056303,\n 0.01402727048844099,\n 0.02888757921755314,\n 0.020503463223576546,\n -0.007759337779134512,\n -0.013852601870894432,\n -0.005596128758043051,\n 0.0010958805214613676,\n -0.05113773047924042,\n -0.022236717864871025,\n -0.0123679144307971,\n 0.021954558789730072,\n 0.015196209773421288,\n -0.03004308231174946,\n -0.03135981783270836,\n -0.016284532845020294,\n -0.05863506719470024,\n -0.018138712272047997,\n 0.006852402351796627,\n 0.014282556250691414,\n 0.016459202393889427,\n -0.013006128370761871,\n 0.009613517671823502,\n 0.020705003291368484,\n 0.0090760737657547,\n 0.0022656593937426805,\n -0.006879274267703295,\n -0.02109465003013611,\n -0.003799052443355322,\n -0.006419088691473007,\n 0.000651650014333427,\n -0.01878364384174347,\n 0.002342917025089264,\n -0.015572420321404934,\n 0.010453272610902786,\n -0.015962066128849983,\n -0.00675163185223937,\n 0.021229011937975883,\n 0.0007910493877716362,\n -0.004830271936953068,\n -0.015518675558269024,\n 0.007087533827871084,\n 0.013295004144310951,\n 0.025877896696329117,\n 0.007873544469475746,\n -0.027973923832178116,\n -0.0028232568874955177,\n 0.008303498849272728,\n -0.0018491409718990326,\n -0.014712510630488396,\n -0.010056908242404461,\n 0.0013133770553395152,\n 0.0015375918010249734,\n 0.025394197553396225,\n -0.0009573209099471569,\n -0.0033640593755990267,\n -0.011749854311347008,\n -0.01386603806167841,\n 0.02336534857749939,\n 0.010809328407049179,\n 0.010695122182369232,\n 0.0006445121252909303,\n -0.010668249800801277,\n 0.0041886987164616585,\n 0.013630906119942665,\n 0.015397750772535801,\n -0.015505239367485046,\n -0.0025142270606011152,\n 0.02105434238910675,\n -0.005992493126541376,\n 0.019025493413209915,\n -0.04425845667719841,\n 0.007114405743777752,\n -0.023754995316267014,\n -0.010473426431417465,\n 0.012764278799295425,\n -0.012018576264381409,\n -0.04624699801206589,\n 0.022586055099964142,\n 0.00040014335536397994,\n -0.009257460944354534,\n -0.006214188411831856,\n 0.011252719908952713,\n -0.011467697098851204,\n -0.00610669981688261,\n -0.02998933754861355,\n 0.017184749245643616,\n -0.026482518762350082,\n -0.02105434238910675,\n -0.006083186715841293,\n -0.00826319120824337,\n -0.001481328159570694,\n -0.010983997955918312,\n 0.03326774016022682,\n 0.0012226835824549198,\n -0.004769809544086456,\n 0.19713421165943146,\n -0.014269120059907436,\n -0.0032179418485611677,\n 0.012105911038815975,\n -0.028511367738246918,\n -0.017131006345152855,\n 0.044177841395139694,\n 0.0017853195313364267,\n -0.0024302515666931868,\n 0.03783601149916649,\n -0.00892155896872282,\n 0.012549301609396935,\n 0.027275249361991882,\n -0.01050029881298542,\n 0.012723970226943493,\n -0.02199486829340458,\n -0.03536377102136612,\n -0.02347283624112606,\n 0.009519464336335659,\n 0.005925312638282776,\n 0.018703026697039604,\n 0.0024755983613431454,\n -0.01386603806167841,\n -0.011481133289635181,\n 0.0009497631108388305,\n 0.000136355243739672,\n -0.007302511017769575,\n -0.00022925317171029747,\n 0.010896663181483746,\n -0.0023731482215225697,\n -0.008934995159506798,\n -0.02426556497812271,\n -0.005135942716151476,\n 0.014040706679224968,\n -4.135794370085932e-05,\n 0.003822565544396639,\n 0.022236717864871025,\n 0.01215293724089861,\n 0.02794705331325531,\n 0.013476391322910786,\n 0.02304288186132908,\n 0.016297968104481697,\n 0.007846672087907791,\n -0.035229410976171494,\n -0.018716463819146156,\n 0.03547126054763794,\n ...]" + "text/plain": [ + "[-0.015384314581751823,\n", + " 0.0031692360062152147,\n", + " -0.007302511017769575,\n", + " -0.02778581902384758,\n", + " -0.013409210368990898,\n", + " 0.0029592972714453936,\n", + " -0.019119545817375183,\n", + " -0.0004874778969679028,\n", + " -0.010721994563937187,\n", + " -0.023486273363232613,\n", + " 0.016351712867617607,\n", + " 0.005532307084649801,\n", + " -0.009136536158621311,\n", + " -0.014282556250691414,\n", + " 0.005122506525367498,\n", + " 0.02888757921755314,\n", + " 0.020973725244402885,\n", + " 0.009136536158621311,\n", + " 0.003303596982732415,\n", + " -0.013382338918745518,\n", + " -0.024749264121055603,\n", + " 0.03904525563120842,\n", + " -0.01699664443731308,\n", + " -0.010312194004654884,\n", + " -0.009029047563672066,\n", + " -0.001587137347087264,\n", + " 0.017036953940987587,\n", + " -0.056915245950222015,\n", + " -0.011084768921136856,\n", + " -0.006375421304255724,\n", + " 0.011145230382680893,\n", + " -0.01094368938356638,\n", + " -0.010184550657868385,\n", + " -0.009546336717903614,\n", + " -0.012105911038815975,\n", + " -0.004675756674259901,\n", + " 0.002245505340397358,\n", + " -0.0015040015568956733,\n", + " -0.007457026280462742,\n", + " 0.0029206685721874237,\n", + " 0.03993203863501549,\n", + " -0.02390279248356819,\n", + " 0.003399329027161002,\n", + " -0.02109465003013611,\n", + " -0.026590008288621902,\n", + " 0.004457420669496059,\n", + " -0.03638491407036781,\n", + " -0.018958313390612602,\n", + " 0.002221992239356041,\n", + " -0.007846672087907791,\n", + " -0.0106548136100173,\n", + " 0.0019096032483503222,\n", + " -0.015451495535671711,\n", + " -0.00783995445817709,\n", + " 0.016821976751089096,\n", + " 0.007409999612718821,\n", + " -0.017601268365979195,\n", + " 0.01502154115587473,\n", + " -0.026119744405150414,\n", + " -0.011333336122334003,\n", + " -0.017184749245643616,\n", + " -0.0352562814950943,\n", + " -0.002327801426872611,\n", + " 0.015666473656892776,\n", + " -0.023069754242897034,\n", + " -0.016821976751089096,\n", + " -0.0005298855248838663,\n", + " 0.0010933612938970327,\n", + " 0.0048571438528597355,\n", + " -0.034503862261772156,\n", + " 0.007712311577051878,\n", + " 0.038024116307497025,\n", + " -0.017856555059552193,\n", + " -0.02415807731449604,\n", + " 0.020664695650339127,\n", + " -0.01742659881711006,\n", + " 0.012072320096194744,\n", + " 0.015249954536557198,\n", + " -0.008357243612408638,\n", + " 0.001610650448128581,\n", + " 0.018017787486314774,\n", + " -0.02247856743633747,\n", + " -3.219936115783639e-05,\n", + " 0.02421182207763195,\n", + " 0.010594351217150688,\n", + " 0.01800435036420822,\n", + " -0.019777914509177208,\n", + " 0.024695521220564842,\n", + " 0.0013805575435981154,\n", + " -0.0138122932985425,\n", + " 0.02132306434214115,\n", + " 0.023325040936470032,\n", + " 0.027597714215517044,\n", + " 0.06062360480427742,\n", + " -0.019562937319278717,\n", + " 0.009559772908687592,\n", + " -0.02183363400399685,\n", + " 0.0173728559166193,\n", + " -0.028242645785212517,\n", + " -0.03058052435517311,\n", + " 0.01847461424767971,\n", + " -0.026536263525485992,\n", + " -0.007947443053126335,\n", + " -0.007517488207668066,\n", + " -0.026616880670189857,\n", + " 0.009183562360703945,\n", + " 0.01872989907860756,\n", + " -0.022075483575463295,\n", + " 0.019589809700846672,\n", + " -0.023916227743029594,\n", + " 0.019347960129380226,\n", + " 0.02378186769783497,\n", + " 0.019764477387070656,\n", + " -0.0202616136521101,\n", + " -0.019401703029870987,\n", + " 0.006335113197565079,\n", + " 0.015209645964205265,\n", + " -0.029935592785477638,\n", + " -0.007013635244220495,\n", + " -0.0363042950630188,\n", + " 0.00704050762578845,\n", + " 0.01616360805928707,\n", + " 0.014981232583522797,\n", + " -0.0013931537978351116,\n", + " 0.030661141499876976,\n", + " 0.01389290951192379,\n", + " 0.007712311577051878,\n", + " -0.01910611055791378,\n", + " -0.0020792337600141764,\n", + " -0.008404269814491272,\n", + " 0.024722393602132797,\n", + " 0.01699664443731308,\n", + " 0.008350525982677937,\n", + " 0.009727723896503448,\n", + " -0.010695122182369232,\n", + " 0.006560167297720909,\n", + " -0.031386688351631165,\n", + " 0.0263078510761261,\n", + " -0.0001876852911664173,\n", + " -0.01816558465361595,\n", + " 0.019482320174574852,\n", + " 0.023190679028630257,\n", + " -0.015115593560039997,\n", + " -0.015384314581751823,\n", + " -0.005233354400843382,\n", + " 0.004225648008286953,\n", + " -0.0011555030941963196,\n", + " -0.012092474848031998,\n", + " 0.011602058075368404,\n", + " -0.02179332636296749,\n", + " -0.003029836807399988,\n", + " 0.0030382343102246523,\n", + " -0.011151948943734169,\n", + " 0.007430153898894787,\n", + " 0.001625766046345234,\n", + " 0.010795892216265202,\n", + " 0.0033136738929897547,\n", + " 0.013167361728847027,\n", + " -0.027033399790525436,\n", + " 0.002052361611276865,\n", + " 0.015061848796904087,\n", + " 0.017762500792741776,\n", + " 0.014349736273288727,\n", + " -0.007047225721180439,\n", + " 0.014887180179357529,\n", + " 0.023190679028630257,\n", + " 0.0055289482697844505,\n", + " 0.018084967508912086,\n", + " -0.0014888859586790204,\n", + " -0.003711717901751399,\n", + " 0.008290063589811325,\n", + " 0.03740605339407921,\n", + " 0.007960879243910313,\n", + " 0.01809840463101864,\n", + " 0.010916817001998425,\n", + " 0.03504130616784096,\n", + " 0.0031138123013079166,\n", + " -0.005303893703967333,\n", + " -0.022868212312459946,\n", + " -0.01373839471489191,\n", + " -0.013933218084275723,\n", + " 0.008525194600224495,\n", + " 0.05304565653204918,\n", + " 0.014537842012941837,\n", + " 0.006230983417481184,\n", + " -0.004920965526252985,\n", + " 0.002856847131624818,\n", + " -0.015868013724684715,\n", + " 0.006835607346147299,\n", + " -0.027449917048215866,\n", + " -0.0049041700549423695,\n", + " 0.009808340109884739,\n", + " 0.028914449736475945,\n", + " -0.017386291176080704,\n", + " -0.6199946403503418,\n", + " -0.02336534857749939,\n", + " -0.018353689461946487,\n", + " -0.0028131799772381783,\n", + " 0.019804786890745163,\n", + " 0.04409722611308098,\n", + " 0.005280380602926016,\n", + " 0.011702828109264374,\n", + " -0.024829881265759468,\n", + " 0.01465876679867506,\n", + " -0.013395775109529495,\n", + " 0.025877896696329117,\n", + " -0.01636514998972416,\n", + " 0.01199842244386673,\n", + " -0.01084291934967041,\n", + " -0.008827506564557552,\n", + " 0.00870658177882433,\n", + " -0.020086944103240967,\n", + " -0.0006025243201293051,\n", + " 0.027812691405415535,\n", + " -0.03404703363776207,\n", + " 0.0019079238409176469,\n", + " 0.0024403284769505262,\n", + " -0.006099981721490622,\n", + " 0.009082792326807976,\n", + " -0.0050519672222435474,\n", + " 0.014309428632259369,\n", + " -0.022921957075595856,\n", + " -0.02199486829340458,\n", + " 0.003607588354498148,\n", + " -0.008518476970493793,\n", + " 0.00871329940855503,\n", + " 0.02747678942978382,\n", + " -0.020906545221805573,\n", + " 0.04253863915801048,\n", + " 0.000455147324828431,\n", + " 0.014484097249805927,\n", + " 0.033079635351896286,\n", + " 0.026590008288621902,\n", + " 0.05258882790803909,\n", + " -0.025971949100494385,\n", + " 0.010493581183254719,\n", + " 0.026455648243427277,\n", + " -0.008511758409440517,\n", + " -0.019025493413209915,\n", + " 0.020019764080643654,\n", + " 0.01937483251094818,\n", + " -0.013415928930044174,\n", + " -0.0027863075956702232,\n", + " -0.007860108278691769,\n", + " 0.011662520468235016,\n", + " -0.007255484815686941,\n", + " 0.0033707772381603718,\n", + " -0.01479312777519226,\n", + " 0.009358231909573078,\n", + " 0.007100970018655062,\n", + " 0.02388935536146164,\n", + " -0.017171313986182213,\n", + " -0.008726735599339008,\n", + " 0.010977279394865036,\n", + " -0.003943490330129862,\n", + " 0.004695910960435867,\n", + " -0.003323751036077738,\n", + " 0.011642365716397762,\n", + " -0.014510969631373882,\n", + " 0.0063888574950397015,\n", + " -0.006832248065620661,\n", + " 0.01937483251094818,\n", + " 0.0011857342906296253,\n", + " 0.006308240815997124,\n", + " -0.0029643357265740633,\n", + " 0.012569455429911613,\n", + " -0.013677932322025299,\n", + " 0.01015767827630043,\n", + " -0.002561253262683749,\n", + " -0.0055994875729084015,\n", + " 0.024144640192389488,\n", + " 0.0076988753862679005,\n", + " -0.01128630992025137,\n", + " -0.022277025505900383,\n", + " 0.013422646559774876,\n", + " 0.00892155896872282,\n", + " 0.0036613326519727707,\n", + " -0.009438848122954369,\n", + " 0.04151749610900879,\n", + " -0.005727130454033613,\n", + " -0.00863268319517374,\n", + " -0.012804587371647358,\n", + " 0.011138512752950191,\n", + " -0.003283442696556449,\n", + " -0.00783995445817709,\n", + " 0.028538240119814873,\n", + " 0.00030609077657572925,\n", + " 0.006113417912274599,\n", + " 0.0205303356051445,\n", + " 0.0037721802946180105,\n", + " -0.02425212971866131,\n", + " 0.013771984726190567,\n", + " 0.0034833045210689306,\n", + " -0.01748034358024597,\n", + " -0.0062444196082651615,\n", + " -0.005653231870383024,\n", + " 0.011037741787731647,\n", + " 0.02684529311954975,\n", + " -0.023822175338864326,\n", + " 0.041598111391067505,\n", + " -0.02915629930794239,\n", + " -0.009895674884319305,\n", + " 0.03240783140063286,\n", + " -0.022639799863100052,\n", + " 0.01879708096385002,\n", + " -0.03727169334888458,\n", + " -0.02415807731449604,\n", + " -0.02132306434214115,\n", + " 0.014940924011170864,\n", + " -0.03536377102136612,\n", + " 0.012925512157380581,\n", + " 0.012421658262610435,\n", + " 0.017117569223046303,\n", + " -0.01281130500137806,\n", + " -0.014269120059907436,\n", + " -0.010144243016839027,\n", + " 0.0049075293354690075,\n", + " -0.01338905654847622,\n", + " 0.00038628740003332496,\n", + " -7.085434481268749e-05,\n", + " -0.00503853103145957,\n", + " -0.024507414549589157,\n", + " -0.022653235122561455,\n", + " 0.02374155819416046,\n", + " 0.03141356259584427,\n", + " -0.003390931524336338,\n", + " 0.015653036534786224,\n", + " -0.024386489763855934,\n", + " 0.05546415224671364,\n", + " 0.015438059344887733,\n", + " 0.02504485845565796,\n", + " -0.001958309207111597,\n", + " 0.013684650883078575,\n", + " -0.01979134976863861,\n", + " -0.022706979885697365,\n", + " 0.013825729489326477,\n", + " 0.008753607980906963,\n", + " -0.014537842012941837,\n", + " 0.01672792248427868,\n", + " -0.01663387008011341,\n", + " -0.014121322892606258,\n", + " -0.015451495535671711,\n", + " -0.005186328198760748,\n", + " 0.03512192144989967,\n", + " -0.008746890351176262,\n", + " 0.029693743214011192,\n", + " -0.016486072912812233,\n", + " 0.026482518762350082,\n", + " -0.023969972506165504,\n", + " -0.037916626781225204,\n", + " -0.017722193151712418,\n", + " -0.0202616136521101,\n", + " 0.023298168554902077,\n", + " 0.0012461966834962368,\n", + " 0.024695521220564842,\n", + " 0.021658966317772865,\n", + " -0.010936971753835678,\n", + " 0.002561253262683749,\n", + " -0.005156097002327442,\n", + " -0.01057419739663601,\n", + " 0.009754596278071404,\n", + " 0.03125232830643654,\n", + " -0.021282754838466644,\n", + " -0.031924132257699966,\n", + " 0.016647307202219963,\n", + " 0.013355466537177563,\n", + " -0.00647283298894763,\n", + " 0.019334523007273674,\n", + " 0.012032012455165386,\n", + " 0.02778581902384758,\n", + " -0.008001187816262245,\n", + " 0.011071332730352879,\n", + " 0.0004958754288963974,\n", + " 0.006368703208863735,\n", + " -0.002013732912018895,\n", + " -0.011951396241784096,\n", + " -0.02372812293469906,\n", + " 0.0049075293354690075,\n", + " 0.032810915261507034,\n", + " -0.010963844135403633,\n", + " -0.013986961916089058,\n", + " 0.041974324733018875,\n", + " -0.018541794270277023,\n", + " 0.022586055099964142,\n", + " -0.003758744103834033,\n", + " 0.020355666056275368,\n", + " 0.022129228338599205,\n", + " 0.004353290889412165,\n", + " -0.03531002625823021,\n", + " 0.0034564323723316193,\n", + " 0.00438352208584547,\n", + " -0.0040207477286458015,\n", + " -0.002181683899834752,\n", + " 0.005488639697432518,\n", + " 0.013227824121713638,\n", + " -0.002729204250499606,\n", + " 0.02899506688117981,\n", + " -0.027328992262482643,\n", + " 0.019536064937710762,\n", + " -0.01735941879451275,\n", + " 0.007235330529510975,\n", + " -0.02126931957900524,\n", + " 0.026388466358184814,\n", + " 0.016284532845020294,\n", + " 0.0048806569539010525,\n", + " -0.018971748650074005,\n", + " -0.008384115993976593,\n", + " 0.00697332713752985,\n", + " 0.023647505789995193,\n", + " 0.011843906715512276,\n", + " -0.004353290889412165,\n", + " 0.013059872202575207,\n", + " -0.014846871607005596,\n", + " -0.0016291250940412283,\n", + " 0.010896663181483746,\n", + " -0.003557202871888876,\n", + " 0.0031524410005658865,\n", + " -0.0016249263426288962,\n", + " -0.03509504720568657,\n", + " 0.005639795679599047,\n", + " 0.00781980063766241,\n", + " 0.007786210160702467,\n", + " 0.007907134480774403,\n", + " -0.01518277358263731,\n", + " 0.0005798509810119867,\n", + " 0.006738195661455393,\n", + " 0.014578149653971195,\n", + " 0.023553453385829926,\n", + " 0.013395775109529495,\n", + " 0.015706781297922134,\n", + " 0.019119545817375183,\n", + " -0.006818812340497971,\n", + " 0.01567990891635418,\n", + " -0.0037251540925353765,\n", + " 0.003856155788525939,\n", + " 0.009425411932170391,\n", + " 0.012052166275680065,\n", + " -0.030392419546842575,\n", + " 0.03697609901428223,\n", + " -0.0009875521063804626,\n", + " 0.05073465034365654,\n", + " -0.012119347229599953,\n", + " -0.03520253673195839,\n", + " 0.027758946642279625,\n", + " -0.012240272015333176,\n", + " 0.018246199935674667,\n", + " 0.0012361196568235755,\n", + " -0.002601561602205038,\n", + " 0.02625410631299019,\n", + " -0.03541751578450203,\n", + " 0.0018071531085297465,\n", + " 0.010795892216265202,\n", + " 0.002294211182743311,\n", + " 0.023486273363232613,\n", + " 0.02646908350288868,\n", + " 0.006677733268588781,\n", + " 0.0025394195690751076,\n", + " 0.010218140669167042,\n", + " 0.013200951740145683,\n", + " 0.02325785905122757,\n", + " 0.020328793674707413,\n", + " -0.004642166662961245,\n", + " -0.015558984130620956,\n", + " 0.008323653601109982,\n", + " -0.003956926520913839,\n", + " -0.015196209773421288,\n", + " -0.006422447506338358,\n", + " -0.010513735003769398,\n", + " 0.009794904850423336,\n", + " -0.011434106156229973,\n", + " -0.018555231392383575,\n", + " -0.014833435416221619,\n", + " 0.008934995159506798,\n", + " -0.00547184469178319,\n", + " -0.0012193245347589254,\n", + " -0.01504841260612011,\n", + " 0.022693544626235962,\n", + " 0.02642877586185932,\n", + " 0.001154663390479982,\n", + " -0.01336890272796154,\n", + " -0.004890734329819679,\n", + " 0.025071730837225914,\n", + " -0.005928671453148127,\n", + " 0.004323059692978859,\n", + " -0.012018576264381409,\n", + " 0.030822373926639557,\n", + " 0.007517488207668066,\n", + " 0.011675955727696419,\n", + " 0.004951196722686291,\n", + " 0.015827706083655357,\n", + " 0.014094451442360878,\n", + " -0.039018385112285614,\n", + " 0.003486663568764925,\n", + " 0.0029206685721874237,\n", + " 0.02284134179353714,\n", + " 0.00033863127464428544,\n", + " -0.014403481036424637,\n", + " -0.001502322033047676,\n", + " 0.0244671069085598,\n", + " 0.0013881153427064419,\n", + " 0.0076316953636705875,\n", + " -0.009700851514935493,\n", + " -0.02435961924493313,\n", + " -0.045494578778743744,\n", + " 0.010124088265001774,\n", + " 0.0009220512001775205,\n", + " -0.01584114134311676,\n", + " -0.005663308780640364,\n", + " -0.012549301609396935,\n", + " -0.0027980643790215254,\n", + " -0.017292238771915436,\n", + " -0.013241259381175041,\n", + " 0.005199763923883438,\n", + " -0.0024604827631264925,\n", + " 0.011541595682501793,\n", + " -0.005219918210059404,\n", + " -0.01762814074754715,\n", + " -0.0006478711147792637,\n", + " 0.09980322420597076,\n", + " 0.02805454097688198,\n", + " 0.004746296443045139,\n", + " 0.021699273958802223,\n", + " 0.006892710458487272,\n", + " 0.011971550062298775,\n", + " -0.007779492065310478,\n", + " -0.010426400229334831,\n", + " 0.009768032468855381,\n", + " -0.023029446601867676,\n", + " 0.01742659881711006,\n", + " 0.0013268132461234927,\n", + " 0.002284134039655328,\n", + " 0.007766055874526501,\n", + " 0.018461177125573158,\n", + " -0.016244225203990936,\n", + " -0.021336499601602554,\n", + " -0.0093447957187891,\n", + " 0.004245802294462919,\n", + " -0.004783245734870434,\n", + " -0.009808340109884739,\n", + " 0.014631894417107105,\n", + " -0.02362063340842724,\n", + " 0.013651059940457344,\n", + " 0.021954558789730072,\n", + " -0.02114839479327202,\n", + " 0.0031591590959578753,\n", + " 0.003164197551086545,\n", + " -0.010769020766019821,\n", + " -0.006855761166661978,\n", + " -0.016969772055745125,\n", + " -0.00590515835210681,\n", + " -0.015411186963319778,\n", + " 0.0001366701617371291,\n", + " -0.015196209773421288,\n", + " 0.0011731380363926291,\n", + " 0.009855367243289948,\n", + " -0.018071532249450684,\n", + " 0.03936772421002388,\n", + " -0.027342429384589195,\n", + " 0.029451893642544746,\n", + " 0.0027476788964122534,\n", + " -0.009828494861721992,\n", + " -0.02257261984050274,\n", + " 0.01479312777519226,\n", + " -0.026119744405150414,\n", + " -0.01007706206291914,\n", + " 0.009559772908687592,\n", + " -0.014752819202840328,\n", + " -0.03135981783270836,\n", + " 0.014322864823043346,\n", + " -0.0008481527329422534,\n", + " -0.01502154115587473,\n", + " 0.004148390609771013,\n", + " 0.010856354609131813,\n", + " 0.013919781893491745,\n", + " -0.03135981783270836,\n", + " -0.010688403621315956,\n", + " 0.008827506564557552,\n", + " -0.017251931130886078,\n", + " -0.009700851514935493,\n", + " -0.012925512157380581,\n", + " 0.010466708801686764,\n", + " -0.019831659272313118,\n", + " 0.009721006266772747,\n", + " -0.028403880074620247,\n", + " -0.0027325632981956005,\n", + " 0.00016994545876514167,\n", + " 0.0021850429475307465,\n", + " 0.004924324341118336,\n", + " 0.02958625555038452,\n", + " 0.008216165006160736,\n", + " -0.01915985345840454,\n", + " -0.0005004940903745592,\n", + " -0.004598499275743961,\n", + " 0.02642877586185932,\n", + " 0.011689391918480396,\n", + " -0.0024201744236052036,\n", + " 0.008384115993976593,\n", + " 0.02309662662446499,\n", + " 0.023378783836960793,\n", + " -0.040657587349414825,\n", + " -0.015908321365714073,\n", + " -0.0019129622960463166,\n", + " -0.019966019317507744,\n", + " 0.009156690910458565,\n", + " -0.01279115118086338,\n", + " -0.0228950846940279,\n", + " 0.002087631495669484,\n", + " -0.0008225401979871094,\n", + " 0.013281567953526974,\n", + " -0.0075779506005346775,\n", + " 0.00553902518004179,\n", + " -0.019038928672671318,\n", + " -0.02327129617333412,\n", + " 0.002831654390320182,\n", + " 0.023177243769168854,\n", + " -0.02531358040869236,\n", + " 0.0001918840571306646,\n", + " -0.004662320949137211,\n", + " 0.0281620305031538,\n", + " -0.00968741625547409,\n", + " 0.0023210833314806223,\n", + " -0.011561749503016472,\n", + " -0.0007293273811228573,\n", + " 0.018770208582282066,\n", + " 0.005458408500999212,\n", + " 0.0038628738839179277,\n", + " 0.00013467573444359004,\n", + " -0.0010253411019220948,\n", + " 0.014161631464958191,\n", + " 0.003374136285856366,\n", + " 0.012065602466464043,\n", + " -0.013617469929158688,\n", + " -0.0018323458498343825,\n", + " 0.005045249126851559,\n", + " 0.0011084768921136856,\n", + " -0.006785221863538027,\n", + " 0.009069356136023998,\n", + " -0.0005160295404493809,\n", + " -0.012636636383831501,\n", + " -0.00517289200797677,\n", + " 0.022357642650604248,\n", + " -0.006953172851353884,\n", + " -0.029666870832443237,\n", + " 0.0021296192426234484,\n", + " -0.0006881793960928917,\n", + " -0.002552855759859085,\n", + " 0.0004912567674182355,\n", + " -0.004255879204720259,\n", + " -0.0008040656102821231,\n", + " 0.018810516223311424,\n", + " -0.015196209773421288,\n", + " -0.015962066128849983,\n", + " -0.008565503172576427,\n", + " 0.0014930847100913525,\n", + " -0.023338476195931435,\n", + " 0.012092474848031998,\n", + " -0.025743534788489342,\n", + " -0.010957125574350357,\n", + " -0.033079635351896286,\n", + " -0.00035395679879002273,\n", + " 0.022760724648833275,\n", + " -0.003933413419872522,\n", + " -0.03563249111175537,\n", + " -0.04122190177440643,\n", + " 0.0057943109422922134,\n", + " 0.017077261582016945,\n", + " -0.008478168398141861,\n", + " 0.023862482979893684,\n", + " -0.004820194561034441,\n", + " -0.004339854698628187,\n", + " -0.007947443053126335,\n", + " -0.01799091510474682,\n", + " -0.01946888491511345,\n", + " -0.027812691405415535,\n", + " -0.019656989723443985,\n", + " 0.01883738860487938,\n", + " 0.02663031592965126,\n", + " 0.028484495356678963,\n", + " 0.02800079621374607,\n", + " 0.020328793674707413,\n", + " 0.02125588245689869,\n", + " -5.395427069743164e-05,\n", + " 0.02556886523962021,\n", + " -0.012052166275680065,\n", + " 0.03243470564484596,\n", + " 0.004974709823727608,\n", + " -0.0105204526335001,\n", + " 0.011897651478648186,\n", + " 0.013335312716662884,\n", + " -0.013825729489326477,\n", + " 0.014363172464072704,\n", + " -0.01811183989048004,\n", + " -0.0024403284769505262,\n", + " 0.03727169334888458,\n", + " -0.012092474848031998,\n", + " -0.016284532845020294,\n", + " -0.008148984052240849,\n", + " -0.031091095879673958,\n", + " -0.01621735282242298,\n", + " 0.006654220167547464,\n", + " 0.020879672840237617,\n", + " 0.005364356096833944,\n", + " -0.03436949849128723,\n", + " -0.025931639596819878,\n", + " 0.013590597547590733,\n", + " 0.008639401756227016,\n", + " 0.017278803512454033,\n", + " -0.012992692179977894,\n", + " 0.021766453981399536,\n", + " -0.003293519839644432,\n", + " 0.013919781893491745,\n", + " 0.009438848122954369,\n", + " 0.015505239367485046,\n", + " -0.02374155819416046,\n", + " -0.010863073170185089,\n", + " -0.0030936580151319504,\n", + " 0.0107891745865345,\n", + " 0.017668448388576508,\n", + " 0.005965620744973421,\n", + " 0.005928671453148127,\n", + " 0.002952579176053405,\n", + " -0.016714487224817276,\n", + " -0.017036953940987587,\n", + " 0.024964241310954094,\n", + " -0.01173641812056303,\n", + " -0.003752026241272688,\n", + " 0.01094368938356638,\n", + " -0.022747289389371872,\n", + " 0.00047992009785957634,\n", + " -0.01778937317430973,\n", + " -0.05425490438938141,\n", + " -0.01731911115348339,\n", + " 0.020812492817640305,\n", + " -0.0032431345898658037,\n", + " -0.0292100440710783,\n", + " -0.004279392305761576,\n", + " 0.012482120655477047,\n", + " -0.03541751578450203,\n", + " 0.002704011742025614,\n", + " -0.007759337779134512,\n", + " 0.02304288186132908,\n", + " 0.012199963442981243,\n", + " 0.028538240119814873,\n", + " 0.014860307797789574,\n", + " -0.012307452037930489,\n", + " -0.01936139538884163,\n", + " -0.0033607003279030323,\n", + " -0.004014029633253813,\n", + " -0.007638412993401289,\n", + " -0.010271885432302952,\n", + " 0.008021341636776924,\n", + " 0.0010925214737653732,\n", + " -0.0373791828751564,\n", + " 0.0024923933669924736,\n", + " 0.008021341636776924,\n", + " -0.00739656388759613,\n", + " -0.02410433255136013,\n", + " 0.025246400386095047,\n", + " 0.005374433007091284,\n", + " 0.010762302204966545,\n", + " -0.006627347785979509,\n", + " -0.015142465941607952,\n", + " -0.050439056009054184,\n", + " 0.04108754172921181,\n", + " 0.03869592025876045,\n", + " 0.004007311537861824,\n", + " 0.003866232931613922,\n", + " 0.004413753282278776,\n", + " 0.015129029750823975,\n", + " 0.023298168554902077,\n", + " -0.024064024910330772,\n", + " 0.0011177140986546874,\n", + " -0.009270897135138512,\n", + " 0.0016266057500615716,\n", + " 0.017386291176080704,\n", + " -0.013745113275945187,\n", + " 0.01694290153682232,\n", + " 0.003973721526563168,\n", + " -0.012011858634650707,\n", + " -7.295373507076874e-05,\n", + " -0.016324840486049652,\n", + " 0.011555030941963196,\n", + " 0.014725946821272373,\n", + " 0.003930054139345884,\n", + " -0.012253707274794579,\n", + " -0.01537087932229042,\n", + " 0.0050519672222435474,\n", + " 0.016136735677719116,\n", + " -0.04573642462491989,\n", + " -0.009647107683122158,\n", + " -0.014201940037310123,\n", + " -0.006543372292071581,\n", + " 0.017655013129115105,\n", + " 0.0035168947651982307,\n", + " -0.00868642795830965,\n", + " 0.011165385134518147,\n", + " -0.023768430575728416,\n", + " -0.011763290502130985,\n", + " 0.03350958973169327,\n", + " 0.003799052443355322,\n", + " 0.0060966224409639835,\n", + " 0.0007314267568290234,\n", + " -0.004679115954786539,\n", + " -0.003631101455539465,\n", + " 0.007705593481659889,\n", + " 0.010433118790388107,\n", + " 0.029021939262747765,\n", + " -0.008390833623707294,\n", + " -0.023929663002490997,\n", + " -0.010963844135403633,\n", + " 0.00109504081774503,\n", + " -0.0034161240328103304,\n", + " 0.009304487146437168,\n", + " -0.014282556250691414,\n", + " -0.00626121461391449,\n", + " 0.03221972659230232,\n", + " -0.04299546405673027,\n", + " 0.007121123839169741,\n", + " 0.014551278203725815,\n", + " -0.012206681072711945,\n", + " -0.008169138804078102,\n", + " 0.001264671329408884,\n", + " -0.004766450263559818,\n", + " 0.00836396124213934,\n", + " 0.04237740486860275,\n", + " 0.003034875262528658,\n", + " -0.01231416966766119,\n", + " -0.01523651834577322,\n", + " 0.017775937914848328,\n", + " 0.03990516811609268,\n", + " -0.002383225131779909,\n", + " 0.004830271936953068,\n", + " 0.013563726097345352,\n", + " 0.000969917222391814,\n", + " 0.01346967276185751,\n", + " 0.002389943227171898,\n", + " -0.014806563034653664,\n", + " 0.007436871994286776,\n", + " -0.039448339492082596,\n", + " 0.009015611372888088,\n", + " 0.0007436032174155116,\n", + " -0.004622012376785278,\n", + " 0.004222289193421602,\n", + " 0.016244225203990936,\n", + " 0.01831338182091713,\n", + " 0.005146019626408815,\n", + " -0.013691368512809277,\n", + " -0.03904525563120842,\n", + " -0.024695521220564842,\n", + " -0.019562937319278717,\n", + " -0.013852601870894432,\n", + " -0.009385104291141033,\n", + " 0.003081901464611292,\n", + " -0.013019564561545849,\n", + " -0.025851024314761162,\n", + " 0.011440824717283249,\n", + " -0.02679155021905899,\n", + " -0.025219528004527092,\n", + " 0.01173641812056303,\n", + " 0.01402727048844099,\n", + " 0.02888757921755314,\n", + " 0.020503463223576546,\n", + " -0.007759337779134512,\n", + " -0.013852601870894432,\n", + " -0.005596128758043051,\n", + " 0.0010958805214613676,\n", + " -0.05113773047924042,\n", + " -0.022236717864871025,\n", + " -0.0123679144307971,\n", + " 0.021954558789730072,\n", + " 0.015196209773421288,\n", + " -0.03004308231174946,\n", + " -0.03135981783270836,\n", + " -0.016284532845020294,\n", + " -0.05863506719470024,\n", + " -0.018138712272047997,\n", + " 0.006852402351796627,\n", + " 0.014282556250691414,\n", + " 0.016459202393889427,\n", + " -0.013006128370761871,\n", + " 0.009613517671823502,\n", + " 0.020705003291368484,\n", + " 0.0090760737657547,\n", + " 0.0022656593937426805,\n", + " -0.006879274267703295,\n", + " -0.02109465003013611,\n", + " -0.003799052443355322,\n", + " -0.006419088691473007,\n", + " 0.000651650014333427,\n", + " -0.01878364384174347,\n", + " 0.002342917025089264,\n", + " -0.015572420321404934,\n", + " 0.010453272610902786,\n", + " -0.015962066128849983,\n", + " -0.00675163185223937,\n", + " 0.021229011937975883,\n", + " 0.0007910493877716362,\n", + " -0.004830271936953068,\n", + " -0.015518675558269024,\n", + " 0.007087533827871084,\n", + " 0.013295004144310951,\n", + " 0.025877896696329117,\n", + " 0.007873544469475746,\n", + " -0.027973923832178116,\n", + " -0.0028232568874955177,\n", + " 0.008303498849272728,\n", + " -0.0018491409718990326,\n", + " -0.014712510630488396,\n", + " -0.010056908242404461,\n", + " 0.0013133770553395152,\n", + " 0.0015375918010249734,\n", + " 0.025394197553396225,\n", + " -0.0009573209099471569,\n", + " -0.0033640593755990267,\n", + " -0.011749854311347008,\n", + " -0.01386603806167841,\n", + " 0.02336534857749939,\n", + " 0.010809328407049179,\n", + " 0.010695122182369232,\n", + " 0.0006445121252909303,\n", + " -0.010668249800801277,\n", + " 0.0041886987164616585,\n", + " 0.013630906119942665,\n", + " 0.015397750772535801,\n", + " -0.015505239367485046,\n", + " -0.0025142270606011152,\n", + " 0.02105434238910675,\n", + " -0.005992493126541376,\n", + " 0.019025493413209915,\n", + " -0.04425845667719841,\n", + " 0.007114405743777752,\n", + " -0.023754995316267014,\n", + " -0.010473426431417465,\n", + " 0.012764278799295425,\n", + " -0.012018576264381409,\n", + " -0.04624699801206589,\n", + " 0.022586055099964142,\n", + " 0.00040014335536397994,\n", + " -0.009257460944354534,\n", + " -0.006214188411831856,\n", + " 0.011252719908952713,\n", + " -0.011467697098851204,\n", + " -0.00610669981688261,\n", + " -0.02998933754861355,\n", + " 0.017184749245643616,\n", + " -0.026482518762350082,\n", + " -0.02105434238910675,\n", + " -0.006083186715841293,\n", + " -0.00826319120824337,\n", + " -0.001481328159570694,\n", + " -0.010983997955918312,\n", + " 0.03326774016022682,\n", + " 0.0012226835824549198,\n", + " -0.004769809544086456,\n", + " 0.19713421165943146,\n", + " -0.014269120059907436,\n", + " -0.0032179418485611677,\n", + " 0.012105911038815975,\n", + " -0.028511367738246918,\n", + " -0.017131006345152855,\n", + " 0.044177841395139694,\n", + " 0.0017853195313364267,\n", + " -0.0024302515666931868,\n", + " 0.03783601149916649,\n", + " -0.00892155896872282,\n", + " 0.012549301609396935,\n", + " 0.027275249361991882,\n", + " -0.01050029881298542,\n", + " 0.012723970226943493,\n", + " -0.02199486829340458,\n", + " -0.03536377102136612,\n", + " -0.02347283624112606,\n", + " 0.009519464336335659,\n", + " 0.005925312638282776,\n", + " 0.018703026697039604,\n", + " 0.0024755983613431454,\n", + " -0.01386603806167841,\n", + " -0.011481133289635181,\n", + " 0.0009497631108388305,\n", + " 0.000136355243739672,\n", + " -0.007302511017769575,\n", + " -0.00022925317171029747,\n", + " 0.010896663181483746,\n", + " -0.0023731482215225697,\n", + " -0.008934995159506798,\n", + " -0.02426556497812271,\n", + " -0.005135942716151476,\n", + " 0.014040706679224968,\n", + " -4.135794370085932e-05,\n", + " 0.003822565544396639,\n", + " 0.022236717864871025,\n", + " 0.01215293724089861,\n", + " 0.02794705331325531,\n", + " 0.013476391322910786,\n", + " 0.02304288186132908,\n", + " 0.016297968104481697,\n", + " 0.007846672087907791,\n", + " -0.035229410976171494,\n", + " -0.018716463819146156,\n", + " 0.03547126054763794,\n", + " ...]" + ] }, "execution_count": 97, "metadata": {}, @@ -143,39 +1137,27 @@ "source": [ "truncated = truncate_text_tokens(long_text)\n", "get_embedding(truncated)" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "## 2. Chunking the input text\n", "\n", "Though the option above works, it has the clear drawback of simply discarding all text after the maximum context is filled. Another possible approach that addresses this issue is to in fact divide the input text into chunks and then embed each chunk individually. We can then either use the chunk embeddings separately, or combine them in some way, such as for example calculating their average (weighted by the size of each chunk).\n", "\n", - "We will first take a function from python's own cookbook that breaks up a sequence into chunks." - ], - "metadata": { - "collapsed": false - } + "We will first take a function from [python's own cookbook](https://docs.python.org/3/library/itertools.html#itertools-recipes) that breaks up a sequence into chunks." + ] }, { "cell_type": "code", "execution_count": 91, + "metadata": {}, "outputs": [], "source": [ "from itertools import islice\n", "\n", - "# From: https://docs.python.org/3/library/itertools.html#itertools-recipes\n", "def batched(iterable, n):\n", " \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n", " # batched('ABCDEFG', 3) --> ABC DEF G\n", @@ -184,23 +1166,19 @@ " it = iter(iterable)\n", " while (batch := tuple(islice(it, n))):\n", " yield batch" - ], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "Now let's define a function that encodes a string into tokens and then breaks it up into chunks." - ], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "code", "execution_count": null, + "metadata": {}, "outputs": [], "source": [ "def chunked_tokens(text, encoding_name, chunk_length):\n", @@ -208,84 +1186,67 @@ " tokens = encoding.encode(text)\n", " chunks_iterator = batched(tokens, chunk_length)\n", " yield from chunks_iterator" - ], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ - "Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The `reduction` flag can be set to either `'average'`, to return the weighted average of the chunk embeddings, or `None`, to simply return the unmodified list of chunk embeddings." - ], - "metadata": { - "collapsed": false - } + "Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The `average` flag can be set to `True` to return the weighted average of the chunk embeddings, or `False` to simply return the unmodified list of chunk embeddings." + ] }, { "cell_type": "code", - "execution_count": 101, + "execution_count": 104, + "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "\n", - "def len_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, reduction=None):\n", + "def len_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, average=True):\n", " chunk_embeddings = []\n", " for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):\n", " chunk_embeddings.append(get_embedding(chunk, model=model))\n", "\n", - " if reduction is None:\n", - " return chunk_embeddings\n", - " elif reduction == 'average':\n", - " return [np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()]\n", - " else:\n", - " raise ValueError(f'reduction {reduction} not valid.')\n", - "\n", - "\n", - "\n" - ], - "metadata": { - "collapsed": false - } + " if average:\n", + " chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n", + " return chunk_embeddings" + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "Once again, we can verify that we can now handle long input texts." - ], - "metadata": { - "collapsed": false - } + ] }, { "cell_type": "code", - "execution_count": 102, + "execution_count": 105, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Setting reduction=None gives us 2 embedding vectors.\n", - "Setting reduction='average' gives us 1 embedding vectors.\n" + "Setting reduce=None gives us 2 embedding vectors.\n", + "Setting reduce='average' gives us 1 embedding vector.\n" ] } ], "source": [ - "embedding_vectors_no_reduce = len_safe_get_embedding(long_text, reduction=None)\n", - "average_embedding_vector = len_safe_get_embedding(long_text, reduction='average')\n", + "average_embedding_vector = len_safe_get_embedding(long_text, average=True)\n", + "chunks_embedding_vectors = len_safe_get_embedding(long_text, average=False)\n", "\n", - "print(f\"Setting reduction=None gives us {len(embedding_vectors_no_reduce)} embedding vectors.\")\n", - "print(f\"Setting reduction='average' gives us {len(average_embedding_vector)} embedding vector.\")\n" - ], - "metadata": { - "collapsed": false - } + "print(f\"Setting average=True gives us a single {len(average_embedding_vector)}-dimensional embedding vector for our long text.\")\n", + "print(f\"Setting average=False gives us {len(chunks_embedding_vectors)} embedding vectors, one for each of the chunks.\")\n" + ] } ], "metadata": { "kernelspec": { - "display_name": "openai", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -301,7 +1262,6 @@ "pygments_lexer": "ipython3", "version": "3.9.9" }, - "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" From f3289404b2f764356620fe170d1aff2fc91b71cb Mon Sep 17 00:00:00 2001 From: Filipe de Avila Belbute Peres Date: Wed, 18 Jan 2023 18:20:01 -0800 Subject: [PATCH 5/9] Wrap error message to avoid saving all the trace to the notebook --- examples/Embedding_long_inputs.ipynb | 31 ++++++++-------------------- 1 file changed, 9 insertions(+), 22 deletions(-) diff --git a/examples/Embedding_long_inputs.ipynb b/examples/Embedding_long_inputs.ipynb index eaf0c20..80f2b51 100644 --- a/examples/Embedding_long_inputs.ipynb +++ b/examples/Embedding_long_inputs.ipynb @@ -49,36 +49,23 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 22, "metadata": {}, "outputs": [ { - "ename": "InvalidRequestError", - "evalue": "This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", - "output_type": "error", - "traceback": [ - "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m", - "\u001B[0;31mInvalidRequestError\u001B[0m Traceback (most recent call last)", - "Cell \u001B[0;32mIn [18], line 2\u001B[0m\n\u001B[1;32m 1\u001B[0m long_text \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mAGI \u001B[39m\u001B[38;5;124m'\u001B[39m \u001B[38;5;241m*\u001B[39m \u001B[38;5;241m5000\u001B[39m\n\u001B[0;32m----> 2\u001B[0m \u001B[43mget_embedding\u001B[49m\u001B[43m(\u001B[49m\u001B[43mlong_text\u001B[49m\u001B[43m)\u001B[49m\n", - "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:326\u001B[0m, in \u001B[0;36mBaseRetrying.wraps..wrapped_f\u001B[0;34m(*args, **kw)\u001B[0m\n\u001B[1;32m 324\u001B[0m \u001B[38;5;129m@functools\u001B[39m\u001B[38;5;241m.\u001B[39mwraps(f)\n\u001B[1;32m 325\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mwrapped_f\u001B[39m(\u001B[38;5;241m*\u001B[39margs: t\u001B[38;5;241m.\u001B[39mAny, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkw: t\u001B[38;5;241m.\u001B[39mAny) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m t\u001B[38;5;241m.\u001B[39mAny:\n\u001B[0;32m--> 326\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mf\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkw\u001B[49m\u001B[43m)\u001B[49m\n", - "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:406\u001B[0m, in \u001B[0;36mRetrying.__call__\u001B[0;34m(self, fn, *args, **kwargs)\u001B[0m\n\u001B[1;32m 404\u001B[0m retry_state \u001B[38;5;241m=\u001B[39m RetryCallState(retry_object\u001B[38;5;241m=\u001B[39m\u001B[38;5;28mself\u001B[39m, fn\u001B[38;5;241m=\u001B[39mfn, args\u001B[38;5;241m=\u001B[39margs, kwargs\u001B[38;5;241m=\u001B[39mkwargs)\n\u001B[1;32m 405\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[0;32m--> 406\u001B[0m do \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43miter\u001B[49m\u001B[43m(\u001B[49m\u001B[43mretry_state\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mretry_state\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 407\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(do, DoAttempt):\n\u001B[1;32m 408\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n", - "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:351\u001B[0m, in \u001B[0;36mBaseRetrying.iter\u001B[0;34m(self, retry_state)\u001B[0m\n\u001B[1;32m 349\u001B[0m is_explicit_retry \u001B[38;5;241m=\u001B[39m retry_state\u001B[38;5;241m.\u001B[39moutcome\u001B[38;5;241m.\u001B[39mfailed \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(retry_state\u001B[38;5;241m.\u001B[39moutcome\u001B[38;5;241m.\u001B[39mexception(), TryAgain)\n\u001B[1;32m 350\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m (is_explicit_retry \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mretry(retry_state\u001B[38;5;241m=\u001B[39mretry_state)):\n\u001B[0;32m--> 351\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mfut\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mresult\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 353\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mafter \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[1;32m 354\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mafter(retry_state)\n", - "File \u001B[0;32m~/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/_base.py:438\u001B[0m, in \u001B[0;36mFuture.result\u001B[0;34m(self, timeout)\u001B[0m\n\u001B[1;32m 436\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m CancelledError()\n\u001B[1;32m 437\u001B[0m \u001B[38;5;28;01melif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;241m==\u001B[39m FINISHED:\n\u001B[0;32m--> 438\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m__get_result\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 440\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_condition\u001B[38;5;241m.\u001B[39mwait(timeout)\n\u001B[1;32m 442\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;129;01min\u001B[39;00m [CANCELLED, CANCELLED_AND_NOTIFIED]:\n", - "File \u001B[0;32m~/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/_base.py:390\u001B[0m, in \u001B[0;36mFuture.__get_result\u001B[0;34m(self)\u001B[0m\n\u001B[1;32m 388\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_exception:\n\u001B[1;32m 389\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 390\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_exception\n\u001B[1;32m 391\u001B[0m \u001B[38;5;28;01mfinally\u001B[39;00m:\n\u001B[1;32m 392\u001B[0m \u001B[38;5;66;03m# Break a reference cycle with the exception in self._exception\u001B[39;00m\n\u001B[1;32m 393\u001B[0m \u001B[38;5;28mself\u001B[39m \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m\n", - "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:409\u001B[0m, in \u001B[0;36mRetrying.__call__\u001B[0;34m(self, fn, *args, **kwargs)\u001B[0m\n\u001B[1;32m 407\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(do, DoAttempt):\n\u001B[1;32m 408\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 409\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[43mfn\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 410\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mBaseException\u001B[39;00m: \u001B[38;5;66;03m# noqa: B902\u001B[39;00m\n\u001B[1;32m 411\u001B[0m retry_state\u001B[38;5;241m.\u001B[39mset_exception(sys\u001B[38;5;241m.\u001B[39mexc_info())\n", - "Cell \u001B[0;32mIn [16], line 12\u001B[0m, in \u001B[0;36mget_embedding\u001B[0;34m(text_or_tokens, model)\u001B[0m\n\u001B[1;32m 10\u001B[0m \u001B[38;5;129m@retry\u001B[39m(wait\u001B[38;5;241m=\u001B[39mwait_random_exponential(\u001B[38;5;28mmin\u001B[39m\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m1\u001B[39m, \u001B[38;5;28mmax\u001B[39m\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m20\u001B[39m), stop\u001B[38;5;241m=\u001B[39mstop_after_attempt(\u001B[38;5;241m6\u001B[39m), retry\u001B[38;5;241m=\u001B[39mretry_if_not_exception_type(openai\u001B[38;5;241m.\u001B[39mInvalidRequestError))\n\u001B[1;32m 11\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mget_embedding\u001B[39m(text_or_tokens, model\u001B[38;5;241m=\u001B[39mEMBEDDING_MODEL):\n\u001B[0;32m---> 12\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mopenai\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mEmbedding\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43minput\u001B[39;49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mtext_or_tokens\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmodel\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mmodel\u001B[49m\u001B[43m)\u001B[49m[\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mdata\u001B[39m\u001B[38;5;124m\"\u001B[39m][\u001B[38;5;241m0\u001B[39m][\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124membedding\u001B[39m\u001B[38;5;124m\"\u001B[39m]\n", - "File \u001B[0;32m~/code/openai-python/openai/api_resources/embedding.py:33\u001B[0m, in \u001B[0;36mEmbedding.create\u001B[0;34m(cls, *args, **kwargs)\u001B[0m\n\u001B[1;32m 31\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[1;32m 32\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m---> 33\u001B[0m response \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 35\u001B[0m \u001B[38;5;66;03m# If a user specifies base64, we'll just return the encoded string.\u001B[39;00m\n\u001B[1;32m 36\u001B[0m \u001B[38;5;66;03m# This is only for the default case.\u001B[39;00m\n\u001B[1;32m 37\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m user_provided_encoding_format:\n", - "File \u001B[0;32m~/code/openai-python/openai/api_resources/abstract/engine_api_resource.py:153\u001B[0m, in \u001B[0;36mEngineAPIResource.create\u001B[0;34m(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)\u001B[0m\n\u001B[1;32m 127\u001B[0m \u001B[38;5;129m@classmethod\u001B[39m\n\u001B[1;32m 128\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mcreate\u001B[39m(\n\u001B[1;32m 129\u001B[0m \u001B[38;5;28mcls\u001B[39m,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 136\u001B[0m \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams,\n\u001B[1;32m 137\u001B[0m ):\n\u001B[1;32m 138\u001B[0m (\n\u001B[1;32m 139\u001B[0m deployment_id,\n\u001B[1;32m 140\u001B[0m engine,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 150\u001B[0m api_key, api_base, api_type, api_version, organization, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams\n\u001B[1;32m 151\u001B[0m )\n\u001B[0;32m--> 153\u001B[0m response, _, api_key \u001B[38;5;241m=\u001B[39m \u001B[43mrequestor\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrequest\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 154\u001B[0m \u001B[43m \u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mpost\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\n\u001B[1;32m 155\u001B[0m \u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 156\u001B[0m \u001B[43m \u001B[49m\u001B[43mparams\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mparams\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 157\u001B[0m \u001B[43m \u001B[49m\u001B[43mheaders\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 158\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mstream\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 159\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_id\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_id\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 160\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_timeout\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_timeout\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 161\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 163\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream:\n\u001B[1;32m 164\u001B[0m \u001B[38;5;66;03m# must be an iterator\u001B[39;00m\n\u001B[1;32m 165\u001B[0m \u001B[38;5;28;01massert\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(response, OpenAIResponse)\n", - "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:227\u001B[0m, in \u001B[0;36mAPIRequestor.request\u001B[0;34m(self, method, url, params, headers, files, stream, request_id, request_timeout)\u001B[0m\n\u001B[1;32m 206\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mrequest\u001B[39m(\n\u001B[1;32m 207\u001B[0m \u001B[38;5;28mself\u001B[39m,\n\u001B[1;32m 208\u001B[0m method,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 215\u001B[0m request_timeout: Optional[Union[\u001B[38;5;28mfloat\u001B[39m, Tuple[\u001B[38;5;28mfloat\u001B[39m, \u001B[38;5;28mfloat\u001B[39m]]] \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m,\n\u001B[1;32m 216\u001B[0m ) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m Tuple[Union[OpenAIResponse, Iterator[OpenAIResponse]], \u001B[38;5;28mbool\u001B[39m, \u001B[38;5;28mstr\u001B[39m]:\n\u001B[1;32m 217\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mrequest_raw(\n\u001B[1;32m 218\u001B[0m method\u001B[38;5;241m.\u001B[39mlower(),\n\u001B[1;32m 219\u001B[0m url,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 225\u001B[0m request_timeout\u001B[38;5;241m=\u001B[39mrequest_timeout,\n\u001B[1;32m 226\u001B[0m )\n\u001B[0;32m--> 227\u001B[0m resp, got_stream \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response\u001B[49m\u001B[43m(\u001B[49m\u001B[43mresult\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 228\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp, got_stream, \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mapi_key\n", - "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:620\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response\u001B[0;34m(self, result, stream)\u001B[0m\n\u001B[1;32m 612\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[1;32m 613\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_interpret_response_line(\n\u001B[1;32m 614\u001B[0m line, result\u001B[38;5;241m.\u001B[39mstatus_code, result\u001B[38;5;241m.\u001B[39mheaders, stream\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 615\u001B[0m )\n\u001B[1;32m 616\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m line \u001B[38;5;129;01min\u001B[39;00m parse_stream(result\u001B[38;5;241m.\u001B[39miter_lines())\n\u001B[1;32m 617\u001B[0m ), \u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 618\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 619\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[0;32m--> 620\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response_line\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 621\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcontent\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdecode\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mutf-8\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 622\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mstatus_code\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 623\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 624\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mFalse\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[1;32m 625\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m,\n\u001B[1;32m 626\u001B[0m \u001B[38;5;28;01mFalse\u001B[39;00m,\n\u001B[1;32m 627\u001B[0m )\n", - "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:680\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response_line\u001B[0;34m(self, rbody, rcode, rheaders, stream)\u001B[0m\n\u001B[1;32m 678\u001B[0m stream_error \u001B[38;5;241m=\u001B[39m stream \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124merror\u001B[39m\u001B[38;5;124m\"\u001B[39m \u001B[38;5;129;01min\u001B[39;00m resp\u001B[38;5;241m.\u001B[39mdata\n\u001B[1;32m 679\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream_error \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;241m200\u001B[39m \u001B[38;5;241m<\u001B[39m\u001B[38;5;241m=\u001B[39m rcode \u001B[38;5;241m<\u001B[39m \u001B[38;5;241m300\u001B[39m:\n\u001B[0;32m--> 680\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mhandle_error_response(\n\u001B[1;32m 681\u001B[0m rbody, rcode, resp\u001B[38;5;241m.\u001B[39mdata, rheaders, stream_error\u001B[38;5;241m=\u001B[39mstream_error\n\u001B[1;32m 682\u001B[0m )\n\u001B[1;32m 683\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp\n", - "\u001B[0;31mInvalidRequestError\u001B[0m: This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length." + "name": "stdout", + "output_type": "stream", + "text": [ + "This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.\n" ] } ], "source": [ "long_text = 'AGI ' * 5000\n", - "get_embedding(long_text)" + "try:\n", + " get_embedding(long_text)\n", + "except openai.InvalidRequestError as e:\n", + " print(e)" ] }, { From ee69beb8cd856144757339eeff7d73ec86577d71 Mon Sep 17 00:00:00 2001 From: Filipe de Avila Belbute Peres Date: Wed, 18 Jan 2023 18:22:03 -0800 Subject: [PATCH 6/9] Print len instead of full array --- examples/Embedding_long_inputs.ipynb | 1018 +------------------------- 1 file changed, 8 insertions(+), 1010 deletions(-) diff --git a/examples/Embedding_long_inputs.ipynb b/examples/Embedding_long_inputs.ipynb index 80f2b51..a1fdb9a 100644 --- a/examples/Embedding_long_inputs.ipynb +++ b/examples/Embedding_long_inputs.ipynb @@ -22,7 +22,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -49,7 +49,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 26, "metadata": {}, "outputs": [ { @@ -86,7 +86,7 @@ }, { "cell_type": "code", - "execution_count": 98, + "execution_count": 27, "metadata": {}, "outputs": [], "source": [ @@ -107,1023 +107,21 @@ }, { "cell_type": "code", - "execution_count": 97, + "execution_count": 32, "metadata": {}, "outputs": [ { "data": { - "text/plain": [ - "[-0.015384314581751823,\n", - " 0.0031692360062152147,\n", - " -0.007302511017769575,\n", - " -0.02778581902384758,\n", - " -0.013409210368990898,\n", - " 0.0029592972714453936,\n", - " -0.019119545817375183,\n", - " -0.0004874778969679028,\n", - " -0.010721994563937187,\n", - " -0.023486273363232613,\n", - " 0.016351712867617607,\n", - " 0.005532307084649801,\n", - " -0.009136536158621311,\n", - " -0.014282556250691414,\n", - " 0.005122506525367498,\n", - " 0.02888757921755314,\n", - " 0.020973725244402885,\n", - " 0.009136536158621311,\n", - " 0.003303596982732415,\n", - " -0.013382338918745518,\n", - " -0.024749264121055603,\n", - " 0.03904525563120842,\n", - " -0.01699664443731308,\n", - " -0.010312194004654884,\n", - " -0.009029047563672066,\n", - " -0.001587137347087264,\n", - " 0.017036953940987587,\n", - " -0.056915245950222015,\n", - " -0.011084768921136856,\n", - " -0.006375421304255724,\n", - " 0.011145230382680893,\n", - " -0.01094368938356638,\n", - " -0.010184550657868385,\n", - " -0.009546336717903614,\n", - " -0.012105911038815975,\n", - " -0.004675756674259901,\n", - " 0.002245505340397358,\n", - " -0.0015040015568956733,\n", - " -0.007457026280462742,\n", - " 0.0029206685721874237,\n", - " 0.03993203863501549,\n", - " -0.02390279248356819,\n", - " 0.003399329027161002,\n", - " -0.02109465003013611,\n", - " -0.026590008288621902,\n", - " 0.004457420669496059,\n", - " -0.03638491407036781,\n", - " -0.018958313390612602,\n", - " 0.002221992239356041,\n", - " -0.007846672087907791,\n", - " -0.0106548136100173,\n", - " 0.0019096032483503222,\n", - " -0.015451495535671711,\n", - " -0.00783995445817709,\n", - " 0.016821976751089096,\n", - " 0.007409999612718821,\n", - " -0.017601268365979195,\n", - " 0.01502154115587473,\n", - " -0.026119744405150414,\n", - " -0.011333336122334003,\n", - " -0.017184749245643616,\n", - " -0.0352562814950943,\n", - " -0.002327801426872611,\n", - " 0.015666473656892776,\n", - " -0.023069754242897034,\n", - " -0.016821976751089096,\n", - " -0.0005298855248838663,\n", - " 0.0010933612938970327,\n", - " 0.0048571438528597355,\n", - " -0.034503862261772156,\n", - " 0.007712311577051878,\n", - " 0.038024116307497025,\n", - " -0.017856555059552193,\n", - " -0.02415807731449604,\n", - " 0.020664695650339127,\n", - " -0.01742659881711006,\n", - " 0.012072320096194744,\n", - " 0.015249954536557198,\n", - " -0.008357243612408638,\n", - " 0.001610650448128581,\n", - " 0.018017787486314774,\n", - " -0.02247856743633747,\n", - " -3.219936115783639e-05,\n", - " 0.02421182207763195,\n", - " 0.010594351217150688,\n", - " 0.01800435036420822,\n", - " -0.019777914509177208,\n", - " 0.024695521220564842,\n", - " 0.0013805575435981154,\n", - " -0.0138122932985425,\n", - " 0.02132306434214115,\n", - " 0.023325040936470032,\n", - " 0.027597714215517044,\n", - " 0.06062360480427742,\n", - " -0.019562937319278717,\n", - " 0.009559772908687592,\n", - " -0.02183363400399685,\n", - " 0.0173728559166193,\n", - " -0.028242645785212517,\n", - " -0.03058052435517311,\n", - " 0.01847461424767971,\n", - " -0.026536263525485992,\n", - " -0.007947443053126335,\n", - " -0.007517488207668066,\n", - " -0.026616880670189857,\n", - " 0.009183562360703945,\n", - " 0.01872989907860756,\n", - " -0.022075483575463295,\n", - " 0.019589809700846672,\n", - " -0.023916227743029594,\n", - " 0.019347960129380226,\n", - " 0.02378186769783497,\n", - " 0.019764477387070656,\n", - " -0.0202616136521101,\n", - " -0.019401703029870987,\n", - " 0.006335113197565079,\n", - " 0.015209645964205265,\n", - " -0.029935592785477638,\n", - " -0.007013635244220495,\n", - " -0.0363042950630188,\n", - " 0.00704050762578845,\n", - " 0.01616360805928707,\n", - " 0.014981232583522797,\n", - " -0.0013931537978351116,\n", - " 0.030661141499876976,\n", - " 0.01389290951192379,\n", - " 0.007712311577051878,\n", - " -0.01910611055791378,\n", - " -0.0020792337600141764,\n", - " -0.008404269814491272,\n", - " 0.024722393602132797,\n", - " 0.01699664443731308,\n", - " 0.008350525982677937,\n", - " 0.009727723896503448,\n", - " -0.010695122182369232,\n", - " 0.006560167297720909,\n", - " -0.031386688351631165,\n", - " 0.0263078510761261,\n", - " -0.0001876852911664173,\n", - " -0.01816558465361595,\n", - " 0.019482320174574852,\n", - " 0.023190679028630257,\n", - " -0.015115593560039997,\n", - " -0.015384314581751823,\n", - " -0.005233354400843382,\n", - " 0.004225648008286953,\n", - " -0.0011555030941963196,\n", - " -0.012092474848031998,\n", - " 0.011602058075368404,\n", - " -0.02179332636296749,\n", - " -0.003029836807399988,\n", - " 0.0030382343102246523,\n", - " -0.011151948943734169,\n", - " 0.007430153898894787,\n", - " 0.001625766046345234,\n", - " 0.010795892216265202,\n", - " 0.0033136738929897547,\n", - " 0.013167361728847027,\n", - " -0.027033399790525436,\n", - " 0.002052361611276865,\n", - " 0.015061848796904087,\n", - " 0.017762500792741776,\n", - " 0.014349736273288727,\n", - " -0.007047225721180439,\n", - " 0.014887180179357529,\n", - " 0.023190679028630257,\n", - " 0.0055289482697844505,\n", - " 0.018084967508912086,\n", - " -0.0014888859586790204,\n", - " -0.003711717901751399,\n", - " 0.008290063589811325,\n", - " 0.03740605339407921,\n", - " 0.007960879243910313,\n", - " 0.01809840463101864,\n", - " 0.010916817001998425,\n", - " 0.03504130616784096,\n", - " 0.0031138123013079166,\n", - " -0.005303893703967333,\n", - " -0.022868212312459946,\n", - " -0.01373839471489191,\n", - " -0.013933218084275723,\n", - " 0.008525194600224495,\n", - " 0.05304565653204918,\n", - " 0.014537842012941837,\n", - " 0.006230983417481184,\n", - " -0.004920965526252985,\n", - " 0.002856847131624818,\n", - " -0.015868013724684715,\n", - " 0.006835607346147299,\n", - " -0.027449917048215866,\n", - " -0.0049041700549423695,\n", - " 0.009808340109884739,\n", - " 0.028914449736475945,\n", - " -0.017386291176080704,\n", - " -0.6199946403503418,\n", - " -0.02336534857749939,\n", - " -0.018353689461946487,\n", - " -0.0028131799772381783,\n", - " 0.019804786890745163,\n", - " 0.04409722611308098,\n", - " 0.005280380602926016,\n", - " 0.011702828109264374,\n", - " -0.024829881265759468,\n", - " 0.01465876679867506,\n", - " -0.013395775109529495,\n", - " 0.025877896696329117,\n", - " -0.01636514998972416,\n", - " 0.01199842244386673,\n", - " -0.01084291934967041,\n", - " -0.008827506564557552,\n", - " 0.00870658177882433,\n", - " -0.020086944103240967,\n", - " -0.0006025243201293051,\n", - " 0.027812691405415535,\n", - " -0.03404703363776207,\n", - " 0.0019079238409176469,\n", - " 0.0024403284769505262,\n", - " -0.006099981721490622,\n", - " 0.009082792326807976,\n", - " -0.0050519672222435474,\n", - " 0.014309428632259369,\n", - " -0.022921957075595856,\n", - " -0.02199486829340458,\n", - " 0.003607588354498148,\n", - " -0.008518476970493793,\n", - " 0.00871329940855503,\n", - " 0.02747678942978382,\n", - " -0.020906545221805573,\n", - " 0.04253863915801048,\n", - " 0.000455147324828431,\n", - " 0.014484097249805927,\n", - " 0.033079635351896286,\n", - " 0.026590008288621902,\n", - " 0.05258882790803909,\n", - " -0.025971949100494385,\n", - " 0.010493581183254719,\n", - " 0.026455648243427277,\n", - " -0.008511758409440517,\n", - " -0.019025493413209915,\n", - " 0.020019764080643654,\n", - " 0.01937483251094818,\n", - " -0.013415928930044174,\n", - " -0.0027863075956702232,\n", - " -0.007860108278691769,\n", - " 0.011662520468235016,\n", - " -0.007255484815686941,\n", - " 0.0033707772381603718,\n", - " -0.01479312777519226,\n", - " 0.009358231909573078,\n", - " 0.007100970018655062,\n", - " 0.02388935536146164,\n", - " -0.017171313986182213,\n", - " -0.008726735599339008,\n", - " 0.010977279394865036,\n", - " -0.003943490330129862,\n", - " 0.004695910960435867,\n", - " -0.003323751036077738,\n", - " 0.011642365716397762,\n", - " -0.014510969631373882,\n", - " 0.0063888574950397015,\n", - " -0.006832248065620661,\n", - " 0.01937483251094818,\n", - " 0.0011857342906296253,\n", - " 0.006308240815997124,\n", - " -0.0029643357265740633,\n", - " 0.012569455429911613,\n", - " -0.013677932322025299,\n", - " 0.01015767827630043,\n", - " -0.002561253262683749,\n", - " -0.0055994875729084015,\n", - " 0.024144640192389488,\n", - " 0.0076988753862679005,\n", - " -0.01128630992025137,\n", - " -0.022277025505900383,\n", - " 0.013422646559774876,\n", - " 0.00892155896872282,\n", - " 0.0036613326519727707,\n", - " -0.009438848122954369,\n", - " 0.04151749610900879,\n", - " -0.005727130454033613,\n", - " -0.00863268319517374,\n", - " -0.012804587371647358,\n", - " 0.011138512752950191,\n", - " -0.003283442696556449,\n", - " -0.00783995445817709,\n", - " 0.028538240119814873,\n", - " 0.00030609077657572925,\n", - " 0.006113417912274599,\n", - " 0.0205303356051445,\n", - " 0.0037721802946180105,\n", - " -0.02425212971866131,\n", - " 0.013771984726190567,\n", - " 0.0034833045210689306,\n", - " -0.01748034358024597,\n", - " -0.0062444196082651615,\n", - " -0.005653231870383024,\n", - " 0.011037741787731647,\n", - " 0.02684529311954975,\n", - " -0.023822175338864326,\n", - " 0.041598111391067505,\n", - " -0.02915629930794239,\n", - " -0.009895674884319305,\n", - " 0.03240783140063286,\n", - " -0.022639799863100052,\n", - " 0.01879708096385002,\n", - " -0.03727169334888458,\n", - " -0.02415807731449604,\n", - " -0.02132306434214115,\n", - " 0.014940924011170864,\n", - " -0.03536377102136612,\n", - " 0.012925512157380581,\n", - " 0.012421658262610435,\n", - " 0.017117569223046303,\n", - " -0.01281130500137806,\n", - " -0.014269120059907436,\n", - " -0.010144243016839027,\n", - " 0.0049075293354690075,\n", - " -0.01338905654847622,\n", - " 0.00038628740003332496,\n", - " -7.085434481268749e-05,\n", - " -0.00503853103145957,\n", - " -0.024507414549589157,\n", - " -0.022653235122561455,\n", - " 0.02374155819416046,\n", - " 0.03141356259584427,\n", - " -0.003390931524336338,\n", - " 0.015653036534786224,\n", - " -0.024386489763855934,\n", - " 0.05546415224671364,\n", - " 0.015438059344887733,\n", - " 0.02504485845565796,\n", - " -0.001958309207111597,\n", - " 0.013684650883078575,\n", - " -0.01979134976863861,\n", - " -0.022706979885697365,\n", - " 0.013825729489326477,\n", - " 0.008753607980906963,\n", - " -0.014537842012941837,\n", - " 0.01672792248427868,\n", - " -0.01663387008011341,\n", - " -0.014121322892606258,\n", - " -0.015451495535671711,\n", - " -0.005186328198760748,\n", - " 0.03512192144989967,\n", - " -0.008746890351176262,\n", - " 0.029693743214011192,\n", - " -0.016486072912812233,\n", - " 0.026482518762350082,\n", - " -0.023969972506165504,\n", - " -0.037916626781225204,\n", - " -0.017722193151712418,\n", - " -0.0202616136521101,\n", - " 0.023298168554902077,\n", - " 0.0012461966834962368,\n", - " 0.024695521220564842,\n", - " 0.021658966317772865,\n", - " -0.010936971753835678,\n", - " 0.002561253262683749,\n", - " -0.005156097002327442,\n", - " -0.01057419739663601,\n", - " 0.009754596278071404,\n", - " 0.03125232830643654,\n", - " -0.021282754838466644,\n", - " -0.031924132257699966,\n", - " 0.016647307202219963,\n", - " 0.013355466537177563,\n", - " -0.00647283298894763,\n", - " 0.019334523007273674,\n", - " 0.012032012455165386,\n", - " 0.02778581902384758,\n", - " -0.008001187816262245,\n", - " 0.011071332730352879,\n", - " 0.0004958754288963974,\n", - " 0.006368703208863735,\n", - " -0.002013732912018895,\n", - " -0.011951396241784096,\n", - " -0.02372812293469906,\n", - " 0.0049075293354690075,\n", - " 0.032810915261507034,\n", - " -0.010963844135403633,\n", - " -0.013986961916089058,\n", - " 0.041974324733018875,\n", - " -0.018541794270277023,\n", - " 0.022586055099964142,\n", - " -0.003758744103834033,\n", - " 0.020355666056275368,\n", - " 0.022129228338599205,\n", - " 0.004353290889412165,\n", - " -0.03531002625823021,\n", - " 0.0034564323723316193,\n", - " 0.00438352208584547,\n", - " -0.0040207477286458015,\n", - " -0.002181683899834752,\n", - " 0.005488639697432518,\n", - " 0.013227824121713638,\n", - " -0.002729204250499606,\n", - " 0.02899506688117981,\n", - " -0.027328992262482643,\n", - " 0.019536064937710762,\n", - " -0.01735941879451275,\n", - " 0.007235330529510975,\n", - " -0.02126931957900524,\n", - " 0.026388466358184814,\n", - " 0.016284532845020294,\n", - " 0.0048806569539010525,\n", - " -0.018971748650074005,\n", - " -0.008384115993976593,\n", - " 0.00697332713752985,\n", - " 0.023647505789995193,\n", - " 0.011843906715512276,\n", - " -0.004353290889412165,\n", - " 0.013059872202575207,\n", - " -0.014846871607005596,\n", - " -0.0016291250940412283,\n", - " 0.010896663181483746,\n", - " -0.003557202871888876,\n", - " 0.0031524410005658865,\n", - " -0.0016249263426288962,\n", - " -0.03509504720568657,\n", - " 0.005639795679599047,\n", - " 0.00781980063766241,\n", - " 0.007786210160702467,\n", - " 0.007907134480774403,\n", - " -0.01518277358263731,\n", - " 0.0005798509810119867,\n", - " 0.006738195661455393,\n", - " 0.014578149653971195,\n", - " 0.023553453385829926,\n", - " 0.013395775109529495,\n", - " 0.015706781297922134,\n", - " 0.019119545817375183,\n", - " -0.006818812340497971,\n", - " 0.01567990891635418,\n", - " -0.0037251540925353765,\n", - " 0.003856155788525939,\n", - " 0.009425411932170391,\n", - " 0.012052166275680065,\n", - " -0.030392419546842575,\n", - " 0.03697609901428223,\n", - " -0.0009875521063804626,\n", - " 0.05073465034365654,\n", - " -0.012119347229599953,\n", - " -0.03520253673195839,\n", - " 0.027758946642279625,\n", - " -0.012240272015333176,\n", - " 0.018246199935674667,\n", - " 0.0012361196568235755,\n", - " -0.002601561602205038,\n", - " 0.02625410631299019,\n", - " -0.03541751578450203,\n", - " 0.0018071531085297465,\n", - " 0.010795892216265202,\n", - " 0.002294211182743311,\n", - " 0.023486273363232613,\n", - " 0.02646908350288868,\n", - " 0.006677733268588781,\n", - " 0.0025394195690751076,\n", - " 0.010218140669167042,\n", - " 0.013200951740145683,\n", - " 0.02325785905122757,\n", - " 0.020328793674707413,\n", - " -0.004642166662961245,\n", - " -0.015558984130620956,\n", - " 0.008323653601109982,\n", - " -0.003956926520913839,\n", - " -0.015196209773421288,\n", - " -0.006422447506338358,\n", - " -0.010513735003769398,\n", - " 0.009794904850423336,\n", - " -0.011434106156229973,\n", - " -0.018555231392383575,\n", - " -0.014833435416221619,\n", - " 0.008934995159506798,\n", - " -0.00547184469178319,\n", - " -0.0012193245347589254,\n", - " -0.01504841260612011,\n", - " 0.022693544626235962,\n", - " 0.02642877586185932,\n", - " 0.001154663390479982,\n", - " -0.01336890272796154,\n", - " -0.004890734329819679,\n", - " 0.025071730837225914,\n", - " -0.005928671453148127,\n", - " 0.004323059692978859,\n", - " -0.012018576264381409,\n", - " 0.030822373926639557,\n", - " 0.007517488207668066,\n", - " 0.011675955727696419,\n", - " 0.004951196722686291,\n", - " 0.015827706083655357,\n", - " 0.014094451442360878,\n", - " -0.039018385112285614,\n", - " 0.003486663568764925,\n", - " 0.0029206685721874237,\n", - " 0.02284134179353714,\n", - " 0.00033863127464428544,\n", - " -0.014403481036424637,\n", - " -0.001502322033047676,\n", - " 0.0244671069085598,\n", - " 0.0013881153427064419,\n", - " 0.0076316953636705875,\n", - " -0.009700851514935493,\n", - " -0.02435961924493313,\n", - " -0.045494578778743744,\n", - " 0.010124088265001774,\n", - " 0.0009220512001775205,\n", - " -0.01584114134311676,\n", - " -0.005663308780640364,\n", - " -0.012549301609396935,\n", - " -0.0027980643790215254,\n", - " -0.017292238771915436,\n", - " -0.013241259381175041,\n", - " 0.005199763923883438,\n", - " -0.0024604827631264925,\n", - " 0.011541595682501793,\n", - " -0.005219918210059404,\n", - " -0.01762814074754715,\n", - " -0.0006478711147792637,\n", - " 0.09980322420597076,\n", - " 0.02805454097688198,\n", - " 0.004746296443045139,\n", - " 0.021699273958802223,\n", - " 0.006892710458487272,\n", - " 0.011971550062298775,\n", - " -0.007779492065310478,\n", - " -0.010426400229334831,\n", - " 0.009768032468855381,\n", - " -0.023029446601867676,\n", - " 0.01742659881711006,\n", - " 0.0013268132461234927,\n", - " 0.002284134039655328,\n", - " 0.007766055874526501,\n", - " 0.018461177125573158,\n", - " -0.016244225203990936,\n", - " -0.021336499601602554,\n", - " -0.0093447957187891,\n", - " 0.004245802294462919,\n", - " -0.004783245734870434,\n", - " -0.009808340109884739,\n", - " 0.014631894417107105,\n", - " -0.02362063340842724,\n", - " 0.013651059940457344,\n", - " 0.021954558789730072,\n", - " -0.02114839479327202,\n", - " 0.0031591590959578753,\n", - " 0.003164197551086545,\n", - " -0.010769020766019821,\n", - " -0.006855761166661978,\n", - " -0.016969772055745125,\n", - " -0.00590515835210681,\n", - " -0.015411186963319778,\n", - " 0.0001366701617371291,\n", - " -0.015196209773421288,\n", - " 0.0011731380363926291,\n", - " 0.009855367243289948,\n", - " -0.018071532249450684,\n", - " 0.03936772421002388,\n", - " -0.027342429384589195,\n", - " 0.029451893642544746,\n", - " 0.0027476788964122534,\n", - " -0.009828494861721992,\n", - " -0.02257261984050274,\n", - " 0.01479312777519226,\n", - " -0.026119744405150414,\n", - " -0.01007706206291914,\n", - " 0.009559772908687592,\n", - " -0.014752819202840328,\n", - " -0.03135981783270836,\n", - " 0.014322864823043346,\n", - " -0.0008481527329422534,\n", - " -0.01502154115587473,\n", - " 0.004148390609771013,\n", - " 0.010856354609131813,\n", - " 0.013919781893491745,\n", - " -0.03135981783270836,\n", - " -0.010688403621315956,\n", - " 0.008827506564557552,\n", - " -0.017251931130886078,\n", - " -0.009700851514935493,\n", - " -0.012925512157380581,\n", - " 0.010466708801686764,\n", - " -0.019831659272313118,\n", - " 0.009721006266772747,\n", - " -0.028403880074620247,\n", - " -0.0027325632981956005,\n", - " 0.00016994545876514167,\n", - " 0.0021850429475307465,\n", - " 0.004924324341118336,\n", - " 0.02958625555038452,\n", - " 0.008216165006160736,\n", - " -0.01915985345840454,\n", - " -0.0005004940903745592,\n", - " -0.004598499275743961,\n", - " 0.02642877586185932,\n", - " 0.011689391918480396,\n", - " -0.0024201744236052036,\n", - " 0.008384115993976593,\n", - " 0.02309662662446499,\n", - " 0.023378783836960793,\n", - " -0.040657587349414825,\n", - " -0.015908321365714073,\n", - " -0.0019129622960463166,\n", - " -0.019966019317507744,\n", - " 0.009156690910458565,\n", - " -0.01279115118086338,\n", - " -0.0228950846940279,\n", - " 0.002087631495669484,\n", - " -0.0008225401979871094,\n", - " 0.013281567953526974,\n", - " -0.0075779506005346775,\n", - " 0.00553902518004179,\n", - " -0.019038928672671318,\n", - " -0.02327129617333412,\n", - " 0.002831654390320182,\n", - " 0.023177243769168854,\n", - " -0.02531358040869236,\n", - " 0.0001918840571306646,\n", - " -0.004662320949137211,\n", - " 0.0281620305031538,\n", - " -0.00968741625547409,\n", - " 0.0023210833314806223,\n", - " -0.011561749503016472,\n", - " -0.0007293273811228573,\n", - " 0.018770208582282066,\n", - " 0.005458408500999212,\n", - " 0.0038628738839179277,\n", - " 0.00013467573444359004,\n", - " -0.0010253411019220948,\n", - " 0.014161631464958191,\n", - " 0.003374136285856366,\n", - " 0.012065602466464043,\n", - " -0.013617469929158688,\n", - " -0.0018323458498343825,\n", - " 0.005045249126851559,\n", - " 0.0011084768921136856,\n", - " -0.006785221863538027,\n", - " 0.009069356136023998,\n", - " -0.0005160295404493809,\n", - " -0.012636636383831501,\n", - " -0.00517289200797677,\n", - " 0.022357642650604248,\n", - " -0.006953172851353884,\n", - " -0.029666870832443237,\n", - " 0.0021296192426234484,\n", - " -0.0006881793960928917,\n", - " -0.002552855759859085,\n", - " 0.0004912567674182355,\n", - " -0.004255879204720259,\n", - " -0.0008040656102821231,\n", - " 0.018810516223311424,\n", - " -0.015196209773421288,\n", - " -0.015962066128849983,\n", - " -0.008565503172576427,\n", - " 0.0014930847100913525,\n", - " -0.023338476195931435,\n", - " 0.012092474848031998,\n", - " -0.025743534788489342,\n", - " -0.010957125574350357,\n", - " -0.033079635351896286,\n", - " -0.00035395679879002273,\n", - " 0.022760724648833275,\n", - " -0.003933413419872522,\n", - " -0.03563249111175537,\n", - " -0.04122190177440643,\n", - " 0.0057943109422922134,\n", - " 0.017077261582016945,\n", - " -0.008478168398141861,\n", - " 0.023862482979893684,\n", - " -0.004820194561034441,\n", - " -0.004339854698628187,\n", - " -0.007947443053126335,\n", - " -0.01799091510474682,\n", - " -0.01946888491511345,\n", - " -0.027812691405415535,\n", - " -0.019656989723443985,\n", - " 0.01883738860487938,\n", - " 0.02663031592965126,\n", - " 0.028484495356678963,\n", - " 0.02800079621374607,\n", - " 0.020328793674707413,\n", - " 0.02125588245689869,\n", - " -5.395427069743164e-05,\n", - " 0.02556886523962021,\n", - " -0.012052166275680065,\n", - " 0.03243470564484596,\n", - " 0.004974709823727608,\n", - " -0.0105204526335001,\n", - " 0.011897651478648186,\n", - " 0.013335312716662884,\n", - " -0.013825729489326477,\n", - " 0.014363172464072704,\n", - " -0.01811183989048004,\n", - " -0.0024403284769505262,\n", - " 0.03727169334888458,\n", - " -0.012092474848031998,\n", - " -0.016284532845020294,\n", - " -0.008148984052240849,\n", - " -0.031091095879673958,\n", - " -0.01621735282242298,\n", - " 0.006654220167547464,\n", - " 0.020879672840237617,\n", - " 0.005364356096833944,\n", - " -0.03436949849128723,\n", - " -0.025931639596819878,\n", - " 0.013590597547590733,\n", - " 0.008639401756227016,\n", - " 0.017278803512454033,\n", - " -0.012992692179977894,\n", - " 0.021766453981399536,\n", - " -0.003293519839644432,\n", - " 0.013919781893491745,\n", - " 0.009438848122954369,\n", - " 0.015505239367485046,\n", - " -0.02374155819416046,\n", - " -0.010863073170185089,\n", - " -0.0030936580151319504,\n", - " 0.0107891745865345,\n", - " 0.017668448388576508,\n", - " 0.005965620744973421,\n", - " 0.005928671453148127,\n", - " 0.002952579176053405,\n", - " -0.016714487224817276,\n", - " -0.017036953940987587,\n", - " 0.024964241310954094,\n", - " -0.01173641812056303,\n", - " -0.003752026241272688,\n", - " 0.01094368938356638,\n", - " -0.022747289389371872,\n", - " 0.00047992009785957634,\n", - " -0.01778937317430973,\n", - " -0.05425490438938141,\n", - " -0.01731911115348339,\n", - " 0.020812492817640305,\n", - " -0.0032431345898658037,\n", - " -0.0292100440710783,\n", - " -0.004279392305761576,\n", - " 0.012482120655477047,\n", - " -0.03541751578450203,\n", - " 0.002704011742025614,\n", - " -0.007759337779134512,\n", - " 0.02304288186132908,\n", - " 0.012199963442981243,\n", - " 0.028538240119814873,\n", - " 0.014860307797789574,\n", - " -0.012307452037930489,\n", - " -0.01936139538884163,\n", - " -0.0033607003279030323,\n", - " -0.004014029633253813,\n", - " -0.007638412993401289,\n", - " -0.010271885432302952,\n", - " 0.008021341636776924,\n", - " 0.0010925214737653732,\n", - " -0.0373791828751564,\n", - " 0.0024923933669924736,\n", - " 0.008021341636776924,\n", - " -0.00739656388759613,\n", - " -0.02410433255136013,\n", - " 0.025246400386095047,\n", - " 0.005374433007091284,\n", - " 0.010762302204966545,\n", - " -0.006627347785979509,\n", - " -0.015142465941607952,\n", - " -0.050439056009054184,\n", - " 0.04108754172921181,\n", - " 0.03869592025876045,\n", - " 0.004007311537861824,\n", - " 0.003866232931613922,\n", - " 0.004413753282278776,\n", - " 0.015129029750823975,\n", - " 0.023298168554902077,\n", - " -0.024064024910330772,\n", - " 0.0011177140986546874,\n", - " -0.009270897135138512,\n", - " 0.0016266057500615716,\n", - " 0.017386291176080704,\n", - " -0.013745113275945187,\n", - " 0.01694290153682232,\n", - " 0.003973721526563168,\n", - " -0.012011858634650707,\n", - " -7.295373507076874e-05,\n", - " -0.016324840486049652,\n", - " 0.011555030941963196,\n", - " 0.014725946821272373,\n", - " 0.003930054139345884,\n", - " -0.012253707274794579,\n", - " -0.01537087932229042,\n", - " 0.0050519672222435474,\n", - " 0.016136735677719116,\n", - " -0.04573642462491989,\n", - " -0.009647107683122158,\n", - " -0.014201940037310123,\n", - " -0.006543372292071581,\n", - " 0.017655013129115105,\n", - " 0.0035168947651982307,\n", - " -0.00868642795830965,\n", - " 0.011165385134518147,\n", - " -0.023768430575728416,\n", - " -0.011763290502130985,\n", - " 0.03350958973169327,\n", - " 0.003799052443355322,\n", - " 0.0060966224409639835,\n", - " 0.0007314267568290234,\n", - " -0.004679115954786539,\n", - " -0.003631101455539465,\n", - " 0.007705593481659889,\n", - " 0.010433118790388107,\n", - " 0.029021939262747765,\n", - " -0.008390833623707294,\n", - " -0.023929663002490997,\n", - " -0.010963844135403633,\n", - " 0.00109504081774503,\n", - " -0.0034161240328103304,\n", - " 0.009304487146437168,\n", - " -0.014282556250691414,\n", - " -0.00626121461391449,\n", - " 0.03221972659230232,\n", - " -0.04299546405673027,\n", - " 0.007121123839169741,\n", - " 0.014551278203725815,\n", - " -0.012206681072711945,\n", - " -0.008169138804078102,\n", - " 0.001264671329408884,\n", - " -0.004766450263559818,\n", - " 0.00836396124213934,\n", - " 0.04237740486860275,\n", - " 0.003034875262528658,\n", - " -0.01231416966766119,\n", - " -0.01523651834577322,\n", - " 0.017775937914848328,\n", - " 0.03990516811609268,\n", - " -0.002383225131779909,\n", - " 0.004830271936953068,\n", - " 0.013563726097345352,\n", - " 0.000969917222391814,\n", - " 0.01346967276185751,\n", - " 0.002389943227171898,\n", - " -0.014806563034653664,\n", - " 0.007436871994286776,\n", - " -0.039448339492082596,\n", - " 0.009015611372888088,\n", - " 0.0007436032174155116,\n", - " -0.004622012376785278,\n", - " 0.004222289193421602,\n", - " 0.016244225203990936,\n", - " 0.01831338182091713,\n", - " 0.005146019626408815,\n", - " -0.013691368512809277,\n", - " -0.03904525563120842,\n", - " -0.024695521220564842,\n", - " -0.019562937319278717,\n", - " -0.013852601870894432,\n", - " -0.009385104291141033,\n", - " 0.003081901464611292,\n", - " -0.013019564561545849,\n", - " -0.025851024314761162,\n", - " 0.011440824717283249,\n", - " -0.02679155021905899,\n", - " -0.025219528004527092,\n", - " 0.01173641812056303,\n", - " 0.01402727048844099,\n", - " 0.02888757921755314,\n", - " 0.020503463223576546,\n", - " -0.007759337779134512,\n", - " -0.013852601870894432,\n", - " -0.005596128758043051,\n", - " 0.0010958805214613676,\n", - " -0.05113773047924042,\n", - " -0.022236717864871025,\n", - " -0.0123679144307971,\n", - " 0.021954558789730072,\n", - " 0.015196209773421288,\n", - " -0.03004308231174946,\n", - " -0.03135981783270836,\n", - " -0.016284532845020294,\n", - " -0.05863506719470024,\n", - " -0.018138712272047997,\n", - " 0.006852402351796627,\n", - " 0.014282556250691414,\n", - " 0.016459202393889427,\n", - " -0.013006128370761871,\n", - " 0.009613517671823502,\n", - " 0.020705003291368484,\n", - " 0.0090760737657547,\n", - " 0.0022656593937426805,\n", - " -0.006879274267703295,\n", - " -0.02109465003013611,\n", - " -0.003799052443355322,\n", - " -0.006419088691473007,\n", - " 0.000651650014333427,\n", - " -0.01878364384174347,\n", - " 0.002342917025089264,\n", - " -0.015572420321404934,\n", - " 0.010453272610902786,\n", - " -0.015962066128849983,\n", - " -0.00675163185223937,\n", - " 0.021229011937975883,\n", - " 0.0007910493877716362,\n", - " -0.004830271936953068,\n", - " -0.015518675558269024,\n", - " 0.007087533827871084,\n", - " 0.013295004144310951,\n", - " 0.025877896696329117,\n", - " 0.007873544469475746,\n", - " -0.027973923832178116,\n", - " -0.0028232568874955177,\n", - " 0.008303498849272728,\n", - " -0.0018491409718990326,\n", - " -0.014712510630488396,\n", - " -0.010056908242404461,\n", - " 0.0013133770553395152,\n", - " 0.0015375918010249734,\n", - " 0.025394197553396225,\n", - " -0.0009573209099471569,\n", - " -0.0033640593755990267,\n", - " -0.011749854311347008,\n", - " -0.01386603806167841,\n", - " 0.02336534857749939,\n", - " 0.010809328407049179,\n", - " 0.010695122182369232,\n", - " 0.0006445121252909303,\n", - " -0.010668249800801277,\n", - " 0.0041886987164616585,\n", - " 0.013630906119942665,\n", - " 0.015397750772535801,\n", - " -0.015505239367485046,\n", - " -0.0025142270606011152,\n", - " 0.02105434238910675,\n", - " -0.005992493126541376,\n", - " 0.019025493413209915,\n", - " -0.04425845667719841,\n", - " 0.007114405743777752,\n", - " -0.023754995316267014,\n", - " -0.010473426431417465,\n", - " 0.012764278799295425,\n", - " -0.012018576264381409,\n", - " -0.04624699801206589,\n", - " 0.022586055099964142,\n", - " 0.00040014335536397994,\n", - " -0.009257460944354534,\n", - " -0.006214188411831856,\n", - " 0.011252719908952713,\n", - " -0.011467697098851204,\n", - " -0.00610669981688261,\n", - " -0.02998933754861355,\n", - " 0.017184749245643616,\n", - " -0.026482518762350082,\n", - " -0.02105434238910675,\n", - " -0.006083186715841293,\n", - " -0.00826319120824337,\n", - " -0.001481328159570694,\n", - " -0.010983997955918312,\n", - " 0.03326774016022682,\n", - " 0.0012226835824549198,\n", - " -0.004769809544086456,\n", - " 0.19713421165943146,\n", - " -0.014269120059907436,\n", - " -0.0032179418485611677,\n", - " 0.012105911038815975,\n", - " -0.028511367738246918,\n", - " -0.017131006345152855,\n", - " 0.044177841395139694,\n", - " 0.0017853195313364267,\n", - " -0.0024302515666931868,\n", - " 0.03783601149916649,\n", - " -0.00892155896872282,\n", - " 0.012549301609396935,\n", - " 0.027275249361991882,\n", - " -0.01050029881298542,\n", - " 0.012723970226943493,\n", - " -0.02199486829340458,\n", - " -0.03536377102136612,\n", - " -0.02347283624112606,\n", - " 0.009519464336335659,\n", - " 0.005925312638282776,\n", - " 0.018703026697039604,\n", - " 0.0024755983613431454,\n", - " -0.01386603806167841,\n", - " -0.011481133289635181,\n", - " 0.0009497631108388305,\n", - " 0.000136355243739672,\n", - " -0.007302511017769575,\n", - " -0.00022925317171029747,\n", - " 0.010896663181483746,\n", - " -0.0023731482215225697,\n", - " -0.008934995159506798,\n", - " -0.02426556497812271,\n", - " -0.005135942716151476,\n", - " 0.014040706679224968,\n", - " -4.135794370085932e-05,\n", - " 0.003822565544396639,\n", - " 0.022236717864871025,\n", - " 0.01215293724089861,\n", - " 0.02794705331325531,\n", - " 0.013476391322910786,\n", - " 0.02304288186132908,\n", - " 0.016297968104481697,\n", - " 0.007846672087907791,\n", - " -0.035229410976171494,\n", - " -0.018716463819146156,\n", - " 0.03547126054763794,\n", - " ...]" - ] + "text/plain": "1536" }, - "execution_count": 97, + "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "truncated = truncate_text_tokens(long_text)\n", - "get_embedding(truncated)" + "len(get_embedding(truncated))" ] }, { @@ -1139,7 +137,7 @@ }, { "cell_type": "code", - "execution_count": 91, + "execution_count": 29, "metadata": {}, "outputs": [], "source": [ From 14262d47e89086647307ca4dd40fa952cf2787cc Mon Sep 17 00:00:00 2001 From: Ted Sanders Date: Thu, 19 Jan 2023 10:04:31 -0800 Subject: [PATCH 7/9] polishes text and re-runs notebook --- examples/Embedding_long_inputs.ipynb | 64 ++++++++++++++++++---------- 1 file changed, 41 insertions(+), 23 deletions(-) diff --git a/examples/Embedding_long_inputs.ipynb b/examples/Embedding_long_inputs.ipynb index a1fdb9a..33be821 100644 --- a/examples/Embedding_long_inputs.ipynb +++ b/examples/Embedding_long_inputs.ipynb @@ -1,28 +1,30 @@ { "cells": [ { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "# Embedding texts that are longer than the model's context length\n", + "# Embedding texts that are longer than the model's maximum context length\n", "\n", - "All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n", + "OpenAI's embedding models cannot embed text that exceeds a maximum length. The maximum length varies by model, and is measured by _tokens_, not string length. If you are unfamiliar with tokenization, check out [How to count tokens with tiktoken](How_to_count_tokens_with_tiktoken.ipynb).\n", "\n", - "In this notebook, we will go over how to handle texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n" + "This notebook shows how to handle texts that are longer than a model's maximum context length. We'll demonstrate using embeddings from `text-embedding-ada-002`, but the same ideas can be applied to other models and tasks. To learn more about embeddings, check out the [OpenAI Embeddings Guide](https://beta.openai.com/docs/guides/embeddings).\n" ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Model context length\n", "\n", - "First, let us define the model we will be working with and a funciton to get embeddings from the API." + "First, we select the model and define a function to get embeddings from the API." ] }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -49,7 +51,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 2, "metadata": {}, "outputs": [ { @@ -69,24 +71,26 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Clearly we want to avoid these errors, particularly when handling programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to handling these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually." + "Clearly we want to avoid these errors, particularly when handling programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we describe and provide recipes for the main approaches to handling these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually." ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Truncating the input text\n", "\n", - "The simplest solution is to truncate the input text to the maximum allowed length. Since the context length is in terms of tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, thus as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function." + "The simplest solution is to truncate the input text to the maximum allowed length. Because the context length is measured in tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, so as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function." ] }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -99,22 +103,25 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Our example from before now works." + "Our example from before now works without error." ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 4, "metadata": {}, "outputs": [ { "data": { - "text/plain": "1536" + "text/plain": [ + "1536" + ] }, - "execution_count": 32, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -125,19 +132,20 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Chunking the input text\n", "\n", - "Though the option above works, it has the clear drawback of simply discarding all text after the maximum context is filled. Another possible approach that addresses this issue is to in fact divide the input text into chunks and then embed each chunk individually. We can then either use the chunk embeddings separately, or combine them in some way, such as for example calculating their average (weighted by the size of each chunk).\n", + "Though truncation works, discarding potentially relevant text is a clear drawback. Another approach is to divide the input text into chunks and then embed each chunk individually. Then, we can either use the chunk embeddings separately, or combine them in some way, such as averaging (weighted by the size of each chunk).\n", "\n", - "We will first take a function from [python's own cookbook](https://docs.python.org/3/library/itertools.html#itertools-recipes) that breaks up a sequence into chunks." + "We will take a function from [Python's own cookbook](https://docs.python.org/3/library/itertools.html#itertools-recipes) that breaks up a sequence into chunks." ] }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -154,15 +162,16 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Now let's define a function that encodes a string into tokens and then breaks it up into chunks." + "Now we define a function that encodes a string into tokens and then breaks it up into chunks." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -182,7 +191,7 @@ }, { "cell_type": "code", - "execution_count": 104, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -200,23 +209,24 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Once again, we can verify that we can now handle long input texts." + "Once again, we can now handle long input texts." ] }, { "cell_type": "code", - "execution_count": 105, + "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Setting reduce=None gives us 2 embedding vectors.\n", - "Setting reduce='average' gives us 1 embedding vector.\n" + "Setting average=True gives us a single 1536-dimensional embedding vector for our long text.\n", + "Setting average=False gives us 2 embedding vectors, one for each of the chunks.\n" ] } ], @@ -227,6 +237,14 @@ "print(f\"Setting average=True gives us a single {len(average_embedding_vector)}-dimensional embedding vector for our long text.\")\n", "print(f\"Setting average=False gives us {len(chunks_embedding_vectors)} embedding vectors, one for each of the chunks.\")\n" ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In some cases, it may make sense to split chunks on paragraph boundaries or sentence boundaries to help preserve the meaning of the text." + ] } ], "metadata": { From d1165a77574aa61a4fb880ca191057b899915516 Mon Sep 17 00:00:00 2001 From: Ted Sanders Date: Thu, 19 Jan 2023 10:11:34 -0800 Subject: [PATCH 8/9] normalizes averaged embeddings to length 1 --- examples/Embedding_long_inputs.ipynb | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/Embedding_long_inputs.ipynb b/examples/Embedding_long_inputs.ipynb index 33be821..fe6b5fa 100644 --- a/examples/Embedding_long_inputs.ipynb +++ b/examples/Embedding_long_inputs.ipynb @@ -205,6 +205,7 @@ "\n", " if average:\n", " chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n", + " chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings) # normalizes length to 1\n", " return chunk_embeddings" ] }, From 2e59263390e959f639cec7c795824800259e898e Mon Sep 17 00:00:00 2001 From: Filipe de Avila Belbute Peres Date: Thu, 19 Jan 2023 10:27:24 -0800 Subject: [PATCH 9/9] Convert back to list only after all np operations --- examples/Embedding_long_inputs.ipynb | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/examples/Embedding_long_inputs.ipynb b/examples/Embedding_long_inputs.ipynb index fe6b5fa..e4460f4 100644 --- a/examples/Embedding_long_inputs.ipynb +++ b/examples/Embedding_long_inputs.ipynb @@ -204,8 +204,9 @@ " chunk_embeddings.append(get_embedding(chunk, model=model))\n", "\n", " if average:\n", - " chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n", + " chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings])\n", " chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings) # normalizes length to 1\n", + " chunk_embeddings = chunk_embeddings.tolist()\n", " return chunk_embeddings" ] },