{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Embedding texts that are longer than the model's context length\n", "\n", "All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n", "\n", "In this notebook, we will go over how to handle texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Model context length\n", "\n", "First, let us define the model we will be working with and a funciton to get embeddings from the API." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import openai\n", "from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type\n", "\n", "\n", "EMBEDDING_MODEL = 'text-embedding-ada-002'\n", "EMBEDDING_CTX_LENGTH = 8191\n", "EMBEDDING_ENCODING = 'cl100k_base'\n", "\n", "# let's make sure to not retry on an invalid request, because that is what we want to demonstrate\n", "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.InvalidRequestError))\n", "def get_embedding(text_or_tokens, model=EMBEDDING_MODEL):\n", " return openai.Embedding.create(input=text_or_tokens, model=model)[\"data\"][0][\"embedding\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `text-embedding-ada-002` model has a context length of 8191 tokens with the `cl100k_base` encoding, and we can see that going over that limit causes an error." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "ename": "InvalidRequestError", "evalue": "This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", "output_type": "error", "traceback": [ "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m", "\u001B[0;31mInvalidRequestError\u001B[0m Traceback (most recent call last)", "Cell \u001B[0;32mIn [18], line 2\u001B[0m\n\u001B[1;32m 1\u001B[0m long_text \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mAGI \u001B[39m\u001B[38;5;124m'\u001B[39m \u001B[38;5;241m*\u001B[39m \u001B[38;5;241m5000\u001B[39m\n\u001B[0;32m----> 2\u001B[0m \u001B[43mget_embedding\u001B[49m\u001B[43m(\u001B[49m\u001B[43mlong_text\u001B[49m\u001B[43m)\u001B[49m\n", "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:326\u001B[0m, in \u001B[0;36mBaseRetrying.wraps..wrapped_f\u001B[0;34m(*args, **kw)\u001B[0m\n\u001B[1;32m 324\u001B[0m \u001B[38;5;129m@functools\u001B[39m\u001B[38;5;241m.\u001B[39mwraps(f)\n\u001B[1;32m 325\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mwrapped_f\u001B[39m(\u001B[38;5;241m*\u001B[39margs: t\u001B[38;5;241m.\u001B[39mAny, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkw: t\u001B[38;5;241m.\u001B[39mAny) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m t\u001B[38;5;241m.\u001B[39mAny:\n\u001B[0;32m--> 326\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mf\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkw\u001B[49m\u001B[43m)\u001B[49m\n", "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:406\u001B[0m, in \u001B[0;36mRetrying.__call__\u001B[0;34m(self, fn, *args, **kwargs)\u001B[0m\n\u001B[1;32m 404\u001B[0m retry_state \u001B[38;5;241m=\u001B[39m RetryCallState(retry_object\u001B[38;5;241m=\u001B[39m\u001B[38;5;28mself\u001B[39m, fn\u001B[38;5;241m=\u001B[39mfn, args\u001B[38;5;241m=\u001B[39margs, kwargs\u001B[38;5;241m=\u001B[39mkwargs)\n\u001B[1;32m 405\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[0;32m--> 406\u001B[0m do \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43miter\u001B[49m\u001B[43m(\u001B[49m\u001B[43mretry_state\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mretry_state\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 407\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(do, DoAttempt):\n\u001B[1;32m 408\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n", "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:351\u001B[0m, in \u001B[0;36mBaseRetrying.iter\u001B[0;34m(self, retry_state)\u001B[0m\n\u001B[1;32m 349\u001B[0m is_explicit_retry \u001B[38;5;241m=\u001B[39m retry_state\u001B[38;5;241m.\u001B[39moutcome\u001B[38;5;241m.\u001B[39mfailed \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(retry_state\u001B[38;5;241m.\u001B[39moutcome\u001B[38;5;241m.\u001B[39mexception(), TryAgain)\n\u001B[1;32m 350\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m (is_explicit_retry \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mretry(retry_state\u001B[38;5;241m=\u001B[39mretry_state)):\n\u001B[0;32m--> 351\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mfut\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mresult\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 353\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mafter \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[1;32m 354\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mafter(retry_state)\n", "File \u001B[0;32m~/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/_base.py:438\u001B[0m, in \u001B[0;36mFuture.result\u001B[0;34m(self, timeout)\u001B[0m\n\u001B[1;32m 436\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m CancelledError()\n\u001B[1;32m 437\u001B[0m \u001B[38;5;28;01melif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;241m==\u001B[39m FINISHED:\n\u001B[0;32m--> 438\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m__get_result\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 440\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_condition\u001B[38;5;241m.\u001B[39mwait(timeout)\n\u001B[1;32m 442\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;129;01min\u001B[39;00m [CANCELLED, CANCELLED_AND_NOTIFIED]:\n", "File \u001B[0;32m~/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/_base.py:390\u001B[0m, in \u001B[0;36mFuture.__get_result\u001B[0;34m(self)\u001B[0m\n\u001B[1;32m 388\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_exception:\n\u001B[1;32m 389\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 390\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_exception\n\u001B[1;32m 391\u001B[0m \u001B[38;5;28;01mfinally\u001B[39;00m:\n\u001B[1;32m 392\u001B[0m \u001B[38;5;66;03m# Break a reference cycle with the exception in self._exception\u001B[39;00m\n\u001B[1;32m 393\u001B[0m \u001B[38;5;28mself\u001B[39m \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m\n", "File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:409\u001B[0m, in \u001B[0;36mRetrying.__call__\u001B[0;34m(self, fn, *args, **kwargs)\u001B[0m\n\u001B[1;32m 407\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(do, DoAttempt):\n\u001B[1;32m 408\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 409\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[43mfn\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 410\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mBaseException\u001B[39;00m: \u001B[38;5;66;03m# noqa: B902\u001B[39;00m\n\u001B[1;32m 411\u001B[0m retry_state\u001B[38;5;241m.\u001B[39mset_exception(sys\u001B[38;5;241m.\u001B[39mexc_info())\n", "Cell \u001B[0;32mIn [16], line 12\u001B[0m, in \u001B[0;36mget_embedding\u001B[0;34m(text_or_tokens, model)\u001B[0m\n\u001B[1;32m 10\u001B[0m \u001B[38;5;129m@retry\u001B[39m(wait\u001B[38;5;241m=\u001B[39mwait_random_exponential(\u001B[38;5;28mmin\u001B[39m\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m1\u001B[39m, \u001B[38;5;28mmax\u001B[39m\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m20\u001B[39m), stop\u001B[38;5;241m=\u001B[39mstop_after_attempt(\u001B[38;5;241m6\u001B[39m), retry\u001B[38;5;241m=\u001B[39mretry_if_not_exception_type(openai\u001B[38;5;241m.\u001B[39mInvalidRequestError))\n\u001B[1;32m 11\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mget_embedding\u001B[39m(text_or_tokens, model\u001B[38;5;241m=\u001B[39mEMBEDDING_MODEL):\n\u001B[0;32m---> 12\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mopenai\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mEmbedding\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43minput\u001B[39;49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mtext_or_tokens\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmodel\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mmodel\u001B[49m\u001B[43m)\u001B[49m[\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mdata\u001B[39m\u001B[38;5;124m\"\u001B[39m][\u001B[38;5;241m0\u001B[39m][\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124membedding\u001B[39m\u001B[38;5;124m\"\u001B[39m]\n", "File \u001B[0;32m~/code/openai-python/openai/api_resources/embedding.py:33\u001B[0m, in \u001B[0;36mEmbedding.create\u001B[0;34m(cls, *args, **kwargs)\u001B[0m\n\u001B[1;32m 31\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[1;32m 32\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m---> 33\u001B[0m response \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 35\u001B[0m \u001B[38;5;66;03m# If a user specifies base64, we'll just return the encoded string.\u001B[39;00m\n\u001B[1;32m 36\u001B[0m \u001B[38;5;66;03m# This is only for the default case.\u001B[39;00m\n\u001B[1;32m 37\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m user_provided_encoding_format:\n", "File \u001B[0;32m~/code/openai-python/openai/api_resources/abstract/engine_api_resource.py:153\u001B[0m, in \u001B[0;36mEngineAPIResource.create\u001B[0;34m(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)\u001B[0m\n\u001B[1;32m 127\u001B[0m \u001B[38;5;129m@classmethod\u001B[39m\n\u001B[1;32m 128\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mcreate\u001B[39m(\n\u001B[1;32m 129\u001B[0m \u001B[38;5;28mcls\u001B[39m,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 136\u001B[0m \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams,\n\u001B[1;32m 137\u001B[0m ):\n\u001B[1;32m 138\u001B[0m (\n\u001B[1;32m 139\u001B[0m deployment_id,\n\u001B[1;32m 140\u001B[0m engine,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 150\u001B[0m api_key, api_base, api_type, api_version, organization, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams\n\u001B[1;32m 151\u001B[0m )\n\u001B[0;32m--> 153\u001B[0m response, _, api_key \u001B[38;5;241m=\u001B[39m \u001B[43mrequestor\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrequest\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 154\u001B[0m \u001B[43m \u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mpost\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\n\u001B[1;32m 155\u001B[0m \u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 156\u001B[0m \u001B[43m \u001B[49m\u001B[43mparams\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mparams\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 157\u001B[0m \u001B[43m \u001B[49m\u001B[43mheaders\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 158\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mstream\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 159\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_id\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_id\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 160\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_timeout\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_timeout\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 161\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 163\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream:\n\u001B[1;32m 164\u001B[0m \u001B[38;5;66;03m# must be an iterator\u001B[39;00m\n\u001B[1;32m 165\u001B[0m \u001B[38;5;28;01massert\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(response, OpenAIResponse)\n", "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:227\u001B[0m, in \u001B[0;36mAPIRequestor.request\u001B[0;34m(self, method, url, params, headers, files, stream, request_id, request_timeout)\u001B[0m\n\u001B[1;32m 206\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mrequest\u001B[39m(\n\u001B[1;32m 207\u001B[0m \u001B[38;5;28mself\u001B[39m,\n\u001B[1;32m 208\u001B[0m method,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 215\u001B[0m request_timeout: Optional[Union[\u001B[38;5;28mfloat\u001B[39m, Tuple[\u001B[38;5;28mfloat\u001B[39m, \u001B[38;5;28mfloat\u001B[39m]]] \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m,\n\u001B[1;32m 216\u001B[0m ) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m Tuple[Union[OpenAIResponse, Iterator[OpenAIResponse]], \u001B[38;5;28mbool\u001B[39m, \u001B[38;5;28mstr\u001B[39m]:\n\u001B[1;32m 217\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mrequest_raw(\n\u001B[1;32m 218\u001B[0m method\u001B[38;5;241m.\u001B[39mlower(),\n\u001B[1;32m 219\u001B[0m url,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 225\u001B[0m request_timeout\u001B[38;5;241m=\u001B[39mrequest_timeout,\n\u001B[1;32m 226\u001B[0m )\n\u001B[0;32m--> 227\u001B[0m resp, got_stream \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response\u001B[49m\u001B[43m(\u001B[49m\u001B[43mresult\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 228\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp, got_stream, \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mapi_key\n", "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:620\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response\u001B[0;34m(self, result, stream)\u001B[0m\n\u001B[1;32m 612\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[1;32m 613\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_interpret_response_line(\n\u001B[1;32m 614\u001B[0m line, result\u001B[38;5;241m.\u001B[39mstatus_code, result\u001B[38;5;241m.\u001B[39mheaders, stream\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 615\u001B[0m )\n\u001B[1;32m 616\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m line \u001B[38;5;129;01min\u001B[39;00m parse_stream(result\u001B[38;5;241m.\u001B[39miter_lines())\n\u001B[1;32m 617\u001B[0m ), \u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 618\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 619\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[0;32m--> 620\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response_line\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 621\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcontent\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdecode\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mutf-8\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 622\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mstatus_code\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 623\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 624\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mFalse\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[1;32m 625\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m,\n\u001B[1;32m 626\u001B[0m \u001B[38;5;28;01mFalse\u001B[39;00m,\n\u001B[1;32m 627\u001B[0m )\n", "File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:680\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response_line\u001B[0;34m(self, rbody, rcode, rheaders, stream)\u001B[0m\n\u001B[1;32m 678\u001B[0m stream_error \u001B[38;5;241m=\u001B[39m stream \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124merror\u001B[39m\u001B[38;5;124m\"\u001B[39m \u001B[38;5;129;01min\u001B[39;00m resp\u001B[38;5;241m.\u001B[39mdata\n\u001B[1;32m 679\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream_error \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;241m200\u001B[39m \u001B[38;5;241m<\u001B[39m\u001B[38;5;241m=\u001B[39m rcode \u001B[38;5;241m<\u001B[39m \u001B[38;5;241m300\u001B[39m:\n\u001B[0;32m--> 680\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mhandle_error_response(\n\u001B[1;32m 681\u001B[0m rbody, rcode, resp\u001B[38;5;241m.\u001B[39mdata, rheaders, stream_error\u001B[38;5;241m=\u001B[39mstream_error\n\u001B[1;32m 682\u001B[0m )\n\u001B[1;32m 683\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp\n", "\u001B[0;31mInvalidRequestError\u001B[0m: This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length." ] } ], "source": [ "long_text = 'AGI ' * 5000\n", "get_embedding(long_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clearly we want to avoid these errors, particularly when handling programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to handling these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Truncating the input text\n", "\n", "The simplest solution is to truncate the input text to the maximum allowed length. Since the context length is in terms of tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, thus as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function." ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", "\n", "def truncate_text_tokens(text, encoding_name=EMBEDDING_ENCODING, max_tokens=EMBEDDING_CTX_LENGTH):\n", " \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n", " encoding = tiktoken.get_encoding(encoding_name)\n", " return encoding.encode(text)[:max_tokens]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our example from before now works." ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[-0.015384314581751823,\n", " 0.0031692360062152147,\n", " -0.007302511017769575,\n", " -0.02778581902384758,\n", " -0.013409210368990898,\n", " 0.0029592972714453936,\n", " -0.019119545817375183,\n", " -0.0004874778969679028,\n", " -0.010721994563937187,\n", " -0.023486273363232613,\n", " 0.016351712867617607,\n", " 0.005532307084649801,\n", " -0.009136536158621311,\n", " -0.014282556250691414,\n", " 0.005122506525367498,\n", " 0.02888757921755314,\n", " 0.020973725244402885,\n", " 0.009136536158621311,\n", " 0.003303596982732415,\n", " -0.013382338918745518,\n", " -0.024749264121055603,\n", " 0.03904525563120842,\n", " -0.01699664443731308,\n", " -0.010312194004654884,\n", " -0.009029047563672066,\n", " -0.001587137347087264,\n", " 0.017036953940987587,\n", " -0.056915245950222015,\n", " -0.011084768921136856,\n", " -0.006375421304255724,\n", " 0.011145230382680893,\n", " -0.01094368938356638,\n", " -0.010184550657868385,\n", " -0.009546336717903614,\n", " -0.012105911038815975,\n", " -0.004675756674259901,\n", " 0.002245505340397358,\n", " -0.0015040015568956733,\n", " -0.007457026280462742,\n", " 0.0029206685721874237,\n", " 0.03993203863501549,\n", " -0.02390279248356819,\n", " 0.003399329027161002,\n", " -0.02109465003013611,\n", " -0.026590008288621902,\n", " 0.004457420669496059,\n", " -0.03638491407036781,\n", " -0.018958313390612602,\n", " 0.002221992239356041,\n", " -0.007846672087907791,\n", " -0.0106548136100173,\n", " 0.0019096032483503222,\n", " -0.015451495535671711,\n", " -0.00783995445817709,\n", " 0.016821976751089096,\n", " 0.007409999612718821,\n", " -0.017601268365979195,\n", " 0.01502154115587473,\n", " -0.026119744405150414,\n", " -0.011333336122334003,\n", " -0.017184749245643616,\n", " -0.0352562814950943,\n", " -0.002327801426872611,\n", " 0.015666473656892776,\n", " -0.023069754242897034,\n", " -0.016821976751089096,\n", " -0.0005298855248838663,\n", " 0.0010933612938970327,\n", " 0.0048571438528597355,\n", " -0.034503862261772156,\n", " 0.007712311577051878,\n", " 0.038024116307497025,\n", " -0.017856555059552193,\n", " -0.02415807731449604,\n", " 0.020664695650339127,\n", " -0.01742659881711006,\n", " 0.012072320096194744,\n", " 0.015249954536557198,\n", " -0.008357243612408638,\n", " 0.001610650448128581,\n", " 0.018017787486314774,\n", " -0.02247856743633747,\n", " -3.219936115783639e-05,\n", " 0.02421182207763195,\n", " 0.010594351217150688,\n", " 0.01800435036420822,\n", " -0.019777914509177208,\n", " 0.024695521220564842,\n", " 0.0013805575435981154,\n", " -0.0138122932985425,\n", " 0.02132306434214115,\n", " 0.023325040936470032,\n", " 0.027597714215517044,\n", " 0.06062360480427742,\n", " -0.019562937319278717,\n", " 0.009559772908687592,\n", " -0.02183363400399685,\n", " 0.0173728559166193,\n", " -0.028242645785212517,\n", " -0.03058052435517311,\n", " 0.01847461424767971,\n", " -0.026536263525485992,\n", " -0.007947443053126335,\n", " -0.007517488207668066,\n", " -0.026616880670189857,\n", " 0.009183562360703945,\n", " 0.01872989907860756,\n", " -0.022075483575463295,\n", " 0.019589809700846672,\n", " -0.023916227743029594,\n", " 0.019347960129380226,\n", " 0.02378186769783497,\n", " 0.019764477387070656,\n", " -0.0202616136521101,\n", " -0.019401703029870987,\n", " 0.006335113197565079,\n", " 0.015209645964205265,\n", " -0.029935592785477638,\n", " -0.007013635244220495,\n", " -0.0363042950630188,\n", " 0.00704050762578845,\n", " 0.01616360805928707,\n", " 0.014981232583522797,\n", " -0.0013931537978351116,\n", " 0.030661141499876976,\n", " 0.01389290951192379,\n", " 0.007712311577051878,\n", " -0.01910611055791378,\n", " -0.0020792337600141764,\n", " -0.008404269814491272,\n", " 0.024722393602132797,\n", " 0.01699664443731308,\n", " 0.008350525982677937,\n", " 0.009727723896503448,\n", " -0.010695122182369232,\n", " 0.006560167297720909,\n", " -0.031386688351631165,\n", " 0.0263078510761261,\n", " -0.0001876852911664173,\n", " -0.01816558465361595,\n", " 0.019482320174574852,\n", " 0.023190679028630257,\n", " -0.015115593560039997,\n", " -0.015384314581751823,\n", " -0.005233354400843382,\n", " 0.004225648008286953,\n", " -0.0011555030941963196,\n", " -0.012092474848031998,\n", " 0.011602058075368404,\n", " -0.02179332636296749,\n", " -0.003029836807399988,\n", " 0.0030382343102246523,\n", " -0.011151948943734169,\n", " 0.007430153898894787,\n", " 0.001625766046345234,\n", " 0.010795892216265202,\n", " 0.0033136738929897547,\n", " 0.013167361728847027,\n", " -0.027033399790525436,\n", " 0.002052361611276865,\n", " 0.015061848796904087,\n", " 0.017762500792741776,\n", " 0.014349736273288727,\n", " -0.007047225721180439,\n", " 0.014887180179357529,\n", " 0.023190679028630257,\n", " 0.0055289482697844505,\n", " 0.018084967508912086,\n", " -0.0014888859586790204,\n", " -0.003711717901751399,\n", " 0.008290063589811325,\n", " 0.03740605339407921,\n", " 0.007960879243910313,\n", " 0.01809840463101864,\n", " 0.010916817001998425,\n", " 0.03504130616784096,\n", " 0.0031138123013079166,\n", " -0.005303893703967333,\n", " -0.022868212312459946,\n", " -0.01373839471489191,\n", " -0.013933218084275723,\n", " 0.008525194600224495,\n", " 0.05304565653204918,\n", " 0.014537842012941837,\n", " 0.006230983417481184,\n", " -0.004920965526252985,\n", " 0.002856847131624818,\n", " -0.015868013724684715,\n", " 0.006835607346147299,\n", " -0.027449917048215866,\n", " -0.0049041700549423695,\n", " 0.009808340109884739,\n", " 0.028914449736475945,\n", " -0.017386291176080704,\n", " -0.6199946403503418,\n", " -0.02336534857749939,\n", " -0.018353689461946487,\n", " -0.0028131799772381783,\n", " 0.019804786890745163,\n", " 0.04409722611308098,\n", " 0.005280380602926016,\n", " 0.011702828109264374,\n", " -0.024829881265759468,\n", " 0.01465876679867506,\n", " -0.013395775109529495,\n", " 0.025877896696329117,\n", " -0.01636514998972416,\n", " 0.01199842244386673,\n", " -0.01084291934967041,\n", " -0.008827506564557552,\n", " 0.00870658177882433,\n", " -0.020086944103240967,\n", " -0.0006025243201293051,\n", " 0.027812691405415535,\n", " -0.03404703363776207,\n", " 0.0019079238409176469,\n", " 0.0024403284769505262,\n", " -0.006099981721490622,\n", " 0.009082792326807976,\n", " -0.0050519672222435474,\n", " 0.014309428632259369,\n", " -0.022921957075595856,\n", " -0.02199486829340458,\n", " 0.003607588354498148,\n", " -0.008518476970493793,\n", " 0.00871329940855503,\n", " 0.02747678942978382,\n", " -0.020906545221805573,\n", " 0.04253863915801048,\n", " 0.000455147324828431,\n", " 0.014484097249805927,\n", " 0.033079635351896286,\n", " 0.026590008288621902,\n", " 0.05258882790803909,\n", " -0.025971949100494385,\n", " 0.010493581183254719,\n", " 0.026455648243427277,\n", " -0.008511758409440517,\n", " -0.019025493413209915,\n", " 0.020019764080643654,\n", " 0.01937483251094818,\n", " -0.013415928930044174,\n", " -0.0027863075956702232,\n", " -0.007860108278691769,\n", " 0.011662520468235016,\n", " -0.007255484815686941,\n", " 0.0033707772381603718,\n", " -0.01479312777519226,\n", " 0.009358231909573078,\n", " 0.007100970018655062,\n", " 0.02388935536146164,\n", " -0.017171313986182213,\n", " -0.008726735599339008,\n", " 0.010977279394865036,\n", " -0.003943490330129862,\n", " 0.004695910960435867,\n", " -0.003323751036077738,\n", " 0.011642365716397762,\n", " -0.014510969631373882,\n", " 0.0063888574950397015,\n", " -0.006832248065620661,\n", " 0.01937483251094818,\n", " 0.0011857342906296253,\n", " 0.006308240815997124,\n", " -0.0029643357265740633,\n", " 0.012569455429911613,\n", " -0.013677932322025299,\n", " 0.01015767827630043,\n", " -0.002561253262683749,\n", " -0.0055994875729084015,\n", " 0.024144640192389488,\n", " 0.0076988753862679005,\n", " -0.01128630992025137,\n", " -0.022277025505900383,\n", " 0.013422646559774876,\n", " 0.00892155896872282,\n", " 0.0036613326519727707,\n", " -0.009438848122954369,\n", " 0.04151749610900879,\n", " -0.005727130454033613,\n", " -0.00863268319517374,\n", " -0.012804587371647358,\n", " 0.011138512752950191,\n", " -0.003283442696556449,\n", " -0.00783995445817709,\n", " 0.028538240119814873,\n", " 0.00030609077657572925,\n", " 0.006113417912274599,\n", " 0.0205303356051445,\n", " 0.0037721802946180105,\n", " -0.02425212971866131,\n", " 0.013771984726190567,\n", " 0.0034833045210689306,\n", " -0.01748034358024597,\n", " -0.0062444196082651615,\n", " -0.005653231870383024,\n", " 0.011037741787731647,\n", " 0.02684529311954975,\n", " -0.023822175338864326,\n", " 0.041598111391067505,\n", " -0.02915629930794239,\n", " -0.009895674884319305,\n", " 0.03240783140063286,\n", " -0.022639799863100052,\n", " 0.01879708096385002,\n", " -0.03727169334888458,\n", " -0.02415807731449604,\n", " -0.02132306434214115,\n", " 0.014940924011170864,\n", " -0.03536377102136612,\n", " 0.012925512157380581,\n", " 0.012421658262610435,\n", " 0.017117569223046303,\n", " -0.01281130500137806,\n", " -0.014269120059907436,\n", " -0.010144243016839027,\n", " 0.0049075293354690075,\n", " -0.01338905654847622,\n", " 0.00038628740003332496,\n", " -7.085434481268749e-05,\n", " -0.00503853103145957,\n", " -0.024507414549589157,\n", " -0.022653235122561455,\n", " 0.02374155819416046,\n", " 0.03141356259584427,\n", " -0.003390931524336338,\n", " 0.015653036534786224,\n", " -0.024386489763855934,\n", " 0.05546415224671364,\n", " 0.015438059344887733,\n", " 0.02504485845565796,\n", " -0.001958309207111597,\n", " 0.013684650883078575,\n", " -0.01979134976863861,\n", " -0.022706979885697365,\n", " 0.013825729489326477,\n", " 0.008753607980906963,\n", " -0.014537842012941837,\n", " 0.01672792248427868,\n", " -0.01663387008011341,\n", " -0.014121322892606258,\n", " -0.015451495535671711,\n", " -0.005186328198760748,\n", " 0.03512192144989967,\n", " -0.008746890351176262,\n", " 0.029693743214011192,\n", " -0.016486072912812233,\n", " 0.026482518762350082,\n", " -0.023969972506165504,\n", " -0.037916626781225204,\n", " -0.017722193151712418,\n", " -0.0202616136521101,\n", " 0.023298168554902077,\n", " 0.0012461966834962368,\n", " 0.024695521220564842,\n", " 0.021658966317772865,\n", " -0.010936971753835678,\n", " 0.002561253262683749,\n", " -0.005156097002327442,\n", " -0.01057419739663601,\n", " 0.009754596278071404,\n", " 0.03125232830643654,\n", " -0.021282754838466644,\n", " -0.031924132257699966,\n", " 0.016647307202219963,\n", " 0.013355466537177563,\n", " -0.00647283298894763,\n", " 0.019334523007273674,\n", " 0.012032012455165386,\n", " 0.02778581902384758,\n", " -0.008001187816262245,\n", " 0.011071332730352879,\n", " 0.0004958754288963974,\n", " 0.006368703208863735,\n", " -0.002013732912018895,\n", " -0.011951396241784096,\n", " -0.02372812293469906,\n", " 0.0049075293354690075,\n", " 0.032810915261507034,\n", " -0.010963844135403633,\n", " -0.013986961916089058,\n", " 0.041974324733018875,\n", " -0.018541794270277023,\n", " 0.022586055099964142,\n", " -0.003758744103834033,\n", " 0.020355666056275368,\n", " 0.022129228338599205,\n", " 0.004353290889412165,\n", " -0.03531002625823021,\n", " 0.0034564323723316193,\n", " 0.00438352208584547,\n", " -0.0040207477286458015,\n", " -0.002181683899834752,\n", " 0.005488639697432518,\n", " 0.013227824121713638,\n", " -0.002729204250499606,\n", " 0.02899506688117981,\n", " -0.027328992262482643,\n", " 0.019536064937710762,\n", " -0.01735941879451275,\n", " 0.007235330529510975,\n", " -0.02126931957900524,\n", " 0.026388466358184814,\n", " 0.016284532845020294,\n", " 0.0048806569539010525,\n", " -0.018971748650074005,\n", " -0.008384115993976593,\n", " 0.00697332713752985,\n", " 0.023647505789995193,\n", " 0.011843906715512276,\n", " -0.004353290889412165,\n", " 0.013059872202575207,\n", " -0.014846871607005596,\n", " -0.0016291250940412283,\n", " 0.010896663181483746,\n", " -0.003557202871888876,\n", " 0.0031524410005658865,\n", " -0.0016249263426288962,\n", " -0.03509504720568657,\n", " 0.005639795679599047,\n", " 0.00781980063766241,\n", " 0.007786210160702467,\n", " 0.007907134480774403,\n", " -0.01518277358263731,\n", " 0.0005798509810119867,\n", " 0.006738195661455393,\n", " 0.014578149653971195,\n", " 0.023553453385829926,\n", " 0.013395775109529495,\n", " 0.015706781297922134,\n", " 0.019119545817375183,\n", " -0.006818812340497971,\n", " 0.01567990891635418,\n", " -0.0037251540925353765,\n", " 0.003856155788525939,\n", " 0.009425411932170391,\n", " 0.012052166275680065,\n", " -0.030392419546842575,\n", " 0.03697609901428223,\n", " -0.0009875521063804626,\n", " 0.05073465034365654,\n", " -0.012119347229599953,\n", " -0.03520253673195839,\n", " 0.027758946642279625,\n", " -0.012240272015333176,\n", " 0.018246199935674667,\n", " 0.0012361196568235755,\n", " -0.002601561602205038,\n", " 0.02625410631299019,\n", " -0.03541751578450203,\n", " 0.0018071531085297465,\n", " 0.010795892216265202,\n", " 0.002294211182743311,\n", " 0.023486273363232613,\n", " 0.02646908350288868,\n", " 0.006677733268588781,\n", " 0.0025394195690751076,\n", " 0.010218140669167042,\n", " 0.013200951740145683,\n", " 0.02325785905122757,\n", " 0.020328793674707413,\n", " -0.004642166662961245,\n", " -0.015558984130620956,\n", " 0.008323653601109982,\n", " -0.003956926520913839,\n", " -0.015196209773421288,\n", " -0.006422447506338358,\n", " -0.010513735003769398,\n", " 0.009794904850423336,\n", " -0.011434106156229973,\n", " -0.018555231392383575,\n", " -0.014833435416221619,\n", " 0.008934995159506798,\n", " -0.00547184469178319,\n", " -0.0012193245347589254,\n", " -0.01504841260612011,\n", " 0.022693544626235962,\n", " 0.02642877586185932,\n", " 0.001154663390479982,\n", " -0.01336890272796154,\n", " -0.004890734329819679,\n", " 0.025071730837225914,\n", " -0.005928671453148127,\n", " 0.004323059692978859,\n", " -0.012018576264381409,\n", " 0.030822373926639557,\n", " 0.007517488207668066,\n", " 0.011675955727696419,\n", " 0.004951196722686291,\n", " 0.015827706083655357,\n", " 0.014094451442360878,\n", " -0.039018385112285614,\n", " 0.003486663568764925,\n", " 0.0029206685721874237,\n", " 0.02284134179353714,\n", " 0.00033863127464428544,\n", " -0.014403481036424637,\n", " -0.001502322033047676,\n", " 0.0244671069085598,\n", " 0.0013881153427064419,\n", " 0.0076316953636705875,\n", " -0.009700851514935493,\n", " -0.02435961924493313,\n", " -0.045494578778743744,\n", " 0.010124088265001774,\n", " 0.0009220512001775205,\n", " -0.01584114134311676,\n", " -0.005663308780640364,\n", " -0.012549301609396935,\n", " -0.0027980643790215254,\n", " -0.017292238771915436,\n", " -0.013241259381175041,\n", " 0.005199763923883438,\n", " -0.0024604827631264925,\n", " 0.011541595682501793,\n", " -0.005219918210059404,\n", " -0.01762814074754715,\n", " -0.0006478711147792637,\n", " 0.09980322420597076,\n", " 0.02805454097688198,\n", " 0.004746296443045139,\n", " 0.021699273958802223,\n", " 0.006892710458487272,\n", " 0.011971550062298775,\n", " -0.007779492065310478,\n", " -0.010426400229334831,\n", " 0.009768032468855381,\n", " -0.023029446601867676,\n", " 0.01742659881711006,\n", " 0.0013268132461234927,\n", " 0.002284134039655328,\n", " 0.007766055874526501,\n", " 0.018461177125573158,\n", " -0.016244225203990936,\n", " -0.021336499601602554,\n", " -0.0093447957187891,\n", " 0.004245802294462919,\n", " -0.004783245734870434,\n", " -0.009808340109884739,\n", " 0.014631894417107105,\n", " -0.02362063340842724,\n", " 0.013651059940457344,\n", " 0.021954558789730072,\n", " -0.02114839479327202,\n", " 0.0031591590959578753,\n", " 0.003164197551086545,\n", " -0.010769020766019821,\n", " -0.006855761166661978,\n", " -0.016969772055745125,\n", " -0.00590515835210681,\n", " -0.015411186963319778,\n", " 0.0001366701617371291,\n", " -0.015196209773421288,\n", " 0.0011731380363926291,\n", " 0.009855367243289948,\n", " -0.018071532249450684,\n", " 0.03936772421002388,\n", " -0.027342429384589195,\n", " 0.029451893642544746,\n", " 0.0027476788964122534,\n", " -0.009828494861721992,\n", " -0.02257261984050274,\n", " 0.01479312777519226,\n", " -0.026119744405150414,\n", " -0.01007706206291914,\n", " 0.009559772908687592,\n", " -0.014752819202840328,\n", " -0.03135981783270836,\n", " 0.014322864823043346,\n", " -0.0008481527329422534,\n", " -0.01502154115587473,\n", " 0.004148390609771013,\n", " 0.010856354609131813,\n", " 0.013919781893491745,\n", " -0.03135981783270836,\n", " -0.010688403621315956,\n", " 0.008827506564557552,\n", " -0.017251931130886078,\n", " -0.009700851514935493,\n", " -0.012925512157380581,\n", " 0.010466708801686764,\n", " -0.019831659272313118,\n", " 0.009721006266772747,\n", " -0.028403880074620247,\n", " -0.0027325632981956005,\n", " 0.00016994545876514167,\n", " 0.0021850429475307465,\n", " 0.004924324341118336,\n", " 0.02958625555038452,\n", " 0.008216165006160736,\n", " -0.01915985345840454,\n", " -0.0005004940903745592,\n", " -0.004598499275743961,\n", " 0.02642877586185932,\n", " 0.011689391918480396,\n", " -0.0024201744236052036,\n", " 0.008384115993976593,\n", " 0.02309662662446499,\n", " 0.023378783836960793,\n", " -0.040657587349414825,\n", " -0.015908321365714073,\n", " -0.0019129622960463166,\n", " -0.019966019317507744,\n", " 0.009156690910458565,\n", " -0.01279115118086338,\n", " -0.0228950846940279,\n", " 0.002087631495669484,\n", " -0.0008225401979871094,\n", " 0.013281567953526974,\n", " -0.0075779506005346775,\n", " 0.00553902518004179,\n", " -0.019038928672671318,\n", " -0.02327129617333412,\n", " 0.002831654390320182,\n", " 0.023177243769168854,\n", " -0.02531358040869236,\n", " 0.0001918840571306646,\n", " -0.004662320949137211,\n", " 0.0281620305031538,\n", " -0.00968741625547409,\n", " 0.0023210833314806223,\n", " -0.011561749503016472,\n", " -0.0007293273811228573,\n", " 0.018770208582282066,\n", " 0.005458408500999212,\n", " 0.0038628738839179277,\n", " 0.00013467573444359004,\n", " -0.0010253411019220948,\n", " 0.014161631464958191,\n", " 0.003374136285856366,\n", " 0.012065602466464043,\n", " -0.013617469929158688,\n", " -0.0018323458498343825,\n", " 0.005045249126851559,\n", " 0.0011084768921136856,\n", " -0.006785221863538027,\n", " 0.009069356136023998,\n", " -0.0005160295404493809,\n", " -0.012636636383831501,\n", " -0.00517289200797677,\n", " 0.022357642650604248,\n", " -0.006953172851353884,\n", " -0.029666870832443237,\n", " 0.0021296192426234484,\n", " -0.0006881793960928917,\n", " -0.002552855759859085,\n", " 0.0004912567674182355,\n", " -0.004255879204720259,\n", " -0.0008040656102821231,\n", " 0.018810516223311424,\n", " -0.015196209773421288,\n", " -0.015962066128849983,\n", " -0.008565503172576427,\n", " 0.0014930847100913525,\n", " -0.023338476195931435,\n", " 0.012092474848031998,\n", " -0.025743534788489342,\n", " -0.010957125574350357,\n", " -0.033079635351896286,\n", " -0.00035395679879002273,\n", " 0.022760724648833275,\n", " -0.003933413419872522,\n", " -0.03563249111175537,\n", " -0.04122190177440643,\n", " 0.0057943109422922134,\n", " 0.017077261582016945,\n", " -0.008478168398141861,\n", " 0.023862482979893684,\n", " -0.004820194561034441,\n", " -0.004339854698628187,\n", " -0.007947443053126335,\n", " -0.01799091510474682,\n", " -0.01946888491511345,\n", " -0.027812691405415535,\n", " -0.019656989723443985,\n", " 0.01883738860487938,\n", " 0.02663031592965126,\n", " 0.028484495356678963,\n", " 0.02800079621374607,\n", " 0.020328793674707413,\n", " 0.02125588245689869,\n", " -5.395427069743164e-05,\n", " 0.02556886523962021,\n", " -0.012052166275680065,\n", " 0.03243470564484596,\n", " 0.004974709823727608,\n", " -0.0105204526335001,\n", " 0.011897651478648186,\n", " 0.013335312716662884,\n", " -0.013825729489326477,\n", " 0.014363172464072704,\n", " -0.01811183989048004,\n", " -0.0024403284769505262,\n", " 0.03727169334888458,\n", " -0.012092474848031998,\n", " -0.016284532845020294,\n", " -0.008148984052240849,\n", " -0.031091095879673958,\n", " -0.01621735282242298,\n", " 0.006654220167547464,\n", " 0.020879672840237617,\n", " 0.005364356096833944,\n", " -0.03436949849128723,\n", " -0.025931639596819878,\n", " 0.013590597547590733,\n", " 0.008639401756227016,\n", " 0.017278803512454033,\n", " -0.012992692179977894,\n", " 0.021766453981399536,\n", " -0.003293519839644432,\n", " 0.013919781893491745,\n", " 0.009438848122954369,\n", " 0.015505239367485046,\n", " -0.02374155819416046,\n", " -0.010863073170185089,\n", " -0.0030936580151319504,\n", " 0.0107891745865345,\n", " 0.017668448388576508,\n", " 0.005965620744973421,\n", " 0.005928671453148127,\n", " 0.002952579176053405,\n", " -0.016714487224817276,\n", " -0.017036953940987587,\n", " 0.024964241310954094,\n", " -0.01173641812056303,\n", " -0.003752026241272688,\n", " 0.01094368938356638,\n", " -0.022747289389371872,\n", " 0.00047992009785957634,\n", " -0.01778937317430973,\n", " -0.05425490438938141,\n", " -0.01731911115348339,\n", " 0.020812492817640305,\n", " -0.0032431345898658037,\n", " -0.0292100440710783,\n", " -0.004279392305761576,\n", " 0.012482120655477047,\n", " -0.03541751578450203,\n", " 0.002704011742025614,\n", " -0.007759337779134512,\n", " 0.02304288186132908,\n", " 0.012199963442981243,\n", " 0.028538240119814873,\n", " 0.014860307797789574,\n", " -0.012307452037930489,\n", " -0.01936139538884163,\n", " -0.0033607003279030323,\n", " -0.004014029633253813,\n", " -0.007638412993401289,\n", " -0.010271885432302952,\n", " 0.008021341636776924,\n", " 0.0010925214737653732,\n", " -0.0373791828751564,\n", " 0.0024923933669924736,\n", " 0.008021341636776924,\n", " -0.00739656388759613,\n", " -0.02410433255136013,\n", " 0.025246400386095047,\n", " 0.005374433007091284,\n", " 0.010762302204966545,\n", " -0.006627347785979509,\n", " -0.015142465941607952,\n", " -0.050439056009054184,\n", " 0.04108754172921181,\n", " 0.03869592025876045,\n", " 0.004007311537861824,\n", " 0.003866232931613922,\n", " 0.004413753282278776,\n", " 0.015129029750823975,\n", " 0.023298168554902077,\n", " -0.024064024910330772,\n", " 0.0011177140986546874,\n", " -0.009270897135138512,\n", " 0.0016266057500615716,\n", " 0.017386291176080704,\n", " -0.013745113275945187,\n", " 0.01694290153682232,\n", " 0.003973721526563168,\n", " -0.012011858634650707,\n", " -7.295373507076874e-05,\n", " -0.016324840486049652,\n", " 0.011555030941963196,\n", " 0.014725946821272373,\n", " 0.003930054139345884,\n", " -0.012253707274794579,\n", " -0.01537087932229042,\n", " 0.0050519672222435474,\n", " 0.016136735677719116,\n", " -0.04573642462491989,\n", " -0.009647107683122158,\n", " -0.014201940037310123,\n", " -0.006543372292071581,\n", " 0.017655013129115105,\n", " 0.0035168947651982307,\n", " -0.00868642795830965,\n", " 0.011165385134518147,\n", " -0.023768430575728416,\n", " -0.011763290502130985,\n", " 0.03350958973169327,\n", " 0.003799052443355322,\n", " 0.0060966224409639835,\n", " 0.0007314267568290234,\n", " -0.004679115954786539,\n", " -0.003631101455539465,\n", " 0.007705593481659889,\n", " 0.010433118790388107,\n", " 0.029021939262747765,\n", " -0.008390833623707294,\n", " -0.023929663002490997,\n", " -0.010963844135403633,\n", " 0.00109504081774503,\n", " -0.0034161240328103304,\n", " 0.009304487146437168,\n", " -0.014282556250691414,\n", " -0.00626121461391449,\n", " 0.03221972659230232,\n", " -0.04299546405673027,\n", " 0.007121123839169741,\n", " 0.014551278203725815,\n", " -0.012206681072711945,\n", " -0.008169138804078102,\n", " 0.001264671329408884,\n", " -0.004766450263559818,\n", " 0.00836396124213934,\n", " 0.04237740486860275,\n", " 0.003034875262528658,\n", " -0.01231416966766119,\n", " -0.01523651834577322,\n", " 0.017775937914848328,\n", " 0.03990516811609268,\n", " -0.002383225131779909,\n", " 0.004830271936953068,\n", " 0.013563726097345352,\n", " 0.000969917222391814,\n", " 0.01346967276185751,\n", " 0.002389943227171898,\n", " -0.014806563034653664,\n", " 0.007436871994286776,\n", " -0.039448339492082596,\n", " 0.009015611372888088,\n", " 0.0007436032174155116,\n", " -0.004622012376785278,\n", " 0.004222289193421602,\n", " 0.016244225203990936,\n", " 0.01831338182091713,\n", " 0.005146019626408815,\n", " -0.013691368512809277,\n", " -0.03904525563120842,\n", " -0.024695521220564842,\n", " -0.019562937319278717,\n", " -0.013852601870894432,\n", " -0.009385104291141033,\n", " 0.003081901464611292,\n", " -0.013019564561545849,\n", " -0.025851024314761162,\n", " 0.011440824717283249,\n", " -0.02679155021905899,\n", " -0.025219528004527092,\n", " 0.01173641812056303,\n", " 0.01402727048844099,\n", " 0.02888757921755314,\n", " 0.020503463223576546,\n", " -0.007759337779134512,\n", " -0.013852601870894432,\n", " -0.005596128758043051,\n", " 0.0010958805214613676,\n", " -0.05113773047924042,\n", " -0.022236717864871025,\n", " -0.0123679144307971,\n", " 0.021954558789730072,\n", " 0.015196209773421288,\n", " -0.03004308231174946,\n", " -0.03135981783270836,\n", " -0.016284532845020294,\n", " -0.05863506719470024,\n", " -0.018138712272047997,\n", " 0.006852402351796627,\n", " 0.014282556250691414,\n", " 0.016459202393889427,\n", " -0.013006128370761871,\n", " 0.009613517671823502,\n", " 0.020705003291368484,\n", " 0.0090760737657547,\n", " 0.0022656593937426805,\n", " -0.006879274267703295,\n", " -0.02109465003013611,\n", " -0.003799052443355322,\n", " -0.006419088691473007,\n", " 0.000651650014333427,\n", " -0.01878364384174347,\n", " 0.002342917025089264,\n", " -0.015572420321404934,\n", " 0.010453272610902786,\n", " -0.015962066128849983,\n", " -0.00675163185223937,\n", " 0.021229011937975883,\n", " 0.0007910493877716362,\n", " -0.004830271936953068,\n", " -0.015518675558269024,\n", " 0.007087533827871084,\n", " 0.013295004144310951,\n", " 0.025877896696329117,\n", " 0.007873544469475746,\n", " -0.027973923832178116,\n", " -0.0028232568874955177,\n", " 0.008303498849272728,\n", " -0.0018491409718990326,\n", " -0.014712510630488396,\n", " -0.010056908242404461,\n", " 0.0013133770553395152,\n", " 0.0015375918010249734,\n", " 0.025394197553396225,\n", " -0.0009573209099471569,\n", " -0.0033640593755990267,\n", " -0.011749854311347008,\n", " -0.01386603806167841,\n", " 0.02336534857749939,\n", " 0.010809328407049179,\n", " 0.010695122182369232,\n", " 0.0006445121252909303,\n", " -0.010668249800801277,\n", " 0.0041886987164616585,\n", " 0.013630906119942665,\n", " 0.015397750772535801,\n", " -0.015505239367485046,\n", " -0.0025142270606011152,\n", " 0.02105434238910675,\n", " -0.005992493126541376,\n", " 0.019025493413209915,\n", " -0.04425845667719841,\n", " 0.007114405743777752,\n", " -0.023754995316267014,\n", " -0.010473426431417465,\n", " 0.012764278799295425,\n", " -0.012018576264381409,\n", " -0.04624699801206589,\n", " 0.022586055099964142,\n", " 0.00040014335536397994,\n", " -0.009257460944354534,\n", " -0.006214188411831856,\n", " 0.011252719908952713,\n", " -0.011467697098851204,\n", " -0.00610669981688261,\n", " -0.02998933754861355,\n", " 0.017184749245643616,\n", " -0.026482518762350082,\n", " -0.02105434238910675,\n", " -0.006083186715841293,\n", " -0.00826319120824337,\n", " -0.001481328159570694,\n", " -0.010983997955918312,\n", " 0.03326774016022682,\n", " 0.0012226835824549198,\n", " -0.004769809544086456,\n", " 0.19713421165943146,\n", " -0.014269120059907436,\n", " -0.0032179418485611677,\n", " 0.012105911038815975,\n", " -0.028511367738246918,\n", " -0.017131006345152855,\n", " 0.044177841395139694,\n", " 0.0017853195313364267,\n", " -0.0024302515666931868,\n", " 0.03783601149916649,\n", " -0.00892155896872282,\n", " 0.012549301609396935,\n", " 0.027275249361991882,\n", " -0.01050029881298542,\n", " 0.012723970226943493,\n", " -0.02199486829340458,\n", " -0.03536377102136612,\n", " -0.02347283624112606,\n", " 0.009519464336335659,\n", " 0.005925312638282776,\n", " 0.018703026697039604,\n", " 0.0024755983613431454,\n", " -0.01386603806167841,\n", " -0.011481133289635181,\n", " 0.0009497631108388305,\n", " 0.000136355243739672,\n", " -0.007302511017769575,\n", " -0.00022925317171029747,\n", " 0.010896663181483746,\n", " -0.0023731482215225697,\n", " -0.008934995159506798,\n", " -0.02426556497812271,\n", " -0.005135942716151476,\n", " 0.014040706679224968,\n", " -4.135794370085932e-05,\n", " 0.003822565544396639,\n", " 0.022236717864871025,\n", " 0.01215293724089861,\n", " 0.02794705331325531,\n", " 0.013476391322910786,\n", " 0.02304288186132908,\n", " 0.016297968104481697,\n", " 0.007846672087907791,\n", " -0.035229410976171494,\n", " -0.018716463819146156,\n", " 0.03547126054763794,\n", " ...]" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "truncated = truncate_text_tokens(long_text)\n", "get_embedding(truncated)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Chunking the input text\n", "\n", "Though the option above works, it has the clear drawback of simply discarding all text after the maximum context is filled. Another possible approach that addresses this issue is to in fact divide the input text into chunks and then embed each chunk individually. We can then either use the chunk embeddings separately, or combine them in some way, such as for example calculating their average (weighted by the size of each chunk).\n", "\n", "We will first take a function from [python's own cookbook](https://docs.python.org/3/library/itertools.html#itertools-recipes) that breaks up a sequence into chunks." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "from itertools import islice\n", "\n", "def batched(iterable, n):\n", " \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n", " # batched('ABCDEFG', 3) --> ABC DEF G\n", " if n < 1:\n", " raise ValueError('n must be at least one')\n", " it = iter(iterable)\n", " while (batch := tuple(islice(it, n))):\n", " yield batch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's define a function that encodes a string into tokens and then breaks it up into chunks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def chunked_tokens(text, encoding_name, chunk_length):\n", " encoding = tiktoken.get_encoding(encoding_name)\n", " tokens = encoding.encode(text)\n", " chunks_iterator = batched(tokens, chunk_length)\n", " yield from chunks_iterator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The `average` flag can be set to `True` to return the weighted average of the chunk embeddings, or `False` to simply return the unmodified list of chunk embeddings." ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "\n", "def len_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, average=True):\n", " chunk_embeddings = []\n", " for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):\n", " chunk_embeddings.append(get_embedding(chunk, model=model))\n", "\n", " if average:\n", " chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n", " return chunk_embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again, we can verify that we can now handle long input texts." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting reduce=None gives us 2 embedding vectors.\n", "Setting reduce='average' gives us 1 embedding vector.\n" ] } ], "source": [ "average_embedding_vector = len_safe_get_embedding(long_text, average=True)\n", "chunks_embedding_vectors = len_safe_get_embedding(long_text, average=False)\n", "\n", "print(f\"Setting average=True gives us a single {len(average_embedding_vector)}-dimensional embedding vector for our long text.\")\n", "print(f\"Setting average=False gives us {len(chunks_embedding_vectors)} embedding vectors, one for each of the chunks.\")\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" }, "vscode": { "interpreter": { "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" } } }, "nbformat": 4, "nbformat_minor": 2 }