Updating logprobs example and updating the model for the RAG database (#1667)

2025-05-09 19:32:38 +00:00 · 2025-03-12 14:03:59 -07:00 · 2025-03-12 14:03:59 -07:00 · f68bdbd0e2
commit f68bdbd0e2
parent 3accc160d7
2 changed files with 164 additions and 122 deletions
--- a/examples/RAG_with_graph_db.ipynb
+++ b/examples/RAG_with_graph_db.ipynb
@ -620,7 +620,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 84,
   "id": "83100e64",
   "metadata": {},
   "outputs": [],
@ -629,7 +629,7 @@
    "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))\n",
    "\n",
    "# Define the entities to look for\n",
-    "def define_query(prompt, model=\"gpt-4-1106-preview\"):\n",
+    "def define_query(prompt, model=\"gpt-4o\"):\n",
    "    completion = client.chat.completions.create(\n",
    "        model=model,\n",
    "        temperature=0,\n",
@ -1220,7 +1220,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 39,
+   "execution_count": null,
   "id": "14f76f9d",
   "metadata": {},
   "outputs": [],
@ -1230,7 +1230,7 @@
    "from langchain.agents.output_parsers.openai_tools import OpenAIToolsAgentOutputParser\n",
    "\n",
    "\n",
-    "llm = ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
+    "llm = ChatOpenAI(temperature=0, model=\"gpt-4o\")\n",
    "\n",
    "# LLM chain consisting of the LLM and a prompt\n",
    "llm_chain = LLMChain(llm=llm, prompt=prompt)\n",
--- a/examples/Using_logprobs.ipynb
+++ b/examples/Using_logprobs.ipynb
@ -7,7 +7,7 @@
    "# Using logprobs for classification and Q&A evaluation\n",
    "\n",
    "This notebook demonstrates the use of the `logprobs` parameter in the Chat Completions API. When `logprobs` is enabled, the API returns the log probabilities of each output token, along with a limited number of the most likely tokens at each token position and their log probabilities. The relevant request parameters are:\n",
-    "* `logprobs`: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message. This option is currently not available on the `gpt-4-vision-preview` model.\n",
+    "* `logprobs`: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.\n",
    "* `top_logprobs`: An integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to true if this parameter is used.\n",
    "\n",
    "Log probabilities of output tokens indicate the likelihood of each token occurring in the sequence given the context. To simplify, a logprob is `log(p)`, where `p` = probability of a token occurring at a specific position based on the previous tokens in the context. Some key points about `logprobs`:\n",
@ -45,7 +45,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
@ -60,7 +60,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
@ -117,7 +117,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 266,
+   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
@ -137,7 +137,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 267,
+   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
@ -150,7 +150,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 268,
+   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
@ -177,7 +177,7 @@
    "    print(f\"\\nHeadline: {headline}\")\n",
    "    API_RESPONSE = get_completion(\n",
    "        [{\"role\": \"user\", \"content\": CLASSIFICATION_PROMPT.format(headline=headline)}],\n",
-    "        model=\"gpt-4\",\n",
+    "        model=\"gpt-4o\",\n",
    "    )\n",
    "    print(f\"Category: {API_RESPONSE.choices[0].message.content}\\n\")"
   ]
@ -191,7 +191,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 269,
+   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
@ -205,7 +205,7 @@
    {
     "data": {
      "text/html": [
-       "<span style='color: cyan'>Output token 1:</span> Technology, <span style='color: darkorange'>logprobs:</span> -2.4584822e-06, <span style='color: magenta'>linear probability:</span> 100.0%<br><span style='color: cyan'>Output token 2:</span> Techn, <span style='color: darkorange'>logprobs:</span> -13.781253, <span style='color: magenta'>linear probability:</span> 0.0%<br>"
+       "<span style='color: cyan'>Output token 1:</span> Technology, <span style='color: darkorange'>logprobs:</span> 0.0, <span style='color: magenta'>linear probability:</span> 100.0%<br><span style='color: cyan'>Output token 2:</span>  Technology, <span style='color: darkorange'>logprobs:</span> -18.75, <span style='color: magenta'>linear probability:</span> 0.0%<br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
@ -227,7 +227,7 @@
    {
     "data": {
      "text/html": [
-       "<span style='color: cyan'>Output token 1:</span> Politics, <span style='color: darkorange'>logprobs:</span> -2.4584822e-06, <span style='color: magenta'>linear probability:</span> 100.0%<br><span style='color: cyan'>Output token 2:</span> Technology, <span style='color: darkorange'>logprobs:</span> -13.937503, <span style='color: magenta'>linear probability:</span> 0.0%<br>"
+       "<span style='color: cyan'>Output token 1:</span> Politics, <span style='color: darkorange'>logprobs:</span> -3.1281633e-07, <span style='color: magenta'>linear probability:</span> 100.0%<br><span style='color: cyan'>Output token 2:</span> Polit, <span style='color: darkorange'>logprobs:</span> -16.0, <span style='color: magenta'>linear probability:</span> 0.0%<br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
@ -249,7 +249,7 @@
    {
     "data": {
      "text/html": [
-       "<span style='color: cyan'>Output token 1:</span> Art, <span style='color: darkorange'>logprobs:</span> -0.009169078, <span style='color: magenta'>linear probability:</span> 99.09%<br><span style='color: cyan'>Output token 2:</span> Sports, <span style='color: darkorange'>logprobs:</span> -4.696669, <span style='color: magenta'>linear probability:</span> 0.91%<br>"
+       "<span style='color: cyan'>Output token 1:</span> Art, <span style='color: darkorange'>logprobs:</span> -0.028133942, <span style='color: magenta'>linear probability:</span> 97.23%<br><span style='color: cyan'>Output token 2:</span> Sports, <span style='color: darkorange'>logprobs:</span> -4.278134, <span style='color: magenta'>linear probability:</span> 1.39%<br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
@ -272,7 +272,7 @@
    "    print(f\"\\nHeadline: {headline}\")\n",
    "    API_RESPONSE = get_completion(\n",
    "        [{\"role\": \"user\", \"content\": CLASSIFICATION_PROMPT.format(headline=headline)}],\n",
-    "        model=\"gpt-4\",\n",
+    "        model=\"gpt-4o-mini\",\n",
    "        logprobs=True,\n",
    "        top_logprobs=2,\n",
    "    )\n",
@ -292,9 +292,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As expected from the first two headlines, `gpt-4` is nearly 100% confident in its classifications, as the content is clearly technology and politics focused respectively. However, the third headline combines both sports and art-related themes, so we see the model is less confident in its selection.\n",
+    "As expected from the first two headlines, gpt-4o-mini is 100% confident in its classifications, as the content is clearly technology and politics focused, respectively. However, the third headline combines both sports and art-related themes, resulting in slightly lower confidence at 97%, while still demonstrating strong certainty in its classification.\n",
    "\n",
-    "This shows how important using `logprobs` can be, as if we are using LLMs for classification tasks we can set confidence theshholds, or output several potential output tokens if the log probability of the selected output is not sufficiently high. For instance, if we are creating a recommendation engine to tag articles, we can automatically classify headlines crossing a certain threshold, and send the less certain headlines for manual review."
+    "`logprobs` are quite useful for classification tasks. They allow us to set confidence thresholds or output multiple potential tokens if the log probability of the selected output is not sufficiently high. For instance, when creating a recommendation engine to tag articles, we can automatically classify headlines that exceed a certain threshold and send less certain ones for manual review."
   ]
  },
  {
@ -320,7 +320,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 270,
+   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
@ -355,7 +355,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 271,
+   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
@ -368,13 +368,13 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 272,
+   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
-       "Questions clearly answered in article<p style=\"color:green\">Question: What nationality was Ada Lovelace?</p><p style=\"color:cyan\">has_sufficient_context_for_answer: True, <span style=\"color:darkorange\">logprobs: -3.1281633e-07, <span style=\"color:magenta\">linear probability: 100.0%</span></p><p style=\"color:green\">Question: What was an important finding from Lovelace's seventh note?</p><p style=\"color:cyan\">has_sufficient_context_for_answer: True, <span style=\"color:darkorange\">logprobs: -7.89631e-07, <span style=\"color:magenta\">linear probability: 100.0%</span></p>Questions only partially covered in the article<p style=\"color:green\">Question: Did Lovelace collaborate with Charles Dickens</p><p style=\"color:cyan\">has_sufficient_context_for_answer: True, <span style=\"color:darkorange\">logprobs: -0.06993677, <span style=\"color:magenta\">linear probability: 93.25%</span></p><p style=\"color:green\">Question: What concepts did Lovelace build with Charles Babbage</p><p style=\"color:cyan\">has_sufficient_context_for_answer: False, <span style=\"color:darkorange\">logprobs: -0.61807257, <span style=\"color:magenta\">linear probability: 53.9%</span></p>"
+       "Questions clearly answered in article<p style=\"color:green\">Question: What nationality was Ada Lovelace?</p><p style=\"color:cyan\">has_sufficient_context_for_answer: True, <span style=\"color:darkorange\">logprobs: -3.1281633e-07, <span style=\"color:magenta\">linear probability: 100.0%</span></p><p style=\"color:green\">Question: What was an important finding from Lovelace's seventh note?</p><p style=\"color:cyan\">has_sufficient_context_for_answer: True, <span style=\"color:darkorange\">logprobs: -7.89631e-07, <span style=\"color:magenta\">linear probability: 100.0%</span></p>Questions only partially covered in the article<p style=\"color:green\">Question: Did Lovelace collaborate with Charles Dickens</p><p style=\"color:cyan\">has_sufficient_context_for_answer: False, <span style=\"color:darkorange\">logprobs: -0.008654992, <span style=\"color:magenta\">linear probability: 99.14%</span></p><p style=\"color:green\">Question: What concepts did Lovelace build with Charles Babbage</p><p style=\"color:cyan\">has_sufficient_context_for_answer: True, <span style=\"color:darkorange\">logprobs: -0.004082317, <span style=\"color:magenta\">linear probability: 99.59%</span></p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
@ -398,7 +398,7 @@
    "                ),\n",
    "            }\n",
    "        ],\n",
-    "        model=\"gpt-4\",\n",
+    "        model=\"gpt-4o-mini\",\n",
    "        logprobs=True,\n",
    "    )\n",
    "    html_output += f'<p style=\"color:green\">Question: {question}</p>'\n",
@ -417,7 +417,7 @@
    "                ),\n",
    "            }\n",
    "        ],\n",
-    "        model=\"gpt-4\",\n",
+    "        model=\"gpt-4o\",\n",
    "        logprobs=True,\n",
    "        top_logprobs=3,\n",
    "    )\n",
@ -437,13 +437,6 @@
    "This self-evaluation can help reduce hallucinations, as you can restrict answers or re-prompt the user when your `sufficient_context_for_answer` log probability is below a certain threshold. Methods like this have been shown to significantly reduce RAG for Q&A hallucinations and errors ([Example](https://jfan001.medium.com/how-we-cut-the-rate-of-gpt-hallucinations-from-20-to-less-than-2-f3bfcc10e4ec)) "
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -467,7 +460,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 273,
+   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
@ -486,18 +479,18 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now, we can ask `gpt-3.5-turbo` to act as an autocomplete engine with whatever context the model is given. We can enable `logprobs` and can see how confident the model is in its prediction."
+    "Now, we can ask `gpt-4o-mini` to act as an autocomplete engine with whatever context the model is given. We can enable `logprobs` and can see how confident the model is in its prediction."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 274,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
-       "<p>Sentence: My</p><p style=\"color:cyan\">Predicted next token: favorite, <span style=\"color:darkorange\">logprobs: -0.18245785, <span style=\"color:magenta\">linear probability: 83.32%</span></p><p style=\"color:cyan\">Predicted next token: dog, <span style=\"color:darkorange\">logprobs: -2.397172, <span style=\"color:magenta\">linear probability: 9.1%</span></p><p style=\"color:cyan\">Predicted next token: ap, <span style=\"color:darkorange\">logprobs: -3.8732424, <span style=\"color:magenta\">linear probability: 2.08%</span></p><br><p>Sentence: My least</p><p style=\"color:cyan\">Predicted next token: favorite, <span style=\"color:darkorange\">logprobs: -0.0146376295, <span style=\"color:magenta\">linear probability: 98.55%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -4.2417912, <span style=\"color:magenta\">linear probability: 1.44%</span></p><p style=\"color:cyan\">Predicted next token:  favorite, <span style=\"color:darkorange\">logprobs: -9.748788, <span style=\"color:magenta\">linear probability: 0.01%</span></p><br><p>Sentence: My least favorite</p><p style=\"color:cyan\">Predicted next token: food, <span style=\"color:darkorange\">logprobs: -0.9481721, <span style=\"color:magenta\">linear probability: 38.74%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -1.3447137, <span style=\"color:magenta\">linear probability: 26.06%</span></p><p style=\"color:cyan\">Predicted next token: color, <span style=\"color:darkorange\">logprobs: -1.3887696, <span style=\"color:magenta\">linear probability: 24.94%</span></p><br><p>Sentence: My least favorite TV</p><p style=\"color:cyan\">Predicted next token: show, <span style=\"color:darkorange\">logprobs: -0.0007898556, <span style=\"color:magenta\">linear probability: 99.92%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -7.711523, <span style=\"color:magenta\">linear probability: 0.04%</span></p><p style=\"color:cyan\">Predicted next token: series, <span style=\"color:darkorange\">logprobs: -9.348547, <span style=\"color:magenta\">linear probability: 0.01%</span></p><br><p>Sentence: My least favorite TV show</p><p style=\"color:cyan\">Predicted next token: is, <span style=\"color:darkorange\">logprobs: -0.2851253, <span style=\"color:magenta\">linear probability: 75.19%</span></p><p style=\"color:cyan\">Predicted next token: of, <span style=\"color:darkorange\">logprobs: -1.55335, <span style=\"color:magenta\">linear probability: 21.15%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -3.4928775, <span style=\"color:magenta\">linear probability: 3.04%</span></p><br><p>Sentence: My least favorite TV show is</p><p style=\"color:cyan\">Predicted next token: \"My, <span style=\"color:darkorange\">logprobs: -0.69349754, <span style=\"color:magenta\">linear probability: 49.98%</span></p><p style=\"color:cyan\">Predicted next token: \"The, <span style=\"color:darkorange\">logprobs: -1.2899293, <span style=\"color:magenta\">linear probability: 27.53%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -2.4170141, <span style=\"color:magenta\">linear probability: 8.92%</span></p><br><p>Sentence: My least favorite TV show is Breaking Bad</p><p style=\"color:cyan\">Predicted next token: because, <span style=\"color:darkorange\">logprobs: -0.17786823, <span style=\"color:magenta\">linear probability: 83.71%</span></p><p style=\"color:cyan\">Predicted next token: ,, <span style=\"color:darkorange\">logprobs: -2.3946173, <span style=\"color:magenta\">linear probability: 9.12%</span></p><p style=\"color:cyan\">Predicted next token: ., <span style=\"color:darkorange\">logprobs: -3.1861975, <span style=\"color:magenta\">linear probability: 4.13%</span></p><br>"
+       "<p>Sentence: My</p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -0.08344023, <span style=\"color:magenta\">linear probability: 91.99%</span></p><p style=\"color:cyan\">Predicted next token: dog, <span style=\"color:darkorange\">logprobs: -3.3334403, <span style=\"color:magenta\">linear probability: 3.57%</span></p><p style=\"color:cyan\">Predicted next token: ap, <span style=\"color:darkorange\">logprobs: -3.5834403, <span style=\"color:magenta\">linear probability: 2.78%</span></p><br><p>Sentence: My least</p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -0.1271426, <span style=\"color:magenta\">linear probability: 88.06%</span></p><p style=\"color:cyan\">Predicted next token: favorite, <span style=\"color:darkorange\">logprobs: -2.1271427, <span style=\"color:magenta\">linear probability: 11.92%</span></p><p style=\"color:cyan\">Predicted next token:  My, <span style=\"color:darkorange\">logprobs: -9.127143, <span style=\"color:magenta\">linear probability: 0.01%</span></p><br><p>Sentence: My least favorite</p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -0.052905332, <span style=\"color:magenta\">linear probability: 94.85%</span></p><p style=\"color:cyan\">Predicted next token: food, <span style=\"color:darkorange\">logprobs: -4.0529056, <span style=\"color:magenta\">linear probability: 1.74%</span></p><p style=\"color:cyan\">Predicted next token: color, <span style=\"color:darkorange\">logprobs: -5.0529056, <span style=\"color:magenta\">linear probability: 0.64%</span></p><br><p>Sentence: My least favorite TV</p><p style=\"color:cyan\">Predicted next token: show, <span style=\"color:darkorange\">logprobs: -0.57662326, <span style=\"color:magenta\">linear probability: 56.18%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -0.82662326, <span style=\"color:magenta\">linear probability: 43.75%</span></p><p style=\"color:cyan\">Predicted next token:  show, <span style=\"color:darkorange\">logprobs: -8.201623, <span style=\"color:magenta\">linear probability: 0.03%</span></p><br><p>Sentence: My least favorite TV show</p><p style=\"color:cyan\">Predicted next token: is, <span style=\"color:darkorange\">logprobs: -0.70817715, <span style=\"color:magenta\">linear probability: 49.25%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -0.70817715, <span style=\"color:magenta\">linear probability: 49.25%</span></p><p style=\"color:cyan\">Predicted next token: was, <span style=\"color:darkorange\">logprobs: -4.833177, <span style=\"color:magenta\">linear probability: 0.8%</span></p><br><p>Sentence: My least favorite TV show is</p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -0.47896808, <span style=\"color:magenta\">linear probability: 61.94%</span></p><p style=\"color:cyan\">Predicted next token: one, <span style=\"color:darkorange\">logprobs: -1.7289681, <span style=\"color:magenta\">linear probability: 17.75%</span></p><p style=\"color:cyan\">Predicted next token: the, <span style=\"color:darkorange\">logprobs: -2.9789681, <span style=\"color:magenta\">linear probability: 5.08%</span></p><br><p>Sentence: My least favorite TV show is Breaking Bad</p><p style=\"color:cyan\">Predicted next token: because, <span style=\"color:darkorange\">logprobs: -0.034502674, <span style=\"color:magenta\">linear probability: 96.61%</span></p><p style=\"color:cyan\">Predicted next token: ,, <span style=\"color:darkorange\">logprobs: -3.7845027, <span style=\"color:magenta\">linear probability: 2.27%</span></p><p style=\"color:cyan\">Predicted next token:  because, <span style=\"color:darkorange\">logprobs: -5.0345025, <span style=\"color:magenta\">linear probability: 0.65%</span></p><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
@ -516,7 +509,7 @@
    "    PROMPT = \"\"\"Complete this sentence. You are acting as auto-complete. Simply complete the sentence to the best of your ability, make sure it is just ONE sentence: {sentence}\"\"\"\n",
    "    API_RESPONSE = get_completion(\n",
    "        [{\"role\": \"user\", \"content\": PROMPT.format(sentence=sentence)}],\n",
-    "        model=\"gpt-3.5-turbo\",\n",
+    "        model=\"gpt-4o-mini\",\n",
    "        logprobs=True,\n",
    "        top_logprobs=3,\n",
    "    )\n",
@ -544,16 +537,16 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 275,
+   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "{'My least': 'favorite', 'My least favorite TV': 'show'}"
+       "{'My least favorite TV show is Breaking Bad': 'because'}"
      ]
     },
-     "execution_count": 275,
+     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -571,16 +564,16 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 276,
+   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "{'My least favorite': 'food', 'My least favorite TV show is': '\"My'}"
+       "{'My least favorite TV': 'show', 'My least favorite TV show': 'is'}"
      ]
     },
-     "execution_count": 276,
+     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -594,7 +587,7 @@
   "metadata": {},
   "source": [
    "These are logical as well. It's pretty unclear what the user is going to say with just the prefix 'my least favorite', and it's really anyone's guess what the author's favorite TV show is. <br><br>\n",
-    "So, using `gpt-3.5-turbo`, we can create the root of a dynamic autocompletion engine with `logprobs`!"
+    "So, using `gpt-4o-mini`, we can create the root of a dynamic autocompletion engine with `logprobs`!"
   ]
  },
  {
@ -613,14 +606,14 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 277,
+   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROMPT = \"\"\"What's the longest word in the English language?\"\"\"\n",
    "\n",
    "API_RESPONSE = get_completion(\n",
-    "    [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4\", logprobs=True, top_logprobs=5\n",
+    "    [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4o\", logprobs=True, top_logprobs=5\n",
    ")\n",
    "\n",
    "\n",
@ -650,13 +643,13 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 278,
+   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
-       "<span style='color: #FF00FF'>The</span><span style='color: #008000'> longest</span><span style='color: #FF8C00'> word</span><span style='color: #FF0000'> in</span><span style='color: #0000FF'> the</span><span style='color: #FF00FF'> English</span><span style='color: #008000'> language</span><span style='color: #FF8C00'>,</span><span style='color: #FF0000'> according</span><span style='color: #0000FF'> to</span><span style='color: #FF00FF'> the</span><span style='color: #008000'> Guinness</span><span style='color: #FF8C00'> World</span><span style='color: #FF0000'> Records</span><span style='color: #0000FF'>,</span><span style='color: #FF00FF'> is</span><span style='color: #008000'> '</span><span style='color: #FF8C00'>p</span><span style='color: #FF0000'>ne</span><span style='color: #0000FF'>um</span><span style='color: #FF00FF'>on</span><span style='color: #008000'>oul</span><span style='color: #FF8C00'>tram</span><span style='color: #FF0000'>icro</span><span style='color: #0000FF'>sc</span><span style='color: #FF00FF'>op</span><span style='color: #008000'>ics</span><span style='color: #FF8C00'>il</span><span style='color: #FF0000'>ic</span><span style='color: #0000FF'>ov</span><span style='color: #FF00FF'>ol</span><span style='color: #008000'>cano</span><span style='color: #FF8C00'>con</span><span style='color: #FF0000'>iosis</span><span style='color: #0000FF'>'.</span><span style='color: #FF00FF'> It</span><span style='color: #008000'> is</span><span style='color: #FF8C00'> a</span><span style='color: #FF0000'> type</span><span style='color: #0000FF'> of</span><span style='color: #FF00FF'> lung</span><span style='color: #008000'> disease</span><span style='color: #FF8C00'> caused</span><span style='color: #FF0000'> by</span><span style='color: #0000FF'> inh</span><span style='color: #FF00FF'>aling</span><span style='color: #008000'> ash</span><span style='color: #FF8C00'> and</span><span style='color: #FF0000'> sand</span><span style='color: #0000FF'> dust</span><span style='color: #FF00FF'>.</span>"
+       "<span style='color: #FF00FF'>The</span><span style='color: #008000'> longest</span><span style='color: #FF8C00'> word</span><span style='color: #FF0000'> in</span><span style='color: #0000FF'> the</span><span style='color: #FF00FF'> English</span><span style='color: #008000'> language</span><span style='color: #FF8C00'> is</span><span style='color: #FF0000'> often</span><span style='color: #0000FF'> considered</span><span style='color: #FF00FF'> to</span><span style='color: #008000'> be</span><span style='color: #FF8C00'> \"</span><span style='color: #FF0000'>p</span><span style='color: #0000FF'>ne</span><span style='color: #FF00FF'>um</span><span style='color: #008000'>on</span><span style='color: #FF8C00'>oul</span><span style='color: #FF0000'>tr</span><span style='color: #0000FF'>amic</span><span style='color: #FF00FF'>ros</span><span style='color: #008000'>cop</span><span style='color: #FF8C00'>ics</span><span style='color: #FF0000'>ilic</span><span style='color: #0000FF'>ovol</span><span style='color: #FF00FF'>can</span><span style='color: #008000'>ocon</span><span style='color: #FF8C00'>iosis</span><span style='color: #FF0000'>,\"</span><span style='color: #0000FF'> a</span><span style='color: #FF00FF'> term</span><span style='color: #008000'> referring</span><span style='color: #FF8C00'> to</span><span style='color: #FF0000'> a</span><span style='color: #0000FF'> type</span><span style='color: #FF00FF'> of</span><span style='color: #008000'> lung</span><span style='color: #FF8C00'> disease</span><span style='color: #FF0000'> caused</span><span style='color: #0000FF'> by</span><span style='color: #FF00FF'> inhal</span><span style='color: #008000'>ing</span><span style='color: #FF8C00'> very</span><span style='color: #FF0000'> fine</span><span style='color: #0000FF'> sil</span><span style='color: #FF00FF'>icate</span><span style='color: #008000'> or</span><span style='color: #FF8C00'> quartz</span><span style='color: #FF0000'> dust</span><span style='color: #0000FF'>.</span><span style='color: #FF00FF'> However</span><span style='color: #008000'>,</span><span style='color: #FF8C00'> it's</span><span style='color: #FF0000'> worth</span><span style='color: #0000FF'> noting</span><span style='color: #FF00FF'> that</span><span style='color: #008000'> this</span><span style='color: #FF8C00'> word</span><span style='color: #FF0000'> was</span><span style='color: #0000FF'> coined</span><span style='color: #FF00FF'> more</span><span style='color: #008000'> for</span><span style='color: #FF8C00'> its</span><span style='color: #FF0000'> length</span><span style='color: #0000FF'> than</span><span style='color: #FF00FF'> for</span><span style='color: #008000'> practical</span><span style='color: #FF8C00'> use</span><span style='color: #FF0000'>.</span><span style='color: #0000FF'> There</span><span style='color: #FF00FF'> are</span><span style='color: #008000'> also</span><span style='color: #FF8C00'> chemical</span><span style='color: #FF0000'> names</span><span style='color: #0000FF'> for</span><span style='color: #FF00FF'> proteins</span><span style='color: #008000'> and</span><span style='color: #FF8C00'> other</span><span style='color: #FF0000'> compounds</span><span style='color: #0000FF'> that</span><span style='color: #FF00FF'> can</span><span style='color: #008000'> be</span><span style='color: #FF8C00'> much</span><span style='color: #FF0000'> longer</span><span style='color: #0000FF'>,</span><span style='color: #FF00FF'> but</span><span style='color: #008000'> they</span><span style='color: #FF8C00'> are</span><span style='color: #FF0000'> typically</span><span style='color: #0000FF'> not</span><span style='color: #FF00FF'> used</span><span style='color: #008000'> in</span><span style='color: #FF8C00'> everyday</span><span style='color: #FF0000'> language</span><span style='color: #0000FF'>.</span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
@ -669,7 +662,7 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Total number of tokens: 51\n"
+      "Total number of tokens: 95\n"
     ]
    }
   ],
@ -686,16 +679,68 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 279,
+   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
+      "Token: Here\n",
+      "Log prob: -0.054242473\n",
+      "Linear prob: 94.72 %\n",
+      "Bytes: [72, 101, 114, 101] \n",
+      "\n",
+      "Token:  is\n",
+      "Log prob: -0.0044352207\n",
+      "Linear prob: 99.56 %\n",
+      "Bytes: [32, 105, 115] \n",
+      "\n",
+      "Token:  the\n",
+      "Log prob: -2.1008714e-06\n",
+      "Linear prob: 100.0 %\n",
+      "Bytes: [32, 116, 104, 101] \n",
+      "\n",
+      "Token:  blue\n",
+      "Log prob: -0.0013290489\n",
+      "Linear prob: 99.87 %\n",
+      "Bytes: [32, 98, 108, 117, 101] \n",
+      "\n",
+      "Token:  heart\n",
+      "Log prob: 0.0\n",
+      "Linear prob: 100.0 %\n",
+      "Bytes: [32, 104, 101, 97, 114, 116] \n",
+      "\n",
+      "Token:  emoji\n",
+      "Log prob: 0.0\n",
+      "Linear prob: 100.0 %\n",
+      "Bytes: [32, 101, 109, 111, 106, 105] \n",
+      "\n",
+      "Token:  and\n",
+      "Log prob: -0.038287632\n",
+      "Linear prob: 96.24 %\n",
+      "Bytes: [32, 97, 110, 100] \n",
+      "\n",
+      "Token:  its\n",
+      "Log prob: 0.0\n",
+      "Linear prob: 100.0 %\n",
+      "Bytes: [32, 105, 116, 115] \n",
+      "\n",
+      "Token:  name\n",
+      "Log prob: -1.569009e-05\n",
+      "Linear prob: 100.0 %\n",
+      "Bytes: [32, 110, 97, 109, 101] \n",
+      "\n",
+      "Token: :\n",
+      "\n",
+      "\n",
+      "Log prob: -0.11313002\n",
+      "Linear prob: 89.3 %\n",
+      "Bytes: [58, 10, 10] \n",
+      "\n",
      "Token: \\xf0\\x9f\\x92\n",
-      "Log prob: -0.0003056686\n",
-      "Linear prob: 99.97 %\n",
+      "Log prob: -0.09048584\n",
+      "Linear prob: 91.35 %\n",
      "Bytes: [240, 159, 146] \n",
      "\n",
      "Token: \\x99\n",
@ -703,31 +748,28 @@
      "Linear prob: 100.0 %\n",
      "Bytes: [153] \n",
      "\n",
-      "Token:  -\n",
-      "Log prob: -0.0096905725\n",
-      "Linear prob: 99.04 %\n",
-      "Bytes: [32, 45] \n",
-      "\n",
      "Token:  Blue\n",
-      "Log prob: -0.00042042506\n",
-      "Linear prob: 99.96 %\n",
+      "Log prob: -0.023958502\n",
+      "Linear prob: 97.63 %\n",
      "Bytes: [32, 66, 108, 117, 101] \n",
      "\n",
      "Token:  Heart\n",
-      "Log prob: -7.302705e-05\n",
-      "Linear prob: 99.99 %\n",
+      "Log prob: -6.2729996e-06\n",
+      "Linear prob: 100.0 %\n",
      "Bytes: [32, 72, 101, 97, 114, 116] \n",
      "\n",
-      "Bytes array: [240, 159, 146, 153, 32, 45, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]\n",
-      "Decoded bytes: 💙 - Blue Heart\n",
-      "Joint prob: 98.96 %\n"
+      "Bytes array: [72, 101, 114, 101, 32, 105, 115, 32, 116, 104, 101, 32, 98, 108, 117, 101, 32, 104, 101, 97, 114, 116, 32, 101, 109, 111, 106, 105, 32, 97, 110, 100, 32, 105, 116, 115, 32, 110, 97, 109, 101, 58, 10, 10, 240, 159, 146, 153, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]\n",
+      "Decoded bytes: Here is the blue heart emoji and its name:\n",
+      "\n",
+      "💙 Blue Heart\n",
+      "Joint prob: 72.19 %\n"
     ]
    }
   ],
   "source": [
    "PROMPT = \"\"\"Output the blue heart emoji and its name.\"\"\"\n",
    "API_RESPONSE = get_completion(\n",
-    "    [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4\", logprobs=True\n",
+    "    [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4o\", logprobs=True\n",
    ")\n",
    "\n",
    "aggregated_bytes = []\n",
@ -771,12 +813,12 @@
    "\n",
    "When looking to assess the model's confidence in a result, it can be useful to calculate perplexity, which is a measure of the uncertainty. Perplexity can be calculated by exponentiating the negative of the average of the logprobs. Generally, a higher perplexity indicates a more uncertain result, and a lower perplexity indicates a more confident result. As such, perplexity can be used to both assess the result of an individual model run and also to compare the relative confidence of results between model runs. While a high confidence doesn't guarantee result accuracy, it can be a helpful signal that can be paired with other evaluation metrics to build a better understanding of your prompt's behavior.\n",
    "\n",
-    "For example, let's say that I want to use `gpt-3.5-turbo` to learn more about artificial intelligence. I could ask a question about recent history and a question about the future:"
+    "For example, let's say that I want to use `gpt-4o-mini` to learn more about artificial intelligence. I could ask a question about recent history and a question about the future:"
   ]
  },
  {
   "cell_type": "code",
-    "execution_count": 4,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -784,18 +826,18 @@
     "output_type": "stream",
     "text": [
      "Prompt:     In a short sentence, has artifical intelligence grown in the last decade?\n",
-       "Response:   Yes, artificial intelligence has grown significantly in the last decade. \n",
+      "Response:   Yes, artificial intelligence has grown significantly in the last decade, advancing in capabilities and applications across various fields. \n",
      "\n",
-       "Tokens:                Yes              ,     artificial   intelligence            has          grown  significantly             in            the           last         decade              .\n",
-       "Logprobs:            -0.00          -0.00          -0.00          -0.00          -0.00          -0.53          -0.11          -0.00          -0.00          -0.01          -0.00          -0.00\n",
-       "Perplexity: 1.0564125277713383 \n",
+      "Tokens:                Yes              ,     artificial   intelligence            has          grown  significantly             in            the           last         decade              ,      advancing             in   capabilities            and   applications         across        various         fields              .\n",
+      "Logprobs:            -0.00           0.00          -0.00           0.00          -0.00          -0.73          -0.00          -0.01          -0.02          -0.00           0.00          -0.02          -0.66          -0.03          -0.62          -0.47          -0.02          -0.39          -0.01          -0.20          -0.00\n",
+      "Perplexity: 1.1644170003987546 \n",
      "\n",
      "Prompt:     In a short sentence, what are your thoughts on the future of artificial intelligence?\n",
-       "Response:   The future of artificial intelligence holds great potential for transforming industries and improving efficiency, but also raises ethical and societal concerns that must be carefully addressed. \n",
+      "Response:   The future of artificial intelligence holds immense potential for transformative advancements across various sectors, but it also requires careful consideration of ethical and societal impacts. \n",
      "\n",
-       "Tokens:               The        future            of    artificial  intelligence         holds         great     potential           for  transforming    industries           and     improving    efficiency             ,           but          also        raises       ethical           and      societal      concerns          that          must            be     carefully     addressed             .\n",
-       "Logprobs:           -0.19         -0.03         -0.00         -0.00         -0.00         -0.30         -0.51         -0.24         -0.03         -1.45         -0.23         -0.03         -0.22         -0.83         -0.48         -0.01         -0.38         -0.07         -0.47         -0.63         -0.18         -0.26         -0.01         -0.14         -0.00         -0.59         -0.55         -0.00\n",
-       "Perplexity: 1.3220795252314004 \n",
+      "Tokens:                 The          future              of      artificial    intelligence           holds         immense       potential             for  transformative    advancements          across         various         sectors               ,             but              it            also        requires         careful   consideration              of         ethical             and        societal         impacts               .\n",
+      "Logprobs:             -0.02           -0.00            0.00           -0.00            0.00           -0.05           -0.35           -0.01           -0.02           -0.64           -0.43           -0.25           -0.16           -0.51           -0.02           -0.43           -0.08           -0.07           -0.97           -0.02           -0.48           -0.00           -0.00           -0.48           -0.01           -0.58           -0.00\n",
+      "Perplexity: 1.2292170270768858 \n",
      "\n"
     ]
    }
@ -809,7 +851,7 @@
    "for prompt in prompts:\n",
    "    API_RESPONSE = get_completion(\n",
    "        [{\"role\": \"user\", \"content\": prompt}],\n",
-     "        model=\"gpt-3.5-turbo\",\n",
+    "        model=\"gpt-4o-mini\",\n",
    "        logprobs=True,\n",
    "    )\n",
    "\n",
@ -835,7 +877,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this example, `gpt-3.5-turbo` returned a lower perplexity score for a more deterministic question about recent history, and a higher perplexity score for a more speculative assessment about the near future. Again, while these differences don't guarantee accuracy, they help point the way for our interpretation of the model's results and our future use of them."
+    "In this example, `gpt-4o-mini` returned a lower perplexity score for a more deterministic question about recent history, and a higher perplexity score for a more speculative assessment about the near future. Again, while these differences don't guarantee accuracy, they help point the way for our interpretation of the model's results and our future use of them."
   ]
  },
  {