diff --git a/examples/RAG_with_graph_db.ipynb b/examples/RAG_with_graph_db.ipynb index 221b630..2ba9453 100644 --- a/examples/RAG_with_graph_db.ipynb +++ b/examples/RAG_with_graph_db.ipynb @@ -620,7 +620,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 84, "id": "83100e64", "metadata": {}, "outputs": [], @@ -629,7 +629,7 @@ "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"\"))\n", "\n", "# Define the entities to look for\n", - "def define_query(prompt, model=\"gpt-4-1106-preview\"):\n", + "def define_query(prompt, model=\"gpt-4o\"):\n", " completion = client.chat.completions.create(\n", " model=model,\n", " temperature=0,\n", @@ -1220,7 +1220,7 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": null, "id": "14f76f9d", "metadata": {}, "outputs": [], @@ -1230,7 +1230,7 @@ "from langchain.agents.output_parsers.openai_tools import OpenAIToolsAgentOutputParser\n", "\n", "\n", - "llm = ChatOpenAI(temperature=0, model=\"gpt-4\")\n", + "llm = ChatOpenAI(temperature=0, model=\"gpt-4o\")\n", "\n", "# LLM chain consisting of the LLM and a prompt\n", "llm_chain = LLMChain(llm=llm, prompt=prompt)\n", diff --git a/examples/Using_logprobs.ipynb b/examples/Using_logprobs.ipynb index ee3c336..e3aa394 100644 --- a/examples/Using_logprobs.ipynb +++ b/examples/Using_logprobs.ipynb @@ -7,7 +7,7 @@ "# Using logprobs for classification and Q&A evaluation\n", "\n", "This notebook demonstrates the use of the `logprobs` parameter in the Chat Completions API. When `logprobs` is enabled, the API returns the log probabilities of each output token, along with a limited number of the most likely tokens at each token position and their log probabilities. The relevant request parameters are:\n", - "* `logprobs`: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message. This option is currently not available on the `gpt-4-vision-preview` model.\n", + "* `logprobs`: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.\n", "* `top_logprobs`: An integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to true if this parameter is used.\n", "\n", "Log probabilities of output tokens indicate the likelihood of each token occurring in the sequence given the context. To simplify, a logprob is `log(p)`, where `p` = probability of a token occurring at a specific position based on the previous tokens in the context. Some key points about `logprobs`:\n", @@ -45,7 +45,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -60,7 +60,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 31, "metadata": {}, "outputs": [], "source": [ @@ -117,7 +117,7 @@ }, { "cell_type": "code", - "execution_count": 266, + "execution_count": 32, "metadata": {}, "outputs": [], "source": [ @@ -137,7 +137,7 @@ }, { "cell_type": "code", - "execution_count": 267, + "execution_count": 33, "metadata": {}, "outputs": [], "source": [ @@ -150,7 +150,7 @@ }, { "cell_type": "code", - "execution_count": 268, + "execution_count": 69, "metadata": {}, "outputs": [ { @@ -177,7 +177,7 @@ " print(f\"\\nHeadline: {headline}\")\n", " API_RESPONSE = get_completion(\n", " [{\"role\": \"user\", \"content\": CLASSIFICATION_PROMPT.format(headline=headline)}],\n", - " model=\"gpt-4\",\n", + " model=\"gpt-4o\",\n", " )\n", " print(f\"Category: {API_RESPONSE.choices[0].message.content}\\n\")" ] @@ -191,7 +191,7 @@ }, { "cell_type": "code", - "execution_count": 269, + "execution_count": 57, "metadata": {}, "outputs": [ { @@ -205,7 +205,7 @@ { "data": { "text/html": [ - "Output token 1: Technology, logprobs: -2.4584822e-06, linear probability: 100.0%
Output token 2: Techn, logprobs: -13.781253, linear probability: 0.0%
" + "Output token 1: Technology, logprobs: 0.0, linear probability: 100.0%
Output token 2: Technology, logprobs: -18.75, linear probability: 0.0%
" ], "text/plain": [ "" @@ -227,7 +227,7 @@ { "data": { "text/html": [ - "Output token 1: Politics, logprobs: -2.4584822e-06, linear probability: 100.0%
Output token 2: Technology, logprobs: -13.937503, linear probability: 0.0%
" + "Output token 1: Politics, logprobs: -3.1281633e-07, linear probability: 100.0%
Output token 2: Polit, logprobs: -16.0, linear probability: 0.0%
" ], "text/plain": [ "" @@ -249,7 +249,7 @@ { "data": { "text/html": [ - "Output token 1: Art, logprobs: -0.009169078, linear probability: 99.09%
Output token 2: Sports, logprobs: -4.696669, linear probability: 0.91%
" + "Output token 1: Art, logprobs: -0.028133942, linear probability: 97.23%
Output token 2: Sports, logprobs: -4.278134, linear probability: 1.39%
" ], "text/plain": [ "" @@ -272,7 +272,7 @@ " print(f\"\\nHeadline: {headline}\")\n", " API_RESPONSE = get_completion(\n", " [{\"role\": \"user\", \"content\": CLASSIFICATION_PROMPT.format(headline=headline)}],\n", - " model=\"gpt-4\",\n", + " model=\"gpt-4o-mini\",\n", " logprobs=True,\n", " top_logprobs=2,\n", " )\n", @@ -292,9 +292,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As expected from the first two headlines, `gpt-4` is nearly 100% confident in its classifications, as the content is clearly technology and politics focused respectively. However, the third headline combines both sports and art-related themes, so we see the model is less confident in its selection.\n", + "As expected from the first two headlines, gpt-4o-mini is 100% confident in its classifications, as the content is clearly technology and politics focused, respectively. However, the third headline combines both sports and art-related themes, resulting in slightly lower confidence at 97%, while still demonstrating strong certainty in its classification.\n", "\n", - "This shows how important using `logprobs` can be, as if we are using LLMs for classification tasks we can set confidence theshholds, or output several potential output tokens if the log probability of the selected output is not sufficiently high. For instance, if we are creating a recommendation engine to tag articles, we can automatically classify headlines crossing a certain threshold, and send the less certain headlines for manual review." + "`logprobs` are quite useful for classification tasks. They allow us to set confidence thresholds or output multiple potential tokens if the log probability of the selected output is not sufficiently high. For instance, when creating a recommendation engine to tag articles, we can automatically classify headlines that exceed a certain threshold and send less certain ones for manual review." ] }, { @@ -320,7 +320,7 @@ }, { "cell_type": "code", - "execution_count": 270, + "execution_count": 36, "metadata": {}, "outputs": [], "source": [ @@ -355,7 +355,7 @@ }, { "cell_type": "code", - "execution_count": 271, + "execution_count": 61, "metadata": {}, "outputs": [], "source": [ @@ -368,13 +368,13 @@ }, { "cell_type": "code", - "execution_count": 272, + "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ - "Questions clearly answered in article

Question: What nationality was Ada Lovelace?

has_sufficient_context_for_answer: True, logprobs: -3.1281633e-07, linear probability: 100.0%

Question: What was an important finding from Lovelace's seventh note?

has_sufficient_context_for_answer: True, logprobs: -7.89631e-07, linear probability: 100.0%

Questions only partially covered in the article

Question: Did Lovelace collaborate with Charles Dickens

has_sufficient_context_for_answer: True, logprobs: -0.06993677, linear probability: 93.25%

Question: What concepts did Lovelace build with Charles Babbage

has_sufficient_context_for_answer: False, logprobs: -0.61807257, linear probability: 53.9%

" + "Questions clearly answered in article

Question: What nationality was Ada Lovelace?

has_sufficient_context_for_answer: True, logprobs: -3.1281633e-07, linear probability: 100.0%

Question: What was an important finding from Lovelace's seventh note?

has_sufficient_context_for_answer: True, logprobs: -7.89631e-07, linear probability: 100.0%

Questions only partially covered in the article

Question: Did Lovelace collaborate with Charles Dickens

has_sufficient_context_for_answer: False, logprobs: -0.008654992, linear probability: 99.14%

Question: What concepts did Lovelace build with Charles Babbage

has_sufficient_context_for_answer: True, logprobs: -0.004082317, linear probability: 99.59%

" ], "text/plain": [ "" @@ -398,7 +398,7 @@ " ),\n", " }\n", " ],\n", - " model=\"gpt-4\",\n", + " model=\"gpt-4o-mini\",\n", " logprobs=True,\n", " )\n", " html_output += f'

Question: {question}

'\n", @@ -417,7 +417,7 @@ " ),\n", " }\n", " ],\n", - " model=\"gpt-4\",\n", + " model=\"gpt-4o\",\n", " logprobs=True,\n", " top_logprobs=3,\n", " )\n", @@ -437,13 +437,6 @@ "This self-evaluation can help reduce hallucinations, as you can restrict answers or re-prompt the user when your `sufficient_context_for_answer` log probability is below a certain threshold. Methods like this have been shown to significantly reduce RAG for Q&A hallucinations and errors ([Example](https://jfan001.medium.com/how-we-cut-the-rate-of-gpt-hallucinations-from-20-to-less-than-2-f3bfcc10e4ec)) " ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "markdown", "metadata": {}, @@ -467,7 +460,7 @@ }, { "cell_type": "code", - "execution_count": 273, + "execution_count": 39, "metadata": {}, "outputs": [], "source": [ @@ -486,18 +479,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now, we can ask `gpt-3.5-turbo` to act as an autocomplete engine with whatever context the model is given. We can enable `logprobs` and can see how confident the model is in its prediction." + "Now, we can ask `gpt-4o-mini` to act as an autocomplete engine with whatever context the model is given. We can enable `logprobs` and can see how confident the model is in its prediction." ] }, { "cell_type": "code", - "execution_count": 274, + "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ - "

Sentence: My

Predicted next token: favorite, logprobs: -0.18245785, linear probability: 83.32%

Predicted next token: dog, logprobs: -2.397172, linear probability: 9.1%

Predicted next token: ap, logprobs: -3.8732424, linear probability: 2.08%


Sentence: My least

Predicted next token: favorite, logprobs: -0.0146376295, linear probability: 98.55%

Predicted next token: My, logprobs: -4.2417912, linear probability: 1.44%

Predicted next token: favorite, logprobs: -9.748788, linear probability: 0.01%


Sentence: My least favorite

Predicted next token: food, logprobs: -0.9481721, linear probability: 38.74%

Predicted next token: My, logprobs: -1.3447137, linear probability: 26.06%

Predicted next token: color, logprobs: -1.3887696, linear probability: 24.94%


Sentence: My least favorite TV

Predicted next token: show, logprobs: -0.0007898556, linear probability: 99.92%

Predicted next token: My, logprobs: -7.711523, linear probability: 0.04%

Predicted next token: series, logprobs: -9.348547, linear probability: 0.01%


Sentence: My least favorite TV show

Predicted next token: is, logprobs: -0.2851253, linear probability: 75.19%

Predicted next token: of, logprobs: -1.55335, linear probability: 21.15%

Predicted next token: My, logprobs: -3.4928775, linear probability: 3.04%


Sentence: My least favorite TV show is

Predicted next token: \"My, logprobs: -0.69349754, linear probability: 49.98%

Predicted next token: \"The, logprobs: -1.2899293, linear probability: 27.53%

Predicted next token: My, logprobs: -2.4170141, linear probability: 8.92%


Sentence: My least favorite TV show is Breaking Bad

Predicted next token: because, logprobs: -0.17786823, linear probability: 83.71%

Predicted next token: ,, logprobs: -2.3946173, linear probability: 9.12%

Predicted next token: ., logprobs: -3.1861975, linear probability: 4.13%


" + "

Sentence: My

Predicted next token: My, logprobs: -0.08344023, linear probability: 91.99%

Predicted next token: dog, logprobs: -3.3334403, linear probability: 3.57%

Predicted next token: ap, logprobs: -3.5834403, linear probability: 2.78%


Sentence: My least

Predicted next token: My, logprobs: -0.1271426, linear probability: 88.06%

Predicted next token: favorite, logprobs: -2.1271427, linear probability: 11.92%

Predicted next token: My, logprobs: -9.127143, linear probability: 0.01%


Sentence: My least favorite

Predicted next token: My, logprobs: -0.052905332, linear probability: 94.85%

Predicted next token: food, logprobs: -4.0529056, linear probability: 1.74%

Predicted next token: color, logprobs: -5.0529056, linear probability: 0.64%


Sentence: My least favorite TV

Predicted next token: show, logprobs: -0.57662326, linear probability: 56.18%

Predicted next token: My, logprobs: -0.82662326, linear probability: 43.75%

Predicted next token: show, logprobs: -8.201623, linear probability: 0.03%


Sentence: My least favorite TV show

Predicted next token: is, logprobs: -0.70817715, linear probability: 49.25%

Predicted next token: My, logprobs: -0.70817715, linear probability: 49.25%

Predicted next token: was, logprobs: -4.833177, linear probability: 0.8%


Sentence: My least favorite TV show is

Predicted next token: My, logprobs: -0.47896808, linear probability: 61.94%

Predicted next token: one, logprobs: -1.7289681, linear probability: 17.75%

Predicted next token: the, logprobs: -2.9789681, linear probability: 5.08%


Sentence: My least favorite TV show is Breaking Bad

Predicted next token: because, logprobs: -0.034502674, linear probability: 96.61%

Predicted next token: ,, logprobs: -3.7845027, linear probability: 2.27%

Predicted next token: because, logprobs: -5.0345025, linear probability: 0.65%


" ], "text/plain": [ "" @@ -516,7 +509,7 @@ " PROMPT = \"\"\"Complete this sentence. You are acting as auto-complete. Simply complete the sentence to the best of your ability, make sure it is just ONE sentence: {sentence}\"\"\"\n", " API_RESPONSE = get_completion(\n", " [{\"role\": \"user\", \"content\": PROMPT.format(sentence=sentence)}],\n", - " model=\"gpt-3.5-turbo\",\n", + " model=\"gpt-4o-mini\",\n", " logprobs=True,\n", " top_logprobs=3,\n", " )\n", @@ -544,16 +537,16 @@ }, { "cell_type": "code", - "execution_count": 275, + "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'My least': 'favorite', 'My least favorite TV': 'show'}" + "{'My least favorite TV show is Breaking Bad': 'because'}" ] }, - "execution_count": 275, + "execution_count": 48, "metadata": {}, "output_type": "execute_result" } @@ -571,16 +564,16 @@ }, { "cell_type": "code", - "execution_count": 276, + "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'My least favorite': 'food', 'My least favorite TV show is': '\"My'}" + "{'My least favorite TV': 'show', 'My least favorite TV show': 'is'}" ] }, - "execution_count": 276, + "execution_count": 49, "metadata": {}, "output_type": "execute_result" } @@ -594,7 +587,7 @@ "metadata": {}, "source": [ "These are logical as well. It's pretty unclear what the user is going to say with just the prefix 'my least favorite', and it's really anyone's guess what the author's favorite TV show is.

\n", - "So, using `gpt-3.5-turbo`, we can create the root of a dynamic autocompletion engine with `logprobs`!" + "So, using `gpt-4o-mini`, we can create the root of a dynamic autocompletion engine with `logprobs`!" ] }, { @@ -613,14 +606,14 @@ }, { "cell_type": "code", - "execution_count": 277, + "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "PROMPT = \"\"\"What's the longest word in the English language?\"\"\"\n", "\n", "API_RESPONSE = get_completion(\n", - " [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4\", logprobs=True, top_logprobs=5\n", + " [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4o\", logprobs=True, top_logprobs=5\n", ")\n", "\n", "\n", @@ -650,13 +643,13 @@ }, { "cell_type": "code", - "execution_count": 278, + "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ - "The longest word in the English language, according to the Guinness World Records, is 'pneumonoultramicroscopicsilicovolcanoconiosis'. It is a type of lung disease caused by inhaling ash and sand dust." + "The longest word in the English language is often considered to be \"pneumonoultramicroscopicsilicovolcanoconiosis,\" a term referring to a type of lung disease caused by inhaling very fine silicate or quartz dust. However, it's worth noting that this word was coined more for its length than for practical use. There are also chemical names for proteins and other compounds that can be much longer, but they are typically not used in everyday language." ], "text/plain": [ "" @@ -669,7 +662,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Total number of tokens: 51\n" + "Total number of tokens: 95\n" ] } ], @@ -686,16 +679,68 @@ }, { "cell_type": "code", - "execution_count": 279, + "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ + "Token: Here\n", + "Log prob: -0.054242473\n", + "Linear prob: 94.72 %\n", + "Bytes: [72, 101, 114, 101] \n", + "\n", + "Token: is\n", + "Log prob: -0.0044352207\n", + "Linear prob: 99.56 %\n", + "Bytes: [32, 105, 115] \n", + "\n", + "Token: the\n", + "Log prob: -2.1008714e-06\n", + "Linear prob: 100.0 %\n", + "Bytes: [32, 116, 104, 101] \n", + "\n", + "Token: blue\n", + "Log prob: -0.0013290489\n", + "Linear prob: 99.87 %\n", + "Bytes: [32, 98, 108, 117, 101] \n", + "\n", + "Token: heart\n", + "Log prob: 0.0\n", + "Linear prob: 100.0 %\n", + "Bytes: [32, 104, 101, 97, 114, 116] \n", + "\n", + "Token: emoji\n", + "Log prob: 0.0\n", + "Linear prob: 100.0 %\n", + "Bytes: [32, 101, 109, 111, 106, 105] \n", + "\n", + "Token: and\n", + "Log prob: -0.038287632\n", + "Linear prob: 96.24 %\n", + "Bytes: [32, 97, 110, 100] \n", + "\n", + "Token: its\n", + "Log prob: 0.0\n", + "Linear prob: 100.0 %\n", + "Bytes: [32, 105, 116, 115] \n", + "\n", + "Token: name\n", + "Log prob: -1.569009e-05\n", + "Linear prob: 100.0 %\n", + "Bytes: [32, 110, 97, 109, 101] \n", + "\n", + "Token: :\n", + "\n", + "\n", + "Log prob: -0.11313002\n", + "Linear prob: 89.3 %\n", + "Bytes: [58, 10, 10] \n", + "\n", "Token: \\xf0\\x9f\\x92\n", - "Log prob: -0.0003056686\n", - "Linear prob: 99.97 %\n", + "Log prob: -0.09048584\n", + "Linear prob: 91.35 %\n", "Bytes: [240, 159, 146] \n", "\n", "Token: \\x99\n", @@ -703,31 +748,28 @@ "Linear prob: 100.0 %\n", "Bytes: [153] \n", "\n", - "Token: -\n", - "Log prob: -0.0096905725\n", - "Linear prob: 99.04 %\n", - "Bytes: [32, 45] \n", - "\n", "Token: Blue\n", - "Log prob: -0.00042042506\n", - "Linear prob: 99.96 %\n", + "Log prob: -0.023958502\n", + "Linear prob: 97.63 %\n", "Bytes: [32, 66, 108, 117, 101] \n", "\n", "Token: Heart\n", - "Log prob: -7.302705e-05\n", - "Linear prob: 99.99 %\n", + "Log prob: -6.2729996e-06\n", + "Linear prob: 100.0 %\n", "Bytes: [32, 72, 101, 97, 114, 116] \n", "\n", - "Bytes array: [240, 159, 146, 153, 32, 45, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]\n", - "Decoded bytes: 💙 - Blue Heart\n", - "Joint prob: 98.96 %\n" + "Bytes array: [72, 101, 114, 101, 32, 105, 115, 32, 116, 104, 101, 32, 98, 108, 117, 101, 32, 104, 101, 97, 114, 116, 32, 101, 109, 111, 106, 105, 32, 97, 110, 100, 32, 105, 116, 115, 32, 110, 97, 109, 101, 58, 10, 10, 240, 159, 146, 153, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]\n", + "Decoded bytes: Here is the blue heart emoji and its name:\n", + "\n", + "💙 Blue Heart\n", + "Joint prob: 72.19 %\n" ] } ], "source": [ "PROMPT = \"\"\"Output the blue heart emoji and its name.\"\"\"\n", "API_RESPONSE = get_completion(\n", - " [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4\", logprobs=True\n", + " [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4o\", logprobs=True\n", ")\n", "\n", "aggregated_bytes = []\n", @@ -771,71 +813,71 @@ "\n", "When looking to assess the model's confidence in a result, it can be useful to calculate perplexity, which is a measure of the uncertainty. Perplexity can be calculated by exponentiating the negative of the average of the logprobs. Generally, a higher perplexity indicates a more uncertain result, and a lower perplexity indicates a more confident result. As such, perplexity can be used to both assess the result of an individual model run and also to compare the relative confidence of results between model runs. While a high confidence doesn't guarantee result accuracy, it can be a helpful signal that can be paired with other evaluation metrics to build a better understanding of your prompt's behavior.\n", "\n", - "For example, let's say that I want to use `gpt-3.5-turbo` to learn more about artificial intelligence. I could ask a question about recent history and a question about the future:" + "For example, let's say that I want to use `gpt-4o-mini` to learn more about artificial intelligence. I could ask a question about recent history and a question about the future:" ] }, { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Prompt: In a short sentence, has artifical intelligence grown in the last decade?\n", - "Response: Yes, artificial intelligence has grown significantly in the last decade. \n", - "\n", - "Tokens: Yes , artificial intelligence has grown significantly in the last decade .\n", - "Logprobs: -0.00 -0.00 -0.00 -0.00 -0.00 -0.53 -0.11 -0.00 -0.00 -0.01 -0.00 -0.00\n", - "Perplexity: 1.0564125277713383 \n", - "\n", - "Prompt: In a short sentence, what are your thoughts on the future of artificial intelligence?\n", - "Response: The future of artificial intelligence holds great potential for transforming industries and improving efficiency, but also raises ethical and societal concerns that must be carefully addressed. \n", - "\n", - "Tokens: The future of artificial intelligence holds great potential for transforming industries and improving efficiency , but also raises ethical and societal concerns that must be carefully addressed .\n", - "Logprobs: -0.19 -0.03 -0.00 -0.00 -0.00 -0.30 -0.51 -0.24 -0.03 -1.45 -0.23 -0.03 -0.22 -0.83 -0.48 -0.01 -0.38 -0.07 -0.47 -0.63 -0.18 -0.26 -0.01 -0.14 -0.00 -0.59 -0.55 -0.00\n", - "Perplexity: 1.3220795252314004 \n", - "\n" - ] - } - ], - "source": [ - "prompts = [\n", - " \"In a short sentence, has artifical intelligence grown in the last decade?\",\n", - " \"In a short sentence, what are your thoughts on the future of artificial intelligence?\",\n", - "]\n", - "\n", - "for prompt in prompts:\n", - " API_RESPONSE = get_completion(\n", - " [{\"role\": \"user\", \"content\": prompt}],\n", - " model=\"gpt-3.5-turbo\",\n", - " logprobs=True,\n", - " )\n", - "\n", - " logprobs = [token.logprob for token in API_RESPONSE.choices[0].logprobs.content]\n", - " response_text = API_RESPONSE.choices[0].message.content\n", - " response_text_tokens = [token.token for token in API_RESPONSE.choices[0].logprobs.content]\n", - " max_starter_length = max(len(s) for s in [\"Prompt:\", \"Response:\", \"Tokens:\", \"Logprobs:\", \"Perplexity:\"])\n", - " max_token_length = max(len(s) for s in response_text_tokens)\n", - " \n", - "\n", - " formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]\n", - " formatted_lps = [f\"{lp:.2f}\".rjust(max_token_length) for lp in logprobs]\n", - "\n", - " perplexity_score = np.exp(-np.mean(logprobs))\n", - " print(\"Prompt:\".ljust(max_starter_length), prompt)\n", - " print(\"Response:\".ljust(max_starter_length), response_text, \"\\n\")\n", - " print(\"Tokens:\".ljust(max_starter_length), \" \".join(formatted_response_tokens))\n", - " print(\"Logprobs:\".ljust(max_starter_length), \" \".join(formatted_lps))\n", - " print(\"Perplexity:\".ljust(max_starter_length), perplexity_score, \"\\n\")" - ] - }, + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Prompt: In a short sentence, has artifical intelligence grown in the last decade?\n", + "Response: Yes, artificial intelligence has grown significantly in the last decade, advancing in capabilities and applications across various fields. \n", + "\n", + "Tokens: Yes , artificial intelligence has grown significantly in the last decade , advancing in capabilities and applications across various fields .\n", + "Logprobs: -0.00 0.00 -0.00 0.00 -0.00 -0.73 -0.00 -0.01 -0.02 -0.00 0.00 -0.02 -0.66 -0.03 -0.62 -0.47 -0.02 -0.39 -0.01 -0.20 -0.00\n", + "Perplexity: 1.1644170003987546 \n", + "\n", + "Prompt: In a short sentence, what are your thoughts on the future of artificial intelligence?\n", + "Response: The future of artificial intelligence holds immense potential for transformative advancements across various sectors, but it also requires careful consideration of ethical and societal impacts. \n", + "\n", + "Tokens: The future of artificial intelligence holds immense potential for transformative advancements across various sectors , but it also requires careful consideration of ethical and societal impacts .\n", + "Logprobs: -0.02 -0.00 0.00 -0.00 0.00 -0.05 -0.35 -0.01 -0.02 -0.64 -0.43 -0.25 -0.16 -0.51 -0.02 -0.43 -0.08 -0.07 -0.97 -0.02 -0.48 -0.00 -0.00 -0.48 -0.01 -0.58 -0.00\n", + "Perplexity: 1.2292170270768858 \n", + "\n" + ] + } + ], + "source": [ + "prompts = [\n", + " \"In a short sentence, has artifical intelligence grown in the last decade?\",\n", + " \"In a short sentence, what are your thoughts on the future of artificial intelligence?\",\n", + "]\n", + "\n", + "for prompt in prompts:\n", + " API_RESPONSE = get_completion(\n", + " [{\"role\": \"user\", \"content\": prompt}],\n", + " model=\"gpt-4o-mini\",\n", + " logprobs=True,\n", + " )\n", + "\n", + " logprobs = [token.logprob for token in API_RESPONSE.choices[0].logprobs.content]\n", + " response_text = API_RESPONSE.choices[0].message.content\n", + " response_text_tokens = [token.token for token in API_RESPONSE.choices[0].logprobs.content]\n", + " max_starter_length = max(len(s) for s in [\"Prompt:\", \"Response:\", \"Tokens:\", \"Logprobs:\", \"Perplexity:\"])\n", + " max_token_length = max(len(s) for s in response_text_tokens)\n", + " \n", + "\n", + " formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]\n", + " formatted_lps = [f\"{lp:.2f}\".rjust(max_token_length) for lp in logprobs]\n", + "\n", + " perplexity_score = np.exp(-np.mean(logprobs))\n", + " print(\"Prompt:\".ljust(max_starter_length), prompt)\n", + " print(\"Response:\".ljust(max_starter_length), response_text, \"\\n\")\n", + " print(\"Tokens:\".ljust(max_starter_length), \" \".join(formatted_response_tokens))\n", + " print(\"Logprobs:\".ljust(max_starter_length), \" \".join(formatted_lps))\n", + " print(\"Perplexity:\".ljust(max_starter_length), perplexity_score, \"\\n\")" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In this example, `gpt-3.5-turbo` returned a lower perplexity score for a more deterministic question about recent history, and a higher perplexity score for a more speculative assessment about the near future. Again, while these differences don't guarantee accuracy, they help point the way for our interpretation of the model's results and our future use of them." + "In this example, `gpt-4o-mini` returned a lower perplexity score for a more deterministic question about recent history, and a higher perplexity score for a more speculative assessment about the near future. Again, while these differences don't guarantee accuracy, they help point the way for our interpretation of the model's results and our future use of them." ] }, {