"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"html_output = \"\"\n",
"html_output += \"Questions clearly answered in article\"\n",
"\n",
"for question in easy_questions:\n",
" API_RESPONSE = get_completion(\n",
" [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": PROMPT.format(\n",
" article=ada_lovelace_article, question=question\n",
" ),\n",
" }\n",
" ],\n",
" model=\"gpt-4\",\n",
" logprobs=True,\n",
" )\n",
" html_output += f'Question: {question}
'\n",
" for logprob in API_RESPONSE.choices[0].logprobs.content:\n",
" html_output += f'has_sufficient_context_for_answer: {logprob.token}, logprobs: {logprob.logprob}, linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%
'\n",
"\n",
"html_output += \"Questions only partially covered in the article\"\n",
"\n",
"for question in medium_questions:\n",
" API_RESPONSE = get_completion(\n",
" [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": PROMPT.format(\n",
" article=ada_lovelace_article, question=question\n",
" ),\n",
" }\n",
" ],\n",
" model=\"gpt-4\",\n",
" logprobs=True,\n",
" top_logprobs=3,\n",
" )\n",
" html_output += f'Question: {question}
'\n",
" for logprob in API_RESPONSE.choices[0].logprobs.content:\n",
" html_output += f'has_sufficient_context_for_answer: {logprob.token}, logprobs: {logprob.logprob}, linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%
'\n",
"\n",
"display(HTML(html_output))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the first two questions, our model asserts with (near) 100% confidence that the article has sufficient context to answer the posed questions.
\n",
"On the other hand, for the more tricky questions which are less clearly answered in the article, the model is less confident that it has sufficient context. This is a great guardrail to help ensure our retrieved content is sufficient.
\n",
"This self-evaluation can help reduce hallucinations, as you can restrict answers or re-prompt the user when your `sufficient_context_for_answer` log probability is below a certain threshold. Methods like this have been shown to significantly reduce RAG for Q&A hallucinations and errors ([Example]((https://jfan001.medium.com/how-we-cut-the-rate-of-gpt-hallucinations-from-20-to-less-than-2-f3bfcc10e4ec)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Autocomplete"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another use case for `logprobs` are autocomplete systems. Without creating the entire autocomplete system end-to-end, let's demonstrate how `logprobs` could help us decide how to suggest words as a user is typing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's come up with a sample sentence: `\"My least favorite TV show is Breaking Bad.\"` Let's say we want it to dynamically recommend the next word or token as we are typing the sentence, but *only* if the model is quite sure of what the next word will be. To demonstrate this, let's break up the sentence into sequential components."
]
},
{
"cell_type": "code",
"execution_count": 273,
"metadata": {},
"outputs": [],
"source": [
"sentence_list = [\n",
" \"My\",\n",
" \"My least\",\n",
" \"My least favorite\",\n",
" \"My least favorite TV\",\n",
" \"My least favorite TV show\",\n",
" \"My least favorite TV show is\",\n",
" \"My least favorite TV show is Breaking Bad\",\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can ask `gpt-3.5-turbo` to act as an autocomplete engine with whatever context the model is given. We can enable `logprobs` and can see how confident the model is in its prediction."
]
},
{
"cell_type": "code",
"execution_count": 274,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Sentence: My
Predicted next token: favorite, logprobs: -0.18245785, linear probability: 83.32%
Predicted next token: dog, logprobs: -2.397172, linear probability: 9.1%
Predicted next token: ap, logprobs: -3.8732424, linear probability: 2.08%
Sentence: My least
Predicted next token: favorite, logprobs: -0.0146376295, linear probability: 98.55%
Predicted next token: My, logprobs: -4.2417912, linear probability: 1.44%
Predicted next token: favorite, logprobs: -9.748788, linear probability: 0.01%
Sentence: My least favorite
Predicted next token: food, logprobs: -0.9481721, linear probability: 38.74%
Predicted next token: My, logprobs: -1.3447137, linear probability: 26.06%
Predicted next token: color, logprobs: -1.3887696, linear probability: 24.94%
Sentence: My least favorite TV
Predicted next token: show, logprobs: -0.0007898556, linear probability: 99.92%
Predicted next token: My, logprobs: -7.711523, linear probability: 0.04%
Predicted next token: series, logprobs: -9.348547, linear probability: 0.01%
Sentence: My least favorite TV show
Predicted next token: is, logprobs: -0.2851253, linear probability: 75.19%
Predicted next token: of, logprobs: -1.55335, linear probability: 21.15%
Predicted next token: My, logprobs: -3.4928775, linear probability: 3.04%
Sentence: My least favorite TV show is
Predicted next token: \"My, logprobs: -0.69349754, linear probability: 49.98%
Predicted next token: \"The, logprobs: -1.2899293, linear probability: 27.53%
Predicted next token: My, logprobs: -2.4170141, linear probability: 8.92%
Sentence: My least favorite TV show is Breaking Bad
Predicted next token: because, logprobs: -0.17786823, linear probability: 83.71%
Predicted next token: ,, logprobs: -2.3946173, linear probability: 9.12%
Predicted next token: ., logprobs: -3.1861975, linear probability: 4.13%
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"high_prob_completions = {}\n",
"low_prob_completions = {}\n",
"html_output = \"\"\n",
"\n",
"for sentence in sentence_list:\n",
" PROMPT = \"\"\"Complete this sentence. You are acting as auto-complete. Simply complete the sentence to the best of your ability, make sure it is just ONE sentence: {sentence}\"\"\"\n",
" API_RESPONSE = get_completion(\n",
" [{\"role\": \"user\", \"content\": PROMPT.format(sentence=sentence)}],\n",
" model=\"gpt-3.5-turbo\",\n",
" logprobs=True,\n",
" top_logprobs=3,\n",
" )\n",
" html_output += f'Sentence: {sentence}
'\n",
" first_token = True\n",
" for token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:\n",
" html_output += f'Predicted next token: {token.token}, logprobs: {token.logprob}, linear probability: {np.round(np.exp(token.logprob)*100,2)}%
'\n",
" if first_token:\n",
" if np.exp(token.logprob) > 0.95:\n",
" high_prob_completions[sentence] = token.token\n",
" if np.exp(token.logprob) < 0.60:\n",
" low_prob_completions[sentence] = token.token\n",
" first_token = False\n",
" html_output += \"
\"\n",
"\n",
"display(HTML(html_output))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the high confidence autocompletions:"
]
},
{
"cell_type": "code",
"execution_count": 275,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'My least': 'favorite', 'My least favorite TV': 'show'}"
]
},
"execution_count": 275,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"high_prob_completions\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These look reasonable! We can feel confident in those suggestions. It's pretty likely you want to write 'show' after writing 'My least favorite TV'! Now let's look at the autocompletion suggestions the model was less confident about:"
]
},
{
"cell_type": "code",
"execution_count": 276,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'My least favorite': 'food', 'My least favorite TV show is': '\"My'}"
]
},
"execution_count": 276,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"low_prob_completions\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These are logical as well. It's pretty unclear what the user is going to say with just the prefix 'my least favorite', and it's really anyone's guess what the author's favorite TV show is.
\n",
"So, using `gpt-3.5-turbo`, we can create the root of a dynamic autocompletion engine with `logprobs`!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Highlighter and bytes parameter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's quickly touch on creating a simple token highlighter with `logprobs`, and using the bytes parameter. First, we can create a function that counts and highlights each token. While this doesn't use the log probabilities, it uses the built in tokenization that comes with enabling `logprobs`."
]
},
{
"cell_type": "code",
"execution_count": 277,
"metadata": {},
"outputs": [],
"source": [
"PROMPT = \"\"\"What's the longest word in the English language?\"\"\"\n",
"\n",
"API_RESPONSE = get_completion(\n",
" [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4\", logprobs=True, top_logprobs=5\n",
")\n",
"\n",
"\n",
"def highlight_text(api_response):\n",
" colors = [\n",
" \"#FF00FF\", # Magenta\n",
" \"#008000\", # Green\n",
" \"#FF8C00\", # Dark Orange\n",
" \"#FF0000\", # Red\n",
" \"#0000FF\", # Blue\n",
" ]\n",
" tokens = api_response.choices[0].logprobs.content\n",
"\n",
" color_idx = 0 # Initialize color index\n",
" html_output = \"\" # Initialize HTML output\n",
" for t in tokens:\n",
" token_str = bytes(t.bytes).decode(\"utf-8\") # Decode bytes to string\n",
"\n",
" # Add colored token to HTML output\n",
" html_output += f\"{token_str}\"\n",
"\n",
" # Move to the next color\n",
" color_idx = (color_idx + 1) % len(colors)\n",
" display(HTML(html_output)) # Display HTML output\n",
" print(f\"Total number of tokens: {len(tokens)}\")"
]
},
{
"cell_type": "code",
"execution_count": 278,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"The longest word in the English language, according to the Guinness World Records, is 'pneumonoultramicroscopicsilicovolcanoconiosis'. It is a type of lung disease caused by inhaling ash and sand dust."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of tokens: 51\n"
]
}
],
"source": [
"highlight_text(API_RESPONSE)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's reconstruct a sentence using the bytes parameter. With `logprobs` enabled, we are given both each token and the ASCII (decimal utf-8) values of the token string. These ASCII values can be helpful when handling tokens of or containing emojis or special characters."
]
},
{
"cell_type": "code",
"execution_count": 279,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Token: \\xf0\\x9f\\x92\n",
"Log prob: -0.0003056686\n",
"Linear prob: 99.97 %\n",
"Bytes: [240, 159, 146] \n",
"\n",
"Token: \\x99\n",
"Log prob: 0.0\n",
"Linear prob: 100.0 %\n",
"Bytes: [153] \n",
"\n",
"Token: -\n",
"Log prob: -0.0096905725\n",
"Linear prob: 99.04 %\n",
"Bytes: [32, 45] \n",
"\n",
"Token: Blue\n",
"Log prob: -0.00042042506\n",
"Linear prob: 99.96 %\n",
"Bytes: [32, 66, 108, 117, 101] \n",
"\n",
"Token: Heart\n",
"Log prob: -7.302705e-05\n",
"Linear prob: 99.99 %\n",
"Bytes: [32, 72, 101, 97, 114, 116] \n",
"\n",
"Bytes array: [240, 159, 146, 153, 32, 45, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]\n",
"Decoded bytes: 💙 - Blue Heart\n",
"Joint prob: 98.96 %\n"
]
}
],
"source": [
"PROMPT = \"\"\"Output the blue heart emoji and its name.\"\"\"\n",
"API_RESPONSE = get_completion(\n",
" [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4\", logprobs=True\n",
")\n",
"\n",
"aggregated_bytes = []\n",
"joint_logprob = 0.0\n",
"\n",
"# Iterate over tokens, aggregate bytes and calculate joint logprob\n",
"for token in API_RESPONSE.choices[0].logprobs.content:\n",
" print(\"Token:\", token.token)\n",
" print(\"Log prob:\", token.logprob)\n",
" print(\"Linear prob:\", np.round(exp(token.logprob) * 100, 2), \"%\")\n",
" print(\"Bytes:\", token.bytes, \"\\n\")\n",
" aggregated_bytes += token.bytes\n",
" joint_logprob += token.logprob\n",
"\n",
"# Decode the aggregated bytes to text\n",
"aggregated_text = bytes(aggregated_bytes).decode(\"utf-8\")\n",
"\n",
"# Assert that the decoded text is the same as the message content\n",
"assert API_RESPONSE.choices[0].message.content == aggregated_text\n",
"\n",
"# Print the results\n",
"print(\"Bytes array:\", aggregated_bytes)\n",
"print(f\"Decoded bytes: {aggregated_text}\")\n",
"print(\"Joint prob:\", np.round(exp(joint_logprob) * 100, 2), \"%\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we see that while the first token was `\\xf0\\x9f\\x92'`, we can get its ASCII value and append it to a bytes array. Then, we can easily decode this array into a full sentence, and validate with our assert statement that the decoded bytes is the same as our completion message!\n",
"\n",
"Additionally, we can get the joint probability of the entire completion, which is the exponentiated product of each token's log probability. This gives us how `likely` this given completion is given the prompt. Since, our prompt is quite directive (asking for a certain emoji and its name), the joint probability of this output is high! If we ask for a random output however, we'll see a much lower joint probability. This can also be a good tactic for developers during prompt engineering. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nice! We were able to use the `logprobs` parameter to build a more robust classifier, evaluate our retrieval for Q&A system, and encode and decode each 'byte' of our tokens! `logprobs` adds useful information and signal to our completions output, and we are excited to see how developers incorporate it to improve applications."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Possible extensions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many other use cases for `logprobs` that are not covered in this cookbook. We can use `logprobs` for:\n",
" - Evaluations (e.g.: calculate `perplexity` of outputs, which is the evaluation metric of uncertainty or surprise of the model at its outcomes)\n",
" - Moderation\n",
" - Keyword selection\n",
" - Improve prompts and interpretability of outputs\n",
" - Token healing\n",
" - and more!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "openai",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 2
}