Add evalsapi notebooks (#1762)

This commit is contained in:
josiah-openai 2025-04-08 09:50:54 -07:00 committed by GitHub
parent 9500f506cd
commit e82d689fc6
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 1370 additions and 1 deletions

View File

@ -267,3 +267,8 @@ erikakettleson-openai:
name: "Erika Kettleson" name: "Erika Kettleson"
website: "https://www.linkedin.com/in/erika-kettleson-85763196/" website: "https://www.linkedin.com/in/erika-kettleson-85763196/"
avatar: "https://avatars.githubusercontent.com/u/186107044?v=4" avatar: "https://avatars.githubusercontent.com/u/186107044?v=4"
josiah-openai:
name: "Josiah Grace"
website: "https://www.linkedin.com/in/josiahbgrace"
avatar: "https://avatars.githubusercontent.com/u/181146311?v=4"

View File

@ -21,6 +21,9 @@
} }
}, },
"source": [ "source": [
"**Note: OpenAI now has a hosted evals product with an API! We recommend you use this instead.\n",
"See [Evals](https://platform.openai.com/docs/guides/evals)**\n",
"\n",
"The [OpenAI Evals](https://github.com/openai/evals/tree/main) framework consists of\n", "The [OpenAI Evals](https://github.com/openai/evals/tree/main) framework consists of\n",
"1. A framework to evaluate an LLM (large language model) or a system built on top of an LLM.\n", "1. A framework to evaluate an LLM (large language model) or a system built on top of an LLM.\n",
"2. An open-source registry of challenging evals\n", "2. An open-source registry of challenging evals\n",
@ -419,7 +422,7 @@
"text": [ "text": [
"[2024-03-26 19:44:39,836] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/evals\n", "[2024-03-26 19:44:39,836] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/evals\n",
"[2024-03-26 19:44:43,623] [registry.py:257] Loading registry from /Users/shyamal/.evals/evals\n", "[2024-03-26 19:44:43,623] [registry.py:257] Loading registry from /Users/shyamal/.evals/evals\n",
"[2024-03-26 19:44:43,635] [oaieval.py:189] \u001B[1;35mRun started: 240327024443FACXGMKA\u001B[0m\n", "[2024-03-26 19:44:43,635] [oaieval.py:189] \u001b[1;35mRun started: 240327024443FACXGMKA\u001b[0m\n",
"[2024-03-26 19:44:43,663] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/modelgraded\n", "[2024-03-26 19:44:43,663] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/modelgraded\n",
"[2024-03-26 19:44:43,851] [registry.py:257] Loading registry from /Users/shyamal/.evals/modelgraded\n", "[2024-03-26 19:44:43,851] [registry.py:257] Loading registry from /Users/shyamal/.evals/modelgraded\n",
"[2024-03-26 19:44:43,853] [data.py:90] Fetching /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/data/sql/spider_sql.jsonl\n", "[2024-03-26 19:44:43,853] [data.py:90] Fetching /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/data/sql/spider_sql.jsonl\n",

View File

@ -0,0 +1,453 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluations Example: Push Notifications Bulk Experimentation \n",
"\n",
"Evals are **task oriented** and iterative, they're the best way to check how your LLM integration is doing and improve it.\n",
"\n",
"In the following eval, we are going to focus on the task of **testing many variants of models and prompts**.\n",
"\n",
"Our use-case is:\n",
"1. I want to get the best possible performance out of my push notifications summarizer\n",
"\n",
"## Evals structure\n",
"\n",
"Evals have two parts, the \"Eval\" and the \"Run\". An \"Eval\" holds the configuration for your testing criteria and the structure of the data for your \"Runs\". An Eval `has_many` runs, that are evaluated by your testing criteria."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pydantic\n",
"import openai\n",
"from openai.types.chat import ChatCompletion\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = os.environ.get(\"OPENAI_API_KEY\", \"your-api-key\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use-case\n",
"\n",
"We're testing the following integration, a push notifications summarizer, which takes in multiple push notifications and collapses them into a single message."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class PushNotifications(pydantic.BaseModel):\n",
" notifications: str\n",
"\n",
"print(PushNotifications.model_json_schema())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DEVELOPER_PROMPT = \"\"\"\n",
"You are a helpful assistant that summarizes push notifications.\n",
"You are given a list of push notifications and you need to collapse them into a single one.\n",
"Output only the final summary, nothing else.\n",
"\"\"\"\n",
"\n",
"def summarize_push_notification(push_notifications: str) -> ChatCompletion:\n",
" result = openai.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"developer\", \"content\": DEVELOPER_PROMPT},\n",
" {\"role\": \"user\", \"content\": push_notifications},\n",
" ],\n",
" )\n",
" return result\n",
"\n",
"example_push_notifications_list = PushNotifications(notifications=\"\"\"\n",
"- Alert: Unauthorized login attempt detected.\n",
"- New comment on your blog post: \"Great insights!\"\n",
"- Tonight's dinner recipe: Pasta Primavera.\n",
"\"\"\")\n",
"result = summarize_push_notification(example_push_notifications_list.notifications)\n",
"print(result.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting up your eval\n",
"\n",
"An Eval holds the configuration that is shared across multiple *Runs*, it has two components:\n",
"1. Data source configuration `data_source_config` - the schema (columns) that your future *Runs* conform to.\n",
" - The `data_source_config` uses JSON Schema to define what variables are available in the Eval.\n",
"2. Testing Criteria `testing_criteria` - How you'll determine if your integration is working for each *row* of your data source.\n",
"\n",
"For this use-case, we want to test if the push notification summary completion is good, so we'll set-up our eval with this in mind."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# We want our input data to be available in our variables, so we set the item_schema to\n",
"# PushNotifications.model_json_schema()\n",
"data_source_config = {\n",
" \"type\": \"custom\",\n",
" \"item_schema\": PushNotifications.model_json_schema(),\n",
" # We're going to be uploading completions from the API, so we tell the Eval to expect this\n",
" \"include_sample_schema\": True,\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This data_source_config defines what variables are available throughout the eval.\n",
"\n",
"This item schema:\n",
"```json\n",
"{\n",
" \"properties\": {\n",
" \"notifications\": {\n",
" \"title\": \"Notifications\",\n",
" \"type\": \"string\"\n",
" }\n",
" },\n",
" \"required\": [\"notifications\"],\n",
" \"title\": \"PushNotifications\",\n",
" \"type\": \"object\"\n",
"}\n",
"```\n",
"Means that we'll have the variable `{{item.notifications}}` available in our eval.\n",
"\n",
"`\"include_sample_schema\": True`\n",
"Mean's that we'll have the variable `{{sample.output_text}}` available in our eval.\n",
"\n",
"**Now, we'll use those variables to set up our test criteria.**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"GRADER_DEVELOPER_PROMPT = \"\"\"\n",
"Categorize the following push notification summary into the following categories:\n",
"1. concise-and-snappy\n",
"2. drops-important-information\n",
"3. verbose\n",
"4. unclear\n",
"5. obscures-meaning\n",
"6. other \n",
"\n",
"You'll be given the original list of push notifications and the summary like this:\n",
"\n",
"<push_notifications>\n",
"...notificationlist...\n",
"</push_notifications>\n",
"<summary>\n",
"...summary...\n",
"</summary>\n",
"\n",
"You should only pick one of the categories above, pick the one which most closely matches and why.\n",
"\"\"\"\n",
"GRADER_TEMPLATE_PROMPT = \"\"\"\n",
"<push_notifications>{{item.notifications}}</push_notifications>\n",
"<summary>{{sample.output_text}}</summary>\n",
"\"\"\"\n",
"push_notification_grader = {\n",
" \"name\": \"Push Notification Summary Grader\",\n",
" \"type\": \"label_model\",\n",
" \"model\": \"o3-mini\",\n",
" \"input\": [\n",
" {\n",
" \"role\": \"developer\",\n",
" \"content\": GRADER_DEVELOPER_PROMPT,\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": GRADER_TEMPLATE_PROMPT,\n",
" },\n",
" ],\n",
" \"passing_labels\": [\"concise-and-snappy\"],\n",
" \"labels\": [\n",
" \"concise-and-snappy\",\n",
" \"drops-important-information\",\n",
" \"verbose\",\n",
" \"unclear\",\n",
" \"obscures-meaning\",\n",
" \"other\",\n",
" ],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `push_notification_grader` is a model grader (llm-as-a-judge) which looks at the input `{{item.notifications}}` and the generated summary `{{sample.output_text}}` and labels it as \"correct\" or \"incorrect\"\n",
"We then instruct via the \"passing_labels\" what constitutes a passing answer.\n",
"\n",
"Note: under the hood, this uses structured outputs so that labels are always valid.\n",
"\n",
"**Now we'll create our eval, and start adding data to it!**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"eval_create_result = openai.evals.create(\n",
" name=\"Push Notification Bulk Experimentation Eval\",\n",
" metadata={\n",
" \"description\": \"This eval tests many prompts and models to find the best performing combination.\",\n",
" },\n",
" data_source_config=data_source_config,\n",
" testing_criteria=[push_notification_grader],\n",
")\n",
"eval_id = eval_create_result.id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating runs\n",
"\n",
"Now that we have our eval set-up with our testing_criteria, we can start to add a bunch of runs!\n",
"We'll start with some push notification data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"push_notification_data = [\n",
" \"\"\"\n",
"- New message from Sarah: \"Can you call me later?\"\n",
"- Your package has been delivered!\n",
"- Flash sale: 20% off electronics for the next 2 hours!\n",
"\"\"\",\n",
" \"\"\"\n",
"- Weather alert: Thunderstorm expected in your area.\n",
"- Reminder: Doctor's appointment at 3 PM.\n",
"- John liked your photo on Instagram.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Breaking News: Local elections results are in.\n",
"- Your daily workout summary is ready.\n",
"- Check out your weekly screen time report.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Your ride is arriving in 2 minutes.\n",
"- Grocery order has been shipped.\n",
"- Don't miss the season finale of your favorite show tonight!\n",
"\"\"\",\n",
" \"\"\"\n",
"- Event reminder: Concert starts at 7 PM.\n",
"- Your favorite team just scored!\n",
"- Flashback: Memories from 3 years ago.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Low battery alert: Charge your device.\n",
"- Your friend Mike is nearby.\n",
"- New episode of \"The Tech Hour\" podcast is live!\n",
"\"\"\",\n",
" \"\"\"\n",
"- System update available.\n",
"- Monthly billing statement is ready.\n",
"- Your next meeting starts in 15 minutes.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Alert: Unauthorized login attempt detected.\n",
"- New comment on your blog post: \"Great insights!\"\n",
"- Tonight's dinner recipe: Pasta Primavera.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Special offer: Free coffee with any breakfast order.\n",
"- Your flight has been delayed by 30 minutes.\n",
"- New movie release: \"Adventures Beyond\" now streaming.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Traffic alert: Accident reported on Main Street.\n",
"- Package out for delivery: Expected by 5 PM.\n",
"- New friend suggestion: Connect with Emma.\n",
"\"\"\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we're going to set up a bunch of prompts to test.\n",
"\n",
"We want to test a basic prompt, with a couple of variations:\n",
"1. In one variation, we'll just have the basic prompt\n",
"2. In the next one, we'll include some positive examples of what we want the summaries to look like\n",
"3. In the final one, we'll include both positive and negative examples.\n",
"\n",
"We'll also include a list of models to use."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"PROMPT_PREFIX = \"\"\"\n",
"You are a helpful assistant that takes in an array of push notifications and returns a collapsed summary of them.\n",
"The push notification will be provided as follows:\n",
"<push_notifications>\n",
"...notificationlist...\n",
"</push_notifications>\n",
"\n",
"You should return just the summary and nothing else.\n",
"\"\"\"\n",
"\n",
"PROMPT_VARIATION_BASIC = f\"\"\"\n",
"{PROMPT_PREFIX}\n",
"\n",
"You should return a summary that is concise and snappy.\n",
"\"\"\"\n",
"\n",
"PROMPT_VARIATION_WITH_EXAMPLES = f\"\"\"\n",
"{PROMPT_VARIATION_BASIC}\n",
"\n",
"Here is an example of a good summary:\n",
"<push_notifications>\n",
"- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.\n",
"</push_notifications>\n",
"<summary>\n",
"Traffic alert, package expected by 5pm, suggestion for new friend (Emily).\n",
"</summary>\n",
"\"\"\"\n",
"\n",
"PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES = f\"\"\"\n",
"{PROMPT_VARIATION_WITH_EXAMPLES}\n",
"\n",
"Here is an example of a bad summary:\n",
"<push_notifications>\n",
"- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.\n",
"</push_notifications>\n",
"<summary>\n",
"Traffic alert reported on main street. You have a package that will arrive by 5pm, Emily is a new friend suggested for you.\n",
"</summary>\n",
"\"\"\"\n",
"\n",
"prompts = [\n",
" (\"basic\", PROMPT_VARIATION_BASIC),\n",
" (\"with_examples\", PROMPT_VARIATION_WITH_EXAMPLES),\n",
" (\"with_negative_examples\", PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES),\n",
"]\n",
"\n",
"models = [\"gpt-4o\", \"gpt-4o-mini\", \"o3-mini\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Now we can just loop through all prompts and all models to test a bunch of configurations at once!**\n",
"\n",
"We'll use the 'completion' run data source with template variables for our push notification list.\n",
"\n",
"OpenAI will handle making the completions calls for you and populating \"sample.output_text\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for prompt_name, prompt in prompts:\n",
" for model in models:\n",
" run_data_source = {\n",
" \"type\": \"completions\",\n",
" \"input_messages\": {\n",
" \"type\": \"template\",\n",
" \"template\": [\n",
" {\n",
" \"role\": \"developer\",\n",
" \"content\": prompt,\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": \"<push_notifications>{{item.notifications}}</push_notifications>\",\n",
" },\n",
" ],\n",
" },\n",
" \"model\": model,\n",
" \"source\": {\n",
" \"type\": \"file_content\",\n",
" \"content\": [\n",
" {\n",
" \"item\": PushNotifications(notifications=notification).model_dump()\n",
" }\n",
" for notification in push_notification_data\n",
" ],\n",
" },\n",
" }\n",
"\n",
" run_create_result = openai.evals.runs.create(\n",
" eval_id=eval_id,\n",
" name=f\"bulk_{prompt_name}_{model}\",\n",
" data_source=run_data_source,\n",
" )\n",
" print(f\"Report URL {model}, {prompt_name}:\", run_create_result.report_url)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Congratulations, you just tested 9 different prompt and model variations across your dataset!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "openai",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,411 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluations Example: Push Notifications Summarizer Monitoring\n",
"\n",
"Evals are **task-oriented** and iterative, they're the best way to check how your LLM integration is doing and improve it.\n",
"\n",
"In the following eval, we are going to focus on the task of **detecting our prompt changes for regressions**.\n",
"\n",
"Our use-case is:\n",
"1. We have been logging chat completion requests by setting `store=True` in our production chat completions requests. Note that you can also enable \"on by default\" logging in your admin panel (https://platform.openai.com/settings/organization/data-controls/data-retention).\n",
"2. We want to see whether our prompt changes have introduced regressions.\n",
"\n",
"## Evals structure\n",
"\n",
"Evals have two parts, the \"Eval\" and the \"Run\". An \"Eval\" holds the configuration for your testing criteria and the structure of the data for your \"Runs\". An Eval can have many Runs, which are each evaluated using your testing criteria."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from openai import AsyncOpenAI\n",
"import os\n",
"import asyncio\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = os.environ.get(\"OPENAI_API_KEY\", \"your-api-key\")\n",
"client = AsyncOpenAI()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use-case\n",
"\n",
"We're testing the following integration, a push notifications summary, which takes in multiple push notifications and collapses them into a single one, this is a chat completions call."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Generate our test data\n",
"\n",
"I'm going to produce simulated production chat completions requests with two different prompt versions to test how each performs. The first is a \"good\" prompt, the second is a \"bad\" prompt. These will have different metadata which we'll use later."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"push_notification_data = [\n",
" \"\"\"\n",
"- New message from Sarah: \"Can you call me later?\"\n",
"- Your package has been delivered!\n",
"- Flash sale: 20% off electronics for the next 2 hours!\n",
"\"\"\",\n",
" \"\"\"\n",
"- Weather alert: Thunderstorm expected in your area.\n",
"- Reminder: Doctor's appointment at 3 PM.\n",
"- John liked your photo on Instagram.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Breaking News: Local elections results are in.\n",
"- Your daily workout summary is ready.\n",
"- Check out your weekly screen time report.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Your ride is arriving in 2 minutes.\n",
"- Grocery order has been shipped.\n",
"- Don't miss the season finale of your favorite show tonight!\n",
"\"\"\",\n",
" \"\"\"\n",
"- Event reminder: Concert starts at 7 PM.\n",
"- Your favorite team just scored!\n",
"- Flashback: Memories from 3 years ago.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Low battery alert: Charge your device.\n",
"- Your friend Mike is nearby.\n",
"- New episode of \"The Tech Hour\" podcast is live!\n",
"\"\"\",\n",
" \"\"\"\n",
"- System update available.\n",
"- Monthly billing statement is ready.\n",
"- Your next meeting starts in 15 minutes.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Alert: Unauthorized login attempt detected.\n",
"- New comment on your blog post: \"Great insights!\"\n",
"- Tonight's dinner recipe: Pasta Primavera.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Special offer: Free coffee with any breakfast order.\n",
"- Your flight has been delayed by 30 minutes.\n",
"- New movie release: \"Adventures Beyond\" now streaming.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Traffic alert: Accident reported on Main Street.\n",
"- Package out for delivery: Expected by 5 PM.\n",
"- New friend suggestion: Connect with Emma.\n",
"\"\"\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"PROMPTS = [\n",
" (\n",
" \"\"\"\n",
" You are a helpful assistant that summarizes push notifications.\n",
" You are given a list of push notifications and you need to collapse them into a single one.\n",
" Output only the final summary, nothing else.\n",
" \"\"\",\n",
" \"v1\"\n",
" ),\n",
" (\n",
" \"\"\"\n",
" You are a helpful assistant that summarizes push notifications.\n",
" You are given a list of push notifications and you need to collapse them into a single one.\n",
" The summary should be longer than it needs to be and include more information than is necessary.\n",
" Output only the final summary, nothing else.\n",
" \"\"\",\n",
" \"v2\"\n",
" )\n",
"]\n",
"\n",
"tasks = []\n",
"for notifications in push_notification_data:\n",
" for (prompt, version) in PROMPTS:\n",
" tasks.append(client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"developer\", \"content\": prompt},\n",
" {\"role\": \"user\", \"content\": notifications},\n",
" ],\n",
" store=True,\n",
" metadata={\"prompt_version\": version, \"usecase\": \"push_notifications_summarizer\"},\n",
" ))\n",
"await asyncio.gather(*tasks)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can view the completions you just created at https://platform.openai.com/logs. \n",
"\n",
"**Make sure that the chat completions show up, as they are necessary for the next step.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"completions = await client.chat.completions.list()\n",
"assert completions.data, \"No completions found. You may need to enable logs in your admin panel.\"\n",
"completions.data[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting up your eval\n",
"\n",
"An Eval holds the configuration that is shared across multiple *Runs*, it has two components:\n",
"1. Data source configuration `data_source_config` - the schema (columns) that your future *Runs* conform to.\n",
" - The `data_source_config` uses JSON Schema to define what variables are available in the Eval.\n",
"2. Testing Criteria `testing_criteria` - How you'll determine if your integration is working for each *row* of your data source.\n",
"\n",
"For this use-case, we're using stored-completions, so we'll set up that data_source_config\n",
"\n",
"**Important**\n",
"You are likely to have many different stored completions use-cases, metadata is the best way to keep track of this for evals to keep them focused and task oriented."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# We want our input data to be available in our variables, so we set the item_schema to\n",
"# PushNotifications.model_json_schema()\n",
"data_source_config = {\n",
" \"type\": \"stored_completions\",\n",
" \"metadata\": {\n",
" \"usecase\": \"push_notifications_summarizer\"\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This data_source_config defines what variables are available throughout the eval.\n",
"\n",
"The stored completions config provides two variables for you to use throughout your eval:\n",
"1. {{item.input}} - the messages sent to the completions call\n",
"2. {{sample.output_text}} - the text response from the assistant\n",
"\n",
"**Now, we'll use those variables to set up our test criteria.**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"GRADER_DEVELOPER_PROMPT = \"\"\"\n",
"Label the following push notification summary as either correct or incorrect.\n",
"The push notification and the summary will be provided below.\n",
"A good push notificiation summary is concise and snappy.\n",
"If it is good, then label it as correct, if not, then incorrect.\n",
"\"\"\"\n",
"GRADER_TEMPLATE_PROMPT = \"\"\"\n",
"Push notifications: {{item.input}}\n",
"Summary: {{sample.output_text}}\n",
"\"\"\"\n",
"push_notification_grader = {\n",
" \"name\": \"Push Notification Summary Grader\",\n",
" \"type\": \"label_model\",\n",
" \"model\": \"o3-mini\",\n",
" \"input\": [\n",
" {\n",
" \"role\": \"developer\",\n",
" \"content\": GRADER_DEVELOPER_PROMPT,\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": GRADER_TEMPLATE_PROMPT,\n",
" },\n",
" ],\n",
" \"passing_labels\": [\"correct\"],\n",
" \"labels\": [\"correct\", \"incorrect\"],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `push_notification_grader` is a model grader (llm-as-a-judge), which looks at the input `{{item.input}}` and the generated summary `{{sample.output_text}}` and labels it as \"correct\" or \"incorrect\".\n",
"\n",
"Note: under the hood, this uses structured outputs so that labels are always valid.\n",
"\n",
"**Now we'll create our eval!, and start adding data to it**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"eval_create_result = await client.evals.create(\n",
" name=\"Push Notification Completion Monitoring\",\n",
" metadata={\"description\": \"This eval monitors completions\"},\n",
" data_source_config=data_source_config,\n",
" testing_criteria=[push_notification_grader],\n",
")\n",
"\n",
"eval_id = eval_create_result.id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating runs\n",
"\n",
"Now that we have our eval set-up with our test_criteria, we can start adding runs.\n",
"I want to compare the performance between my two **prompt versions**\n",
"\n",
"To do this, we just define our source as \"stored_completions\" with a metadata filter for each of our prompt versions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Grade prompt_version=v1\n",
"eval_run_result = await client.evals.runs.create(\n",
" eval_id=eval_id,\n",
" name=\"v1-run\",\n",
" data_source={\n",
" \"type\": \"completions\",\n",
" \"source\": {\n",
" \"type\": \"stored_completions\",\n",
" \"metadata\": {\n",
" \"prompt_version\": \"v1\",\n",
" }\n",
" }\n",
" }\n",
")\n",
"print(eval_run_result.report_url)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Grade prompt_version=v2\n",
"eval_run_result_v2 = await client.evals.runs.create(\n",
" eval_id=eval_id,\n",
" name=\"v2-run\",\n",
" data_source={\n",
" \"type\": \"completions\",\n",
" \"source\": {\n",
" \"type\": \"stored_completions\",\n",
" \"metadata\": {\n",
" \"prompt_version\": \"v2\",\n",
" }\n",
" }\n",
" }\n",
")\n",
"print(eval_run_result_v2.report_url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just for to be thorough, let's see how this prompt would do with 4o, instead of 4o-mini, with both prompt versions as the starting point.\n",
"\n",
"All we have to do is reference the input messages ({{item.input}}) and set the model to 4o. Since we don't already have any stored completions for 4o, this eval run will generate new completions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tasks = []\n",
"for prompt_version in [\"v1\", \"v2\"]:\n",
" tasks.append(client.evals.runs.create(\n",
" eval_id=eval_id,\n",
" name=f\"post-fix-new-model-run-{prompt_version}\",\n",
" data_source={\n",
" \"type\": \"completions\",\n",
" \"input_messages\": {\n",
" \"type\": \"item_reference\",\n",
" \"item_reference\": \"item.input\",\n",
" },\n",
" \"model\": \"gpt-4o\",\n",
" \"source\": {\n",
" \"type\": \"stored_completions\",\n",
" \"metadata\": {\n",
" \"prompt_version\": prompt_version,\n",
" }\n",
" }\n",
" },\n",
" ))\n",
"result = await asyncio.gather(*tasks)\n",
"for run in result:\n",
" print(run.report_url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you view that report, you'll see that we can see that prompt_version=v2 has a regression!\n",
"\n",
"## Congratulations, you just discovered a bug, you could revert it, or make another prompt change, etc.!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "openai",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,470 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluations Example: Push Notifications Summarizer Prompt Regression,\n",
"\n",
"Evals are **task oriented** and iterative, they're the best way to check how your LLM integration is doing and improve it.\n",
"\n",
"In the following eval, we are going to focus on the task of **detecting if my prompt change is a regression**.\n",
"\n",
"Our use-case is:\n",
"1. I have an llm integration that takes a list of push notifications and summarizes them into a single condensed statement.\n",
"2. I want to detect if a prompt change regresses the behavior\n",
"\n",
"## Evals structure\n",
"\n",
"Evals have two parts, the \"Eval\" and the \"Run\". An \"Eval\" holds the configuration for your testing criteria and the structure of the data for your \"Runs\". An Eval can have many runs that are evaluated by your testing criteria."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"from openai.types.chat import ChatCompletion\n",
"import pydantic\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = os.environ.get(\"OPENAI_API_KEY\", \"your-api-key\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use-case\n",
"\n",
"We're testing the following integration, a push notifications summary, which takes in multiple push notifications and collapses them into a single one, this is a chat completions call."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class PushNotifications(pydantic.BaseModel):\n",
" notifications: str\n",
"\n",
"print(PushNotifications.model_json_schema())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DEVELOPER_PROMPT = \"\"\"\n",
"You are a helpful assistant that summarizes push notifications.\n",
"You are given a list of push notifications and you need to collapse them into a single one.\n",
"Output only the final summary, nothing else.\n",
"\"\"\"\n",
"\n",
"def summarize_push_notification(push_notifications: str) -> ChatCompletion:\n",
" result = openai.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"developer\", \"content\": DEVELOPER_PROMPT},\n",
" {\"role\": \"user\", \"content\": push_notifications},\n",
" ],\n",
" )\n",
" return result\n",
"\n",
"example_push_notifications_list = PushNotifications(notifications=\"\"\"\n",
"- Alert: Unauthorized login attempt detected.\n",
"- New comment on your blog post: \"Great insights!\"\n",
"- Tonight's dinner recipe: Pasta Primavera.\n",
"\"\"\")\n",
"result = summarize_push_notification(example_push_notifications_list.notifications)\n",
"print(result.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting up your eval\n",
"\n",
"An Eval holds the configuration that is shared across multiple *Runs*, it has two components:\n",
"1. Data source configuration `data_source_config` - the schema (columns) that your future *Runs* conform to.\n",
" - The `data_source_config` uses JSON Schema to define what variables are available in the Eval.\n",
"2. Testing Criteria `testing_criteria` - How you'll determine if your integration is working for each *row* of your data source.\n",
"\n",
"For this use-case, we want to test if the push notification summary completion is good, so we'll set-up our eval with this in mind."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# We want our input data to be available in our variables, so we set the item_schema to\n",
"# PushNotifications.model_json_schema()\n",
"data_source_config = {\n",
" \"type\": \"custom\",\n",
" \"item_schema\": PushNotifications.model_json_schema(),\n",
" # We're going to be uploading completions from the API, so we tell the Eval to expect this\n",
" \"include_sample_schema\": True,\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This data_source_config defines what variables are available throughout the eval.\n",
"\n",
"This item schema:\n",
"```json\n",
"{\n",
" \"properties\": {\n",
" \"notifications\": {\n",
" \"title\": \"Notifications\",\n",
" \"type\": \"string\"\n",
" }\n",
" },\n",
" \"required\": [\"notifications\"],\n",
" \"title\": \"PushNotifications\",\n",
" \"type\": \"object\"\n",
"}\n",
"```\n",
"Means that we'll have the variable `{{item.notifications}}` available in our eval.\n",
"\n",
"`\"include_sample_schema\": True`\n",
"Mean's that we'll have the variable `{{sample.output_text}}` available in our eval.\n",
"\n",
"**Now, we'll use those variables to set up our test criteria.**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"GRADER_DEVELOPER_PROMPT = \"\"\"\n",
"Label the following push notification summary as either correct or incorrect.\n",
"The push notification and the summary will be provided below.\n",
"A good push notificiation summary is concise and snappy.\n",
"If it is good, then label it as correct, if not, then incorrect.\n",
"\"\"\"\n",
"GRADER_TEMPLATE_PROMPT = \"\"\"\n",
"Push notifications: {{item.notifications}}\n",
"Summary: {{sample.output_text}}\n",
"\"\"\"\n",
"push_notification_grader = {\n",
" \"name\": \"Push Notification Summary Grader\",\n",
" \"type\": \"label_model\",\n",
" \"model\": \"o3-mini\",\n",
" \"input\": [\n",
" {\n",
" \"role\": \"developer\",\n",
" \"content\": GRADER_DEVELOPER_PROMPT,\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": GRADER_TEMPLATE_PROMPT,\n",
" },\n",
" ],\n",
" \"passing_labels\": [\"correct\"],\n",
" \"labels\": [\"correct\", \"incorrect\"],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `push_notification_grader` is a model grader (llm-as-a-judge), which looks at the input `{{item.notifications}}` and the generated summary `{{sample.output_text}}` and labels it as \"correct\" or \"incorrect\".\n",
"We then instruct via. the \"passing_labels\", what constitutes a passing answer.\n",
"\n",
"Note: under the hood, this uses structured outputs so that labels are always valid.\n",
"\n",
"**Now we'll create our eval!, and start adding data to it**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"eval_create_result = openai.evals.create(\n",
" name=\"Push Notification Summary Workflow\",\n",
" metadata={\n",
" \"description\": \"This eval checks if the push notification summary is correct.\",\n",
" },\n",
" data_source_config=data_source_config,\n",
" testing_criteria=[push_notification_grader],\n",
")\n",
"\n",
"eval_id = eval_create_result.id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating runs\n",
"\n",
"Now that we have our eval set-up with our test_criteria, we can start to add a bunch of runs!\n",
"We'll start with some push notification data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"push_notification_data = [\n",
" \"\"\"\n",
"- New message from Sarah: \"Can you call me later?\"\n",
"- Your package has been delivered!\n",
"- Flash sale: 20% off electronics for the next 2 hours!\n",
"\"\"\",\n",
" \"\"\"\n",
"- Weather alert: Thunderstorm expected in your area.\n",
"- Reminder: Doctor's appointment at 3 PM.\n",
"- John liked your photo on Instagram.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Breaking News: Local elections results are in.\n",
"- Your daily workout summary is ready.\n",
"- Check out your weekly screen time report.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Your ride is arriving in 2 minutes.\n",
"- Grocery order has been shipped.\n",
"- Don't miss the season finale of your favorite show tonight!\n",
"\"\"\",\n",
" \"\"\"\n",
"- Event reminder: Concert starts at 7 PM.\n",
"- Your favorite team just scored!\n",
"- Flashback: Memories from 3 years ago.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Low battery alert: Charge your device.\n",
"- Your friend Mike is nearby.\n",
"- New episode of \"The Tech Hour\" podcast is live!\n",
"\"\"\",\n",
" \"\"\"\n",
"- System update available.\n",
"- Monthly billing statement is ready.\n",
"- Your next meeting starts in 15 minutes.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Alert: Unauthorized login attempt detected.\n",
"- New comment on your blog post: \"Great insights!\"\n",
"- Tonight's dinner recipe: Pasta Primavera.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Special offer: Free coffee with any breakfast order.\n",
"- Your flight has been delayed by 30 minutes.\n",
"- New movie release: \"Adventures Beyond\" now streaming.\n",
"\"\"\",\n",
" \"\"\"\n",
"- Traffic alert: Accident reported on Main Street.\n",
"- Package out for delivery: Expected by 5 PM.\n",
"- New friend suggestion: Connect with Emma.\n",
"\"\"\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our first run will be our default grader from the completions function above `summarize_push_notification`\n",
"We'll loop through our dataset, make completions calls, and then submit them as a run to be graded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run_data = []\n",
"for push_notifications in push_notification_data:\n",
" result = summarize_push_notification(push_notifications)\n",
" run_data.append({\n",
" \"item\": PushNotifications(notifications=push_notifications).model_dump(),\n",
" \"sample\": result.model_dump()\n",
" })\n",
"\n",
"eval_run_result = openai.evals.runs.create(\n",
" eval_id=eval_id,\n",
" name=\"baseline-run\",\n",
" data_source={\n",
" \"type\": \"jsonl\",\n",
" \"source\": {\n",
" \"type\": \"file_content\",\n",
" \"content\": run_data,\n",
" }\n",
" },\n",
")\n",
"print(eval_run_result)\n",
"# Check out the results in the UI\n",
"print(eval_run_result.report_url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's simulate a regression, here's our original prompt, let's simulate a developer breaking the prompt.\n",
"\n",
"```python\n",
"DEVELOPER_PROMPT = \"\"\"\n",
"You are a helpful assistant that summarizes push notifications.\n",
"You are given a list of push notifications and you need to collapse them into a single one.\n",
"Output only the final summary, nothing else.\n",
"\"\"\"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"DEVELOPER_PROMPT = \"\"\"\n",
"You are a helpful assistant that summarizes push notifications.\n",
"You are given a list of push notifications and you need to collapse them into a single one.\n",
"You should make the summary longer than it needs to be and include more information than is necessary.\n",
"\"\"\"\n",
"\n",
"def summarize_push_notification_bad(push_notifications: str) -> ChatCompletion:\n",
" result = openai.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"developer\", \"content\": DEVELOPER_PROMPT},\n",
" {\"role\": \"user\", \"content\": push_notifications},\n",
" ],\n",
" )\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run_data = []\n",
"for push_notifications in push_notification_data:\n",
" result = summarize_push_notification_bad(push_notifications)\n",
" run_data.append({\n",
" \"item\": PushNotifications(notifications=push_notifications).model_dump(),\n",
" \"sample\": result.model_dump()\n",
" })\n",
"\n",
"eval_run_result = openai.evals.runs.create(\n",
" eval_id=eval_id,\n",
" name=\"regression-run\",\n",
" data_source={\n",
" \"type\": \"jsonl\",\n",
" \"source\": {\n",
" \"type\": \"file_content\",\n",
" \"content\": run_data,\n",
" }\n",
" },\n",
")\n",
"print(eval_run_result.report_url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you view that report, you'll see that it has a score that's much lower than the baseline-run.\n",
"\n",
"## Congratulations, you just prevented a bug from shipping to users\n",
"\n",
"Quick note:\n",
"Evals doesn't yet support the `responses` api natively, however, you can transform it to the `completions` format with the following code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def summarize_push_notification_responses(push_notifications: str):\n",
" result = openai.responses.create(\n",
" model=\"gpt-4o\",\n",
" input=[\n",
" {\"role\": \"developer\", \"content\": DEVELOPER_PROMPT},\n",
" {\"role\": \"user\", \"content\": push_notifications},\n",
" ],\n",
" )\n",
" return result\n",
"def transform_response_to_completion(response):\n",
" completion = {\n",
" \"model\": response.model,\n",
" \"choices\": [{\n",
" \"index\": 0,\n",
" \"message\": {\n",
" \"role\": \"assistant\",\n",
" \"content\": response.output_text\n",
" },\n",
" \"finish_reason\": \"stop\",\n",
" }]\n",
" }\n",
" return completion\n",
"\n",
"run_data = []\n",
"for push_notifications in push_notification_data:\n",
" response = summarize_push_notification_responses(push_notifications)\n",
" completion = transform_response_to_completion(response)\n",
" run_data.append({\n",
" \"item\": PushNotifications(notifications=push_notifications).model_dump(),\n",
" \"sample\": completion\n",
" })\n",
"\n",
"report_response = openai.evals.runs.create(\n",
" eval_id=eval_id,\n",
" name=\"responses-run\",\n",
" data_source={\n",
" \"type\": \"jsonl\",\n",
" \"source\": {\n",
" \"type\": \"file_content\",\n",
" \"content\": run_data,\n",
" }\n",
" },\n",
")\n",
"print(report_response.report_url)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "openai",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -4,6 +4,33 @@
# should build pages for, and indicates metadata such as tags, creation date and # should build pages for, and indicates metadata such as tags, creation date and
# authors for each page. # authors for each page.
- title: EvalsAPI Use-case - Detecting prompt regressions
path: examples/evaluation/use-cases/regression.ipynb
date: 2025-04-08
authors:
- josiah-openai
tags:
- evalsapi
- completions
- title: EvalsAPI Use-case - Bulk model and prompt experimentation
path: examples/evaluation/use-cases/bulk-experimentation.ipynb
date: 2025-04-08
authors:
- josiah-openai
tags:
- evalsapi
- completions
- title: EvalsAPI Use-case - Monitoring stored completions
path: examples/evaluation/use-cases/completion-monitoring.ipynb
date: 2025-04-08
authors:
- josiah-openai
tags:
- evalsapi
- completions
- title: Multi-Tool Orchestration with RAG approach using OpenAI's Responses API - title: Multi-Tool Orchestration with RAG approach using OpenAI's Responses API
path: examples/responses_api/responses_api_tool_orchestration.ipynb path: examples/responses_api/responses_api_tool_orchestration.ipynb
date: 2025-03-28 date: 2025-03-28