mirror of
https://github.com/james-m-jordan/openai-cookbook.git
synced 2025-05-09 19:32:38 +00:00
454 lines
15 KiB
Plaintext
454 lines
15 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Evaluations Example: Push Notifications Bulk Experimentation \n",
|
|
"\n",
|
|
"Evals are **task oriented** and iterative, they're the best way to check how your LLM integration is doing and improve it.\n",
|
|
"\n",
|
|
"In the following eval, we are going to focus on the task of **testing many variants of models and prompts**.\n",
|
|
"\n",
|
|
"Our use-case is:\n",
|
|
"1. I want to get the best possible performance out of my push notifications summarizer\n",
|
|
"\n",
|
|
"## Evals structure\n",
|
|
"\n",
|
|
"Evals have two parts, the \"Eval\" and the \"Run\". An \"Eval\" holds the configuration for your testing criteria and the structure of the data for your \"Runs\". An Eval `has_many` runs, that are evaluated by your testing criteria."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pydantic\n",
|
|
"import openai\n",
|
|
"from openai.types.chat import ChatCompletion\n",
|
|
"import os\n",
|
|
"\n",
|
|
"os.environ[\"OPENAI_API_KEY\"] = os.environ.get(\"OPENAI_API_KEY\", \"your-api-key\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Use-case\n",
|
|
"\n",
|
|
"We're testing the following integration, a push notifications summarizer, which takes in multiple push notifications and collapses them into a single message."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class PushNotifications(pydantic.BaseModel):\n",
|
|
" notifications: str\n",
|
|
"\n",
|
|
"print(PushNotifications.model_json_schema())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"DEVELOPER_PROMPT = \"\"\"\n",
|
|
"You are a helpful assistant that summarizes push notifications.\n",
|
|
"You are given a list of push notifications and you need to collapse them into a single one.\n",
|
|
"Output only the final summary, nothing else.\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"def summarize_push_notification(push_notifications: str) -> ChatCompletion:\n",
|
|
" result = openai.chat.completions.create(\n",
|
|
" model=\"gpt-4o-mini\",\n",
|
|
" messages=[\n",
|
|
" {\"role\": \"developer\", \"content\": DEVELOPER_PROMPT},\n",
|
|
" {\"role\": \"user\", \"content\": push_notifications},\n",
|
|
" ],\n",
|
|
" )\n",
|
|
" return result\n",
|
|
"\n",
|
|
"example_push_notifications_list = PushNotifications(notifications=\"\"\"\n",
|
|
"- Alert: Unauthorized login attempt detected.\n",
|
|
"- New comment on your blog post: \"Great insights!\"\n",
|
|
"- Tonight's dinner recipe: Pasta Primavera.\n",
|
|
"\"\"\")\n",
|
|
"result = summarize_push_notification(example_push_notifications_list.notifications)\n",
|
|
"print(result.choices[0].message.content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Setting up your eval\n",
|
|
"\n",
|
|
"An Eval holds the configuration that is shared across multiple *Runs*, it has two components:\n",
|
|
"1. Data source configuration `data_source_config` - the schema (columns) that your future *Runs* conform to.\n",
|
|
" - The `data_source_config` uses JSON Schema to define what variables are available in the Eval.\n",
|
|
"2. Testing Criteria `testing_criteria` - How you'll determine if your integration is working for each *row* of your data source.\n",
|
|
"\n",
|
|
"For this use-case, we want to test if the push notification summary completion is good, so we'll set-up our eval with this in mind."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# We want our input data to be available in our variables, so we set the item_schema to\n",
|
|
"# PushNotifications.model_json_schema()\n",
|
|
"data_source_config = {\n",
|
|
" \"type\": \"custom\",\n",
|
|
" \"item_schema\": PushNotifications.model_json_schema(),\n",
|
|
" # We're going to be uploading completions from the API, so we tell the Eval to expect this\n",
|
|
" \"include_sample_schema\": True,\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This data_source_config defines what variables are available throughout the eval.\n",
|
|
"\n",
|
|
"This item schema:\n",
|
|
"```json\n",
|
|
"{\n",
|
|
" \"properties\": {\n",
|
|
" \"notifications\": {\n",
|
|
" \"title\": \"Notifications\",\n",
|
|
" \"type\": \"string\"\n",
|
|
" }\n",
|
|
" },\n",
|
|
" \"required\": [\"notifications\"],\n",
|
|
" \"title\": \"PushNotifications\",\n",
|
|
" \"type\": \"object\"\n",
|
|
"}\n",
|
|
"```\n",
|
|
"Means that we'll have the variable `{{item.notifications}}` available in our eval.\n",
|
|
"\n",
|
|
"`\"include_sample_schema\": True`\n",
|
|
"Mean's that we'll have the variable `{{sample.output_text}}` available in our eval.\n",
|
|
"\n",
|
|
"**Now, we'll use those variables to set up our test criteria.**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"GRADER_DEVELOPER_PROMPT = \"\"\"\n",
|
|
"Categorize the following push notification summary into the following categories:\n",
|
|
"1. concise-and-snappy\n",
|
|
"2. drops-important-information\n",
|
|
"3. verbose\n",
|
|
"4. unclear\n",
|
|
"5. obscures-meaning\n",
|
|
"6. other \n",
|
|
"\n",
|
|
"You'll be given the original list of push notifications and the summary like this:\n",
|
|
"\n",
|
|
"<push_notifications>\n",
|
|
"...notificationlist...\n",
|
|
"</push_notifications>\n",
|
|
"<summary>\n",
|
|
"...summary...\n",
|
|
"</summary>\n",
|
|
"\n",
|
|
"You should only pick one of the categories above, pick the one which most closely matches and why.\n",
|
|
"\"\"\"\n",
|
|
"GRADER_TEMPLATE_PROMPT = \"\"\"\n",
|
|
"<push_notifications>{{item.notifications}}</push_notifications>\n",
|
|
"<summary>{{sample.output_text}}</summary>\n",
|
|
"\"\"\"\n",
|
|
"push_notification_grader = {\n",
|
|
" \"name\": \"Push Notification Summary Grader\",\n",
|
|
" \"type\": \"label_model\",\n",
|
|
" \"model\": \"o3-mini\",\n",
|
|
" \"input\": [\n",
|
|
" {\n",
|
|
" \"role\": \"developer\",\n",
|
|
" \"content\": GRADER_DEVELOPER_PROMPT,\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"role\": \"user\",\n",
|
|
" \"content\": GRADER_TEMPLATE_PROMPT,\n",
|
|
" },\n",
|
|
" ],\n",
|
|
" \"passing_labels\": [\"concise-and-snappy\"],\n",
|
|
" \"labels\": [\n",
|
|
" \"concise-and-snappy\",\n",
|
|
" \"drops-important-information\",\n",
|
|
" \"verbose\",\n",
|
|
" \"unclear\",\n",
|
|
" \"obscures-meaning\",\n",
|
|
" \"other\",\n",
|
|
" ],\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The `push_notification_grader` is a model grader (llm-as-a-judge) which looks at the input `{{item.notifications}}` and the generated summary `{{sample.output_text}}` and labels it as \"correct\" or \"incorrect\"\n",
|
|
"We then instruct via the \"passing_labels\" what constitutes a passing answer.\n",
|
|
"\n",
|
|
"Note: under the hood, this uses structured outputs so that labels are always valid.\n",
|
|
"\n",
|
|
"**Now we'll create our eval, and start adding data to it!**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"eval_create_result = openai.evals.create(\n",
|
|
" name=\"Push Notification Bulk Experimentation Eval\",\n",
|
|
" metadata={\n",
|
|
" \"description\": \"This eval tests many prompts and models to find the best performing combination.\",\n",
|
|
" },\n",
|
|
" data_source_config=data_source_config,\n",
|
|
" testing_criteria=[push_notification_grader],\n",
|
|
")\n",
|
|
"eval_id = eval_create_result.id"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Creating runs\n",
|
|
"\n",
|
|
"Now that we have our eval set-up with our testing_criteria, we can start to add a bunch of runs!\n",
|
|
"We'll start with some push notification data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"push_notification_data = [\n",
|
|
" \"\"\"\n",
|
|
"- New message from Sarah: \"Can you call me later?\"\n",
|
|
"- Your package has been delivered!\n",
|
|
"- Flash sale: 20% off electronics for the next 2 hours!\n",
|
|
"\"\"\",\n",
|
|
" \"\"\"\n",
|
|
"- Weather alert: Thunderstorm expected in your area.\n",
|
|
"- Reminder: Doctor's appointment at 3 PM.\n",
|
|
"- John liked your photo on Instagram.\n",
|
|
"\"\"\",\n",
|
|
" \"\"\"\n",
|
|
"- Breaking News: Local elections results are in.\n",
|
|
"- Your daily workout summary is ready.\n",
|
|
"- Check out your weekly screen time report.\n",
|
|
"\"\"\",\n",
|
|
" \"\"\"\n",
|
|
"- Your ride is arriving in 2 minutes.\n",
|
|
"- Grocery order has been shipped.\n",
|
|
"- Don't miss the season finale of your favorite show tonight!\n",
|
|
"\"\"\",\n",
|
|
" \"\"\"\n",
|
|
"- Event reminder: Concert starts at 7 PM.\n",
|
|
"- Your favorite team just scored!\n",
|
|
"- Flashback: Memories from 3 years ago.\n",
|
|
"\"\"\",\n",
|
|
" \"\"\"\n",
|
|
"- Low battery alert: Charge your device.\n",
|
|
"- Your friend Mike is nearby.\n",
|
|
"- New episode of \"The Tech Hour\" podcast is live!\n",
|
|
"\"\"\",\n",
|
|
" \"\"\"\n",
|
|
"- System update available.\n",
|
|
"- Monthly billing statement is ready.\n",
|
|
"- Your next meeting starts in 15 minutes.\n",
|
|
"\"\"\",\n",
|
|
" \"\"\"\n",
|
|
"- Alert: Unauthorized login attempt detected.\n",
|
|
"- New comment on your blog post: \"Great insights!\"\n",
|
|
"- Tonight's dinner recipe: Pasta Primavera.\n",
|
|
"\"\"\",\n",
|
|
" \"\"\"\n",
|
|
"- Special offer: Free coffee with any breakfast order.\n",
|
|
"- Your flight has been delayed by 30 minutes.\n",
|
|
"- New movie release: \"Adventures Beyond\" now streaming.\n",
|
|
"\"\"\",\n",
|
|
" \"\"\"\n",
|
|
"- Traffic alert: Accident reported on Main Street.\n",
|
|
"- Package out for delivery: Expected by 5 PM.\n",
|
|
"- New friend suggestion: Connect with Emma.\n",
|
|
"\"\"\"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we're going to set up a bunch of prompts to test.\n",
|
|
"\n",
|
|
"We want to test a basic prompt, with a couple of variations:\n",
|
|
"1. In one variation, we'll just have the basic prompt\n",
|
|
"2. In the next one, we'll include some positive examples of what we want the summaries to look like\n",
|
|
"3. In the final one, we'll include both positive and negative examples.\n",
|
|
"\n",
|
|
"We'll also include a list of models to use."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"PROMPT_PREFIX = \"\"\"\n",
|
|
"You are a helpful assistant that takes in an array of push notifications and returns a collapsed summary of them.\n",
|
|
"The push notification will be provided as follows:\n",
|
|
"<push_notifications>\n",
|
|
"...notificationlist...\n",
|
|
"</push_notifications>\n",
|
|
"\n",
|
|
"You should return just the summary and nothing else.\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"PROMPT_VARIATION_BASIC = f\"\"\"\n",
|
|
"{PROMPT_PREFIX}\n",
|
|
"\n",
|
|
"You should return a summary that is concise and snappy.\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"PROMPT_VARIATION_WITH_EXAMPLES = f\"\"\"\n",
|
|
"{PROMPT_VARIATION_BASIC}\n",
|
|
"\n",
|
|
"Here is an example of a good summary:\n",
|
|
"<push_notifications>\n",
|
|
"- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.\n",
|
|
"</push_notifications>\n",
|
|
"<summary>\n",
|
|
"Traffic alert, package expected by 5pm, suggestion for new friend (Emily).\n",
|
|
"</summary>\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES = f\"\"\"\n",
|
|
"{PROMPT_VARIATION_WITH_EXAMPLES}\n",
|
|
"\n",
|
|
"Here is an example of a bad summary:\n",
|
|
"<push_notifications>\n",
|
|
"- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.\n",
|
|
"</push_notifications>\n",
|
|
"<summary>\n",
|
|
"Traffic alert reported on main street. You have a package that will arrive by 5pm, Emily is a new friend suggested for you.\n",
|
|
"</summary>\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"prompts = [\n",
|
|
" (\"basic\", PROMPT_VARIATION_BASIC),\n",
|
|
" (\"with_examples\", PROMPT_VARIATION_WITH_EXAMPLES),\n",
|
|
" (\"with_negative_examples\", PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES),\n",
|
|
"]\n",
|
|
"\n",
|
|
"models = [\"gpt-4o\", \"gpt-4o-mini\", \"o3-mini\"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Now we can just loop through all prompts and all models to test a bunch of configurations at once!**\n",
|
|
"\n",
|
|
"We'll use the 'completion' run data source with template variables for our push notification list.\n",
|
|
"\n",
|
|
"OpenAI will handle making the completions calls for you and populating \"sample.output_text\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for prompt_name, prompt in prompts:\n",
|
|
" for model in models:\n",
|
|
" run_data_source = {\n",
|
|
" \"type\": \"completions\",\n",
|
|
" \"input_messages\": {\n",
|
|
" \"type\": \"template\",\n",
|
|
" \"template\": [\n",
|
|
" {\n",
|
|
" \"role\": \"developer\",\n",
|
|
" \"content\": prompt,\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"role\": \"user\",\n",
|
|
" \"content\": \"<push_notifications>{{item.notifications}}</push_notifications>\",\n",
|
|
" },\n",
|
|
" ],\n",
|
|
" },\n",
|
|
" \"model\": model,\n",
|
|
" \"source\": {\n",
|
|
" \"type\": \"file_content\",\n",
|
|
" \"content\": [\n",
|
|
" {\n",
|
|
" \"item\": PushNotifications(notifications=notification).model_dump()\n",
|
|
" }\n",
|
|
" for notification in push_notification_data\n",
|
|
" ],\n",
|
|
" },\n",
|
|
" }\n",
|
|
"\n",
|
|
" run_create_result = openai.evals.runs.create(\n",
|
|
" eval_id=eval_id,\n",
|
|
" name=f\"bulk_{prompt_name}_{model}\",\n",
|
|
" data_source=run_data_source,\n",
|
|
" )\n",
|
|
" print(f\"Report URL {model}, {prompt_name}:\", run_create_result.report_url)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"\n",
|
|
"## Congratulations, you just tested 9 different prompt and model variations across your dataset!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "openai",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.8"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|