diff --git a/examples/Custom-LLM-as-a-Judge.ipynb b/examples/Custom-LLM-as-a-Judge.ipynb new file mode 100644 index 0000000..894ed0c --- /dev/null +++ b/examples/Custom-LLM-as-a-Judge.ipynb @@ -0,0 +1,966 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Building an LLM-as-a-judge evaluation to detect hallucinations with Braintrust\n", + "\n", + "Let's say you're working on a customer service bot and trying to evaluate the quality of its responses. Consider a question like \"What is your return policy?\" If the correct answer is \"You can return items within 30 days of purchase,\" but your bot generates \"You can return items within 30 days,\" how would you evaluate whether this is a good response?\n", + "\n", + "A heuristic like the `Levenshtein` string distance would indicate that the response is incorrect. However, a better approach is to use an LLM-as-a-judge to assess the accuracy of the response. LLM-as-a-judge is a technique that leverages an LLM to score the quality of answers. LLMs can reason about language beyond surface-level string comparisons, enabling them to evaluate answers more accurately.\n", + "\n", + "In this cookbook, we'll walk through how to build an LLM-as-a-judge scorer that can detect hallucinations using [Braintrust](https://www.braintrust.dev/), a third-party evaluation platform that is compatible with OpenAI's models.\n", + "\n", + "## Installing dependencies\n", + "\n", + "Let's install a few basic dependencies. We'll use the CoQA dataset (via DuckDB), [Braintrust](https://www.braintrust.dev/) for evals, and [OpenAI's models](https://platform.openai.com/docs/models). Please note that Braintrust is a third-party evaluation platform and you should review their [terms of service and privacy policy](https://www.braintrust.dev/legal/terms-of-service) before proceeding.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install autoevals duckdb braintrust openai --quiet\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, let's initialize the OpenAI client. We'll use the `AsyncOpenAI` client so that we can parallelize our requests. The `braintrust.wrap_openai` function\n", + "wraps the OpenAI client to enable logging LLM calls to [Braintrust](https://www.braintrust.dev/). We'll use Braintrust to facilitate the evaluations below.\n", + "Before proceeding, you should sign up for a [Braintrust account](https://www.braintrust.dev/signup) and set `BRAINTRUST_API_KEY` in your environment to a valid API key.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "import braintrust\n", + "from openai import AsyncOpenAI\n", + "\n", + "braintrust.login(api_key=os.environ[\"BRAINTRUST_API_KEY\"])\n", + "client = braintrust.wrap_openai(\n", + " AsyncOpenAI(api_key=os.environ[\"OPENAI_API_KEY\"]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explore the dataset\n", + "\n", + "We'll use the [CoQA dataset](https://stanfordnlp.github.io/coqa/) which contains a diverse set of passages, questions, and answers. Because CoQA is quite large, we'll just look at the first several passages. As with any public dataset, there's a chance that the underlying LLMs have memorized aspects of the dataset, so when developing your own scorers, it's a good idea to test them using\n", + "your own private data.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Passage:\n", + "(CNN)A chiseled boxer's Instagram feed shows him making constant references to the Bible and enjoying gospel singing with his wife. \n", + "\n", + "Another features his formidable opponent counting stacks of money, hanging out in strip clubs, and flashing diamond watches and Ferraris. \n", + "\n", + "Welcome to the world of boxing promotion, circa 2015. \n", + "\n", + "American Floyd Mayweather and Filipino Manny Pacquiao are set to officially announce their heavily anticipated boxing match at a press conference in Los Angeles Wednesday. \n", + "\n", + "With the combined purse for the May 2 bout in Las Vegas reported to touch $300 million pending viewership numbers, the incentives to self-promote could not be higher. \n", + "\n", + "\"Nowadays you have to be on social media to launch the fight and to build hype,\" says boxing promoter Nisse Sauerland, CEO of Team Sauerland. \"It couldn't be done without it.\" \n", + "\n", + "Thirty-eight year old Mayweather (47-0, 26 knockouts), who favors the moniker \"The Money Man\" or \"TBE\" (The Best Ever), boasts nearly five million Instagram followers, 5.65 million followers on Twitter and 9.2 million Facebook likes. \n", + "\n", + "He famously confirmed the fight via Shots, a photo sharing social media application that he's invested in, and displays links to his clothing brand, The Money Team, on all his accounts. \n", + "\n", + "Along with professing to the be the best fighter of all time, he could also stake a claim to be one of the greatest social media users in sports. \n", + "\n", + "\"I think they're both playing their roles,\" says Sauerland, who promotes over 45 boxers. \"You've got the bad guy and the good guy, really. You've got the guy who throws the money around (Mayweather), that's his image, and Pacquiao, he's the hope of a nation.\" \n", + "\n", + "Question:\n", + "Who are the two boxer featured in this article?\n", + "\n", + "Answer:\n", + "Floyd Mayweather and Manny Pacquiao\n" + ] + } + ], + "source": [ + "import duckdb\n", + "\n", + "# DuckDB has an easy wrapper for loading datasets from Hugging Face.\n", + "con = duckdb.connect(\":memory:\")\n", + "full_result = con.query(\"\"\"\n", + " SELECT * FROM 'hf://datasets/stanfordnlp/coqa/data/validation-00000-of-00001.parquet'\n", + " LIMIT 40\n", + "\"\"\").fetchall()\n", + "\n", + "single_result = full_result[10]\n", + "\n", + "print(\"Passage:\")\n", + "print(single_result[1])\n", + "\n", + "print(\"\\nQuestion:\")\n", + "print(single_result[2][0])\n", + "\n", + "print(\"\\nAnswer:\")\n", + "print(single_result[3][\"input_text\"][0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The data contains a series of passages, each with a number of questions and answers. Let's flatten this into a list of `(passage, question, answer)` tuples.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "629\n" + ] + } + ], + "source": [ + "from dataclasses import dataclass\n", + "\n", + "\n", + "@dataclass\n", + "class QuestionAnswer:\n", + " passage: str\n", + " question: str\n", + " expected_answer: str\n", + " generated_answer: str\n", + "\n", + "\n", + "qa_pairs = [\n", + " QuestionAnswer(\n", + " passage=r[1],\n", + " question=question,\n", + " generated_answer=r[3][\"input_text\"][i],\n", + " expected_answer=r[3][\"input_text\"][i],\n", + " )\n", + " for r in full_result\n", + " for (i, question) in enumerate(r[2])\n", + "]\n", + "\n", + "print(len(qa_pairs))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Adding hallucinations\n", + "\n", + "Because Braintrust's scorer is designed to test hallucinations, we can use the QA pairs to generate known hallucinations. We'll create hallucinated answers by asking an\n", + "LLM to confidently generate an answer to each question without using the passage.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Passage:\n", + "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. \n", + "\n", + "\"What are you doing, Cotton?!\" \n", + "\n", + "\"I only wanted to be more like you\". \n", + "\n", + "Cotton's mommy rubbed her face on Cotton's and said \"Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way\". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry. \n", + "\n", + "\"Don't ever do that again, Cotton!\" they all cried. \"Next time you might mess up that pretty white fur of yours and we wouldn't want that!\" \n", + "\n", + "Then Cotton thought, \"I change my mind. I like being special\".\n", + "\n", + "Question:\n", + "What color was Cotton?\n", + "\n", + "Expected Answer:\n", + "white\n", + "\n", + "Generated Answer:\n", + "Cotton is typically a natural off-white color when it is picked from the plant, although during the late 1800s, fields in the southern United States often sprouted rare and vibrant purple cotton bolls that were highly prized for their unique appearance.\n", + "\n", + "\n", + "Number of hallucinations: 259\n" + ] + } + ], + "source": [ + "import asyncio\n", + "import random\n", + "\n", + "random.seed(42)\n", + "\n", + "\n", + "async def hallucinate_answer(qa):\n", + " response = await client.chat.completions.create(\n", + " model=\"gpt-4o\",\n", + " messages=[\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"\"\"\\\n", + "You are a helpful hallucinating assistant, who makes up fake answers to questions.\n", + "\n", + "Answer the following question in 1 sentence. If you know the answer, then make up some fake\n", + "superfluous details that are not in the passage you have memorized.\n", + "\n", + "Make sure to always answer it confidently, even if you don't know the answer. Do not use words\n", + "like \"perhaps\", \"likely\", \"maybe\", etc. or punctuation like \"...\".Do not admit that you cannot\n", + "or do not know the answer.\"\"\",\n", + " },\n", + " {\"role\": \"user\", \"content\": qa.question},\n", + " ],\n", + " temperature=1,\n", + " max_tokens=100,\n", + " )\n", + " return response.choices[0].message.content\n", + "\n", + "\n", + "hallucinated_answers = await asyncio.gather(\n", + " *[hallucinate_answer(qa) for qa in qa_pairs]\n", + ")\n", + "\n", + "\n", + "hallucinations = [\n", + " QuestionAnswer(\n", + " passage=qa.passage,\n", + " question=qa.question,\n", + " expected_answer=qa.expected_answer,\n", + " generated_answer=hallucination,\n", + " )\n", + " for (qa, hallucination) in zip(qa_pairs, hallucinated_answers)\n", + " # Exclude simple yes/no answers.\n", + " if \"yes\" not in hallucination.lower() and \"no\" not in hallucination.lower()\n", + "]\n", + "\n", + "print(\"Passage:\")\n", + "print(hallucinations[0].passage)\n", + "print(\"\\nQuestion:\")\n", + "print(hallucinations[0].question)\n", + "print(\"\\nExpected Answer:\")\n", + "print(hallucinations[0].expected_answer)\n", + "print(\"\\nGenerated Answer:\")\n", + "print(hallucinations[0].generated_answer)\n", + "\n", + "print(\"\\n\\nNumber of hallucinations:\", len(hallucinations))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating the evaluators\n", + "\n", + "We'll consider a few popular approaches for creating an LLM-as-a-judge. For each approach, we'll create a scorer and then \"meta-evaluate\" it to see how it performs.\n", + "Since we know that the hallucinated answers are incorrect, we'll assess the quality of an evaluator by testing how often it scores the hallucinated answers as `0`.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### LLM-as-a-judge #1: Numeric rater\n", + "\n", + "A common initial intuition when creating an LLM-as-a-judge is asking the LLM to rate the answer on a scale of 1 to 5. The benefit of this approach is that\n", + "it's easy to convert the LLM's output into a numeric score.\n", + "\n", + "We'll use a modified version of the [Factuality](https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml) template, but ask the LLM to\n", + "rate the answer on a scale of 1 to 10.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face\n", + "1.0\n", + "Why? On a hallucinated answer: Because the intricate balance of cosmic forces dictated the alignment of elements, guided by the invisible hand of interstellar diplomacy, causing events to unfold as they do.\n", + "0.0\n" + ] + } + ], + "source": [ + "import json\n", + "\n", + "PROMPT = \"\"\"\\\n", + "You are comparing a submitted answer to an expert answer on a given question. Here is the data:\n", + "[BEGIN DATA]\n", + "************\n", + "[Question]: {input}\n", + "************\n", + "[Expert]: {expected}\n", + "************\n", + "[Submission]: {output}\n", + "************\n", + "[END DATA]\n", + "\n", + "Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.\n", + "Rate the submission on a scale of 1 to 10.\n", + "\"\"\"\n", + "\n", + "\n", + "@braintrust.traced\n", + "async def numeric_rater(input, output, expected):\n", + " response = await client.chat.completions.create(\n", + " model=\"gpt-4o\",\n", + " messages=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": PROMPT.format(input=input, output=output, expected=expected),\n", + " }\n", + " ],\n", + " temperature=0,\n", + " tools=[\n", + " {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"rate\",\n", + " \"description\": \"Rate the submission on a scale of 1 to 10.\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"rating\": {\"type\": \"integer\", \"minimum\": 1, \"maximum\": 10},\n", + " },\n", + " \"required\": [\"rating\"],\n", + " },\n", + " },\n", + " }\n", + " ],\n", + " tool_choice={\"type\": \"function\", \"function\": {\"name\": \"rate\"}},\n", + " )\n", + " arguments = json.loads(\n", + " response.choices[0].message.tool_calls[0].function.arguments)\n", + " return (arguments[\"rating\"] - 1) / 9\n", + "\n", + "\n", + "print(qa_pairs[10].question, \"On a correct answer:\",\n", + " qa_pairs[10].generated_answer)\n", + "print(\n", + " await numeric_rater(\n", + " qa_pairs[10].question,\n", + " qa_pairs[10].generated_answer,\n", + " qa_pairs[10].expected_answer,\n", + " )\n", + ")\n", + "\n", + "print(\n", + " hallucinations[10].question,\n", + " \"On a hallucinated answer:\",\n", + " hallucinations[10].generated_answer,\n", + ")\n", + "print(\n", + " await numeric_rater(\n", + " hallucinations[10].question,\n", + " hallucinations[10].generated_answer,\n", + " hallucinations[10].expected_answer,\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This looks promising! Now that we have sanity checked it on a single example, let's run a proper evaluation and see how it performs on a wider set of data. An evaluation consists of three components:\n", + "\n", + "- **Data**: In this case, the `input` is the question, hallucinated answer, and ground truth answer. The scorer will convert this into a score between 0 and 1. The expected score is 0, since it's a hallucination.\n", + "- **Task**: The task is simply calling the numeric rater for each input.\n", + "- **Scores**: We'll assess the quality of the generated score by comparing it with the ground truth score. Since we know both numbers are between 0 and 1, we can use the normalized difference as the score.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Experiment Numeric rater is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater\n", + "LLM-as-a-judge [experiment_name=Numeric rater] (data): 259it [00:00, 104685.82it/s]\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "02fe772e41ae4b4cbc51b5e02a975208", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "LLM-as-a-judge [experiment_name=Numeric rater] (tasks): 0%| | 0/259 [00:00