"# Building an LLM-as-a-judge evaluation to detect hallucinations with Braintrust\n",
"\n",
"Let's say you're working on a customer service bot and trying to evaluate the quality of its responses. Consider a question like \"What is your return policy?\" If the correct answer is \"You can return items within 30 days of purchase,\" but your bot generates \"You can return items within 30 days,\" how would you evaluate whether this is a good response?\n",
"\n",
"A heuristic like the `Levenshtein` string distance would indicate that the response is incorrect. However, a better approach is to use an LLM-as-a-judge to assess the accuracy of the response. LLM-as-a-judge is a technique that leverages an LLM to score the quality of answers. LLMs can reason about language beyond surface-level string comparisons, enabling them to evaluate answers more accurately.\n",
"\n",
"In this cookbook, we'll walk through how to build an LLM-as-a-judge scorer that can detect hallucinations using [Braintrust](https://www.braintrust.dev/), a third-party evaluation platform that is compatible with OpenAI's models.\n",
"\n",
"## Installing dependencies\n",
"\n",
"Let's install a few basic dependencies. We'll use the CoQA dataset (via DuckDB), [Braintrust](https://www.braintrust.dev/) for evals, and [OpenAI's models](https://platform.openai.com/docs/models). Please note that Braintrust is a third-party evaluation platform and you should review their [terms of service and privacy policy](https://www.braintrust.dev/legal/terms-of-service) before proceeding.\n"
"Next, let's initialize the OpenAI client. We'll use the `AsyncOpenAI` client so that we can parallelize our requests. The `braintrust.wrap_openai` function\n",
"wraps the OpenAI client to enable logging LLM calls to [Braintrust](https://www.braintrust.dev/). We'll use Braintrust to facilitate the evaluations below.\n",
"Before proceeding, you should sign up for a [Braintrust account](https://www.braintrust.dev/signup) and set `BRAINTRUST_API_KEY` in your environment to a valid API key.\n"
"We'll use the [CoQA dataset](https://stanfordnlp.github.io/coqa/) which contains a diverse set of passages, questions, and answers. Because CoQA is quite large, we'll just look at the first several passages. As with any public dataset, there's a chance that the underlying LLMs have memorized aspects of the dataset, so when developing your own scorers, it's a good idea to test them using\n",
"your own private data.\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Passage:\n",
"(CNN)A chiseled boxer's Instagram feed shows him making constant references to the Bible and enjoying gospel singing with his wife. \n",
"\n",
"Another features his formidable opponent counting stacks of money, hanging out in strip clubs, and flashing diamond watches and Ferraris. \n",
"\n",
"Welcome to the world of boxing promotion, circa 2015. \n",
"\n",
"American Floyd Mayweather and Filipino Manny Pacquiao are set to officially announce their heavily anticipated boxing match at a press conference in Los Angeles Wednesday. \n",
"\n",
"With the combined purse for the May 2 bout in Las Vegas reported to touch $300 million pending viewership numbers, the incentives to self-promote could not be higher. \n",
"\n",
"\"Nowadays you have to be on social media to launch the fight and to build hype,\" says boxing promoter Nisse Sauerland, CEO of Team Sauerland. \"It couldn't be done without it.\" \n",
"\n",
"Thirty-eight year old Mayweather (47-0, 26 knockouts), who favors the moniker \"The Money Man\" or \"TBE\" (The Best Ever), boasts nearly five million Instagram followers, 5.65 million followers on Twitter and 9.2 million Facebook likes. \n",
"\n",
"He famously confirmed the fight via Shots, a photo sharing social media application that he's invested in, and displays links to his clothing brand, The Money Team, on all his accounts. \n",
"\n",
"Along with professing to the be the best fighter of all time, he could also stake a claim to be one of the greatest social media users in sports. \n",
"\n",
"\"I think they're both playing their roles,\" says Sauerland, who promotes over 45 boxers. \"You've got the bad guy and the good guy, really. You've got the guy who throws the money around (Mayweather), that's his image, and Pacquiao, he's the hope of a nation.\" \n",
"\n",
"Question:\n",
"Who are the two boxer featured in this article?\n",
"\n",
"Answer:\n",
"Floyd Mayweather and Manny Pacquiao\n"
]
}
],
"source": [
"import duckdb\n",
"\n",
"# DuckDB has an easy wrapper for loading datasets from Hugging Face.\n",
"con = duckdb.connect(\":memory:\")\n",
"full_result = con.query(\"\"\"\n",
" SELECT * FROM 'hf://datasets/stanfordnlp/coqa/data/validation-00000-of-00001.parquet'\n",
" LIMIT 40\n",
"\"\"\").fetchall()\n",
"\n",
"single_result = full_result[10]\n",
"\n",
"print(\"Passage:\")\n",
"print(single_result[1])\n",
"\n",
"print(\"\\nQuestion:\")\n",
"print(single_result[2][0])\n",
"\n",
"print(\"\\nAnswer:\")\n",
"print(single_result[3][\"input_text\"][0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data contains a series of passages, each with a number of questions and answers. Let's flatten this into a list of `(passage, question, answer)` tuples.\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"629\n"
]
}
],
"source": [
"from dataclasses import dataclass\n",
"\n",
"\n",
"@dataclass\n",
"class QuestionAnswer:\n",
" passage: str\n",
" question: str\n",
" expected_answer: str\n",
" generated_answer: str\n",
"\n",
"\n",
"qa_pairs = [\n",
" QuestionAnswer(\n",
" passage=r[1],\n",
" question=question,\n",
" generated_answer=r[3][\"input_text\"][i],\n",
" expected_answer=r[3][\"input_text\"][i],\n",
" )\n",
" for r in full_result\n",
" for (i, question) in enumerate(r[2])\n",
"]\n",
"\n",
"print(len(qa_pairs))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding hallucinations\n",
"\n",
"Because Braintrust's scorer is designed to test hallucinations, we can use the QA pairs to generate known hallucinations. We'll create hallucinated answers by asking an\n",
"LLM to confidently generate an answer to each question without using the passage.\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Passage:\n",
"Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. \n",
"\n",
"\"What are you doing, Cotton?!\" \n",
"\n",
"\"I only wanted to be more like you\". \n",
"\n",
"Cotton's mommy rubbed her face on Cotton's and said \"Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way\". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry. \n",
"\n",
"\"Don't ever do that again, Cotton!\" they all cried. \"Next time you might mess up that pretty white fur of yours and we wouldn't want that!\" \n",
"\n",
"Then Cotton thought, \"I change my mind. I like being special\".\n",
"\n",
"Question:\n",
"What color was Cotton?\n",
"\n",
"Expected Answer:\n",
"white\n",
"\n",
"Generated Answer:\n",
"Cotton is typically a natural off-white color when it is picked from the plant, although during the late 1800s, fields in the southern United States often sprouted rare and vibrant purple cotton bolls that were highly prized for their unique appearance.\n",
" *[hallucinate_answer(qa) for qa in qa_pairs]\n",
")\n",
"\n",
"\n",
"hallucinations = [\n",
" QuestionAnswer(\n",
" passage=qa.passage,\n",
" question=qa.question,\n",
" expected_answer=qa.expected_answer,\n",
" generated_answer=hallucination,\n",
" )\n",
" for (qa, hallucination) in zip(qa_pairs, hallucinated_answers)\n",
" # Exclude simple yes/no answers.\n",
" if \"yes\" not in hallucination.lower() and \"no\" not in hallucination.lower()\n",
"]\n",
"\n",
"print(\"Passage:\")\n",
"print(hallucinations[0].passage)\n",
"print(\"\\nQuestion:\")\n",
"print(hallucinations[0].question)\n",
"print(\"\\nExpected Answer:\")\n",
"print(hallucinations[0].expected_answer)\n",
"print(\"\\nGenerated Answer:\")\n",
"print(hallucinations[0].generated_answer)\n",
"\n",
"print(\"\\n\\nNumber of hallucinations:\", len(hallucinations))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating the evaluators\n",
"\n",
"We'll consider a few popular approaches for creating an LLM-as-a-judge. For each approach, we'll create a scorer and then \"meta-evaluate\" it to see how it performs.\n",
"Since we know that the hallucinated answers are incorrect, we'll assess the quality of an evaluator by testing how often it scores the hallucinated answers as `0`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### LLM-as-a-judge #1: Numeric rater\n",
"\n",
"A common initial intuition when creating an LLM-as-a-judge is asking the LLM to rate the answer on a scale of 1 to 5. The benefit of this approach is that\n",
"it's easy to convert the LLM's output into a numeric score.\n",
"\n",
"We'll use a modified version of the [Factuality](https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml) template, but ask the LLM to\n",
"rate the answer on a scale of 1 to 10.\n"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face\n",
"1.0\n",
"Why? On a hallucinated answer: Because the intricate balance of cosmic forces dictated the alignment of elements, guided by the invisible hand of interstellar diplomacy, causing events to unfold as they do.\n",
"0.0\n"
]
}
],
"source": [
"import json\n",
"\n",
"PROMPT = \"\"\"\\\n",
"You are comparing a submitted answer to an expert answer on a given question. Here is the data:\n",
"[BEGIN DATA]\n",
"************\n",
"[Question]: {input}\n",
"************\n",
"[Expert]: {expected}\n",
"************\n",
"[Submission]: {output}\n",
"************\n",
"[END DATA]\n",
"\n",
"Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.\n",
"print(qa_pairs[10].question, \"On a correct answer:\",\n",
" qa_pairs[10].generated_answer)\n",
"print(\n",
" await numeric_rater(\n",
" qa_pairs[10].question,\n",
" qa_pairs[10].generated_answer,\n",
" qa_pairs[10].expected_answer,\n",
" )\n",
")\n",
"\n",
"print(\n",
" hallucinations[10].question,\n",
" \"On a hallucinated answer:\",\n",
" hallucinations[10].generated_answer,\n",
")\n",
"print(\n",
" await numeric_rater(\n",
" hallucinations[10].question,\n",
" hallucinations[10].generated_answer,\n",
" hallucinations[10].expected_answer,\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This looks promising! Now that we have sanity checked it on a single example, let's run a proper evaluation and see how it performs on a wider set of data. An evaluation consists of three components:\n",
"\n",
"- **Data**: In this case, the `input` is the question, hallucinated answer, and ground truth answer. The scorer will convert this into a score between 0 and 1. The expected score is 0, since it's a hallucination.\n",
"- **Task**: The task is simply calling the numeric rater for each input.\n",
"- **Scores**: We'll assess the quality of the generated score by comparing it with the ground truth score. Since we know both numbers are between 0 and 1, we can use the normalized difference as the score.\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Experiment Numeric rater is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater\n",
"It looks like the numeric rater scored almost 94% in total. That's not bad, but if 6% of your evals are incorrectly judged, that could make it very hard to trust them. Let's dig into the Braintrust\n",
"It looks like a number of the incorrect answers were scored with numbers between 1 and 10. However, we do not currently have any insight into why the model gave these scores. Let's see if we can\n",
"fix that next.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### LLM-as-a-judge #2: Adding reasoning\n",
"\n",
"Let's tweak the prompt to get the LLM to also reason about its rating. This method is called [Chain of Thought Reasoning](https://en.wikipedia.org/wiki/Chain_of_thought_reasoning). In addition\n",
"to potentially improving the score, it will give us some insight into why the model gave these scores.\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face\n",
"1.0\n",
"Why? On a hallucinated answer: Because the intricate balance of cosmic forces dictated the alignment of elements, guided by the invisible hand of interstellar diplomacy, causing events to unfold as they do.\n",
" \"description\": \"Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.\",\n",
"print(qa_pairs[10].question, \"On a correct answer:\",\n",
" qa_pairs[10].generated_answer)\n",
"print(\n",
" await numeric_rater(\n",
" qa_pairs[10].question,\n",
" qa_pairs[10].generated_answer,\n",
" qa_pairs[10].expected_answer,\n",
" )\n",
")\n",
"\n",
"print(\n",
" hallucinations[10].question,\n",
" \"On a hallucinated answer:\",\n",
" hallucinations[10].generated_answer,\n",
")\n",
"print(\n",
" await numeric_rater(\n",
" hallucinations[10].question,\n",
" hallucinations[10].generated_answer,\n",
" hallucinations[10].expected_answer,\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Experiment Numeric rater with reasoning is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning\n",
"LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (data): 259it [00:00, 38500.31it/s]\n"
"See results for Numeric rater with reasoning at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning\n"
" experiment_name=\"Numeric rater with reasoning\",\n",
" max_concurrency=10,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It doesn't look like adding reasoning helped the score (in fact, it's half a percent worse). However, if we look at one of the failures, we'll get some insight into\n",
"what the model was thinking. Here is an example of a hallucinated answer:\n",
"It looks like the model is applying its own judgement to compute partial credit. This is a common problem with numeric rating—both for models and for humans—and can often be solved\n",
"by using better prompting.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### LLM-as-a-judge #3: Classifying instead of rating\n",
"\n",
"Next, we'll spell out specific criteria and ask the model to classify the answer according to those criteria. This method allows us to more precisely guide the model\n",
"towards the hallucinations we're testing for. Intuitively, giving the model specific criteria to rate will result in a more accurate score.\n"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face\n",
"1\n",
"Why? On a hallucinated answer: Because the intricate balance of cosmic forces dictated the alignment of elements, guided by the invisible hand of interstellar diplomacy, causing events to unfold as they do.\n",
"0\n"
]
}
],
"source": [
"PROMPT = \"\"\"\\\n",
"You are comparing a submitted answer to an expert answer on a given question. Here is the data:\n",
"[BEGIN DATA]\n",
"************\n",
"[Question]: {input}\n",
"************\n",
"[Expert]: {expected}\n",
"************\n",
"[Submission]: {output}\n",
"************\n",
"[END DATA]\n",
"\n",
"Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.\n",
"The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:\n",
"(A) The submitted answer is a subset of the expert answer and is fully consistent with it.\n",
"(B) The submitted answer is a superset of the expert answer and is fully consistent with it.\n",
"(C) The submitted answer contains all the same details as the expert answer.\n",
"(D) There is a disagreement between the submitted answer and the expert answer.\n",
"(E) The answers differ, but these differences don't matter from the perspective of factuality.\n",
"\n",
"Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be\n",
"sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a\n",
"single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.\n",
"\"\"\"\n",
"\n",
"# Since we're testing for hallucinations, penalize (B) as much as (D).\n",
" \"description\": \"Call this function to select a choice.\",\n",
" \"parameters\": {\n",
" \"properties\": {\n",
" \"reasons\": {\n",
" \"description\": \"Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.\",\n",
"The classifier scored 98.5% which is a significant improvement!\n",
"\n",
"### Codifying this pattern\n",
"\n",
"The classifier above can simply be rewritten as:\n",
"\n",
"```python\n",
"PROMPT = \"\"\"\\\n",
"You are comparing a submitted answer to an expert answer on a given question. Here is the data:\n",
"[BEGIN DATA]\n",
"************\n",
"[Question]: {{input}}\n",
"************\n",
"[Expert]: {{expected}}\n",
"************\n",
"[Submission]: {{output}}\n",
"************\n",
"[END DATA]\n",
"\n",
"Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.\n",
"The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:\n",
"(A) The submitted answer is a subset of the expert answer and is fully consistent with it.\n",
"(B) The submitted answer is a superset of the expert answer and is fully consistent with it.\n",
"(C) The submitted answer contains all the same details as the expert answer.\n",
"(D) There is a disagreement between the submitted answer and the expert answer.\n",
"(E) The answers differ, but these differences don't matter from the perspective of factuality.\n",
"\n",
"Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be\n",
"sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a\n",
"single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.\n",
"As a next step, you could dig into the individual improvements and regressions to assess them and consider future improvements to the prompt. You could also test it on your own data, and double check that the results hold for your use case.\n",
"You could also measure a model like o1, try fine-tuning a smaller model and see if the results are reproducible, or use few-shot prompting to align the model with more subjective criteria.\n",
"In all cases, you should strive to evaluate your results, so you can rigorously assess the impact of each change.\n"