Fix llm as a judge cookbook images (#1517)
@ -499,7 +499,7 @@
|
||||
"It looks like the numeric rater scored almost 94% in total. That's not bad, but if 6% of your evals are incorrectly judged, that could make it very hard to trust them. Let's dig into the Braintrust\n",
|
||||
"UI to get some insight into what's going on.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"It looks like a number of the incorrect answers were scored with numbers between 1 and 10. However, we do not currently have any insight into why the model gave these scores. Let's see if we can\n",
|
||||
"fix that next.\n"
|
||||
@ -670,11 +670,11 @@
|
||||
"It doesn't look like adding reasoning helped the score (in fact, it's half a percent worse). However, if we look at one of the failures, we'll get some insight into\n",
|
||||
"what the model was thinking. Here is an example of a hallucinated answer:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"And the score along with its reasoning:\n",
|
||||
"\n",
|
||||
"\n"
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
Before Width: | Height: | Size: 207 KiB After Width: | Height: | Size: 207 KiB |
Before Width: | Height: | Size: 1.1 MiB After Width: | Height: | Size: 1.1 MiB |
Before Width: | Height: | Size: 300 KiB After Width: | Height: | Size: 300 KiB |
Before Width: | Height: | Size: 191 KiB |