mirror of
https://github.com/james-m-jordan/openai-cookbook.git
synced 2025-05-09 19:32:38 +00:00
Fix image links
This commit is contained in:
parent
109143f505
commit
ec414ee67b
@ -33,10 +33,10 @@ A: Let's think step by step.
|
||||
```
|
||||
|
||||
```text-davinci-002
|
||||
There are 16 balls in total.
|
||||
Half of the balls are golf balls.
|
||||
That means that there are 8 golf balls.
|
||||
Half of the golf balls are blue.
|
||||
There are 16 balls in total.
|
||||
Half of the balls are golf balls.
|
||||
That means that there are 8 golf balls.
|
||||
Half of the golf balls are blue.
|
||||
That means that there are 4 blue golf balls.
|
||||
```
|
||||
|
||||
@ -65,73 +65,73 @@ The rest of this article shares techniques for improving reliability of large la
|
||||
One way to give a model more time and space to think is to break tasks into simpler pieces.
|
||||
|
||||
As an example, consider a task where we ask the model a multiple-choice question about some text - in this case, a game of Clue. When asked directly, `text-davinci-002` isn't able to put clues 3 & 5 together, and answers incorrectly:
|
||||
|
||||
|
||||
```text-davinci-002
|
||||
Use the following clues to answer the following multiple-choice question.
|
||||
|
||||
|
||||
Clues:
|
||||
1. Miss Scarlett was the only person in the lounge.
|
||||
2. The person with the pipe was in the kitchen.
|
||||
3. Colonel Mustard was the only person in the observatory.
|
||||
4. Professor Plum was not in the library nor the billiard room.
|
||||
5. The person with the candlestick was in the observatory.
|
||||
|
||||
|
||||
Question: Was Colonel Mustard in the observatory with the candlestick?
|
||||
(a) Yes; Colonel Mustard was in the observatory with the candlestick
|
||||
(b) No; Colonel Mustard was not in the observatory with the candlestick
|
||||
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick
|
||||
|
||||
|
||||
Solution:
|
||||
```
|
||||
|
||||
|
||||
```text-davinci-002
|
||||
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick
|
||||
```
|
||||
|
||||
|
||||
Although clues 3 and 5 establish that Colonel Mustard was the only person in the observatory and that the person in the observatory had the candlestick, the model fails to combine them into a correct answer of (a) Yes.
|
||||
|
||||
|
||||
However, instead of asking for the answer directly, we can split the task into three pieces:
|
||||
|
||||
|
||||
- First, go through the clues one by one and consider whether the clue is potentially relevant
|
||||
- Second, combine the relevant clues to reason out the answer to the question
|
||||
- Third, write the final answer: either (a), (b), or (c)
|
||||
|
||||
|
||||
```text-davinci-002
|
||||
Use the following clues to answer the following multiple-choice question, using the following procedure:
|
||||
(1) First, go through the clues one by one and consider whether the clue is potentially relevant
|
||||
(2) Second, combine the relevant clues to reason out the answer to the question
|
||||
(3) Third, map the answer to one of the multiple choice answers: either (a), (b), or (c)
|
||||
|
||||
|
||||
Clues:
|
||||
1. Miss Scarlett was the only person in the lounge.
|
||||
2. The person with the pipe was in the kitchen.
|
||||
3. Colonel Mustard was the only person in the observatory.
|
||||
4. Professor Plum was not in the library nor the billiard room.
|
||||
5. The person with the candlestick was in the observatory.
|
||||
|
||||
|
||||
Question: Was Colonel Mustard in the observatory with the candlestick?
|
||||
(a) Yes; Colonel Mustard was in the observatory with the candlestick
|
||||
(b) No; Colonel Mustard was not in the observatory with the candlestick
|
||||
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick
|
||||
|
||||
|
||||
Solution:
|
||||
(1) First, go through the clues one by one and consider whether the clue is potentially relevant:
|
||||
```
|
||||
|
||||
|
||||
```text-davinci-002
|
||||
Clue 1: Miss Scarlett was the only person in the lounge. This clue is not relevant.
|
||||
Clue 2: The person with the pipe was in the kitchen. This clue is not relevant.
|
||||
Clue 3: Colonel Mustard was the only person in the observatory. This clue is relevant.
|
||||
Clue 4: Professor Plum was not in the library nor the billiard room. This clue is not relevant.
|
||||
Clue 5: The person with the candlestick was in the observatory. This clue is relevant.
|
||||
|
||||
|
||||
(2) Second, combine the relevant clues to reason out the answer to the question:
|
||||
Clue 3 and Clue 5 both say that the person with the candlestick was in the observatory. Colonel Mustard was the only person in the observatory, so Colonel Mustard was the person with the candlestick.
|
||||
|
||||
|
||||
(3) Third, write the final answer:
|
||||
(a) Yes; Colonel Mustard was in the observatory with the candlestick
|
||||
```
|
||||
|
||||
|
||||
By giving the model more time and space to think, and guiding it along a reasoning plan, it's able to figure out the correct answer of (a) Yes.
|
||||
|
||||
Another benefit of splitting complex instructions into smaller subtasks is that it can help keep the model focused on each subtask.
|
||||
@ -151,9 +151,9 @@ Summary:
|
||||
|
||||
```text-davinci-002
|
||||
The text explains that statistics is a science that studies the variability, collection, organization, analysis, interpretation, and presentation of data, as well as the random process that generates them following the laws of probability.
|
||||
```
|
||||
```
|
||||
|
||||
However, if we first ask the model to identify the language of the text, and then summarize the text, it becomes more reliable:
|
||||
However, if we first ask the model to identify the language of the text, and then summarize the text, it becomes more reliable:
|
||||
|
||||
```text-davinci-002
|
||||
First, identify the language of the text. Second, summarize the text using the original language of the text. The summary should be one sentence long.
|
||||
@ -182,22 +182,22 @@ Another powerful technique for improving the reliability of answers is to prompt
|
||||
|
||||
Published by [Takeshi Kojima et al. in 2022](https://arxiv.org/abs/2205.11916), the easiest way to prompt a model to reason out the answer is to simply prepend answers with `Let's think step by step.` Figure 2 illustrates an example:
|
||||
|
||||
[
|
||||
<br>Source: *Large Language Models are Zero-Shot Reasoners* by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916)
|
||||
[
|
||||
<br>Source: _Large Language Models are Zero-Shot Reasoners_ by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916)
|
||||
|
||||
#### Results
|
||||
|
||||
Applying this simple trick to the MultiArith math dataset, the authors found `Let's think step by step` quadrupled the accuracy, from 18% to 79%!
|
||||
|
||||
[
|
||||
<br>Source: *Large Language Models are Zero-Shot Reasoners* by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916)
|
||||
[
|
||||
<br>Source: _Large Language Models are Zero-Shot Reasoners_ by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916)
|
||||
|
||||
#### Implications
|
||||
|
||||
Although the `Let's think step by step` trick works well on math problems, it's not effective on all tasks. The authors found that it was most helpful for multi-step arithmetic problems, symbolic reasoning problems, strategy problems, and other reasoning problems. It didn't help with simple math problems or common sense questions, and presumably wouldn't help with many other non-reasoning tasks either.
|
||||
|
||||
[
|
||||
<br>Source: *Large Language Models are Zero-Shot Reasoners* by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916)
|
||||
[
|
||||
<br>Source: _Large Language Models are Zero-Shot Reasoners_ by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916)
|
||||
|
||||
To learn more, read the [full paper](https://arxiv.org/abs/2205.11916).
|
||||
|
||||
@ -248,13 +248,13 @@ Because the Toyota Prius Prime meets all of the criteria for a federal tax credi
|
||||
|
||||
Prompting the model to reason out its answers can be done in many ways. One way is to demonstrate with a few examples ('few-shot'), as studied by [Jason Wei and Denny Zhou et al. from Google](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html). Here's an example few-shot chain-of-thought prompt:
|
||||
|
||||
[
|
||||
<br>Source: *Chain of Thought Prompting Elicits Reasoning in Large Language Models* Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html)
|
||||
[
|
||||
<br>Source: _Chain of Thought Prompting Elicits Reasoning in Large Language Models_ Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html)
|
||||
|
||||
More demonstrations of reasoning chains written by human labelers:
|
||||
|
||||
[
|
||||
<br>Source: *Chain of Thought Prompting Elicits Reasoning in Large Language Models* Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html)
|
||||
[
|
||||
<br>Source: _Chain of Thought Prompting Elicits Reasoning in Large Language Models_ Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html)
|
||||
|
||||
[(Note that it has been called into question whether pears actually float)](https://twitter.com/Meaningness/status/1561062170074370048?s=20&t=mpHt8f3RRboztXxdhLFnWQ)
|
||||
|
||||
@ -262,13 +262,13 @@ More demonstrations of reasoning chains written by human labelers:
|
||||
|
||||
Testing on grade school math problems, the authors found that chain of thought prompting tripled the solve rate, from 18% to 57%.
|
||||
|
||||
[
|
||||
<br>Source: *Chain of Thought Prompting Elicits Reasoning in Large Language Models* Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html)
|
||||
[
|
||||
<br>Source: _Chain of Thought Prompting Elicits Reasoning in Large Language Models_ Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html)
|
||||
|
||||
In addition to math problems, chain of thought prompting also lifted performance on questions related to sports understanding, coin flip tracking, and last letter concatenation. In most cases, not many examples were need to saturate the performance gains (less than 8 or so).
|
||||
|
||||
[
|
||||
<br>Source: *Chain of Thought Prompting Elicits Reasoning in Large Language Models* Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html)
|
||||
[
|
||||
<br>Source: _Chain of Thought Prompting Elicits Reasoning in Large Language Models_ Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html)
|
||||
|
||||
To learn more, read the [full paper](https://arxiv.org/abs/2201.11903).
|
||||
|
||||
@ -284,8 +284,8 @@ In general, to eke out maximum performance on a task, you'll need to fine-tune a
|
||||
|
||||
In 2022, Eric Zelikman and Yuhuai Wu et al. published a clever procedure for using a few-shot prompt to generate a dataset of explanations that could be used to fine-tune a model. The idea is to use a few-shot prompt to generate candidate explanations, and only keep the explanations that produce the correct answer. Then, to get additional explanations for some of the incorrect answers, retry the few-shot prompt but with correct answers given as part of the question. The authors called their procedure STaR (Self-taught Reasoner):
|
||||
|
||||
[
|
||||
<br>Source: *STaR: Bootstrapping Reasoning With Reasoning* by Eric Zelikman and Yujuai Wu et al. (2022)](https://arxiv.org/abs/2203.14465)
|
||||
[
|
||||
<br>Source: _STaR: Bootstrapping Reasoning With Reasoning_ by Eric Zelikman and Yujuai Wu et al. (2022)](https://arxiv.org/abs/2203.14465)
|
||||
|
||||
With this technique, you can combine the benefits of fine-tuning with the benefits of chain-of-thought prompting without needing to write thousands of example explanations.
|
||||
|
||||
@ -293,8 +293,8 @@ With this technique, you can combine the benefits of fine-tuning with the benefi
|
||||
|
||||
When the authors applied this technique to a Common Sense Q&A dataset, they found that STaR outperformed both chain-of-thought prompting alone (73% > 37%) and fine-tuning alone (73% > 60%):
|
||||
|
||||
[
|
||||
<br>Source: *STaR: Bootstrapping Reasoning With Reasoning* by Eric Zelikman and Yujuai Wu et al. (2022)](https://arxiv.org/abs/2203.14465)
|
||||
[
|
||||
<br>Source: _STaR: Bootstrapping Reasoning With Reasoning_ by Eric Zelikman and Yujuai Wu et al. (2022)](https://arxiv.org/abs/2203.14465)
|
||||
|
||||
To learn more, read the [full paper](https://arxiv.org/abs/2203.14465).
|
||||
|
||||
@ -312,15 +312,15 @@ A number of extensions of chain-of-thought prompting have been published as well
|
||||
|
||||
Published by Antonia Creswell et al., one extension of the chain-of-thought technique is to split the single prompt for generating explanations and answers into smaller parts. First, a prompt selects a relevant subset of facts from the text ('selection prompt'). Then, a second prompt infers a conclusion from the selected facts ('inference prompt'). These prompts are then alternated in a loop to generate multiple steps of reasoning and eventually land on a final answer. The authors illustrate the idea in the following figure:
|
||||
|
||||
[
|
||||
<br>Source: *Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2205.09712)
|
||||
[
|
||||
<br>Source: _Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2205.09712)
|
||||
|
||||
#### Results
|
||||
|
||||
When applied to a 7B-parameter model, the authors found that selection-inference prompting substantially improved performance relative to chain-of-thought prompting on the bAbi and Proof Writer benchmark tasks (both of which require longer sequences of reasoning steps). The best performance they achieved combined both selection-inference prompting with fine-tuning.
|
||||
|
||||
[
|
||||
<br>Source: *Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2205.09712)
|
||||
[
|
||||
<br>Source: _Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2205.09712)
|
||||
|
||||
#### Implications
|
||||
|
||||
@ -351,33 +351,33 @@ The halter models brings a couple of advantages:
|
||||
- it can tell the selection-inference process to stop or keep going, as necessary.
|
||||
- if the process never halts, you'll get no answer, which is often preferable to a hallucinated guess
|
||||
|
||||
[
|
||||
<br>Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
[
|
||||
<br>Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
|
||||
[
|
||||
<br>Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
[
|
||||
<br>Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
|
||||
Second, the authors add a value function, which is used to assess the quality of reasoning steps and search over multiple reasoning trajectories. This echoes a common theme for increasing reliability; instead of generating a single answer from the model, generate a set of answers and then use some type of value function / discriminator / verifier model to pick the best one.
|
||||
|
||||
[
|
||||
<br>Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
[
|
||||
<br>Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
|
||||
In addition to these two extensions, the authors also use a trick to reduce hallucination of fake facts. Rather than asking the model to write out factual sentences, they fine-tune a model to work with sentence labels (e.g., sen1) instead. This helps prevent the model from hallucinating fake facts not mentioned in the prompt context.
|
||||
|
||||
[
|
||||
<br>Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
[
|
||||
<br>Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
|
||||
#### Results
|
||||
|
||||
The authors evaluated their technique on two benchmarks: the ProofWriter task (not shown) and [EntailmentBankQA](https://allenai.org/data/entailmentbank) (shown). The technique increased accuracy substantially, especially on harder reasoning problems.
|
||||
|
||||

|
||||
<br>Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||

|
||||
<br>Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
|
||||
In addition, their sentence label manipulation trick essentially eliminated hallucination!
|
||||
|
||||

|
||||
<br>Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||

|
||||
<br>Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)
|
||||
|
||||
#### Implications
|
||||
|
||||
@ -399,18 +399,18 @@ In addition to doing poorly on long reasoning chains (where selection-inference
|
||||
|
||||
Least-to-most prompting is another technique that splits up reasoning tasks into smaller, more reliable subtasks. The idea is to elicit a subtask from the model by prompting it with something like `To solve {question}, we need to first solve: "`. Then, with that subtask in hand, the model can generate a solution. The solution is appended to the original question and the process is repeated until a final answer is produced.
|
||||
|
||||
[
|
||||
<br>Source: *Least-to-most Prompting Enables Complex Reasoning in Large Language Models* by Denny Zhou et al. (2022)](https://arxiv.org/abs/2205.10625)
|
||||
[
|
||||
<br>Source: _Least-to-most Prompting Enables Complex Reasoning in Large Language Models_ by Denny Zhou et al. (2022)](https://arxiv.org/abs/2205.10625)
|
||||
|
||||
#### Results
|
||||
|
||||
When applied to benchmarks involving long reasoning chains using `code-davinci-002` (which is optimized for code but can still understand text), the authors measured gains as large as 16% -> 99.7%!
|
||||
|
||||
[
|
||||

|
||||

|
||||

|
||||
<br>Source: *Least-to-most Prompting Enables Complex Reasoning in Large Language Models* by Denny Zhou et al. (2022)](https://arxiv.org/abs/2205.10625)
|
||||

|
||||

|
||||

|
||||
<br>Source: _Least-to-most Prompting Enables Complex Reasoning in Large Language Models_ by Denny Zhou et al. (2022)](https://arxiv.org/abs/2205.10625)
|
||||
|
||||
#### Implications
|
||||
|
||||
@ -426,7 +426,7 @@ To learn more, read the [full paper](https://arxiv.org/abs/2205.10625).
|
||||
|
||||
#### Method
|
||||
|
||||
In contrast to the previous techniques, which try to maximize the likelihood of correct answers, another approach is to use GPT-3 to generate a tree of possible explanations (both correct *and incorrect*), and then analyze their relationships to guess at which set is correct. This technique was coined maieutic prompting by [Jaehun Jung et al. in May 2022](https://arxiv.org/abs/2205.11822) (maieutic means relating to the Socratic method of asking questions to elicit ideas).
|
||||
In contrast to the previous techniques, which try to maximize the likelihood of correct answers, another approach is to use GPT-3 to generate a tree of possible explanations (both correct _and incorrect_), and then analyze their relationships to guess at which set is correct. This technique was coined maieutic prompting by [Jaehun Jung et al. in May 2022](https://arxiv.org/abs/2205.11822) (maieutic means relating to the Socratic method of asking questions to elicit ideas).
|
||||
|
||||
The method is complicated, and works as follows:
|
||||
|
||||
@ -444,15 +444,14 @@ The method is complicated, and works as follows:
|
||||
- Use a solver to the find the most self-consistent set of beliefs, and take those as true
|
||||
|
||||
[
|
||||

|
||||

|
||||
<br>Source: *Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations* by Jaehun Jung et al. (2022)](https://arxiv.org/abs/2205.11822)
|
||||
|
||||

|
||||

|
||||
<br>Source: _Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations_ by Jaehun Jung et al. (2022)](https://arxiv.org/abs/2205.11822)
|
||||
|
||||
#### Results
|
||||
|
||||
[
|
||||
<br>Source: *Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations* by Jaehun Jung et al. (2022)](https://arxiv.org/abs/2205.11822)
|
||||
[
|
||||
<br>Source: _Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations_ by Jaehun Jung et al. (2022)](https://arxiv.org/abs/2205.11822)
|
||||
|
||||
#### Implications
|
||||
|
||||
@ -468,15 +467,15 @@ To learn more, read the [full paper](https://arxiv.org/abs/2205.11822).
|
||||
|
||||
For tasks with a discrete set of answers, one simple way to improve reliability is to sample multiple explanations & answers from the model (using a positive temperature) and then pick the final answer that appears most often.
|
||||
|
||||
[
|
||||
<br>Source: *Self-Consistency Improves Chain of Thought Reasoning in Language Models* by Xuezhi Wang et al. (2022)](https://arxiv.org/abs/2203.11171)
|
||||
[
|
||||
<br>Source: _Self-Consistency Improves Chain of Thought Reasoning in Language Models_ by Xuezhi Wang et al. (2022)](https://arxiv.org/abs/2203.11171)
|
||||
|
||||
#### Results
|
||||
|
||||
This technique lifted accuracies by anywhere from 1 to 24 percentage points on a suite of math and reasoning benchmarks. (Plotted below are results from Google's LaMDA model; using Google's larger PaLM model, the baselines were higher but the gains were a bit smaller.)
|
||||
|
||||
[
|
||||
<br>Source: *Self-Consistency Improves Chain of Thought Reasoning in Language Models* by Xuezhi Wang et al. (2022)](https://arxiv.org/abs/2203.11171)
|
||||
[
|
||||
<br>Source: _Self-Consistency Improves Chain of Thought Reasoning in Language Models_ by Xuezhi Wang et al. (2022)](https://arxiv.org/abs/2203.11171)
|
||||
|
||||
#### Implications
|
||||
|
||||
@ -500,15 +499,15 @@ In 2021, OpenAI researchers applied this technique to grade school math problems
|
||||
- Using those solutions, with some labeled correct and some labeled incorrect, they fine-tuned a verifier model to classify whether a question and candidate solution was correct or incorrect
|
||||
- Finally, at test time, the generative model creates 100 solutions to each problem, and the one with the highest score according to the verifier model is picked as the final answer
|
||||
|
||||
[
|
||||
<br>Source: *Training Verifiers to Solve Math Word Problems* by Karl Cobbe et al. (2021)](https://arxiv.org/abs/2110.14168)
|
||||
[
|
||||
<br>Source: _Training Verifiers to Solve Math Word Problems_ by Karl Cobbe et al. (2021)](https://arxiv.org/abs/2110.14168)
|
||||
|
||||
#### Results
|
||||
|
||||
With a 175B GPT-3 model and 8,000 training examples, this technique substantially lifted grade school math accuracy from ~33% to ~55%.
|
||||
|
||||
[
|
||||
<br>Source: *Training Verifiers to Solve Math Word Problems* by Karl Cobbe et al. (2021)](https://arxiv.org/abs/2110.14168)
|
||||
[
|
||||
<br>Source: _Training Verifiers to Solve Math Word Problems_ by Karl Cobbe et al. (2021)](https://arxiv.org/abs/2110.14168)
|
||||
|
||||
#### Implications
|
||||
|
||||
@ -525,27 +524,27 @@ Although the techniques above vary in their approach, they all share the goal of
|
||||
|
||||
This paradigm of trying to build a reliable system out of less reliable components is reminiscent of probabilistic programming, and many of the analysis techniques of that field can be applied to this one.
|
||||
|
||||
In the paper *Language Model Cascades*, David Dohan et al. interpret the above techniques in the paradigm of probabilistic graphical models:
|
||||
In the paper _Language Model Cascades_, David Dohan et al. interpret the above techniques in the paradigm of probabilistic graphical models:
|
||||
|
||||
#### Chain of thought prompting
|
||||
|
||||
[
|
||||
<br>Source: *Language Model Cascades* by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342)
|
||||
[
|
||||
<br>Source: _Language Model Cascades_ by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342)
|
||||
|
||||
#### Fine-tuned chain of thought prompting / Self-taught reasoner
|
||||
|
||||
[
|
||||
<br>Source: *Language Model Cascades* by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342)
|
||||
[
|
||||
<br>Source: _Language Model Cascades_ by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342)
|
||||
|
||||
#### Selection-inference prompting
|
||||
|
||||
[
|
||||
<br>Source: *Language Model Cascades* by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342)
|
||||
[
|
||||
<br>Source: _Language Model Cascades_ by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342)
|
||||
|
||||
#### Verifiers
|
||||
|
||||
[
|
||||
<br>Source: *Language Model Cascades* by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342)
|
||||
[
|
||||
<br>Source: _Language Model Cascades_ by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342)
|
||||
|
||||
#### Implications
|
||||
|
||||
@ -560,7 +559,7 @@ In the future, expect better models and better techniques to be published. Even
|
||||
## Bibliography
|
||||
|
||||
| Lesson | Paper | Date |
|
||||
|--------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|----------|
|
||||
| ------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | -------- |
|
||||
| Break complex tasks into simpler subtasks (and consider exposing the intermediate outputs to users) | [AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts](https://arxiv.org/abs/2110.01691) | 2021 Oct |
|
||||
| You can improve output by generating many candidates, and then picking the one that looks best | [Training Verifiers to Solve Math Word Problems](https://arxiv.org/abs/2110.14168) | 2021 Oct |
|
||||
| On reasoning tasks, models do better when they reason step-by-step before answering | [Chain of Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) | 2022 Jan |
|
||||
@ -571,4 +570,4 @@ In the future, expect better models and better techniques to be published. Even
|
||||
| On long reasoning problems, you can improve step-by-step reasoning by splitting the problem into pieces to solve incrementally | [Least-to-most Prompting Enables Complex Reasoning in Large Language Models](https://arxiv.org/abs/2205.10625) | 2022 May |
|
||||
| You can have the model analyze both good and bogus explanations to figure out which set of explanations are most consistent | [Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations](https://arxiv.org/abs/2205.11822) | 2022 May |
|
||||
| You can think about these techniques in terms of probabilistic programming, where systems comprise unreliable components | [Language Model Cascades](https://arxiv.org/abs/2207.10342) | 2022 Jul |
|
||||
| You can eliminate hallucination with sentence label manipulation, and you can reduce wrong answers with a 'halter' prompt | [Faithful Reasoning Using Large Language Models](https://arxiv.org/abs/2208.14271) | 2022 Aug |
|
||||
| You can eliminate hallucination with sentence label manipulation, and you can reduce wrong answers with a 'halter' prompt | [Faithful Reasoning Using Large Language Models](https://arxiv.org/abs/2208.14271) | 2022 Aug |
|
||||
|
File diff suppressed because it is too large
Load Diff
Loading…
x
Reference in New Issue
Block a user