# How to handle rate limits

When you call the OpenAI API repeatedly, you may encounter error messages that say `429: 'Too Many Requests'` or `RateLimitError`. These error messages come from exceeding the API's rate limits.

This guide shares tips for avoiding and handling rate limit errors.

To see an example script for throttling parallel requests to avoid rate limit errors, see [api_request_parallel_processor.py](https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py).

## Why rate limits exist

Rate limits are a common practice for APIs, and they're put in place for a few different reasons.

- First, they help protect against abuse or misuse of the API. For example, a malicious actor could flood the API with requests in an attempt to overload it or cause disruptions in service. By setting rate limits, OpenAI can prevent this kind of activity.
- Second, rate limits help ensure that everyone has fair access to the API. If one person or organization makes an excessive number of requests, it could bog down the API for everyone else. By throttling the number of requests that a single user can make, OpenAI ensures that everyone has an opportunity to use the API without experiencing slowdowns.
- Lastly, rate limits can help OpenAI manage the aggregate load on its infrastructure. If requests to the API increase dramatically, it could tax the servers and cause performance issues. By setting rate limits, OpenAI can help maintain a smooth and consistent experience for all users.

Although hitting rate limits can be frustrating, rate limits exist to protect the reliable operation of the API for its users.

## Default rate limits

Your rate limit and spending limit (quota) are automatically adjusted based on a number of factors. As your usage of the OpenAI API goes up and you successfully pay the bill, we automatically increase your usage tier. You can find specific information regarding rate limits using the resources below.

### Other rate limit resources

Read more about OpenAI's rate limits in these other resources:

- [Guide: Rate limits](https://platform.openai.com/docs/guides/rate-limits?context=tier-free)
- [Help Center: Is API usage subject to any rate limits?](https://help.openai.com/en/articles/5955598-is-api-usage-subject-to-any-rate-limits)
- [Help Center: How can I solve 429: 'Too Many Requests' errors?](https://help.openai.com/en/articles/5955604-how-can-i-solve-429-too-many-requests-errors)

### Requesting a rate limit increase

If you'd like your organization's rate limit increased, please visit your [Limits settings page](https://platform.openai.com/account/limits) to see how you can increase your usage tier


In [None]:
import openai
import os

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

## Example rate limit error

A rate limit error will occur when API requests are sent too quickly. If using the OpenAI Python library, they will look something like:

```
RateLimitError: Rate limit reached for default-codex in organization org-{id} on requests per min. Limit: 20.000000 / min. Current: 24.000000 / min. Contact support@openai.com if you continue to have issues or if you’d like to request an increase.
```

Below is example code for triggering a rate limit error.

In [3]:
# request a bunch of completions in a loop
for _ in range(100):
    client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Hello"}],
        max_tokens=10,
    )

## How to avoid rate limit errors

### Retrying with exponential backoff

One easy way to avoid rate limit errors is to automatically retry requests with a random exponential backoff. Retrying with exponential backoff means performing a short sleep when a rate limit error is hit, then retrying the unsuccessful request. If the request is still unsuccessful, the sleep length is increased and the process is repeated. This continues until the request is successful or until a maximum number of retries is reached.

This approach has many benefits:

- Automatic retries means you can recover from rate limit errors without crashes or missing data
- Exponential backoff means that your first retries can be tried quickly, while still benefiting from longer delays if your first few retries fail
- Adding random jitter to the delay helps retries from all hitting at the same time

Note that unsuccessful requests contribute to your per-minute limit, so continuously resending a request won’t work.

Below are a few example solutions.

#### Example #1: Using the Tenacity library

[Tenacity](https://tenacity.readthedocs.io/en/latest/) is an Apache 2.0 licensed general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything.

To add exponential backoff to your requests, you can use the `tenacity.retry` [decorator](https://peps.python.org/pep-0318/). The following example uses the `tenacity.wait_random_exponential` function to add random exponential backoff to a request.

Note that the Tenacity library is a third-party tool, and OpenAI makes no guarantees about its reliability or security.

In [6]:
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(**kwargs):
    return client.chat.completions.create(**kwargs)


completion_with_backoff(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Once upon a time,"}])

ChatCompletion(id='chatcmpl-8PAu6anX2JxQdYmJRzps38R8u0ZBC', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='in a small village nestled among green fields and rolling hills, there lived a kind-hearted and curious young girl named Lily. Lily was known for her bright smile and infectious laughter, bringing joy to everyone around her.\n\nOne sunny morning, as Lily played in the meadows, she stumbled upon a mysterious book tucked away beneath a tall oak tree. Intrigued, she picked it up and dusted off its weathered cover to reveal intricate golden patterns. Without hesitation, she opened it, discovering that its pages were filled with magical tales and enchanting adventures.\n\nAmong the stories she found, one particularly caught her attention—a tale of a long-lost treasure hidden deep within a mysterious forest. Legend had it that whoever found this hidden treasure would be granted one wish, no matter how big or small. Excited by the prospect of findin

#### Example #2: Using the backoff library

Another library that provides function decorators for backoff and retry is [backoff](https://pypi.org/project/backoff/).

Like Tenacity, the backoff library is a third-party tool, and OpenAI makes no guarantees about its reliability or security.

In [10]:
import backoff  # for exponential backoff

@backoff.on_exception(backoff.expo, openai.RateLimitError)
def completions_with_backoff(**kwargs):
    return client.chat.completions.create(**kwargs)


completions_with_backoff(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Once upon a time,"}])


ChatCompletion(id='chatcmpl-8PAwkg7Q9pPeAkvVuAZ8AyA108WhR', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="in a small village, there lived a young girl named Lily. She had fiery red hair, lively green eyes, and a spirit as wild as the rushing river nearby. Lily was known for her curious nature and her desire to explore the world beyond the village boundaries.\n\nOne day, while playing near the river, Lily spotted an injured bird nested on a branch. Its wing was broken, and it seemed unable to fly away. Lily's heart filled with sadness, and she knew she couldn't leave the bird alone.\n\nCarefully, she climbed up the tree and gently placed the bird inside her pocket. Lily brought it home and made a cozy bed for it in a small wooden box. She named the bird Ruby, after its shimmering red feathers.\n\nDays turned into weeks, and Ruby's wing slowly healed under Lily's constant care and attention. As they spent time together, a deep bond grew between the

#### Example 3: Manual backoff implementation

If you don't want to use third-party libraries, you can implement your own backoff logic.

In [11]:
# imports
import random
import time

# define a retry decorator
def retry_with_exponential_backoff(
    func,
    initial_delay: float = 1,
    exponential_base: float = 2,
    jitter: bool = True,
    max_retries: int = 10,
    errors: tuple = (openai.RateLimitError,),
):
    """Retry a function with exponential backoff."""

    def wrapper(*args, **kwargs):
        # Initialize variables
        num_retries = 0
        delay = initial_delay

        # Loop until a successful response or max_retries is hit or an exception is raised
        while True:
            try:
                return func(*args, **kwargs)

            # Retry on specified errors
            except errors as e:
                # Increment retries
                num_retries += 1

                # Check if max retries has been reached
                if num_retries > max_retries:
                    raise Exception(
                        f"Maximum number of retries ({max_retries}) exceeded."
                    )

                # Increment the delay
                delay *= exponential_base * (1 + jitter * random.random())

                # Sleep for the delay
                time.sleep(delay)

            # Raise exceptions for any errors not specified
            except Exception as e:
                raise e

    return wrapper


@retry_with_exponential_backoff
def completions_with_backoff(**kwargs):
    return client.chat.completions.create(**kwargs)


completions_with_backoff(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Once upon a time,"}])

ChatCompletion(id='chatcmpl-8PAxGvV3GbLpnOoKSvJ00XCUdOglM', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="in a faraway kingdom, there lived a young princess named Aurora. She was known for her beauty, grace, and kind heart. Aurora's kingdom was filled with lush green meadows, towering mountains, and sparkling rivers. The princess loved spending time exploring the enchanting forests surrounding her castle.\n\nOne day, while Aurora was wandering through the woods, she stumbled upon a hidden clearing. At the center stood a majestic oak tree, its branches reaching towards the sky. Aurora approached the tree with curiosity, and as she got closer, she noticed a small door at its base.\n\nIntrigued, she gently pushed open the door and was amazed to find herself in a magical realm. The forest transformed into a breathtaking wonderland, with colorful flowers blooming in every direction and woodland creatures frolicking joyously. Aurora's eyes widened with

## How to maximize throughput of batch processing given rate limits

If you're processing real-time requests from users, backoff and retry is a great strategy to minimize latency while avoiding rate limit errors.

However, if you're processing large volumes of batch data, where throughput matters more than latency, there are a few other things you can do in addition to backoff and retry.

### Proactively adding delay between requests

If you are constantly hitting the rate limit, then backing off, then hitting the rate limit again, then backing off again, it's possible that a good fraction of your request budget will be 'wasted' on requests that need to be retried. This limits your processing throughput, given a fixed rate limit.

Here, one potential solution is to calculate your rate limit and add a delay equal to its reciprocal (e.g., if your rate limit 20 requests per minute, add a delay of 3–6 seconds to each request). This can help you operate near the rate limit ceiling without hitting it and incurring wasted requests.

#### Example of adding delay to a request

In [12]:
# imports
import time

# Define a function that adds a delay to a Completion API call
def delayed_completion(delay_in_seconds: float = 1, **kwargs):
    """Delay a completion by a specified amount of time."""

    # Sleep for the delay
    time.sleep(delay_in_seconds)

    # Call the Completion API and return the result
    return client.chat.completions.create(**kwargs)


# Calculate the delay based on your rate limit
rate_limit_per_minute = 20
delay = 60.0 / rate_limit_per_minute

delayed_completion(
    delay_in_seconds=delay,
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Once upon a time,"}]
)


ChatCompletion(id='chatcmpl-8PAyCR1axKsomV0e349XiCN1Z81pH', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="in a small village, there lived a young girl named Maya. Maya was known for her kindness and love for nature. She spent hours exploring the forests surrounding the village, admiring the vibrant flowers and talking to the animals.\n\nOne sunny day, as Maya was picking wildflowers, she stumbled upon a wounded blackbird with a broken wing. Feeling sorry for the bird, Maya gently picked it up and cradled it in her hands. She knew she had to help the bird, so she hurried back to her cottage.\n\nMaya set up a cozy nest for the blackbird and carefully splinted its wing. She fed it worms and berries, doing everything she could to nurse it back to health. Each day, she would sing lullabies and tell stories to keep the blackbird company. Slowly, the bird's wing healed, and before long, it was ready to fly again.\n\nOn a beautiful morning, Maya opened t



### Batching requests

The OpenAI API has separate limits for requests per minute and tokens per minute.

If you're hitting the limit on requests per minute, but have headroom on tokens per minute, you can increase your throughput by batching multiple tasks into each request. This will allow you to process more tokens per minute, especially with the smaller models.

Sending in a batch of prompts works exactly the same as a normal API call, except that pass in a list of strings to `prompt` parameter instead of a single string.

**Warning:** the response object may not return completions in the order of the prompts, so always remember to match responses back to prompts using the `index` field.

#### Example without batching

In [13]:
num_stories = 10
content = "Once upon a time,"

# serial example, with one story completion per request
for _ in range(num_stories):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": content}],
        max_tokens=20,
    )

    # print story
    print(content + response.choices[0].message.content)


Once upon a time,in a small village nestled between rolling green hills, there lived a young girl named Lily. She had
Once upon a time,in a small village nestled in the heart of a lush forest, lived a young girl named Evelyn.
Once upon a time,in a faraway kingdom, there lived a young princess named Aurora. She was known for her kind
Once upon a time,in a faraway kingdom called Enchantia, there lived a young girl named Ella. Ella was
Once upon a time,in a small village nestled among the rolling hills, lived a young woman named Lucy. Lucy was known
Once upon a time,in a small village nestled between rolling hills, there lived a young girl named Ava. Ava was a
Once upon a time,in a faraway kingdom, there lived a wise and just king named Arthur. King Arthur ruled over
Once upon a time,in a small village nestled among towering mountains, lived a young girl named Lily. She was known for
Once upon a time,in a small village nestled in the heart of a lush forest, there lived a young girl named 

#### Example with batching

In [15]:
num_stories = 10
prompts = ["Once upon a time,"] * num_stories

# batched example, with 10 stories completions per request
response = client.chat.completions.create(
    model="curie",
    prompt=prompts,
    max_tokens=20,
)

# match completions to prompts by index
stories = [""] * len(prompts)
for choice in response.choices:
    stories[choice.index] = prompts[choice.index] + choice.text

# print stories
for story in stories:
    print(story)


Once upon a time, I lived in hope. I convinced myself I knew best, because, naive as it might sound,
Once upon a time, Thierry Henry was invited to have a type of frosty exchange with English fans, in which
Once upon a time, and a long time ago as well, PV was passively cooled because coils cooled by use of metal driving
Once upon a time, there was a land called Texas. It was about the size of Wisconsin. It contained, however,
Once upon a time, there was an old carpenter who had three sons. The locksmith never learned to read or write
Once upon a time, there was a small farming town called Moonridge Village, far West across the great vast plains that lay
Once upon a time, California’s shorelines, lakes, and valleys were host to expanses of untamed wilderness
Once upon a time, she said. It started with a simple question: Why don’t we know any stories?
Once upon a time, when I was a young woman, there was a movie named Wuthering Heights. Stand by alleges
Once upon a time, a very long tim

## Example parallel processing script

We've written an example script for parallel processing large quantities of API requests: [api_request_parallel_processor.py](https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py).

The script combines some handy features:
- Streams requests from file, to avoid running out of memory for giant jobs
- Makes requests concurrently, to maximize throughput
- Throttles both request and token usage, to stay under rate limits
- Retries failed requests, to avoid missing data
- Logs errors, to diagnose problems with requests

Feel free to use it as is or modify it to suit your needs.