# Summarizing with Controllable Detail

The objective of this notebook is to demonstrate how to summarize large documents with a controllable level of detail. If you give a GPT model the task of summarizing a long document (e.g. 10k or more tokens), you'll tend to get back a relatively short summary that isn't proportional to the length of the document. For instance, a summary of a 20k token document will not be twice as long as a summary of a 10k token document. One way we can fix this is to split our document up into pieces, and produce a summary piecewise. After many queries to a GPT model, the full summary can be reconstructed. By controlling the number of text chunks and their sizes, we can ultimately control the level of detail in the output.

In [1]:
import openai
import tiktoken
from typing import List, Tuple, Optional
from tqdm import tqdm

In [2]:
# open dataset containing part of the text of the Wikipedia page for the United States
with open("data/united_states_wikipedia.txt", "r") as file:
    united_states_wikipedia_text = file.read()

In [3]:
# load encoding and check the length of dataset
encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
len(encoding.encode(united_states_wikipedia_text))

15781

We'll define a simple utility to wrap calls to the OpenAI API.

In [4]:
def get_chat_completion(messages, model='gpt-3.5-turbo'):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message['content']

Next we'll define some utilities to chunk a large document into smaller pieces.

In [5]:
def tokenize(text: str) -> List[str]:
    encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
    return encoding.encode(text)


def chunk_on_delimiter(input_string, max_tokens, delimiter):
    chunks = input_string.split(delimiter)
    combined_chunks, _, dropped_chunk_count = combine_chunks_with_no_minimum(
        chunks, max_tokens, chunk_delimiter=delimiter, add_ellipsis_for_overflow=True
    )
    if dropped_chunk_count > 0:
        print(f"warning: {dropped_chunk_count} chunks were dropped due to overflow")
    combined_chunks = [f"{chunk}{delimiter}" for chunk in combined_chunks]
    return combined_chunks


def combine_chunks_with_no_minimum(
    chunks: List[str],
    max_tokens: int,
    chunk_delimiter="\n\n",
    header: Optional[str] = None,
    add_ellipsis_for_overflow=False,
) -> Tuple[List[str], List[int]]:
    dropped_chunk_count = 0
    output = []  # list to hold the final combined chunks
    output_indices = []  # list to hold the indices of the final combined chunks
    candidate = (
        [] if header is None else [header]
    )  # list to hold the current combined chunk candidate
    candidate_indices = []
    for chunk_i, chunk in enumerate(chunks):
        chunk_with_header = [chunk] if header is None else [header, chunk]
        if len(tokenize(chunk_delimiter.join(chunk_with_header))) > max_tokens:
            print(f"warning: chunk overflow")
            if (
                add_ellipsis_for_overflow
                and len(tokenize(chunk_delimiter.join(candidate + ["..."]))) <= max_tokens
            ):
                candidate.append("...")
                dropped_chunk_count += 1
            continue  # this case would break downstream assumptions
        # estimate token count with the current chunk added
        extended_candidate_token_count = len(tokenize(chunk_delimiter.join(candidate + [chunk])))
        # If the token count exceeds max_tokens, add the current candidate to output and start a new candidate
        if extended_candidate_token_count > max_tokens:
            output.append(chunk_delimiter.join(candidate))
            output_indices.append(candidate_indices)
            candidate = chunk_with_header  # re-initialize candidate
            candidate_indices = [chunk_i]
        # otherwise keep extending the candidate
        else:
            candidate.append(chunk)
            candidate_indices.append(chunk_i)
    # add the remaining candidate to output if it's not empty
    if (header is not None and len(candidate) > 1) or (header is None and len(candidate) > 0):
        output.append(chunk_delimiter.join(candidate))
        output_indices.append(candidate_indices)
    return output, output_indices, dropped_chunk_count

Now we can define a utility to summarize text with a controllable level of detail (note the detail parameter).

In [6]:
def summarize(text: str,
              detail: float = 0,
              model: str = 'gpt-3.5-turbo',
              additional_instructions: Optional[str] = None,
              minimum_chunk_size: Optional[int] = 500,
              chunk_delimiter: str = ".",
              summarize_recursively = False,
              verbose=False):
    """
    Summarizes a given text by splitting it into chunks, each of which is summarized individually. 
    The level of detail in the summary can be adjusted, and the process can optionally be made recursive.

    Parameters:
    - text (str): The text to be summarized.
    - detail (float, optional): A value between 0 and 1 indicating the desired level of detail in the summary.
      0 leads to a higher level summary, and 1 results in a more detailed summary. Defaults to 0.
    - model (str, optional): The model to use for generating summaries. Defaults to 'gpt-3.5-turbo'.
    - additional_instructions (Optional[str], optional): Additional instructions to provide to the model for customizing summaries.
    - minimum_chunk_size (Optional[int], optional): The minimum size for text chunks. Defaults to 500.
    - chunk_delimiter (str, optional): The delimiter used to split the text into chunks. Defaults to ".".
    - summarize_recursively (bool, optional): If True, summaries are generated recursively, using previous summaries for context.
    - verbose (bool, optional): If True, prints detailed information about the chunking process.

    Returns:
    - str: The final compiled summary of the text.

    The function first determines the number of chunks by interpolating between a minimum and a maximum chunk count based on the `detail` parameter. 
    It then splits the text into chunks and summarizes each chunk. If `summarize_recursively` is True, each summary is based on the previous summaries, 
    adding more context to the summarization process. The function returns a compiled summary of all chunks.
    """
    
    # check detail is set correctly
    assert 0 <= detail <= 1

    # interpolate the number of chunks based to get specified level of detail
    max_chunks = len(chunk_on_delimiter(text, minimum_chunk_size, chunk_delimiter))
    min_chunks = 1
    num_chunks = int(min_chunks + detail * (max_chunks - min_chunks))

    # adjust chunk_size based on interpolated number of chunks
    document_length = len(tokenize(text))
    chunk_size = max(minimum_chunk_size, document_length // num_chunks)
    text_chunks = chunk_on_delimiter(text, chunk_size, chunk_delimiter)
    if verbose:
        print(f"Splitting the text into {len(text_chunks)} chunks to be summarized.")
        print(f"Chunk lengths are {[len(tokenize(x)) for x in text_chunks]}")

    # set system message
    system_message_content = "Summarize the following text."
    if additional_instructions is not None:
        system_message_content += f"\n\n{additional_instructions}"

    accumulated_summaries = []
    for chunk in tqdm(text_chunks):
        if summarize_recursively and accumulated_summaries:
            # Creating a structured prompt for recursive summarization
            accumulated_summaries_string = '\n\n'.join(accumulated_summaries)
            user_message_content = f"Previous summaries:\n\n{accumulated_summaries_string}\n\nText to summarize next:\n\n{chunk}"
        else:
            # Directly passing the chunk for summarization without recursive context
            user_message_content = chunk

        # Constructing messages based on whether recursive summarization is applied
        messages = [
            {"role": "system", "content": system_message_content},
            {"role": "user", "content": user_message_content}
        ]

        # Assuming this function gets the completion and works as expected
        response = get_chat_completion(messages, model=model)
        accumulated_summaries.append(response)

    # Compile final summary from partial summaries
    final_summary = '\n\n'.join(accumulated_summaries)

    return final_summary

Now we can use this utility to produce summaries with varying levels of detail. By increasing 'detail' from 0 to 1 we get progressively longer summaries of the underlying document. A higher value for the detail parameter results in a more detailed summary because the utility first splits the document into a greater number of chunks. Each chunk is then summarized, and the final summary is a concatenation of all the chunk summaries.

In [7]:
summary_with_detail_0 = summarize(united_states_wikipedia_text, detail=0, verbose=True)

Splitting the text into 1 chunks to be summarized.
Chunk lengths are [15781]


100%|██████████| 1/1 [00:05<00:00,  5.98s/it]


In [8]:
summary_with_detail_pt1 = summarize(united_states_wikipedia_text, detail=0.1, verbose=True)

Splitting the text into 5 chunks to be summarized.
Chunk lengths are [3945, 3941, 3943, 3915, 37]


100%|██████████| 5/5 [00:15<00:00,  3.18s/it]


In [9]:
summary_with_detail_pt2 = summarize(united_states_wikipedia_text, detail=0.2, verbose=True)

Splitting the text into 8 chunks to be summarized.
Chunk lengths are [2214, 2253, 2249, 2255, 2254, 2255, 2221, 84]


100%|██████████| 8/8 [00:19<00:00,  2.46s/it]


In [10]:
summary_with_detail_pt4 = summarize(united_states_wikipedia_text, detail=0.4, verbose=True)

Splitting the text into 14 chunks to be summarized.
Chunk lengths are [1198, 1209, 1210, 1209, 1212, 1192, 1176, 1205, 1212, 1201, 1210, 1210, 1192, 154]


100%|██████████| 14/14 [00:37<00:00,  2.69s/it]


In [11]:
summary_with_detail_pt8 = summarize(united_states_wikipedia_text, detail=0.8, verbose=True)

Splitting the text into 27 chunks to be summarized.
Chunk lengths are [602, 596, 601, 601, 604, 598, 572, 594, 592, 592, 604, 593, 578, 582, 597, 600, 596, 555, 582, 601, 582, 587, 581, 595, 598, 568, 445]


100%|██████████| 27/27 [01:20<00:00,  2.99s/it]


In [12]:
summary_with_detail_1 = summarize(united_states_wikipedia_text, detail=1.0, verbose=True)

Splitting the text into 33 chunks to be summarized.
Chunk lengths are [490, 443, 475, 490, 501, 470, 472, 487, 479, 477, 447, 442, 490, 468, 488, 477, 493, 493, 472, 491, 490, 501, 493, 468, 500, 500, 474, 460, 489, 462, 490, 482, 445]


100%|██████████| 33/33 [01:25<00:00,  2.58s/it]


The original document is ~15k tokens long. Notice how large the gap is between the length of 'summary_pt0' and summary_pt10'

In [13]:
# lengths of summaries
[len(tokenize(x)) for x in [summary_with_detail_0, summary_with_detail_pt1, summary_with_detail_pt2, summary_with_detail_pt4, summary_with_detail_pt8, summary_with_detail_1]]

[291, 681, 965, 1734, 3542, 4182]

Let's inspect the summaries to get a feel for what that means.

In [14]:
print(summary_with_detail_0)

The United States of America is a diverse country located in North America, with a population exceeding 334 million. It is a federation of 50 states, a federal capital district, and various territories. The country has a rich history, from the migration of Paleo-Indians over 12,000 years ago to the American Revolution and the Civil War. The U.S. emerged as a superpower after World War II and played a significant role in the Cold War era.

The U.S. government is a presidential constitutional republic with three separate branches: legislative, executive, and judicial. The country has a strong emphasis on liberty, equality under the law, individualism, and limited government. Economically, the U.S. has the largest nominal GDP in the world and is a leader in economic competitiveness, innovation, and human rights. The U.S. is also a founding member of various international organizations.

The U.S. has a rich cultural landscape, with influences from various ethnic groups and traditions. Amer

In [15]:
print(summary_with_detail_pt2)

The United States of America is a country located in North America, consisting of 50 states, a federal capital district, and various territories. It has a rich history, from the arrival of Paleo-Indians over 12,000 years ago to British colonization, the American Revolution, and the Civil War. The U.S. is a presidential constitutional republic with a strong emphasis on liberty, equality, and limited government. It is a global economic powerhouse, with the largest nominal GDP since 1890 and significant influence in international organizations. The country's history includes European colonization, conflicts with Native Americans, the Revolutionary War, and westward expansion. The U.S. Constitution, drafted in 1787, established a federal government with three branches and a system of checks and balances. The U.S. has played a significant role in world events, including World War II and the Cold War, emerging as a superpower after the collapse of the Soviet Union.

The text discusses key ev

Note that this utility also allows passing additional instructions.

In [16]:
summary_with_additional_instructions = summarize(united_states_wikipedia_text, detail=0.1, additional_instructions="Write in point form and focus on numerical data.")
print(summary_with_additional_instructions)

100%|██████████| 5/5 [00:17<00:00,  3.54s/it]

- The USA is a federation of 50 states, a federal capital district, and 326 Indian reservations.
- It has sovereignty over five major unincorporated island territories and various uninhabited islands.
- The population of the USA exceeds 334 million.
- The USA has the world's third-largest land area and largest maritime exclusive economic zone.
- The USA has had the largest nominal GDP in the world since 1890 and accounted for over 25% of the global economy in 2023.
- The USA has the highest median income per capita of any non-microstate.
- The USA ranks high in economic competitiveness, productivity, innovation, human rights, and higher education.
- The USA is a founding member of various international organizations such as the World Bank, IMF, NATO, and the UN Security Council.

- In the early 1960s, President Lyndon Johnson's Great Society plan led to groundbreaking laws and policies to counteract institutional racism.
- By 1985, the majority of women aged 16 and older in the US were




Finally, note that the utility allows for recursive summarization, where each summary is based on the previous summaries, adding more context to the summarization process. This can be enabled by setting the `summarize_recursively` parameter to True. This is more computationally expensive, but can increase consistency and coherence of the combined summary.

In [18]:
recursive_summary = summarize(united_states_wikipedia_text, detail=0.1, summarize_recursively=True, additional_instructions="Don't overuse repetitive phrases to introduce each section")
print(recursive_summary)

100%|██████████| 5/5 [00:12<00:00,  2.40s/it]

The United States of America is a country located in North America with 50 states, a federal capital district, and various territories. It has a rich history, from early indigenous migrations to European colonization, the American Revolution, Civil War, and emergence as a global superpower. The U.S. government is a presidential constitutional republic with a strong emphasis on liberty, equality, and limited government. Economically, the U.S. is a powerhouse with a significant global influence. The country has been involved in major historical events such as World War II, the Cold War, and the civil rights movement.

In the 1960s, the U.S. saw significant social changes with President Lyndon Johnson's Great Society plan addressing institutional racism, the counterculture movement influencing attitudes towards drug use and sexuality, and opposition to the Vietnam War. The 1980s and 1990s marked the end of the Cold War, solidifying the U.S. as a superpower. The 1990s saw economic growth, 


