openai-cookbook/examples/GPT_with_vision_for_video_understanding.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API\n",
    "\n",
    "This notebook demonstrates how to use GPT's visual capabilities with a video. Although GPT-4.1-mini doesn't take videos as input directly, we can use vision and the 1M token context window to describe the static frames of a whole video at once. We'll walk through two examples:\n",
    "\n",
    "1. Using GPT-4.1-mini to get a description of a video\n",
    "2. Generating a voiceover for a video with GPT-4o TTS API\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display, Image, Audio\n",
    "\n",
    "import cv2  # We're using OpenCV to read video, to install !pip install opencv-python\n",
    "import base64\n",
    "import time\n",
    "from openai import OpenAI\n",
    "import os\n",
    "\n",
    "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Using GPT's visual capabilities to get a description of a video\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we use OpenCV to extract frames from a nature [video](https://www.youtube.com/watch?v=kQ_7GtE529M) containing bisons and wolves:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "video = cv2.VideoCapture(\"data/bison.mp4\")\n",
    "\n",
    "base64Frames = []\n",
    "while video.isOpened():\n",
    "    success, frame = video.read()\n",
    "    if not success:\n",
    "        break\n",
    "    _, buffer = cv2.imencode(\".jpg\", frame)\n",
    "    base64Frames.append(base64.b64encode(buffer).decode(\"utf-8\"))\n",
    "\n",
    "video.release()\n",
    "print(len(base64Frames), \"frames read.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display frames to make sure we've read them in correctly:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_handle = display(None, display_id=True)\n",
    "for img in base64Frames:\n",
    "    display_handle.update(Image(data=base64.b64decode(img.encode(\"utf-8\"))))\n",
    "    time.sleep(0.025)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don't need to send every frame for GPT to understand what's going on):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = client.responses.create(\n",
    "    model=\"gpt-4.1-mini\",\n",
    "    input=[\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                {\n",
    "                    \"type\": \"input_text\",\n",
    "                    \"text\": (\n",
    "                        \"These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.\"\n",
    "                    )\n",
    "                },\n",
    "                *[\n",
    "                    {\n",
    "                        \"type\": \"input_image\",\n",
    "                        \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
    "                    }\n",
    "                    for frame in base64Frames[0::25]\n",
    "                ]\n",
    "            ]\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(response.output_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Generating a voiceover for a video with GPT-4.1 and the GPT-4o TTS API\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a voiceover for this video in the style of David Attenborough. Using the same video frames we prompt GPT to give us a short script:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_result = client.responses.create(\n",
    "    model=\"gpt-4.1-mini\",\n",
    "    input=[\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                {\n",
    "                    \"type\": \"input_text\",\n",
    "                    \"text\": (\n",
    "                        \"These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.\"\n",
    "                    )\n",
    "                },\n",
    "                *[\n",
    "                    {\n",
    "                        \"type\": \"input_image\",\n",
    "                        \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
    "                    }\n",
    "                    for frame in base64Frames[0::60]\n",
    "                ]\n",
    "            ]\n",
    "        }\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(new_result.output_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can work with the GPT-4o TTS model and provide it a set of instructions on how the voice should sound. You can play around with the voice models and instructers at [OpenAI.fm](openai.fm). We can then pass in the script we generated above with GPT-4.1-mini and generate audio of the voiceover:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "instructions = \"\"\"\n",
    "Voice Affect: Calm, measured, and warmly engaging; convey awe and quiet reverence for the natural world.\n",
    "\n",
    "Tone: Inquisitive and insightful, with a gentle sense of wonder and deep respect for the subject matter.\n",
    "\n",
    "Pacing: Even and steady, with slight lifts in rhythm when introducing a new species or unexpected behavior; natural pauses to allow the viewer to absorb visuals.\n",
    "\n",
    "Emotion: Subtly emotive—imbued with curiosity, empathy, and admiration without becoming sentimental or overly dramatic.\n",
    "\n",
    "Emphasis: Highlight scientific and descriptive language (“delicate wings shimmer in the sunlight,” “a symphony of unseen life,” “ancient rituals played out beneath the canopy”) to enrich imagery and understanding.\n",
    "\n",
    "Pronunciation: Clear and articulate, with precise enunciation and slightly rounded vowels to ensure accessibility and authority.\n",
    "\n",
    "Pauses: Insert thoughtful pauses before introducing key facts or transitions (“And then... with a sudden rustle...”), allowing space for anticipation and reflection.\n",
    "\"\"\"\n",
    "\n",
    "audio_response = response = client.audio.speech.create(\n",
    "  model=\"gpt-4o-mini-tts\",\n",
    "  voice=\"echo\",\n",
    "  instructions=instructions,\n",
    "  input=new_result.output_text,\n",
    "  response_format=\"wav\"\n",
    ")\n",
    "\n",
    "audio_bytes = audio_response.content\n",
    "Audio(data=audio_bytes)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}