openai-cookbook/examples/GPT_with_vision_for_video_understanding.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API\n",
    "\n",
    "This notebook demonstrates how to use GPT's visual capabilities with a video. Although GPT-4.1-mini doesn't take videos as input directly, we can use vision and the 1M token context window to describe the static frames of a whole video at once. We'll walk through two examples:\n",
    "\n",
    "1. Using GPT-4.1-mini to get a description of a video\n",
    "2. Generating a voiceover for a video with GPT-4o TTS API\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display, Image, Audio\n",
    "\n",
    "import cv2  # We're using OpenCV to read video, to install !pip install opencv-python\n",
    "import base64\n",
    "import time\n",
    "from openai import OpenAI\n",
    "import os\n",
    "\n",
    "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Using GPT's visual capabilities to get a description of a video\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we use OpenCV to extract frames from a nature [video](https://www.youtube.com/watch?v=kQ_7GtE529M) containing bisons and wolves:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "video = cv2.VideoCapture(\"data/bison.mp4\")\n",
    "\n",
    "base64Frames = []\n",
    "while video.isOpened():\n",
    "    success, frame = video.read()\n",
    "    if not success:\n",
    "        break\n",
    "    _, buffer = cv2.imencode(\".jpg\", frame)\n",
    "    base64Frames.append(base64.b64encode(buffer).decode(\"utf-8\"))\n",
    "\n",
    "video.release()\n",
    "print(len(base64Frames), \"frames read.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display frames to make sure we've read them in correctly:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_handle = display(None, display_id=True)\n",
    "for img in base64Frames:\n",
    "    display_handle.update(Image(data=base64.b64decode(img.encode(\"utf-8\"))))\n",
    "    time.sleep(0.025)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don't need to send every frame for GPT to understand what's going on):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = client.responses.create(\n",
    "    model=\"gpt-4.1-mini\",\n",
    "    input=[\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                {\n",
    "                    \"type\": \"input_text\",\n",
    "                    \"text\": (\n",
    "                        \"These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.\"\n",
    "                    )\n",
    "                },\n",
    "                *[\n",
    "                    {\n",
    "                        \"type\": \"input_image\",\n",
    "                        \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
    "                    }\n",
    "                    for frame in base64Frames[0::25]\n",
    "                ]\n",
    "            ]\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(response.output_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Generating a voiceover for a video with GPT-4.1 and the GPT-4o TTS API\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a voiceover for this video in the style of David Attenborough. Using the same video frames we prompt GPT to give us a short script:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_result = client.responses.create(\n",
    "    model=\"gpt-4.1-mini\",\n",
    "    input=[\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                {\n",
    "                    \"type\": \"input_text\",\n",
    "                    \"text\": (\n",
    "                        \"These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.\"\n",
    "                    )\n",
    "                },\n",
    "                *[\n",
    "                    {\n",
    "                        \"type\": \"input_image\",\n",
    "                        \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
    "                    }\n",
    "                    for frame in base64Frames[0::60]\n",
    "                ]\n",
    "            ]\n",
    "        }\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(new_result.output_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can work with the GPT-4o TTS model and provide it a set of instructions on how the voice should sound. You can play around with the voice models and instructers at [OpenAI.fm](openai.fm). We can then pass in the script we generated above with GPT-4.1-mini and generate audio of the voiceover:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "instructions = \"\"\"\n",
    "Voice Affect: Calm, measured, and warmly engaging; convey awe and quiet reverence for the natural world.\n",
    "\n",
    "Tone: Inquisitive and insightful, with a gentle sense of wonder and deep respect for the subject matter.\n",
    "\n",
    "Pacing: Even and steady, with slight lifts in rhythm when introducing a new species or unexpected behavior; natural pauses to allow the viewer to absorb visuals.\n",
    "\n",
    "Emotion: Subtly emotive—imbued with curiosity, empathy, and admiration without becoming sentimental or overly dramatic.\n",
    "\n",
    "Emphasis: Highlight scientific and descriptive language (“delicate wings shimmer in the sunlight,” “a symphony of unseen life,” “ancient rituals played out beneath the canopy”) to enrich imagery and understanding.\n",
    "\n",
    "Pronunciation: Clear and articulate, with precise enunciation and slightly rounded vowels to ensure accessibility and authority.\n",
    "\n",
    "Pauses: Insert thoughtful pauses before introducing key facts or transitions (“And then... with a sudden rustle...”), allowing space for anticipation and reflection.\n",
    "\"\"\"\n",
    "\n",
    "audio_response = response = client.audio.speech.create(\n",
    "  model=\"gpt-4o-mini-tts\",\n",
    "  voice=\"echo\",\n",
    "  instructions=instructions,\n",
    "  input=new_result.output_text,\n",
    "  response_format=\"wav\"\n",
    ")\n",
    "\n",
    "audio_bytes = audio_response.content\n",
    "Audio(data=audio_bytes)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"# Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API\n",`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"\n",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"This notebook demonstrates how to use GPT's visual capabilities with a video. Although GPT-4.1-mini doesn't take videos as input directly, we can use vision and the 1M token context window to describe the static frames of a whole video at once. We'll walk through two examples:\n",`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"\n",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"1. Using GPT-4.1-mini to get a description of a video\n",`
			`"2. Generating a voiceover for a video with GPT-4o TTS API\n"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"execution_count": 46,`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from IPython.display import display, Image, Audio\n",`
			`"\n",`
Updating notebooks to use new Python SDK (#837) 2023-11-14 13:31:13 -08:00			`"import cv2 # We're using OpenCV to read video, to install !pip install opencv-python\n",`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"import base64\n",`
			`"import time\n",`
Updating notebooks to use new Python SDK (#837) 2023-11-14 13:31:13 -08:00			`"from openai import OpenAI\n",`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"import os\n",`
Updating notebooks to use new Python SDK (#837) 2023-11-14 13:31:13 -08:00			`"\n",`
Migrate all notebooks to API V1 (#914) Co-authored-by: ayush rajgor <ayushrajgorar@gmail.com> 2024-01-24 17:05:14 -08:00			`"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## 1. Using GPT's visual capabilities to get a description of a video\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
fix some typos (#829) 2023-11-11 01:32:51 +08:00			`"First, we use OpenCV to extract frames from a nature [video](https://www.youtube.com/watch?v=kQ_7GtE529M) containing bisons and wolves:\n"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"execution_count": null,`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"metadata": {},`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"outputs": [],`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"source": [`
Add video link and update paths (#822) 2023-11-06 13:05:53 -08:00			`"video = cv2.VideoCapture(\"data/bison.mp4\")\n",`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"\n",`
			`"base64Frames = []\n",`
			`"while video.isOpened():\n",`
			`" success, frame = video.read()\n",`
			`" if not success:\n",`
			`" break\n",`
			`" _, buffer = cv2.imencode(\".jpg\", frame)\n",`
			`" base64Frames.append(base64.b64encode(buffer).decode(\"utf-8\"))\n",`
			`"\n",`
			`"video.release()\n",`
Fix audio output in GPT-V notebook (#908) 2023-12-05 13:06:23 -08:00			`"print(len(base64Frames), \"frames read.\")"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Display frames to make sure we've read them in correctly:\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"execution_count": null,`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"metadata": {},`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"outputs": [],`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"source": [`
			`"display_handle = display(None, display_id=True)\n",`
			`"for img in base64Frames:\n",`
			`" display_handle.update(Image(data=base64.b64decode(img.encode(\"utf-8\"))))\n",`
Fix audio output in GPT-V notebook (#908) 2023-12-05 13:06:23 -08:00			`" time.sleep(0.025)"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
fix some typos (#829) 2023-11-11 01:32:51 +08:00			`"Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don't need to send every frame for GPT to understand what's going on):\n"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"execution_count": null,`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"metadata": {},`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"outputs": [],`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"source": [`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"response = client.responses.create(\n",`
			`" model=\"gpt-4.1-mini\",\n",`
			`" input=[\n",`
			`" {\n",`
			`" \"role\": \"user\",\n",`
			`" \"content\": [\n",`
			`" {\n",`
			`" \"type\": \"input_text\",\n",`
			`" \"text\": (\n",`
			`" \"These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.\"\n",`
			`" )\n",`
			`" },\n",`
			`" *[\n",`
			`" {\n",`
			`" \"type\": \"input_image\",\n",`
			`" \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",`
			`" }\n",`
			`" for frame in base64Frames[0::25]\n",`
			`" ]\n",`
			`" ]\n",`
			`" }\n",`
			`" ],\n",`
			`")\n",`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"\n",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"print(response.output_text)"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"## 2. Generating a voiceover for a video with GPT-4.1 and the GPT-4o TTS API\n"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Let's create a voiceover for this video in the style of David Attenborough. Using the same video frames we prompt GPT to give us a short script:\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"execution_count": null,`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"metadata": {},`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"outputs": [],`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"source": [`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"new_result = client.responses.create(\n",`
			`" model=\"gpt-4.1-mini\",\n",`
			`" input=[\n",`
			`" {\n",`
			`" \"role\": \"user\",\n",`
			`" \"content\": [\n",`
			`" {\n",`
			`" \"type\": \"input_text\",\n",`
			`" \"text\": (\n",`
			`" \"These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.\"\n",`
			`" )\n",`
			`" },\n",`
			`" *[\n",`
			`" {\n",`
			`" \"type\": \"input_image\",\n",`
			`" \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",`
			`" }\n",`
			`" for frame in base64Frames[0::60]\n",`
			`" ]\n",`
			`" ]\n",`
			`" }\n",`
			`" ]\n",`
			`")\n",`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"\n",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"print(new_result.output_text)"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"Now, we can work with the GPT-4o TTS model and provide it a set of instructions on how the voice should sound. You can play around with the voice models and instructers at [OpenAI.fm](openai.fm). We can then pass in the script we generated above with GPT-4.1-mini and generate audio of the voiceover:\n"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"execution_count": null,`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"metadata": {},`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"outputs": [],`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"source": [`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"instructions = \"\"\"\n",`
			`"Voice Affect: Calm, measured, and warmly engaging; convey awe and quiet reverence for the natural world.\n",`
			`"\n",`
			`"Tone: Inquisitive and insightful, with a gentle sense of wonder and deep respect for the subject matter.\n",`
			`"\n",`
			`"Pacing: Even and steady, with slight lifts in rhythm when introducing a new species or unexpected behavior; natural pauses to allow the viewer to absorb visuals.\n",`
			`"\n",`
			`"Emotion: Subtly emotive—imbued with curiosity, empathy, and admiration without becoming sentimental or overly dramatic.\n",`
			`"\n",`
			`"Emphasis: Highlight scientific and descriptive language (“delicate wings shimmer in the sunlight,” “a symphony of unseen life,” “ancient rituals played out beneath the canopy”) to enrich imagery and understanding.\n",`
			`"\n",`
			`"Pronunciation: Clear and articulate, with precise enunciation and slightly rounded vowels to ensure accessibility and authority.\n",`
			`"\n",`
			`"Pauses: Insert thoughtful pauses before introducing key facts or transitions (“And then... with a sudden rustle...”), allowing space for anticipation and reflection.\n",`
			`"\"\"\"\n",`
			`"\n",`
			`"audio_response = response = client.audio.speech.create(\n",`
			`" model=\"gpt-4o-mini-tts\",\n",`
			`" voice=\"echo\",\n",`
			`" instructions=instructions,\n",`
			`" input=new_result.output_text,\n",`
			`" response_format=\"wav\"\n",`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`")\n",`
			`"\n",`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"audio_bytes = audio_response.content\n",`
			`"Audio(data=audio_bytes)"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
Updating vision cookbook to 4.1 (#1783) 2025-04-23 08:28:20 -07:00			`"display_name": "Python 3",`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Update existing cookbooks to use gpt-4o (#1197) 2024-05-13 13:28:44 -04:00			`"version": "3.11.8"`
Add examples and guides for DevDay releases (#820) 2023-11-06 12:48:12 -08:00			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 2`
			`}`