openai-cookbook/examples/GPT_with_vision_for_video_understanding.ipynb

301 lines
3.2 MiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API\n",
"\n",
"This notebook demonstrates how to use GPT's visual capabilities with a video. Although GPT-4.1-mini doesn't take videos as input directly, we can use vision and the 1M token context window to describe the static frames of a whole video at once. We'll walk through two examples:\n",
"\n",
"1. Using GPT-4.1-mini to get a description of a video\n",
"2. Generating a voiceover for a video with GPT-4o TTS API\n"
]
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 2,
=======
"execution_count": 46,
>>>>>>> main
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import display, Image, Audio\n",
"\n",
"import cv2 # We're using OpenCV to read video, to install !pip install opencv-python\n",
"import base64\n",
"import time\n",
"from openai import OpenAI\n",
"import os\n",
"\n",
"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Using GPT's visual capabilities to get a description of a video\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2023-11-11 01:32:51 +08:00
"First, we use OpenCV to extract frames from a nature [video](https://www.youtube.com/watch?v=kQ_7GtE529M) containing bisons and wolves:\n"
]
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 3,
=======
"execution_count": null,
>>>>>>> main
"metadata": {},
"outputs": [],
"source": [
2023-11-06 13:05:53 -08:00
"video = cv2.VideoCapture(\"data/bison.mp4\")\n",
"\n",
"base64Frames = []\n",
"while video.isOpened():\n",
" success, frame = video.read()\n",
" if not success:\n",
" break\n",
" _, buffer = cv2.imencode(\".jpg\", frame)\n",
" base64Frames.append(base64.b64encode(buffer).decode(\"utf-8\"))\n",
"\n",
"video.release()\n",
"print(len(base64Frames), \"frames read.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display frames to make sure we've read them in correctly:\n"
]
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 4,
=======
"execution_count": null,
>>>>>>> main
"metadata": {},
"outputs": [],
"source": [
"display_handle = display(None, display_id=True)\n",
"for img in base64Frames:\n",
" display_handle.update(Image(data=base64.b64decode(img.encode(\"utf-8\"))))\n",
" time.sleep(0.025)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2023-11-11 01:32:51 +08:00
"Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don't need to send every frame for GPT to understand what's going on):\n"
]
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Witness an intense and gripping wildlife encounter in the heart of a snowy wilderness. This extraordinary video captures a fearless pack of wolves as they courageously surround and confront a mighty bison. As the icy wind sweeps across the barren landscape, watch the raw power and strategic teamwork of the wolves unfold in a dramatic struggle for survival. The video showcases nature's harsh realities and the delicate balance of predator and prey, highlighting the wolves' determination and the bison's formidable strength. Prepare for an unforgettable glimpse into the relentless dance of life and death in the wild. Dont miss this captivating moment of natures untamed drama!\n"
]
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> main
"source": [
"response = client.responses.create(\n",
" model=\"gpt-4.1-mini\",\n",
" input=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"input_text\",\n",
" \"text\": (\n",
" \"These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.\"\n",
" )\n",
" },\n",
" *[\n",
" {\n",
" \"type\": \"input_image\",\n",
" \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
" }\n",
" for frame in base64Frames[0::25]\n",
" ]\n",
" ]\n",
" }\n",
" ],\n",
")\n",
"\n",
"print(response.output_text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Generating a voiceover for a video with GPT-4.1 and the GPT-4o TTS API\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a voiceover for this video in the style of David Attenborough. Using the same video frames we prompt GPT to give us a short script:\n"
]
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In the vast, unforgiving expanse of the winter tundra, a stealthy pack of wolves encircles their formidable prey: the mighty bison. With calculated precision, they close in, working as a cohesive unit to isolate their target from the herd. Though the bison stands its ground, outnumbered and pressured, the relentless wolves persist, their survival dependent on this crucial hunt. Each movement tells a story of nature's balance—predator and prey locked in an age-old dance for existence under the stark skies. Here, in this frozen wilderness, life and death hang in a delicate balance, governed by the instinct to endure.\n"
]
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> main
"source": [
"new_result = client.responses.create(\n",
" model=\"gpt-4.1-mini\",\n",
" input=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"input_text\",\n",
" \"text\": (\n",
" \"These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.\"\n",
" )\n",
" },\n",
" *[\n",
" {\n",
" \"type\": \"input_image\",\n",
" \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
" }\n",
" for frame in base64Frames[0::60]\n",
" ]\n",
" ]\n",
" }\n",
" ]\n",
")\n",
"\n",
"print(new_result.output_text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can work with the GPT-4o TTS model and provide it a set of instructions on how the voice should sound. You can play around with the voice models and instructers at [OpenAI.fm](openai.fm). We can then pass in the script we generated above with GPT-4.1-mini and generate audio of the voiceover:\n"
]
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <audio controls=\"controls\" >\n",
" <source src=\"data:audio/wav;base64,UklGRv////9XQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAAZGF0Yf/////5//b/9f/1//T/8f/y//L/8//z//X/9P/x//H/8f/x//H/7//x//L/8f/z//X/8v/z//T/9P/0//T/9P/3//n/+f/5//r/+v/7//7//f/+//3/AAAAAAEAAAAAAP////8AAAAAAgADAAQABAADAAcABgAHAAgACAAIAAUAAQD9/wAAAAABAP3/+v/4//X/9//3//T/8P/w/+3/7f/s/+r/6v/r/+3/6//p/+f/5//o/+n/6v/t/+7/7//y//b/9//9////AwAIAA0ADQATABkAGwAeACQAKgAtADMANgA2AD4AQABEAEUARwBGAEcASgBLAEwASQBHAEUAQwBBADwANgAvACoAKAAgABwAFQAOAA0AAgD+//n/8//w/+v/5v/i/9r/1f/T/9P/0f/T/9T/z//N/87/y//P/9D/1f/Z/9z/4f/d/9//3P/e/+L/5P/r/+b/5v/l/+H/5P/g/+b/5f/d/+D/2f/W/8v/xv/J/8r/0P/N/8X/w/+7/7r/uf+3/7z/vv/C/8X/vP/A/8H/xP/M/9b/3//k/+f/6P/o//T//f8GAA0AFAAUABsAHQAkACUALQAzADcAPAA6ADgANgA2AEAAQQBIAEsASwBQAEoATQBIAEkASgBJAFMAVABUAFMATgBLAFIAWwBhAGUAXwBZAFMATQBSAE0AUABMAEEAOQAbABwABgAFAAoA+P/7/9D/uv+o/4v/mv9//4//fv9m/23/NP9G/yz/N/9j/0f/ef9T/1r/YP9H/4T/d/+z/8r/0P8BAOX/FAARADAAcABxAMcAzwDpAAcB+AAoATkBbQGiAbMB5AHRAdgBzQGzAd4B1gEGAg8C/gHkAZ4BlAF6AX0BoQF9AYEBUgEZAfcA2QDWALoAwwChAH0AYAAYAA8A9f/b/97/sP+l/2//V/83/yX/JP/n/g//wP60/rv+Xv6E/kD+K/4i/v394/3G/Yf9Wv0s/ev89vy7/Lz8jPwf/Cf8oPuO+377E/tz+936APvy+nT6x/r9+Qb6wfmz+aD6rPpf+xP7Ovo/+sf5s/op/C79Y/4d/qH93vzY/Ir9Vv+0ASED7wPQAuAB8gCqAYADOQaXCLwIwQiXBnAFfQVYBvgIkwqIC5EK1QggB78F6gUFBwUI9Qj2B8AGqARwAugBXgGWAgoDCwOnArQALv/E/U79Df4Y/7T/IAD5/pn9v/wY/Ab9pf3R/sr+y/4n/qb90f1r/XT+3f43ANAAHwFGAaQAZAFGAk4D7wQDBVsFeAZjBuAHaQhMCDsJKAkGCjILVguDC3oL6wpjC4ALEQszC/gKtAqnCgEKXAhSB8QGWQYKB3AGyANAAk3/sv+fAA4AkgBk/P/5d/j495X6Cfvt+qj43vUt9Xv0+vYZ+Hj3Dvib9oX2/fZ49Yz2RvX+9cL4IPmc+g74vvX49Df1qPfT+LX5dvge90b3uPdu9672TPTu8+n0lfby+An35vTE8BDvuO9t8bXz4fMc9JDyTfJB8Y/wBvBY8aD0W/g++6X7g/lP91H35fl8/yUCogS7BHcDIwWyBFQGRAfwB2MKjgyXDvAOyw1ZDN8LqgzWDh8QQRGdECcQ+Q4jDZ8NTwwBDNcMlQueC/UJegh2Bx8GUQUaBOIDwwKQAnIC5wDcAFL/ff59/g/9i/0Y/mL9Yf4a/jH9l/z3+p77GP1e/iX/BACB/1X+cv5Y/xQBagKPA3cFAQbHBfkFZgaQB+cI2Ao+DGcNyQzrDNoNgA2hDQkO+g7dD1QQbxCcD3YNJAx9CxgMmAwjDKwL0wmKCPsFZgQMBA8C4QFnAcAACABS/bH7/fmm9673mvdz+AX5z/Yl9ezydvJs8m7ztvR+8xPzKfII8sbyqvGi8WHy8/GZ9B30uvJj8nPwivIU9Bj1zvUp87jwnO8E8HTxGPIE8zHzffEG72XrSesE7O7u0vSr9VD0vO4m6kvsKvE+96/8CP0d+Rj3Cve2+oz+mQHmAsEELAWeBZoHgAW/BuoHdgpGDnYOrg2nDfsL6AutDo4PFhBtD3oOaw6+DWEMxgyLDNoKvgpaCZsIbAgqBisGjAV1A7ECUwJ9ArgBtwGnADkAf/9S/i7/gv+i/5IA5v+W/tP+8P3U/6YBRQLiAt4BtQEdAvYDsAVyB3IIfAhFCXQJfwm7C94MxA3yD64P0Q+aD8AOXQ88ENgQGhHkEOEPGw/LDdgMnwwdC/kKqwrlCbMIrQYTBAQCHQHBANYA5v81/kX8YvrW9+j3Qvfn9lL32fW+9Qr0VfIV8kPyzPKM8sTyzvIA8cXwLfBt8O7x1/Bv8g3y3+9j8T3wKvBw8PzvPfC28UnzIfJ+8AvtMulC67DvMvFG9KbwHuze6H7lCeta7t7vNPTm8ybxcfC67Q3usPI69/78nQDKALn8A/o9+1n+fQNPB3gK+wm+B1MIDAgoCPgKdQugDdMPrQ9tD/QNWA38DJMNqg4MDwEPTQ5pDWUMOQrvCPsH0wewCGQILQeJBJQCwwHfAD4CbAIIAn0B0v80ADz/dv9bALkAKQJ6AVkBgwE6AewCjgMyBGIGQQUzBqUHQQdOCd8JigvPDGINEQ5KDu0PUxCUEYISGRJ+EhkSxRIvFMITLBNYEvgQShCpDxUQZA8oDqcMWgoyCtgHfAavBb0D6wKIAYIA4v6z/G/6kPmN+Jj3SPdv9qj1sfTq8sbxU/F38HLxBvKp8cXw/O4H7+zulu5373Xv6O+m7yvvue907sjunu6z75zwUu8Q8Jrvpu8f8ZDvsu5p7YTqje0I71Lx3/Dd7Kjr1+da6cTs4u8w81jynfGi8fnvkPHY8871BPy1/m0AOgEH/xD9Lv6sAokGuQriC2kKUgnLCIsIDwscDUcNxA9cDyMQxw/UDQAOCg1nDjwO9A62DwwOgAxhC2EJbAhfCLIH8QiXB4AGxwTqAg4CkwE+A4wDaQOiAXwA7wCmAdICFAOoA5ICsgFEA2oEKwZ5B0cHKQeXB4wH/wgHC7YLRQ3gDZwOPg/GDgsPyg9NEZIRYxNcFFMTdBJcEM8PbRHMEVcR2BG4DzcOEA1FC1ULCwmhCJYImQerBvkDngIFAKz9vv1S/X/9tvsX+tD4Jfcs9i30evQR9NTzZfQA9IjzavFH8DfwCPAx8F3xOPFC8TDwx+4Y78/tSu4l7f7v6/CH7+fvSuzF7PLrYexQ7+3vUe9V7VvqpukC6FfpDu2d7BTvBurF6GfocuQ/6yfuvfL89KXxS/Ew8JfyT/Ws+0EABQE3Ao8B/wHvAVsEbgZiCr0M6guqDQYMcAqKC3MN0w6fEGsPkg8CEJAOPw+7DiYP8g1gDZwN/gzzDMQKSAokCRsH5AYnBtcF7gQ6BMQDeQL0AfMBLQHVArYCdwJgAnUChwL4Ai0FpgSyBqMFNwabBt4GTgmJCVMLQwsnDK4LHQzVDEAO+Q/hEFcRzhCdEIMPfBB+EdoRnhLYEcwRLBAxDqcO8g3NDlYOOA0bDT0KTAh0B9IHHQe9BmYFtAMoAn//2v56/pP9xfwR/OX6c/li96T2YPax9XX2BvaV9q/1B/N18xPy0fHf8wHz3fMR85vxKPBp73LvmO9G72jwye+I79TuOO2c7q7rBu3167PsRu+I7WXtzOsJ67Lljufi58Popex/6vLr6ecw5WzjnOUg6jDu3PJ987ry0+/o70Lyg/d1+5oARgQxA9oC5AKnAlYFuQfGCogOpg0PDn8M/wxjDScNWhD3D1wRbw8bDygQww0XD9wOVw2vDbUKQApFCQEJ1wjQBzYI9ARZA3ICVgEcAskCzQOnA7gCmgIwAScBcwLkAoQF6QYDCOUHIwctBpIGYgjlCdwMxw08DscN1gwWDHcN/A49EEITExO9EukQwxAaEFQQUhJBEXASbREvEM8Qsw0YDawMiQu4DJUMMwtKCvkIYQZCBvcFZgV/BGsDBAM5AZgAFP/a/X/9n/yG+3j6XPpM+Wv4bPis9933IvbO9Vr
" Your browser does not support the audio element.\n",
" </audio>\n",
" "
],
"text/plain": [
"<IPython.lib.display.Audio object>"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> main
"source": [
"instructions = \"\"\"\n",
"Voice Affect: Calm, measured, and warmly engaging; convey awe and quiet reverence for the natural world.\n",
"\n",
"Tone: Inquisitive and insightful, with a gentle sense of wonder and deep respect for the subject matter.\n",
"\n",
"Pacing: Even and steady, with slight lifts in rhythm when introducing a new species or unexpected behavior; natural pauses to allow the viewer to absorb visuals.\n",
"\n",
"Emotion: Subtly emotive—imbued with curiosity, empathy, and admiration without becoming sentimental or overly dramatic.\n",
"\n",
"Emphasis: Highlight scientific and descriptive language (“delicate wings shimmer in the sunlight,” “a symphony of unseen life,” “ancient rituals played out beneath the canopy”) to enrich imagery and understanding.\n",
"\n",
"Pronunciation: Clear and articulate, with precise enunciation and slightly rounded vowels to ensure accessibility and authority.\n",
"\n",
"Pauses: Insert thoughtful pauses before introducing key facts or transitions (“And then... with a sudden rustle...”), allowing space for anticipation and reflection.\n",
"\"\"\"\n",
"\n",
"audio_response = response = client.audio.speech.create(\n",
" model=\"gpt-4o-mini-tts\",\n",
" voice=\"echo\",\n",
" instructions=instructions,\n",
" input=new_result.output_text,\n",
" response_format=\"wav\"\n",
")\n",
"\n",
"audio_bytes = audio_response.content\n",
"Audio(data=audio_bytes)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}