openai-cookbook/examples/GPT_with_vision_for_video_understanding.ipynb

236 lines
7.6 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API\n",
"\n",
"This notebook demonstrates how to use GPT's visual capabilities with a video. Although GPT-4.1-mini doesn't take videos as input directly, we can use vision and the 1M token context window to describe the static frames of a whole video at once. We'll walk through two examples:\n",
"\n",
"1. Using GPT-4.1-mini to get a description of a video\n",
"2. Generating a voiceover for a video with GPT-4o TTS API\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import display, Image, Audio\n",
"\n",
"import cv2 # We're using OpenCV to read video, to install !pip install opencv-python\n",
"import base64\n",
"import time\n",
"from openai import OpenAI\n",
"import os\n",
"\n",
"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Using GPT's visual capabilities to get a description of a video\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2023-11-11 01:32:51 +08:00
"First, we use OpenCV to extract frames from a nature [video](https://www.youtube.com/watch?v=kQ_7GtE529M) containing bisons and wolves:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
2023-11-06 13:05:53 -08:00
"video = cv2.VideoCapture(\"data/bison.mp4\")\n",
"\n",
"base64Frames = []\n",
"while video.isOpened():\n",
" success, frame = video.read()\n",
" if not success:\n",
" break\n",
" _, buffer = cv2.imencode(\".jpg\", frame)\n",
" base64Frames.append(base64.b64encode(buffer).decode(\"utf-8\"))\n",
"\n",
"video.release()\n",
"print(len(base64Frames), \"frames read.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display frames to make sure we've read them in correctly:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"display_handle = display(None, display_id=True)\n",
"for img in base64Frames:\n",
" display_handle.update(Image(data=base64.b64decode(img.encode(\"utf-8\"))))\n",
" time.sleep(0.025)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2023-11-11 01:32:51 +08:00
"Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don't need to send every frame for GPT to understand what's going on):\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response = client.responses.create(\n",
" model=\"gpt-4.1-mini\",\n",
" input=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"input_text\",\n",
" \"text\": (\n",
" \"These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.\"\n",
" )\n",
" },\n",
" *[\n",
" {\n",
" \"type\": \"input_image\",\n",
" \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
" }\n",
" for frame in base64Frames[0::25]\n",
" ]\n",
" ]\n",
" }\n",
" ],\n",
")\n",
"\n",
"print(response.output_text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Generating a voiceover for a video with GPT-4.1 and the GPT-4o TTS API\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a voiceover for this video in the style of David Attenborough. Using the same video frames we prompt GPT to give us a short script:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_result = client.responses.create(\n",
" model=\"gpt-4.1-mini\",\n",
" input=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"input_text\",\n",
" \"text\": (\n",
" \"These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.\"\n",
" )\n",
" },\n",
" *[\n",
" {\n",
" \"type\": \"input_image\",\n",
" \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
" }\n",
" for frame in base64Frames[0::60]\n",
" ]\n",
" ]\n",
" }\n",
" ]\n",
")\n",
"\n",
"print(new_result.output_text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can work with the GPT-4o TTS model and provide it a set of instructions on how the voice should sound. You can play around with the voice models and instructers at [OpenAI.fm](openai.fm). We can then pass in the script we generated above with GPT-4.1-mini and generate audio of the voiceover:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"instructions = \"\"\"\n",
"Voice Affect: Calm, measured, and warmly engaging; convey awe and quiet reverence for the natural world.\n",
"\n",
"Tone: Inquisitive and insightful, with a gentle sense of wonder and deep respect for the subject matter.\n",
"\n",
"Pacing: Even and steady, with slight lifts in rhythm when introducing a new species or unexpected behavior; natural pauses to allow the viewer to absorb visuals.\n",
"\n",
"Emotion: Subtly emotive—imbued with curiosity, empathy, and admiration without becoming sentimental or overly dramatic.\n",
"\n",
"Emphasis: Highlight scientific and descriptive language (“delicate wings shimmer in the sunlight,” “a symphony of unseen life,” “ancient rituals played out beneath the canopy”) to enrich imagery and understanding.\n",
"\n",
"Pronunciation: Clear and articulate, with precise enunciation and slightly rounded vowels to ensure accessibility and authority.\n",
"\n",
"Pauses: Insert thoughtful pauses before introducing key facts or transitions (“And then... with a sudden rustle...”), allowing space for anticipation and reflection.\n",
"\"\"\"\n",
"\n",
"audio_response = response = client.audio.speech.create(\n",
" model=\"gpt-4o-mini-tts\",\n",
" voice=\"echo\",\n",
" instructions=instructions,\n",
" input=new_result.output_text,\n",
" response_format=\"wav\"\n",
")\n",
"\n",
"audio_bytes = audio_response.content\n",
"Audio(data=audio_bytes)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}