mirror of
https://github.com/james-m-jordan/openai-cookbook.git
synced 2025-05-09 19:32:38 +00:00
236 lines
7.6 KiB
Plaintext
236 lines
7.6 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API\n",
|
|
"\n",
|
|
"This notebook demonstrates how to use GPT's visual capabilities with a video. Although GPT-4.1-mini doesn't take videos as input directly, we can use vision and the 1M token context window to describe the static frames of a whole video at once. We'll walk through two examples:\n",
|
|
"\n",
|
|
"1. Using GPT-4.1-mini to get a description of a video\n",
|
|
"2. Generating a voiceover for a video with GPT-4o TTS API\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 46,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from IPython.display import display, Image, Audio\n",
|
|
"\n",
|
|
"import cv2 # We're using OpenCV to read video, to install !pip install opencv-python\n",
|
|
"import base64\n",
|
|
"import time\n",
|
|
"from openai import OpenAI\n",
|
|
"import os\n",
|
|
"\n",
|
|
"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Using GPT's visual capabilities to get a description of a video\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"First, we use OpenCV to extract frames from a nature [video](https://www.youtube.com/watch?v=kQ_7GtE529M) containing bisons and wolves:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"video = cv2.VideoCapture(\"data/bison.mp4\")\n",
|
|
"\n",
|
|
"base64Frames = []\n",
|
|
"while video.isOpened():\n",
|
|
" success, frame = video.read()\n",
|
|
" if not success:\n",
|
|
" break\n",
|
|
" _, buffer = cv2.imencode(\".jpg\", frame)\n",
|
|
" base64Frames.append(base64.b64encode(buffer).decode(\"utf-8\"))\n",
|
|
"\n",
|
|
"video.release()\n",
|
|
"print(len(base64Frames), \"frames read.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Display frames to make sure we've read them in correctly:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"display_handle = display(None, display_id=True)\n",
|
|
"for img in base64Frames:\n",
|
|
" display_handle.update(Image(data=base64.b64decode(img.encode(\"utf-8\"))))\n",
|
|
" time.sleep(0.025)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don't need to send every frame for GPT to understand what's going on):\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = client.responses.create(\n",
|
|
" model=\"gpt-4.1-mini\",\n",
|
|
" input=[\n",
|
|
" {\n",
|
|
" \"role\": \"user\",\n",
|
|
" \"content\": [\n",
|
|
" {\n",
|
|
" \"type\": \"input_text\",\n",
|
|
" \"text\": (\n",
|
|
" \"These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.\"\n",
|
|
" )\n",
|
|
" },\n",
|
|
" *[\n",
|
|
" {\n",
|
|
" \"type\": \"input_image\",\n",
|
|
" \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
|
|
" }\n",
|
|
" for frame in base64Frames[0::25]\n",
|
|
" ]\n",
|
|
" ]\n",
|
|
" }\n",
|
|
" ],\n",
|
|
")\n",
|
|
"\n",
|
|
"print(response.output_text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Generating a voiceover for a video with GPT-4.1 and the GPT-4o TTS API\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's create a voiceover for this video in the style of David Attenborough. Using the same video frames we prompt GPT to give us a short script:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"new_result = client.responses.create(\n",
|
|
" model=\"gpt-4.1-mini\",\n",
|
|
" input=[\n",
|
|
" {\n",
|
|
" \"role\": \"user\",\n",
|
|
" \"content\": [\n",
|
|
" {\n",
|
|
" \"type\": \"input_text\",\n",
|
|
" \"text\": (\n",
|
|
" \"These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.\"\n",
|
|
" )\n",
|
|
" },\n",
|
|
" *[\n",
|
|
" {\n",
|
|
" \"type\": \"input_image\",\n",
|
|
" \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n",
|
|
" }\n",
|
|
" for frame in base64Frames[0::60]\n",
|
|
" ]\n",
|
|
" ]\n",
|
|
" }\n",
|
|
" ]\n",
|
|
")\n",
|
|
"\n",
|
|
"print(new_result.output_text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now, we can work with the GPT-4o TTS model and provide it a set of instructions on how the voice should sound. You can play around with the voice models and instructers at [OpenAI.fm](openai.fm). We can then pass in the script we generated above with GPT-4.1-mini and generate audio of the voiceover:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"instructions = \"\"\"\n",
|
|
"Voice Affect: Calm, measured, and warmly engaging; convey awe and quiet reverence for the natural world.\n",
|
|
"\n",
|
|
"Tone: Inquisitive and insightful, with a gentle sense of wonder and deep respect for the subject matter.\n",
|
|
"\n",
|
|
"Pacing: Even and steady, with slight lifts in rhythm when introducing a new species or unexpected behavior; natural pauses to allow the viewer to absorb visuals.\n",
|
|
"\n",
|
|
"Emotion: Subtly emotive—imbued with curiosity, empathy, and admiration without becoming sentimental or overly dramatic.\n",
|
|
"\n",
|
|
"Emphasis: Highlight scientific and descriptive language (“delicate wings shimmer in the sunlight,” “a symphony of unseen life,” “ancient rituals played out beneath the canopy”) to enrich imagery and understanding.\n",
|
|
"\n",
|
|
"Pronunciation: Clear and articulate, with precise enunciation and slightly rounded vowels to ensure accessibility and authority.\n",
|
|
"\n",
|
|
"Pauses: Insert thoughtful pauses before introducing key facts or transitions (“And then... with a sudden rustle...”), allowing space for anticipation and reflection.\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"audio_response = response = client.audio.speech.create(\n",
|
|
" model=\"gpt-4o-mini-tts\",\n",
|
|
" voice=\"echo\",\n",
|
|
" instructions=instructions,\n",
|
|
" input=new_result.output_text,\n",
|
|
" response_format=\"wav\"\n",
|
|
")\n",
|
|
"\n",
|
|
"audio_bytes = audio_response.content\n",
|
|
"Audio(data=audio_bytes)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.8"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|