{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API\n", "\n", "This notebook demonstrates how to use GPT's visual capabilities with a video. Although GPT-4.1-mini doesn't take videos as input directly, we can use vision and the 1M token context window to describe the static frames of a whole video at once. We'll walk through two examples:\n", "\n", "1. Using GPT-4.1-mini to get a description of a video\n", "2. Generating a voiceover for a video with GPT-4o TTS API\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display, Image, Audio\n", "\n", "import cv2 # We're using OpenCV to read video, to install !pip install opencv-python\n", "import base64\n", "import time\n", "from openai import OpenAI\n", "import os\n", "\n", "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Using GPT's visual capabilities to get a description of a video\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we use OpenCV to extract frames from a nature [video](https://www.youtube.com/watch?v=kQ_7GtE529M) containing bisons and wolves:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "video = cv2.VideoCapture(\"data/bison.mp4\")\n", "\n", "base64Frames = []\n", "while video.isOpened():\n", " success, frame = video.read()\n", " if not success:\n", " break\n", " _, buffer = cv2.imencode(\".jpg\", frame)\n", " base64Frames.append(base64.b64encode(buffer).decode(\"utf-8\"))\n", "\n", "video.release()\n", "print(len(base64Frames), \"frames read.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display frames to make sure we've read them in correctly:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display_handle = display(None, display_id=True)\n", "for img in base64Frames:\n", " display_handle.update(Image(data=base64.b64decode(img.encode(\"utf-8\"))))\n", " time.sleep(0.025)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don't need to send every frame for GPT to understand what's going on):\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.responses.create(\n", " model=\"gpt-4.1-mini\",\n", " input=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"input_text\",\n", " \"text\": (\n", " \"These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.\"\n", " )\n", " },\n", " *[\n", " {\n", " \"type\": \"input_image\",\n", " \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n", " }\n", " for frame in base64Frames[0::25]\n", " ]\n", " ]\n", " }\n", " ],\n", ")\n", "\n", "print(response.output_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Generating a voiceover for a video with GPT-4.1 and the GPT-4o TTS API\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a voiceover for this video in the style of David Attenborough. Using the same video frames we prompt GPT to give us a short script:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_result = client.responses.create(\n", " model=\"gpt-4.1-mini\",\n", " input=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"input_text\",\n", " \"text\": (\n", " \"These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.\"\n", " )\n", " },\n", " *[\n", " {\n", " \"type\": \"input_image\",\n", " \"image_url\": f\"data:image/jpeg;base64,{frame}\"\n", " }\n", " for frame in base64Frames[0::60]\n", " ]\n", " ]\n", " }\n", " ]\n", ")\n", "\n", "print(new_result.output_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can work with the GPT-4o TTS model and provide it a set of instructions on how the voice should sound. You can play around with the voice models and instructers at [OpenAI.fm](openai.fm). We can then pass in the script we generated above with GPT-4.1-mini and generate audio of the voiceover:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "instructions = \"\"\"\n", "Voice Affect: Calm, measured, and warmly engaging; convey awe and quiet reverence for the natural world.\n", "\n", "Tone: Inquisitive and insightful, with a gentle sense of wonder and deep respect for the subject matter.\n", "\n", "Pacing: Even and steady, with slight lifts in rhythm when introducing a new species or unexpected behavior; natural pauses to allow the viewer to absorb visuals.\n", "\n", "Emotion: Subtly emotive—imbued with curiosity, empathy, and admiration without becoming sentimental or overly dramatic.\n", "\n", "Emphasis: Highlight scientific and descriptive language (“delicate wings shimmer in the sunlight,” “a symphony of unseen life,” “ancient rituals played out beneath the canopy”) to enrich imagery and understanding.\n", "\n", "Pronunciation: Clear and articulate, with precise enunciation and slightly rounded vowels to ensure accessibility and authority.\n", "\n", "Pauses: Insert thoughtful pauses before introducing key facts or transitions (“And then... with a sudden rustle...”), allowing space for anticipation and reflection.\n", "\"\"\"\n", "\n", "audio_response = response = client.audio.speech.create(\n", " model=\"gpt-4o-mini-tts\",\n", " voice=\"echo\",\n", " instructions=instructions,\n", " input=new_result.output_text,\n", " response_format=\"wav\"\n", ")\n", "\n", "audio_bytes = audio_response.content\n", "Audio(data=audio_bytes)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }