openai-cookbook/examples/gpt4o/introduction_to_gpt4o.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to GPT-4o\n",
    "GPT-4o (\"o\" for \"omni\") is designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats.\n",
    "\n",
    "### Background\n",
    "Before GPT-4o, users could interact with ChatGPT using Voice Mode, which operated with three separate models. GPT-4o will integrate these capabilities into a single model that's trained across text, vision, and audio. This unified approach ensures that all inputs—whether text, visual, or auditory—are processed cohesively by the same neural network.\n",
    "\n",
    "### Current API Capabilities\n",
    "Currently, the API supports `{text, image}` inputs only, with `{text}` outputs, the same modalities as `gpt-4-turbo`. Additional modalities, including audio, will be introduced soon. This guide will help you get started with using GPT-4o for text, image, and video understanding.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting Started"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install OpenAI SDK for Python\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade openai --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure the OpenAI client and submit a test request\n",
    "To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage. \n",
    "\n",
    "You can get an API key by following these steps:\n",
    "1. [Create a new project](https://help.openai.com/en/articles/9186755-managing-your-work-in-the-api-platform-with-projects)\n",
    "2. [Generate an API key in your project](https://platform.openai.com/api-keys)\n",
    "3. (RECOMMENDED, BUT NOT REQUIRED) [Setup your API key for all projects as an env var](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key)\n",
    "\n",
    "Once we have this setup, let's start with a simple {text} input to the model for our first request. We'll use both `system` and `user` messages for our first request, and we'll receive a response from the `assistant` role."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI \n",
    "import os\n",
    "\n",
    "## Set the API key and model name\n",
    "MODEL=\"gpt-4o\"\n",
    "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as an env var>\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Assistant: Sure! The sum of 2 + 2 is 4. If you have any more questions or need further assistance, feel free to ask!\n"
     ]
    }
   ],
   "source": [
    "completion = client.chat.completions.create(\n",
    "  model=MODEL,\n",
    "  messages=[\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant. Help me with my math homework!\"}, # <-- This is the system message that provides context to the model\n",
    "    {\"role\": \"user\", \"content\": \"Hello! Could you solve 2+2?\"}  # <-- This is the user message for which the model will generate a response\n",
    "  ]\n",
    ")\n",
    "\n",
    "print(\"Assistant: \" + completion.choices[0].message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Image Processing\n",
    "GPT-4o can directly process images and take intelligent actions based on the image. We can provide images in two formats:\n",
    "1. Base64 Encoded\n",
    "2. URL\n",
    "\n",
    "Let's first view the image we'll use, then try sending this image as both Base64 and as a URL link to the API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnEAAAEcCAAAAACNZL39AAAACXBIWXMAAAsSAAALEgHS3X78AAAfF0lEQVR42u2d63LjuLKlv0xAAMmyXKf3mT3v/34zs0902S2JFICcH3K55Lsk60JKWBHd0W1bFJhYyAQSeRGjouKM0CqCisq4isq4iorKuIrKuIqKyriKyriKyriKisq4iiuFryI4BwTEfKLe8FTGnYVv4qw4S3iplKtW9QyE82TFMlqIlXGVEKdGF7xJInvKDwb5cevrr6r5U69pg7af2RAGaJdiLlXGVZyScE5my42ouwU6s6FZVsZVnAjNutw9Aob4BN0CaJdSKuMqTiLcu8dmnV0xA6RZuwGZDfjEDbtJKuNOBrUwiGmhXQAaejHDFZdddm64WblXxp1MsmjBJ54OCj6rZAPxSTXhwuJWF2KlxokId08JpChPJ9PsizfAEhguL5tbFUzVcSdAt6RZ4TJ/NmyzhOaNyLVEem51L1cZdwqhgst3j2LdPzwTTqT85tj8EV/wurpF4Vereny+SYfL8RELz4QjNSa0T//ziBbK4JHKuIrvopHIwpvr+bf1f37sV4j+9vyaFSsaHrnFvVy1qkeWJ+CT2OtdmvgEW55f2fzRDe7lqo47Kt8ktlCQaK+o1KSgbutnlhCN8QbtatVxxxSmFnDF3lFdgsvE1Yu1bkAc7MZmoDLuqHxzWYuz8laoqtuukicS4iS1yxuzrNWqHotv8rNEshq5vEehHFReBlwbuOKX3FiUZtVxR5KjFiCsbXOL+vb3Yf3W2LYrNlb4luKXqo47CiJGgGDY+/elNri3p4SlRS/ZIatbWptVx30f7QpXzGf7ZEvmCu+FxYmjaAa/vhVhVR33/UUra5psIdknx04pTt7zhIQ8s3wPtxOjWXXctyWIw4pK/nQzpqHn/dBfkbiiWd2MM7jquO+hk0DORUqxz3f/fSC8+wuTVWCNOr0Nb3Bl3LfglgwOkK/MYvn4D7IMmtFst0G5yrjDIdIUURyE/KVJ9ImPDgdFSiPJe4lyA5yr+7jDRRcTkhApu8hQtuLj3vmtc2uDdunytU9I1XGH8k2c5gx6vxPh2I6Pewd5CKBLn69ezVUdd5jYVIoRU9EdE+xdwZfPuPljIX/9BxW59gJMlXGHCI1ugcSVzHZOAnwdH/fOX4SBZp0JV55YWK3q3miFsHDOVth6Z3KEV/Fxb2G9xVX2hCvXAVXH7S8yMS2OLLpHyZq38XHvYpZ8uvIZqTpuX/zAXCFA2a9GUnb0O/xZEpGrLiNZGbcfGrdwmhuW+3oxIrn7OhAug85x13xgrYzbC7O+xFnxiWZf25cIi691nN2TH7rhmqel1gHeA+KN0HtJuL1DKAvQfblrNiH2izgUl69WiPXksLOoFLPYg8j+dBBiv1Otc1e6hU8OrrWSZrWqu1JGXJkbvUI4SP/0u9Wdzn6BEXK+1q1c1XG7wZlipV0XykF3AgJ3j2q7BF6KNEstV9v8oTJuJymFgU0PkKeCl/s/QaQIO35WlILMhniN+Q/Vqu4Ah+BQI5DKgUvUZtiuZUaseJgl+mucnarjvsQs4bIWvObDd/OiRfexx04yzuUrjF2q3pGvoBaR7NfyveunQtnnxJFF/vpPbsr1aYSq4z7H/YMnic0fvrmPF8LAXlX1hTCg5erOD3Uf9+msy8PPlDpzD9+vR7Pmg9SaD2AMf+ElXltXparjPpENIWeAI4SCi5iWu4d9PnL/ALh8bWmFVcd9TBINQ4dr7jhGAJEhPO71iV+G+tzQXFeOV9VxH0qmXeKTluPoGMHlsN43816epuea2ipVHffRXOsSTW1pj1VRMLPe+0lWWkW5KpdC1XHvwmdkNsDRNlECyCHkFWmWQa6okmZl3PuEa1nCXVkcTTxyIOMQQirE/loOEJVxbxBLQgugcsSIoYMZxyy5HPtmpdcRMlf3ca9wL0PC3UEIeRwhamtzXa8rrqQuetVxL+EKPjfLo/jgXgr6YB0Hgpj64To8c1XHvcCsQJotPTqmJD6zRkpQfxWOuarjtoXhsstfJs8f+Oxv6DhA2wX4dAWFcKqO2yKFh0wK2Ckcrg0WDv90WQLJ5elv5irjnvl2T5o50JPU/WjCiuY71aWLx5GlnXwqa7WqvwUhIhnBTrM93/Dk/u/vPGNGAv76n2lPWdVxADiPFecxO5FzP4q4uz2v8l9jnSMN/xOmreWqjoOn3pN3jyf0Pgg+0X33CkOgGcqBJxAZh3Ol6jhEWooapyQcQRNyhDszWZXWOMRNMuN/iTzvAvVisQGVcXMosZQTF25bF22//xTDGkrUA8qiz5L7v/hIlB8i4i3/14U4d+tWtVsn9QMixZ30TktNy1GcadKsUErT72tavSTnhpAoRF0eK+7vBhkXB6BZHSi+TRBRWNupxX9wtNK7T4rrIrpnZZJZwicxLYRhE4l1mamftlXVKAORsArcy0xmcS9j46LicTMdwoRuLC0w78tPk9zs87E2dRRaVxrNMMBPd5nxT1jHCcD8cZYKEAaftNCsdn4fifTRhmZ1ltV+PB0Hrtw94vJeil3oFk6SmE/QLb6V7X2rjBMxN7Men5pBMs6vCavd20ZG6bl7pFmd56rymIwD4W4Y2KMGiiOsaJc+rIrL7fKSkz5Zxklja02+lPhUeTL27ZJmvSt/hLtHYn+uQs/HZRw/FpvEwl1r4YjL3P/akO/CUX9T3ccJq94yqbiVmcnck91S71Y7xsnK7CcLtA8TrSyu8K8sjn63vZz4HPjFHBfI3YVnbqI6rqHH5flj+U3AdinWLtkpQy8OagZ3i3K2tz+yjoN2BY0Nux2y7x/ioG5ttEvCcNlj0kQZJ+At/7EpswT45PMOJdrUfHKZTzuzjZ9xSMh511Bl6RZigkFclzBwyaYRk7SqIh6aLH82McmBc6TGvvTsRwMEVTvnWm+O/kRLGUHYJXwpLgAL5qQv3UBMzeWCniap4/71P5tradtWIWLi/JcOA8VCSe2SMztAj6/j4MfCJ9CmfOkSEu4fNzvcJuUgPcr87wvN/BQZJ8wf1F51YfuxaJdfb1Hk90vvV3RmnIxDwtruHgHaxed/2T53AxAtxHWJ/VcfqYx7hvOp+ETs9x664Mps6Ba7e+2OSbjjMw6kWTVl/WOPsBdPRku4XGLY9PZxUtZC8e6pSK782PkVRDUzsKK1MxMOH0/zXFuxSrYUZh9SshHZ9of4fEchczEfyQRPDn9ZjsVmzJAfrUhvu10Qivy00hqN5/z25L4/0YMtYs7MJXn/nbu+Jy63frni8SfgFs2Fzg6Ts6oSe7rNNc39gyFSdovdnWUD7h4v5BiIw0msKuAK+KSzdX7nS9ulT7F/sc3T+S/iuhCtv4Qgpsc4lYLLYZDQm/gkzvU+fV0hpFtu7r65UPWOk5wcnh4toe8W6Bt3tmjBp9jzYs8mEAYRP9R93I4zVwIlDpgZtqaxXk2Lky8mZUWXu4Ta9XVYM+vjglC0eaNNnCQxvL6gIayRs+9kt0Y1LfGqaaFZOcrTbZb3vejn3eXF4XstFy2IdUIdB0KzwuXXl/RqNEOhWb3wXHorBj5dqu7mxBgn+Gy8KDovsUfaxceUc79Fe8lyWCdlHAjE9etAcgnSE/wibLVbEtHclFS4lDtualbVk01jLNvH0x7CIgwf3CNFKR0uOv59hQb1GSauL+2r3tNS+rkOCx388Pyz4HJcDSWG5aWGOjEd58k+8SLIS2j74hPvuzR9RuIKCP1FB35iHQe+aCam/OKMcPfI/S8Nf27BXIFowzvHjMq496EGtEPevjNQc5kwvCtEIQzzB9zF9dvJGYeII9EMeXu5NevcrJ5vWH7XjnJxUWNHdoQFR7PMvDxpBZohmrw3zz7zOCddsUF9Fg0pO1Zl6/Yhsc6smphEpZNWXG5zTBwlVfvwlTGxk4Oa5lfx/a6IZt4Jp276sDZ8HkOd8NPrOGh6l3F5O2ROECngE2hBNYXh8FTL42Byt1zWvl4lyizDm/RNEXQwdamMgXAe5qcex4osCNuXrGaIJxacK47iSITVZVfg1M6qpXsk3b0o/efaITa
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import Image, display, Audio, Markdown\n",
    "import base64\n",
    "\n",
    "IMAGE_PATH = \"data/triangle.png\"\n",
    "\n",
    "# Preview image for context\n",
    "display(Image(IMAGE_PATH))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Base64 Image Processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "To find the area of the triangle, we can use Heron's formula. Heron's formula states that the area of a triangle with sides of length \\(a\\), \\(b\\), and \\(c\\) is:\n",
       "\n",
       "\\[ \\text{Area} = \\sqrt{s(s-a)(s-b)(s-c)} \\]\n",
       "\n",
       "where \\(s\\) is the semi-perimeter of the triangle:\n",
       "\n",
       "\\[ s = \\frac{a + b + c}{2} \\]\n",
       "\n",
       "For the given triangle, the side lengths are \\(a = 5\\), \\(b = 6\\), and \\(c = 9\\).\n",
       "\n",
       "First, calculate the semi-perimeter \\(s\\):\n",
       "\n",
       "\\[ s = \\frac{5 + 6 + 9}{2} = \\frac{20}{2} = 10 \\]\n",
       "\n",
       "Now, apply Heron's formula:\n",
       "\n",
       "\\[ \\text{Area} = \\sqrt{10(10-5)(10-6)(10-9)} \\]\n",
       "\\[ \\text{Area} = \\sqrt{10 \\cdot 5 \\cdot 4 \\cdot 1} \\]\n",
       "\\[ \\text{Area} = \\sqrt{200} \\]\n",
       "\\[ \\text{Area} = 10\\sqrt{2} \\]\n",
       "\n",
       "So, the area of the triangle is \\(10\\sqrt{2}\\) square units."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Open the image file and encode it as a base64 string\n",
    "def encode_image(image_path):\n",
    "    with open(image_path, \"rb\") as image_file:\n",
    "        return base64.b64encode(image_file.read()).decode(\"utf-8\")\n",
    "\n",
    "base64_image = encode_image(IMAGE_PATH)\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"You are a helpful assistant that responds in Markdown. Help me with my math homework!\"},\n",
    "        {\"role\": \"user\", \"content\": [\n",
    "            {\"type\": \"text\", \"text\": \"What's the area of the triangle?\"},\n",
    "            {\"type\": \"image_url\", \"image_url\": {\n",
    "                \"url\": f\"data:image/png;base64,{base64_image}\"}\n",
    "            }\n",
    "        ]}\n",
    "    ],\n",
    "    temperature=0.0,\n",
    ")\n",
    "\n",
    "display(Markdown(response.choices[0].message.content))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### URL Image Processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "To find the area of the triangle, we can use Heron's formula. First, we need to find the semi-perimeter of the triangle.\n",
       "\n",
       "The sides of the triangle are 6, 5, and 9.\n",
       "\n",
       "1. Calculate the semi-perimeter \\( s \\):\n",
       "\\[ s = \\frac{a + b + c}{2} = \\frac{6 + 5 + 9}{2} = 10 \\]\n",
       "\n",
       "2. Use Heron's formula to find the area \\( A \\):\n",
       "\\[ A = \\sqrt{s(s-a)(s-b)(s-c)} \\]\n",
       "\n",
       "Substitute the values:\n",
       "\\[ A = \\sqrt{10(10-6)(10-5)(10-9)} \\]\n",
       "\\[ A = \\sqrt{10 \\cdot 4 \\cdot 5 \\cdot 1} \\]\n",
       "\\[ A = \\sqrt{200} \\]\n",
       "\\[ A = 10\\sqrt{2} \\]\n",
       "\n",
       "So, the area of the triangle is \\( 10\\sqrt{2} \\) square units."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"You are a helpful assistant that responds in Markdown. Help me with my math homework!\"},\n",
    "        {\"role\": \"user\", \"content\": [\n",
    "            {\"type\": \"text\", \"text\": \"What's the area of the triangle?\"},\n",
    "            {\"type\": \"image_url\", \"image_url\": {\n",
    "                \"url\": \"https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png\"}\n",
    "            }\n",
    "        ]}\n",
    "    ],\n",
    "    temperature=0.0,\n",
    ")\n",
    "\n",
    "display(Markdown(response.choices[0].message.content))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Video Processing\n",
    "While it's not possible to directly send a video to the API, GPT-4o can understand videos if you sample frames and then provide them as images. It performs better at this task than GPT-4 Turbo.\n",
    "\n",
    "Since GPT-4o in the API does not yet support audio-in (as of May 2024), we'll use a combination of GPT-4o and Whisper to process both the audio and visual for a provided video, and showcase two usecases:\n",
    "1. Summarization\n",
    "2. Question and Answering\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup for Video Processing\n",
    "We'll use two python packages for video processing - opencv-python and moviepy. \n",
    "\n",
    "These require [ffmpeg](https://ffmpeg.org/about.html), so make sure to install this beforehand. Depending on your OS, you may need to run `brew install ffmpeg` or `sudo apt install ffmpeg`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install opencv-python --quiet\n",
    "%pip install moviepy --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Process the video into two components: frames and audio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cv2\n",
    "from moviepy.editor import VideoFileClip\n",
    "import time\n",
    "import base64\n",
    "\n",
    "# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk\n",
    "VIDEO_PATH = \"data/keynote_recap.mp4\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MoviePy - Writing audio in data/keynote_recap.mp3\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                      \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MoviePy - Done.\n",
      "Extracted 218 frames\n",
      "Extracted audio to data/keynote_recap.mp3\n"
     ]
    }
   ],
   "source": [
    "def process_video(video_path, seconds_per_frame=2):\n",
    "    base64Frames = []\n",
    "    base_video_path, _ = os.path.splitext(video_path)\n",
    "\n",
    "    video = cv2.VideoCapture(video_path)\n",
    "    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))\n",
    "    fps = video.get(cv2.CAP_PROP_FPS)\n",
    "    frames_to_skip = int(fps * seconds_per_frame)\n",
    "    curr_frame=0\n",
    "\n",
    "    # Loop through the video and extract frames at specified sampling rate\n",
    "    while curr_frame < total_frames - 1:\n",
    "        video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)\n",
    "        success, frame = video.read()\n",
    "        if not success:\n",
    "            break\n",
    "        _, buffer = cv2.imencode(\".jpg\", frame)\n",
    "        base64Frames.append(base64.b64encode(buffer).decode(\"utf-8\"))\n",
    "        curr_frame += frames_to_skip\n",
    "    video.release()\n",
    "\n",
    "    # Extract audio from video\n",
    "    audio_path = f\"{base_video_path}.mp3\"\n",
    "    clip = VideoFileClip(video_path)\n",
    "    clip.audio.write_audiofile(audio_path, bitrate=\"32k\")\n",
    "    clip.audio.close()\n",
    "    clip.close()\n",
    "\n",
    "    print(f\"Extracted {len(base64Frames)} frames\")\n",
    "    print(f\"Extracted audio to {audio_path}\")\n",
    "    return base64Frames, audio_path\n",
    "\n",
    "# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate\n",
    "base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAIBAQEBAQIBAQECAgICAgQDAgICAgUEBAMEBgUGBgYFBgYGBwkIBgcJBwYGCAsICQoKCgoKBggLDAsKDAkKCgr/2wBDAQICAgICAgUDAwUKBwYHCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgr/wAARCALQBQADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD4Dooor6g/lcKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooA
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {
      "image/jpeg": {
       "width": 600
      }
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "                <audio  controls=\"controls\" >\n",
       "                    <source src=\"data:audio/mpeg;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//tQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASW5mbwAAAA8AAB+PAAziXQADBggLDRASFRcaHB8hJCYpKy4wMzU4Oj1AQkZIS01QUlVXWlxfYWRmaWtucHN1eHp9gIKFh4qMkJKVl5qcn6GkpqmrrrCztbi6vcDCxcfKzM/R1Nba3N/h5Obp6+7w8/X4+v0AAAAATGF2YzU4LjU0AAAAAAAAAAAAAAAAJAU0AAAAAAAM4l1dmhK4AAAAAAAAAAAAAAAAAAAAAP/7EGQAD/AAAGkAAAAIAAANIAAAAQAAAaQAAAAgAAA0gAAABAUQKxYyaP/mjIKIFYqaMnPzJpUx8eTAfuTAz4CD8MHfgy0p0DOUWP4gqYR24rJ+wBBIBamBwyjXBuQSvMpccm1qsYxi//sSZCIP8AAAaQAAAAgAAA0gAAABAAABpAAAACAAADSAAAAEgGaL0VOC5ASBJQ73I/uwxtkNYvd+/RVrTSoAB0RmACncQCJRQfBmKSihL11Oos9n0a7g4zAS7oGjySnpFxsZQ7Tm0Mcx//sQZESP8AAAaQAAAAgAAA0gAAABAAABpAAAACAAADSAAAAE3TJ0QNh4UGD3mm/16dfx+v//Q4vUTMLN7jBhSjGpl6m+rf/31cf8KT8mHhogqBBzeILNZ6z1N/1ABCDSABeAHgkuAj7/+xJkZo/wAABpAAAACAAADSAAAAEAAAGkAAAAIAAANIAAAASU0396/z37EUAAMx+BhCXhjkqQO+Yvc/i/9X+uAANQTYAAtuzjKHNKbjIteSUABAmVBlUAAbbRWl/w/eARcwXJouRnOvr/+xBkiQMwAABpAAAACAAADSAAAAEBJALXAARgICSAWuAAjAS2fEcVKwuIAEVzoyYb0IjKSXfnACGmSwRGjorgyt7ABw4PSm1Z5TVH05vj/0lEQEGcwIWzzhIDZ4uimyj+z2/sUgAACP/7EkShgZBtAMCAIxgCDqDYQARgCEOAGxEgjECgSINi2BEYBEYAGGqQo3DeUoPytMi2CjEUUdEoWP3w+WtfpLU3VGagAAEaoI1QABUECBQJcMdEelaa9fVyn9qCkVR9i6pKU9BAaSbJ2//7EGSeifC+BkbJJiiQFKDIkATKBAIoSRgEgExARgMiwAM0CEWjHD1kLjh8M1wG/OUmswcUsmSmTHhu6Nk8A9pbiQ7flrveP/rmagAAgCBgAAo9ByPjROYH/PD2gMCSJZq01tEAYACF//sSZJiBEJsGRoGBWYAUQMjVDSUSAmQbMaEEomBDAiW0EJgEr7qPLIaFwAASIkAKnYhLlUi9Sg8PGrRd2rvfg9xezM6C5CqgITKzE2caRUBpiBT6qa2uxlSrHbrsgDuiLPT46d+TagAC//sQZJUBENcGSWmBOYAQYNjQCYMSA0gZIMCxYIBMgiW0EKAEFmKjS+4yk1YHabn3bqrLuz9d6AAAAIIGXAAASkkpyknDxbAVykexxqjBqTT4zQ2kNOayu6j7YE8qDUpSVWOxEfjDZ2r/+xJkiomRFAhHSG8YQBgA2U0EZRID4CEewbxBAFMDZGQmIEiYLu6RztCJI8IHEiR/GgCBeRIwXcehy6y+pKYZid7FHVylKrlor4RcoJQIrQgiD6gbWYMZF4afGJdL7fV7awAAgGDbEGb/+xBkdYkRGghH0G8oQB2CSPklojID8GkhJ4RMwFMDZfQShCATU8BM4F9/0YP6TqiF1GHrdOdmtP4VrAAOMNRgACAOBAqy2WnwDj6aUWbEu+EdkVocghBlwjjlwTgWGT5bIwKysdFDOP/7EmRdCZD+B8poLDgoFyI40DwlWARUSRonhE4AVgNkJBSUEGB6svFLuTf9pCB7iFvNohiUNFTSmDFYBciojvz6XI9/rVuH1UAAz4BD8xdQ84G0SSktU73BsfvLTt1pgIn9PqKGCBsYNv/7EGRIDRDuCUhB7BDAGsDJSgWFBANcHyRI4SAAawMlqYSITBHfQNUeVioFLFPB+wjU56jSE4m/o/1gIYdTE2E4HrBaS4PDIRNYClS2n3dV+6ERjPhj9HcUgJ0xGDoETxRK6BmrwEqt//sSZDKBkOgaSsNYKCgaYNkRaywAA7hHJs0w4kBWg2TZlhRIdjn5Ky6pLCh8JnLBNgoY/Oq8VGAeknv8uyJ6UFb/y/6hwEAmIoEwdMoMExQP/6P61THlKnWYg0PZUTbEpbSctv5z7CGY//sQZB+PcNQRyYMaOCAPwNmFPGIVA0RHJA0ArEA9g2VBhIhI69eAT/00w08e6RLREEgi2M+7a7iNGBmiR3nQLHy+q/SZ97he0vdUeZyUEsMSII5tDECd9N/6L/4p8AG0JQWmPjAjA1P/+xJkF48wpQbKAxpYIA8iOWBhIhICRBswDCQCwDiDZhTxAMz4WV7G8r/UbKAUomIsT/yvqMHKX5KXDdj1UXoeLloQNgtlf+n+mshUnyJEPUviwuJcb5LyoMAmrB2GMjWD5QM59ApLWiD/+xBkGAfwhAbMgww4kA7g2VBhghIByBs0rCQC4D6DZYD0iEh3BezEBEdS4aW9Ho/l1SoV0fTnCLoUQT2+DLnch9nGUlWKw0h+eF6qEbyzuj0wCCdAMyGqFbR6sNl+r5X+uETMFAFsQP/7EmQbD/B/BkyBuUgAD4D5YGFnEgIEGzYMJKJAO4PmAPMcSJiSnwgy92jyqqhiieVBnlLNDvUFiv/bYoGyb7E9jn02Ecaj//ztYOsIMAIJI0DYp5X/lv6PaYCkrQWChZDuFH+70/1V6P/7EGQeD/BvBk2DBiiQDsDJkBnnBAHcDzYMYMAAO4NmAYScSAwnpBhQedoQLgdP38X/pwHSFVaVZGWUCc//5L+nhxwrox7mdQQzVCcFiKeihkGrRMIxtNFfUHNTd/JKugwZpAr4tYq6//sSZCKP8H4GzYMMEJAOAJmAYSASAeAhNAwMRmA6g2XAnCQAON8AGL/8pUEaBzgXisjYdAg46P+VmiVgKSaYPOxjOdYR1Xf14liqP4UqwFpI+3g6sR/8tYiBWglrwIw32xvpOoGYWnXW//sQZCcP8H0GzoMPKJAOYNmQPAIwAeQbOAw84kA+A2YBhKxITkGspEkOex1qqBxv/K1gxAoYBRG0pqisLJA2W9v/6IISsvwAUVzIeEUhE3/p/pWoDpU4N2keFAi53j08TR/hi6hqvoT/+xJkKo/wgQhNgwgRmA6A2bBh6hICLBs4DDziQD4DZsDzKIACDZyARR4aNu/ojoiiCAJxZUrcU71ELy9+7ys0OmHnpqBGvWL+mBN0j0J8ASGUoCyQx0zmNfPTj3JySPf/bSg0JkEZ5J7/+xBkLQ/whAhNgwwQsA6A2bBhKhICHB84DDziQDsDJsGGIExxPk843FX/01ArY3AIvkUHoOiTKwJpPW7yVcQNFtPkUPm7dVfW//9dpRU5cYMAmraZCNYFGt/7/5T+teAAWGAAKQ3gEv/7EmQvj/CdBk4B+HgAD0DZoGAGMAIwGzYMJCSAOQHmgYxgALi0wN//XdHJjrwE4aG7ZA3TAki//u/VodoSQEAVmWD0r/CHv3+U6xBgw7NJSlWHfBhZH/EP8rUfxecGKXndsY3YWaoyPv/7EGQwjzCUEc6DCTkADyDZ5T1CEQIoGzoMICSAPINmwYMEkKphTgVkFAFrX8PIqcA7j/+p/81ii6DUrPBxuxo1Qif/1ApABohIuxQwY0uGikEb93Lwmh5CrCQ4IjO+gz+j34LaMBUu//sSZDGP8IMGzoMGOQAPIMmgPwYCAlQZOAxhAEA3AycBhhxIQEV/yIjr6G/V8pAyArDAApSPlWmA6q7+usHLOj0dQpKOM4q3mQg03uu8QwfGgSYRuRnNZ5wWzpD/OQaTgsqnm1+AOFt+//sQZDQD8IMG0CsPKJgNgMmwQekCAfAbPAwkBIA4g2cAp5wQI8Rf/I9DAIReyxobS0Aqb/
       "                    Your browser does not support the audio element.\n",
       "                </audio>\n",
       "              "
      ],
      "text/plain": [
       "<IPython.lib.display.Audio object>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Display the frames and audio for context\n",
    "display_handle = display(None, display_id=True)\n",
    "for img in base64Frames:\n",
    "    display_handle.update(Image(data=base64.b64decode(img.encode(\"utf-8\")), width=600))\n",
    "    time.sleep(0.025)\n",
    "\n",
    "Audio(audio_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 1: Summarization\n",
    "Now that we have both the video frames and the audio, let's run a few different tests to generate a video summary to compare the results of using the models with different modalities. We should expect to see that the summary generated with context from both visual and audio inputs will be the most accurate, as the model is able to use the entire context from the video.\n",
    "\n",
    "1. Visual Summary\n",
    "2. Audio Summary\n",
    "3. Visual + Audio Summary\n",
    "\n",
    "#### Visual Summary\n",
    "The visual summary is generated by sending the model only the frames from the video. With just the frames, the model is likely to capture the visual aspects, but will miss any details discussed by the speaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "## Video Summary\n",
       "\n",
       "The video appears to be a presentation from OpenAI's DevDay event. Here is a summary based on the provided frames:\n",
       "\n",
       "1. **Introduction**:\n",
       "   - The video starts with the title \"OpenAI DevDay\" and a \"Keynote Recap\" slide.\n",
       "   - The event venue is shown, with attendees gathering and the stage being set up.\n",
       "\n",
       "2. **Keynote Presentation**:\n",
       "   - A speaker, likely a representative from OpenAI, takes the stage to deliver the keynote address.\n",
       "   - The presentation covers several key topics and announcements:\n",
       "     - **GPT-4 Turbo**: Introduction of GPT-4 Turbo, highlighting its capabilities and improvements.\n",
       "     - **JSON Mode**: A feature that allows structured data output in JSON format.\n",
       "     - **Function Calling**: Demonstration of how the model can call functions based on user instructions.\n",
       "     - **Enhanced Features**: Discussion on improvements such as increased context length, better control, and enhanced knowledge.\n",
       "     - **DALL-E 3**: Introduction of DALL-E 3, a new version of the image generation model.\n",
       "     - **Custom Models**: Announcement of the ability to create custom models tailored to specific needs.\n",
       "     - **Token Efficiency**: Explanation of the new token efficiency, with 3x less input tokens and 2x less output tokens.\n",
       "     - **API Enhancements**: Overview of new API features, including threading, retrieval, code interpreter, and function calling.\n",
       "\n",
       "3. **Closing Remarks**:\n",
       "   - The speaker emphasizes the importance of building with natural language and the potential of the new tools and features.\n",
       "   - The presentation concludes with a thank you to the audience and a final display of the OpenAI DevDay logo.\n",
       "\n",
       "4. **Audience Engagement**:\n",
       "   - The video shows the audience's reactions and engagement during the presentation, with applause and focused attention.\n",
       "\n",
       "Overall, the video captures the highlights of OpenAI's DevDay event, showcasing new advancements and features in their AI models and tools."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[\n",
    "    {\"role\": \"system\", \"content\": \"You are generating a video summary. Please provide a summary of the video. Respond in Markdown.\"},\n",
    "    {\"role\": \"user\", \"content\": [\n",
    "        \"These are the frames from the video.\",\n",
    "        *map(lambda x: {\"type\": \"image_url\", \n",
    "                        \"image_url\": {\"url\": f'data:image/jpg;base64,{x}', \"detail\": \"low\"}}, base64Frames)\n",
    "        ],\n",
    "    }\n",
    "    ],\n",
    "    temperature=0,\n",
    ")\n",
    "display(Markdown(response.choices[0].message.content))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are as expected - the model is able to capture the high level aspects of the video visuals, but misses the details provided in the speech.\n",
    "\n",
    "#### Audio Summary\n",
    "The audio summary is generated by sending the model the audio transcript. With just the audio, the model is likely to bias towards the audio content, and will miss the context provided by the presentations and visuals.\n",
    "\n",
    "`{audio}` input for GPT-4o isn't currently available but will be coming soon! For now, we use our existing `whisper-1` model to process the audio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "### Summary\n",
       "\n",
       "Welcome to OpenAI's first-ever Dev Day. Key announcements include:\n",
       "\n",
       "- **GPT-4 Turbo**: A new model supporting up to 128,000 tokens of context, featuring JSON mode for valid JSON responses, improved instruction following, and better knowledge retrieval from external documents or databases. It is also significantly cheaper than GPT-4.\n",
       "- **New Features**: \n",
       "  - **Dolly 3**, **GPT-4 Turbo with Vision**, and a new **Text-to-Speech model** are now available in the API.\n",
       "  - **Custom Models**: A program where OpenAI researchers help companies create custom models tailored to their specific use cases.\n",
       "  - **Increased Rate Limits**: Doubling tokens per minute for established GPT-4 customers and allowing requests for further rate limit changes.\n",
       "- **GPTs**: Tailored versions of ChatGPT for specific purposes, programmable through conversation, with options for private or public sharing, and a forthcoming GPT Store.\n",
       "- **Assistance API**: Includes persistent threads, built-in retrieval, a code interpreter, and improved function calling.\n",
       "\n",
       "OpenAI is excited about the future of AI integration and looks forward to seeing what users will create with these new tools. The event concludes with an invitation to return next year for more advancements."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Transcribe the audio\n",
    "transcription = client.audio.transcriptions.create(\n",
    "    model=\"whisper-1\",\n",
    "    file=open(audio_path, \"rb\"),\n",
    ")\n",
    "## OPTIONAL: Uncomment the line below to print the transcription\n",
    "#print(\"Transcript: \", transcription.text + \"\\n\\n\")\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[\n",
    "    {\"role\": \"system\", \"content\":\"\"\"You are generating a transcript summary. Create a summary of the provided transcription. Respond in Markdown.\"\"\"},\n",
    "    {\"role\": \"user\", \"content\": [\n",
    "        {\"type\": \"text\", \"text\": f\"The audio transcription is: {transcription.text}\"}\n",
    "        ],\n",
    "    }\n",
    "    ],\n",
    "    temperature=0,\n",
    ")\n",
    "display(Markdown(response.choices[0].message.content))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The audio summary is biased towards the content discussed during the speech, but comes out with much less structure than the video summary.\n",
    "\n",
    "#### Audio + Visual Summary\n",
    "The Audio + Visual summary is generated by sending the model both the visual and the audio from the video at once. When sending both of these, the model is expected to better summarize since it can perceive the entire video at once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "## Video Summary\n",
       "\n",
       "### Event Introduction\n",
       "- **Title:** OpenAI Dev Day\n",
       "- **Keynote Recap:** The event begins with a keynote recap, setting the stage for the announcements.\n",
       "\n",
       "### Venue and Audience\n",
       "- **Location:** The event is held at a venue with a sign reading \"OpenAI DevDay.\"\n",
       "- **Audience:** The venue is filled with attendees, eagerly awaiting the presentations.\n",
       "\n",
       "### Key Announcements\n",
       "1. **GPT-4 Turbo:**\n",
       "   - **Launch:** Introduction of GPT-4 Turbo.\n",
       "   - **Features:** Supports up to 128,000 tokens of context.\n",
       "   - **JSON Mode:** Ensures responses in valid JSON format.\n",
       "   - **Function Calling:** Improved ability to call multiple functions and follow instructions.\n",
       "   - **Knowledge Update:** Knowledge up to April 2023, with ongoing improvements.\n",
       "   - **API Integration:** Available in the API along with DALL-E 3 and a new Text-to-Speech model.\n",
       "   - **Custom Models:** New program for creating custom models tailored to specific use cases.\n",
       "   - **Rate Limits:** Doubling tokens per minute for established GPT-4 customers, with options to request further changes.\n",
       "   - **Pricing:** GPT-4 Turbo is significantly cheaper than GPT-4 (3x less for prompt tokens, 2x less for completion tokens).\n",
       "\n",
       "2. **GPTs:**\n",
       "   - **Introduction:** Tailored versions of ChatGPT for specific purposes.\n",
       "   - **Features:** Combine instructions, expanded knowledge, and actions for better performance and control.\n",
       "   - **Ease of Use:** Can be programmed through conversation, no coding required.\n",
       "   - **Customization:** Options to create private GPTs, share publicly, or make them exclusive to a company.\n",
       "   - **GPT Store:** Launching later this month for sharing and discovering GPTs.\n",
       "\n",
       "3. **Assistance API:**\n",
       "   - **Features:** Includes persistent threads, built-in retrieval, code interpreter, and improved function calling.\n",
       "   - **Integration:** Designed to integrate intelligence into various applications, providing \"superpowers on demand.\"\n",
       "\n",
       "### Closing Remarks\n",
       "- **Future Outlook:** The technology launched today is just the beginning, with more advancements in the pipeline.\n",
       "- **Gratitude:** Thanks to the attendees and a promise of more exciting developments in the future.\n",
       "\n",
       "### Conclusion\n",
       "- **Event End:** The event concludes with applause and a final thank you to the audience."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "## Generate a summary with visual and audio\n",
    "response = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[\n",
    "    {\"role\": \"system\", \"content\":\"\"\"You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown\"\"\"},\n",
    "    {\"role\": \"user\", \"content\": [\n",
    "        \"These are the frames from the video.\",\n",
    "        *map(lambda x: {\"type\": \"image_url\", \n",
    "                        \"image_url\": {\"url\": f'data:image/jpg;base64,{x}', \"detail\": \"low\"}}, base64Frames),\n",
    "        {\"type\": \"text\", \"text\": f\"The audio transcription is: {transcription.text}\"}\n",
    "        ],\n",
    "    }\n",
    "],\n",
    "    temperature=0,\n",
    ")\n",
    "display(Markdown(response.choices[0].message.content))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After combining both the video and audio, we're able to get a much more detailed and comprehensive summary for the event which uses information from both the visual and audio elements from the video.\n",
    "\n",
    "### Example 2: Question and Answering\n",
    "For the Q&A, we'll use the same concept as before to ask questions of our processed video while running the same 3 tests to demonstrate the benefit of combining input modalities:\n",
    "1. Visual Q&A\n",
    "2. Audio Q&A\n",
    "3. Visual + Audio Q&A "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "QUESTION = \"Question: Why did Sam Altman have an example about raising windows and turning the radio on?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Visual QA:Sam Altman used the example about raising windows and turning the radio on to demonstrate the function calling capabilities of the new model. The example illustrated how the model can interpret and execute specific commands by calling appropriate functions, showcasing its ability to handle complex tasks and integrate with external systems or APIs. This feature enhances the model's utility in practical applications by allowing it to perform actions based on user instructions."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "qa_visual_response = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[\n",
    "    {\"role\": \"system\", \"content\": \"Use the video to answer the provided question. Respond in Markdown.\"},\n",
    "    {\"role\": \"user\", \"content\": [\n",
    "        \"These are the frames from the video.\",\n",
    "        *map(lambda x: {\"type\": \"image_url\", \"image_url\": {\"url\": f'data:image/jpg;base64,{x}', \"detail\": \"low\"}}, base64Frames),\n",
    "        QUESTION\n",
    "        ],\n",
    "    }\n",
    "    ],\n",
    "    temperature=0,\n",
    ")\n",
    "display(Markdown(\"Visual QA:\" + qa_visual_response.choices[0].message.content))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Audio QA:\n",
       "The provided transcription does not include any mention of Sam Altman or an example about raising windows and turning the radio on. Therefore, I cannot provide an answer based on the given transcription."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "qa_audio_response = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[\n",
    "    {\"role\": \"system\", \"content\":\"\"\"Use the transcription to answer the provided question. Respond in Markdown.\"\"\"},\n",
    "    {\"role\": \"user\", \"content\": f\"The audio transcription is: {transcription.text}. \\n\\n {QUESTION}\"},\n",
    "    ],\n",
    "    temperature=0,\n",
    ")\n",
    "display(Markdown(\"Audio QA:\\n\" + qa_audio_response.choices[0].message.content))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Both QA:\n",
       "Sam Altman used the example of raising windows and turning the radio on to demonstrate the improved function calling capabilities of GPT-4 Turbo. The example illustrated how the model can now handle multiple function calls more effectively and follow instructions better. In the demonstration, the model was able to interpret the command to raise the windows and turn the radio on, showing how it can execute multiple actions in response to a single prompt. This highlights the enhanced ability of GPT-4 Turbo to manage complex tasks and provide more accurate and useful responses."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "qa_both_response = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[\n",
    "    {\"role\": \"system\", \"content\":\"\"\"Use the video and transcription to answer the provided question.\"\"\"},\n",
    "    {\"role\": \"user\", \"content\": [\n",
    "        \"These are the frames from the video.\",\n",
    "        *map(lambda x: {\"type\": \"image_url\", \n",
    "                        \"image_url\": {\"url\": f'data:image/jpg;base64,{x}', \"detail\": \"low\"}}, base64Frames),\n",
    "                        {\"type\": \"text\", \"text\": f\"The audio transcription is: {transcription.text}\"},\n",
    "        QUESTION\n",
    "        ],\n",
    "    }\n",
    "    ],\n",
    "    temperature=0,\n",
    ")\n",
    "display(Markdown(\"Both QA:\\n\" + qa_both_response.choices[0].message.content))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Comparing the three answers, the most accurate answer is generated by using both the audio and visual from the video. Sam Altman did not discuss the raising windows or radio on during the Keynote, but referenced an improved capability for the model to execute multiple functions in a single request while the examples were shown behind him.\n",
    "\n",
    "## Conclusion\n",
    "Integrating many input modalities such as audio, visual, and textual, significantly enhances the performance of the model on a diverse range of tasks. This multimodal approach allows for more comprehensive understanding and interaction, mirroring more closely how humans perceive and process information. \n",
    "\n",
    "Currently, GPT-4o in the API supports text and image inputs, with audio capabilities coming soon."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}