{ "cells": [ { "cell_type": "markdown", "id": "c6806af9-68ae-4714-851c-9a967aee0e23", "metadata": {}, "source": [ "# Leveraging model distillation to fine-tune a model\n", "\n", "OpenAI recently released **Distillation** which allows to leverage the outputs of a (large) model to fine-tune another (smaller) model. This can significantly reduce the price and the latency for specific tasks as you move to a smaller model. In this cookbook we'll look at a dataset, distill the output of gpt-4o to gpt-4o-mini and show how we can get significantly better results than on a generic, non-distilled, 4o-mini.\n", "\n", "We'll also leverage **Structured Outputs** for a classification problem using a list of enum. We'll see how fine-tuned model can benefit from structured output and how it will impact the performance. We'll show that **Structured Ouputs** work with all of those models, including the distilled one.\n", "\n", "We'll first analyze the dataset, get the output of both 4o and 4o mini, highlighting the difference in performance of both models, then proceed to the distillation and analyze the performance of this distilled model." ] }, { "cell_type": "markdown", "id": "5dd8fd2f-dfdf-47c2-9627-02acbe3fb7a2", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "Let's install and load dependencies.\n", "Make sure your OpenAI API key is defined in your environment as \"OPENAI_API_KEY\" and it'll be loaded by the client directly." ] }, { "cell_type": "code", "execution_count": 8, "id": "e16ed9ef-0220-4f23-a8eb-40813eacf210", "metadata": {}, "outputs": [], "source": [ "! pip install openai tiktoken numpy pandas tqdm --quiet" ] }, { "cell_type": "code", "execution_count": 9, "id": "7b643798-3b2b-43e4-bfb5-ebcf74066253", "metadata": {}, "outputs": [], "source": [ "import openai\n", "import json\n", "import tiktoken\n", "from tqdm import tqdm\n", "from openai import OpenAI\n", "import numpy as np\n", "import concurrent.futures\n", "import pandas as pd\n", "\n", "client = OpenAI()" ] }, { "cell_type": "markdown", "id": "246364b6-2fed-4b54-b540-09569a197a6b", "metadata": {}, "source": [ "## Loading and understanding the dataset\n", "\n", "For this cookbook, we'll load the data from the following Kaggle challenge: [https://www.kaggle.com/datasets/zynicide/wine-reviews](https://www.kaggle.com/datasets/zynicide/wine-reviews).\n", "\n", "This dataset has a large number of rows and you're free to run this cookbook on the whole data, but as a biaised french wine-lover, I'll narrow down the dataset to only French wine to focus on less rows and grape varieties.\n", "\n", "We're looking at a classification problem where we'd like to guess the grape variety based on all other criterias available, including description, subregion and province that we'll include in the prompt. It gives a lot of information to the model, you're free to also remove some information that can help significantly the model such as the region in which it was produced to see if it does a good job at finding the grape.\n", "\n", "Let's filter the grape varieties that have less than 5 occurences in reviews.\n", "\n", "Let's proceed with a subset of 500 random rows from this dataset." ] }, { "cell_type": "code", "execution_count": 10, "id": "759d1705-2213-443a-9fc3-050bc00177e6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Unnamed: 0 | \n", "country | \n", "description | \n", "designation | \n", "points | \n", "price | \n", "province | \n", "region_1 | \n", "region_2 | \n", "taster_name | \n", "taster_twitter_handle | \n", "title | \n", "variety | \n", "winery | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
95206 | \n", "95206 | \n", "France | \n", "Full, fat, ripe, perfumed wine that is full of... | \n", "Château de Mercey Premier Cru | \n", "91 | \n", "35.0 | \n", "Burgundy | \n", "Mercurey | \n", "NaN | \n", "Roger Voss | \n", "@vossroger | \n", "Antonin Rodet 2010 Château de Mercey Premier C... | \n", "Pinot Noir | \n", "Antonin Rodet | \n", "
66403 | \n", "66403 | \n", "France | \n", "For simple Chablis, this is impressive, rich, ... | \n", "Domaine | \n", "89 | \n", "26.0 | \n", "Burgundy | \n", "Chablis | \n", "NaN | \n", "Roger Voss | \n", "@vossroger | \n", "William Fèvre 2005 Domaine (Chablis) | \n", "Chardonnay | \n", "William Fèvre | \n", "
71277 | \n", "71277 | \n", "France | \n", "This 50-50 blend of Marselan and Merlot opens ... | \n", "La Remise | \n", "84 | \n", "13.0 | \n", "France Other | \n", "Vin de France | \n", "NaN | \n", "Lauren Buzzeo | \n", "@laurbuzz | \n", "Domaine de la Mordorée 2014 La Remise Red (Vin... | \n", "Red Blend | \n", "Domaine de la Mordorée | \n", "
27484 | \n", "27484 | \n", "France | \n", "The medium-intense nose of this solid and easy... | \n", "Authentic & Chic | \n", "86 | \n", "10.0 | \n", "France Other | \n", "Vin de France | \n", "NaN | \n", "Lauren Buzzeo | \n", "@laurbuzz | \n", "Romantic 2014 Authentic & Chic Cabernet Sauvig... | \n", "Cabernet Sauvignon | \n", "Romantic | \n", "
124917 | \n", "124917 | \n", "France | \n", "Fresh, pure notes of Conference pear peel enti... | \n", "NaN | \n", "89 | \n", "30.0 | \n", "Alsace | \n", "Alsace | \n", "NaN | \n", "Anne Krebiehl MW | \n", "@AnneInVino | \n", "Domaine Vincent Stoeffler 2015 Pinot Gris (Als... | \n", "Pinot Gris | \n", "Domaine Vincent Stoeffler | \n", "