{ "cells": [ { "cell_type": "markdown", "id": "c6806af9-68ae-4714-851c-9a967aee0e23", "metadata": {}, "source": [ "# Leveraging model distillation to fine-tune a model\n", "\n", "OpenAI recently released **Distillation** which allows to leverage the outputs of a (large) model to fine-tune another (smaller) model. This can significantly reduce the price and the latency for specific tasks as you move to a smaller model. In this cookbook we'll look at a dataset, distill the output of gpt-4o to gpt-4o-mini and show how we can get significantly better results than on a generic, non-distilled, 4o-mini.\n", "\n", "We'll also leverage **Structured Outputs** for a classification problem using a list of enum. We'll see how fine-tuned model can benefit from structured output and how it will impact the performance. We'll show that **Structured Ouputs** work with all of those models, including the distilled one.\n", "\n", "We'll first analyze the dataset, get the output of both 4o and 4o mini, highlighting the difference in performance of both models, then proceed to the distillation and analyze the performance of this distilled model." ] }, { "cell_type": "markdown", "id": "5dd8fd2f-dfdf-47c2-9627-02acbe3fb7a2", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "Let's install and load dependencies.\n", "Make sure your OpenAI API key is defined in your environment as \"OPENAI_API_KEY\" and it'll be loaded by the client directly." ] }, { "cell_type": "code", "execution_count": 8, "id": "e16ed9ef-0220-4f23-a8eb-40813eacf210", "metadata": {}, "outputs": [], "source": [ "! pip install openai tiktoken numpy pandas tqdm --quiet" ] }, { "cell_type": "code", "execution_count": 9, "id": "7b643798-3b2b-43e4-bfb5-ebcf74066253", "metadata": {}, "outputs": [], "source": [ "import openai\n", "import json\n", "import tiktoken\n", "from tqdm import tqdm\n", "from openai import OpenAI\n", "import numpy as np\n", "import concurrent.futures\n", "import pandas as pd\n", "\n", "client = OpenAI()" ] }, { "cell_type": "markdown", "id": "246364b6-2fed-4b54-b540-09569a197a6b", "metadata": {}, "source": [ "## Loading and understanding the dataset\n", "\n", "For this cookbook, we'll load the data from the following Kaggle challenge: [https://www.kaggle.com/datasets/zynicide/wine-reviews](https://www.kaggle.com/datasets/zynicide/wine-reviews).\n", "\n", "This dataset has a large number of rows and you're free to run this cookbook on the whole data, but as a biaised french wine-lover, I'll narrow down the dataset to only French wine to focus on less rows and grape varieties.\n", "\n", "We're looking at a classification problem where we'd like to guess the grape variety based on all other criterias available, including description, subregion and province that we'll include in the prompt. It gives a lot of information to the model, you're free to also remove some information that can help significantly the model such as the region in which it was produced to see if it does a good job at finding the grape.\n", "\n", "Let's filter the grape varieties that have less than 5 occurences in reviews.\n", "\n", "Let's proceed with a subset of 500 random rows from this dataset." ] }, { "cell_type": "code", "execution_count": 10, "id": "759d1705-2213-443a-9fc3-050bc00177e6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
9520695206FranceFull, fat, ripe, perfumed wine that is full of...Château de Mercey Premier Cru9135.0BurgundyMercureyNaNRoger Voss@vossrogerAntonin Rodet 2010 Château de Mercey Premier C...Pinot NoirAntonin Rodet
6640366403FranceFor simple Chablis, this is impressive, rich, ...Domaine8926.0BurgundyChablisNaNRoger Voss@vossrogerWilliam Fèvre 2005 Domaine (Chablis)ChardonnayWilliam Fèvre
7127771277FranceThis 50-50 blend of Marselan and Merlot opens ...La Remise8413.0France OtherVin de FranceNaNLauren Buzzeo@laurbuzzDomaine de la Mordorée 2014 La Remise Red (Vin...Red BlendDomaine de la Mordorée
2748427484FranceThe medium-intense nose of this solid and easy...Authentic & Chic8610.0France OtherVin de FranceNaNLauren Buzzeo@laurbuzzRomantic 2014 Authentic & Chic Cabernet Sauvig...Cabernet SauvignonRomantic
124917124917FranceFresh, pure notes of Conference pear peel enti...NaN8930.0AlsaceAlsaceNaNAnne Krebiehl MW@AnneInVinoDomaine Vincent Stoeffler 2015 Pinot Gris (Als...Pinot GrisDomaine Vincent Stoeffler
\n", "
" ], "text/plain": [ " Unnamed: 0 country description \\\n", "95206 95206 France Full, fat, ripe, perfumed wine that is full of... \n", "66403 66403 France For simple Chablis, this is impressive, rich, ... \n", "71277 71277 France This 50-50 blend of Marselan and Merlot opens ... \n", "27484 27484 France The medium-intense nose of this solid and easy... \n", "124917 124917 France Fresh, pure notes of Conference pear peel enti... \n", "\n", " designation points price province \\\n", "95206 Château de Mercey Premier Cru 91 35.0 Burgundy \n", "66403 Domaine 89 26.0 Burgundy \n", "71277 La Remise 84 13.0 France Other \n", "27484 Authentic & Chic 86 10.0 France Other \n", "124917 NaN 89 30.0 Alsace \n", "\n", " region_1 region_2 taster_name taster_twitter_handle \\\n", "95206 Mercurey NaN Roger Voss @vossroger \n", "66403 Chablis NaN Roger Voss @vossroger \n", "71277 Vin de France NaN Lauren Buzzeo @laurbuzz \n", "27484 Vin de France NaN Lauren Buzzeo @laurbuzz \n", "124917 Alsace NaN Anne Krebiehl MW @AnneInVino \n", "\n", " title variety \\\n", "95206 Antonin Rodet 2010 Château de Mercey Premier C... Pinot Noir \n", "66403 William Fèvre 2005 Domaine (Chablis) Chardonnay \n", "71277 Domaine de la Mordorée 2014 La Remise Red (Vin... Red Blend \n", "27484 Romantic 2014 Authentic & Chic Cabernet Sauvig... Cabernet Sauvignon \n", "124917 Domaine Vincent Stoeffler 2015 Pinot Gris (Als... Pinot Gris \n", "\n", " winery \n", "95206 Antonin Rodet \n", "66403 William Fèvre \n", "71277 Domaine de la Mordorée \n", "27484 Romantic \n", "124917 Domaine Vincent Stoeffler " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/winemag/winemag-data-130k-v2.csv')\n", "df_france = df[df['country'] == 'France']\n", "\n", "# Let's also filter out wines that have less than 5 references with their grape variety – even though we'd like to find those\n", "# they're outliers that we don't want to optimize for that would make our enum list be too long\n", "# and they could also add noise for the rest of the dataset on which we'd like to guess, eventually reducing our accuracy.\n", "\n", "varieties_less_than_five_list = df_france['variety'].value_counts()[df_france['variety'].value_counts() < 5].index.tolist()\n", "df_france = df_france[~df_france['variety'].isin(varieties_less_than_five_list)]\n", "\n", "df_france_subset = df_france.sample(n=500)\n", "df_france_subset.head()" ] }, { "cell_type": "markdown", "id": "b96cd12f-cbdf-46af-958f-3d553598be1d", "metadata": {}, "source": [ "Let's retrieve all grape varieties to include them in the prompt and in our structured outputs enum list." ] }, { "cell_type": "code", "execution_count": 11, "id": "06f5dbea-549a-455d-9b6e-051de9d38723", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Gewürztraminer', 'Pinot Gris', 'Gamay',\n", " 'Bordeaux-style White Blend', 'Champagne Blend', 'Chardonnay',\n", " 'Petit Manseng', 'Riesling', 'White Blend', 'Pinot Blanc',\n", " 'Alsace white blend', 'Bordeaux-style Red Blend', 'Malbec',\n", " 'Tannat-Cabernet', 'Rhône-style Red Blend', 'Ugni Blanc-Colombard',\n", " 'Savagnin', 'Pinot Noir', 'Rosé', 'Melon',\n", " 'Rhône-style White Blend', 'Pinot Noir-Gamay', 'Colombard',\n", " 'Chenin Blanc', 'Sylvaner', 'Sauvignon Blanc', 'Red Blend',\n", " 'Chenin Blanc-Chardonnay', 'Cabernet Sauvignon', 'Cabernet Franc',\n", " 'Syrah', 'Sparkling Blend', 'Duras', 'Provence red blend',\n", " 'Tannat', 'Merlot', 'Malbec-Merlot', 'Chardonnay-Viognier',\n", " 'Cabernet Franc-Cabernet Sauvignon', 'Muscat', 'Viognier',\n", " 'Picpoul', 'Altesse', 'Provence white blend', 'Mondeuse',\n", " 'Grenache-Syrah', 'G-S-M', 'Pinot Meunier', 'Cabernet-Syrah',\n", " 'Vermentino', 'Marsanne', 'Colombard-Sauvignon Blanc',\n", " 'Gros and Petit Manseng', 'Jacquère', 'Negrette', 'Mauzac',\n", " 'Pinot Auxerrois', 'Grenache', 'Roussanne', 'Gros Manseng',\n", " 'Tannat-Merlot', 'Aligoté', 'Chasselas', \"Loin de l'Oeil\",\n", " 'Malbec-Tannat', 'Carignan', 'Colombard-Ugni Blanc', 'Sémillon',\n", " 'Syrah-Grenache', 'Sciaccerellu', 'Auxerrois', 'Mourvèdre',\n", " 'Tannat-Cabernet Franc', 'Braucol', 'Trousseau',\n", " 'Merlot-Cabernet Sauvignon'], dtype='