openai-cookbook/examples/SDG1.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "I_IHQTO8xXBn"
   },
   "source": [
    "# Synthetic Data generation (Part 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VBoxtnxVdTWZ"
   },
   "source": [
    "\n",
    "Synthetic data generation using large language models (LLMs) offers a powerful solution to a commonly faced problem: the availability of high-quality, diverse, and privacy-compliant data. This could be used in a number of scenarios such as training a data science  machine learning model (SVMs, decision trees, KNN's), finetuning a different GPT model on the data, as a solution to the coldstart problem, helping build compelling demos/apps with realistic data, scenario testing etc.\n",
    "\n",
    "There are a number of key drivers which may see you wanting to leverage synthetic data. \n",
    "1. Human data may have privacy restrictions and/or identifiable data within it which we do not want to be used. \n",
    "2. Synthetic data can be much more structured and therefore easier to manipulate than real data. \n",
    "3. In domains where data is sparse or data of certain categories is sparse we may want to augment the data. \n",
    "4. When dealing with imbalanced datasets or datasets which lack diversity, we may want to create data to improve the richness of our datasets.\n",
    "\n",
    "Unlike traditional data augmentation or manual data creation methods, using LLMs allows for the generation of rich, nuanced, and contextually relevant datasets that can significantly enhance it's usefulness to enterprises and developers.\n",
    "\n",
    "We split this tutorial into 2 parts. In this cookbook, we will have the following agenda:\n",
    "1. CSV with a structured prompt\n",
    "2. CSV with a Python program\n",
    "3. Multitable CSV with a python program\n",
    "4. Simply creating textual data\n",
    "5. Dealing with imbalanced or non-diverse textual data\n",
    "while in part 2, we will look at prompting strategies for getting better textual data.\n",
    "\n",
    "The last two in particular are useful for creating synthetic data to finetune another GPT model. For example using higher quality data produced by `gpt-4o` to finetune the cheaper and quicker `gpt-3.5-turbo` for improved performance while reducing costs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NE9Rr29zlRsA"
   },
   "source": [
    "### Getting setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "YGncxYrgQ8eb"
   },
   "outputs": [],
   "source": [
    "%pip install openai\n",
    "%pip install pandas\n",
    "%pip install scikit-learn\n",
    "%pip install matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "id": "8pzwvE-YQPtU"
   },
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "import os\n",
    "import re\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.cluster import KMeans\n",
    "import matplotlib.pyplot as plt\n",
    "import json\n",
    "import matplotlib\n",
    "\n",
    "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "B8eAx4-JxaZB"
   },
   "source": [
    "### 1. CSV with a structure prompt\n",
    "Here we create data in the simplest way. You can quickly generate data by addressing 3 key points: telling it the format of the data (CSV), the schema, and useful information regarding how columns relate (the LLM will be able to deduce this from the column names but a helping hand will improve performance)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "dqbvepd0n4vS",
    "outputId": "8735cacc-baa5-463e-938c-783e6b508b00"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "```csv\n",
      "id,house_size_m2,house_price,location,number_of_bedrooms\n",
      "1,50,150000,Suburban,2\n",
      "2,75,250000,City Center,3\n",
      "3,100,350000,Suburban,4\n",
      "4,120,450000,Suburban,4\n",
      "5,80,300000,City Center,3\n",
      "6,90,400000,City Center,3\n",
      "7,150,600000,Premium Area,5\n",
      "8,200,750000,Premium Area,5\n",
      "9,55,180000,Suburban,2\n",
      "10,300,950000,Premium Area,6\n",
      "```\n"
     ]
    }
   ],
   "source": [
    "datagen_model = \"gpt-4o-mini\"\n",
    "question = \"\"\"\n",
    "Create a CSV file with 10 rows of housing data.\n",
    "Each row should include the following fields:\n",
    " - id (incrementing integer starting at 1)\n",
    " - house size (m^2)\n",
    " - house price\n",
    " - location\n",
    " - number of bedrooms\n",
    "\n",
    "Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense). Also only respond with the CSV.\n",
    "\"\"\"\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "  model=datagen_model,\n",
    "  messages=[\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant designed to generate synthetic data.\"},\n",
    "    {\"role\": \"user\", \"content\": question}\n",
    "  ]\n",
    ")\n",
    "res = response.choices[0].message.content\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6ym0NiIyxiVj"
   },
   "source": [
    "### 2. CSV with a Python program\n",
    "The issue with generating data directly is we are limited in the amount of data we can generate because of the context. Instead what we can do is ask the LLM to generate a python program to generate the synthetic data. This allows us to scale to much more data while also providing us a view into how the data was generated by inspecting the python program.\n",
    "\n",
    "This would then let us edit the python program as we desire while giving us a good basis to start from.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "2yDuwB5ZxWS3",
    "outputId": "dcbe1093-90f0-4f60-d9c6-34bf679bb092"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Certainly! Below is a Python program that generates synthetic housing data according to your specifications. We will create a pandas DataFrame with the defined fields and characteristics.\n",
      "\n",
      "```python\n",
      "import pandas as pd\n",
      "import random\n",
      "\n",
      "def generate_housing_data(num_rows):\n",
      "    data = []\n",
      "    \n",
      "    locations = [\n",
      "        ('City Center', 10000, 150),  # (location name, base price per m², base size)\n",
      "        ('Suburban Area', 8000, 100),\n",
      "        ('Country Side', 5000, 80),\n",
      "        ('Coastal Region', 12000, 110),\n",
      "        ('Urban Neighborhood', 9000, 130)\n",
      "    ]\n",
      "    \n",
      "    for i in range(1, num_rows + 1):\n",
      "        # Randomly pick a location\n",
      "        location, base_price_per_m2, base_size = random.choice(locations)\n",
      "        \n",
      "        # Generate number of bedrooms (1 to 5)\n",
      "        number_of_bedrooms = random.randint(1, 5)\n",
      "        \n",
      "        # Calculate house size based on the number of bedrooms\n",
      "        house_size = base_size + (10 * number_of_bedrooms) + random.randint(-5, 15)  # Adding some noise\n",
      "        \n",
      "        # Calculate house price based on house size and location\n",
      "        house_price = base_price_per_m2 * house_size + random.randint(-5000, 10000)  # Adding some noise\n",
      "\n",
      "        # Append the generated data to the list\n",
      "        data.append({\n",
      "            'id': i,\n",
      "            'house_size_m2': house_size,\n",
      "            'house_price': house_price,\n",
      "            'location': location,\n",
      "            'number_of_bedrooms': number_of_bedrooms\n",
      "        })\n",
      "\n",
      "    # Create a pandas DataFrame\n",
      "    df = pd.DataFrame(data)\n",
      "    return df\n",
      "\n",
      "# Generate 100 rows of housing data\n",
      "housing_data_df = generate_housing_data(100)\n",
      "\n",
      "# Show the result\n",
      "print(housing_data_df)\n",
      "```\n",
      "\n",
      "### Explanation:\n",
      "- The `generate_housing_data` function creates synthetic housing data for a specified number of rows (`num_rows`).\n",
      "- We define different locations with corresponding base prices per square meter and average house sizes.\n",
      "- For each house, we randomly select a location, number of bedrooms, and calculate house size and price to ensure a sensible correlation between the values.\n",
      "- Finally, we create a pandas DataFrame from the generated data and return it.\n",
      "\n",
      "You can run this program in your Python environment, and it will output a DataFrame containing 100 rows of synthetic housing data.\n"
     ]
    }
   ],
   "source": [
    "question = \"\"\"\n",
    "Create a Python program to generate 100 rows of housing data.\n",
    "I want you to at the end of it output a pandas dataframe with 100 rows of data.\n",
    "Each row should include the following fields:\n",
    " - id (incrementing integer starting at 1)\n",
    " - house size (m^2)\n",
    " - house price\n",
    " - location\n",
    " - number of bedrooms\n",
    "\n",
    "Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).\n",
    "\"\"\"\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "  model=datagen_model,\n",
    "  messages=[\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant designed to generate synthetic data.\"},\n",
    "    {\"role\": \"user\", \"content\": question}\n",
    "  ]\n",
    ")\n",
    "res = response.choices[0].message.content\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to make sure to parse the output of this appropriately as often there may be surrounding text to the python code. We can also explicitly ask it to state all assumptions it made about the data it's generating, however in this circumstance it told us that automatically."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HZaJs7q8xm3L"
   },
   "source": [
    "### 3. Multitable CSV with a python program\n",
    "For more complex relationships however we need to make sure to specify a few more characteristics. \n",
    "\n",
    "To create multiple different datasets which relate to each other (for example housing, location, house type), as before we would need to specify the format, schema and useful information. However, the useful information required to get good performance is higher now. It's case-specific but a good amount of things to describe would be how the datasets relate to each other, addressing the size of the datasets in relation to one another, making sure foreign and primary keys are made appropriately and ideally using previously generated datasets to populate new ones so the actual data values match where necessary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "3TWAhIYIxnbS",
    "outputId": "8f766838-b2f0-419a-a4fb-543d029afce5"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Certainly! Below is a Python program that generates the three specified pandas DataFrames for housing data, location data, and house types. Each DataFrame will include the necessary fields, and the foreign keys will ensure proper relationships among them.\n",
      "\n",
      "```python\n",
      "import pandas as pd\n",
      "import numpy as np\n",
      "\n",
      "# Set random seed for reproducibility\n",
      "np.random.seed(0)\n",
      "\n",
      "# Function to generate location DataFrame\n",
      "def generate_location_data(num_locations):\n",
      "    locations = {\n",
      "        \"id\": range(1, num_locations + 1),\n",
      "        \"country\": np.random.choice(['USA', 'Canada', 'UK'], num_locations),\n",
      "        \"city\": np.random.choice(['New York', 'Toronto', 'London', 'Vancouver', 'Manchester'], num_locations),\n",
      "        \"population\": np.random.randint(50000, 1000000, num_locations),\n",
      "        \"area\": np.random.randint(10000, 500000, num_locations)\n",
      "    }\n",
      "    return pd.DataFrame(locations)\n",
      "\n",
      "# Function to generate house types DataFrame\n",
      "def generate_house_type_data(num_house_types):\n",
      "    house_types = {\n",
      "        \"id\": range(1, num_house_types + 1),\n",
      "        \"house_type\": np.random.choice(['Detached', 'Semi-Detached', 'Terraced', 'Flat'], num_house_types),\n",
      "        \"average_house_type_price\": np.random.randint(100000, 1000000, num_house_types),\n",
      "        \"number_of_houses\": np.random.randint(10, 1000, num_house_types)\n",
      "    }\n",
      "    return pd.DataFrame(house_types)\n",
      "\n",
      "# Function to generate housing data DataFrame\n",
      "def generate_housing_data(num_houses, location_df, house_type_df):\n",
      "    house_sizes = np.random.randint(50, 300, num_houses)  # size in m^2\n",
      "    location_ids = np.random.choice(location_df['id'], num_houses)\n",
      "    house_type_ids = np.random.choice(house_type_df['id'], num_houses)\n",
      "    \n",
      "    # Generate prices based on size, location, and house type\n",
      "    house_prices = (house_sizes * np.random.randint(2000, 5000, num_houses) // 10) + \\\n",
      "                   (location_ids * 1000) + \\\n",
      "                   (house_type_df.loc[house_type_ids - 1, 'average_house_type_price'].values // 4)\n",
      "    \n",
      "    housing_data = {\n",
      "        \"id\": range(1, num_houses + 1),\n",
      "        \"house_size\": house_sizes,\n",
      "        \"house_price\": house_prices,\n",
      "        \"location_id\": location_ids,\n",
      "        \"bedrooms\": np.random.randint(1, 6, num_houses),\n",
      "        \"house_type_id\": house_type_ids\n",
      "    }\n",
      "    \n",
      "    return pd.DataFrame(housing_data)\n",
      "\n",
      "# Generate DataFrames\n",
      "num_locations = 10\n",
      "num_house_types = 4\n",
      "num_houses = 100\n",
      "\n",
      "location_df = generate_location_data(num_locations)\n",
      "house_type_df = generate_house_type_data(num_house_types)\n",
      "housing_df = generate_housing_data(num_houses, location_df, house_type_df)\n",
      "\n",
      "# Display the generated DataFrames\n",
      "print(\"Location DataFrame:\")\n",
      "print(location_df.head(), \"\\n\")\n",
      "\n",
      "print(\"House Types DataFrame:\")\n",
      "print(house_type_df.head(), \"\\n\")\n",
      "\n",
      "print(\"Housing DataFrame:\")\n",
      "print(housing_df.head(), \"\\n\")\n",
      "\n",
      "# Printing the DataFrame shapes\n",
      "print(f\"Shapes: \\nLocation: {location_df.shape}, House Types: {house_type_df.shape}, Housing: {housing_df.shape}\")\n",
      "```\n",
      "\n",
      "### Explanation of the Code:\n",
      "1. **Location DataFrame:** \n",
      "   - Generates random locations with attributes such as country, city, population, and area.\n",
      "  \n",
      "2. **House Types DataFrame:** \n",
      "   - Generates different types of houses along with average prices and quantity available.\n",
      "  \n",
      "3. **Housing DataFrame:** \n",
      "   - Generates housing data with increments on price based on house size, location, and house type, while also ensuring foreign keys (IDs) for location and house type.\n",
      "\n",
      "### Output:\n",
      "The three DataFrames generated will logically relate to one another with consistent data types and primary–foreign key relationships, resulting in a coherent representation of the housing dataset. The output displays heads of each DataFrame and their shapes for verification.\n"
     ]
    }
   ],
   "source": [
    "question = \"\"\"\n",
    "Create a Python program to generate 3 different pandas dataframes.\n",
    "\n",
    "1. Housing data\n",
    "I want 100 rows. Each row should include the following fields:\n",
    " - id (incrementing integer starting at 1)\n",
    " - house size (m^2)\n",
    " - house price\n",
    " - location\n",
    " - number of bedrooms\n",
    " - house type\n",
    " + any relevant foreign keys\n",
    "\n",
    "2. Location\n",
    "Each row should include the following fields:\n",
    " - id (incrementing integer starting at 1)\n",
    " - country\n",
    " - city\n",
    " - population\n",
    " - area (m^2)\n",
    " + any relevant foreign keys\n",
    "\n",
    " 3. House types\n",
    " - id (incrementing integer starting at 1)\n",
    " - house type\n",
    " - average house type price\n",
    " - number of houses\n",
    " + any relevant foreign keys\n",
    "\n",
    "Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).\n",
    "Make sure that the dataframe generally follow common sense checks, e.g. the size of the dataframes make sense in comparison with one another.\n",
    "Make sure the foreign keys match up and you can use previously generated dataframes when creating each consecutive dataframes.\n",
    "You can use the previously generated dataframe to generate the next dataframe.\n",
    "\"\"\"\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "  model=datagen_model,\n",
    "  messages=[\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant designed to generate synthetic data.\"},\n",
    "    {\"role\": \"user\", \"content\": question}\n",
    "  ]\n",
    ")\n",
    "res = response.choices[0].message.content\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Yv9XlRtauZYZ"
   },
   "source": [
    "### 4. Simply creating textual data\n",
    "Here we take a first look at creating textual data. This can be used to finetune another GPT model for example. In this case we imagine ourselves a retailer trying to streamline the process of creating descriptions for items they are selling. We again need to specify the format of the data, in particular in this case we want one which is easy to parse as an output."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The example we consider below is one in which we want to create input output training pairs for GPT model to finetune on. We will have the products' name and the category it belongs to as input and the output will be a description. \n",
    "\n",
    "Specifying the structure of the output explicitly and giving commands to not deviate from this help enforce the output structure. You can run this in a loop and append the data to generate more synthetic data. Again, as before we will need to parse the data well so that our code further downstream does not break."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "id": "2KJVwjV0upby"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.\n",
      "Input: Wireless Bluetooth Headphones, Electronics\n",
      "Output: Immerse yourself in high-quality sound with these Wireless Bluetooth Headphones, featuring active noise cancellation and a comfortable over-ear design for extended listening sessions.\n",
      "\n",
      "2.\n",
      "Input: Organic Green Tea, Beverages\n",
      "Output: Enjoy a refreshing cup of Organic Green Tea, sourced from the finest leaves, packed with antioxidants, and perfect for a healthy, invigorating boost anytime.\n",
      "\n",
      "3.\n",
      "Input: Stainless Steel Kitchen Knife, Kitchenware\n",
      "Output: Cut with precision and ease using this Stainless Steel Kitchen Knife, designed with an ergonomic handle and a sharp blade for all your culinary tasks.\n",
      "\n",
      "4.\n",
      "Input: Hiking Backpack, Outdoor Gear\n",
      "Output: Explore the great outdoors with this durable Hiking Backpack, featuring multiple compartments for optimal organization and a breathable design for ultimate comfort on long treks.\n",
      "\n",
      "5.\n",
      "Input: Air Fryer, Kitchen Appliances\n",
      "Output: Cook your favorite meals with less oil using this Air Fryer\n"
     ]
    }
   ],
   "source": [
    "output_string = \"\"\n",
    "for i in range(3):\n",
    "  question = f\"\"\"\n",
    "  I am creating input output training pairs to fine tune my gpt model. The usecase is a retailer generating a description for a product from a product catalogue. I want the input to be product name and category (to which the product belongs to) and output to be description.\n",
    "  The format should be of the form:\n",
    "  1.\n",
    "  Input: product_name, category\n",
    "  Output: description\n",
    "  2.\n",
    "  Input: product_name, category\n",
    "  Output: description\n",
    "\n",
    "  Do not add any extra characters around that formatting as it will make the output parsing break.\n",
    "  Create as many training pairs as possible.\n",
    "  \"\"\"\n",
    "\n",
    "  response = client.chat.completions.create(\n",
    "    model=datagen_model,\n",
    "    messages=[\n",
    "      {\"role\": \"system\", \"content\": \"You are a helpful assistant designed to generate synthetic data.\"},\n",
    "      {\"role\": \"user\", \"content\": question}\n",
    "    ]\n",
    "  )\n",
    "  res = response.choices[0].message.content\n",
    "  output_string += res + \"\\n\" + \"\\n\"\n",
    "print(output_string[:1000]) #displaying truncated response\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "K5EmKTEa7GlC"
   },
   "source": [
    "Note: the above output is truncated. And now we can parse it as below to get a list of products, categories and their descriptions. For example, let's take a look at the products it's generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "owvoyJBh0o2n",
    "outputId": "ee48bcc9-fd29-42bf-9beb-ef3800cdbcb2"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Wireless Bluetooth Headphones',\n",
       " 'Organic Green Tea',\n",
       " 'Stainless Steel Kitchen Knife',\n",
       " 'Hiking Backpack',\n",
       " 'Air Fryer',\n",
       " \"Kids' Educational Tablet\",\n",
       " 'Bluetooth Speaker',\n",
       " 'Yoga Mat',\n",
       " 'Memory Foam Mattress',\n",
       " 'Smartwatch',\n",
       " 'Leather Wallet',\n",
       " 'Portable Phone Charger',\n",
       " 'Non-Stick Cookware Set',\n",
       " 'Pet Dog Bed',\n",
       " 'Fitness Tracker',\n",
       " 'Wireless Earbuds',\n",
       " 'Organic Green Tea',\n",
       " 'Reusable Water Bottle',\n",
       " 'Yoga Mat',\n",
       " 'Leather Wallet',\n",
       " 'Air Fryer',\n",
       " 'Gaming Mouse',\n",
       " 'Crochet Kit',\n",
       " 'Hiking Boots',\n",
       " 'Scented Candles',\n",
       " 'Bluetooth Speaker',\n",
       " 'Stainless Steel Cookware Set',\n",
       " 'Fitness Tracker',\n",
       " 'Decorative Throw Pillows',\n",
       " 'Eco-Friendly Cleaning Supplies',\n",
       " 'Wireless Noise Cancelling Headphones',\n",
       " 'Organic Green Tea',\n",
       " 'Adjustable Yoga Mat',\n",
       " 'Bluetooth Smart Scale',\n",
       " 'Stainless Steel Water Bottle',\n",
       " 'Soft Cotton Bedding Set',\n",
       " 'Multi-Functional Kitchen Blender',\n",
       " 'Eco-Friendly Reusable Bags',\n",
       " 'Portable Phone Charger',\n",
       " 'Classic Leather Wallet',\n",
       " 'Suede Chelsea Boots',\n",
       " 'Non-Stick Cookware Set',\n",
       " 'Pet-Friendly Indoor Plants',\n",
       " 'High-Protein Snack Bars',\n",
       " 'LED Desk Lamp with USB Port']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#regex to parse data\n",
    "pattern = re.compile(r'Input:\\s*(.+?),\\s*(.+?)\\nOutput:\\s*(.+?)(?=\\n\\n|\\Z)', re.DOTALL)\n",
    "matches = pattern.findall(output_string)\n",
    "products = []\n",
    "categories = []\n",
    "descriptions = []\n",
    "\n",
    "for match in matches:\n",
    "    product, category, description = match\n",
    "    products.append(product.strip())\n",
    "    categories.append(category.strip())\n",
    "    descriptions.append(description.strip())\n",
    "products"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bO3PgRwpyocn"
   },
   "source": [
    "\n",
    "### 5. Dealing with imbalanced or non-diverse textual data\n",
    "Some of the most important aspects of generating high-quality synthetic data are accuracy (does the data make sense), consistency (are two separate data points for the same input roughly the same) and diversity (making sure our data distribution matches as much of the distribution that exists in production).\n",
    "\n",
    "\n",
    "To increase the diversity of our data, we start first by clustering the data. This will provide us information about which clusters are underrepresented (imbalanced dataset) or which data is not addressed at all (widening the data distribution). Then, we will either suggest new clusters (using self-reflection type call from GPT) or ask the next iteration of our synthetic generation calls to explicitly target the underrepresented clusters. \n",
    "\n",
    "We can then recursively run this generation and analysis of cluster loop to automate generating diverse synthetic data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ubdPEFYR-myU"
   },
   "source": [
    "For demonstrative purposes, we explicitly prompt the LLM to generate information about 4 different topical areas: vehicle, clothing, toiletries, food. We will then cluster the data and see if it managed to find these 4 topic areas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "m-yncn8s1hWZ",
    "outputId": "35ae248d-4859-4d3f-ba29-94478aed7305"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. vehicle  \n",
      "Input: \"Tesla Model 3, Electric Car\"  \n",
      "Output: \"The Tesla Model 3 is a revolutionary electric car with impressive range and cutting-edge technology, designed to provide an exhilarating driving experience while minimizing environmental impact.\"\n",
      "\n",
      "2. clothing  \n",
      "Input: \"Nike Air Max, Shoes\"  \n",
      "Output: \"Elevate your sneaker game with Nike Air Max. Combining iconic style with superior comfort and support, these shoes are perfect for both workouts and casual outings.\"\n",
      "\n",
      "3. toiletries  \n",
      "Input: \"Oral-B Pro 1000, Electronic Toothbrush\"  \n",
      "Output: \"Achieve a superior clean with the Oral-B Pro 1000. This electronic toothbrush features 3D cleaning action that pulsates and oscillates to remove more plaque than a regular manual toothbrush.\"\n",
      "\n",
      "4. food  \n",
      "Input: \"Chobani Greek Yogurt, Yogurt\"  \n",
      "Output: \"Indulge in a nutritious snack with Chobani Greek Yogurt. Packed with protein and delicious flavors, it’s the perfect choice for a healthy breakfast or a satisfying treat anytime.\"\n",
      "\n",
      "5. vehicle  \n",
      "\n"
     ]
    }
   ],
   "source": [
    "output_string = \"\"\n",
    "for i in range(3):\n",
    "  question = f\"\"\"\n",
    "  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under 4 main topics: vehicle, clothing, toiletries, food)\n",
    "  After the number of each example also state the topic area. The format should be of the form:\n",
    "  1. topic_area\n",
    "  Input: product_name, category\n",
    "  Output: description\n",
    "\n",
    "  Do not add any extra characters around that formatting as it will make the output parsing break.\n",
    "\n",
    "  Here are some helpful examples so you get the style of output correct.\n",
    "\n",
    "  1) clothing\n",
    "  Input: \"Shoe Name, Shoes\"\n",
    "  Output: \"Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move.\"\n",
    "  \"\"\"\n",
    "\n",
    "  response = client.chat.completions.create(\n",
    "    model=\"gpt-4o-mini\",\n",
    "    messages=[\n",
    "      {\"role\": \"system\", \"content\": \"You are a helpful assistant designed to generate synthetic data.\"},\n",
    "      {\"role\": \"user\", \"content\": question}\n",
    "    ]\n",
    "  )\n",
    "  res = response.choices[0].message.content\n",
    "  output_string += res + \"\\n\" + \"\\n\"\n",
    "print(output_string[:1000]) #displaying truncated response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: The above output is truncated. In the example above, we would explicitly include the topic area as part of the response per example as it helps condition the proceeding output and tends to give better performance. We can also give it an actual example of what the output should look like so it gets the right idea of style of output but also to help enforce structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "id": "k7GAokEC1hUY"
   },
   "outputs": [],
   "source": [
    "pattern = re.compile(r'(\\d+)\\.\\s*(\\w+)\\s*Input:\\s*\"(.+?),\\s*(.+?)\"\\s*Output:\\s*\"(.*?)\"', re.DOTALL)\n",
    "matches = pattern.findall(output_string)\n",
    "\n",
    "topics = []\n",
    "products = []\n",
    "categories = []\n",
    "descriptions = []\n",
    "\n",
    "for match in matches:\n",
    "    number, topic, product, category, description = match\n",
    "    topics.append(topic)\n",
    "    products.append(product)\n",
    "    categories.append(category)\n",
    "    descriptions.append(description)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Z49B3LrJ1hSG",
    "outputId": "d76a9038-1879-44d9-f1db-dc933e066c54"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Tesla Model 3',\n",
       " 'Nike Air Max',\n",
       " 'Oral-B Pro 1000',\n",
       " 'Chobani Greek Yogurt',\n",
       " 'Ford F-150',\n",
       " \"Levi's 511\",\n",
       " 'Philips Sonicare',\n",
       " 'Quaker Oatmeal',\n",
       " 'Toyota Camry',\n",
       " 'Adidas Ultraboost',\n",
       " 'Toyota Camry',\n",
       " 'Nike Air Max',\n",
       " 'Colgate Electric Toothbrush',\n",
       " 'Blue Diamond Almonds',\n",
       " 'Harley Davidson Fat Boy',\n",
       " 'Adidas UltraBoost',\n",
       " \"Dove Men's Body Wash\",\n",
       " 'Quaker Oats',\n",
       " 'Ford F-150',\n",
       " \"Levi's 501 Jeans\",\n",
       " 'Tesla Model 3',\n",
       " 'Nike Air Max',\n",
       " 'Oral-B Pro 1000',\n",
       " 'Organic Almond Butter',\n",
       " 'Yamaha YZF-R3',\n",
       " 'Adidas Ultraboost',\n",
       " 'Philips Sonicare',\n",
       " 'Organic Quinoa']"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "products"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "J1qKxLkAzKq0"
   },
   "source": [
    "We will now cluster the data to analyze it. We will use K-means clustering to segregate the data. An important parameter of K-means to set is K, the number of clusters.\n",
    "\n",
    "We know that there should be 4 cluster (4 topics) since we specified this in prompt: vehicle, electronics, clothing, food. However in general for our data, we do not know the number of clusters that exist. Therefore we will use the elbow method to find the optimal number of clusters.\n",
    "\n",
    "In the elbow method, we iterate through a range of different K's, each time storing the inertia. The inertia measures the sum of the squared distances between each point in a cluster and the centroid of that cluster thus telling us how well-separated and dense each cluster is. If we plot K against the inertia, we are able to see how the inertia drops and where the drop in inertia is least rapid (often making an elbow shape) we can set our optimal number of clusters. You can read into more depth about the elbow method [here](https://en.wikipedia.org/wiki/Elbow_method_(clustering))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1BxwPTkpGzu8"
   },
   "source": [
    "First let's store our data into a pandas dataframe for ease of analysis\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "id": "XcPBzORtKWv6"
   },
   "outputs": [],
   "source": [
    "data = {\n",
    "    'Product': products,\n",
    "    'Category': categories,\n",
    "    'Description': descriptions\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HQbg6r37KjG0"
   },
   "source": [
    "Next let us embed our data as the embeddings is what we will cluster since they should be close to each other in vector space if they are similar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "id": "l8M7MX-1Jctr"
   },
   "outputs": [],
   "source": [
    "def get_embedding(text, model=\"text-embedding-3-small\"):\n",
    "    text = text.replace(\"\\n\", \" \")\n",
    "\n",
    "    response = client.embeddings.create(input=[text], model=model)\n",
    "\n",
    "    return response.data[0].embedding\n",
    "\n",
    "embedding_model = \"text-embedding-3-small\"\n",
    "df[\"embedding\"] = df.Category.apply(lambda x: get_embedding(x, model=embedding_model))\n",
    "\n",
    "# Ensure there are embeddings to concatenate\n",
    "if len(df.embedding.values) > 0:\n",
    "    matrix = np.vstack(df.embedding.values)\n",
    "else:\n",
    "    matrix = np.array([])  # Handle the case where there are no embeddings\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Product</th>\n",
       "      <th>Category</th>\n",
       "      <th>Description</th>\n",
       "      <th>embedding</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tesla Model 3</td>\n",
       "      <td>Electric Car</td>\n",
       "      <td>The Tesla Model 3 is a revolutionary electric ...</td>\n",
       "      <td>[0.003255360759794712, -0.039260633289813995, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Nike Air Max</td>\n",
       "      <td>Shoes</td>\n",
       "      <td>Elevate your sneaker game with Nike Air Max. C...</td>\n",
       "      <td>[0.03943369910120964, 0.022045187652111053, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Oral-B Pro 1000</td>\n",
       "      <td>Electronic Toothbrush</td>\n",
       "      <td>Achieve a superior clean with the Oral-B Pro 1...</td>\n",
       "      <td>[-0.003470012918114662, -0.01911414973437786, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Chobani Greek Yogurt</td>\n",
       "      <td>Yogurt</td>\n",
       "      <td>Indulge in a nutritious snack with Chobani Gre...</td>\n",
       "      <td>[0.0208318829536438, -0.02645781636238098, -0....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Ford F-150</td>\n",
       "      <td>Pickup Truck</td>\n",
       "      <td>The Ford F-150 is the ultimate pickup truck, d...</td>\n",
       "      <td>[0.007467855699360371, -0.05288049206137657, -...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Levi's 511</td>\n",
       "      <td>Jeans</td>\n",
       "      <td>Step out in style with Levi's 511 jeans. Featu...</td>\n",
       "      <td>[0.0037206460256129503, 0.022772302851080894, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Philips Sonicare</td>\n",
       "      <td>Electric Toothbrush</td>\n",
       "      <td>Discover a new level of oral care with the Phi...</td>\n",
       "      <td>[-0.00724813062697649, -0.011600878089666367, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Quaker Oatmeal</td>\n",
       "      <td>Breakfast Cereal</td>\n",
       "      <td>Start your day right with Quaker Oatmeal. This...</td>\n",
       "      <td>[-0.006529285106807947, 0.007865572348237038, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Toyota Camry</td>\n",
       "      <td>Sedan</td>\n",
       "      <td>The Toyota Camry stands out in the sedan categ...</td>\n",
       "      <td>[-0.02088991366326809, -0.006191295105963945, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Adidas Ultraboost</td>\n",
       "      <td>Running Shoes</td>\n",
       "      <td>Run like never before in the Adidas Ultraboost...</td>\n",
       "      <td>[0.02679188922047615, 0.014639599248766899, 8....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Toyota Camry</td>\n",
       "      <td>Car</td>\n",
       "      <td>The Toyota Camry is a reliable midsize sedan k...</td>\n",
       "      <td>[0.008056452497839928, -0.007912316359579563, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Nike Air Max</td>\n",
       "      <td>Shoes</td>\n",
       "      <td>Step up your sneaker game with the Nike Air Ma...</td>\n",
       "      <td>[0.03943241760134697, 0.02208484522998333, -0....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Colgate Electric Toothbrush</td>\n",
       "      <td>Electronic Toothbrush</td>\n",
       "      <td>Transform your oral hygiene routine with the C...</td>\n",
       "      <td>[-0.003470012918114662, -0.01911414973437786, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Blue Diamond Almonds</td>\n",
       "      <td>Nuts</td>\n",
       "      <td>Snack healthy with Blue Diamond Almonds. These...</td>\n",
       "      <td>[-0.013289917260408401, 0.036334190517663956, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Harley Davidson Fat Boy</td>\n",
       "      <td>Motorcycle</td>\n",
       "      <td>Experience the thrill of the open road with th...</td>\n",
       "      <td>[0.012365399859845638, 0.03552943095564842, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Adidas UltraBoost</td>\n",
       "      <td>Sneakers</td>\n",
       "      <td>Enjoy a perfect blend of comfort and performan...</td>\n",
       "      <td>[0.013107392005622387, 0.02963760495185852, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Dove Men's Body Wash</td>\n",
       "      <td>Body Wash</td>\n",
       "      <td>Refresh and hydrate your skin with Dove Men's ...</td>\n",
       "      <td>[0.03760576993227005, -0.008475445210933685, -...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Quaker Oats</td>\n",
       "      <td>Oats</td>\n",
       "      <td>Start your day right with Quaker Oats. Packed ...</td>\n",
       "      <td>[-0.00903365109115839, 0.00896345917135477, 0....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Ford F-150</td>\n",
       "      <td>Truck</td>\n",
       "      <td>The Ford F-150 is a durable and dependable tru...</td>\n",
       "      <td>[0.023461222648620605, -0.026651185005903244, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Levi's 501 Jeans</td>\n",
       "      <td>Jeans</td>\n",
       "      <td>Discover the timeless style of Levi's 501 Jean...</td>\n",
       "      <td>[0.003762696636840701, 0.02275814116001129, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Tesla Model 3</td>\n",
       "      <td>Mobile Phones</td>\n",
       "      <td>Explore the future of driving with the Tesla M...</td>\n",
       "      <td>[0.03703858703374863, 0.03407958149909973, 0.0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Nike Air Max</td>\n",
       "      <td>Shoes</td>\n",
       "      <td>Step up your game with the Nike Air Max. Desig...</td>\n",
       "      <td>[0.03943369910120964, 0.022045187652111053, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Oral-B Pro 1000</td>\n",
       "      <td>Electronic Toothbrush</td>\n",
       "      <td>Achieve a superior clean with the Oral-B Pro 1...</td>\n",
       "      <td>[-0.003470012918114662, -0.01911414973437786, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Organic Almond Butter</td>\n",
       "      <td>Food</td>\n",
       "      <td>Indulge in the creamy goodness of Organic Almo...</td>\n",
       "      <td>[-0.014613640494644642, -0.002179765608161688,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Yamaha YZF-R3</td>\n",
       "      <td>Mobile Phones</td>\n",
       "      <td>Introducing the Yamaha YZF-R3, the ultimate sp...</td>\n",
       "      <td>[0.03703858703374863, 0.03407958149909973, 0.0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Adidas Ultraboost</td>\n",
       "      <td>Shoes</td>\n",
       "      <td>Discover the Adidas Ultraboost, a shoe that of...</td>\n",
       "      <td>[0.03944042697548866, 0.022062409669160843, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Philips Sonicare</td>\n",
       "      <td>Electronic Toothbrush</td>\n",
       "      <td>Experience the dental care revolution with Phi...</td>\n",
       "      <td>[-0.003470012918114662, -0.01911414973437786, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>Organic Quinoa</td>\n",
       "      <td>Food</td>\n",
       "      <td>Nourish your body with Organic Quinoa, a nutri...</td>\n",
       "      <td>[-0.014613640494644642, -0.002179765608161688,...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        Product               Category  \\\n",
       "0                 Tesla Model 3           Electric Car   \n",
       "1                  Nike Air Max                  Shoes   \n",
       "2               Oral-B Pro 1000  Electronic Toothbrush   \n",
       "3          Chobani Greek Yogurt                 Yogurt   \n",
       "4                    Ford F-150           Pickup Truck   \n",
       "5                    Levi's 511                  Jeans   \n",
       "6              Philips Sonicare    Electric Toothbrush   \n",
       "7                Quaker Oatmeal       Breakfast Cereal   \n",
       "8                  Toyota Camry                  Sedan   \n",
       "9             Adidas Ultraboost          Running Shoes   \n",
       "10                 Toyota Camry                    Car   \n",
       "11                 Nike Air Max                  Shoes   \n",
       "12  Colgate Electric Toothbrush  Electronic Toothbrush   \n",
       "13         Blue Diamond Almonds                   Nuts   \n",
       "14      Harley Davidson Fat Boy             Motorcycle   \n",
       "15            Adidas UltraBoost               Sneakers   \n",
       "16         Dove Men's Body Wash              Body Wash   \n",
       "17                  Quaker Oats                   Oats   \n",
       "18                   Ford F-150                  Truck   \n",
       "19             Levi's 501 Jeans                  Jeans   \n",
       "20                Tesla Model 3          Mobile Phones   \n",
       "21                 Nike Air Max                  Shoes   \n",
       "22              Oral-B Pro 1000  Electronic Toothbrush   \n",
       "23        Organic Almond Butter                   Food   \n",
       "24                Yamaha YZF-R3          Mobile Phones   \n",
       "25            Adidas Ultraboost                  Shoes   \n",
       "26             Philips Sonicare  Electronic Toothbrush   \n",
       "27               Organic Quinoa                   Food   \n",
       "\n",
       "                                          Description  \\\n",
       "0   The Tesla Model 3 is a revolutionary electric ...   \n",
       "1   Elevate your sneaker game with Nike Air Max. C...   \n",
       "2   Achieve a superior clean with the Oral-B Pro 1...   \n",
       "3   Indulge in a nutritious snack with Chobani Gre...   \n",
       "4   The Ford F-150 is the ultimate pickup truck, d...   \n",
       "5   Step out in style with Levi's 511 jeans. Featu...   \n",
       "6   Discover a new level of oral care with the Phi...   \n",
       "7   Start your day right with Quaker Oatmeal. This...   \n",
       "8   The Toyota Camry stands out in the sedan categ...   \n",
       "9   Run like never before in the Adidas Ultraboost...   \n",
       "10  The Toyota Camry is a reliable midsize sedan k...   \n",
       "11  Step up your sneaker game with the Nike Air Ma...   \n",
       "12  Transform your oral hygiene routine with the C...   \n",
       "13  Snack healthy with Blue Diamond Almonds. These...   \n",
       "14  Experience the thrill of the open road with th...   \n",
       "15  Enjoy a perfect blend of comfort and performan...   \n",
       "16  Refresh and hydrate your skin with Dove Men's ...   \n",
       "17  Start your day right with Quaker Oats. Packed ...   \n",
       "18  The Ford F-150 is a durable and dependable tru...   \n",
       "19  Discover the timeless style of Levi's 501 Jean...   \n",
       "20  Explore the future of driving with the Tesla M...   \n",
       "21  Step up your game with the Nike Air Max. Desig...   \n",
       "22  Achieve a superior clean with the Oral-B Pro 1...   \n",
       "23  Indulge in the creamy goodness of Organic Almo...   \n",
       "24  Introducing the Yamaha YZF-R3, the ultimate sp...   \n",
       "25  Discover the Adidas Ultraboost, a shoe that of...   \n",
       "26  Experience the dental care revolution with Phi...   \n",
       "27  Nourish your body with Organic Quinoa, a nutri...   \n",
       "\n",
       "                                            embedding  \n",
       "0   [0.003255360759794712, -0.039260633289813995, ...  \n",
       "1   [0.03943369910120964, 0.022045187652111053, -0...  \n",
       "2   [-0.003470012918114662, -0.01911414973437786, ...  \n",
       "3   [0.0208318829536438, -0.02645781636238098, -0....  \n",
       "4   [0.007467855699360371, -0.05288049206137657, -...  \n",
       "5   [0.0037206460256129503, 0.022772302851080894, ...  \n",
       "6   [-0.00724813062697649, -0.011600878089666367, ...  \n",
       "7   [-0.006529285106807947, 0.007865572348237038, ...  \n",
       "8   [-0.02088991366326809, -0.006191295105963945, ...  \n",
       "9   [0.02679188922047615, 0.014639599248766899, 8....  \n",
       "10  [0.008056452497839928, -0.007912316359579563, ...  \n",
       "11  [0.03943241760134697, 0.02208484522998333, -0....  \n",
       "12  [-0.003470012918114662, -0.01911414973437786, ...  \n",
       "13  [-0.013289917260408401, 0.036334190517663956, ...  \n",
       "14  [0.012365399859845638, 0.03552943095564842, -0...  \n",
       "15  [0.013107392005622387, 0.02963760495185852, -0...  \n",
       "16  [0.03760576993227005, -0.008475445210933685, -...  \n",
       "17  [-0.00903365109115839, 0.00896345917135477, 0....  \n",
       "18  [0.023461222648620605, -0.026651185005903244, ...  \n",
       "19  [0.003762696636840701, 0.02275814116001129, -0...  \n",
       "20  [0.03703858703374863, 0.03407958149909973, 0.0...  \n",
       "21  [0.03943369910120964, 0.022045187652111053, -0...  \n",
       "22  [-0.003470012918114662, -0.01911414973437786, ...  \n",
       "23  [-0.014613640494644642, -0.002179765608161688,...  \n",
       "24  [0.03703858703374863, 0.03407958149909973, 0.0...  \n",
       "25  [0.03944042697548866, 0.022062409669160843, -0...  \n",
       "26  [-0.003470012918114662, -0.01911414973437786, ...  \n",
       "27  [-0.014613640494644642, -0.002179765608161688,...  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZcdZscBNKy0F"
   },
   "source": [
    "Now we perform the elbow method. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 981
    },
    "id": "1Azw_xgVl_aY",
    "outputId": "5b7076aa-a03c-4a40-f52c-08b09248cf18"
   },
   "outputs": [],
   "source": [
    "# Determine the optimal number of clusters using the elbow method\n",
    "inertias = []\n",
    "range_of_clusters = range(1, 13)  # Adjust the range as necessary\n",
    "\n",
    "for n_clusters in range_of_clusters:\n",
    "    kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42, n_init=10)\n",
    "    kmeans.fit(matrix)\n",
    "    inertias.append(kmeans.inertia_)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will output a chart for us in which we have to visually tell where the optimal cluster point is. We can see below that we see a gradual decrease of inertia rather than a sharp elbow but the point of steepest decrease appears to occur around 3, 4 or 5 clusters which lines up with our expectations given our prompt. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA0oAAAIjCAYAAAA9VuvLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy80BEi2AAAACXBIWXMAAA9hAAAPYQGoP6dpAAB/oElEQVR4nO3dd1yVdf/H8ffhMEWGoCxBxC0OcmQ5cue4TStT245um5qpaWn9Sm1oe95m2dD20NLUypFbMzfuLW4QFWWoIHCu3x/GiamgwMV4PR+P8yiucc7nOgcP532uz/X9WgzDMAQAAAAAsHMwuwAAAAAAKGkISgAAAACQDUEJAAAAALIhKAEAAABANgQlAAAAAMiGoAQAAAAA2RCUAAAAACAbghIAAAAAZENQAgAAAIBsCEqACSwWi8aPH2//efz48bJYLDp9+rR5RZVQ1atX12233Vbkj7Ns2TJZLBYtW7asyB8LWbVv317t27c3u4xic+jQIVksFk2fPr1cPXZhyTiGt956y+xS8iUpKUmDBw9WQECALBaLhg8fXij3O336dFksFh06dKhQ7g9ATgQloJBk/NHK6/b333+bXeI1q169uiwWizp37pzr+k8//dR+nBs2bCjw/e/cuVPjx48vFX/w//rrL40fP17nzp0r1PvNCMsZtwoVKqhatWrq2bOnpk2bppSUlGu+76Kquaw4c+aMRo8erbp168rV1VU+Pj7q2rWr5s2bd133+9133+m9994rnCKLWcYXBxaLRRs3bsyxfuDAgapYsaIJlZU+EydO1PTp0/X444/r66+/1oMPPnjF7dPT0zVt2jS1b99ePj4+cnFxUfXq1TVo0KBren+9Vr///nuWL/SA8sjR7AKAsuall15SWFhYjuW1atUyoZrC4+rqqqVLlyomJkYBAQFZ1n377bdydXVVcnLyNd33zp07NWHCBLVv317Vq1cvhGqLzl9//aUJEyZo4MCB8vb2LvT7nzJliipWrKiUlBQdP35cCxYs0EMPPaT33ntP8+bNU0hISImr+XotXLjQtMfes2ePOnXqpFOnTmnQoEFq3ry5zp07p2+//VY9e/bUqFGj9Oabb17TfX/33Xfavn17jjMIoaGhunjxopycnArhCIre+PHjNXfuXLPLKLWWLFmim2++WePGjbvqthcvXlTv3r01f/58tW3bVs8995x8fHx06NAh/fTTT/ryyy915MgRBQcHF3ndv//+uyZPnkxYQrlGUAIKWffu3dW8eXOzyyh0rVu31vr16/Xjjz/qqaeesi8/duyYVq5cqTvvvFM///yziRWWDX369FHlypXtP7/44ov69ttv1b9/f/Xt27dEnZm8cOGCKlSocN334+zsXAjVFFxqaqr69Omjs2fPasWKFbrpppvs60aMGKH7779fb731lpo3b66777670B7XYrHI1dW10O6vKN1www2aN2+eNm3apKZNm5pdTrE6f/683N3dr/t+YmNjFR4enq9tR48erfnz5+vdd9/NEbDHjRund99997rrMZNhGEpOTpabm5vZpQD5QusdUIKcPn1a/fr1k6enp3x9ffXUU0/lOEuTlpaml19+WTVr1rS3ZDz33HNZWrNGjhwpX19fGYZhX/bkk0/KYrHogw8+sC87efKkLBaLpkyZctXaXF1d1bt3b3333XdZln///feqVKmSunbtmut+u3fvVp8+feTj4yNXV1c1b95cc+bMsa+fPn26+vbtK0nq0KGDvd0n+7VCq1atUosWLeTq6qoaNWroq6++yvFYBw8eVN++feXj46MKFSro5ptv1m+//ZZju2PHjumOO+6Qu7u7/Pz8NGLEiHy1to0fP16jR4+WJIWFhdlrzWgZzM9rcy3uv/9+DR48WGvXrtWiRYuyrFu7dq26desmLy8vVahQQe3atdPq1avzXbMkffPNN2rWrJnc3Nzk4+Oje+65R0ePHs3yOO3bt1fDhg21ceNGtW3bVhUqVNBzzz2X5XqRyZMnq0aNGqpQoYK6dOmio0ePyjAMvfzyywoODpabm5tuv/12xcXF5bjvzNcoZbR9/fTTT3r11VcVHBwsV1dXderUSfv378/x/FztOcjLzz//rO3bt2vMmDFZQpIkWa1WffLJJ/L29s7yjXpGbT/++KOee+45BQQEyN3dXb169crynLVv316//fabDh8+bH/OM86W5nadUEYr25EjR3TbbbepYsWKqlq1qiZPnixJ2rZtmzp27Ch3d3eFhobm+HcYFxenUaNGqVGjRqpYsaI8PT3VvXt3bdmy5arPw5U8+eSTqlSpUr7OKmS/9jJD9erVNXDgQPvPGW3Kq1at0rBhw1SlShV5e3vr0Ucf1aVLl3Tu3Dn1799flSpVUqVKlfTMM89keS/L7N1331VoaKjc3NzUrl07bd++Pcc2V3sPylzT8uXL9cQTT8jPz++qZ21iY2P13//+V/7+/nJ1dVVERIS+/PJL+/qM35WoqCj99ttvuf7by+zYsWP65JNPdOutt+Z6HZPVatWoUaOuWFd+X4PU1FRNmDBBtWvXlqurq3x9fdWmTRv7+8vAgQPtv3uZW4Iz2Gw2vffee2rQoIFcXV3l7++vRx99VGfPns3xuLfddpsWLFig5s2by83NTZ988okkadGiRWrTpo28vb1VsWJF1a1bV88991yexwaYgTNKQCGLj4/PMSiDxWKRr6/vVfft16+fqlevrkmTJunvv//WBx98oLNnz2YJBYMHD9aXX36pPn366Omnn9batWs1adIk7dq1S7NmzZIk3XLLLXr33Xe1Y8cONWzYUJK0cuVKOTg4aOXKlRo2bJh9mSS1bds2X8d23333qUuXLjpw4IBq1qwp6XJ7UZ8+fXJtI9qxY4dat26tqlWrasyYMXJ3d9dPP/2kO+64Qz///LPuvPNOtW3bVsOGDdMHH3yg5557TvXr15ck+38laf/+/erTp4/++9//asCAAfriiy80cOBANWvWTA0aNJB0OfS1atVKFy5c0LBhw+Tr66svv/xSvXr10syZM3XnnXdKutza0qlTJx05ckTDhg1TUFCQvv76ay1ZsuSqx9+7d2/t3btX33//vd599137mZ8qVark+7W5Vg8++KCmTp2qhQsX6tZbb5V0uaWne/fuatasmcaNGycHBwdNmzZNHTt21MqVK9WiRYur1vzqq6/qhRdeUL9+/TR48GCdOnVKH374odq2bavNmzdnadU7c+aMunfvrnvuuUcPPPCA/P397eu+/fZbXbp0SU8++aTi4uL0xhtvqF+/furYsaOWLVumZ599Vvv379eHH36oUaNG6YsvvrjqMb/22mtycHDQqFGjFB8frzfeeEP333+/1q5da98mP89BXjLayfr375/rei8vL91+++368ssvtX///izts6+++qosFoueffZZxcbG6r333lPnzp0VGRkpNzc3Pf/884qPj9exY8fsZwGudk1Penq6unfvrrZt2+qNN97Qt99+q6FDh8rd3V3PP/+87r//fvXu3Vsff/yx+vfvr5YtW9rbfA8ePKjZs2erb9++CgsL08mTJ/XJJ5+oXbt22rlzp4KCgq76fOfG09NTI0aM0IsvvljoZ5WefPJJBQQEaMKECfr77781depUeXt766+//lK1atU0ceJE/f7773rzzTfVsGHDHK/TV199pcTERA0ZMkTJycl6//331bFjR23bts3+u5mf96DMnnjiCVWpUkUvvviizp8/n2ftFy9eVPv27bV//34NHTpUYWFhmjFjhgYOHKhz587pqaeeUv369fX1119rxIgRCg4O1tNPPy3p33972f3xxx9KS0u76jVMhWH8+PGaNGmSBg8erBYtWighIUEbNmzQpk2bdOutt+rRRx/ViRMntGjRIn399dc59n/00Uc1ffp0DRo0SMOGDVNUVJT+97//afPmzVq9enWWvwd79uzRvffeq0cffVQPP/yw6tatqx07dui2225T48aN9dJLL8nFxUX79+/P1xccQLEyABSKadOmGZJyvbm4uGTZVpIxbtw4+8/jxo0zJBm9evXKst0TTzxhSDK2bNliGIZhREZGGpK
      "text/plain": [
       "<Figure size 1000x600 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Plotting the elbow plot\n",
    "plt.figure(figsize=(10, 6))\n",
    "plt.plot(range_of_clusters, inertias, '-o')\n",
    "plt.title('Elbow Method to Determine Optimal Number of Clusters')\n",
    "plt.xlabel('Number of Clusters')\n",
    "plt.ylabel('Inertia')\n",
    "plt.xticks(range_of_clusters)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![elbow_chart](../images/elbow_chart.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NN7NbTmiLe_-"
   },
   "source": [
    "For demonstration purposes we will pick 5 as the optimal cluster number to show it doesn't matter exactly where we pick it as long as we are approximately right. There are numerous correct ways to categorize data. We also store which cluster each data point belongs to."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "KvrDe3WYKWgZ",
    "outputId": "d0d95227-b9d2-4c52-a5f7-59159ba848d2"
   },
   "outputs": [],
   "source": [
    "n_clusters = 5\n",
    "\n",
    "kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42)\n",
    "kmeans.fit(matrix)\n",
    "labels = kmeans.labels_\n",
    "df[\"Cluster\"] = labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tLvO4AISDM0J"
   },
   "source": [
    "We will analyze the cluster data now. There are two separate things we will look to address. 1. imbalanced data, 2. Expanding the data distribution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zaQ_mdhpOJqs"
   },
   "source": [
    "First for imbalanced data we count the number of examples in each cluster. Then we select a few examples from each cluster at random and ask the LLM what topics these map to. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "crUT7OR7QcFD",
    "outputId": "04f49ac2-1259-4622-bcf2-0686af93068c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cluster\n",
      "0    5\n",
      "1    7\n",
      "2    8\n",
      "3    6\n",
      "4    2\n",
      "Name: count, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "cluster_counts = df[\"Cluster\"].value_counts().sort_index()\n",
    "print(cluster_counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8i1FGtM-Xx3k"
   },
   "source": [
    "We can see the topics found here:\n",
    "Eco-friendly Transportation, Luxury and Leisure Items, Personal Care Products, Electronic Toothbrushes and Clothing and Apparel\n",
    "match well enough but not exactly to our initial prompt of:\n",
    "vehicle, clothing, toiletries, food.\n",
    "\n",
    "As we chose 5 clusters, it split up toiletries into Skincare and Personal Care which doesn't affect us too much further downstream."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Product</th>\n",
       "      <th>Category</th>\n",
       "      <th>Description</th>\n",
       "      <th>embedding</th>\n",
       "      <th>Cluster</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tesla Model 3</td>\n",
       "      <td>Electric Car</td>\n",
       "      <td>The Tesla Model 3 is a revolutionary electric ...</td>\n",
       "      <td>[0.003255360759794712, -0.039260633289813995, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Nike Air Max</td>\n",
       "      <td>Shoes</td>\n",
       "      <td>Elevate your sneaker game with Nike Air Max. C...</td>\n",
       "      <td>[0.03943369910120964, 0.022045187652111053, -0...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Oral-B Pro 1000</td>\n",
       "      <td>Electronic Toothbrush</td>\n",
       "      <td>Achieve a superior clean with the Oral-B Pro 1...</td>\n",
       "      <td>[-0.003470012918114662, -0.01911414973437786, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Chobani Greek Yogurt</td>\n",
       "      <td>Yogurt</td>\n",
       "      <td>Indulge in a nutritious snack with Chobani Gre...</td>\n",
       "      <td>[0.0208318829536438, -0.02645781636238098, -0....</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Ford F-150</td>\n",
       "      <td>Pickup Truck</td>\n",
       "      <td>The Ford F-150 is the ultimate pickup truck, d...</td>\n",
       "      <td>[0.007467855699360371, -0.05288049206137657, -...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Levi's 511</td>\n",
       "      <td>Jeans</td>\n",
       "      <td>Step out in style with Levi's 511 jeans. Featu...</td>\n",
       "      <td>[0.0037206460256129503, 0.022772302851080894, ...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Philips Sonicare</td>\n",
       "      <td>Electric Toothbrush</td>\n",
       "      <td>Discover a new level of oral care with the Phi...</td>\n",
       "      <td>[-0.00724813062697649, -0.011600878089666367, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Quaker Oatmeal</td>\n",
       "      <td>Breakfast Cereal</td>\n",
       "      <td>Start your day right with Quaker Oatmeal. This...</td>\n",
       "      <td>[-0.006529285106807947, 0.007865572348237038, ...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Toyota Camry</td>\n",
       "      <td>Sedan</td>\n",
       "      <td>The Toyota Camry stands out in the sedan categ...</td>\n",
       "      <td>[-0.02088991366326809, -0.006191295105963945, ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Adidas Ultraboost</td>\n",
       "      <td>Running Shoes</td>\n",
       "      <td>Run like never before in the Adidas Ultraboost...</td>\n",
       "      <td>[0.02679188922047615, 0.014639599248766899, 8....</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Toyota Camry</td>\n",
       "      <td>Car</td>\n",
       "      <td>The Toyota Camry is a reliable midsize sedan k...</td>\n",
       "      <td>[0.008056452497839928, -0.007912316359579563, ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Nike Air Max</td>\n",
       "      <td>Shoes</td>\n",
       "      <td>Step up your sneaker game with the Nike Air Ma...</td>\n",
       "      <td>[0.03943241760134697, 0.02208484522998333, -0....</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Colgate Electric Toothbrush</td>\n",
       "      <td>Electronic Toothbrush</td>\n",
       "      <td>Transform your oral hygiene routine with the C...</td>\n",
       "      <td>[-0.003470012918114662, -0.01911414973437786, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Blue Diamond Almonds</td>\n",
       "      <td>Nuts</td>\n",
       "      <td>Snack healthy with Blue Diamond Almonds. These...</td>\n",
       "      <td>[-0.013289917260408401, 0.036334190517663956, ...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Harley Davidson Fat Boy</td>\n",
       "      <td>Motorcycle</td>\n",
       "      <td>Experience the thrill of the open road with th...</td>\n",
       "      <td>[0.012365399859845638, 0.03552943095564842, -0...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Adidas UltraBoost</td>\n",
       "      <td>Sneakers</td>\n",
       "      <td>Enjoy a perfect blend of comfort and performan...</td>\n",
       "      <td>[0.013107392005622387, 0.02963760495185852, -0...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Dove Men's Body Wash</td>\n",
       "      <td>Body Wash</td>\n",
       "      <td>Refresh and hydrate your skin with Dove Men's ...</td>\n",
       "      <td>[0.03760576993227005, -0.008475445210933685, -...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Quaker Oats</td>\n",
       "      <td>Oats</td>\n",
       "      <td>Start your day right with Quaker Oats. Packed ...</td>\n",
       "      <td>[-0.00903365109115839, 0.00896345917135477, 0....</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Ford F-150</td>\n",
       "      <td>Truck</td>\n",
       "      <td>The Ford F-150 is a durable and dependable tru...</td>\n",
       "      <td>[0.023461222648620605, -0.026651185005903244, ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Levi's 501 Jeans</td>\n",
       "      <td>Jeans</td>\n",
       "      <td>Discover the timeless style of Levi's 501 Jean...</td>\n",
       "      <td>[0.003762696636840701, 0.02275814116001129, -0...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Tesla Model 3</td>\n",
       "      <td>Mobile Phones</td>\n",
       "      <td>Explore the future of driving with the Tesla M...</td>\n",
       "      <td>[0.03703858703374863, 0.03407958149909973, 0.0...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Nike Air Max</td>\n",
       "      <td>Shoes</td>\n",
       "      <td>Step up your game with the Nike Air Max. Desig...</td>\n",
       "      <td>[0.03943369910120964, 0.022045187652111053, -0...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Oral-B Pro 1000</td>\n",
       "      <td>Electronic Toothbrush</td>\n",
       "      <td>Achieve a superior clean with the Oral-B Pro 1...</td>\n",
       "      <td>[-0.003470012918114662, -0.01911414973437786, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Organic Almond Butter</td>\n",
       "      <td>Food</td>\n",
       "      <td>Indulge in the creamy goodness of Organic Almo...</td>\n",
       "      <td>[-0.014613640494644642, -0.002179765608161688,...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Yamaha YZF-R3</td>\n",
       "      <td>Mobile Phones</td>\n",
       "      <td>Introducing the Yamaha YZF-R3, the ultimate sp...</td>\n",
       "      <td>[0.03703858703374863, 0.03407958149909973, 0.0...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Adidas Ultraboost</td>\n",
       "      <td>Shoes</td>\n",
       "      <td>Discover the Adidas Ultraboost, a shoe that of...</td>\n",
       "      <td>[0.03944042697548866, 0.022062409669160843, -0...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Philips Sonicare</td>\n",
       "      <td>Electronic Toothbrush</td>\n",
       "      <td>Experience the dental care revolution with Phi...</td>\n",
       "      <td>[-0.003470012918114662, -0.01911414973437786, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>Organic Quinoa</td>\n",
       "      <td>Food</td>\n",
       "      <td>Nourish your body with Organic Quinoa, a nutri...</td>\n",
       "      <td>[-0.014613640494644642, -0.002179765608161688,...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        Product               Category  \\\n",
       "0                 Tesla Model 3           Electric Car   \n",
       "1                  Nike Air Max                  Shoes   \n",
       "2               Oral-B Pro 1000  Electronic Toothbrush   \n",
       "3          Chobani Greek Yogurt                 Yogurt   \n",
       "4                    Ford F-150           Pickup Truck   \n",
       "5                    Levi's 511                  Jeans   \n",
       "6              Philips Sonicare    Electric Toothbrush   \n",
       "7                Quaker Oatmeal       Breakfast Cereal   \n",
       "8                  Toyota Camry                  Sedan   \n",
       "9             Adidas Ultraboost          Running Shoes   \n",
       "10                 Toyota Camry                    Car   \n",
       "11                 Nike Air Max                  Shoes   \n",
       "12  Colgate Electric Toothbrush  Electronic Toothbrush   \n",
       "13         Blue Diamond Almonds                   Nuts   \n",
       "14      Harley Davidson Fat Boy             Motorcycle   \n",
       "15            Adidas UltraBoost               Sneakers   \n",
       "16         Dove Men's Body Wash              Body Wash   \n",
       "17                  Quaker Oats                   Oats   \n",
       "18                   Ford F-150                  Truck   \n",
       "19             Levi's 501 Jeans                  Jeans   \n",
       "20                Tesla Model 3          Mobile Phones   \n",
       "21                 Nike Air Max                  Shoes   \n",
       "22              Oral-B Pro 1000  Electronic Toothbrush   \n",
       "23        Organic Almond Butter                   Food   \n",
       "24                Yamaha YZF-R3          Mobile Phones   \n",
       "25            Adidas Ultraboost                  Shoes   \n",
       "26             Philips Sonicare  Electronic Toothbrush   \n",
       "27               Organic Quinoa                   Food   \n",
       "\n",
       "                                          Description  \\\n",
       "0   The Tesla Model 3 is a revolutionary electric ...   \n",
       "1   Elevate your sneaker game with Nike Air Max. C...   \n",
       "2   Achieve a superior clean with the Oral-B Pro 1...   \n",
       "3   Indulge in a nutritious snack with Chobani Gre...   \n",
       "4   The Ford F-150 is the ultimate pickup truck, d...   \n",
       "5   Step out in style with Levi's 511 jeans. Featu...   \n",
       "6   Discover a new level of oral care with the Phi...   \n",
       "7   Start your day right with Quaker Oatmeal. This...   \n",
       "8   The Toyota Camry stands out in the sedan categ...   \n",
       "9   Run like never before in the Adidas Ultraboost...   \n",
       "10  The Toyota Camry is a reliable midsize sedan k...   \n",
       "11  Step up your sneaker game with the Nike Air Ma...   \n",
       "12  Transform your oral hygiene routine with the C...   \n",
       "13  Snack healthy with Blue Diamond Almonds. These...   \n",
       "14  Experience the thrill of the open road with th...   \n",
       "15  Enjoy a perfect blend of comfort and performan...   \n",
       "16  Refresh and hydrate your skin with Dove Men's ...   \n",
       "17  Start your day right with Quaker Oats. Packed ...   \n",
       "18  The Ford F-150 is a durable and dependable tru...   \n",
       "19  Discover the timeless style of Levi's 501 Jean...   \n",
       "20  Explore the future of driving with the Tesla M...   \n",
       "21  Step up your game with the Nike Air Max. Desig...   \n",
       "22  Achieve a superior clean with the Oral-B Pro 1...   \n",
       "23  Indulge in the creamy goodness of Organic Almo...   \n",
       "24  Introducing the Yamaha YZF-R3, the ultimate sp...   \n",
       "25  Discover the Adidas Ultraboost, a shoe that of...   \n",
       "26  Experience the dental care revolution with Phi...   \n",
       "27  Nourish your body with Organic Quinoa, a nutri...   \n",
       "\n",
       "                                            embedding  Cluster  \n",
       "0   [0.003255360759794712, -0.039260633289813995, ...        1  \n",
       "1   [0.03943369910120964, 0.022045187652111053, -0...        2  \n",
       "2   [-0.003470012918114662, -0.01911414973437786, ...        1  \n",
       "3   [0.0208318829536438, -0.02645781636238098, -0....        3  \n",
       "4   [0.007467855699360371, -0.05288049206137657, -...        0  \n",
       "5   [0.0037206460256129503, 0.022772302851080894, ...        2  \n",
       "6   [-0.00724813062697649, -0.011600878089666367, ...        1  \n",
       "7   [-0.006529285106807947, 0.007865572348237038, ...        3  \n",
       "8   [-0.02088991366326809, -0.006191295105963945, ...        0  \n",
       "9   [0.02679188922047615, 0.014639599248766899, 8....        2  \n",
       "10  [0.008056452497839928, -0.007912316359579563, ...        0  \n",
       "11  [0.03943241760134697, 0.02208484522998333, -0....        2  \n",
       "12  [-0.003470012918114662, -0.01911414973437786, ...        1  \n",
       "13  [-0.013289917260408401, 0.036334190517663956, ...        3  \n",
       "14  [0.012365399859845638, 0.03552943095564842, -0...        0  \n",
       "15  [0.013107392005622387, 0.02963760495185852, -0...        2  \n",
       "16  [0.03760576993227005, -0.008475445210933685, -...        1  \n",
       "17  [-0.00903365109115839, 0.00896345917135477, 0....        3  \n",
       "18  [0.023461222648620605, -0.026651185005903244, ...        0  \n",
       "19  [0.003762696636840701, 0.02275814116001129, -0...        2  \n",
       "20  [0.03703858703374863, 0.03407958149909973, 0.0...        4  \n",
       "21  [0.03943369910120964, 0.022045187652111053, -0...        2  \n",
       "22  [-0.003470012918114662, -0.01911414973437786, ...        1  \n",
       "23  [-0.014613640494644642, -0.002179765608161688,...        3  \n",
       "24  [0.03703858703374863, 0.03407958149909973, 0.0...        4  \n",
       "25  [0.03944042697548866, 0.022062409669160843, -0...        2  \n",
       "26  [-0.003470012918114662, -0.01911414973437786, ...        1  \n",
       "27  [-0.014613640494644642, -0.002179765608161688,...        3  "
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "RRwIet9DUdKe",
    "outputId": "8e7835c9-884a-4504-bbed-f556382dd9f5"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\n",
      "  {\n",
      "    \"cluster\": 0,\n",
      "    \"topic\": \"Automotive  \"\n",
      "  },\n",
      "  {\n",
      "    \"cluster\": 1,\n",
      "    \"topic\": \"Personal Care  \"\n",
      "  },\n",
      "  {\n",
      "    \"cluster\": 2,\n",
      "    \"topic\": \"Footwear  \"\n",
      "  },\n",
      "  {\n",
      "    \"cluster\": 3,\n",
      "    \"topic\": \"Food  \"\n",
      "  },\n",
      "  {\n",
      "    \"cluster\": 4,\n",
      "    \"topic\": \"Automotive  \"\n",
      "  }\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)\n",
    "\n",
    "# Format the selected examples\n",
    "formatted_examples = \"\\n\".join(\n",
    "    f'Input: \"{row[\"Product\"]}, {row[\"Category\"]}\"\\nOutput: \"{row[\"Description\"]}\"\\nCluster: \"{row[\"Cluster\"]}\"'\n",
    "    for _, row in selected_examples.iterrows()\n",
    ")\n",
    "\n",
    "topic_prompt = f\"\"\"\n",
    "    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.\n",
    "    I want you identify the broad topic areas these clusters belong to.\n",
    "    Previous examples:\n",
    "    {formatted_examples}\n",
    "\n",
    "\n",
    "    Your output should be strictly of the format:\n",
    "    Cluster: number, topic: topic\n",
    "    Cluster: number, topic: topic\n",
    "    Cluster: number, topic: topic\n",
    "\n",
    "    Do not add any extra characters around that formatting as it will make the output parsing break.\n",
    "    \"\"\"\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "  model=datagen_model,\n",
    "  messages=[\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant designed analyze clustered data\"},\n",
    "    {\"role\": \"user\", \"content\": topic_prompt}\n",
    "  ]\n",
    ")\n",
    "res = response.choices[0].message.content\n",
    "\n",
    "pattern = r\"Cluster: (\\d+), topic: ([^\\n]+)\"\n",
    "matches = re.findall(pattern, res)\n",
    "clusters = [{\"cluster\": int(cluster), \"topic\": topic} for cluster, topic in matches]\n",
    "json_output = json.dumps(clusters, indent=2)\n",
    "print(json_output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "x5hszl-SZVdi"
   },
   "source": [
    "We now have the clusters and their counts so we could prompt the LLM to generate more examples within the topics we want. However for this example we won't take that further as they are well-split and you would just follow the procedure above for prompting the model to generate data while passing in the underrepresented topics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yVD_TPsHYvDb"
   },
   "source": [
    "Next, we will try and deal with increasing the diversity of our data distribution. \n",
    "\n",
    "First we start in a similar way by finding a few examples from each cluster at random and ask the LLM what topics these map to. In addition to this in the same LLM call, we will ask it to generate more topics to increase the diversity of our data. We do this in one call to save time/cost."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 53
    },
    "id": "mZjBbfFaZ3mn",
    "outputId": "8864421a-e9d4-4ea6-f747-a76a3291a593"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Cluster topic mapping\n",
      "Cluster: 0, topic: Automotive\n",
      "Cluster: 1, topic: Personal Care\n",
      "Cluster: 2, topic: Footwear\n",
      "Cluster: 3, topic: Food\n",
      "Cluster: 4, topic: Electric Vehicles\n",
      "\n",
      "2. New topics\n",
      "1. topic: Home Appliances\n",
      "2. topic: Outdoor Equipment\n",
      "3. topic: Smart Home Technology\n",
      "4. topic: Fitness Equipment\n"
     ]
    }
   ],
   "source": [
    "selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)\n",
    "\n",
    "# Format the selected examples\n",
    "formatted_examples = \"\\n\".join(\n",
    "    f'Input: \"{row[\"Product\"]}, {row[\"Category\"]}\"\\nOutput: \"{row[\"Description\"]}\"\\nCluster: \"{row[\"Cluster\"]}\"'\n",
    "    for _, row in selected_examples.iterrows()\n",
    ")\n",
    "\n",
    "topic_prompt = f\"\"\"\n",
    "    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.\n",
    "    I want to promote diversity in my examples across categories so follow the procedure below:\n",
    "    1. You must identify the broad topic areas these clusters belong to.\n",
    "    2. You should generate further topic areas which don't exist so I can generate data within these topics to improve diversity.\n",
    "\n",
    "\n",
    "    Previous examples:\n",
    "    {formatted_examples}\n",
    "\n",
    "\n",
    "    Your output should be strictly of the format:\n",
    "\n",
    "    1. Cluster topic mapping\n",
    "    Cluster: number, topic: topic\n",
    "    Cluster: number, topic: topic\n",
    "    Cluster: number, topic: topic\n",
    "\n",
    "    2. New topics\n",
    "    1. topic\n",
    "    2. topic\n",
    "    3. topic\n",
    "    4. topic\n",
    "\n",
    "    Do not add any extra characters around that formatting as it will make the output parsing break. It is very important you stick to that output format\n",
    "    \"\"\"\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "  model=datagen_model,\n",
    "  messages=[\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant designed to analyze clustered data\"},\n",
    "    {\"role\": \"user\", \"content\": topic_prompt}\n",
    "  ]\n",
    ")\n",
    "res = response.choices[0].message.content\n",
    "print(res)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see here again that we explicitly prompt the output structure it should follow. I also tell it the purpose of generating topics (to promote diversity) so the model has full context."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "s254oJ-Ecka0"
   },
   "source": [
    "We then parse the data into a list of cluster-mapping jsons and a list of topics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "HTS4ybspcivw",
    "outputId": "52ea363c-cbf5-420e-b81e-0d2710c50203"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "([{'cluster': 0, 'topic': 'Automotive'},\n",
       "  {'cluster': 1, 'topic': 'Personal Care'},\n",
       "  {'cluster': 2, 'topic': 'Footwear'},\n",
       "  {'cluster': 3, 'topic': 'Food'},\n",
       "  {'cluster': 4, 'topic': 'Electric Vehicles'}],\n",
       " ['topic: Home Appliances',\n",
       "  'topic: Outdoor Equipment',\n",
       "  'topic: Smart Home Technology',\n",
       "  'topic: Fitness Equipment'])"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parts = res.split(\"\\n\\n\")\n",
    "cluster_mapping_part = parts[0]\n",
    "new_topics_part = parts[1]\n",
    "\n",
    "# Parse cluster topic mapping\n",
    "cluster_topic_mapping_lines = cluster_mapping_part.split(\"\\n\")[1:]  # Skip the first two lines\n",
    "cluster_topic_mapping = [{\"cluster\": int(line.split(\",\")[0].split(\":\")[1].strip()), \"topic\": line.split(\":\")[2].strip()} for line in cluster_topic_mapping_lines]\n",
    "\n",
    "# Parse new topics\n",
    "new_topics_lines = new_topics_part.split(\"\\n\")[1:]  # Skip the first line\n",
    "new_topics = [line.split(\". \")[1] for line in new_topics_lines]\n",
    "\n",
    "cluster_topic_mapping, new_topics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CX26-PGdcui0"
   },
   "source": [
    "And finally we can use this information to further prompt a model to keep generating synthetic data. We do this by passing all the topics in the list of jsons to the prompt below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "id": "zHf4LnVk0aHw"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Automotive  \n",
      "Input: \"Tesla Model S, Electric Vehicles\"  \n",
      "Output: \"The Tesla Model S delivers exhilarating performance with advanced electric technology, offering a sleek design, impressive range, and an industry-leading infotainment system.\"\n",
      "\n",
      "2. Personal Care  \n",
      "Input: \"Oral-B Pro 1000, Electronic Toothbrush\"  \n",
      "Output: \"The Oral-B Pro 1000 features a 3D cleaning action that oscillates, rotates, and pulsates to remove plaque, ensuring a deeper clean for healthier gums.\"\n",
      "\n",
      "3. Footwear  \n",
      "Input: \"Nike Air Max 270, Shoes\"  \n",
      "Output: \"Step into comfort and style with Nike Air Max 270, designed with a large Max Air unit for superior cushioning and a breathable upper for a snug fit.\"\n",
      "\n",
      "4. Electronics  \n",
      "Input: \"Apple iPhone 12, Mobile Phones\"  \n",
      "Output: \"The Apple iPhone 12 combines powerful performance with stunning design, equipped with A14 Bionic chip and advanced camera systems for capturing every moment in stunning detail.\"\n",
      "\n",
      "5. Food  \n",
      "Input: \"Nature Valley Granola Bars, Snacks\"  \n",
      "Output: \"Nature Valley Granola Bars offer a wholesome crunch made from simple, delicious ingredients, providing a perfect snack that fuels your adventure.\"\n",
      "\n",
      "6. Automotive  \n",
      "Input: \"Ford F-150, Electric Vehicles\"  \n",
      "Output: \"The Ford F-150 stands at the forefront of durability and innovation, with its powerful electric version setting new standards for strength and sustainability in the truck category.\" \n",
      "\n",
      "7. Personal Care  \n",
      "Input: \"Philips Sonicare, Electronic Toothbrush\"  \n",
      "Output: \"Philips Sonicare delivers superior cleaning with dynamic technology that provides up to 31,000 strokes per minute for a healthier mouth and brighter smile.\"\n",
      "\n",
      "8. Footwear  \n",
      "Input: \"Adidas Ultraboost, Shoes\"  \n",
      "Output: \"The Adidas Ultraboost is a game-changer in running footwear, featuring responsive cushioning and a knit upper for a snug, supportive fit that adapts to any run.\"\n",
      "\n",
      "9. Electronics  \n",
      "Input: \"Dell XPS 13, Laptop\"  \n",
      "Output: \"The Dell XPS 13 is a remarkable laptop with an ultra-thin design, featuring a stunning InfinityEdge display and powerful performance to accommodate your multitasking needs.\"\n",
      "\n",
      "10. Food  \n",
      "Input: \"Kraft Macaroni & Cheese, Instant Food\"  \n",
      "Output: \"Kraft Macaroni & Cheese offers quick and convenient comfort food, combining creamy cheese sauce with perfectly cooked pasta for a simple meal that satisfies.\"\n",
      "\n",
      "1. Automotive  \n",
      "Input: \"Toyota Camry, Mobile Phones\"  \n",
      "Output: \"The Toyota Camry is a midsize sedan that combines efficiency with modern technology. It offers a spacious interior and the latest features for an enjoyable driving experience.\"\n",
      "\n",
      "2. Personal Care  \n",
      "Input: \"Oral-B Pro 1000, Electronic Toothbrush\"  \n",
      "Output: \"The Oral-B Pro 1000 not only provides powerful cleaning action but also enhances your oral hygiene routine with its smart pressure sensor and various cleaning modes.\"\n",
      "\n",
      "3. Footwear  \n",
      "Input: \"Nike Air Max, Shoes\"  \n",
      "Output: \"Step into comfort with the Nike Air Max. With cutting-edge technology and a sleek design, these shoes are perfect for athletes and casual wearers alike.\"\n",
      "\n",
      "4. Food  \n",
      "Input: \"Nature's Valley Granola Bar, Food\"  \n",
      "Output: \"Savor the wholesome goodness of Nature's Valley Granola Bar, crafted with real ingredients to fuel your day with delicious flavor and crunchy satisfaction.\"\n",
      "\n",
      "5. Electric Vehicles  \n",
      "Input: \"Tesla Model 3, Mobile Phones\"  \n",
      "Output: \"The Tesla Model 3 is a revolutionary electric vehicle that combines performance with sustainability, featuring an intuitive interface and cutting-edge technology for an exceptional driving experience.\"\n",
      "\n",
      "1. Automotive  \n",
      "Input: \"Tesla Model 3, Electric Vehicles\"  \n",
      "Output: \"The Tesla Model 3 combines cutting-edge technology with eco-friendly driving. Enjoy a sleek design, impressive range, and top-notch safety features, making it the perfect electric car for the modern driver.\"\n",
      "\n",
      "2. Personal Care  \n",
      "Input: \"Oral-B Pro 1000, Electronic Toothbrush\"  \n",
      "Output: \"Achieve a superior clean with the Oral-B Pro 1000. Featuring advanced 3D cleaning action, this electronic toothbrush ensures effective plaque removal while being gentle on gums, allowing you to maintain optimum oral health.\"\n",
      "\n",
      "3. Footwear  \n",
      "Input: \"Nike Air Max, Shoes\"  \n",
      "Output: \"Step up your game with Nike Air Max shoes. Combining iconic cushioning technology and bold style, these shoes provide ultimate comfort and support, perfect for both casual wear and athletic performance.\"\n",
      "\n",
      "4. Food  \n",
      "Input: \"Oreo Cookies, Snacks\"  \n",
      "Output: \"Indulge in the classic taste of Oreo Cookies. With their irresistible cream filling sandwiched between two crunchy chocolate wafers, these treats are perfect for satisfying your sweet tooth any time of the day.\"\n",
      "\n",
      "5. Personal Care  \n",
      "Input: \"Garnier Micellar Water, Skincare\"  \n",
      "Output: \"Garnier Micellar Water gently removes makeup and impurities while hydrating the skin. This soothing formula is suitable for all skin types, making it a must-have in your daily skincare routine.\"\n",
      "\n",
      "6. Automotive  \n",
      "Input: \"Ford F-150, Trucks\"  \n",
      "Output: \"The Ford F-150 is the quintessential pickup truck, combining power, reliability, and innovative technology. Equipped with advanced towing capabilities and a spacious interior, it's designed for both work and play.\"\n",
      "\n",
      "7. Electronics  \n",
      "Input: \"Samsung Galaxy S21, Mobile Phones\"  \n",
      "Output: \"Experience the future of mobile technology with the Samsung Galaxy S21. This smartphone features a stunning display, powerful processor, and multiple camera options, perfect for capturing life's moments in high definition.\"\n",
      "\n",
      "8. Footwear  \n",
      "Input: \"Adidas Ultraboost, Shoes\"  \n",
      "Output: \"Run in style with Adidas Ultraboost shoes. Known for their comfort and performance, these shoes utilize responsive cushioning to provide unmatched energy return with every step you take.\" \n",
      "\n",
      "9. Electronics  \n",
      "Input: \"Dell XPS 13, Laptops\"  \n",
      "Output: \"The Dell XPS 13 redefines the laptop experience with its stunning InfinityEdge display, powerful performance, and sleek design. Ideal for both professionals and students looking for portability and functionality.\"\n",
      "\n",
      "10. Personal Care  \n",
      "Input: \"Philips Sonicare, Electronic Toothbrush\"  \n",
      "Output: \"Philips Sonicare's electronic toothbrush guarantees a superior cleaning experience with its advanced sonic technology. This toothbrush not only helps remove plaque but also promotes healthier gums for a brighter smile.\"\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "output_string = \"\"\n",
    "for i in range(3):\n",
    "  question = f\"\"\"\n",
    "  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under some main topics: {[entry['topic'] for entry in cluster_topic_mapping]})\n",
    "  After the number of each example also state the topic area. The format should be of the form:\n",
    "  1. topic_area\n",
    "  Input: product_name, category\n",
    "  Output: description\n",
    "\n",
    "  Do not add any extra characters around that formatting as it will make the output parsing break.\n",
    "\n",
    "  Here are some helpful examples so you get the style of output correct.\n",
    "\n",
    "  1) clothing\n",
    "  Input: \"Shoe Name, Shoes\"\n",
    "  Output: \"Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move.\"\n",
    "  \"\"\"\n",
    "\n",
    "  response = client.chat.completions.create(\n",
    "    model=\"gpt-4o-mini\",\n",
    "    messages=[\n",
    "      {\"role\": \"system\", \"content\": \"You are a helpful assistant designed to generate synthetic data.\"},\n",
    "      {\"role\": \"user\", \"content\": question}\n",
    "    ]\n",
    "  )\n",
    "  res = response.choices[0].message.content\n",
    "  output_string += res + \"\\n\" + \"\\n\"\n",
    "print(output_string)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RQMHQxnZdRug"
   },
   "source": [
    "You can run this in a loop to append to your previous data and in this way you can keep generating more textual synthetic data to train another GPT model while making sure that we cater to imbalanced datasets and generating a diversity of data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Hiim8Xg5djGH"
   },
   "source": [
    "You have now completed part 1 of the synthetic data generation tutorial where we have gone through:\n",
    "*   CSV with a structured prompt\n",
    "*   CSV with a Python program\n",
    "*   Multitable CSV with a python program\n",
    "*   Simply creating textual data\n",
    "*   Dealing with imbalanced or non-diverse textual data\n",
    "\n",
    "In part 2 you will find find out techniques for better prompting an LLM to enhance textual synthetic data generation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}