{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Load the dataset\n", "\n", "The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n", "\n", "We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeProductIdUserIdScoreSummaryTextcombined
Id
11303862400B001E4KFG0A3SGXH7AUHU8GW5Good Quality Dog FoodI have bought several of the Vitality canned d...Title: Good Quality Dog Food; Content: I have ...
21346976000B00813GRG4A1D87F6ZCVE5NK1Not as AdvertisedProduct arrived labeled as Jumbo Salted Peanut...Title: Not as Advertised; Content: Product arr...
\n", "
" ], "text/plain": [ " Time ProductId UserId Score Summary \\\n", "Id \n", "1 1303862400 B001E4KFG0 A3SGXH7AUHU8GW 5 Good Quality Dog Food \n", "2 1346976000 B00813GRG4 A1D87F6ZCVE5NK 1 Not as Advertised \n", "\n", " Text \\\n", "Id \n", "1 I have bought several of the Vitality canned d... \n", "2 Product arrived labeled as Jumbo Salted Peanut... \n", "\n", " combined \n", "Id \n", "1 Title: Good Quality Dog Food; Content: I have ... \n", "2 Title: Not as Advertised; Content: Product arr... " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('input/Reviews.csv', index_col=0)\n", "df = df[['Time', 'ProductId', 'UserId', 'Score', 'Summary', 'Text']]\n", "df = df.dropna()\n", "df['combined'] = \"Title: \" + df.Summary.str.strip() + \"; Content: \" + df.Text.str.strip()\n", "df.head(2)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1000" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# subsample to 1k most recent reviews and remove samples that are too long\n", "df = df.sort_values('Time').tail(1_100)\n", "df.drop('Time', axis=1, inplace=True)\n", "\n", "from transformers import GPT2TokenizerFast\n", "tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")\n", "\n", "# remove reviews that are too long\n", "df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))\n", "df = df[df.n_tokens<2000].tail(1_000)\n", "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Get embeddings and save them for future reuse" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from openai.embeddings_utils import get_embedding\n", "\n", "# This will take just under 10 minutes\n", "df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))\n", "df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-search-babbage-doc-001'))\n", "df.to_csv('output/embedded_1k_reviews.csv')" ] } ], "metadata": { "interpreter": { "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8" }, "kernelspec": { "display_name": "Python 3.7.3 64-bit ('base': conda)", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }