diff --git a/examples/third_party_examples/Visualizing_embeddings_in_Kangas.ipynb b/examples/third_party_examples/Visualizing_embeddings_in_Kangas.ipynb new file mode 100644 index 0000000..bc0c377 --- /dev/null +++ b/examples/third_party_examples/Visualizing_embeddings_in_Kangas.ipynb @@ -0,0 +1,439 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "0wjP9mrldJsd" + }, + "source": [ + "## Visualizing the embeddings in Kangas\n", + "\n", + "In this Jupyter Notebook, we construct a Kangas DataGrid containing the data and projections of the embeddings into 2 dimensions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4tPKQqqldJsj" + }, + "source": [ + "## What is Kangas?\n", + "\n", + "[Kangas](https://github.com/comet-ml/kangas/) as an open source, mixed-media, dataframe-like tool for data scientists. It was developed by [Comet](https://comet.com/), a company designed to help reduce the friction of moving models into production. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6sNsB2iFdJsk" + }, + "source": [ + "### 1. Setup\n", + "\n", + "To get started, we pip install kangas, and import it." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "N8gi529adL-f", + "outputId": "c12e9973-a179-41e3-c5a8-f241804d99ad" + }, + "outputs": [], + "source": [ + "%pip install kangas --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "htxjXThodRxD" + }, + "outputs": [], + "source": [ + "import kangas as kg" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2. Constructing a Kangas DataGrid\n", + "\n", + "We create a Kangas Datagrid with the original data and the embeddings. The data is composed of a rows of reviews, and the embeddings are composed of 1536 floating-point values. In this example, we get the data directly from github, in case you aren't running this notebook inside OpenAI's repo.\n", + "\n", + "We use Kangas to read the CSV file into a DataGrid for further processing." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0SxWlRTrdVJq", + "outputId": "d36c3a14-2e80-4315-e285-f39f6b008976" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "1001it [00:00, 2412.90it/s]\n", + "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]\n" + ] + } + ], + "source": [ + "data = kg.read_csv(\"https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can review the fields of the CSV file:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bzhQgoRGeMCp", + "outputId": "791c4e40-fb28-409e-d1e9-20b753fb1215" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DataGrid (in memory)\n", + " Name : fine_food_reviews_with_embeddings_1k\n", + " Rows : 1,000\n", + " Columns: 9\n", + "# Column Non-Null Count DataGrid Type \n", + "--- -------------------- --------------- --------------------\n", + "1 Column 1 1,000 INTEGER \n", + "2 ProductId 1,000 TEXT \n", + "3 UserId 1,000 TEXT \n", + "4 Score 1,000 INTEGER \n", + "5 Summary 1,000 TEXT \n", + "6 Text 1,000 TEXT \n", + "7 combined 1,000 TEXT \n", + "8 n_tokens 1,000 INTEGER \n", + "9 embedding 1,000 TEXT \n" + ] + } + ], + "source": [ + "data.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And get a glimpse of the first and last rows:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 349 + }, + "id": "Q95N832aeaBr", + "outputId": "aaea2816-e5a1-4e52-f228-c3e6aca6fa3e" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
row-id Column 1 ProductId UserId Score Summary Text combined n_tokens embedding
1 0 B003XPF9BO A3R7JR3FMEBXQB 5 where does one Wanted to save Title: where do 52 [0.007018072064
2 297 B003VXHGPK A21VWSCGW7UUAR 4 Good, but not W Honestly, I hav Title: Good, bu 178 [-0.00314055196
3 296 B008JKTTUA A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118
4 295 B000LKTTTW A14MQ40CCU8B13 5 Best tomato sou I have a hard t Title: Best tom 111 [-0.00139322795
5 294 B001D09KAM A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118
...
996 623 B0000CFXYA A3GS4GWPIBV0NT 1 Strange inflamm Truthfully wasn Title: Strange 110 [0.000110913533
997 624 B0001BH5YM A1BZ3HMAKK0NC 5 My favorite and You've just got Title: My favor 80 [-0.02086931467
998 625 B0009ET7TC A2FSDQY5AI6TNX 5 My furbabies LO Shake the conta Title: My furba 47 [-0.00974910240
999 619 B007PA32L2 A15FF2P7RPKH6G 5 got this for th all i have hear Title: got this 50 [-0.00521062919
1000 999 B001EQ5GEO A3VYU0VO6DYV6I 5 I love Maui Cof My first experi Title: I love M 118 [-0.00605782261
[1000 rows x 9 columns]
* Use DataGrid.save() to save to disk
** Use DataGrid.show() to start user interface
" + ], + "text/plain": [ + " row-id Column 1 ProductId UserId Score Summary Text combined n_tokens embedding \n", + " 1 0 B003XPF9BO A3R7JR3FMEBXQB 5 where does one Wanted to save Title: where do 52 [0.007018072064 \n", + " 2 297 B003VXHGPK A21VWSCGW7UUAR 4 Good, but not W Honestly, I hav Title: Good, bu 178 [-0.00314055196 \n", + " 3 296 B008JKTTUA A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118 \n", + " 4 295 B000LKTTTW A14MQ40CCU8B13 5 Best tomato sou I have a hard t Title: Best tom 111 [-0.00139322795 \n", + " 5 294 B001D09KAM A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118 \n", + "...\n", + " 996 623 B0000CFXYA A3GS4GWPIBV0NT 1 Strange inflamm Truthfully wasn Title: Strange 110 [0.000110913533 \n", + " 997 624 B0001BH5YM A1BZ3HMAKK0NC 5 My favorite and You've just got Title: My favor 80 [-0.02086931467 \n", + " 998 625 B0009ET7TC A2FSDQY5AI6TNX 5 My furbabies LO Shake the conta Title: My furba 47 [-0.00974910240 \n", + " 999 619 B007PA32L2 A15FF2P7RPKH6G 5 got this for th all i have hear Title: got this 50 [-0.00521062919 \n", + " 1000 999 B001EQ5GEO A3VYU0VO6DYV6I 5 I love Maui Cof My first experi Title: I love M 118 [-0.00605782261 \n", + "\n", + " [1000 rows x 9 columns] \n", + "\n", + "* Use DataGrid.save() to save to disk\n", + "** Use DataGrid.show() to start user interface" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we create a new DataGrid, converting the numbers into an Embedding:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "Bu0erP68dvLU" + }, + "outputs": [], + "source": [ + "import ast # to convert string of a list of numbers into a list of numbers\n", + "\n", + "dg = kg.DataGrid(\n", + " name=\"openai_embeddings\",\n", + " columns=data.get_columns(),\n", + " converters={\"Score\": str},\n", + ")\n", + "for row in data:\n", + " embedding = ast.literal_eval(row[8])\n", + " row[8] = kg.Embedding(\n", + " embedding, \n", + " name=str(row[3]), \n", + " text=\"%s - %.10s\" % (row[3], row[4]),\n", + " projection=\"umap\",\n", + " )\n", + " dg.append(row)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The new DataGrid now has an Embedding column with proper datatype." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "gd6Od4Bmhijy", + "outputId": "9aa38221-0272-4a63-e393-706e0a0c5879" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DataGrid (in memory)\n", + " Name : openai_embeddings\n", + " Rows : 1,000\n", + " Columns: 9\n", + "# Column Non-Null Count DataGrid Type \n", + "--- -------------------- --------------- --------------------\n", + "1 Column 1 1,000 INTEGER \n", + "2 ProductId 1,000 TEXT \n", + "3 UserId 1,000 TEXT \n", + "4 Score 1,000 TEXT \n", + "5 Summary 1,000 TEXT \n", + "6 Text 1,000 TEXT \n", + "7 combined 1,000 TEXT \n", + "8 n_tokens 1,000 INTEGER \n", + "9 embedding 1,000 EMBEDDING-ASSET \n" + ] + } + ], + "source": [ + "dg.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We simply save the datagrid, and we're done." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dg.save()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3. Render 2D Projections\n", + "\n", + "To render the data directly in the notebook, simply show it. Note that each row contains an embedding projection. \n", + "\n", + "Scroll to far right to see embeddings projection per row.\n", + "\n", + "The color of the point in projection space represents the Score." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 771 + }, + "id": "Z8j-GdpiijU0", + "outputId": "20a0b1ca-3059-4384-cd8c-b32b1aa1c270" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "dg.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Group by \"Score\" to see rows of each group." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "dg.show(group=\"Score\", sort=\"Score\", rows=5, select=\"Score,embedding\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vLIxfmK5dJsq" + }, + "source": [ + "An example of this datagrid is hosted here: https://kangas.comet.com/?datagrid=/data/openai_embeddings.datagrid" + ] + } + ], + "metadata": { + "accelerator": "TPU", + "colab": { + "gpuType": "V100", + "machine_shape": "hm", + "provenance": [] + }, + "gpuClass": "standard", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "vscode": { + "interpreter": { + "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}