diff --git a/examples/vector_databases/tair/Getting_started_with_Tair_and_OpenAI.ipynb b/examples/vector_databases/tair/Getting_started_with_Tair_and_OpenAI.ipynb new file mode 100644 index 0000000..1b69017 --- /dev/null +++ b/examples/vector_databases/tair/Getting_started_with_Tair_and_OpenAI.ipynb @@ -0,0 +1,546 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using Tair as a vector database for OpenAI embeddings\n", + "\n", + "This notebook guides you step by step on using Tair as a vector database for OpenAI embeddings.\n", + "\n", + "This notebook presents an end-to-end process of:\n", + "1. Using precomputed embeddings created by OpenAI API.\n", + "2. Storing the embeddings in a cloud instance of Tair.\n", + "3. Converting raw text query to an embedding with OpenAI API.\n", + "4. Using Tair to perform the nearest neighbour search in the created collection.\n", + "\n", + "### What is Tair\n", + "\n", + "[Tair](https://www.alibabacloud.com/help/en/tair/latest/what-is-tair) is a cloud native in-memory database service that is developed by Alibaba Cloud. Tair is compatible with open source Redis and provides a variety of data models and enterprise-class capabilities to support your real-time online scenarios. Tair also introduces persistent memory-optimized instances that are based on the new non-volatile memory (NVM) storage medium. These instances can reduce costs by 30%, ensure data persistence, and provide almost the same performance as in-memory databases. Tair has been widely used in areas such as government affairs, finance, manufacturing, healthcare, and pan-Internet to meet their high-speed query and computing requirements.\n", + "\n", + "[Tairvector](https://www.alibabacloud.com/help/en/tair/latest/tairvector) is an in-house data structure that provides high-performance real-time storage and retrieval of vectors. TairVector provides two indexing algorithms: Hierarchical Navigable Small World (HNSW) and Flat Search. Additionally, TairVector supports multiple distance functions, such as Euclidean distance, inner product, and Jaccard distance. Compared with traditional vector retrieval services, TairVector has the following advantages:\n", + "- Stores all data in memory and supports real-time index updates to reduce latency of read and write operations.\n", + "- Uses an optimized data structure in memory to better utilize storage capacity.\n", + "- Functions as an out-of-the-box data structure in a simple and efficient architecture without complex modules or dependencies.\n", + "\n", + "### Deployment options\n", + "\n", + "- Using [Tair Cloud Vector Database](https://www.alibabacloud.com/help/en/tair/latest/getting-started-overview). [Click here](https://www.alibabacloud.com/product/tair) to fast deploy it.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "For the purposes of this exercise we need to prepare a couple of things:\n", + "\n", + "1. Tair cloud server instance.\n", + "2. The 'tair' library to interact with the tair database.\n", + "3. An [OpenAI API key](https://beta.openai.com/account/api-keys).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install requirements\n", + "\n", + "This notebook obviously requires the `openai` and `tair` packages, but there are also some other additional libraries we will use. The following command installs them all:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:05:05.718972Z", + "start_time": "2023-02-16T12:04:30.434820Z" + }, + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Looking in indexes: http://sg.mirrors.cloud.aliyuncs.com/pypi/simple/\n", + "Requirement already satisfied: openai in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (0.28.0)\n", + "Requirement already satisfied: redis in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (5.0.0)\n", + "Requirement already satisfied: tair in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (1.3.6)\n", + "Requirement already satisfied: pandas in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (2.1.0)\n", + "Requirement already satisfied: wget in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (3.2)\n", + "Requirement already satisfied: requests>=2.20 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (2.31.0)\n", + "Requirement already satisfied: tqdm in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (4.66.1)\n", + "Requirement already satisfied: aiohttp in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (3.8.5)\n", + "Requirement already satisfied: async-timeout>=4.0.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from redis) (4.0.3)\n", + "Requirement already satisfied: numpy>=1.22.4 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (1.25.2)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2.8.2)\n", + "Requirement already satisfied: pytz>=2020.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3.post1)\n", + "Requirement already satisfied: tzdata>=2022.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3)\n", + "Requirement already satisfied: six>=1.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.2.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.4)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2.0.4)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2023.7.22)\n", + "Requirement already satisfied: attrs>=17.3.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (22.1.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (6.0.4)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.9.2)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.4.0)\n", + "Requirement already satisfied: aiosignal>=1.1.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.3.1)\n", + "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "! pip install openai redis tair pandas wget" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare your OpenAI API key\n", + "\n", + "The OpenAI API key is used for vectorization of the documents and queries.\n", + "\n", + "If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n", + "\n", + "Once you get your key, please add it by getpass." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:05:05.730338Z", + "start_time": "2023-02-16T12:05:05.723351Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input your OpenAI API key:········\n" + ] + } + ], + "source": [ + "import getpass\n", + "import openai\n", + "\n", + "openai.api_key = getpass.getpass(\"Input your OpenAI API key:\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Connect to Tair\n", + "First add it to your environment variables.\n", + "\n", + "Connecting to a running instance of Tair server is easy with the official Python library." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input your tair url:········\n" + ] + } + ], + "source": [ + "# The format of url: redis://[[username]:[password]]@localhost:6379/0\n", + "TAIR_URL = getpass.getpass(\"Input your tair url:\")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "from tair import Tair as TairClient\n", + "\n", + "# connect to tair from url and create a client\n", + "\n", + "url = TAIR_URL\n", + "client = TairClient.from_url(url)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can test the connection by ping:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:05:06.848488Z", + "start_time": "2023-02-16T12:05:06.832612Z" + }, + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client.ping()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:05:37.371951Z", + "start_time": "2023-02-16T12:05:06.851634Z" + }, + "pycharm": { + "is_executing": true + }, + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100% [......................................................................] 698933052 / 698933052" + ] + }, + { + "data": { + "text/plain": [ + "'vector_database_wikipedia_articles_embedded (1).zip'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import wget\n", + "\n", + "embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The downloaded file has to then be extracted:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:06:01.538851Z", + "start_time": "2023-02-16T12:05:37.376042Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.\n" + ] + } + ], + "source": [ + "import zipfile\n", + "import os\n", + "import re\n", + "import tempfile\n", + "\n", + "current_directory = os.getcwd()\n", + "zip_file_path = os.path.join(current_directory, \"vector_database_wikipedia_articles_embedded.zip\")\n", + "output_directory = os.path.join(current_directory, \"../../data\")\n", + "\n", + "with zipfile.ZipFile(zip_file_path, \"r\") as zip_ref:\n", + " zip_ref.extractall(output_directory)\n", + "\n", + "\n", + "# check the csv file exist\n", + "file_name = \"vector_database_wikipedia_articles_embedded.csv\"\n", + "data_directory = os.path.join(current_directory, \"../../data\")\n", + "file_path = os.path.join(data_directory, file_name)\n", + "\n", + "\n", + "if os.path.exists(file_path):\n", + " print(f\"The file {file_name} exists in the data directory.\")\n", + "else:\n", + " print(f\"The file {file_name} does not exist in the data directory.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create Index\n", + "\n", + "Tair stores data in indexes where each object is described by one key. Each key contains a vector and multiple attribute_keys.\n", + "\n", + "We will start with creating two indexes, one for **title_vector** and one for **content_vector**, and then we will fill it with our precomputed embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Index already exists\n", + "Index already exists\n" + ] + } + ], + "source": [ + "# set index parameters\n", + "index = \"openai_test\"\n", + "embedding_dim = 1536\n", + "distance_type = \"L2\"\n", + "index_type = \"HNSW\"\n", + "data_type = \"FLOAT32\"\n", + "\n", + "# Create two indexes, one for title_vector and one for content_vector, skip if already exists\n", + "index_names = [index + \"_title_vector\", index+\"_content_vector\"]\n", + "for index_name in index_names:\n", + " index_connection = client.tvs_get_index(index_name)\n", + " if index_connection is not None:\n", + " print(\"Index already exists\")\n", + " else:\n", + " client.tvs_create_index(name=index_name, dim=embedding_dim, distance_type=distance_type,\n", + " index_type=index_type, data_type=data_type)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we are going to load the data prepared previous to this session, so you don't have to recompute the embeddings of Wikipedia articles with your own credits." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from ast import literal_eval\n", + "# Path to your local CSV file\n", + "csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'\n", + "article_df = pd.read_csv(csv_file_path)\n", + "\n", + "# Read vectors from strings back into a list\n", + "article_df['title_vector'] = article_df.title_vector.apply(literal_eval).values\n", + "article_df['content_vector'] = article_df.content_vector.apply(literal_eval).values\n", + "\n", + "# add/update data to indexes\n", + "for i in range(len(article_df)):\n", + " # add data to index with title_vector\n", + " client.tvs_hset(index=index_names[0], key=article_df.id[i].item(), vector=article_df.title_vector[i], is_binary=False,\n", + " **{\"url\": article_df.url[i], \"title\": article_df.title[i], \"text\": article_df.text[i]})\n", + " # add data to index with content_vector\n", + " client.tvs_hset(index=index_names[1], key=article_df.id[i].item(), vector=article_df.content_vector[i], is_binary=False,\n", + " **{\"url\": article_df.url[i], \"title\": article_df.title[i], \"text\": article_df.text[i]})" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:30:40.675202Z", + "start_time": "2023-02-16T12:30:40.655654Z" + }, + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Count in openai_test_title_vector:25000\n", + "Count in openai_test_content_vector:25000\n" + ] + } + ], + "source": [ + "# Check the data count to make sure all the points have been stored\n", + "for index_name in index_names:\n", + " stats = client.tvs_get_index(index_name)\n", + " count = int(stats[\"current_record_count\"]) - int(stats[\"delete_record_count\"])\n", + " print(f\"Count in {index_name}:{count}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Search data\n", + "\n", + "Once the data is put into Tair we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search. Since the precomputed embeddings were created with `text-embedding-ada-002` OpenAI model, we also have to use it during search.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:30:38.024370Z", + "start_time": "2023-02-16T12:30:37.712816Z" + } + }, + "outputs": [], + "source": [ + "def query_tair(client, query, vector_name=\"title_vector\", top_k=5):\n", + "\n", + " # Creates embedding vector from user query\n", + " embedded_query = openai.Embedding.create(\n", + " input= query,\n", + " model=\"text-embedding-ada-002\",\n", + " )[\"data\"][0]['embedding']\n", + " embedded_query = np.array(embedded_query)\n", + "\n", + " # search for the top k approximate nearest neighbors of vector in an index\n", + " query_result = client.tvs_knnsearch(index=index+\"_\"+vector_name, k=top_k, vector=embedded_query)\n", + "\n", + " return query_result" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:30:39.379566Z", + "start_time": "2023-02-16T12:30:38.031041Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Museum of Modern Art (Distance: 0.125)\n", + "2. Western Europe (Distance: 0.133)\n", + "3. Renaissance art (Distance: 0.136)\n", + "4. Pop art (Distance: 0.14)\n", + "5. Northern Europe (Distance: 0.145)\n" + ] + } + ], + "source": [ + "import openai\n", + "import numpy as np\n", + "\n", + "query_result = query_tair(client=client, query=\"modern art in Europe\", vector_name=\"title_vector\")\n", + "for i in range(len(query_result)):\n", + " title = client.tvs_hmget(index+\"_\"+\"content_vector\", query_result[i][0].decode('utf-8'), \"title\")\n", + " print(f\"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})\")" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:30:40.652676Z", + "start_time": "2023-02-16T12:30:39.382555Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Battle of Bannockburn (Distance: 0.131)\n", + "2. Wars of Scottish Independence (Distance: 0.139)\n", + "3. 1651 (Distance: 0.147)\n", + "4. First War of Scottish Independence (Distance: 0.15)\n", + "5. Robert I of Scotland (Distance: 0.154)\n" + ] + } + ], + "source": [ + "# This time we'll query using content vector\n", + "query_result = query_tair(client=client, query=\"Famous battles in Scottish history\", vector_name=\"content_vector\")\n", + "for i in range(len(query_result)):\n", + " title = client.tvs_hmget(index+\"_\"+\"content_vector\", query_result[i][0].decode('utf-8'), \"title\")\n", + " print(f\"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:notebook] *", + "language": "python", + "name": "conda-env-notebook-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/examples/vector_databases/tair/QA_with_Langchain_Tair_and_OpenAI.ipynb b/examples/vector_databases/tair/QA_with_Langchain_Tair_and_OpenAI.ipynb new file mode 100644 index 0000000..c8e9cae --- /dev/null +++ b/examples/vector_databases/tair/QA_with_Langchain_Tair_and_OpenAI.ipynb @@ -0,0 +1,496 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question Answering with Langchain, Tair and OpenAI\n", + "This notebook presents how to implement a Question Answering system with Langchain, Tair as a knowledge based and OpenAI embeddings. If you are not familiar with Tair, it’s better to check out the [Getting_started_with_Tair_and_OpenAI.ipynb](Getting_started_with_Tair_and_OpenAI.ipynb) notebook.\n", + "\n", + "This notebook presents an end-to-end process of:\n", + "- Calculating the embeddings with OpenAI API.\n", + "- Storing the embeddings in an Tair instance to build a knowledge base.\n", + "- Converting raw text query to an embedding with OpenAI API.\n", + "- Using Tair to perform the nearest neighbour search in the created collection to find some context.\n", + "- Asking LLM to find the answer in a given context.\n", + "\n", + "All the steps will be simplified to calling some corresponding Langchain methods." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "For the purposes of this exercise we need to prepare a couple of things:\n", + "[Tair cloud instance](https://www.alibabacloud.com/help/en/tair/latest/what-is-tair).\n", + "[Langchain](https://github.com/hwchase17/langchain) as a framework.\n", + "An OpenAI API key." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install requirements\n", + "This notebook requires the following Python packages: `openai`, `tiktoken`, `langchain` and `tair`.\n", + "- `openai` provides convenient access to the OpenAI API.\n", + "- `tiktoken` is a fast BPE tokeniser for use with OpenAI's models.\n", + "- `langchain` helps us to build applications with LLM more easily.\n", + "- `tair` library is used to interact with the tair vector database." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "ExecuteTime": { + "end_time": "2023-05-06T10:21:40.843630Z", + "start_time": "2023-05-06T10:21:38.796769Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Looking in indexes: http://sg.mirrors.cloud.aliyuncs.com/pypi/simple/\n", + "Requirement already satisfied: openai in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (0.28.0)\n", + "Requirement already satisfied: tiktoken in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (0.4.0)\n", + "Requirement already satisfied: langchain in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (0.0.281)\n", + "Requirement already satisfied: tair in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (1.3.6)\n", + "Requirement already satisfied: requests>=2.20 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (2.31.0)\n", + "Requirement already satisfied: tqdm in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (4.66.1)\n", + "Requirement already satisfied: aiohttp in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (3.8.5)\n", + "Requirement already satisfied: regex>=2022.1.18 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from tiktoken) (2023.8.8)\n", + "Requirement already satisfied: PyYAML>=5.3 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from langchain) (6.0.1)\n", + "Requirement already satisfied: SQLAlchemy<3,>=1.4 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from langchain) (2.0.20)\n", + "Requirement already satisfied: async-timeout<5.0.0,>=4.0.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from langchain) (4.0.3)\n", + "Requirement already satisfied: dataclasses-json<0.6.0,>=0.5.7 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from langchain) (0.5.14)\n", + "Requirement already satisfied: langsmith<0.1.0,>=0.0.21 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from langchain) (0.0.33)\n", + "Requirement already satisfied: numexpr<3.0.0,>=2.8.4 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from langchain) (2.8.5)\n", + "Requirement already satisfied: numpy<2,>=1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from langchain) (1.25.2)\n", + "Requirement already satisfied: pydantic<3,>=1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from langchain) (1.10.12)\n", + "Requirement already satisfied: tenacity<9.0.0,>=8.1.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from langchain) (8.2.3)\n", + "Requirement already satisfied: redis>=4.4.4 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from tair) (5.0.0)\n", + "Requirement already satisfied: attrs>=17.3.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (22.1.0)\n", + "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (3.2.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (6.0.4)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.9.2)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.4.0)\n", + "Requirement already satisfied: aiosignal>=1.1.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.3.1)\n", + "Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from dataclasses-json<0.6.0,>=0.5.7->langchain) (3.20.1)\n", + "Requirement already satisfied: typing-inspect<1,>=0.4.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from dataclasses-json<0.6.0,>=0.5.7->langchain) (0.9.0)\n", + "Requirement already satisfied: typing-extensions>=4.2.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pydantic<3,>=1->langchain) (4.7.1)\n", + "Requirement already satisfied: idna<4,>=2.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.4)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2.0.4)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2023.7.22)\n", + "Requirement already satisfied: greenlet!=0.4.17 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from SQLAlchemy<3,>=1.4->langchain) (2.0.2)\n", + "Requirement already satisfied: packaging>=17.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from marshmallow<4.0.0,>=3.18.0->dataclasses-json<0.6.0,>=0.5.7->langchain) (23.1)\n", + "Requirement already satisfied: mypy-extensions>=0.3.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from typing-inspect<1,>=0.4.0->dataclasses-json<0.6.0,>=0.5.7->langchain) (1.0.0)\n", + "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "! pip install openai tiktoken langchain tair " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare your OpenAI API key\n", + "The OpenAI API key is used for vectorization of the documents and queries.\n", + "\n", + "If you don't have an OpenAI API key, you can get one from [https://platform.openai.com/account/api-keys ).\n", + "\n", + "Once you get your key, please add it by getpass." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "ExecuteTime": { + "end_time": "2023-05-06T10:21:40.974668Z", + "start_time": "2023-05-06T10:21:40.845980Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input your OpenAI API key:········\n" + ] + } + ], + "source": [ + "import getpass\n", + "\n", + "openai_api_key = getpass.getpass(\"Input your OpenAI API key:\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare your Tair URL\n", + "To build the Tair connection, you need to have `TAIR_URL`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "ExecuteTime": { + "end_time": "2023-05-06T10:21:41.574807Z", + "start_time": "2023-05-06T10:21:40.976664Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input your tair url:········\n" + ] + } + ], + "source": [ + "# The format of url: redis://[[username]:[password]]@localhost:6379/0\n", + "TAIR_URL = getpass.getpass(\"Input your tair url:\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load data\n", + "In this section we are going to load the data containing some natural questions and answers to them. All the data will be used to create a Langchain application with Tair being the knowledge base." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100% [..............................................................................] 95372 / 95372" + ] + }, + { + "data": { + "text/plain": [ + "'answers (2).json'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import wget\n", + "\n", + "# All the examples come from https://ai.google.com/research/NaturalQuestions\n", + "# This is a sample of the training set that we download and extract for some\n", + "# further processing.\n", + "wget.download(\"https://storage.googleapis.com/dataset-natural-questions/questions.json\")\n", + "wget.download(\"https://storage.googleapis.com/dataset-natural-questions/answers.json\")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "with open(\"questions.json\", \"r\") as fp:\n", + " questions = json.load(fp)\n", + "\n", + "with open(\"answers.json\", \"r\") as fp:\n", + " answers = json.load(fp)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "when is the last episode of season 8 of the walking dead\n" + ] + } + ], + "source": [ + "print(questions[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "No . overall No. in season Title Directed by Written by Original air date U.S. viewers ( millions ) 100 `` Mercy '' Greg Nicotero Scott M. Gimple October 22 , 2017 ( 2017 - 10 - 22 ) 11.44 Rick , Maggie , and Ezekiel rally their communities together to take down Negan . Gregory attempts to have the Hilltop residents side with Negan , but they all firmly stand behind Maggie . The group attacks the Sanctuary , taking down its fences and flooding the compound with walkers . With the Sanctuary defaced , everyone leaves except Gabriel , who reluctantly stays to save Gregory , but is left behind when Gregory abandons him . Surrounded by walkers , Gabriel hides in a trailer , where he is trapped inside with Negan . 101 `` The Damned '' Rosemary Rodriguez Matthew Negrete & Channing Powell October 29 , 2017 ( 2017 - 10 - 29 ) 8.92 Rick 's forces split into separate parties to attack several of the Saviors ' outposts , during which many members of the group are killed ; Eric is critically injured and rushed away by Aaron . Jesus stops Tara and Morgan from executing a group of surrendered Saviors . While clearing an outpost with Daryl , Rick is confronted and held at gunpoint by Morales , a survivor he met in the initial Atlanta camp , who is now with the Saviors . 102 `` Monsters '' Greg Nicotero Matthew Negrete & Channing Powell November 5 , 2017 ( 2017 - 11 - 05 ) 8.52 Daryl finds Morales threatening Rick and kills him ; the duo then pursue a group of Saviors who are transporting weapons to another outpost . Gregory returns to Hilltop , and after a heated argument , Maggie ultimately allows him back in the community . Eric dies from his injuries , leaving Aaron distraught . Despite Tara and Morgan 's objections , Jesus leads the group of surrendered Saviors to Hilltop . Ezekiel 's group attacks another Savior compound , during which several Kingdommers are shot while protecting Ezekiel . 103 `` Some Guy '' Dan Liu David Leslie Johnson November 12 , 2017 ( 2017 - 11 - 12 ) 8.69 Ezekiel 's group is overwhelmed by the Saviors , who kill all of them except for Ezekiel himself and Jerry . Carol clears the inside of the compound , killing all but two Saviors , who almost escape but are eventually caught by Rick and Daryl . En route to the Kingdom , Ezekiel , Jerry , and Carol are surrounded by walkers , but Shiva sacrifices herself to save them . The trio returns to the Kingdom , where Ezekiel 's confidence in himself as a leader has diminished . 104 5 `` The Big Scary U '' Michael E. Satrazemis Story by : Scott M. Gimple & David Leslie Johnson & Angela Kang Teleplay by : David Leslie Johnson & Angela Kang November 19 , 2017 ( 2017 - 11 - 19 ) 7.85 After confessing their sins to each other , Gabriel and Negan manage to escape from the trailer . Simon and the other lieutenants grow suspicious of each other , knowing that Rick 's forces must have inside information . The workers in the Sanctuary become increasingly frustrated with their living conditions , and a riot nearly ensues , until Negan returns and restores order . Gabriel is locked in a cell , where Eugene discovers him sick and suffering . Meanwhile , Rick and Daryl argue over how to take out the Saviors , leading Daryl to abandon Rick . 105 6 `` The King , the Widow , and Rick '' John Polson Angela Kang & Corey Reed November 26 , 2017 ( 2017 - 11 - 26 ) 8.28 Rick visits Jadis in hopes of convincing her to turn against Negan ; Jadis refuses , and locks Rick in a shipping container . Carl encounters Siddiq in the woods and recruits him to Alexandria . Daryl and Tara plot to deviate from Rick 's plans by destroying the Sanctuary . Ezekiel isolates himself at the Kingdom , where Carol tries to encourage him to be the leader his people need . Maggie has the group of captured Saviors placed in a holding area and forces Gregory to join them as punishment for betraying Hilltop . 106 7 `` Time for After '' Larry Teng Matthew Negrete & Corey Reed December 3 , 2017 ( 2017 - 12 - 03 ) 7.47 After learning of Dwight 's association with Rick 's group , Eugene affirms his loyalty to Negan and outlines a plan to get rid of the walkers surrounding the Sanctuary . With help from Morgan and Tara , Daryl drives a truck through the Sanctuary 's walls , flooding its interior with walkers , killing many Saviors . Rick finally convinces Jadis and the Scavengers to align with him , and they plan to force the Saviors to surrender . However , when they arrive at the Sanctuary , Rick is horrified to see the breached walls and no sign of the walker herd . 107 8 `` How It 's Gotta Be '' Michael E. Satrazemis David Leslie Johnson & Angela Kang December 10 , 2017 ( 2017 - 12 - 10 ) 7.89 Eugene 's plan allows the Saviors to escape , and separately , the Saviors waylay the Alexandria , Hilltop , and Kingdom forces . The Scavengers abandon Rick , after which he returns to Alexandria . Ezekiel ensures that the Kingdom residents are able to escape before locking himself in the community with the Saviors . Eugene aids Gabriel and Doctor Carson in escaping the Sanctuary in order to ease his conscience . Negan attacks Alexandria , but Carl devises a plan to allow the Alexandria residents to escape into the sewers . Carl reveals he was bitten by a walker while escorting Siddiq to Alexandria . 108 9 `` Honor '' Greg Nicotero Matthew Negrete & Channing Powell February 25 , 2018 ( 2018 - 02 - 25 ) 8.28 After the Saviors leave Alexandria , the survivors make for the Hilltop while Rick and Michonne stay behind to say their final goodbyes to a dying Carl , who pleads with Rick to build a better future alongside the Saviors before killing himself . In the Kingdom , Morgan and Carol launch a rescue mission for Ezekiel . Although they are successful and retake the Kingdom , the Saviors ' lieutenant Gavin is killed by Benjamin 's vengeful brother Henry . 109 10 `` The Lost and the Plunderers '' TBA TBA March 4 , 2018 ( 2018 - 03 - 04 ) TBD 110 11 `` Dead or Alive Or '' TBA TBA March 11 , 2018 ( 2018 - 03 - 11 ) TBD 111 12 `` The Key '' TBA TBA March 18 , 2018 ( 2018 - 03 - 18 ) TBD\n" + ] + } + ], + "source": [ + "print(answers[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Chain definition\n", + "\n", + "Langchain is already integrated with Tair and performs all the indexing for given list of documents. In our case we are going to store the set of answers we have." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.vectorstores import Tair\n", + "from langchain.embeddings import OpenAIEmbeddings\n", + "from langchain import VectorDBQA, OpenAI\n", + "\n", + "embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)\n", + "doc_store = Tair.from_texts(\n", + " texts=answers, embedding=embeddings, tair_url=TAIR_URL,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At this stage all the possible answers are already stored in Tair, so we can define the whole QA chain." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/root/anaconda3/envs/notebook/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py:251: UserWarning: `VectorDBQA` is deprecated - please use `from langchain.chains import RetrievalQA`\n", + " warnings.warn(\n" + ] + } + ], + "source": [ + "llm = OpenAI(openai_api_key=openai_api_key)\n", + "qa = VectorDBQA.from_chain_type(\n", + " llm=llm,\n", + " chain_type=\"stuff\",\n", + " vectorstore=doc_store,\n", + " return_source_documents=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Search data\n", + "\n", + "Once the data is put into Tair we can start asking some questions. A question will be automatically vectorized by OpenAI model, and the created vector will be used to find some possibly matching answers in Tair. Once retrieved, the most similar answers will be incorporated into the prompt sent to OpenAI Large Language Model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "\n", + "random.seed(52)\n", + "selected_questions = random.choices(questions, k=5)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "> where do frankenstein and the monster first meet\n", + " Frankenstein and the monster first meet in the mountains.\n", + "\n", + "> who are the actors in fast and furious\n", + " The actors in Fast & Furious are Vin Diesel ( Dominic Toretto ), Paul Walker ( Brian O'Conner ), Michelle Rodriguez ( Letty Ortiz ), Jordana Brewster ( Mia Toretto ), Tyrese Gibson ( Roman Pearce ), Ludacris ( Tej Parker ), Lucas Black ( Sean Boswell ), Sung Kang ( Han Lue ), Gal Gadot ( Gisele Yashar ), and Dwayne Johnson ( Luke Hobbs ).\n", + "\n", + "> properties of red black tree in data structure\n", + " The properties of a red-black tree in data structure are that each node is either red or black, the root is black, if a node is red then both its children must be black, and every path from a given node to any of its descendant NIL nodes contains the same number of black nodes.\n", + "\n", + "> who designed the national coat of arms of south africa\n", + " Iaan Bekker\n", + "\n", + "> caravaggio's death of the virgin pamela askew\n", + " I don't know.\n", + "\n" + ] + } + ], + "source": [ + "import time\n", + "for question in selected_questions:\n", + " print(\">\", question)\n", + " print(qa.run(question), end=\"\\n\\n\")\n", + " # wait 20seconds because of the rate limit\n", + " time.sleep(20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Custom prompt templates\n", + "\n", + "The `stuff` chain type in Langchain uses a specific prompt with question and context documents incorporated. This is what the default prompt looks like:\n", + "\n", + "```text\n", + "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n", + "{context}\n", + "Question: {question}\n", + "Helpful Answer:\n", + "```\n", + "\n", + "We can, however, provide our prompt template and change the behaviour of the OpenAI LLM, while still using the `stuff` chain type. It is important to keep `{context}` and `{question}` as placeholders.\n", + "\n", + "#### Experimenting with custom prompts\n", + "\n", + "We can try using a different prompt template, so the model:\n", + "1. Responds with a single-sentence answer if it knows it.\n", + "2. Suggests a random song title if it doesn't know the answer to our question." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.prompts import PromptTemplate\n", + "custom_prompt = \"\"\"\n", + "Use the following pieces of context to answer the question at the end. Please provide\n", + "a short single-sentence summary answer only. If you don't know the answer or if it's\n", + "not present in given context, don't try to make up an answer, but suggest me a random\n", + "unrelated song title I could listen to.\n", + "Context: {context}\n", + "Question: {question}\n", + "Helpful Answer:\n", + "\"\"\"\n", + "\n", + "custom_prompt_template = PromptTemplate(\n", + " template=custom_prompt, input_variables=[\"context\", \"question\"]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "custom_qa = VectorDBQA.from_chain_type(\n", + " llm=llm,\n", + " chain_type=\"stuff\",\n", + " vectorstore=doc_store,\n", + " return_source_documents=False,\n", + " chain_type_kwargs={\"prompt\": custom_prompt_template},\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "> what was uncle jesse's original last name on full house\n", + "Uncle Jesse's original last name on Full House was Cochran.\n", + "\n", + "> when did the volcano erupt in indonesia 2018\n", + "The given context does not mention any volcanic eruption in Indonesia in 2018. Suggested song title: \"The Heat Is On\" by Glenn Frey.\n", + "\n", + "> what does a dualist way of thinking mean\n", + "Dualism means the belief that there is a distinction between the mind and the body, and that the mind is a non-extended, non-physical substance.\n", + "\n", + "> the first civil service commission in india was set up on the basis of recommendation of\n", + "The first Civil Service Commission in India was not set up on the basis of the recommendation of the Election Commission of India's Model Code of Conduct.\n", + "\n", + "> how old do you have to be to get a tattoo in utah\n", + "You must be at least 18 years old to get a tattoo in Utah.\n", + "\n" + ] + } + ], + "source": [ + "random.seed(41)\n", + "for question in random.choices(questions, k=5):\n", + " print(\">\", question)\n", + " print(custom_qa.run(question), end=\"\\n\\n\")\n", + " # wait 20seconds because of the rate limit\n", + " time.sleep(20)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:notebook] *", + "language": "python", + "name": "conda-env-notebook-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}