{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cb1537e6",
   "metadata": {},
   "source": [
    "# Vector Database Introduction\n",
    "\n",
    "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
    "\n",
    "The demo flow is:\n",
    "- **Setup**: Import packages and set any required variables\n",
    "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n",
    "- **Pinecone**\n",
    "    - *Setup*: Here we setup the Python client for Pinecone. For more details go [here](https://docs.pinecone.io/docs/quickstart)\n",
    "    - *Index Data*: We'll create an index with namespaces for __titles__ and __content__\n",
    "    - *Search Data*: We'll test out both namespaces with search queries to confirm it works\n",
    "- **Weaviate**\n",
    "    - *Setup*: Here we setup the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n",
    "    - *Index Data*: We'll create an index with __title__ search vectors in it\n",
    "    - *Search Data*: We'll run a few searches to confirm it works\n",
    "\n",
    "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2b59250",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Here we import the required libraries and set the embedding model that we'd like to use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "5be94df6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import openai\n",
    "\n",
    "import tiktoken\n",
    "from tenacity import retry, wait_random_exponential, stop_after_attempt\n",
    "from typing import List, Iterator\n",
    "import concurrent\n",
    "from tqdm import tqdm\n",
    "import pandas as pd\n",
    "from datasets import load_dataset\n",
    "import numpy as np\n",
    "import os\n",
    "\n",
    "# Pinecone's client library for Python\n",
    "import pinecone\n",
    "\n",
    "# Weaviate's client library for Python\n",
    "import weaviate\n",
    "\n",
    "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
    "MODEL = \"text-embedding-ada-002\"\n",
    "\n",
    "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n",
    "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5d9d2e1",
   "metadata": {},
   "source": [
    "## Load data\n",
    "\n",
    "In this section we'll source the data for this task, embed it and format it for insertion into a vector database\n",
    "\n",
    "*Thanks to Ryan Greene for the template used for the batch ingestion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "id": "bd99e08e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simple function to take in a list of text objects and return them as a list of embeddings\n",
    "def get_embeddings(input: List):\n",
    "    response = openai.Embedding.create(\n",
    "        input=input,\n",
    "        model=MODEL,\n",
    "    )[\"data\"]\n",
    "    return [data[\"embedding\"] for data in response]\n",
    "\n",
    "def batchify(iterable, n=1):\n",
    "    l = len(iterable)\n",
    "    for ndx in range(0, l, n):\n",
    "        yield iterable[ndx : min(ndx + n, l)]\n",
    "\n",
    "# Function for batching and parallel processing the embeddings\n",
    "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
    "def embed_corpus(\n",
    "    corpus: List[str],\n",
    "    batch_size=64,\n",
    "    num_workers=8,\n",
    "    max_context_len=8191,\n",
    "):\n",
    "\n",
    "    # Encode the corpus, truncating to max_context_len\n",
    "    encoding = tiktoken.get_encoding(\"cl100k_base\")\n",
    "    encoded_corpus = [\n",
    "        encoded_article[:max_context_len] for encoded_article in encoding.encode_batch(corpus)\n",
    "    ]\n",
    "\n",
    "    # Calculate corpus statistics: the number of inputs, the total number of tokens, and the estimated cost to embed\n",
    "    num_tokens = sum(len(article) for article in encoded_corpus)\n",
    "    cost_to_embed_tokens = num_tokens / 1_000 * 0.0004\n",
    "    print(\n",
    "        f\"num_articles={len(encoded_corpus)}, num_tokens={num_tokens}, est_embedding_cost={cost_to_embed_tokens:.2f} USD\"\n",
    "    )\n",
    "\n",
    "    # Embed the corpus\n",
    "    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:\n",
    "        \n",
    "        try:\n",
    "            futures = [\n",
    "                executor.submit(get_embeddings, text_batch)\n",
    "                for text_batch in batchify(encoded_corpus, batch_size)\n",
    "            ]\n",
    "\n",
    "            with tqdm(total=len(encoded_corpus)) as pbar:\n",
    "                for _ in concurrent.futures.as_completed(futures):\n",
    "                    pbar.update(batch_size)\n",
    "\n",
    "            embeddings = []\n",
    "            for future in futures:\n",
    "                data = future.result()\n",
    "                embeddings.extend(data)\n",
    "                \n",
    "            return embeddings\n",
    "                \n",
    "        except Exception as e:\n",
    "            print('Get embeddings failed, returning exception')\n",
    "            \n",
    "            return e\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c1c73cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding\n",
    "dataset = list(load_dataset(\"wikipedia\", \"20220301.simple\")[\"train\"])\n",
    "# Limited to 50k articles for demo purposes\n",
    "dataset = dataset[:50_000]  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "id": "e6ee90ce",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "num_articles=50000, num_tokens=18272526, est_embedding_cost=7.31 USD\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "50048it [02:30, 332.26it/s]                                                                                                                                                      \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "num_articles=50000, num_tokens=202363, est_embedding_cost=0.08 USD\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "50048it [00:53, 942.94it/s]                                                                                                                                                      "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 48.7 s, sys: 1min 19s, total: 2min 7s\n",
      "Wall time: 5min 53s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# Embed the article text\n",
    "dataset_embeddings = embed_corpus([article[\"text\"] for article in dataset])\n",
    "# Embed the article titles separately\n",
    "title_embeddings = embed_corpus([article[\"title\"] for article in dataset])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "id": "1410daaa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>url</th>\n",
       "      <th>title</th>\n",
       "      <th>text</th>\n",
       "      <th>title_vector</th>\n",
       "      <th>content_vector</th>\n",
       "      <th>vector_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/April</td>\n",
       "      <td>April</td>\n",
       "      <td>April is the fourth month of the year in the J...</td>\n",
       "      <td>[0.00107035250402987, -0.02077057771384716, -0...</td>\n",
       "      <td>[-0.011253940872848034, -0.013491976074874401,...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/August</td>\n",
       "      <td>August</td>\n",
       "      <td>August (Aug.) is the eighth month of the year ...</td>\n",
       "      <td>[0.0010461278725415468, 0.0008924593566916883,...</td>\n",
       "      <td>[0.0003609954728744924, 0.007262262050062418, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Art</td>\n",
       "      <td>Art</td>\n",
       "      <td>Art is a creative activity that expresses imag...</td>\n",
       "      <td>[0.0033627033699303865, 0.006122018210589886, ...</td>\n",
       "      <td>[-0.004959689453244209, 0.015772193670272827, ...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/A</td>\n",
       "      <td>A</td>\n",
       "      <td>A or a is the first letter of the English alph...</td>\n",
       "      <td>[0.015406121499836445, -0.013689860701560974, ...</td>\n",
       "      <td>[0.024894846603274345, -0.022186409682035446, ...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>9</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Air</td>\n",
       "      <td>Air</td>\n",
       "      <td>Air refers to the Earth's atmosphere. Air is a...</td>\n",
       "      <td>[0.022219523787498474, -0.020443666726350784, ...</td>\n",
       "      <td>[0.021524671465158463, 0.018522677943110466, -...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  id                                       url   title  \\\n",
       "0  1   https://simple.wikipedia.org/wiki/April   April   \n",
       "1  2  https://simple.wikipedia.org/wiki/August  August   \n",
       "2  6     https://simple.wikipedia.org/wiki/Art     Art   \n",
       "3  8       https://simple.wikipedia.org/wiki/A       A   \n",
       "4  9     https://simple.wikipedia.org/wiki/Air     Air   \n",
       "\n",
       "                                                text  \\\n",
       "0  April is the fourth month of the year in the J...   \n",
       "1  August (Aug.) is the eighth month of the year ...   \n",
       "2  Art is a creative activity that expresses imag...   \n",
       "3  A or a is the first letter of the English alph...   \n",
       "4  Air refers to the Earth's atmosphere. Air is a...   \n",
       "\n",
       "                                        title_vector  \\\n",
       "0  [0.00107035250402987, -0.02077057771384716, -0...   \n",
       "1  [0.0010461278725415468, 0.0008924593566916883,...   \n",
       "2  [0.0033627033699303865, 0.006122018210589886, ...   \n",
       "3  [0.015406121499836445, -0.013689860701560974, ...   \n",
       "4  [0.022219523787498474, -0.020443666726350784, ...   \n",
       "\n",
       "                                      content_vector vector_id  \n",
       "0  [-0.011253940872848034, -0.013491976074874401,...         0  \n",
       "1  [0.0003609954728744924, 0.007262262050062418, ...         1  \n",
       "2  [-0.004959689453244209, 0.015772193670272827, ...         2  \n",
       "3  [0.024894846603274345, -0.022186409682035446, ...         3  \n",
       "4  [0.021524671465158463, 0.018522677943110466, -...         4  "
      ]
     },
     "execution_count": 119,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We then store the result in another dataframe, and prep the data for insertion into a vector DB\n",
    "article_df = pd.DataFrame(dataset)\n",
    "article_df['title_vector'] = title_embeddings\n",
    "article_df['content_vector'] = dataset_embeddings\n",
    "article_df['vector_id'] = article_df.index\n",
    "article_df['vector_id'] = article_df['vector_id'].apply(str)\n",
    "article_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed32fc87",
   "metadata": {},
   "source": [
    "## Pinecone\n",
    "\n",
    "Now we'll look to index these embedded documents in a vector database and search them. The first option we'll look at is **Pinecone**, a managed vector database which offers a cloud-native option.\n",
    "\n",
    "Before you proceed with this step you'll need to navigate to [Pinecone](pinecone.io), sign up and then save your API key as an environment variable titled ```PINECONE_API_KEY```.\n",
    "\n",
    "For section we will:\n",
    "- Create an index with multiple namespaces for article titles and content\n",
    "- Store our data in the index with separate searchable \"namespaces\" for article **titles** and **content**\n",
    "- Fire some similarity search queries to verify our setup is working"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "92e6152a",
   "metadata": {},
   "outputs": [],
   "source": [
    "api_key = os.getenv(\"PINECONE_API_KEY\")\n",
    "pinecone.init(api_key=api_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63b28543",
   "metadata": {},
   "source": [
    "### Create Index\n",
    "\n",
    "First we need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [this article](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "0a71c575",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Models a simple batch generator that make chunks out of an input DataFrame\n",
    "class BatchGenerator:\n",
    "    \n",
    "    \n",
    "    def __init__(self, batch_size: int = 10) -> None:\n",
    "        self.batch_size = batch_size\n",
    "    \n",
    "    # Makes chunks out of an input DataFrame\n",
    "    def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:\n",
    "        splits = self.splits_num(df.shape[0])\n",
    "        if splits <= 1:\n",
    "            yield df\n",
    "        else:\n",
    "            for chunk in np.array_split(df, splits):\n",
    "                yield chunk\n",
    "\n",
    "    # Determines how many chunks DataFrame contains\n",
    "    def splits_num(self, elements: int) -> int:\n",
    "        return round(elements / self.batch_size)\n",
    "    \n",
    "    __call__ = to_batches\n",
    "\n",
    "df_batcher = BatchGenerator(300)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "id": "7ea9ad46",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['wikipedia-articles']"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Pick a name for the new index\n",
    "index_name = 'wikipedia-articles'\n",
    "\n",
    "# Check whether the index with the same name already exists - if so, delete it\n",
    "if index_name in pinecone.list_indexes():\n",
    "    pinecone.delete_index(index_name)\n",
    "    \n",
    "# Creates new index\n",
    "pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))\n",
    "index = pinecone.Index(index_name=index_name)\n",
    "\n",
    "# Confirm our index was created\n",
    "pinecone.list_indexes()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "id": "5daeba00",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploading vectors to content namespace..\n"
     ]
    }
   ],
   "source": [
    "# Upsert content vectors in content namespace\n",
    "print(\"Uploading vectors to content namespace..\")\n",
    "for batch_df in df_batcher(article_df):\n",
    "    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "id": "5fc1b083",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploading vectors to title namespace..\n"
     ]
    }
   ],
   "source": [
    "# Upsert title vectors in title namespace\n",
    "print(\"Uploading vectors to title namespace..\")\n",
    "for batch_df in df_batcher(article_df):\n",
    "    index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "id": "f90c7fba",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'dimension': 1536,\n",
       " 'index_fullness': 0.2,\n",
       " 'namespaces': {'content': {'vector_count': 50000},\n",
       "                'title': {'vector_count': 50000}},\n",
       " 'total_vector_count': 100000}"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check index size for each namespace\n",
    "index.describe_index_stats()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2da40a69",
   "metadata": {},
   "source": [
    "### Search data\n",
    "\n",
    "Now we'll enter some dummy searches and check we get decent results back"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "id": "d701b3c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results\n",
    "titles_mapped = dict(zip(article_df.vector_id,article_df.title))\n",
    "content_mapped = dict(zip(article_df.vector_id,article_df.text))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "3c8c2aa1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_article(query, namespace, top_k=5):\n",
    "    '''Queries an article using its title in the specified\n",
    "     namespace and prints results.'''\n",
    "\n",
    "    # Create vector embeddings based on the title column\n",
    "    embedded_query = openai.Embedding.create(\n",
    "                                                input=query,\n",
    "                                                model=MODEL,\n",
    "                                            )[\"data\"][0]['embedding']\n",
    "\n",
    "    # Query namespace passed as parameter using title vector\n",
    "    query_result = index.query(embedded_query, \n",
    "                                      namespace=namespace, \n",
    "                                      top_k=top_k)\n",
    "\n",
    "    # Print query results \n",
    "    print(f'\\nMost similar results querying {query} in \"{namespace}\" namespace:\\n')\n",
    "    if not query_result.matches:\n",
    "        print('no query result')\n",
    "    \n",
    "    matches = query_result.matches\n",
    "    ids = [res.id for res in matches]\n",
    "    scores = [res.score for res in matches]\n",
    "    df = pd.DataFrame({'id':ids, \n",
    "                       'score':scores,\n",
    "                       'title': [titles_mapped[_id] for _id in ids],\n",
    "                       'content': [content_mapped[_id] for _id in ids],\n",
    "                       })\n",
    "    \n",
    "    counter = 0\n",
    "    for k,v in df.iterrows():\n",
    "        counter += 1\n",
    "        print(f'Result {counter} with a score of {v.score} is {v.title}')\n",
    "    \n",
    "    print('\\n')\n",
    "\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "id": "67b3584d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Most similar results querying modern art in Europe in \"title\" namespace:\n",
      "\n",
      "Result 1 with a score of 0.890994787 is Early modern Europe\n",
      "Result 2 with a score of 0.875286043 is Museum of Modern Art\n",
      "Result 3 with a score of 0.867404044 is Western Europe\n",
      "Result 4 with a score of 0.864250064 is Renaissance art\n",
      "Result 5 with a score of 0.860506058 is Pop art\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "query_output = query_article('modern art in Europe','title')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "id": "3e7ac79b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Most similar results querying Famous battles in Scottish history in \"content\" namespace:\n",
      "\n",
      "Result 1 with a score of 0.869324744 is Battle of Bannockburn\n",
      "Result 2 with a score of 0.861479 is Wars of Scottish Independence\n",
      "Result 3 with a score of 0.852555931 is 1651\n",
      "Result 4 with a score of 0.84969604 is First War of Scottish Independence\n",
      "Result 5 with a score of 0.846192539 is Robert I of Scotland\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "content_query_output = query_article(\"Famous battles in Scottish history\",'content')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d939342f",
   "metadata": {},
   "source": [
    "## Weaviate\n",
    "\n",
    "The other vector database option we'll explore here is **Weaviate**, which offers both a managed, SaaS option like Pinecone, as well as a self-hosted option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n",
    "\n",
    "For this we will:\n",
    "- Set up a local deployment of Weaviate\n",
    "- Create indices in Weaviate\n",
    "- Store our data there\n",
    "- Fire some similarity search queries\n",
    "- Try a real use case"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfdfe260",
   "metadata": {},
   "source": [
    "### Setup\n",
    "\n",
    "To get Weaviate running locally we used Docker and followed the instructions contained in this article: https://weaviate.io/developers/weaviate/current/installation/docker-compose.html\n",
    "\n",
    "For an example docker-compose.yaml file please refer to `./weaviate/docker-compose.yaml` in this repo\n",
    "\n",
    "You can start Weaviate up locally by navigating to this directory and running `docker-compose up -d `"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "id": "b9ea472d",
   "metadata": {},
   "outputs": [],
   "source": [
    "client = weaviate.Client(\"http://localhost:8080/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "id": "13be220d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'classes': []}"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "client.schema.delete_all()\n",
    "client.schema.get()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "id": "73d33184",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 109,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "client.is_ready()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03a926b9",
   "metadata": {},
   "source": [
    "### Index data\n",
    "\n",
    "In Weaviate you create __schemas__ to capture each of the entities you will be searching. \n",
    "\n",
    "In this case we'll create a schema called **Article** with the **title** vector from above included for us to search by.\n",
    "\n",
    "The next few steps closely follow the documents Weaviate provides [here](https://weaviate.io/developers/weaviate/current/tutorials/how-to-use-weaviate-without-modules.htm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "id": "e868d143",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'classes': [{'class': 'Article',\n",
       "   'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},\n",
       "    'cleanupIntervalSeconds': 60,\n",
       "    'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},\n",
       "   'properties': [{'dataType': ['text'],\n",
       "     'description': 'Title of the article',\n",
       "     'name': 'title',\n",
       "     'tokenization': 'word'},\n",
       "    {'dataType': ['text'],\n",
       "     'description': 'Contents of the article',\n",
       "     'name': 'content',\n",
       "     'tokenization': 'word'}],\n",
       "   'shardingConfig': {'virtualPerPhysical': 128,\n",
       "    'desiredCount': 1,\n",
       "    'actualCount': 1,\n",
       "    'desiredVirtualCount': 128,\n",
       "    'actualVirtualCount': 128,\n",
       "    'key': '_id',\n",
       "    'strategy': 'hash',\n",
       "    'function': 'murmur3'},\n",
       "   'vectorIndexConfig': {'skip': False,\n",
       "    'cleanupIntervalSeconds': 300,\n",
       "    'maxConnections': 64,\n",
       "    'efConstruction': 128,\n",
       "    'ef': -1,\n",
       "    'dynamicEfMin': 100,\n",
       "    'dynamicEfMax': 500,\n",
       "    'dynamicEfFactor': 8,\n",
       "    'vectorCacheMaxObjects': 2000000,\n",
       "    'flatSearchCutoff': 40000,\n",
       "    'distance': 'cosine'},\n",
       "   'vectorIndexType': 'hnsw',\n",
       "   'vectorizer': 'none'}]}"
      ]
     },
     "execution_count": 110,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class_obj = {\n",
    "    \"class\": \"Article\",\n",
    "    \"vectorizer\": \"none\", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model\n",
    "    \"properties\": [{\n",
    "        \"name\": \"title\",\n",
    "        \"description\": \"Title of the article\",\n",
    "        \"dataType\": [\"text\"]\n",
    "    },\n",
    "        {\n",
    "        \"name\": \"content\",\n",
    "        \"description\": \"Contents of the article\",\n",
    "        \"dataType\": [\"text\"]\n",
    "    }]\n",
    "}\n",
    "\n",
    "# Create the schema in Weaviate\n",
    "client.schema.create_class(class_obj)\n",
    "\n",
    "# Check that we've created it as intended\n",
    "client.schema.get()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "id": "786d437f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploading vectors to article schema..\n"
     ]
    }
   ],
   "source": [
    "# Convert DF into a list of tuples\n",
    "data_objects = []\n",
    "for k,v in article_df.iterrows():\n",
    "    data_objects.append((v['title'],v['text'],v['title_vector'],v['vector_id']))\n",
    "\n",
    "# Upsert into article schema\n",
    "print(\"Uploading vectors to article schema..\")\n",
    "uuids = []\n",
    "for articles in data_objects:\n",
    "    uuid = client.data_object.create(\n",
    "                              {\n",
    "                                  \"title\": articles[0],\n",
    "                                  \"content\": articles[1]\n",
    "                              },\n",
    "                              \"Article\",\n",
    "                              vector=articles[2]\n",
    "                            )\n",
    "    uuids.append(uuid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "id": "3658693c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'content': 'Eddie Cantor (January 31, 1892 - October 10, 1964) was an American comedian, singer, actor, songwriter. Familiar to Broadway, radio and early television audiences, this \"Apostle of Pep\" was regarded almost as a family member by millions because his top-rated radio shows revealed intimate stories and amusing anecdotes about his wife Ida and five daughters. His eye-rolling song-and-dance routines eventually led to his nickname, Banjo Eyes, and in 1933, the artist Frederick J. Garner caricatured Cantor with large round and white eyes resembling the drum-like pot of a banjo. Cantor\\'s eyes became his trademark, often exaggerated in illustrations, and leading to his appearance on Broadway in the musical Banjo Eyes (1941). He was the original singer of 1929 hit song \"Makin\\' Whoopie\".\\n\\nReferences\\n\\nPresidents of the Screen Actors Guild\\nAmerican stage actors\\nComedians from New York City\\nAmerican Jews\\nActors from New York City\\nSingers from New York City\\nAmerican television actors\\nAmerican radio actors\\n1892 births\\n1964 deaths',\n",
       " 'title': 'Eddie Cantor'}"
      ]
     },
     "execution_count": 112,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "client.data_object.get()['objects'][0]['properties']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46050ca9",
   "metadata": {},
   "source": [
    "### Search Data\n",
    "\n",
    "As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "id": "5acd5437",
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_weaviate(query, schema, top_k=20):\n",
    "\n",
    "    # Creates embedding vector from user query\n",
    "    embedded_query = openai.Embedding.create(\n",
    "                                                input=query,\n",
    "                                                model=MODEL,\n",
    "                                            )[\"data\"][0]['embedding']\n",
    "    \n",
    "    near_vector = {\"vector\": embedded_query}\n",
    "\n",
    "    # Queries input schema with vectorised user query\n",
    "    query_result = client.query.get(schema,[\"title\",\"content\", \"_additional {certainty}\"]) \\\n",
    "    .with_near_vector(near_vector) \\\n",
    "    .with_limit(top_k) \\\n",
    "    .do()\n",
    "    \n",
    "    return query_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "id": "15def653",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Title: Early modern Europe Certainty: 0.9454971551895142\n",
      "2. Title: Museum of Modern Art Certainty: 0.9376430511474609\n",
      "3. Title: Western Europe Certainty: 0.9337018430233002\n",
      "4. Title: Renaissance art Certainty: 0.932124525308609\n",
      "5. Title: Pop art Certainty: 0.9302527010440826\n",
      "6. Title: Art exhibition Certainty: 0.9282020926475525\n",
      "7. Title: History of Europe Certainty: 0.927833616733551\n",
      "8. Title: Northern Europe Certainty: 0.9273514151573181\n",
      "9. Title: Concert of Europe Certainty: 0.9268475472927094\n",
      "10. Title: Hellenistic art Certainty: 0.9264959394931793\n",
      "11. Title: Piet Mondrian Certainty: 0.9235787093639374\n",
      "12. Title: Modernist literature Certainty: 0.9235587120056152\n",
      "13. Title: European Capital of Culture Certainty: 0.9227772951126099\n",
      "14. Title: Art film Certainty: 0.9217384457588196\n",
      "15. Title: Europa Certainty: 0.9216940104961395\n",
      "16. Title: Art rock Certainty: 0.9212885200977325\n",
      "17. Title: Central Europe Certainty: 0.9212715923786163\n",
      "18. Title: Art Certainty: 0.9207542240619659\n",
      "19. Title: European Certainty: 0.9207191467285156\n",
      "20. Title: Byzantine art Certainty: 0.9204496443271637\n"
     ]
    }
   ],
   "source": [
    "query_result = query_weaviate('modern art in Europe','Article')\n",
    "counter = 0\n",
    "for article in query_result['data']['Get']['Article']:\n",
    "    counter += 1\n",
    "    print(f\"{counter}. Title: {article['title']} Certainty: {article['_additional']['certainty']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "id": "93c4a696",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Title: Historic Scotland Certainty: 0.9465253949165344\n",
      "2. Title: First War of Scottish Independence Certainty: 0.9461104869842529\n",
      "3. Title: Battle of Bannockburn Certainty: 0.9455604553222656\n",
      "4. Title: Wars of Scottish Independence Certainty: 0.944368839263916\n",
      "5. Title: Second War of Scottish Independence Certainty: 0.9394940435886383\n",
      "6. Title: List of Scottish monarchs Certainty: 0.9366503059864044\n",
      "7. Title: Kingdom of Scotland Certainty: 0.9353288412094116\n",
      "8. Title: Scottish Borders Certainty: 0.9317235946655273\n",
      "9. Title: List of rivers of Scotland Certainty: 0.9296278059482574\n",
      "10. Title: Braveheart Certainty: 0.9294214248657227\n",
      "11. Title: John of Scotland Certainty: 0.9292325675487518\n",
      "12. Title: Duncan II of Scotland Certainty: 0.9291643798351288\n",
      "13. Title: Bannockburn Certainty: 0.929103285074234\n",
      "14. Title: The Scotsman Certainty: 0.9280981719493866\n",
      "15. Title: Flag of Scotland Certainty: 0.9270428121089935\n",
      "16. Title: Banff and Macduff Certainty: 0.9267247915267944\n",
      "17. Title: Guardians of Scotland Certainty: 0.9260668158531189\n",
      "18. Title: Scottish Parliament Certainty: 0.9251855313777924\n",
      "19. Title: Holyrood Abbey Certainty: 0.925055593252182\n",
      "20. Title: Scottish Certainty: 0.9249534606933594\n"
     ]
    }
   ],
   "source": [
    "query_result = query_weaviate('Famous battles in Scottish history','Article')\n",
    "counter = 0\n",
    "for article in query_result['data']['Get']['Article']:\n",
    "    counter += 1\n",
    "    print(f\"{counter}. Title: {article['title']} Certainty: {article['_additional']['certainty']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad74202e",
   "metadata": {},
   "source": [
    "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}