{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Search augmented by query generation and embeddings reranking\n", "\n", "Searching for relevant information can sometimes feel like looking for a needle in a haystack, but don’t despair, GPTs can actually do a lot of this work for us. In this guide we explore a way to augment existing search systems with various AI techniques, helping us sift through the noise.\n", "\n", "Two ways of retrieving information for GPT are:\n", "\n", "1. **Mimicking Human Browsing:** [GPT triggers a search](https://openai.com/blog/chatgpt-plugins#browsing), evaluates the results, and modifies the search query if necessary. It can also follow up on specific search results to form a chain of thought, much like a human user would do.\n", "2. **Retrieval with Embeddings:** Calculate [embeddings](https://platform.openai.com/docs/guides/embeddings) for your content and a user query, and then [retrieve the content](Question_answering_using_embeddings.ipynb) most related as measured by cosine similarity. This technique is [used heavily](https://blog.google/products/search/search-language-understanding-bert/) by search engines like Google.\n", "\n", "These approaches are both promising, but each has their shortcomings: the first one can be slow due to its iterative nature and the second one requires embedding your entire knowledge base in advance, continuously embedding new content and maintaining a vector database.\n", "\n", "By combining these approaches, and drawing inspiration from [re-ranking](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) methods, we identify an approach that sits in the middle. **This approach can be implemented on top of any existing search system, like the Slack search API, or an internal ElasticSearch instance with private data**. Here’s how it works:\n", "\n", "![search_augmented_by_query_generation_and_embeddings_reranking.png](../images/search_augmentation_embeddings.png)\n", "\n", "**Step 1: Search**\n", "\n", "1. User asks a question.\n", "2. GPT generates a list of potential queries.\n", "3. Search queries are executed in parallel.\n", "\n", "**Step 2: Re-rank**\n", "\n", "1. Embeddings for each result are used to calculate semantic similarity to a generated hypothetical ideal answer to the user question.\n", "2. Results are ranked and filtered based on this similarity metric.\n", "\n", "**Step 3: Answer**\n", "\n", "1. Given the top search results, the model generates an answer to the user’s question, including references and links.\n", "\n", "This hybrid approach offers relatively low latency and can be integrated into any existing search endpoint, without requiring the upkeep of a vector database. Let's dive into it! We will use the [News API](https://newsapi.org/) as an example domain to search over.\n", "\n", "## Setup\n", "\n", "In addition to your `OPENAI_API_KEY`, you'll have to include a `NEWS_API_KEY` in your environment. You can get an API key [here](https://newsapi.org/).\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%env NEWS_API_KEY = YOUR_API_KEY\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Dependencies\n", "from datetime import date, timedelta # date handling for fetching recent news\n", "from IPython import display # for pretty printing\n", "import json # for parsing the JSON api responses and model outputs\n", "from numpy import dot # for cosine similarity\n", "import openai # for using GPT and getting embeddings\n", "import os # for loading environment variables\n", "import requests # for making the API requests\n", "from tqdm import tqdm # for printing progress bars\n", "\n", "# Load environment variables\n", "news_api_key = os.getenv(\"NEWS_API_KEY\")\n", "\n", "\n", "# Helper functions\n", "def json_gpt(input: str):\n", " completion = openai.ChatCompletion.create(\n", " model=\"gpt-4\",\n", " messages=[\n", " {\"role\": \"system\", \"content\": \"Output only valid JSON\"},\n", " {\"role\": \"user\", \"content\": input},\n", " ],\n", " temperature=1,\n", " )\n", "\n", " text = completion.choices[0].message.content\n", " parsed = json.loads(text)\n", "\n", " return parsed\n", "\n", "\n", "def embeddings(input: list[str]) -> list[list[str]]:\n", " response = openai.Embedding.create(\n", " model=\"text-embedding-ada-002\", input=input)\n", " return [data.embedding for data in response.data]\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Search\n", "\n", "It all starts with a user question.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# User asks a question\n", "USER_QUESTION = \"Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now, in order to be as exhaustive as possible, we use the model to generate a list of diverse queries based on this question.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['NBA championship winner',\n", " 'NBA finals MVP',\n", " 'last NBA championship game',\n", " 'recent NBA finals results',\n", " 'NBA finals champions',\n", " 'NBA finals MVP and winner',\n", " 'latest NBA championship game details',\n", " 'NBA championship winning team',\n", " 'most recent NBA finals MVP',\n", " 'last NBA finals game summary',\n", " 'latest NBA finals champion and MVP',\n", " 'Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "QUERIES_INPUT = f\"\"\"\n", "Generate an array of search queries that are relevant to this question.\n", "Use a variation of related keywords for the queries, trying to be as general as possible.\n", "Include as many queries as you can think of, including and excluding terms.\n", "For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2'].\n", "Be creative. The more queries you include, the more likely you are to find relevant results.\n", "\n", "User question: {USER_QUESTION}\n", "\n", "Format: {{\"queries\": [\"query_1\", \"query_2\", \"query_3\"]}}\n", "\"\"\"\n", "\n", "queries = json_gpt(QUERIES_INPUT)[\"queries\"]\n", "\n", "# Let's include the original question as well for good measure\n", "queries.append(USER_QUESTION)\n", "\n", "queries\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The queries look good, so let's run the searches.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 12/12 [00:04<00:00, 2.69it/s]\n" ] } ], "source": [ "def search_news(query: str):\n", " # get date 1 week ago\n", " one_week_ago = (date.today() - timedelta(weeks=1)).strftime(\"%Y-%m-%d\")\n", "\n", " response = requests.get(\n", " \"https://newsapi.org/v2/everything\",\n", " params={\n", " \"q\": query,\n", " \"apiKey\": news_api_key,\n", " \"pageSize\": 50,\n", " \"sortBy\": \"relevancy\",\n", " \"from\": one_week_ago,\n", " },\n", " )\n", "\n", " return response.json()\n", "\n", "\n", "articles = []\n", "\n", "for query in tqdm(queries):\n", " result = search_news(query)\n", " if result[\"status\"] == \"ok\":\n", " articles = articles + result[\"articles\"]\n", " else:\n", " raise Exception(result[\"message\"])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of articles: 356\n", "Top 5 articles of query 1: \n", "\n", "Title: Nascar takes on Le Mans as LeBron James gets centenary race under way\n", "Description: The crowd chanted “U-S-A! U-S-A!” as Nascar driver lineup for the 24 Hours of Le Mans passed through the city cente…\n", "Content: The crowd chanted U-S-A! U-S-A! as Nascar driver lineup for the 24 Hours of Le Mans passed through t...\n", "\n", "Title: Futura and Michelob ULTRA Toast to the NBA Finals With Abstract Artwork Crafted From the Brand’s 2023 Limited-Edition Championship Bottles\n", "Description: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beermaker is back with its celebratory NBA Champ Bottles. This year, the self-proclaimed MVP of joy is dropping a limited-edition bottle made in collaboration with a…\n", "Content: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beerma...\n", "\n", "Title: Signed and Delivered, Futura and Michelob ULTRA Will Gift Hand-Painted Bottles to This Year’s NBA Championship Team\n", "Description: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basketball lovers and sports fans around the globe as the NBA 2022-2023 season comes to a nail-biting close. In collaboration with artist Futura, Michelob ULTRA will…\n", "Content: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basket...\n", "\n", "Title: Alexis Ohanian and Serena Williams are building a mini-sports empire with a new golf team that's part of a league created by Tiger Woods and Rory McIlroy\n", "Description: Ohanian and Williams are already co-owners of the National Women's Soccer League Los Angeles team, Angel City FC.\n", "Content: Alexis Ohanian and Serena Williams attend The 2023 Met Gala.Cindy Ord/Getty Images\n", "