mirror of
https://github.com/james-m-jordan/openai-cookbook.git
synced 2025-05-09 19:32:38 +00:00
561 lines
22 KiB
Plaintext
561 lines
22 KiB
Plaintext
|
||
# How to build an AI that can answer questions about your website
|
||
|
||
This tutorial walks through a simple example of crawling a website (in this example, the OpenAI website), turning the crawled pages into embeddings using the [Embeddings API](/docs/guides/embeddings), and then creating a basic search functionality that allows a user to ask questions about the embedded information. This is intended to be a starting point for more sophisticated applications that make use of custom knowledge bases.
|
||
|
||
# Getting started
|
||
|
||
Some basic knowledge of Python and GitHub is helpful for this tutorial. Before diving in, make sure to [set up an OpenAI API key](/docs/api-reference/introduction) and walk through the [quickstart tutorial](/docs/quickstart). This will give a good intuition on how to use the API to its full potential.
|
||
|
||
Python is used as the main programming language along with the OpenAI, Pandas, transformers, NumPy, and other popular packages. If you run into any issues working through this tutorial, please ask a question on the [OpenAI Community Forum](https://community.openai.com).
|
||
|
||
To start with the code, clone the [full code for this tutorial on GitHub](https://github.com/openai/web-crawl-q-and-a-example). Alternatively, follow along and copy each section into a Jupyter notebook and run the code step by step, or just read along. A good way to avoid any issues is to set up a new virtual environment and install the required packages by running the following commands:
|
||
|
||
```bash
|
||
python -m venv env
|
||
|
||
source env/bin/activate
|
||
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
## Setting up a web crawler
|
||
|
||
The primary focus of this tutorial is the OpenAI API so if you prefer, you can skip the context on how to create a web crawler and just [download the source code](https://github.com/openai/web-crawl-q-and-a-example). Otherwise, expand the section below to work through the scraping mechanism implementation.
|
||
|
||
|
||
|
||
|
||
|
||
<Image
|
||
png="https://cdn.openai.com/API/docs/images/tutorials/web-qa/DALL-E-coding-a-web-crawling-system-pixel-art.png"
|
||
webp="https://cdn.openai.com/API/docs/images/tutorials/web-qa/DALL-E-coding-a-web-crawling-system-pixel-art.webp"
|
||
alt="DALL-E: Coding a web crawling system pixel art"
|
||
width="1024"
|
||
height="1024"
|
||
/>
|
||
|
||
|
||
|
||
Acquiring data in text form is the first step to use embeddings. This tutorial
|
||
creates a new set of data by crawling the OpenAI website, a technique that you
|
||
can also use for your own company or personal website.
|
||
|
||
|
||
<Button
|
||
size="small"
|
||
color={ButtonColor.neutral}
|
||
href="https://github.com/openai/web-crawl-q-and-a-example"
|
||
target="_blank"
|
||
>
|
||
View source code
|
||
|
||
|
||
|
||
|
||
|
||
While this crawler is written from scratch, open source packages like [Scrapy](https://github.com/scrapy/scrapy) can also help with these operations.
|
||
|
||
This crawler will start from the root URL passed in at the bottom of the code below, visit each page, find additional links, and visit those pages as well (as long as they have the same root domain). To begin, import the required packages, set up the basic URL, and define a HTMLParser class.
|
||
|
||
```python
|
||
import requests
|
||
import re
|
||
import urllib.request
|
||
from bs4 import BeautifulSoup
|
||
from collections import deque
|
||
from html.parser import HTMLParser
|
||
from urllib.parse import urlparse
|
||
import os
|
||
|
||
# Regex pattern to match a URL
|
||
HTTP_URL_PATTERN = r'^http[s]*://.+'
|
||
|
||
domain = "openai.com" # <- put your domain to be crawled
|
||
full_url = "https://openai.com/" # <- put your domain to be crawled with https or http
|
||
|
||
# Create a class to parse the HTML and get the hyperlinks
|
||
class HyperlinkParser(HTMLParser):
|
||
def __init__(self):
|
||
super().__init__()
|
||
# Create a list to store the hyperlinks
|
||
self.hyperlinks = []
|
||
|
||
# Override the HTMLParser's handle_starttag method to get the hyperlinks
|
||
def handle_starttag(self, tag, attrs):
|
||
attrs = dict(attrs)
|
||
|
||
# If the tag is an anchor tag and it has an href attribute, add the href attribute to the list of hyperlinks
|
||
if tag == "a" and "href" in attrs:
|
||
self.hyperlinks.append(attrs["href"])
|
||
```
|
||
|
||
The next function takes a URL as an argument, opens the URL, and reads the HTML content. Then, it returns all the hyperlinks found on that page.
|
||
|
||
```python
|
||
# Function to get the hyperlinks from a URL
|
||
def get_hyperlinks(url):
|
||
|
||
# Try to open the URL and read the HTML
|
||
try:
|
||
# Open the URL and read the HTML
|
||
with urllib.request.urlopen(url) as response:
|
||
|
||
# If the response is not HTML, return an empty list
|
||
if not response.info().get('Content-Type').startswith("text/html"):
|
||
return []
|
||
|
||
# Decode the HTML
|
||
html = response.read().decode('utf-8')
|
||
except Exception as e:
|
||
print(e)
|
||
return []
|
||
|
||
# Create the HTML Parser and then Parse the HTML to get hyperlinks
|
||
parser = HyperlinkParser()
|
||
parser.feed(html)
|
||
|
||
return parser.hyperlinks
|
||
```
|
||
|
||
The goal is to crawl through and index only the content that lives under the OpenAI domain. For this purpose, a function that calls the `get_hyperlinks` function but filters out any URLs that are not part of the specified domain is needed.
|
||
|
||
```python
|
||
# Function to get the hyperlinks from a URL that are within the same domain
|
||
def get_domain_hyperlinks(local_domain, url):
|
||
clean_links = []
|
||
for link in set(get_hyperlinks(url)):
|
||
clean_link = None
|
||
|
||
# If the link is a URL, check if it is within the same domain
|
||
if re.search(HTTP_URL_PATTERN, link):
|
||
# Parse the URL and check if the domain is the same
|
||
url_obj = urlparse(link)
|
||
if url_obj.netloc == local_domain:
|
||
clean_link = link
|
||
|
||
# If the link is not a URL, check if it is a relative link
|
||
else:
|
||
if link.startswith("/"):
|
||
link = link[1:]
|
||
elif link.startswith("#") or link.startswith("mailto:"):
|
||
continue
|
||
clean_link = "https://" + local_domain + "/" + link
|
||
|
||
if clean_link is not None:
|
||
if clean_link.endswith("/"):
|
||
clean_link = clean_link[:-1]
|
||
clean_links.append(clean_link)
|
||
|
||
# Return the list of hyperlinks that are within the same domain
|
||
return list(set(clean_links))
|
||
```
|
||
|
||
The `crawl` function is the final step in the web scraping task setup. It keeps track of the visited URLs to avoid repeating the same page, which might be linked across multiple pages on a site. It also extracts the raw text from a page without the HTML tags, and writes the text content into a local .txt file specific to the page.
|
||
|
||
```python
|
||
def crawl(url):
|
||
# Parse the URL and get the domain
|
||
local_domain = urlparse(url).netloc
|
||
|
||
# Create a queue to store the URLs to crawl
|
||
queue = deque([url])
|
||
|
||
# Create a set to store the URLs that have already been seen (no duplicates)
|
||
seen = set([url])
|
||
|
||
# Create a directory to store the text files
|
||
if not os.path.exists("text/"):
|
||
os.mkdir("text/")
|
||
|
||
if not os.path.exists("text/"+local_domain+"/"):
|
||
os.mkdir("text/" + local_domain + "/")
|
||
|
||
# Create a directory to store the csv files
|
||
if not os.path.exists("processed"):
|
||
os.mkdir("processed")
|
||
|
||
# While the queue is not empty, continue crawling
|
||
while queue:
|
||
|
||
# Get the next URL from the queue
|
||
url = queue.pop()
|
||
print(url) # for debugging and to see the progress
|
||
|
||
# Save text from the url to a .txt file
|
||
with open('text/'+local_domain+'/'+url[8:].replace("/", "_") + ".txt", "w", encoding="UTF-8") as f:
|
||
|
||
# Get the text from the URL using BeautifulSoup
|
||
soup = BeautifulSoup(requests.get(url).text, "html.parser")
|
||
|
||
# Get the text but remove the tags
|
||
text = soup.get_text()
|
||
|
||
# If the crawler gets to a page that requires JavaScript, it will stop the crawl
|
||
if ("You need to enable JavaScript to run this app." in text):
|
||
print("Unable to parse page " + url + " due to JavaScript being required")
|
||
|
||
# Otherwise, write the text to the file in the text directory
|
||
f.write(text)
|
||
|
||
# Get the hyperlinks from the URL and add them to the queue
|
||
for link in get_domain_hyperlinks(local_domain, url):
|
||
if link not in seen:
|
||
queue.append(link)
|
||
seen.add(link)
|
||
|
||
crawl(full_url)
|
||
```
|
||
|
||
The last line of the above example runs the crawler which goes through all the accessible links and turns those pages into text files. This will take a few minutes to run depending on the size and complexity of your site.
|
||
|
||
|
||
|
||
## Building an embeddings index
|
||
|
||
|
||
|
||
<Image
|
||
png="https://cdn.openai.com/API/docs/images/tutorials/web-qa/DALL-E-woman-turning-a-stack-of-papers-into-numbers-pixel-art.png"
|
||
webp="https://cdn.openai.com/API/docs/images/tutorials/web-qa/DALL-E-woman-turning-a-stack-of-papers-into-numbers-pixel-art.webp"
|
||
alt="DALL-E: Woman turning a stack of papers into numbers pixel art"
|
||
width="1024"
|
||
height="1024"
|
||
/>
|
||
|
||
|
||
|
||
CSV is a common format for storing embeddings. You can use this format with
|
||
Python by converting the raw text files (which are in the text directory) into
|
||
Pandas data frames. Pandas is a popular open source library that helps you
|
||
work with tabular data (data stored in rows and columns).
|
||
|
||
|
||
Blank empty lines can clutter the text files and make them harder to process.
|
||
A simple function can remove those lines and tidy up the files.
|
||
|
||
|
||
|
||
|
||
```python
|
||
def remove_newlines(serie):
|
||
serie = serie.str.replace('\n', ' ')
|
||
serie = serie.str.replace('\\n', ' ')
|
||
serie = serie.str.replace(' ', ' ')
|
||
serie = serie.str.replace(' ', ' ')
|
||
return serie
|
||
```
|
||
|
||
Converting the text to CSV requires looping through the text files in the text directory created earlier. After opening each file, remove the extra spacing and append the modified text to a list. Then, add the text with the new lines removed to an empty Pandas data frame and write the data frame to a CSV file.
|
||
|
||
|
||
Extra spacing and new lines can clutter the text and complicate the embeddings
|
||
process. The code used here helps to remove some of them but you may find 3rd party
|
||
libraries or other methods useful to get rid of more unnecessary characters.
|
||
|
||
|
||
```python
|
||
import pandas as pd
|
||
|
||
# Create a list to store the text files
|
||
texts=[]
|
||
|
||
# Get all the text files in the text directory
|
||
for file in os.listdir("text/" + domain + "/"):
|
||
|
||
# Open the file and read the text
|
||
with open("text/" + domain + "/" + file, "r", encoding="UTF-8") as f:
|
||
text = f.read()
|
||
|
||
# Omit the first 11 lines and the last 4 lines, then replace -, _, and #update with spaces.
|
||
texts.append((file[11:-4].replace('-',' ').replace('_', ' ').replace('#update',''), text))
|
||
|
||
# Create a dataframe from the list of texts
|
||
df = pd.DataFrame(texts, columns = ['fname', 'text'])
|
||
|
||
# Set the text column to be the raw text with the newlines removed
|
||
df['text'] = df.fname + ". " + remove_newlines(df.text)
|
||
df.to_csv('processed/scraped.csv')
|
||
df.head()
|
||
```
|
||
|
||
Tokenization is the next step after saving the raw text into a CSV file. This process splits the input text into tokens by breaking down the sentences and words. A visual demonstration of this can be seen by [checking out our Tokenizer](/tokenizer) in the docs.
|
||
|
||
> A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
|
||
|
||
The API has a limit on the maximum number of input tokens for embeddings. To stay below the limit, the text in the CSV file needs to be broken down into multiple rows. The existing length of each row will be recorded first to identify which rows need to be split.
|
||
|
||
```python
|
||
import tiktoken
|
||
|
||
# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
|
||
tokenizer = tiktoken.get_encoding("cl100k_base")
|
||
|
||
df = pd.read_csv('processed/scraped.csv', index_col=0)
|
||
df.columns = ['title', 'text']
|
||
|
||
# Tokenize the text and save the number of tokens to a new column
|
||
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
|
||
|
||
# Visualize the distribution of the number of tokens per row using a histogram
|
||
df.n_tokens.hist()
|
||
```
|
||
|
||
|
||
|
||
<img
|
||
src="https://cdn.openai.com/API/docs/images/tutorials/web-qa/embeddings-initial-histrogram.png"
|
||
alt="Embeddings histogram"
|
||
width="553"
|
||
height="413"
|
||
/>
|
||
|
||
|
||
|
||
The newest embeddings model can handle inputs with up to 8191 input tokens so most of the rows would not need any chunking, but this may not be the case for every subpage scraped so the next code chunk will split the longer lines into smaller chunks.
|
||
|
||
```Python
|
||
max_tokens = 500
|
||
|
||
# Function to split the text into chunks of a maximum number of tokens
|
||
def split_into_many(text, max_tokens = max_tokens):
|
||
|
||
# Split the text into sentences
|
||
sentences = text.split('. ')
|
||
|
||
# Get the number of tokens for each sentence
|
||
n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
|
||
|
||
chunks = []
|
||
tokens_so_far = 0
|
||
chunk = []
|
||
|
||
# Loop through the sentences and tokens joined together in a tuple
|
||
for sentence, token in zip(sentences, n_tokens):
|
||
|
||
# If the number of tokens so far plus the number of tokens in the current sentence is greater
|
||
# than the max number of tokens, then add the chunk to the list of chunks and reset
|
||
# the chunk and tokens so far
|
||
if tokens_so_far + token > max_tokens:
|
||
chunks.append(". ".join(chunk) + ".")
|
||
chunk = []
|
||
tokens_so_far = 0
|
||
|
||
# If the number of tokens in the current sentence is greater than the max number of
|
||
# tokens, go to the next sentence
|
||
if token > max_tokens:
|
||
continue
|
||
|
||
# Otherwise, add the sentence to the chunk and add the number of tokens to the total
|
||
chunk.append(sentence)
|
||
tokens_so_far += token + 1
|
||
|
||
return chunks
|
||
|
||
|
||
shortened = []
|
||
|
||
# Loop through the dataframe
|
||
for row in df.iterrows():
|
||
|
||
# If the text is None, go to the next row
|
||
if row[1]['text'] is None:
|
||
continue
|
||
|
||
# If the number of tokens is greater than the max number of tokens, split the text into chunks
|
||
if row[1]['n_tokens'] > max_tokens:
|
||
shortened += split_into_many(row[1]['text'])
|
||
|
||
# Otherwise, add the text to the list of shortened texts
|
||
else:
|
||
shortened.append( row[1]['text'] )
|
||
```
|
||
|
||
Visualizing the updated histogram again can help to confirm if the rows were successfully split into shortened sections.
|
||
|
||
```python
|
||
df = pd.DataFrame(shortened, columns = ['text'])
|
||
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
|
||
df.n_tokens.hist()
|
||
```
|
||
|
||
|
||
|
||
<img
|
||
src="https://cdn.openai.com/API/docs/images/tutorials/web-qa/embeddings-tokenized-output.png"
|
||
alt="Embeddings tokenized output"
|
||
width="552"
|
||
height="418"
|
||
/>
|
||
|
||
|
||
|
||
The content is now broken down into smaller chunks and a simple request can be sent to the OpenAI API specifying the use of the new text-embedding-ada-002 model to create the embeddings:
|
||
|
||
```python
|
||
from openai import OpenAI
|
||
|
||
client = OpenAI(
|
||
api_key=os.environ.get("OPENAI_API_KEY"),
|
||
)
|
||
|
||
df['embeddings'] = df.text.apply(lambda x: client.embeddings.create(input=x, engine='text-embedding-ada-002')['data'][0]['embedding'])
|
||
|
||
df.to_csv('processed/embeddings.csv')
|
||
df.head()
|
||
```
|
||
|
||
This should take about 3-5 minutes but after you will have your embeddings ready to use!
|
||
|
||
## Building a question answer system with your embeddings
|
||
|
||
|
||
|
||
<Image
|
||
png="https://cdn.openai.com/API/docs/images/tutorials/web-qa/DALL-E-friendly-robot-question-and-answer-system-pixel-art.png"
|
||
webp="https://cdn.openai.com/API/docs/images/tutorials/web-qa/DALL-E-friendly-robot-question-and-answer-system-pixel-art.webp"
|
||
alt="DALL-E: Friendly robot question and answer system pixel art"
|
||
width="1024"
|
||
height="1024"
|
||
/>
|
||
|
||
|
||
|
||
The embeddings are ready and the final step of this process is to create a
|
||
simple question and answer system. This will take a user's question, create an
|
||
embedding of it, and compare it with the existing embeddings to retrieve the
|
||
most relevant text from the scraped website. The gpt-3.5-turbo-instruct model
|
||
will then generate a natural sounding answer based on the retrieved text.
|
||
|
||
|
||
|
||
|
||
---
|
||
|
||
Turning the embeddings into a NumPy array is the first step, which will provide more flexibility in how to use it given the many functions available that operate on NumPy arrays. It will also flatten the dimension to 1-D, which is the required format for many subsequent operations.
|
||
|
||
```python
|
||
import numpy as np
|
||
from openai.embeddings_utils import distances_from_embeddings
|
||
|
||
df=pd.read_csv('processed/embeddings.csv', index_col=0)
|
||
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)
|
||
|
||
df.head()
|
||
```
|
||
|
||
The question needs to be converted to an embedding with a simple function, now that the data is ready. This is important because the search with embeddings compares the vector of numbers (which was the conversion of the raw text) using cosine distance. The vectors are likely related and might be the answer to the question if they are close in cosine distance. The OpenAI python package has a built in `distances_from_embeddings` function which is useful here.
|
||
|
||
```python
|
||
def create_context(
|
||
question, df, max_len=1800, size="ada"
|
||
):
|
||
"""
|
||
Create a context for a question by finding the most similar context from the dataframe
|
||
"""
|
||
|
||
# Get the embeddings for the question
|
||
q_embeddings = client.embeddings.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']
|
||
|
||
# Get the distances from the embeddings
|
||
df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')
|
||
|
||
|
||
returns = []
|
||
cur_len = 0
|
||
|
||
# Sort by distance and add the text to the context until the context is too long
|
||
for i, row in df.sort_values('distances', ascending=True).iterrows():
|
||
|
||
# Add the length of the text to the current length
|
||
cur_len += row['n_tokens'] + 4
|
||
|
||
# If the context is too long, break
|
||
if cur_len > max_len:
|
||
break
|
||
|
||
# Else add it to the text that is being returned
|
||
returns.append(row["text"])
|
||
|
||
# Return the context
|
||
return "\n\n###\n\n".join(returns)
|
||
```
|
||
|
||
The text was broken up into smaller sets of tokens, so looping through in ascending order and continuing to add the text is a critical step to ensure a full answer. The max_len can also be modified to something smaller, if more content than desired is returned.
|
||
|
||
The previous step only retrieved chunks of texts that are semantically related to the question, so they might contain the answer, but there's no guarantee of it. The chance of finding an answer can be further increased by returning the top 5 most likely results.
|
||
|
||
The answering prompt will then try to extract the relevant facts from the retrieved contexts, in order to formulate a coherent answer. If there is no relevant answer, the prompt will return “I don’t know”.
|
||
|
||
A realistic sounding answer to the question can be created with the completion endpoint using `gpt-3.5-turbo-instruct`.
|
||
|
||
```python
|
||
def answer_question(
|
||
df,
|
||
model="gpt-3.5-turbo",
|
||
question="Am I allowed to publish model outputs to Twitter, without a human review?",
|
||
max_len=1800,
|
||
size="ada",
|
||
debug=False,
|
||
max_tokens=150,
|
||
stop_sequence=None
|
||
):
|
||
"""
|
||
Answer a question based on the most similar context from the dataframe texts
|
||
"""
|
||
context = create_context(
|
||
question,
|
||
df,
|
||
max_len=max_len,
|
||
size=size,
|
||
)
|
||
# If debug, print the raw model response
|
||
if debug:
|
||
print("Context:\n" + context)
|
||
print("\n\n")
|
||
|
||
try:
|
||
# Create a chat completion using the question and context
|
||
response = client.chat.completions.create(
|
||
model="gpt-3.5-turbo",
|
||
messages=[
|
||
{"role": "system", "content": "Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\n"},
|
||
{"role": "user", f"content": "Context: {context}\n\n---\n\nQuestion: {question}\nAnswer:"}
|
||
],
|
||
temperature=0,
|
||
max_tokens=max_tokens,
|
||
top_p=1,
|
||
frequency_penalty=0,
|
||
presence_penalty=0,
|
||
stop=stop_sequence,
|
||
)
|
||
return response.choices[0].message.strip()
|
||
except Exception as e:
|
||
print(e)
|
||
return ""
|
||
```
|
||
|
||
It is done! A working Q/A system that has the knowledge embedded from the OpenAI website is now ready. A few quick tests can be done to see the quality of the output:
|
||
|
||
```python
|
||
answer_question(df, question="What day is it?", debug=False)
|
||
|
||
answer_question(df, question="What is our newest embeddings model?")
|
||
|
||
answer_question(df, question="What is ChatGPT?")
|
||
```
|
||
|
||
The responses will look something like the following:
|
||
|
||
```response
|
||
"I don't know."
|
||
|
||
'The newest embeddings model is text-embedding-ada-002.'
|
||
|
||
'ChatGPT is a model trained to interact in a conversational way. It is able to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.'
|
||
```
|
||
|
||
If the system is not able to answer a question that is expected, it is worth searching through the raw text files to see if the information that is expected to be known actually ended up being embedded or not. The crawling process that was done initially was setup to skip sites outside the original domain that was provided, so it might not have that knowledge if there was a subdomain setup.
|
||
|
||
Currently, the dataframe is being passed in each time to answer a question. For more production workflows, a [vector database solution](/docs/guides/embeddings/how-can-i-retrieve-k-nearest-embedding-vectors-quickly) should be used instead of storing the embeddings in a CSV file, but the current approach is a great option for prototyping.
|