# Semantic Search with Pinecone and OpenAI

In this guide you will learn how to use the OpenAI Embedding API to generate language embeddings, and then index those embeddings in the Pinecone vector database for fast and scalable vector search.

This is a powerful and common combination for building semantic search, question-answering, threat-detection, and other applications that rely on NLP and search over a large corpus of text data.

The basic workflow looks like this:

**Embed and index**

* Use the OpenAI Embedding API to generate vector embeddings of your documents (or any text data).
* Upload those vector embeddings into Pinecone, which can store and index millions/billions of these vector embeddings, and search through them at ultra-low latencies.

**Search**

* Pass your query text or document through the OpenAI Embedding API again.
* Take the resulting vector embedding and send it as a query to Pinecone.
* Get back semantically similar documents, even if they don't share any keywords with the query.

![Architecture overview](https://files.readme.io/6a3ea5a-pinecone-openai-overview.png)

Let's get started...

## Setup

We first need to setup our environment and retrieve API keys for OpenAI and Pinecone. Let's start with our environment, we need HuggingFace *Datasets* for our data, and the OpenAI and Pinecone clients:

In [None]:
!pip install -qU pinecone-client openai datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pinecone-client
  Downloading pinecone_client-2.0.13-py3-none-any.whl (175 kB)
[K     |████████████████████████████████| 175 kB 4.9 MB/s 
[?25hCollecting openai
  Downloading openai-0.25.0.tar.gz (44 kB)
[K     |████████████████████████████████| 44 kB 2.1 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[K     |████████████████████████████████| 452 kB 59.5 MB/s 
Collecting loguru>=0.5.0
  Downloading loguru-0.6.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 4.9 MB/s 
Collecting pandas-stubs>=1.1.0.11
  Downloading pandas_stubs-1.5.2.221213-py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 60.7 MB/s 
Collecting types-pytz>=2

### Creating Embeddings

Then we initialize our connection to OpenAI Embeddings *and* Pinecone vector DB. Sign up for an API key over at [OpenAI](https://beta.openai.com/signup) and [Pinecone](https://app.pinecone.io).

In [None]:
import openai

openai.api_key = "OPENAI_API_KEY"
# get API key from top-right dropdown on OpenAI website

openai.Engine.list()  # check we have authenticated

<OpenAIObject list at 0x7f98f1d704a0> JSON: {
  "data": [
    {
      "created": null,
      "id": "babbage",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "ada",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "davinci",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "babbage-code-search-code",
      "object": "engine",
      "owner": "openai-dev",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-similarity-babbage-001",
      "object": "engine",
      "owner": "openai-dev",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-davinci-003",
      "object": "engine",
      "owner": "openai-intern

We can now create embeddings with the OpenAI Ada similarity model like so:

In [None]:
MODEL = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=MODEL
)
res

<OpenAIObject list at 0x7f98efba5130> JSON: {
  "data": [
    {
      "embedding": [
        -0.003040769835934043,
        0.011684642173349857,
        -0.005026957020163536,
        -0.027237210422754288,
        -0.016361193731427193,
        0.03234503045678139,
        -0.016159038990736008,
        -0.001036894042044878,
        -0.025822116062045097,
        -0.00666779326274991,
        0.02014825865626335,
        0.016657691448926926,
        -0.009164425544440746,
        0.023423193022608757,
        -0.0101212989538908,
        0.01344340294599533,
        0.02522912435233593,
        -0.016873324289917946,
        0.012115909717977047,
        -0.016361193731427193,
        -0.00426887022331357,
        -0.006502698641270399,
        -0.004369948524981737,
        0.020808637142181396,
        -0.01053908932954073,
        -0.003652293002232909,
        0.01369272917509079,
        -0.026361199095845222,
        -0.0003171329153701663,
        -0.0022186669521033764,
   

In [None]:
print(f"vector 0: {len(res['data'][0]['embedding'])}\nvector 1: {len(res['data'][1]['embedding'])}")

vector 0: 1536
vector 1: 1536


In [None]:
# we can extract embeddings to a list
embeds = [record['embedding'] for record in res['data']]
len(embeds)

2

Next, we initialize our index to store vector embeddings with Pinecone.

In [None]:
len(embeds[0])

1536

In [None]:
import pinecone

index_name = 'semantic-search-openai'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="PINECONE_API_KEY",
    environment="PINECONE_ENVIRONMENT"  # find next to api key in console
)
# check if 'openai' index already exists (only create index if not)
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=len(embeds[0]))
# connect to index
index = pinecone.Index(index_name)

## Populating the Index

Now we will take 1K questions from the TREC dataset

In [None]:
from datasets import load_dataset

# load the first 1K rows of the TREC dataset
trec = load_dataset('trec', split='train[:1000]')
trec



Dataset({
    features: ['text', 'coarse_label', 'fine_label'],
    num_rows: 1000
})

In [None]:
trec[0]

{'text': 'How did serfdom develop in and then leave Russia ?',
 'coarse_label': 2,
 'fine_label': 26}

Then we create a vector embedding for each phrase using OpenAI, and `upsert` the ID, vector embedding, and original text for each phrase to Pinecone.

In [None]:
from tqdm.auto import tqdm

count = 0  # we'll use the count to create unique IDs
batch_size = 32  # process everything in batches of 32
for i in tqdm(range(0, len(trec['text']), batch_size)):
    # set end position of batch
    i_end = min(i+batch_size, len(trec['text']))
    # get batch of lines and IDs
    lines_batch = trec['text'][i: i+batch_size]
    ids_batch = [str(n) for n in range(i, i_end)]
    # create embeddings
    res = openai.Embedding.create(input=lines_batch, engine=MODEL)
    embeds = [record['embedding'] for record in res['data']]
    # prep metadata and upsert batch
    meta = [{'text': line} for line in lines_batch]
    to_upsert = zip(ids_batch, embeds, meta)
    # upsert to Pinecone
    index.upsert(vectors=list(to_upsert))

  0%|          | 0/32 [00:00<?, ?it/s]

---

# Querying

With our data indexed, we're now ready to move onto performing searches. This follows a similar process to indexing. We start with a text `query`, that we would like to use to find similar sentences. As before we encode this with OpenAI's text similarity Babbage model to create a *query vector* `xq`. We then use `xq` to query the Pinecone index.

In [None]:
query = "What caused the 1929 Great Depression?"

xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

Now query...

In [None]:
res = index.query([xq], top_k=5, include_metadata=True)
res

{'matches': [{'id': '932',
              'metadata': {'text': 'Why did the world enter a global '
                                   'depression in 1929 ?'},
              'score': 0.917971551,
              'sparseValues': {},
              'values': []},
             {'id': '787',
              'metadata': {'text': "When was `` the Great Depression '' ?"},
              'score': 0.87167418,
              'sparseValues': {},
              'values': []},
             {'id': '400',
              'metadata': {'text': 'What crop failure caused the Irish Famine '
                                   '?'},
              'score': 0.812044263,
              'sparseValues': {},
              'values': []},
             {'id': '775',
              'metadata': {'text': 'What historical event happened in Dogtown '
                                   'in 1899 ?'},
              'score': 0.798895657,
              'sparseValues': {},
              'values': []},
             {'id': '481',
            

The response from Pinecone includes our original text in the `metadata` field, let's print out the `top_k` most similar questions and their respective similarity scores.

In [None]:
for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.92: Why did the world enter a global depression in 1929 ?
0.87: When was `` the Great Depression '' ?
0.81: What crop failure caused the Irish Famine ?
0.80: What historical event happened in Dogtown in 1899 ?
0.79: What caused the Lynmouth floods ?


Looks good, let's make it harder and replace *"depression"* with the incorrect term *"recession"*.

In [None]:
query = "What was the cause of the major recession in the early 20th century?"

# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.88: Why did the world enter a global depression in 1929 ?
0.83: When was `` the Great Depression '' ?
0.81: What crop failure caused the Irish Famine ?
0.80: When did World War I start ?
0.80: What were popular songs and types of songs in the 1920s ?


And again...

In [None]:
query = "Why was there a long-term economic downturn in the early 20th century?"

# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.90: Why did the world enter a global depression in 1929 ?
0.84: When was `` the Great Depression '' ?
0.80: When did World War I start ?
0.80: What crop failure caused the Irish Famine ?
0.80: When did the Dow first reach ?


Looks great, our semantic search pipeline is clearly able to identify the meaning between each of our queries and return the most semantically similar questions from the already indexed questions.

Once we're finished with the index we delete it to save resources.

In [None]:
pinecone.delete_index(index_name)

---