"In this quickstart you will learn how to build a \"philosophy quote finder & generator\" using OpenAI's vector embeddings and [Apache Cassandra®](https://cassandra.apache.org), or equivalently DataStax [Astra DB through CQL](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html), as the vector store for data persistence.\n",
"The basic workflow of this notebook is outlined below. You will evaluate and store the vector embeddings for a number of quotes by famous philosophers, use them to build a powerful search engine and, after that, even a generator of new quotes!\n",
"The notebook exemplifies some of the standard usage patterns of vector search -- while showing how easy is it to get started with the vector capabilities of [Cassandra](https://cassandra.apache.org/doc/trunk/cassandra/vector-search/overview.html) / [Astra DB through CQL](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html).\n",
"For a background on using vector search and text embeddings to build a question-answering system, please check out this excellent hands-on notebook: [Question answering using embeddings](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb).\n",
"\n",
"#### _Choose-your-framework_\n",
"\n",
"Please note that this notebook uses the [CassIO library](https://cassio.org), but we cover other choices of technology to accomplish the same task. Check out this folder's [README](https://github.com/openai/openai-cookbook/tree/main/examples/vector_databases/cassandra_astradb) for other options. This notebook can run either as a Colab notebook or as a regular Jupyter notebook.\n",
"\n",
"Table of contents:\n",
"- Setup\n",
"- Get DB connection\n",
"- Connect to OpenAI\n",
"- Load quotes into the Vector Store\n",
"- Use case 1: **quote search engine**\n",
"- Use case 2: **quote generator**\n",
"- (Optional) exploit partitioning in the Vector Store"
]
},
{
"cell_type": "markdown",
"id": "cddf17cc-eef4-4021-b72a-4d3832a9b4a7",
"metadata": {},
"source": [
"### How it works\n",
"\n",
"**Indexing**\n",
"\n",
"Each quote is made into an embedding vector with OpenAI's `Embedding`. These are saved in the Vector Store for later use in searching. Some metadata, including the author's name and a few other pre-computed tags, are stored alongside, to allow for search customization.\n",
"To find a quote similar to the provided search quote, the latter is made into an embedding vector on the fly, and this vector is used to query the store for similar vectors ... i.e. similar quotes that were previously indexed. The search can optionally be constrained by additional metadata (\"find me quotes by Spinoza similar to this one ...\").\n",
"The key point here is that \"quotes similar in content\" translates, in vector space, to vectors that are metrically close to each other: thus, vector similarity search effectively implements semantic similarity. _This is the key reason vector embeddings are so powerful._\n",
"\n",
"The sketch below tries to convey this idea. Each quote, once it's made into a vector, is a point in space. Well, in this case it's on a sphere, since OpenAI's embedding vectors, as most others, are normalized to _unit length_. Oh, and the sphere is actually not three-dimensional, rather 1536-dimensional!\n",
"\n",
"So, in essence, a similarity search in vector space returns the vectors that are closest to the query vector:\n",
"Given a suggestion (a topic or a tentative quote), the search step is performed, and the first returned results (quotes) are fed into an LLM prompt which asks the generative model to invent a new text along the lines of the passed examples _and_ the initial suggestion.\n",
" Make sure you have both strings -- which are obtained in the [Astra UI](https://astra.datastax.com) once you sign in. For more information, see here: [database ID](https://awesome-astra.github.io/docs/pages/astra/faq/#where-should-i-find-a-database-identifier) and [Token](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure).\n",
"If you want to _connect to a Cassandra cluster_ (which however must [support](https://cassandra.apache.org/doc/trunk/cassandra/vector-search/overview.html) Vector Search), replace with `cassio.init(session=..., keyspace=...)` with suitable Session and keyspace name for your cluster."
"_(Incidentally, you could also use any Cassandra cluster (as long as it provides Vector capabilities), just by [changing the parameters](https://docs.datastax.com/en/developer/python-driver/latest/getting_started/#connecting-to-cassandra) to the following `Cluster` instantiation.)_"
"You will compute the embeddings for the quotes and save them into the Vector Store, along with the text itself and the metadata planned for later use. Note that the author is added as a metadata field along with the \"tags\" already found with the quote itself.\n",
"_(Note: for faster execution, Cassandra and CassIO would let you do concurrent inserts, which we don't do here for a more straightforward demo code.)_"
"For the quote-search functionality, you need first to make the input quote into a vector, and then use it to query the store (besides handling the optional metadata into the search call, that is).\n",
"\n",
"Encapsulate the search-engine functionality into a function for ease of re-use:"
" ('We give up leisure in order that we may have leisure, just as we go to war in order that we may have peace.',\n",
" 'aristotle'),\n",
" ('Perhaps the gods are kind to us, by making life more disagreeable as we grow older. In the end death seems less intolerable than the manifold burdens we carry',\n",
"[('Mankind will never see an end of trouble until lovers of wisdom come to hold political power, or the holders of power become lovers of wisdom',\n",
" 'plato'),\n",
" ('Everything the State says is a lie, and everything it has it has stolen.',\n",
"find_quote_and_author(\"We struggle all our life for nothing\", 2, tags=[\"politics\"])"
]
},
{
"cell_type": "markdown",
"id": "746fe38f-139f-44a6-a225-a63e40d3ddf5",
"metadata": {},
"source": [
"### Cutting out irrelevant results\n",
"\n",
"The vector similarity search generally returns the vectors that are closest to the query, even if that means results that might be somewhat irrelevant if there's nothing better.\n",
"\n",
"To keep this issue under control, you can get the actual \"distance\" between the query and each result, and then set a cutoff on it, effectively discarding results that are beyond that threshold.\n",
"Tuning this threshold correctly is not an easy problem: here, we'll just show you the way.\n",
"\n",
"To get a feeling on how this works, try the following query and play with the choice of quote and threshold to compare the results:\n",
"_Note (for the mathematically inclined): this \"distance\" is exactly the cosine similarity between the vectors, i.e. the scalar product divided by the product of the norms of the two vectors. As such, it is a number ranging from -1 to +1, where -1 is for exactly opposite-facing vectors and +1 for identically-oriented vectors. Elsewhere (e.g. in the \"CQL\" counterpart of this demo) you would get a rescaling of this quantity to fit the [0, 1] interval, which means the resulting numerical values and adequate thresholds there are transformed accordingly._"
"For this task you need another component from OpenAI, namely an LLM to generate the quote for us (based on input obtained by querying the Vector Store).\n",
"\n",
"You also need a template for the prompt that will be filled for the generate-quote LLM completion task."
"Just passing a text (a \"quote\", but one can actually just suggest a topic since its vector embedding will still end up at the right place in the vector space):"
"** - Our moral virtues benefit mainly other people; intellectual virtues, on the other hand, benefit primarily ourselves; therefore the former make us universally popular, the latter unpopular. (schopenhauer)\n",
"** - Because Christian morality leaves animals out of account, they are at once outlawed in philosophical morals; they are mere 'things,' mere means to any ends whatsoever. They can therefore be used for vivisection, hunting, coursing, bullfights, and horse racing, and can be whipped to death as they struggle along with heavy carts of stone. Shame on such a morality that is worthy of pariahs, and that fails to recognize the eternal essence that exists in every living thing, and shines forth with inscrutable significance from all eyes that see the sun! (schopenhauer)\n",
"** - The assumption that animals are without rights, and the illusion that our treatment of them has no moral significance, is a positively outrageous example of Western crudity and barbarity. Universal compassion is the only guarantee of morality. (schopenhauer)\n",
"There's an interesting topic to examine before completing this quickstart. While, generally, tags and quotes can be in any relationship (e.g. a quote having multiple tags), _authors_ are effectively an exact grouping (they define a \"disjoint partitioning\" on the set of quotes): each quote has exactly one author (for us, at least).\n",
"\n",
"Now, suppose you know in advance your application will usually (or always) run queries on a _single author_. Then you can take full advantage of the underlying database structure: if you group quotes in **partitions** (one per author), vector queries on just an author will use less resources and return much faster.\n",
"\n",
"We'll not dive into the details here, which have to do with the Cassandra storage internals: the important message is that **if your queries are run within a group, consider partitioning accordingly to boost performance**.\n",
"\n",
"You'll now see this choice in action."
]
},
{
"cell_type": "markdown",
"id": "2b7eb294-85e0-4b5f-8a2f-4731361c2bd9",
"metadata": {},
"source": [
"First, you need a different table abstraction from CassIO:"
"Compared to what you have seen earlier, there is a crucial difference in that now the quote's author is stored as the _partition id_ for the inserted row, instead of being added to the catch-all \"metadata\" dictionary.\n",
"\n",
"While you are at it, by way of demonstration, you will insert all quotes by a given author _concurrently_: with CassIO, this is done by usng the asynchronous `put_async` method for each quote, collecting the resulting list of `Future` objects, and calling the `result()` method on them all afterwards, to ensure they all have executed. Cassandra / Astra DB well supports a high degree of concurrency in I/O operations.\n",
"_(Note: one could have cached the embeddings computed previously to save a few API tokens -- here, however, we wanted to keep the code easier to inspect.)_"
" ('We give up leisure in order that we may have leisure, just as we go to war in order that we may have peace.',\n",
" 'aristotle'),\n",
" ('Perhaps the gods are kind to us, by making life more disagreeable as we grow older. In the end death seems less intolerable than the manifold burdens we carry',\n",
"find_quote_and_author_p(\"We struggle all our life for nothing\", 2, author=\"nietzsche\")"
]
},
{
"cell_type": "markdown",
"id": "871da950-2a06-4a77-a86b-c528935da3a6",
"metadata": {},
"source": [
"Well, you _would_ notice a performance gain, if you had a realistic-size dataset. In this demo, with a few tens of entries, there's no noticeable difference -- but you get the idea."
"Congratulations! You have learned how to use OpenAI for vector embeddings and Cassandra / Astra DB through CQL for storage in order to build a sophisticated philosophical search engine and quote generator.\n",
"This example used [CassIO](https://cassio.org) to interface with the Vector Store - but this is not the only choice. Check the [README](https://github.com/openai/openai-cookbook/tree/main/examples/vector_databases/cassandra_astradb) for other options and integration with popular frameworks.\n",
"To find out more on how Astra DB's Vector Search capabilities can be a key ingredient in your ML/GenAI applications, visit [Astra DB](https://docs.datastax.com/en/astra/home/astra.html)'s web page on the topic."