openai-cookbook/examples/data/parsed_pdf_docs.json


			
				
					
						
						
						
							
							
							[{"filename": "rag-deck.pdf", "text": "RAG\nTechnique\n\nFebruary 2024\n\n\fOverview\n\nRetrieval-Augmented Generation \nenhances the capabilities of language \nmodels by combining them with a \nretrieval system. This allows the model \nto leverage external knowledge sources \nto generate more accurate and \ncontextually relevant responses.\n\nExample use cases\n\n- Provide answers with up-to-date \n\ninformation\n\n- Generate contextual responses\n\nWhat we\u2019ll cover\n\n\u25cf Technical patterns\n\n\u25cf Best practices\n\n\u25cf Common pitfalls\n\n\u25cf Resources\n\n3\n\n\fWhat is RAG\n\nRetrieve information to Augment the model\u2019s knowledge and Generate the output\n\n\u201cWhat is your \nreturn policy?\u201d\n\nask\n\nresult\n\nsearch\n\nLLM\n\nreturn information\n\nTotal refunds: 0-14 days\n50% of value vouchers: 14-30 days\n$5 discount on next order: > 30 days\n\n\u201cYou can get a full refund up \nto 14 days after the \npurchase, then up to 30 days \nyou would get a voucher for \nhalf the value of your order\u201d\n\nKnowledge \nBase / External \nsources\n\n4\n\n\fWhen to use RAG\n\nGood for  \u2705\n\nNot good for  \u274c\n\n\u25cf\n\n\u25cf\n\nIntroducing new information to the model \n\n\u25cf\n\nTeaching the model a speci\ufb01c format, style, \n\nto update its knowledge\n\nReducing hallucinations by controlling \n\ncontent\n\n/!\\ Hallucinations can still happen with RAG\n\nor language\n\u2794 Use \ufb01ne-tuning or custom models instead\n\n\u25cf\n\nReducing token usage\n\u2794 Consider \ufb01ne-tuning depending on the use \n\ncase\n\n5\n\n\fTechnical patterns\n\nData preparation\n\nInput processing\n\nRetrieval\n\nAnswer Generation\n\n\u25cf Chunking\n\n\u25cf\n\n\u25cf\n\nEmbeddings\n\nAugmenting \ncontent\n\n\u25cf\n\nInput \naugmentation\n\n\u25cf NER\n\n\u25cf\n\nSearch\n\n\u25cf Context window\n\n\u25cf Multi-step \nretrieval\n\n\u25cf Optimisation\n\n\u25cf\n\nSafety checks\n\n\u25cf\n\nEmbeddings\n\n\u25cf Re-ranking\n\n6\n\n\fTechnical patterns\nData preparation\n\nchunk documents into multiple \npieces for easier consumption\n\ncontent\n\nembeddings\n\n0.983, 0.123, 0.289\u2026\n\n0.876, 0.145, 0.179\u2026\n\n0.983, 0.123, 0.289\u2026\n\nAugment content \nusing LLMs\n\nEx: parse text only, ask gpt-4 to rephrase & \nsummarize each part, generate bullet points\u2026\n\nBEST PRACTICES\n\nPre-process content for LLM \nconsumption: \nAdd summary, headers for each \npart, etc.\n+ curate relevant data sources\n\nKnowledge \nBase\n\nCOMMON PITFALLS\n\n\u2794 Having too much low-quality \n\ncontent\n\n\u2794 Having too large documents\n\n7\n\n\fTechnical patterns\nData preparation: chunking\n\nWhy chunking?\n\nIf your system doesn\u2019t require \nentire documents to provide \nrelevant answers, you can \nchunk them into multiple pieces \nfor easier consumption (reduced \ncost & latency).\n\nOther approaches: graphs or \nmap-reduce\n\nThings to consider\n\n\u25cf\n\nOverlap:\n\n\u25cb\n\n\u25cb\n\nShould chunks be independent or overlap one \nanother?\nIf they overlap, by how much?\n\n\u25cf\n\nSize of chunks: \n\n\u25cb What is the optimal chunk size for my use case?\n\u25cb\n\nDo I want to include a lot in the context window or \njust the minimum?\n\n\u25cf Where to chunk:\n\n\u25cb\n\n\u25cb\n\nShould I chunk every N tokens or use speci\ufb01c \nseparators? \nIs there a logical way to split the context that would \nhelp the retrieval process?\n\n\u25cf What to return:\n\n\u25cb\n\n\u25cb\n\nShould I return chunks across multiple documents \nor top chunks within the same doc?\nShould chunks be linked together with metadata to \nindicate common properties?\n\n8\n\n\fTechnical patterns\nData preparation: embeddings\n\nWhat to embed?\n\nDepending on your use case \nyou might not want just to \nembed the text in the \ndocuments but metadata as well \n- anything that will make it easier \nto surface this speci\ufb01c chunk or \ndocument when performing a \nsearch\n\nExamples\n\nEmbedding Q&A posts in a forum\nYou might want to embed the title of the posts, \nthe text of the original question and the content of \nthe top answers.\nAdditionally, if the posts are tagged by topic or \nwith keywords, you can embed those too.\n\nEmbedding product specs\nIn additional to embedding the text contained in \ndocuments describing the products, you might \nwant to add metadata that you have on the \nproduct such as the color, size, etc. in your \nembeddings.\n\n9\n\n\fTechnical patterns\nData preparation: augmenting content\n\nWhat does \u201cAugmenting \ncontent\u201d mean?\n\nAugmenting content refers to \nmodi\ufb01cations of the original content \nto make it more digestible for a \nsystem relying on RAG. The \nmodi\ufb01cations could be a change in \nformat, wording, or adding \ndescriptive content such as \nsummaries or keywords.\n\nExample approaches\n\nMake it a guide*\nReformat the content to look more like \na step-by-step guide with clear \nheadings and bullet-points, as this \nformat is more easily understandable \nby an LLM.\n\nAdd descriptive metadata*\nConsider adding keywords or text that \nusers might search for when thinking \nof a speci\ufb01c product or service.\n\nMultimodality\nLeverage models \nsuch as Whisper or \nGPT-4V to \ntransform audio or \nvisual content into \ntext.\nFor example, you \ncan use GPT-4V to \ngenerate tags for \nimages or to \ndescribe slides.\n\n* GPT-4 can do this for you with the right prompt\n\n10\n\n\fTechnical patterns\nInput processing\n\nProcess input according to task\n\nQ&A\nHyDE:  Ask LLM to hypothetically answer the \nquestion & use the answer to search the KB\n\nembeddings\n\n0.983, 0.123, 0.289\u2026\n\n0.876, 0.145, 0.179\u2026\n\nContent search\nPrompt LLM to rephrase input & optionally add \nmore context\n\nquery\n\nSELECT * from items\u2026\n\nDB search\nNER:  Find relevant entities to be used for a \nkeyword search or to construct a search query\n\nkeywords\n\nred\n\nsummer\n\nBEST PRACTICES\n\nConsider how to transform the \ninput to match content in the \ndatabase\nConsider using metadata to \naugment the user input\n\nCOMMON PITFALLS\n\n\u2794 Comparing directly the input \nto the database without \nconsidering the task \nspeci\ufb01cities \n\n11\n\n\fTechnical patterns\nInput processing: input augmentation\n\nWhat is input augmentation?\n\nExample approaches\n\nAugmenting the input means turning \nit into something di\ufb00erent, either \nrephrasing it, splitting it in several \ninputs or expanding it.\nThis helps boost performance as \nthe LLM might understand better \nthe user intent.\n\nQuery \nexpansion*\nRephrase the \nquery to be \nmore \ndescriptive\n\nHyDE*\nHypothetically \nanswer the \nquestion & use \nthe answer to \nsearch the KB\n\nSplitting a query in N*\nWhen there is more than 1 question or \nintent in a user query, consider \nsplitting it in several queries\n\nFallback\nConsider \nimplementing a \n\ufb02ow where the LLM \ncan ask for \nclari\ufb01cation when \nthere is not enough \ninformation in the \noriginal user query \nto get a result\n(Especially relevant \nwith tool usage)\n\n* GPT-4 can do this for you with the right prompt\n\n12\n\n\fTechnical patterns\nInput processing: NER\n\nWhy use NER?\n\nUsing NER (Named Entity \nRecognition) allows to extract \nrelevant entities from the input, that \ncan then be used for more \ndeterministic search queries. \nThis can be useful when the scope \nis very constrained.\n\nExample\n\nSearching for movies\nIf you have a structured database containing \nmetadata on movies, you can extract genre, \nactors or directors names, etc. from the user \nquery and use this to search the database\n\nNote: You can use exact values or embeddings after \nhaving extracted the relevant entities\n\n13\n\n\fTechnical patterns\nRetrieval\n\nre-ranking\n\nINPUT\n\nembeddings\n\n0.983, 0.123, 0.289\u2026\n\n0.876, 0.145, 0.179\u2026\n\nquery\n\nSELECT * from items\u2026\n\nkeywords\n\nred\n\nsummer\n\nSemantic \nsearch\n\nRESULTS\n\nRESULTS\n\nvector DB\n\nrelational / \nnosql db\n\nFINAL RESULT\n\nUsed to \ngenerate output\n\nBEST PRACTICES\n\nUse a combination of semantic \nsearch and deterministic queries \nwhere possible\n\n+ Cache output where possible\n\nCOMMON PITFALLS\n\n\u2794 The wrong elements could be \ncompared when looking at \ntext similarity, that is why \nre-ranking is important\n\n14\n\n\fTechnical patterns\nRetrieval: search\n\nHow to search?\n\nSemantic search\n\nKeyword search\n\nSearch query\n\nThere are many di\ufb00erent \napproaches to search depending on \nthe use case and the existing \nsystem.\n\nUsing embeddings, you \ncan perform semantic \nsearches. You can \ncompare embeddings \nwith what is in your \ndatabase and \ufb01nd the \nmost similar.\n\nIf you have extracted \nspeci\ufb01c entities or \nkeywords to search for, \nyou can search for these \nin your database.\n\nBased on the extracted \nentities you have or the \nuser input as is, you can \nconstruct search queries \n(SQL, cypher\u2026) and use \nthese queries to search \nyour database.\n\nYou can use a hybrid approach and combine several of these.\nYou can perform multiple searches in parallel or in sequence, or \nsearch for keywords with their embeddings for example.\n\n15\n\n\fTechnical patterns\nRetrieval: multi-step retrieval\n\nWhat is multi-step retrieval?\n\nIn some cases, there might be \nseveral actions to be performed to \nget the required information to \ngenerate an answer.\n\nThings to consider\n\n\u25cf\n\nFramework to be used:\n\n\u25cb When there are multiple steps to perform, \nconsider whether you want to handle this \nyourself or use a framework to make it easier\n\n\u25cf\n\nCost & Latency:\n\n\u25cb\n\n\u25cb\n\nPerforming multiple steps at the retrieval \nstage can increase latency and cost \nsigni\ufb01cantly\nConsider performing actions in parallel to \nreduce latency\n\n\u25cf\n\nChain of Thought:\n\n\u25cb\n\n\u25cb\n\nGuide the assistant with the chain of thought \napproach: break down instructions into \nseveral steps, with clear guidelines on \nwhether to continue, stop or do something \nelse. \nThis is more appropriate when tasks need to \nbe performed sequentially - for example: \u201cif \nthis didn\u2019t work, then do this\u201d\n\n16\n\n\fTechnical patterns\nRetrieval: re-ranking\n\nWhat is re-ranking?\n\nExample approaches\n\nRe-ranking means re-ordering the \nresults of the retrieval process to \nsurface more relevant results.\nThis is particularly important when \ndoing semantic searches.\n\nRule-based re-ranking\nYou can use metadata to rank results by relevance. For \nexample, you can look at the recency of the documents, at \ntags, speci\ufb01c keywords in the title, etc.\n\nRe-ranking algorithms\nThere are several existing algorithms/approaches you can use \nbased on your use case: BERT-based re-rankers, \ncross-encoder re-ranking, TF-IDF algorithms\u2026\n\n17\n\n\fTechnical patterns\nAnswer Generation\n\nFINAL RESULT\n\nPiece of content \nretrieved\n\nLLM\n\nPrompt including \nthe content\n\nUser sees the \n\ufb01nal result\n\nBEST PRACTICES\n\nEvaluate performance after each \nexperimentation to assess if it\u2019s \nworth exploring other paths\n+ Implement guardrails if applicable\n\nCOMMON PITFALLS\n\n\u2794 Going for \ufb01ne-tuning without \ntrying other approaches\n\u2794 Not paying attention to the \nway the model is prompted\n\n18\n\n\fTechnical patterns\nAnswer Generation: context window\n\nHow to manage context?\n\nDepending on your use case, there are \nseveral things to consider when \nincluding retrieved content into the \ncontext window to generate an answer. \n\nThings to consider\n\n\u25cf\n\nContext window max size:\n\n\u25cb\n\n\u25cb\n\nThere is a maximum size, so putting too \nmuch content is not ideal\nIn conversation use cases, the \nconversation will be part of the context \nas well and will add to that size\n\n\u25cf\n\nCost & Latency vs Accuracy:\n\n\u25cb More context results in increased \n\nlatency and additional costs since there \nwill be more input tokens\nLess context might also result in \ndecreased accuracy\n\n\u25cb\n\n\u25cf\n\n\u201cLost in the middle\u201d problem:\n\n\u25cb When there is too much context, LLMs \ntend to forget the text \u201cin the middle\u201d of \nthe content and might look over some \nimportant information.\n\n19\n\n\fTechnical patterns\nAnswer Generation: optimisation\n\nHow to optimise?\n\nThere are a few di\ufb00erent \nmethods to consider when \noptimising a RAG application.\nTry them from left to right, and \niterate with several of these \napproaches if needed.\n\nPrompt Engineering\n\nFew-shot examples\n\nFine-tuning\n\nAt each point of the \nprocess, experiment with \ndi\ufb00erent prompts to get \nthe expected input format \nor generate a relevant \noutput.\nTry guiding the model if \nthe process to get to the \n\ufb01nal outcome contains \nseveral steps.\n\nIf the model doesn\u2019t \nbehave as expected, \nprovide examples of what \nyou want e.g. provide \nexample user inputs and \nthe expected processing \nformat.\n\nIf giving a few examples \nisn\u2019t enough, consider \n\ufb01ne-tuning a model with \nmore examples for each \nstep of the process: you \ncan \ufb01ne-tune to get a \nspeci\ufb01c input processing \nor output format.\n\n20\n\n\fTechnical patterns\nAnswer Generation: safety checks\n\nWhy include safety checks?\n\nJust because you provide the model \nwith (supposedly) relevant context \ndoesn\u2019t mean the answer will \nsystematically be truthful or on-point.\nDepending on the use case, you \nmight want to double-check. \n\nExample evaluation framework: RAGAS\n\n21\n\n\f", "pages_description": ["**Overview**\n\nRetrieval-Augmented Generation (RAG) enhances language models by integrating them with a retrieval system. This combination allows the model to access external knowledge sources, resulting in more accurate and contextually relevant responses. \n\n**Example Use Cases:**\n- Providing answers with up-to-date information\n- Generating contextual responses\n\n**What We\u2019ll Cover:**\n- Technical patterns\n- Best practices\n- Common pitfalls\n- Resources", "What is RAG\n\nRAG stands for \"Retrieve information to Augment the model\u2019s knowledge and Generate the output.\" This process involves using a language model (LLM) to enhance its responses by accessing external information sources.\n\nHere's how it works:\n\n1. **User Query**: A user asks a question, such as \"What is your return policy?\"\n\n2. **LLM Processing**: The language model receives the question and initiates a search for relevant information.\n\n3. **Information Retrieval**: The LLM accesses a knowledge base or external sources to find the necessary details. In this example, the information retrieved includes:\n   - Total refunds available from 0 to 14 days.\n   - 50% value vouchers for returns between 14 to 30 days.\n   - A $5 discount on the next order for returns after 30 days.\n\n4. **Response Generation**: The LLM uses the retrieved information to generate a coherent response for the user. For instance, it might say, \"You can get a full refund up to 14 days after the purchase, then up to 30 days you would get a voucher for half the value of your order.\"\n\nThis method allows the model to provide accurate and up-to-date answers by leveraging external data sources.", "When to use RAG\n\n**Good for:**\n\n- **Introducing new information to the model:** RAG (Retrieval-Augmented Generation) is effective for updating a model's knowledge by incorporating new data.\n\n- **Reducing hallucinations by controlling content:** While RAG can help minimize hallucinations, it's important to note that they can still occur.\n\n**Not good for:**\n\n- **Teaching the model a specific format, style, or language:** For these tasks, it's better to use fine-tuning or custom models.\n\n- **Reducing token usage:** If token usage is a concern, consider fine-tuning based on the specific use case.", "**Technical Patterns**\n\nThis image outlines four key technical patterns involved in data processing and answer generation:\n\n1. **Data Preparation**\n   - **Chunking**: Breaking down data into smaller, manageable pieces.\n   - **Embeddings**: Converting data into numerical formats that can be easily processed by machine learning models.\n   - **Augmenting Content**: Enhancing data with additional information to improve its quality or usefulness.\n\n2. **Input Processing**\n   - **Input Augmentation**: Adding extra data or features to the input to improve model performance.\n   - **NER (Named Entity Recognition)**: Identifying and classifying key entities in the text, such as names, dates, and locations.\n   - **Embeddings**: Similar to data preparation, embeddings are used here to represent input data in a format suitable for processing.\n\n3. **Retrieval**\n   - **Search**: Locating relevant information from a dataset.\n   - **Multi-step Retrieval**: Using multiple steps or methods to refine the search process and improve accuracy.\n   - **Re-ranking**: Adjusting the order of retrieved results based on relevance or other criteria.\n\n4. **Answer Generation**\n   - **Context Window**: Using a specific portion of data to generate relevant answers.\n   - **Optimisation**: Improving the efficiency and accuracy of the answer generation process.\n   - **Safety Checks**: Ensuring that the generated answers are safe and appropriate for use.", "**Technical Patterns: Data Preparation**\n\nThis presentation focuses on the process of preparing data for easier consumption by large language models (LLMs). \n\n1. **Content Chunking**: \n   - Documents are divided into smaller, manageable pieces. This makes it easier for LLMs to process the information.\n\n2. **Embeddings**:\n   - Each chunk of content is converted into embeddings, which are numerical representations (e.g., 0.983, 0.123, 0.289) that capture the semantic meaning of the text. These embeddings are then stored in a knowledge base.\n\n3. **Augmenting Content**:\n   - Content can be enhanced using LLMs. For example, GPT-4 can be used to rephrase, summarize, and generate bullet points from the text.\n\n4. **Best Practices**:\n   - Pre-process content for LLM consumption by adding summaries and headers for each part.\n   - Curate relevant data sources to ensure quality and relevance.\n\n5. **Common Pitfalls**:\n   - Avoid having too much low-quality content.\n   - Ensure documents are not too large, as this can hinder processing efficiency.\n\nThis approach helps in organizing and optimizing data for better performance and understanding by LLMs.", "**Technical Patterns: Data Preparation - Chunking**\n\n**Why Chunking?**\n\nChunking is a technique used when your system doesn't need entire documents to provide relevant answers. By breaking documents into smaller pieces, you can make data easier to process, which reduces cost and latency. This approach is beneficial for systems that need to handle large volumes of data efficiently. Other methods for data preparation include using graphs or map-reduce.\n\n**Things to Consider**\n\n1. **Overlap:**\n   - Should chunks be independent or overlap with one another?\n   - If they overlap, by how much should they do so?\n\n2. **Size of Chunks:**\n   - What is the optimal chunk size for your specific use case?\n   - Do you want to include a lot of information in the context window, or just the minimum necessary?\n\n3. **Where to Chunk:**\n   - Should you chunk every N tokens or use specific separators?\n   - Is there a logical way to split the context that would aid the retrieval process?\n\n4. **What to Return:**\n   - Should you return chunks across multiple documents or focus on top chunks within the same document?\n   - Should chunks be linked together with metadata to indicate common properties?\n\nThese considerations help in designing an efficient chunking strategy that aligns with your system's requirements and goals.", "# Technical Patterns: Data Preparation - Embeddings\n\n## What to Embed?\n\nWhen preparing data for embedding, it's important to consider not just the text but also the metadata. This approach can enhance the searchability and relevance of the data. Here are some examples:\n\n### Examples\n\n1. **Embedding Q&A Posts in a Forum**\n   - You might want to include the title of the posts, the original question, and the top answers.\n   - Additionally, if the posts are tagged by topic or keywords, these can be embedded as well.\n\n2. **Embedding Product Specs**\n   - Besides embedding the text from product descriptions, you can add metadata such as color, size, and other specifications to your embeddings.\n\nBy embedding both text and metadata, you can improve the ability to surface specific chunks or documents during a search.", "**Technical Patterns: Data Preparation - Augmenting Content**\n\n**What does \u201cAugmenting content\u201d mean?**\n\nAugmenting content involves modifying the original material to make it more accessible and understandable for systems that rely on Retrieval-Augmented Generation (RAG). These modifications can include changes in format, wording, or the addition of descriptive elements like summaries or keywords.\n\n**Example Approaches:**\n\n1. **Make it a Guide:**\n   - Reformat the content into a step-by-step guide with clear headings and bullet points. This structure is more easily understood by a Language Learning Model (LLM). GPT-4 can assist with this transformation using the right prompts.\n\n2. **Add Descriptive Metadata:**\n   - Incorporate keywords or text that users might search for when considering a specific product or service. This helps in making the content more searchable and relevant.\n\n3. **Multimodality:**\n   - Utilize models like Whisper or GPT-4V to convert audio or visual content into text. For instance, GPT-4V can generate tags for images or describe slides, enhancing the content's accessibility and utility.", "**Technical Patterns: Input Processing**\n\nThis slide discusses methods for processing input data according to specific tasks, focusing on three main areas: Q&A, content search, and database (DB) search.\n\n1. **Q&A**: \n   - Uses a technique called HyDE, where a large language model (LLM) is asked to hypothetically answer a question. This answer is then used to search the knowledge base (KB).\n\n2. **Content Search**:\n   - Involves prompting the LLM to rephrase the input and optionally add more context to improve search results.\n\n3. **DB Search**:\n   - Utilizes Named Entity Recognition (NER) to find relevant entities. These entities are then used for keyword searches or to construct a search query.\n\nThe slide also highlights different output formats:\n- **Embeddings**: Numerical representations of data, such as vectors (e.g., 0.983, 0.123, 0.289).\n- **Query**: SQL-like statements for database searches (e.g., SELECT * from items).\n- **Keywords**: Specific terms extracted from the input (e.g., \"red,\" \"summer\").\n\n**Best Practices**:\n- Transform the input to match the content in the database.\n- Use metadata to enhance user input.\n\n**Common Pitfalls**:\n- Avoid directly comparing input to the database without considering the specific requirements of the task.", "**Technical Patterns: Input Processing - Input Augmentation**\n\n**What is input augmentation?**\n\nInput augmentation involves transforming the input into something different, such as rephrasing it, splitting it into several inputs, or expanding it. This process enhances performance by helping the language model (LLM) better understand the user's intent.\n\n**Example Approaches:**\n\n1. **Query Expansion**\n   - Rephrase the query to make it more descriptive. This helps the LLM grasp the context and details more effectively.\n\n2. **HyDE**\n   - Hypothetically answer the question and use that answer to search the knowledge base (KB). This approach can provide more relevant results by anticipating possible answers.\n\n3. **Splitting a Query in N**\n   - When a user query contains multiple questions or intents, consider dividing it into several queries. This ensures each part is addressed thoroughly.\n\n4. **Fallback**\n   - Implement a flow where the LLM can ask for clarification if the original query lacks sufficient information. This is particularly useful when using tools that require precise input.\n\n*Note: GPT-4 can perform these tasks with the appropriate prompt.*", "Technical Patterns: Input Processing - NER\n\n**Why use NER?**\n\nNamed Entity Recognition (NER) is a technique used to extract relevant entities from input data. This process is beneficial for creating more deterministic search queries, especially when the scope is very constrained. By identifying specific entities, such as names, dates, or locations, NER helps in refining and improving the accuracy of searches.\n\n**Example: Searching for Movies**\n\nConsider a structured database containing metadata on movies. By using NER, you can extract specific entities like genre, actors, or directors' names from a user's query. This information can then be used to search the database more effectively. \n\n**Note:** After extracting the relevant entities, you can use exact values or embeddings to enhance the search process.", "Technical Patterns: Retrieval\n\nThis diagram illustrates a retrieval process using technical patterns. The process begins with three types of input: embeddings, queries, and keywords.\n\n1. **Embeddings**: These are numerical representations (e.g., 0.983, 0.123, 0.289) used for semantic search. They are processed through a vector database (vector DB).\n\n2. **Query**: This involves structured queries (e.g., \"SELECT * from items...\") that interact with a relational or NoSQL database.\n\n3. **Keywords**: Simple search terms like \"red\" and \"summer\" are also used with the relational or NoSQL database.\n\nThe results from both the vector and relational/NoSQL databases are combined. The initial results undergo a re-ranking process to ensure accuracy and relevance, leading to the final result, which is then used to generate output.\n\n**Best Practices**:\n- Combine semantic search with deterministic queries for more effective retrieval.\n- Cache outputs where possible to improve efficiency.\n\n**Common Pitfalls**:\n- Incorrect element comparison during text similarity checks can occur, highlighting the importance of re-ranking to ensure accurate results.", "Technical Patterns: Retrieval - Search\n\n**How to search?**\n\nThere are various approaches to searching, which depend on the use case and the existing system. Here are three main methods:\n\n1. **Semantic Search**:\n   - This method uses embeddings to perform searches. \n   - By comparing embeddings with the data in your database, you can find the most similar matches.\n\n2. **Keyword Search**:\n   - If you have specific entities or keywords extracted, you can search for these directly in your database.\n\n3. **Search Query**:\n   - Based on extracted entities or direct user input, you can construct search queries (such as SQL or Cypher) to search your database.\n\nAdditionally, you can use a hybrid approach by combining several methods. This can involve performing multiple searches in parallel or in sequence, or searching for keywords along with their embeddings.", "**Technical Patterns: Retrieval - Multi-step Retrieval**\n\n**What is multi-step retrieval?**\n\nMulti-step retrieval involves performing several actions to obtain the necessary information to generate an answer. This approach is useful when a single step is insufficient to gather all required data.\n\n**Things to Consider**\n\n1. **Framework to be Used:**\n   - When multiple steps are needed, decide whether to manage this process yourself or use a framework to simplify the task.\n\n2. **Cost & Latency:**\n   - Performing multiple steps can significantly increase both latency and cost.\n   - To mitigate latency, consider executing actions in parallel.\n\n3. **Chain of Thought:**\n   - Use a chain of thought approach to guide the process. Break down instructions into clear steps, providing guidelines on whether to continue, stop, or take alternative actions.\n   - This method is particularly useful for tasks that must be performed sequentially, such as \"if this didn\u2019t work, then do this.\"", "**Technical Patterns: Retrieval - Re-ranking**\n\n**What is re-ranking?**\n\nRe-ranking involves re-ordering the results of a retrieval process to highlight more relevant outcomes. This is especially crucial in semantic searches, where understanding the context and meaning of queries is important.\n\n**Example Approaches**\n\n1. **Rule-based Re-ranking**\n   - This approach uses metadata to rank results by relevance. For instance, you might consider the recency of documents, tags, or specific keywords in the title to determine their importance.\n\n2. **Re-ranking Algorithms**\n   - There are various algorithms available for re-ranking based on specific use cases. Examples include BERT-based re-rankers, cross-encoder re-ranking, and TF-IDF algorithms. These methods apply different techniques to assess and order the relevance of search results.", "**Technical Patterns: Answer Generation**\n\nThis diagram illustrates the process of generating answers using a language model (LLM). Here's a breakdown of the components and concepts:\n\n1. **Process Flow:**\n   - A piece of content is retrieved and used to create a prompt.\n   - This prompt is fed into the LLM, which processes it to generate a final result.\n   - The user then sees this final result.\n\n2. **Best Practices:**\n   - It's important to evaluate performance after each experiment. This helps determine if exploring other methods is beneficial.\n   - Implementing guardrails can be useful to ensure the model's outputs are safe and reliable.\n\n3. **Common Pitfalls:**\n   - Avoid jumping straight to fine-tuning the model without considering other approaches that might be more effective or efficient.\n   - Pay close attention to how the model is prompted, as this can significantly impact the quality of the output.\n\nBy following these guidelines, you can optimize the use of LLMs for generating accurate and useful answers.", "# Technical Patterns: Answer Generation - Context Window\n\n## How to Manage Context?\n\nWhen generating answers using a context window, it's important to consider several factors based on your specific use case. Here are key points to keep in mind:\n\n### Things to Consider\n\n- **Context Window Max Size:**\n  - The context window has a maximum size, so overloading it with too much content is not ideal.\n  - In conversational scenarios, the conversation itself becomes part of the context, contributing to the overall size.\n\n- **Cost & Latency vs. Accuracy:**\n  - Including more context can lead to increased latency and higher costs due to the additional input tokens required.\n  - Conversely, using less context might reduce accuracy.\n\n- **\"Lost in the Middle\" Problem:**\n  - When the context is too extensive, language models may overlook or forget information that is \"in the middle\" of the content, potentially missing important details.", "**Technical Patterns: Answer Generation Optimisation**\n\n**How to optimise?**\n\nWhen optimising a Retrieval-Augmented Generation (RAG) application, there are several methods to consider. These methods should be tried sequentially from left to right, and multiple approaches can be iterated if necessary.\n\n1. **Prompt Engineering**\n   - Experiment with different prompts at each stage of the process to achieve the desired input format or generate relevant output.\n   - Guide the model through multiple steps to reach the final outcome.\n\n2. **Few-shot Examples**\n   - If the model's behavior is not as expected, provide examples of the desired outcome.\n   - Include sample user inputs and the expected processing format to guide the model.\n\n3. **Fine-tuning**\n   - If a few examples are insufficient, consider fine-tuning the model with more examples for each process step.\n   - Fine-tuning can help achieve a specific input processing or output format.", "Technical Patterns: Answer Generation - Safety Checks\n\n**Why include safety checks?**\n\nSafety checks are crucial because providing a model with supposedly relevant context does not guarantee that the generated answer will be truthful or accurate. Depending on the use case, it is important to double-check the information to ensure reliability.\n\n**RAGAS Score Evaluation Framework**\n\nThe RAGAS score is an evaluation framework that assesses both the generation and retrieval aspects of answer generation:\n\n- **Generation:**\n  - **Faithfulness:** This measures how factually accurate the generated answer is.\n  - **Answer Relevancy:** This evaluates how relevant the generated answer is to the question.\n\n- **Retrieval:**\n  - **Context Precision:** This assesses the signal-to-noise ratio of the retrieved context, ensuring that the information is precise.\n  - **Context Recall:** This checks if all relevant information required to answer the question is retrieved.\n\nBy using this framework, one can systematically evaluate and improve the quality of generated answers."]}, {"filename": "models-page.pdf", "text": "26/02/2024, 17:58\n\nModels - OpenAI API\n\nDocumentation\n\nAPI reference\n\nForum \n\nHelp \n\nModels\n\nOverview\n\nThe OpenAI API is powered by a diverse set of models with different capabilities and\nprice points. You can also make customizations to our models for your specific use\n\ncase with fine-tuning.\n\nMODEL\n\nDE S CRIPTION\n\nGPT-4 and GPT-4 Turbo A set of models that improve on GPT-3.5 and can\n\nunderstand as well as generate natural language or code\n\nGPT-3.5 Turbo\n\nA set of models that improve on GPT-3.5 and can\n\nunderstand as well as generate natural language or code\n\nDALL\u00b7E\n\nA model that can generate and edit images given a natural\n\nlanguage prompt\n\nTTS\n\nA set of models that can convert text into natural sounding\n\nspoken audio\n\nWhisper\n\nA model that can convert audio into text\n\nEmbeddings\n\nA set of models that can convert text into a numerical form\n\nModeration\n\nA fine-tuned model that can detect whether text may be\n\nsensitive or unsafe\n\nGPT base\n\nDeprecated\n\nA set of models without instruction following that can\nunderstand as well as generate natural language or code\n\nA full list of models that have been deprecated along with\nthe suggested replacement\n\nWe have also published open source models including Point-E, Whisper, Jukebox, and\nCLIP.\n\nContinuous model upgrades\n\nhttps://platform.openai.com/docs/models/overview\n\n1/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\ngpt-3.5-turbo ,  gpt-4 , and  gpt-4-turbo-preview  point to the latest model\nversion. You can verify this by looking at the response object after sending a request.\nThe response will include the specific model version used (e.g.  gpt-3.5-turbo-\n0613 ).\n\nWe also offer static model versions that developers can continue using for at least\nthree months after an updated model has been introduced. With the new cadence of\nmodel updates, we are also giving people the ability to contribute evals to help us\n\nimprove the model for different use cases. If you are interested, check out the OpenAI\nEvals repository.\n\nLearn more about model deprecation on our deprecation page.\n\nGPT-4 and GPT-4 Turbo\n\nGPT-4 is a large multimodal model (accepting text or image inputs and outputting text)\nthat can solve difficult problems with greater accuracy than any of our previous\n\nmodels, thanks to its broader general knowledge and advanced reasoning capabilities.\n\nGPT-4 is available in the OpenAI API to paying customers. Like  gpt-3.5-turbo , GPT-\n\n4 is optimized for chat but works well for traditional completions tasks using the Chat\nCompletions API. Learn how to use GPT-4 in our text generation guide.\n\nMODEL\n\nDE S CRIPTION\n\nCONTEXT\nWIND OW\n\nTRAINING\nDATA\n\ngpt-4-0125-preview\n\nNew  GPT-4 Turbo\n\n128,000\n\nUp to\n\nDec\n\n2023\n\nThe latest GPT-4 model\n\ntokens\n\nintended to reduce cases of\n\n\u201claziness\u201d where the model\ndoesn\u2019t complete a task.\nReturns a maximum of\n\n4,096 output tokens.\nLearn more.\n\ngpt-4-turbo-preview\n\nCurrently points to gpt-4-\n\n0125-preview.\n\ngpt-4-1106-preview\n\nGPT-4 Turbo model\nfeaturing improved\ninstruction following, JSON\n\nmode, reproducible outputs,\nparallel function calling, and\nmore. Returns a maximum\nof 4,096 output tokens. This\n\n128,000\ntokens\n\nUp to\nDec\n2023\n\n128,000\ntokens\n\nUp to\nApr 2023\n\nhttps://platform.openai.com/docs/models/overview\n\n2/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nMODEL\n\nDE S CRIPTION\n\nis a preview model.\nLearn more.\n\nCONTEXT\nWIND OW\n\nTRAINING\nDATA\n\ngpt-4-vision-preview\n\nGPT-4 with the ability to\nunderstand images, in\n\n128,000\ntokens\n\nUp to\nApr 2023\n\naddition to all other GPT-4\nTurbo capabilities. Currently\npoints to gpt-4-1106-\n\nvision-preview.\n\ngpt-4-1106-vision-preview GPT-4 with the ability to\n\nunderstand images, in\naddition to all other GPT-4\n\nTurbo capabilities. Returns a\nmaximum of 4,096 output\n\ntokens. This is a preview\n\nmodel version. Learn more.\n\n128,000\ntokens\n\nUp to\nApr 2023\n\ngpt-4\n\ngpt-4-0613\n\nCurrently points to gpt-4-\n\n8,192\n\nUp to\n\n0613. See\n\ntokens\n\nSep 2021\n\ncontinuous model upgrades.\n\nSnapshot of gpt-4 from\n\nJune 13th 2023 with\n\nimproved function calling\n\nsupport.\n\n8,192\ntokens\n\nUp to\nSep 2021\n\ngpt-4-32k\n\nCurrently points to gpt-4-\n\ngpt-4-32k-0613\n\n32k-0613. See\n\ncontinuous model upgrades.\nThis model was never rolled\nout widely in favor of GPT-4\n\nTurbo.\n\nSnapshot of gpt-4-32k\n\nfrom June 13th 2023 with\nimproved function calling\nsupport. This model was\nnever rolled out widely in\n\nfavor of GPT-4 Turbo.\n\n32,768\n\ntokens\n\nUp to\n\nSep 2021\n\n32,768\n\ntokens\n\nUp to\n\nSep 2021\n\nFor many basic tasks, the difference between GPT-4 and GPT-3.5 models is not\nsignificant. However, in more complex reasoning situations, GPT-4 is much more\ncapable than any of our previous models.\n\nhttps://platform.openai.com/docs/models/overview\n\n3/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nMultilingual capabilities\n\nGPT-4 outperforms both previous large language models and as of 2023, most state-\nof-the-art systems (which often have benchmark-specific training or hand-\nengineering). On the MMLU benchmark, an English-language suite of multiple-choice\nquestions covering 57 subjects, GPT-4 not only outperforms existing models by a\nconsiderable margin in English, but also demonstrates strong performance in other\nlanguages.\n\nGPT-3.5 Turbo\n\nGPT-3.5 Turbo models can understand and generate natural language or code and\nhave been optimized for chat using the Chat Completions API but work well for non-\nchat tasks as well.\n\nCONTEXT\nWIND OW\n\nTRAINING\nDATA\n\n16,385\n\ntokens\n\nUp to Sep\n\n2021\n\nMODEL\n\nDE S CRIPTION\n\ngpt-3.5-turbo-0125\n\nNew  Updated GPT 3.5 Turbo\n\nThe latest GPT-3.5 Turbo\nmodel with higher accuracy at\n\nresponding in requested\n\nformats and a fix for a bug\n\nwhich caused a text encoding\nissue for non-English\n\nlanguage function calls.\n\nReturns a maximum of 4,096\n\noutput tokens. Learn more.\n\ngpt-3.5-turbo\n\nCurrently points to gpt-3.5-\n\n4,096\n\nUp to Sep\n\nturbo-0613. The gpt-3.5-\n\ntokens\n\n2021\n\nturbo model alias will be\n\nautomatically upgraded from\ngpt-3.5-turbo-0613 to\n\ngpt-3.5-turbo-0125 on\n\nFebruary 16th.\n\ngpt-3.5-turbo-1106\n\nGPT-3.5 Turbo model with\nimproved instruction\n\n16,385\ntokens\n\nUp to Sep\n2021\n\nfollowing, JSON mode,\nreproducible outputs, parallel\nfunction calling, and more.\nReturns a maximum of 4,096\n\noutput tokens. Learn more.\n\nhttps://platform.openai.com/docs/models/overview\n\n4/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nMODEL\n\nDE S CRIPTION\n\ngpt-3.5-turbo-instruct Similar capabilities as GPT-3\nera models. Compatible with\nlegacy Completions endpoint\nand not Chat Completions.\n\nCONTEXT\nWIND OW\n\nTRAINING\nDATA\n\n4,096\ntokens\n\nUp to Sep\n2021\n\ngpt-3.5-turbo-16k\n\nLegacy  Currently points to\ngpt-3.5-turbo-16k-0613.\n\n16,385\ntokens\n\nUp to Sep\n2021\n\ngpt-3.5-turbo-0613\n\nLegacy  Snapshot of gpt-3.5-\n\nturbo from June 13th 2023.\n\nWill be deprecated on June 13,\n2024.\n\n4,096\ntokens\n\nUp to Sep\n2021\n\ngpt-3.5-turbo-16k-0613\n\nLegacy  Snapshot of gpt-3.5-\n\n16,385\n\nUp to Sep\n\n16k-turbo from June 13th\n\ntokens\n\n2021\n\n2023. Will be deprecated on\n\nJune 13, 2024.\n\nDALL\u00b7E\n\nDALL\u00b7E is a AI system that can create realistic images and art from a description in\n\nnatural language. DALL\u00b7E 3 currently supports the ability, given a prompt, to create a\n\nnew image with a specific size. DALL\u00b7E 2 also support the ability to edit an existing\n\nimage, or create variations of a user provided image.\n\nDALL\u00b7E 3 is available through our Images API along with DALL\u00b7E 2. You can try DALL\u00b7E 3\n\nthrough ChatGPT Plus.\n\nMODEL\n\nDE S CRIPTION\n\ndall-e-3\n\nNew  DALL\u00b7E 3\n\nThe latest DALL\u00b7E model released in Nov 2023. Learn more.\n\ndall-e-2 The previous DALL\u00b7E model released in Nov 2022. The 2nd iteration of\nDALL\u00b7E with more realistic, accurate, and 4x greater resolution images\nthan the original model.\n\nTTS\n\nTTS is an AI model that converts text to natural sounding spoken text. We offer two\ndifferent model variates,  tts-1  is optimized for real time text to speech use cases\nand  tts-1-hd  is optimized for quality. These models can be used with the Speech\n\nendpoint in the Audio API.\n\nhttps://platform.openai.com/docs/models/overview\n\n5/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nMODEL\n\nDE S CRIPTION\n\ntts-1\n\nNew  Text-to-speech 1\nThe latest text to speech model, optimized for speed.\n\ntts-1-hd\n\nNew  Text-to-speech 1 HD\nThe latest text to speech model, optimized for quality.\n\nWhisper\n\nWhisper is a general-purpose speech recognition model. It is trained on a large dataset\nof diverse audio and is also a multi-task model that can perform multilingual speech\nrecognition as well as speech translation and language identification. The Whisper v2-\n\nlarge model is currently available through our API with the  whisper-1  model name.\n\nCurrently, there is no difference between the open source version of Whisper and the\n\nversion available through our API. However, through our API, we offer an optimized\ninference process which makes running Whisper through our API much faster than\n\ndoing it through other means. For more technical details on Whisper, you can read the\n\npaper.\n\nEmbeddings\n\nEmbeddings are a numerical representation of text that can be used to measure the\n\nrelatedness between two pieces of text. Embeddings are useful for search, clustering,\n\nrecommendations, anomaly detection, and classification tasks. You can read more\nabout our latest embedding models in the announcement blog post.\n\nMODEL\n\nDE S CRIPTION\n\ntext-embedding-\n3-large\n\nNew  Embedding V3 large\nMost capable embedding model for both\n\nenglish and non-english tasks\n\ntext-embedding-\n\nNew  Embedding V3 small\n\n3-small\n\nIncreased performance over 2nd generation ada\nembedding model\n\ntext-embedding-\nada-002\n\nMost capable 2nd generation embedding\nmodel, replacing 16 first generation models\n\nOUTP UT\nDIMENSION\n\n3,072\n\n1,536\n\n1,536\n\nModeration\n\nhttps://platform.openai.com/docs/models/overview\n\n6/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nThe Moderation models are designed to check whether content complies with\nOpenAI's usage policies. The models provide classification capabilities that look for\ncontent in the following categories: hate, hate/threatening, self-harm, sexual,\nsexual/minors, violence, and violence/graphic. You can find out more in our moderation\n\nguide.\n\nModeration models take in an arbitrary sized input that is automatically broken up into\nchunks of 4,096 tokens. In cases where the input is more than 32,768 tokens,\n\ntruncation is used which in a rare condition may omit a small number of tokens from\nthe moderation check.\n\nThe final results from each request to the moderation endpoint shows the maximum\n\nvalue on a per category basis. For example, if one chunk of 4K tokens had a category\nscore of 0.9901 and the other had a score of 0.1901, the results would show 0.9901 in the\nAPI response since it is higher.\n\nMODEL\n\nDE S CRIPTION\n\nMAX\nTOKENS\n\ntext-moderation-latest Currently points to text-moderation-\n\n32,768\n\n007.\n\ntext-moderation-stable Currently points to text-moderation-\n\n32,768\n\n007.\n\ntext-moderation-007\n\nMost capable moderation model across\nall categories.\n\n32,768\n\nGPT base\n\nGPT base models can understand and generate natural language or code but are not\ntrained with instruction following. These models are made to be replacements for our\n\noriginal GPT-3 base models and use the legacy Completions API. Most customers\n\nshould use GPT-3.5 or GPT-4.\n\nMODEL\n\nDE S CRIPTION\n\nbabbage-002 Replacement for the GPT-3 ada and\n\nbabbage base models.\n\ndavinci-002 Replacement for the GPT-3 curie and\n\ndavinci base models.\n\nMAX\nTOKENS\n\nTRAINING\nDATA\n\n16,384\ntokens\n\n16,384\ntokens\n\nUp to Sep\n2021\n\nUp to Sep\n2021\n\nHow we use your data\n\nhttps://platform.openai.com/docs/models/overview\n\n7/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nYour data is your data.\n\nAs of March 1, 2023, data sent to the OpenAI API will not be used to train or improve\n\nOpenAI models (unless you explicitly opt in). One advantage to opting in is that the\nmodels may get better at your use case over time.\n\nTo help identify abuse, API data may be retained for up to 30 days, after which it will be\n\ndeleted (unless otherwise required by law). For trusted customers with sensitive\napplications, zero data retention may be available. With zero data retention, request\nand response bodies are not persisted to any logging mechanism and exist only in\nmemory in order to serve the request.\n\nNote that this data policy does not apply to OpenAI's non-API consumer services like\nChatGPT or DALL\u00b7E Labs.\n\nDefault usage policies by endpoint\n\nENDP OINT\n\nDATA USED\nFOR TRAINING\n\nDEFAULT\nRETENTION\n\nELIGIBLE FOR\nZERO RETENTION\n\n/v1/chat/completions*\n\nNo\n\n30 days\n\nYes, except\n\nimage inputs*\n\n/v1/files\n\n/v1/assistants\n\n/v1/threads\n\n/v1/threads/messages\n\n/v1/threads/runs\n\n/v1/threads/runs/steps\n\n/v1/images/generations\n\n/v1/images/edits\n\n/v1/images/variations\n\n/v1/embeddings\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\n/v1/audio/transcriptions No\n\nUntil deleted by\n\nNo\n\ncustomer\n\nUntil deleted by\n\nNo\n\ncustomer\n\n60 days *\n\n60 days *\n\n60 days *\n\n60 days *\n\n30 days\n\n30 days\n\n30 days\n\n30 days\n\nZero data\nretention\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nYes\n\n-\n\nhttps://platform.openai.com/docs/models/overview\n\n8/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nENDP OINT\n\nDATA USED\nFOR TRAINING\n\nDEFAULT\nRETENTION\n\nELIGIBLE FOR\nZERO RETENTION\n\n/v1/audio/translations\n\nNo\n\n/v1/audio/speech\n\n/v1/fine_tuning/jobs\n\n/v1/moderations\n\n/v1/completions\n\nNo\n\nNo\n\nNo\n\nNo\n\nZero data\nretention\n\n30 days\n\nUntil deleted by\ncustomer\n\nZero data\nretention\n\n-\n\nNo\n\nNo\n\n-\n\n30 days\n\nYes\n\n* Image inputs via the  gpt-4-vision-preview  model are not eligible for zero\nretention.\n\n* For the Assistants API, we are still evaluating the default retention period during the\n\nBeta. We expect that the default retention period will be stable after the end of the\n\nBeta.\n\nFor details, see our API data usage policies. To learn more about zero retention, get in\n\ntouch with our sales team.\n\nModel endpoint compatibility\n\nENDP OINT\n\nL ATE ST MODEL S\n\n/v1/assistants\n\nAll models except gpt-3.5-turbo-0301\n\nsupported. The retrieval tool requires gpt-4-\n\nturbo-preview (and subsequent dated model\n\nreleases) or gpt-3.5-turbo-1106 (and\n\nsubsequent versions).\n\n/v1/audio/transcriptions whisper-1\n\n/v1/audio/translations\n\nwhisper-1\n\n/v1/audio/speech\n\ntts-1, tts-1-hd\n\n/v1/chat/completions\n\ngpt-4 and dated model releases, gpt-4-turbo-\n\npreview and dated model releases, gpt-4-\n\nvision-preview, gpt-4-32k and dated model\n\nreleases, gpt-3.5-turbo and dated model\n\nhttps://platform.openai.com/docs/models/overview\n\n9/10\n\n\f26/02/2024, 17:58\n\nENDP OINT\n\nModels - OpenAI API\n\nL ATE ST MODEL S\n\nreleases, gpt-3.5-turbo-16k and dated model\n\nreleases, fine-tuned versions of gpt-3.5-turbo\n\n/v1/completions (Legacy) gpt-3.5-turbo-instruct, babbage-002,\n\ndavinci-002\n\n/v1/embeddings\n\ntext-embedding-3-small, text-embedding-\n\n3-large, text-embedding-ada-002\n\n/v1/fine_tuning/jobs\n\ngpt-3.5-turbo, babbage-002, davinci-002\n\n/v1/moderations\n\ntext-moderation-stable, text-\n\nhttps://platform.openai.com/docs/models/overview\n\n10/10\n\n\f", "pages_description": ["**GPT-4 and GPT-4 Turbo**\n\nGPT-4 is a sophisticated multimodal model capable of processing both text and image inputs to produce text outputs. It is designed to tackle complex problems with higher accuracy than previous models, leveraging its extensive general knowledge and advanced reasoning skills. GPT-4 is accessible through the OpenAI API for paying customers and is optimized for chat applications, although it can also handle traditional completion tasks using the Chat Completions API.\n\n**Model Versions:**\n\n1. **gpt-4-0125-preview**\n   - **Description:** This is the latest GPT-4 Turbo model, designed to minimize instances where the model fails to complete a task, known as \"laziness.\" It can return up to 4,096 output tokens.\n   - **Context Window:** 128,000 tokens\n   - **Training Data:** Up to December 2023\n\n2. **gpt-4-turbo-preview**\n   - **Description:** This version currently points to the gpt-4-0125-preview model.\n   - **Context Window:** 128,000 tokens\n   - **Training Data:** Up to December 2023\n\n3. **gpt-4-1106-preview**\n   - **Description:** This version of GPT-4 Turbo includes enhancements such as improved instruction following, JSON mode, reproducible outputs, and parallel function calling. It also supports up to 4,096 output tokens.\n   - **Context Window:** 128,000 tokens\n   - **Training Data:** Up to April 2023\n\nThese models are part of OpenAI's ongoing efforts to provide developers with robust tools for various applications, ensuring flexibility and improved performance across different use cases.", "**Models - OpenAI API Overview**\n\nThis document provides an overview of various GPT-4 models, highlighting their capabilities, context windows, and training data timelines.\n\n1. **gpt-4-vision-preview**\n   - **Description**: This model has the ability to understand images, in addition to all other GPT-4 Turbo capabilities. It currently points to the gpt-4-1106-vision-preview model.\n   - **Context Window**: 128,000 tokens\n   - **Training Data**: Up to April 2023\n\n2. **gpt-4-1106-vision-preview**\n   - **Description**: Similar to the gpt-4-vision-preview, this model can understand images and includes all GPT-4 Turbo capabilities. It returns a maximum of 4,096 output tokens and is a preview model version.\n   - **Context Window**: 128,000 tokens\n   - **Training Data**: Up to April 2023\n\n3. **gpt-4**\n   - **Description**: This model currently points to gpt-4-0613 and includes continuous model upgrades.\n   - **Context Window**: 8,192 tokens\n   - **Training Data**: Up to September 2021\n\n4. **gpt-4-0613**\n   - **Description**: A snapshot of gpt-4 from June 13th, 2023, with improved function calling support.\n   - **Context Window**: 8,192 tokens\n   - **Training Data**: Up to September 2021\n\n5. **gpt-4-32k**\n   - **Description**: This model points to gpt-4-32k-0613 and includes continuous model upgrades. It was not widely rolled out in favor of GPT-4 Turbo.\n   - **Context Window**: 32,768 tokens\n   - **Training Data**: Up to September 2021\n\n6. **gpt-4-32k-0613**\n   - **Description**: A snapshot of gpt-4-32k from June 13th, 2023, with improved function calling support. Like gpt-4-32k, it was not widely rolled out in favor of GPT-4 Turbo.\n   - **Context Window**: 32,768 tokens\n   - **Training Data**: Up to September ", "**Multilingual Capabilities and GPT-3.5 Turbo**\n\n**Multilingual Capabilities**\n\nGPT-4 surpasses previous large language models and, as of 2023, most state-of-the-art systems. It excels in the MMLU benchmark, which involves English-language multiple-choice questions across 57 subjects. GPT-4 not only outperforms existing models in English but also shows strong performance in other languages.\n\n**GPT-3.5 Turbo**\n\nGPT-3.5 Turbo models are designed to understand and generate natural language or code. They are optimized for chat using the Chat Completions API but are also effective for non-chat tasks.\n\n**Model Descriptions:**\n\n1. **gpt-3.5-turbo-0125**\n   - **Description:** Updated GPT-3.5 Turbo with improved accuracy and a fix for a text encoding bug in non-English language function calls. It returns up to 4,096 output tokens.\n   - **Context Window:** 16,385 tokens\n   - **Training Data:** Up to September 2021\n\n2. **gpt-3.5-turbo**\n   - **Description:** Currently points to gpt-3.5-turbo-0613. The alias will automatically upgrade to gpt-3.5-turbo-0125 on February 16th.\n   - **Context Window:** 4,096 tokens\n   - **Training Data:** Up to September 2021\n\n3. **gpt-3.5-turbo-1106**\n   - **Description:** Features improved instruction following, JSON mode, reproducible outputs, and parallel function calling. It returns up to 4,096 output tokens.\n   - **Context Window:** 16,385 tokens\n   - **Training Data:** Up to September 2021", "**Models - OpenAI API**\n\n**GPT-3.5 Models:**\n\n1. **gpt-3.5-turbo-instruct**\n   - **Description:** Similar capabilities to GPT-3 era models. Compatible with legacy Completions endpoint, not Chat Completions.\n   - **Context Window:** 4,096 tokens\n   - **Training Data:** Up to September 2021\n\n2. **gpt-3.5-turbo-16k**\n   - **Description:** Legacy model pointing to gpt-3.5-turbo-16k-0613.\n   - **Context Window:** 16,385 tokens\n   - **Training Data:** Up to September 2021\n\n3. **gpt-3.5-turbo-0613**\n   - **Description:** Legacy snapshot of gpt-3.5-turbo from June 13, 2023. Will be deprecated on June 13, 2024.\n   - **Context Window:** 4,096 tokens\n   - **Training Data:** Up to September 2021\n\n4. **gpt-3.5-turbo-16k-0613**\n   - **Description:** Legacy snapshot of gpt-3.5-turbo-16k-turbo from June 13, 2023. Will be deprecated on June 13, 2024.\n   - **Context Window:** 16,385 tokens\n   - **Training Data:** Up to September 2021\n\n**DALL-E:**\n\n- DALL-E is an AI system that creates realistic images and art from natural language descriptions. DALL-E 3 supports creating new images with specific sizes and editing existing images or creating variations. Available through the Images API and ChatGPT Plus.\n\n1. **dall-e-3**\n   - **Description:** The latest DALL-E model released in November 2023.\n\n2. **dall-e-2**\n   - **Description:** Released in November 2022, this model offers more realistic, accurate, and higher resolution images than the original.\n\n**TTS (Text-to-Speech):**\n\n- TTS converts text to natural-sounding spoken text. Two model variants are offered:\n  - **tts-1:** Optimized for real-time text-to-speech use cases.\n  - **tts-1-hd:** Optimized for quality.\n- These models can be used with the Speech endpoint in", "**Models - OpenAI API**\n\n**Text-to-Speech Models:**\n\n1. **tts-1**: This is a new text-to-speech model optimized for speed, providing efficient conversion of text into spoken words.\n   \n2. **tts-1-hd**: This model is optimized for quality, offering high-definition text-to-speech conversion.\n\n**Whisper:**\n\nWhisper is a versatile speech recognition model capable of handling diverse audio inputs. It supports multilingual speech recognition, speech translation, and language identification. The Whisper v2-large model is accessible via the API under the name \"whisper-1.\" While the open-source version and the API version are similar, the API offers an optimized inference process for faster performance. More technical details can be found in the associated paper.\n\n**Embeddings:**\n\nEmbeddings are numerical representations of text, useful for measuring the relatedness between text pieces. They are applied in search, clustering, recommendations, anomaly detection, and classification tasks.\n\n- **text-embedding-3-large**: The most capable embedding model for both English and non-English tasks, with an output dimension of 3,072.\n  \n- **text-embedding-3-small**: Offers improved performance over the second-generation ada embedding model, with an output dimension of 1,536.\n  \n- **text-embedding-ada-002**: A second-generation embedding model replacing 16 first-generation models, also with an output dimension of 1,536.\n\n**Moderation:**\n\nThe document mentions a section on moderation, likely related to content moderation capabilities, though specific details are not provided in the visible content.", "**Moderation Models and GPT Base**\n\n**Moderation Models**\n\nThe moderation models are designed to ensure content compliance with OpenAI's usage policies. They classify content into categories such as hate, hate/threatening, self-harm, sexual, sexual/minors, violence, and violence/graphic. These models process inputs by breaking them into chunks of 4,096 tokens. If the input exceeds 32,768 tokens, some tokens may be truncated, potentially omitting a few from the moderation check.\n\nThe moderation endpoint provides the maximum score per category from each request. For instance, if one chunk scores 0.9901 and another scores 0.1901 in a category, the API response will show 0.9901.\n\n- **text-moderation-latest**: Points to text-moderation-007 with a max of 32,768 tokens.\n- **text-moderation-stable**: Also points to text-moderation-007 with a max of 32,768 tokens.\n- **text-moderation-007**: The most capable model across all categories with a max of 32,768 tokens.\n\n**GPT Base**\n\nGPT base models are capable of understanding and generating natural language or code but are not trained for instruction following. They serve as replacements for the original GPT-3 base models and utilize the legacy Completions API. Most users are advised to use GPT-3.5 or GPT-4.\n\n- **babbage-002**: Replaces the GPT-3 ada and babbage models, with a max of 16,384 tokens and training data up to September 2021.\n- **davinci-002**: Replaces the GPT-3 curie and davinci models, with a max of 16,384 tokens and training data up to September 2021.", "Your Data is Your Data\n\nAs of March 1, 2023, data sent to the OpenAI API is not used to train or improve OpenAI models unless you explicitly opt in. Opting in can help models improve for your specific use case over time.\n\nTo prevent abuse, API data may be retained for up to 30 days before deletion, unless legally required otherwise. Trusted customers with sensitive applications may have zero data retention, meaning request and response bodies are not logged and exist only in memory to serve the request.\n\nThis data policy does not apply to OpenAI's non-API consumer services like ChatGPT or DALL-E Labs.\n\n**Default Usage Policies by Endpoint**\n\n- **/v1/chat/completions**: Data is not used for training. Default retention is 30 days, and it is eligible for zero retention except for image inputs.\n- **/v1/files**: Data is not used for training. Retention is until deleted by the customer, with no zero retention option.\n- **/v1/assistants**: Data is not used for training. Retention is until deleted by the customer, with no zero retention option.\n- **/v1/threads**: Data is not used for training. Retention is 60 days, with no zero retention option.\n- **/v1/threads/messages**: Data is not used for training. Retention is 60 days, with no zero retention option.\n- **/v1/threads/runs**: Data is not used for training. Retention is 60 days, with no zero retention option.\n- **/v1/threads/runs/steps**: Data is not used for training. Retention is 60 days, with no zero retention option.\n- **/v1/images/generations**: Data is not used for training. Retention is 30 days, with no zero retention option.\n- **/v1/images/edits**: Data is not used for training. Retention is 30 days, with no zero retention option.\n- **/v1/images/variations**: Data is not used for training. Retention is 30 days, with no zero retention option.\n- **/v1/embeddings**: Data is not used for training. Retention is 30 days, and it is eligible for zero retention.\n- **/v1/audio/transcriptions**: Data is not used for training", "### Model Endpoint Compatibility and Data Retention\n\n#### Data Retention Details\n\nThe table outlines the data retention policies for various API endpoints:\n\n- **/v1/audio/translations**: No data is used for training, and there is zero data retention.\n- **/v1/audio/speech**: No data is used for training, with a default retention period of 30 days. It is not eligible for zero retention.\n- **/v1/fine_tuning/jobs**: No data is used for training, and data is retained until deleted by the customer. It is not eligible for zero retention.\n- **/v1/moderations**: No data is used for training, and there is zero data retention.\n- **/v1/completions**: No data is used for training, with a default retention period of 30 days. It is eligible for zero retention.\n\nAdditional notes:\n- Image inputs via the `gpt-4-vision-preview` model are not eligible for zero retention.\n- The default retention period for the Assistants API is still being evaluated during the Beta phase.\n\n#### Model Endpoint Compatibility\n\nThe table provides information on the compatibility of endpoints with the latest models:\n\n- **/v1/assistants**: Supports all models except `gpt-3.5-turbo-0301`. The `retrieval` tool requires `gpt-4-turbo-preview` or `gpt-3.5-turbo-1106`.\n- **/v1/audio/transcriptions**: Compatible with `whisper-1`.\n- **/v1/audio/translations**: Compatible with `whisper-1`.\n- **/v1/audio/speech**: Compatible with `tts-1` and `tts-1-hd`.\n- **/v1/chat/completions**: Compatible with `gpt-4`, `gpt-4-turbo-preview`, `gpt-4-vision-preview`, `gpt-4-32k`, and `gpt-3.5-turbo`.\n\nFor more details, users are encouraged to refer to the API data usage policies or contact the sales team for information on zero retention.", "LATEST MODELS\n\nThis document outlines the latest models available for different endpoints in the OpenAI API:\n\n1. **/v1/completions (Legacy)**:\n   - Models: `gpt-3.5-turbo-instruct`, `babbage-002`, `davinci-002`\n   - These models are used for generating text completions based on input prompts.\n\n2. **/v1/embeddings**:\n   - Models: `text-embedding-3-small`, `text-embedding-3-large`, `text-embedding-ada-002`\n   - These models are designed to convert text into numerical vectors, which can be used for various tasks like similarity comparison and clustering.\n\n3. **/v1/fine_tuning/jobs**:\n   - Models: `gpt-3.5-turbo`, `babbage-002`, `davinci-002`\n   - These models support fine-tuning, allowing users to customize the models for specific tasks by training them on additional data.\n\n4. **/v1/moderations**:\n   - Models: `text-moderation-stable`\n   - This model is used for content moderation, helping to identify and filter out inappropriate or harmful content.\n\nAdditionally, the document mentions the availability of `gpt-3.5-turbo-16k` and other fine-tuned versions of `gpt-3.5-turbo`, indicating enhancements in model capabilities and performance."]}, {"filename": "evals-decks.pdf", "text": "Evaluation\nTechnique\n\nFebruary 2024\n\n\fOverview\n\nEvaluation is the process of validating \nand testing the outputs that your LLM \napplications are producing. Having \nstrong evaluations (\u201cevals\u201d) will mean a \nmore stable, reliable application which is \nresilient to code and model changes.\n\nExample use cases\n\n- Quantify a solution\u2019s reliability\n- Monitor application performance in \n\nproduction\nTest for regressions \n\n-\n\nWhat we\u2019ll cover\n\n\u25cf What are evals\n\n\u25cf Technical patterns\n\n\u25cf Example framework\n\n\u25cf Best practices\n\n\u25cf Resources\n\n3\n\n\fWhat are evals\nExample\n\nAn evaluation contains a question and a correct answer. We call this the ground truth.\n\nQuestion\n\nWhat is the population \nof Canada?\n\nThought: I don\u2019t know. I \nshould use a tool\nAction: Search\nAction Input: What is the \npopulation of Canada?\n\nLLM\n\nSearch\n\nThere are 39,566,248 people \nin Canada as of 2023.\n\nThe current population of \nCanada is 39,566,248 as of \nTuesday, May 23, 2023\u2026.\n\nActual result\n\n4\n\n\fWhat are evals\nExample\n\nOur ground truth matches the predicted answer, so the evaluation passes!\n\nEvaluation\n\nQuestion\n\nGround Truth\n\nPredicted Answer\n\nWhat is the population \nof Canada?\n\nThe population of Canada in \n2023 is 39,566,248 people.\n\nThere are 39,566,248 people \nin Canada as of 2023.\n\n5\n\n\fTechnical patterns\n\nMetric-based evaluations\n\nComponent evaluations\n\nSubjective evaluations\n\n\u25cf\n\n\u25cf\n\nComparison metrics like \nBLEU, ROUGE\n\nGives a score to \ufb01lter and \nrank results\n\n\u25cf\n\n\u25cf\n\nCompares ground \ntruth to prediction\n\nGives Pass/Fail\n\n\u25cf\n\n\u25cf\n\nUses a scorecard to \nevaluate subjectively\n\nScorecard may also \nhave a Pass/Fail\n\n6\n\n\fTechnical patterns\nMetric-based evaluations\n\nROUGE is a common metric for evaluating machine summarizations of text\n\nROUGE\n\nMetric for evaluating \nsummarization tasks\n\nOriginal\n\nOpenAI's mission is to ensure that \narti\ufb01cial general intelligence (AGI) \nbene\ufb01ts all of humanity. OpenAI \nwill build safe and bene\ufb01cial AGI \ndirectly, but will also consider its \nmission ful\ufb01lled if its work aids \nothers to achieve this outcome. \nOpenAI follows several key \nprinciples for this purpose. First, \nbroadly distributed bene\ufb01ts - any \nin\ufb02uence over AGI's deployment \nwill be used for the bene\ufb01t of all, \nand to avoid harmful uses or undue \nconcentration of power\u2026\n\nMachine \nSummary\n\nOpenAI aims to ensure AGI is \nfor everyone's use, totally \navoiding harmful stuff or big \npower concentration. \nCommitted to researching \nAGI's safe side, promoting \nthese studies in AI folks. \nOpenAI wants to be top in AI \nthings and works with \nworldwide research, policy \ngroups to \ufb01gure AGI's stuff.\n\nROUGE \nScore\n\n0.51162\n\n7\n\n\fTechnical patterns\nMetric-based evaluations\n\nBLEU score is another standard metric, this time focusing on machine translation tasks\n\nBLEU\n\nOriginal text\n\nReference\nTranslation\n\nPredicted \nTranslation\n\nMetric for \nevaluating \ntranslation tasks\n\nY gwir oedd \ndoedden nhw \nddim yn dweud \ncelwyddau wedi'r \ncwbl.\n\nThe truth was \nthey were not \ntelling lies after \nall.\n\nThe truth was \nthey weren't \ntelling lies after \nall.\n\nBLEU \nScore\n\n0.39938\n\n8\n\n\fTechnical patterns\nMetric-based evaluations\n\nWhat they\u2019re good for\n\nWhat to be aware of\n\n\u25cf\n\n\u25cf\n\nA good starting point for evaluating a \n\n\u25cf Not tuned to your speci\ufb01c context\n\nfresh solution\n\nUseful yardstick for automated testing \n\nof whether a change has triggered a \n\nmajor performance shift\n\n\u25cf Most customers require more \n\nsophisticated evaluations to go to \n\nproduction\n\n\u25cf Cheap and fast\n\n9\n\n\fTechnical patterns\nComponent evaluations\n\nComponent evaluations (or \u201cunit tests\u201d) cover a single input/output of the application. They check \nwhether each component works in isolation, comparing the input to a ground truth ideal result\n\nIs this the \ncorrect action?\n\nExact match \ncomparison\n\nDoes this answer \nuse the context?\n\nExtract numbers \nfrom each and \ncompare\n\nWhat is the population \nof Canada?\n\nThought: I don\u2019t know. I \nshould use a tool\nAction: Search\nAction Input: What is the \npopulation of Canada?\n\nAgent\n\nSearch\n\nThere are 39,566,248 people \nin Canada as of 2023.\n\nThe current population of \nCanada is 39,566,248 as of \nTuesday, May 23, 2023\u2026.\n\nIs this the right \nsearch result?\n\nTag the right \nanswer and do \nan exact match \ncomparison with \nthe retrieval.\n\n10\n\n\fTechnical patterns\nSubjective evaluations\n\nBuilding up a good scorecard for automated testing bene\ufb01ts from a few rounds of detailed human \nreview so we can learn what is valuable. \n\nA policy of \u201cshow rather than tell\u201d is also advised for GPT-4, so include examples of what a 1, 3 and \n8 out of 10 look like so the model can appreciate the spread.\n\nExample \nscorecard\n\nYou are a helpful evaluation assistant who grades how well the Assistant has answered the customer\u2019s query.\n\nYou will assess each submission against these metrics, please think through these step by step:\n\n-\n\nrelevance: Grade how relevant the search content is to the question from 1 to 5 // 5 being highly relevant and 1 being \nnot relevant at all.\n\n- credibility: Grade how credible the sources provided are from 1 to 5 // 5 being an established newspaper, \n\n-\n\ngovernment agency or large company and 1 being unreferenced.\nresult: Assess whether the question is correct given only the content returned from the search and the user\u2019s \nquestion // acceptable values are \u201ccorrect\u201d or \u201cincorrect\u201d\n\nYou will output this as a JSON document: {relevance: integer, credibility: integer, result: string}\n\nUser: What is the population of Canada?\nAssistant: Canada's population was estimated at 39,858,480 on April 1, 2023 by Statistics Canada.\nEvaluation: {relevance: 5, credibility: 5, result: correct}\n\n11\n\n\fExample framework\n\nYour evaluations can be grouped up into test suites called runs and executed in a batch to test \nthe e\ufb00ectiveness of your system.\n\nEach run should have its contents logged and stored at the most granular level possible \n(\u201ctracing\u201d) so you can investigate failure reasons, make tweaks and then rerun your evals.\n\nRun ID Model\n\nScore\n\nAnnotation feedback\n\nChanges since last run\n\n1\n\n2\n\n3\n\n4\n\n5\n\ngpt-3.5-turbo 28/50\n\ngpt-4\n\n36/50\n\ngpt-3.5-turbo 34/50\n\n\u25cf 18 incorrect with correct search results\n\u25cf 4 incorrect searches\n\nN/A\n\n\u25cf 10 incorrect with correct search results\n\u25cf 4 incorrect searches\n\n\u25cf 12 incorrect with correct search results\n\u25cf 4 incorrect searches\n\nModel updated to GPT-4\n\nAdded few-shot examples\n\ngpt-3.5-turbo 42/50\n\n\u25cf 8 incorrect with correct search results\n\nAdded metadata to search\nPrompt engineering for Answer step\n\ngpt-3.5-turbo 48/50\n\n\u25cf 2 incorrect with correct search results\n\nPrompt engineering to Answer step\n\n12\n\n\fExample framework\n\nI want to return a \nT-shirt I bought on \nAmazon on March 3rd.\n\nUser\n\nRouter\n\nLLM\n\nExpected: return\nPredicted: return\nPASS\n\nReturn\nAssistant\n\nLLM\n\nComponent evals\n\nSubjective evals\n\nExpected: return_policy\nPredicted: return_policy\nPASS\n\nKnowledge \nbase\n\nQuestion: Does this response adhere to \nour guidelines\nScore: \nPoliteness: 5, Coherence: 4, Relevancy: 4\nPASS\n\nSure - because we\u2019re \nwithin 14 days of the \npurchase, I can \nprocess the return\n\nQuestion: I want to return a T-shirt I \nbought on Amazon on March 3rd.\nGround truth: Eligible for return\nPASS\n\n13\n\n\fBest practices\n\nLog everything\n\n\u25cf\n\nEvals need test cases - log everything as you develop so you can mine your logs for good eval cases\n\nCreate a feedback loop\n\n\u25cf\n\u25cf\n\nBuild evals into your application so you can quickly run them, iterate and rerun to see the impact\nEvals also provide a useful structure for few-shot or \ufb01ne-tuning examples when optimizing\n\nEmploy expert labellers who know the process\n\n\u25cf Use experts to help create your eval cases - these need to be as lifelike as possible\n\nEvaluate early and often\n\n\u25cf\n\nEvals are something you should build as soon as you have your \ufb01rst functioning prompt - you won\u2019t be \nable to optimize without this baseline, so build it early\n\n\u25cf Making evals early also forces you to engage with what a good response looks like\n\n\f", "pages_description": ["## Overview\n\nEvaluation is the process of validating and testing the outputs that your Large Language Model (LLM) applications are producing. Strong evaluations, referred to as \"evals,\" contribute to creating a more stable and reliable application that can withstand changes in code and model updates.\n\n### Example Use Cases\n- **Quantify a solution\u2019s reliability**: Measure how dependable your application is.\n- **Monitor application performance in production**: Keep track of how well your application performs in real-world scenarios.\n- **Test for regressions**: Ensure that new updates do not negatively impact existing functionality.\n\n### What We\u2019ll Cover\n- **What are evals**: Understanding the concept and importance of evaluations.\n- **Technical patterns**: Exploring common methods and strategies used in evaluations.\n- **Example framework**: Providing a structured approach to implementing evaluations.\n- **Best practices**: Sharing tips and guidelines for effective evaluations.\n- **Resources**: Offering additional materials for further learning and exploration.", "What are evals\n\nAn evaluation, or \"eval,\" involves a question and a correct answer, known as the ground truth. In this example, the question posed is, \"What is the population of Canada?\" \n\nThe process begins with a person asking this question. The language model (LLM) initially does not know the answer and decides to use a tool to find it. The LLM takes the action of searching, with the input being the question about Canada's population.\n\nThe search tool then provides the answer: \"The current population of Canada is 39,566,248 as of Tuesday, May 23, 2023.\" This result matches the actual result expected, which is that there are 39,566,248 people in Canada as of 2023. \n\nThis example illustrates how evaluations are used to verify the accuracy of information provided by a language model.", "What are evals\n\nThis slide provides an example of an evaluation process, often referred to as \"evals.\" The purpose of evals is to compare a predicted answer to a known correct answer, called the \"ground truth,\" to determine if they match.\n\nIn this example, the question posed is: \"What is the population of Canada?\" The ground truth states that the population of Canada in 2023 is 39,566,248 people. The predicted answer is: \"There are 39,566,248 people in Canada as of 2023.\"\n\nSince the predicted answer matches the ground truth, the evaluation is successful, as indicated by a checkmark. This process is crucial for verifying the accuracy of predictions in various applications.", "**Technical Patterns**\n\nThis slide outlines three types of evaluation methods used in technical assessments:\n\n1. **Metric-based Evaluations**:\n   - These evaluations use comparison metrics such as BLEU and ROUGE. \n   - They provide a score that helps in filtering and ranking results, making it easier to assess the quality of outputs quantitatively.\n\n2. **Component Evaluations**:\n   - This method involves comparing the ground truth to predictions.\n   - It results in a simple Pass/Fail outcome, which is useful for determining whether specific components meet the required standards.\n\n3. **Subjective Evaluations**:\n   - These evaluations rely on a scorecard to assess outputs subjectively.\n   - The scorecard can also include a Pass/Fail option, allowing for a more nuanced evaluation that considers qualitative aspects.", "Technical Patterns: Metric-based Evaluations\n\nROUGE is a common metric for evaluating machine summarizations of text. It is specifically used to assess the quality of summaries by comparing them to reference summaries. The slide provides an example of how ROUGE is applied:\n\n- **Original Text**: This is a detailed description of OpenAI's mission, emphasizing the development of artificial general intelligence (AGI) that benefits humanity. It highlights the importance of safety, broad distribution of benefits, and avoiding harmful uses or power concentration.\n\n- **Machine Summary**: This is a condensed version of the original text. It focuses on ensuring AGI is safe and accessible, avoiding harm and power concentration, and promoting research and collaboration in AI.\n\n- **ROUGE Score**: The score given is 0.51162, which quantifies the similarity between the machine-generated summary and the original text. A higher score indicates a closer match to the reference summary.\n\nOverall, ROUGE helps in evaluating how well a machine-generated summary captures the essence of the original text.", "# Technical Patterns: Metric-based Evaluations\n\nThe slide discusses the BLEU score, a standard metric used to evaluate machine translation tasks. BLEU stands for Bilingual Evaluation Understudy and is a method for assessing the quality of text that has been machine-translated from one language to another.\n\n### Key Elements:\n\n- **BLEU**: This is a metric specifically designed for evaluating translation tasks. It compares the machine-generated translation to one or more reference translations.\n\n- **Original Text**: The example given is in Welsh: \"Y gwir oedd doedden nhw ddim yn dweud celwyddau wedi'r cwbl.\"\n\n- **Reference Translation**: This is the human-generated translation used as a standard for comparison: \"The truth was they were not telling lies after all.\"\n\n- **Predicted Translation**: This is the translation produced by the machine: \"The truth was they weren't telling lies after all.\"\n\n- **BLEU Score**: The score for this translation is 0.39938. This score indicates how closely the machine translation matches the reference translation, with a higher score representing a closer match.\n\nThe BLEU score is widely used in the field of natural language processing to provide a quantitative measure of translation quality.", "Technical Patterns: Metric-based Evaluations\n\n**What they\u2019re good for:**\n\n- **Starting Point**: They provide a good starting point for evaluating a new solution, helping to establish initial benchmarks.\n- **Automated Testing**: These evaluations serve as a useful yardstick for automated testing, particularly in determining if a change has caused a significant performance shift.\n- **Cost-Effective**: They are cheap and fast, making them accessible for quick assessments.\n\n**What to be aware of:**\n\n- **Context Specificity**: These evaluations are not tailored to specific contexts, which can limit their effectiveness in certain situations.\n- **Sophistication Needs**: Most customers require more sophisticated evaluations before moving to production, indicating that metric-based evaluations might not be sufficient on their own for final decision-making.", "**Technical Patterns: Component Evaluations**\n\nComponent evaluations, also known as \"unit tests,\" focus on assessing a single input/output of an application. The goal is to verify that each component functions correctly in isolation by comparing the input to a predefined ideal result, known as the ground truth.\n\n**Process Overview:**\n\n1. **Input Question:** \n   - The process begins with a question: \"What is the population of Canada?\"\n\n2. **Agent's Role:**\n   - The agent receives the question and processes it. The agent's thought process is: \"I don\u2019t know. I should use a tool.\"\n   - The agent decides on an action: \"Search.\"\n   - The action input is the original question: \"What is the population of Canada?\"\n\n3. **Search Component:**\n   - The search component is tasked with finding the answer. It retrieves the information: \"The current population of Canada is 39,566,248 as of Tuesday, May 23, 2023\u2026.\"\n\n4. **Evaluation Steps:**\n   - **Correct Action Check:** Is the agent's decision to search the correct action?\n   - **Exact Match Comparison:** Does the retrieved answer match the expected result exactly?\n   - **Contextual Relevance:** Does the answer use the context provided in the question?\n   - **Number Extraction and Comparison:** Extract numbers from both the expected and retrieved answers and compare them for accuracy.\n\n5. **Final Output:**\n   - The final output is the verified answer: \"There are 39,566,248 people in Canada as of 2023.\"\n\nThis process ensures that each component of the application is functioning correctly and producing accurate results by systematically evaluating each step against the ground truth.", "**Technical Patterns: Subjective Evaluations**\n\nBuilding an effective scorecard for automated testing is enhanced by incorporating detailed human reviews. This process helps identify what is truly valuable. The approach of \"show rather than tell\" is recommended for GPT-4, meaning that examples of scores like 1, 3, and 8 out of 10 should be provided to help the model understand the range.\n\n**Example Scorecard:**\n\n- **Role**: You are an evaluation assistant assessing how well the Assistant has answered a customer's query.\n  \n- **Metrics for Assessment**:\n  - **Relevance**: Rate the relevance of the search content to the question on a scale from 1 to 5, where 5 is highly relevant and 1 is not relevant at all.\n  - **Credibility**: Rate the credibility of the sources from 1 to 5, where 5 is an established newspaper, government agency, or large company, and 1 is unreferenced.\n  - **Result**: Determine if the question is answered correctly based on the search content and the user's question. Acceptable values are \"correct\" or \"incorrect.\"\n\n- **Output Format**: Provide the evaluation as a JSON document with fields for relevance, credibility, and result.\n\n**Example Evaluation**:\n- **User Query**: \"What is the population of Canada?\"\n- **Assistant's Response**: \"Canada's population was estimated at 39,858,480 on April 1, 2023, by Statistics Canada.\"\n- **Evaluation**: `{relevance: 5, credibility: 5, result: correct}`\n\nThis structured approach ensures clarity and consistency in evaluating the performance of automated systems.", "**Example Framework**\n\nThis framework outlines a method for evaluating the effectiveness of a system by grouping evaluations into test suites called \"runs.\" These runs are executed in batches, and each run's contents are logged and stored at a detailed level, known as \"tracing.\" This allows for investigation of failures, making adjustments, and rerunning evaluations.\n\nThe table provides a summary of different runs:\n\n- **Run ID 1**: \n  - Model: gpt-3.5-turbo\n  - Score: 28/50\n  - Annotation Feedback: 18 incorrect with correct search results, 4 incorrect searches\n  - Changes: N/A\n\n- **Run ID 2**: \n  - Model: gpt-4\n  - Score: 36/50\n  - Annotation Feedback: 10 incorrect with correct search results, 4 incorrect searches\n  - Changes: Model updated to GPT-4\n\n- **Run ID 3**: \n  - Model: gpt-3.5-turbo\n  - Score: 34/50\n  - Annotation Feedback: 12 incorrect with correct search results, 4 incorrect searches\n  - Changes: Added few-shot examples\n\n- **Run ID 4**: \n  - Model: gpt-3.5-turbo\n  - Score: 42/50\n  - Annotation Feedback: 8 incorrect with correct search results\n  - Changes: Added metadata to search, Prompt engineering for Answer step\n\n- **Run ID 5**: \n  - Model: gpt-3.5-turbo\n  - Score: 48/50\n  - Annotation Feedback: 2 incorrect with correct search results\n  - Changes: Prompt engineering to Answer step\n\nThis framework emphasizes the importance of detailed logging and iterative improvements to enhance system performance.", "Example Framework\n\nThis diagram illustrates a framework for processing a return request using a language model (LLM) system. Here's a breakdown of the process:\n\n1. **User Input**: The user wants to return a T-shirt purchased on Amazon on March 3rd.\n\n2. **Router**: The initial input is processed by a router LLM, which determines the nature of the request. The expected and predicted outcomes are both \"return,\" and the process passes this evaluation.\n\n3. **Return Assistant**: The request is then handled by a return assistant LLM. It interacts with a knowledge base to verify the return policy.\n\n4. **Knowledge Base**: The system checks the return policy, confirming that the item is eligible for return within 14 days of purchase. The expected and predicted outcomes are \"return_policy,\" and this step also passes.\n\n5. **Response to User**: The system responds to the user, confirming that the return can be processed because it is within the 14-day window.\n\n6. **Evaluation**: The response is evaluated for adherence to guidelines, scoring 5 for politeness, 4 for coherence, and 4 for relevancy, resulting in a pass.\n\nThe framework uses both component evaluations (red dashed lines) and subjective evaluations (orange dashed lines) to ensure the process is accurate and user-friendly.", "Best Practices\n\n1. **Log Everything**\n   - It's important to log all test cases during development. This allows you to mine your logs for effective evaluation cases.\n\n2. **Create a Feedback Loop**\n   - Integrate evaluations into your application to quickly run, iterate, and rerun them to observe impacts.\n   - Evaluations provide a useful structure for few-shot or fine-tuning examples during optimization.\n\n3. **Employ Expert Labelers Who Know the Process**\n   - Use experts to help create evaluation cases, ensuring they are as realistic as possible.\n\n4. **Evaluate Early and Often**\n   - Build evaluations as soon as you have a functioning prompt. This baseline is crucial for optimization.\n   - Early evaluations help you understand what a good response looks like, facilitating better engagement."]}, {"filename": "fine-tuning-deck.pdf", "text": "Fine-tuning\nTechnique\n\nFebruary 2024\n\n\fOverview\n\nFine-tuning involves adjusting the \nparameters of pre-trained models on a \nspeci\ufb01c dataset or task. This process \nenhances the model's ability to generate \nmore accurate and relevant responses for \nthe given context by adapting it to the \nnuances and speci\ufb01c requirements of the \ntask at hand.\n\nExample use cases\n\n- Generate output in a consistent \n\n-\n\nformat\nProcess input by following speci\ufb01c \ninstructions\n\nWhat we\u2019ll cover\n\n\u25cf When to \ufb01ne-tune\n\n\u25cf Preparing the dataset\n\n\u25cf Best practices\n\n\u25cf Hyperparameters\n\n\u25cf Fine-tuning advances\n\n\u25cf Resources\n\n3\n\n\fWhat is Fine-tuning\n\nPublic Model\n\nTraining data\n\nTraining\n\nFine-tuned \nmodel\n\nFine-tuning a model consists of training the \nmodel to follow a set of given input/output \nexamples.\n\nThis will teach the model to behave in a \ncertain way when confronted with a similar \ninput in the future.\n\nWe recommend using 50-100 examples \n\neven if the minimum is 10.\n\n4\n\n\fWhen to \ufb01ne-tune\n\nGood for  \u2705\n\nNot good for  \u274c\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nFollowing a given format or tone for the \n\noutput\n\nProcessing the input following speci\ufb01c, \n\ncomplex instructions\n\nImproving latency\n\nReducing token usage\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nTeaching the model new knowledge\n\u2794 Use RAG or custom models instead\n\nPerforming well at multiple, unrelated tasks\n\u2794 Do prompt-engineering or create multiple \n\nFT models instead\n\nInclude up-to-date content in responses\n\u2794 Use RAG instead\n\n5\n\n\fPreparing the dataset\n\nExample format\n\n{\n\n\"messages\": [\n\n{\n\n\"role\": \"system\",\n\"content\": \"Marv is a factual chatbot \nthat is also sarcastic.\"\n\n},\n{\n\n\"role\": \"user\",\n\"content\": \"What's the capital of \nFrance?\"\n\n},\n{\n\n\"role\": \"assistant\",\n\"content\": \"Paris, as if everyone \ndoesn't know that already.\"\n\n}\n\n]\n\n}\n\n.jsonl\n\n\u2794 Take the set of instructions and prompts that you \n\nfound worked best for the model prior to \ufb01ne-tuning. \nInclude them in every training example\n\n\u2794 If you would like to shorten the instructions or \n\nprompts, it may take more training examples to arrive \nat good results\n\nWe recommend using 50-100 examples \n\neven if the minimum is 10.\n\n6\n\n\fBest practices\n\nCurate examples carefully\n\nDatasets can be di\ufb03cult to build, start \nsmall and invest intentionally. \nOptimize for fewer high-quality \ntraining examples.\n\n\u25cf Consider \u201cprompt baking\u201d, or using a basic \nprompt to generate your initial examples\n\u25cf If your conversations are multi-turn, ensure \n\nyour examples are representative\n\n\u25cf Collect examples to target issues detected \n\nin evaluation\n\n\u25cf Consider the balance & diversity of data\n\u25cf Make sure your examples contain all the \n\ninformation needed in the response\n\nIterate on hyperparameters\n\nEstablish a baseline\n\nStart with the defaults and adjust \nbased on performance.\n\n\u25cf If the model does not appear to converge, \n\nincrease the learning rate multiplier\n\u25cf If the model does not follow the training \ndata as much as expected increase the \nnumber of epochs\n\n\u25cf If the model becomes less diverse than \n\nexpected decrease the # of epochs by 1-2\n\nAutomate your feedback \npipeline\n\nIntroduce automated evaluations to \nhighlight potential problem cases to \nclean up and use as training data.\n\nConsider the G-Eval approach of \nusing GPT-4 to perform automated \ntesting using a scorecard.\n\nOften users start with a \nzero-shot or few-shot prompt to \nbuild a baseline evaluation \nbefore graduating to \ufb01ne-tuning.\n\nOften users start with a \nzero-shot or few-shot prompt to \nbuild a baseline evaluation \nOptimize for latency and \nbefore graduating to \ufb01ne-tuning.\ntoken e\ufb03ciency\n\nWhen using GPT-4, once you \nhave a baseline evaluation and \ntraining examples consider \n\ufb01ne-tuning 3.5 to get similar \nperformance for less cost and \nlatency.\n\nExperiment with reducing or \nremoving system instructions \nwith subsequent \ufb01ne-tuned \nmodel versions.\n\n\fHyperparameters\n\nEpochs\nRefers to 1 full cycle through the training dataset\nIf you have hundreds of thousands of examples, we would recommend \nexperimenting with two epochs (or one) to avoid over\ufb01tting.\n\ndefault: auto (standard is 4)\n\nBatch size\nNumber of training examples used to train a single \nforward & backward pass\nIn general, we've found that larger batch sizes tend to work better for larger datasets\n\ndefault: ~0.2% x N* (max 256)\n\n*N = number of training examples\n\nLearning rate multiplier\nScaling factor for the original learning rate\nWe recommend experimenting with values between 0.02-0.2. We've found that \nlarger learning rates often perform better with larger batch sizes.\n\ndefault: 0.05, 0.1 or 0.2*\n\n*depends on \ufb01nal batch size\n\n8\n\n\f", "pages_description": ["**Overview**\n\nFine-tuning involves adjusting the parameters of pre-trained models on a specific dataset or task. This process enhances the model's ability to generate more accurate and relevant responses for the given context by adapting it to the nuances and specific requirements of the task at hand.\n\n**Example Use Cases:**\n- Generate output in a consistent format.\n- Process input by following specific instructions.\n\n**What We\u2019ll Cover:**\n- When to fine-tune\n- Preparing the dataset\n- Best practices\n- Hyperparameters\n- Fine-tuning advances\n- Resources", "What is Fine-tuning\n\nFine-tuning is a process in machine learning where a pre-existing model, known as a public model, is further trained using specific training data. This involves adjusting the model to follow a set of given input/output examples. The goal is to teach the model to respond in a particular way when it encounters similar inputs in the future.\n\nThe diagram illustrates this process: starting with a public model, training data is used in a training phase to produce a fine-tuned model. This refined model is better suited to specific tasks or datasets.\n\nIt is recommended to use 50-100 examples for effective fine-tuning, although the minimum requirement is 10 examples. This ensures the model learns adequately from the examples provided.", "When to Fine-Tune\n\n**Good for:**\n\n- **Following a given format or tone for the output:** Fine-tuning is effective when you need the model to adhere to a specific style or structure in its responses.\n  \n- **Processing the input following specific, complex instructions:** It helps in handling detailed and intricate instructions accurately.\n\n- **Improving latency:** Fine-tuning can enhance the speed of the model's responses.\n\n- **Reducing token usage:** It can optimize the model to use fewer tokens, making it more efficient.\n\n**Not good for:**\n\n- **Teaching the model new knowledge:** Fine-tuning is not suitable for adding new information to the model. Instead, use Retrieval-Augmented Generation (RAG) or custom models.\n\n- **Performing well at multiple, unrelated tasks:** For diverse tasks, it's better to use prompt engineering or create multiple fine-tuned models.\n\n- **Including up-to-date content in responses:** Fine-tuning is not ideal for ensuring the model has the latest information. RAG is recommended for this purpose.", "**Preparing the Dataset**\n\nThis slide provides guidance on preparing a dataset for training a chatbot model. It includes an example format using JSONL (JSON Lines) to structure the data. The example shows a conversation with three roles:\n\n1. **System**: Sets the context by describing the chatbot as \"Marv is a factual chatbot that is also sarcastic.\"\n2. **User**: Asks a question, \"What's the capital of France?\"\n3. **Assistant**: Responds with a sarcastic answer, \"Paris, as if everyone doesn't know that already.\"\n\nKey recommendations for dataset preparation include:\n\n- Use a set of instructions and prompts that have proven effective for the model before fine-tuning. These should be included in every training example.\n- If you choose to shorten instructions or prompts, be aware that more training examples may be needed to achieve good results.\n- It is recommended to use 50-100 examples, even though the minimum required is 10.", "**Best Practices**\n\n1. **Curate Examples Carefully**\n   - Building datasets can be challenging, so start small and focus on high-quality examples.\n   - Use \"prompt baking\" to generate initial examples.\n   - Ensure multi-turn conversations are well-represented.\n   - Collect examples to address issues found during evaluation.\n   - Balance and diversify your data.\n   - Ensure examples contain all necessary information for responses.\n\n2. **Iterate on Hyperparameters**\n   - Begin with default settings and adjust based on performance.\n   - Increase the learning rate multiplier if the model doesn't converge.\n   - Increase the number of epochs if the model doesn't follow training data closely.\n   - Decrease the number of epochs by 1-2 if the model becomes less diverse.\n\n3. **Establish a Baseline**\n   - Start with zero-shot or few-shot prompts to create a baseline before fine-tuning.\n\n4. **Automate Your Feedback Pipeline**\n   - Use automated evaluations to identify and clean up problem cases for training data.\n   - Consider using the G-Eval approach with GPT-4 for automated testing with a scorecard.\n\n5. **Optimize for Latency and Token Efficiency**\n   - After establishing a baseline, consider fine-tuning with GPT-3.5 for similar performance at lower cost and latency.\n   - Experiment with reducing or removing system instructions in subsequent fine-tuned versions.", "Hyperparameters\n\n**Epochs**\n- An epoch refers to one complete cycle through the training dataset.\n- For datasets with hundreds of thousands of examples, it is recommended to use fewer epochs (one or two) to prevent overfitting.\n- Default setting is auto, with a standard of 4 epochs.\n\n**Batch Size**\n- This is the number of training examples used to train in a single forward and backward pass.\n- Larger batch sizes are generally more effective for larger datasets.\n- The default batch size is approximately 0.2% of the total number of training examples (N), with a maximum of 256.\n\n**Learning Rate Multiplier**\n- This is a scaling factor for the original learning rate.\n- Experimentation with values between 0.02 and 0.2 is recommended.\n- Larger learning rates often yield better results with larger batch sizes.\n- Default values are 0.05, 0.1, or 0.2, depending on the final batch size."]}]
						
						
					
				
				
					
						Reference in New Issue
					
					View Git Blame
					Copy Permalink