RAG - Say what?

Not the usual beginner RAG tutorial

Ever wondered what happens under the hood when you ask ChatGPT a question, or when you “chat with your pdf”? Heard people talk about semantic search, vector databases, retrieval, indexing but never understood why?

Enter RAG aka Retrieval Augmented Generation! Search for RAG and you will find a number of tutorials, all talking about the basics, which may not exactly be useful when we take demos to production on large datasets powering real world use cases. This will be a series of articles focusing on each step of the process with a focus on the advanced techniques, hopefully helpful to your efforts in building production grade applications.

So, let’s try to break the process down step by step. In this post, we shall only be looking at the first step of the RAG pipeline ie Query Translation (boxed in black in the infographic above). If you like what you read here, please make sure to subscribe as I build out this newsletter, geared towards keeping up with the latest and greatest.

I shall reference the terminology in the well detailed infographic in the cover, shared by the amazing Lance Martin at LangChain. If you are a beginner and are looking to quickly get up to speed with the RAG concepts, I highly recommend his quick tutorials on YT before jumping into the advanced topics below.

There are a number of moving parts when it comes to building an optimal solution for Q&A with your knowledge base. The only way to optimize each of these parts is by experimenting and iterating as there is no playbook to what happens with an LLM under the hood aka “I beg you to please not make up stuff”.

So let's begin with the obvious question: why do RAG, aren’t LLMs good enough?

Problem 1: LLMs do not know your data, or any most recent data for that matter. They have been pretrained on publicly available (*cough*) information from the internet, so they are neither experts in nor up to date on your proprietary databases, which are most probably domain specific.

Problem 2: Context window – Computational constraints imply that every LLM has a maximum limit on the number of tokens (as a general rule of thumb, on average 100 tokens correspond to 75 words) it can ingest in one submission. This leads to a loss of context as the context window is exceeded, further impacting accuracy, issues with retrieval and hallucination.

Problem 3: Lost in the middle problem – even if LLMs could ingest all the data in one go, there is the issue of LLMs not being to retrieve information depending on where it is found in the document. The study referenced here shows for example that LLM performance significantly degrades if the relevant information is buried somewhere near the middle of the doc rather than the beginning or the end.

Hence, we need to do RAG! The obvious next question is how? Let’s start with the application we are all familiar with at this point. Upload a pdf and chat with it (think ChatGPT).

When we ask a question to an AI native Q&A chatbot, based on a large data-source, here is how it flows:

Query Translation:

a) Query Decomposition

Your question while obvious to you, may not be so to the LLM. It may either be too vague, too specific or lacking context. It is always recommended to reconstruct the query before sending it to the embedding model (more on this below). Who better to ask to do this than an LLM itself. Following are a few ways to do this:

  • Rewrite-Retrieve-Read: This approach focuses on the query input by the user (ie rewriting it), rather than just adapting retriever or the reader. It prompts an LLM to generate a query and then uses a web search engine to retrieve context. There is further alignment via a trainable scheme for the pipeline using a small language model. The summary is well captured in the graphic below:

  • Follow-up question to condensed/standalone one: This is generally used to provide context in a conversation to the chatbot, by rephrasing a conversation into a good standalone question. This prompt template by Langchain is a good example of it:

    "Given the following conversation and a follow up question, rephrase the follow up \ question to be a standalone question.

    Chat History: {chat_history}

    Follow Up Input: {question}

    Standalone Question:"

  • RAG Fusion: Combination of RAG and reciprocal rank fusion (RRF), by generating multiple queries (to add context from different perspectives), reranking them with reciprocal scores and then fusing the documents and scores. This leads to more comprehensive and accurate answers

  • Step-Back Prompting: This is a more technical prompting technique whereby the LLM does abstractions to derive high level concepts and first principles. This is an iterative process where the user question is used to generate a step back question. The step back answer is then used for reasoning to generate the final answer.

  • Query Expansion: This is a process whereby the LLM is provided with a query to generate new terms to expand the query. This is very powerful in document retrieval, especially when applied to Chain-of-Thought prompts.

    - Expansion with generated answers: In this method, we prompt our LLM to generate a hypothetical answer based on our query. Then we append that answer to our query as a joint query, to then do an embedding search. By expanding our query using the hypothetical answer, our query now lands elsewhere when doing embedding search, providing more accurate answers.

- Expansion with multiple queries: We ask the LLM to generate additional similar queries based on our query. Then we pass the additional queries along with our original query to the vector database for retrieval, resulting in a significant improvement in accuracy vs using just the original query. It is crucial to iterate on the prompts to evaluate which ones lead to best results. 

Below is a view representing the impact, whereby the orange crosses are the additional queries generated and the green circles show the retrieved information, which is much more dispersed vs that for only the original query (in red). The upside is that we now have a better chance of capturing all relevant context with the downside being the higher number of retrieved results (reranking is a decent solution for this – we shall discuss this in a later post on retrieval).

b) Psuedo documents:

  • Hypothetical Document Embeddings (HyDE): What is worse than asking the wrong question? Asking the question to a vector database which may or may not be well labelled for relevance. So, a better approach has been found to be to create a hypothetical answer and then search the embeddings for the match. As a note of caution, while this is better than the query to answer embedding match, for highly irrelevant questions vs context, there is a higher possibility of hallucinating answers, so the process needs to be optimized keeping in mind those edge cases.

To re-emphasize, there is no right or wrong way of doing Query Translation, which is only the first step in creating a RAG based workflow. It is an experimental field and the only way to know what works best for your use case is to build. There is no dearth of choices when it comes to LLMs or vector DBs, so go try them out and please do let me know how it went or what worked best.

Next up: Routing and Query Construction.

Join the conversation

or to participate.