Advanced RAG Series: Generation and Evaluation

“You can’t manage what you cannot measure” – Peter Drucker

“You can’t manage what you can’t measure” – Peter Drucker

Having put in the work to optimize for Query Translation, Routing and Query construction, Indexing and Retrieval, today we get to see the results with Generation and Evaluation to round off our Advanced RAG pipeline.

Generation:

This is the final step in setting up the RAG pipeline, the moment of truth! While Indexing and Retrieval go a long way in ensuring the integrity of the output, it is crucial to critique the retrieved results, followed by a decision-making step to trigger appropriate actions, before generating an output for the user. Cognitive architectures for Language Agents (CoALA) are a great framework to put this in context, whereby the retrieved information is evaluated for a set of possibilities. This is best represented in the figure below.

There are a few approaches to implement this decision procedure for action selection.

For those with a discerning eye, we discussed this briefly in the previous post on Retrieval, but it does overlap with the generation setup and hence worth expanding on further in this post.

The way CRAG enhances generation is by using the lightweight “Retrieval Evaluator”, which returns a confidence score for each document retrieved. This score then determines which retrieval action to trigger. For example, the evaluator can label the retrieved document into one of the three buckets based on the confidence score – Correct, Ambiguous, Incorrect.

If all of the retrieved docs have a confidence score below the threshold, the retrieval is assumed “Incorrect”. This triggers the action that new sources of knowledge (external ie web search) need to be introduced to enable quality generation.

If atleast one retrieved doc’s confidence score is greater than the threshold, the retrieval is assumed “Correct” which triggers the knowledge refinement method on the retrieved doc. Knowledge refinement entails splitting the doc into “knowledge strips”, with each strip then scored based on relevance, the most relevant ones being recomposed as internal knowledge for generation. Refer to the image above for a more intuitive visual representation.

The retrieval is assumed “Ambiguous” when the retrieval evaluator is not confident in its judgement, leading to a mix of the above strategies. This is well represented in the following decision tree:

The above methodology has been cited to have produced the best results across the four datasets used. As shown, CRAG outperforms RAG by a significant margin. Self-CRAG (see Self-RAG for reference below) makes these margins even more significant and crucially shows how adaptable CRAG can be as a “plug-and-play” option for RAG pipelines. Another big advantage of CRAG vs other methods (such as Self-RAG) is its flexibility in replacing the underlying LLM, which is important if we were to use a much more powerful LLM in the future. 

The obvious limitation is that CRAG is heavily dependent on the quality of the retrieval evaluator and susceptible to biases that a web search can introduce. Fine-tuning the retrieval evaluator may be inevitable and assignment of guardrails necessary to ensure the quality and accuracy of output hence. 

This is another framework to think through how we can improve an LLM’s quality and factuality, while preserving its versatility. So, instead of retrieving a fixed number of passages, irrespective of whether they are required or relevant, this framework focuses on on-demand retrieval and self-reflection.

Step 1: An arbitrary LLM is trained to self reflect on its own generations. This is done by special tokens known as reflection tokens (retrieval and critique). Retrieval is triggered by the retrieval token, which is output by the LLM, based on input prompt and preceding generations (ref image).

Step 2: The LLM then proceeds to process these retrieved passages concurrently to generate outputs in parallel. This triggers the LLM to generate the critique tokens to evaluate these outputs.

Step 3: Based on “factuality” and “relevance”, the best output is then chosen for final generation. The algorithm is well depicted in the paper, with tokens being defined as follows:

The following schematic graph by Langchain well illustrates the Self-RAG Inference algorithm for decision making based on the above defined reflection tokens.

In terms of performance, Self-RAG significantly outperforms the baseline with or without retrieval. Refer to the CRAG discussion earlier for performance on benchmarks. To note, Self-CRAG further builds on this framework to improve performance.

There can be limitations on cost and latency given the number of LLM calls being made in this framework. It is worth it to look at workarounds such as doing the generation one time vs for every chunk twice.  

“The RRR model [Ma et al., 2023a] introduces the Rewrite-Retrieve-Read process, utilizing the LLM performance as a reinforcement learning incentive for a rewriting module. This enables the rewriter to fine-tune retrieval queries, thereby improving the downstream task performance of the reader”

The assumption in this framework is that the user query can be further optimized (ie rewritten) by the LLM for more accurate retrieval. While this can improve performance by making the LLMs “think” via the query rewriting process, problems such as reasoning errors or invalid search will be potential barriers to deployment in production environments.

To solve for this and as depicted in the figure above, adding a trainable small LLM as the re-writer has been shown to result in consistent performance improvement with training, making tis framework scalable and efficient. The training implementation here occurs in two phases – “warm-up” and reinforcement learning, the key breakthrough being the ability to integrate trainable modules into larger LLMs.

Evaluation

I would be remiss not to touch on one of the most important steps of any RAG pipeline: Evaluation. There are various ways one can do it, starting with having a set of Q&A pairs as test dataset and verifying the output vs the actual answers. The obvious pitfall with this approach is that it is not only time consuming, but also adds the risk of optimizing the pipeline for the dataset itself vs the larger real world use case which may have many edge cases (captured in the many metrics discussed below).

This is where RAGAs (short for RAG Assessment) comes in. It is an open-source framework to evaluate RAG pipelines. It helps evaluate pipelines by

a) Providing ways to generate test data based on the “ground truth”

b) Metrics based evaluation for retrieval and generation steps, either individually or end-to-end.

It evaluates the following aspects of the RAG system ie “metrics”:

  1. Faithfulness: factual consistency

  2. Answer Relevance: Relevancy of answer vs the prompt

  3. Context Precision: Checks whether the relevant chunks are ranked higher or not.

  4. Aspect Critique: Assess submissions based on predefined aspects such as harmlessness and correctness.

  5. Context Recall: Compare ground truth vs contexts to check if all relevant information os retrieved

  6. Context entities Recall: Evaluate number of entities present in retrieved context vs ground truth

  7. Context Relevancy: Relevancy of retrieved context vs the prompt

  8. Answer Semantic Similarity: How semantically similar the generated answer is vs actual

  9. Answer Correctness: Evaluate accuracy and alignment of generated vs actual answer

This is quite an exhaustive list and provides good optionality on how one may want to evaluate their RAG setups – highly recommend checking out their documentation here.

Also worth mentioning is Langsmith by Langchain as an observability and monitoring tool in the above context.

LangSmith is a platform that helps to debug, test, evaluate, and monitor chains and agents built on any LLM framework”

Incorporating Langsmith into the RAGAs evaluation framework can be helpful when we want to go deeper into the results. Where one can find Langsmith particularly useful is in the logging and tracing part of the evaluation to understand which step can be further optimized to improve either the retriever or generator, as informed by the evaluation.

The other evaluation framework worth a mention here is DeepEval. It has more than 14+ metrics which cover both RAG and fine-tuning and include G-Eval, RAGAS, Summarization, Hallucination, Bias, Toxicity etc.

The self-explanatory nature of these metrics serves well in explain-ability of the metric score, making it easier to debug. This is a key differentiator vs RAGAs framework above (also why I mentioned Langsmith in the same breath as RAGAs earlier).

It has other nice-to-have features such as integration for Pytest (developer friendly), modular components et al and is also open source.

Evaluate → Iterate → Optimize

With the Generation and Evaluation steps in place, we are now well equipped to not only deploy a RAG system, but also evaluate, iterate and optimize for the specific use case we are designing it for.

There is no correct or incorrect way of doing things when it comes to LLMs, black boxes as they are. RAG pipelines are notoriously hard to build, and harder to maintain. The only way to know what works best is to build…and evaluate!

Until next time…happy reading!

Join the conversation

or to participate.