RAG systems, which have shown great promise in development environments, can be particularly challenging to deploy in production environments. One major hurdle is ensuring access and security. In production, RAG systems must handle a large volume of user requests while maintaining the security and integrity of the data. This requires robust access controls, encryption, and monitoring, which can be difficult to implement and maintain. In contrast, development environments often have more relaxed security settings, making it easier to test and iterate on RAG systems without the added complexity of security protocols.
“The case for production-grade RAG systems in enterprises warrant much deeper scrutiny over system design, given performance, cost and security considerations.”
In this 9 part series, we discuss various system design considerations that directly impact RAG system performance, cost and security which serves as a guide for CTOs, CISOs and AI Engineers.
Download the complete guide on Advanced RAG for Enterprises
Post-retrieval ranking and stacking approaches for performance and cost optimization
The post-retrieval stage of a RAG system is a critical phase where the retrieved search results are refined and optimized for LLM operations, so as to ensure a safe, informative, and engaging user experience. This stage encompasses two vital processes: post-retrieval filtering for safety and compliance, and re-ranking for relevance and performance.
Filtering
Post-retrieval filtering is a crucial step in the RAG pipeline, ensuring that the search results are not only relevant but also safe, diverse, and respectful of privacy and confidentiality. Before generating the final answer, filtering out irrelevant or harmful content is essential to maintain the integrity of the system.
Duplicate content: One key aspect of post-retrieval filtering is removing duplicates, which helps to prevent redundant information and promote diversity in the search results. This step is particularly important in RAG systems, where multiple queries may retrieve similar results.
Inappropriate content: Another critical aspect of post-retrieval filtering is removing potentially harmful content, such as copyrighted material, violent or hateful content, or information that may violate privacy and confidentiality. This step helps to ensure that the system adheres to ethical standards and avoids generating harmful or offensive responses.
Irrelevant Content: Additionally, post-retrieval filtering may involve removing outdated or irrelevant information, filtering out biased or low-quality sources, and detecting potential errors or inaccuracies in the search results. By performing these checks, RAG systems can increase the accuracy and reliability of their responses, ultimately leading to a better user experience.
In the context of RAG systems, post-retrieval filtering serves as a quality control mechanism, ensuring that the final response is not only relevant but also responsible, respectful, and informative. By filtering out harmful or irrelevant content, RAG systems can maintain user trust and provide a safe and engaging interaction experience.
Re-ranking
Re-ranking is a refinement technique used to enhance the initial search results from a retrieval system, aiming to improve their relevance and accuracy. In the context of RAG retrieval, re-ranking serves as a quality control mechanism that enables our librarian (the LLM) to fine-tune the list of potential responses before generating the final answer. By applying additional ranking criteria or incorporating contextual information, re-ranking helps align the top-k candidates with the user’s query, ensuring more precise and informative responses.
In practice, re-ranking can be achieved through various techniques:
- Ensemble Models: Combining multiple language models or ranking algorithms to provide more accurate and diverse results.
- Contextual Re-Ranking: Incorporating user preferences, interaction history, or other contextual information to personalize the ranking criteria.
- Feature-based Re-Ranking: Assigning scores based on predefined features like term frequency, document length, or entity overlap to re-rank candidates.
- Learning to Re-Rank (LTR): Training models to predict relevance based on user queries and labeled data.
- User Feedback Integration: Incorporating user interactions like clicks, likes, or ratings to learn user preferences and adjust re-ranking over time.
These re-ranking techniques can be applied individually or in combination to improve the quality of search results and provide more accurate and engaging responses.
Overall, production environment RAG systems presents the following key questions. Dive into each of the subtopics through the links below:
- API model access vs hosted model on self-managed instances
- Choice of model and precision as a trade-off between performance and running cost
- Choice of vector databases based on types of supported search algorithm and security options
- Data pipeline design for reliability, content safety and performance-focused data pre-processing
- Choice of chunking approach based on type of content: length, sentences or logical chunks
- Pre-retrieval filters and transformations for security and retrieval performance optimization
- Post-retrieval ranking and stacking approaches for performance and cost optimization
- Guardrail implementation with consideration for different modalities of inputs and outputs
- Logging mechanisms to facilitate performance, cost and security analyses
Download the complete guide on Advanced RAG for Enterprises