RAG systems, which have shown great promise in development environments, can be particularly challenging to deploy in production environments. One major hurdle is ensuring access and security. In production, RAG systems must handle a large volume of user requests while maintaining the security and integrity of the data. This requires robust access controls, encryption, and monitoring, which can be difficult to implement and maintain. In contrast, development environments often have more relaxed security settings, making it easier to test and iterate on RAG systems without the added complexity of security protocols.
“The case for production-grade RAG systems in enterprises warrant much deeper scrutiny over system design, given performance, cost and security considerations.”
In this 9 part series, we discuss various system design considerations that directly impact RAG system performance, cost and security which serves as a guide for CTOs, CISOs and AI Engineers.
Download the complete guide on Advanced RAG for Enterprises
Choice of chunking approach based on type of content: length, sentences or logical chunks
When it comes to building production-grade RAG (Retrieval-Augmented Generation) systems, the focus often falls on selecting the right models and fine-tuning hyperparameters. However, one crucial aspect that receives far less attention is the importance of chunking – the process of breaking down content into manageable chunks (i.e. blocks of content).
The way content is chunked has a profound impact on the eventual performance of the RAG system and is best illustrated through an analogy. Building a RAG system is like searching for a specific book in a vast library. If the books are not properly catalogued and shelved (chunked), the librarian (search process) will struggle to find the correct book, even with the most precise query. And, even if the librarian manages to locate a book, if it’s not the relevant one (most relevant content), the most skilled reader (best model) won’t be able to extract meaningful insights from it. Just as a well-organized library enables the librarian to efficiently find the right book, proper chunking enables the RAG system to retrieve the most relevant content, unlocking the full potential of the model to generate accurate and meaningful results.
We introduce three broad approaches to chunking below but emphasize that the most optimal chunking approach ultimately depends on: (i) the type of content used in the RAG system e.g. code, question-answer, paragraphs etc., (ii) the purpose and type of questions the system is meant to answer and (iii) the way search will be conducted as part of the retrieval step e.g. hybrid, keyword-based, question-based
Length-Based Chunking
Length-based chunking involves dividing content into fixed-length segments, usually measured in characters or words. It is the approach native to many low/no-code RAG frameworks and many online guides to RAG development but may not be effective for most production-grade RAG systems. One key limitation is that this approach often splits sentences mid-thought or combine unrelated ideas.
Sentence-Based Chunking
Sentence-based chunking involves breaking content into individual sentences or sentence-like units. This approach is particularly useful for content that follows a formal structure, such as news articles or technical documents. By chunking content at the sentence level, models can better understand the relationships between ideas and improve performance in tasks like question answering or text summarization. However, sentence-based chunking may not be effective for content with long, complex sentences or those that contain multiple ideas. There are also variations to how sentence-based chunking can be done. For instance, some implementations involve combining two or more sentences in a given chunk to capture logical flow of connected ideas across sentences within each chunk.
Logical Chunking
Logical chunking involves creating synthetic or dividing content into meaningful units based on semantic relationships, such as paragraphs, sections, or topics. This approach is useful for content that requires a deeper understanding of context and relationships, such as technical documentation or instructional materials. The following real-world examples show how logical chunking can bring together full content into individual chunks for optimizing search.
Example 1: Enriched logical chunks for software code
Individual functions, classes, views or variables are stored as a single logical chunk. Each of these chunks are also enriched with contextual information such as related functions or variables. This way, individual chunks preserve the logical connection of how different parts of a code work together.
Example 2: Synthetic logical chunks from financial documents
Content chunks about financial line items are created from structured XBRL documents. Each chunk is represented by a constructed statement that contains not just the numerical value, but information about the unit of measure, relevant period, prior period values, year-on-year and quarter-on-quarter changes, line item definition as well as associated supplementary notes.
As can be seen, there are many different approaches and virtually no limits to how content chunking can be done. Ultimately, in the context of RAG systems, organizations should consider the intended use case, how search and retrieval will be conducted and therefore how ideas and information should be packed into individual document chunks for storage within a database.
Overall, production environment RAG systems presents the following key questions. Dive into each of the subtopics through the links below:
- API model access vs hosted model on self-managed instances
- Choice of model and precision as a trade-off between performance and running cost
- Choice of vector databases based on types of supported search algorithm and security options
- Data pipeline design for reliability, content safety and performance-focused data pre-processing
- Choice of chunking approach based on type of content: length, sentences or logical chunks
- Pre-retrieval filters and transformations for security and retrieval performance optimization
- Post-retrieval ranking and stacking approaches for performance and cost optimization
- Guardrail implementation with consideration for different modalities of inputs and outputs
- Logging mechanisms to facilitate performance, cost and security analyses
Download the complete guide on Advanced RAG for Enterprises