Guide on Performance and Security for Advanced Production RAG: Part 1 – Model Deployment Mode

API model access vs hosted model on self-managed instance: A key decision point for CTOs, CISOs and AI Engineers

RAG systems, which have shown great promise in development environments, can be particularly challenging to deploy in production environments. One major hurdle is ensuring access and security. In production, RAG systems must handle a large volume of user requests while maintaining the security and integrity of the data. This requires robust access controls, encryption, and monitoring, which can be difficult to implement and maintain. In contrast, development environments often have more relaxed security settings, making it easier to test and iterate on RAG systems without the added complexity of security protocols.

“The case for production-grade RAG systems in enterprises warrant much deeper scrutiny over system design, given performance, cost and security considerations.”

In this 9 part series, we discuss various system design considerations that directly impact RAG system performance, cost and security which serves as a guide for CTOs, CISOs and AI Engineers.

Download the complete guide on Advanced RAG for Enterprises

API model access vs hosted model on self-managed instance

API-based access to Large Language Models (LLMs) provides a convenient and scalable way to tap into pre-trained models. Examples include using OpenAI’s API endpoint or leveraging model-as-a-service platforms like Amazon Bedrock or Azure.

With APIs, the model is hosted and managed by the provider, and access is granted through a endpoint. This approach offers ease of use, rapid deployment, and scalability. However, it also means relying on the provider’s infrastructure, security, and pricing models. From a security standpoint, API-based access may introduce additional risks, such as data exposure and dependence on the provider’s security measures. It is also important to note that with most model-as-a-service platforms like AWS Bedrock or Azure, not all models are available in all regions. This can be a practical limitation for enterprises where data is not allowed to cross national boundaries, just to access a model that is available only in a different region.

Self-hosted models, on the other hand, offer greater control and flexibility. By running LLMs like Llama-3 on a GPU compute instance, organizations can maintain complete ownership and control over the model and its outputs. Self-hosting ensures sensitive data remains within the organization’s infrastructure, reducing security risks. Additionally, self-hosted models provide extensive customization options, allowing organizations to fine-tune the model to their specific use case.

From a cost perspective, self-hosted models offer fixed costs, whereas API-based access often comes with variable costs that can add up quickly. In production environments, self-hosted models provide more predictability and control over costs, which can be critical for businesses.

Security Differences

  • API-based access: relies on provider’s security measures, may introduce data exposure risks
  • Self-hosted models: maintains sensitive data within organization’s infrastructure, reducing security risks. Model provenance and regular model scans are however essential.

Options and Controls

  • API-based access: limited customization options, dependent on provider’s offerings
  • Self-hosted models: extensive customization options, fine-tune model to specific use case

Benefits and Limitations

  • API-based access: easy to use, rapid deployment, scalable, but limited control
  • Self-hosted models: complete control, customization options, predictable costs, but requires expertise and resources

Cost Considerations

  • API-based access: variable costs, can add up quickly
  • Self-hosted models: high fixed costs for GPU instances, more predictability and control over costs in production environments

Overall, production environment RAG systems presents the following key questions. Dive into each of the subtopics through the links below:

  1. API model access vs hosted model on self-managed instances
  2. Choice of model and precision as a trade-off between performance and running cost
  3. Choice of vector databases based on types of supported search algorithm and security options
  4. Data pipeline design for reliability, content safety and performance-focused data pre-processing
  5. Choice of chunking approach based on type of content: length, sentences or logical chunks
  6. Pre-retrieval filters and transformations for security and retrieval performance optimization
  7. Post-retrieval ranking and stacking approaches for performance and cost optimization
  8. Guardrail implementation with consideration for different modalities of inputs and outputs
  9. Logging mechanisms to facilitate performance, cost and security analyses

Download the complete guide on Advanced RAG for Enterprises

More Insights