RAG in Production, Section 2: Choosing the Right Model — Performance, Precision, and Legal Considerations

Deploying a production RAG system requires choosing a language model, and that choice is more consequential than it first appears. Organisations that default to the most capable or most popular model available encounter unexpected trade-offs: performance that does not match the use case, legal restrictions that prevent commercial deployment, or inference costs that make production economics unworkable at scale. The Aigos Blueprint addresses model selection as a systems engineering decision with legal, ethical, and operational dimensions that demand systematic evaluation.

📄 Download the Full Blueprint: Advanced Production RAG – Performance and Security

Use Case Determines Model Requirements

Model selection starts with the intended use case. Models optimised for coding tasks differ significantly from those tuned for general question-answering or document analysis. Models with multimodal capabilities introduce new capability and new security risk compared to text-only alternatives. A model mismatched to the use case does not merely reduce performance; it produces confidently wrong outputs that erode user trust and create downstream risks in environments where users rely on AI-generated responses to make consequential decisions.

The performance metrics that matter most to one application may be irrelevant to another. Tokens per second is critical for interactive conversational interfaces where latency directly affects user experience. Factual accuracy and recall are paramount for enterprise knowledge base applications. Context window length determines whether the model can reason across long documents or must rely on chunking to compensate for a limited working memory. Evaluate models against the specific metrics that matter for the deployment, not against benchmark rankings that reflect different use cases.

Legal and Ethical Dimensions of Model Selection

Every organisation deploying a language model in production must understand the legal foundations of that model. Some models are available for commercial use under open licences; others impose restrictions on commercial deployment, fine-tuning, or redistribution that may conflict with the intended use. Deploying a model whose licence prohibits commercial use is not a technical problem. It is a legal one, with consequences that include contractual liability and reputational harm.

Ethical screening is an equally important dimension. Some models are explicitly screened during training to remove inappropriate content, copyrighted material, or biased representations. Others are not, and the organisation deploying them inherits those omissions. In financial services, healthcare, and government, the provenance and screening history of a model is a compliance requirement. Confirming that the chosen model aligns with organisational values and meets applicable regulatory standards must be part of the selection process.

Precision, Quantisation, and the Performance-Cost Trade-off

Model precision, the numerical precision at which model weights are stored and computed, is one of the most significant but least discussed factors in production RAG system design. Full-precision (FP32 or BF16) models deliver maximum accuracy but require substantial GPU memory and compute. Quantised models operating at lower precision (INT8, INT4) reduce memory footprint and increase inference speed at the cost of some output quality. The right precision level depends on performance requirements, acceptable quality trade-offs, and available infrastructure for each specific use case.

The relationship between precision and cost is direct: higher precision means higher memory requirements, translating to more expensive GPU infrastructure and lower throughput per dollar. For organisations running high-volume production workloads, the cost difference between FP16 and INT8 inference is substantial. Treat this as an engineering trade-off requiring empirical evaluation against real production data, not a binary choice between best quality and fastest performance.

Embedding Generation: API-Based vs. Local

Beyond the generation model, production RAG systems require an embedding model to convert documents and queries into vector representations for retrieval. API-based embedding generation reduces local resource overhead and avoids GPU investment but introduces additional latency and per-call costs. Local embedding generation provides control and eliminates external data transmission but requires significant GPU processing capacity. The production embedding strategy must be selected with the same care as generation model selection. Cost, latency, and data sensitivity all factor into the decision.

📄 Download the Full Blueprint: Advanced Production RAG – Performance and Security

Use Case Determines Model Requirements

Legal and Ethical Dimensions of Model Selection

Precision, Quantisation, and the Performance-Cost Trade-off

Embedding Generation: API-Based vs. Local

Related publications

RAG in Production, Section 7: Post-Retrieval Filtering and Re-Ranking — Safety, Compliance, and Relevance Optimisation

RAG in Production, Section 6: Pre-Retrieval Filters and Query Transformation — Security and Relevance Before Search

RAG in Production, Section 9: Logging Mechanisms — The Foundation of Performance, Security, and Compliance Visibility

Discuss your deployment with our team