What RAG is

The typical flow: (1) user asks question, (2) system searches in vector database for relevant documents, (3) passes those documents as context to LLM, (4) LLM generates answer grounded in those documents.

The advantage vs. fine-tuning: data stays in vector database, easy to update. The LLM is generic, no need to retrain.

Components

Embeddings model (turns text into vectors): OpenAI text-embedding-3-small, Cohere embed-v3, or Voyage AI. Vector database: Pinecone (managed), Weaviate, Chroma (free local), pgvector (PostgreSQL extension). LLM: Claude, GPT, Gemini. Orchestration: LangChain, LlamaIndex, or custom.

Chunking strategies

The trick of RAG is in chunking: how to divide documents into searchable pieces. Fixed-size: 500 tokens per chunk, simple. Semantic: divide by section/paragraph respecting context. Hierarchical: multiple levels (paragraph, page, document) and retrieve at the right level.

Reranking: the secret of good RAG

Vector search returns the 10 nearest by similarity. But nearest ≠ most useful. Reranking is a second pass with smarter model (cross-encoder) that re-orders the 10 results by real relevance to the query. Cohere Rerank or BAAI/bge-reranker-large.

Impact: typically improves answer quality 20-30%. It's the difference between mediocre RAG and good RAG.

Hybrid search

Vector search is great for semantics ("similar concept") but loses in keyword matches (numbers, specific names, acronyms). Hybrid search = vector + BM25 (keyword). Combines best of both worlds.

Realistic costs

For 10K documents (~1M tokens), with text-embedding-3-small ($0.02/1M tokens) and Pinecone Standard tier ($70/month): setup cost ~$20, monthly cost ~$100-200 depending on query volume.

Common errors

(1) Bad chunking: too small loses context, too large dilutes relevance. (2) Skip reranking: raw vector results aren't enough. (3) Source mixing: retrieving from inconsistent sources confuses LLM. (4) Not measuring: RAG quality is measurable with metrics like Hit Rate, MRR.

When NOT to use RAG

If your data fits in context (1M tokens with Gemini 1.5/2.0), just pass it all in context. RAG is needed when data exceeds context or when latency/cost forces selective retrieval.

Conclusion

RAG remains the standard pattern for enterprise applications with private data. The fundamentals: good embeddings + smart chunking + reranking + measurement. With those four right, you have a competitive solution at low cost.