Feb 18, 2025
Ask a raw LLM about your internal docs and it will confidently make something up. Retrieval-augmented generation (RAG) is how you fix that: embed your content, retrieve the relevant pieces at query time, and hand them to the model as grounded context. The result is answers that are accurate, current, and citable.
This guide covers the three building blocks — embeddings, vector databases, and the retrieval pattern — with the current best options for each.
| Model | Price / 1M | Standout |
|---|---|---|
| OpenAI text-embedding-3-small | ~$0.02 | Best default for ~90% of projects |
| OpenAI text-embedding-3-large | higher | Strong general retrieval |
| Voyage AI voyage-3-large | ~$0.18 | Top retrieval quality; best for code/legal/medical |
| Cohere Embed v4 | usage-based | Best multilingual |
| Google text-embedding-005 | ~$0.006 | Best value, near top quality |
from openai import OpenAI
client = OpenAI()
vec = client.embeddings.create(model="text-embedding-3-small", input="How do refunds work?").data[0].embedding
Note: MongoDB-owned Voyage shipped the Voyage 4 family (a mixture-of-experts embedding architecture) in early 2026 — worth testing if retrieval quality directly drives your UX.
| DB | ~Cost at 10M vectors | Best for |
|---|---|---|
| Pinecone Serverless | ~$70/mo | Zero-ops, scale-to-zero startups |
| Qdrant Cloud | ~$65/mo | Best price/performance, self-host option |
| Weaviate Cloud | ~$135/mo | Best hybrid search for RAG agents |
| pgvector (Postgres) | ~$45/mo | Cheapest if you already run Postgres |
# Pinecone serverless example
from pinecone import Pinecone
pc = Pinecone(api_key="YOUR_KEY")
index = pc.Index("docs")
index.upsert(vectors=[{"id": "doc1#0", "values": vec, "metadata": {"source": "refunds.md"}}])
hits = index.query(vector=question_vec, top_k=5, include_metadata=True)
def answer(question):
q_vec = embed(question)
hits = index.query(vector=q_vec, top_k=20, include_metadata=True)
# Rerank the 20 candidates down to the best 5 (e.g. Cohere Rerank)
context = rerank(question, hits)[:5]
prompt = "Answer using ONLY this context; cite sources. If unknown, say so.\n\n" + context
return llm(prompt)
Reranking is the highest-ROI upgrade in most RAG systems: retrieve broadly, then use a reranker (Cohere Rerank, Voyage) to put the truly relevant chunks first.
Find every embedding model and vector database in our AI API directory.