Building AI-Powered Apps with RAG and PgVector: A Practical Guide
A hands-on walkthrough of Retrieval-Augmented Generation (RAG) using PgVector and Node.js — from chunking strategy to production deployment.
Retrieval-Augmented Generation (RAG) is one of the most impactful AI patterns you can implement today — and it doesn’t require a PhD or a team of ML engineers. In this guide, I’ll walk you through a production-grade RAG pipeline using Node.js, TypeScript, and PgVector (PostgreSQL’s native vector extension).
What Is RAG and Why Does It Matter?
LLMs like GPT-4 are powerful but have a critical limitation: their knowledge is frozen at training time. RAG solves this by retrieving relevant context from your own data and injecting it into the prompt at inference time.
The result: an AI that answers questions about your documents, database, or knowledge base — accurately and without hallucination.
The Architecture
User Query → Embed Query → Vector Search (PgVector) → Inject Top-K Chunks → LLM → Answer
This pipeline has three moving parts:
- Ingestion — chunk documents, embed them, store in Postgres with PgVector
- Retrieval — embed the user query, do a cosine similarity search
- Generation — inject retrieved chunks into the LLM prompt
Setting Up PgVector
PgVector is a Postgres extension that adds a native vector type and efficient similarity search operators. Enable it in your database:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536), -- OpenAI text-embedding-3-small dimensions
source TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- IVFFlat index for approximate nearest-neighbor search at scale
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Embedding and Ingestion (TypeScript)
import OpenAI from 'openai';
import { Pool } from 'pg';
const openai = new OpenAI();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
async function embedAndStore(content: string, source: string): Promise<void> {
const { data } = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: content,
});
const embedding = data[0].embedding;
await pool.query(
'INSERT INTO document_chunks (content, embedding, source) VALUES ($1, $2, $3)',
[content, JSON.stringify(embedding), source]
);
}
Retrieval
async function retrieveTopK(query: string, k = 5): Promise<string[]> {
const { data } = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query,
});
const embedding = data[0].embedding;
const { rows } = await pool.query<{ content: string }>(
`SELECT content
FROM document_chunks
ORDER BY embedding <=> $1
LIMIT $2`,
[JSON.stringify(embedding), k]
);
return rows.map((r) => r.content);
}
Generation
async function answerWithContext(question: string): Promise<string> {
const chunks = await retrieveTopK(question);
const context = chunks.join('\n\n---\n\n');
const { choices } = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer based only on the context provided.
If the answer is not in the context, say so.`,
},
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`,
},
],
});
return choices[0].message.content ?? '';
}
Production Considerations
- Chunking strategy matters. Overlapping chunks (e.g. 512 tokens with 64-token overlap) dramatically improve retrieval quality.
- Hybrid search. Combine vector search with full-text search (
tsvector) for best-of-both results. - Metadata filtering. Store document source, date, and category alongside embeddings so you can scope retrieval.
- Caching. Cache embeddings for repeated queries to cut costs.
Conclusion
RAG is the right tool when you need an LLM to reason over private, domain-specific data. With PgVector in Postgres, you already have the infrastructure if you’re running a relational database — no separate vector database required.
If you’re building an AI feature and need a production-grade implementation, get in touch.