Retrieval-Augmented Generation (RAG) is one of the most impactful AI patterns you can implement today — and it doesn’t require a PhD or a team of ML engineers. In this guide, I’ll walk you through a production-grade RAG pipeline using Node.js, TypeScript, and PgVector (PostgreSQL’s native vector extension).

What Is RAG and Why Does It Matter?

LLMs like GPT-4 are powerful but have a critical limitation: their knowledge is frozen at training time. RAG solves this by retrieving relevant context from your own data and injecting it into the prompt at inference time.

The result: an AI that answers questions about your documents, database, or knowledge base — accurately and without hallucination.

The Architecture

User Query → Embed Query → Vector Search (PgVector) → Inject Top-K Chunks → LLM → Answer

This pipeline has three moving parts:

Ingestion — chunk documents, embed them, store in Postgres with PgVector
Retrieval — embed the user query, do a cosine similarity search
Generation — inject retrieved chunks into the LLM prompt

Setting Up PgVector

PgVector is a Postgres extension that adds a native vector type and efficient similarity search operators. Enable it in your database:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
  id SERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  embedding vector(1536),  -- OpenAI text-embedding-3-small dimensions
  source TEXT,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- IVFFlat index for approximate nearest-neighbor search at scale
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Embedding and Ingestion (TypeScript)

import OpenAI from 'openai';
import { Pool } from 'pg';

const openai = new OpenAI();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function embedAndStore(content: string, source: string): Promise<void> {
  const { data } = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: content,
  });

  const embedding = data[0].embedding;

  await pool.query(
    'INSERT INTO document_chunks (content, embedding, source) VALUES ($1, $2, $3)',
    [content, JSON.stringify(embedding), source]
  );
}

Retrieval

async function retrieveTopK(query: string, k = 5): Promise<string[]> {
  const { data } = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query,
  });

  const embedding = data[0].embedding;

  const { rows } = await pool.query<{ content: string }>(
    `SELECT content
     FROM document_chunks
     ORDER BY embedding <=> $1
     LIMIT $2`,
    [JSON.stringify(embedding), k]
  );

  return rows.map((r) => r.content);
}

Generation

async function answerWithContext(question: string): Promise<string> {
  const chunks = await retrieveTopK(question);
  const context = chunks.join('\n\n---\n\n');

  const { choices } = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `You are a helpful assistant. Answer based only on the context provided.
                  If the answer is not in the context, say so.`,
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`,
      },
    ],
  });

  return choices[0].message.content ?? '';
}

Production Considerations

Chunking strategy matters. Overlapping chunks (e.g. 512 tokens with 64-token overlap) dramatically improve retrieval quality.
Hybrid search. Combine vector search with full-text search (tsvector) for best-of-both results.
Metadata filtering. Store document source, date, and category alongside embeddings so you can scope retrieval.
Caching. Cache embeddings for repeated queries to cut costs.

Conclusion

RAG is the right tool when you need an LLM to reason over private, domain-specific data. With PgVector in Postgres, you already have the infrastructure if you’re running a relational database — no separate vector database required.

If you’re building an AI feature and need a production-grade implementation, get in touch.