Building a Knowledge Base Chat with Supabase and Claude
A complete walkthrough of building a RAG-based knowledge base chat: pgvector schema, embedding models, the retrieval function, chunking strategy, and the Claude prompt pattern — including the gotchas that cause silent failures.
The basic idea is simple: store your documents in Supabase, embed them as vectors, search for the relevant chunks when a user asks a question, and pass those chunks to Claude to generate an answer. The implementation has a handful of non-obvious decisions that determine whether the system actually works in production.
This is the full setup — schema, embeddings, retrieval function, and the Claude prompt pattern — with the gotchas that cost the most time.
When to Use This vs. Full-Context Injection#
If your knowledge base is small and curated — say, under 100 documents — you can often skip vector search entirely and inject all of it directly into the system prompt. Simpler, no embedding costs, and retrieval can't fail because there's no retrieval step.
Vector search becomes worth it when the knowledge base is large enough that injecting everything would overflow the context window or produce worse answers from noise. The crossover is roughly when your documents stop fitting comfortably in 50–100k tokens. Below that threshold, consider whether RAG complexity is actually necessary.
This post covers the RAG approach for when you genuinely need it.
Schema Setup#
Enable pgvector and create your documents table. Use halfvec instead of vector — it stores embeddings as 16-bit floats rather than 32-bit, cutting storage in half with negligible quality loss, and it unlocks HNSW indexing for models with more than 2000 dimensions.
create extension if not exists vector with schema extensions;
create table documents (
id bigint primary key generated always as identity,
title text not null,
content text not null,
embedding extensions.halfvec(1536),
metadata jsonb,
created_at timestamptz default now()
);
create index on documents
using hnsw (embedding extensions.halfvec_cosine_ops);
The HNSW index is the right default — unlike IVFFlat, it updates itself on write so you don't need to rebuild the index as you add documents. IVFFlat requires the table to have data before you create the index (its clusters are derived from the existing data distribution); build it on an empty table and the index is useless.
Choosing an Embedding Model#
For most use cases, OpenAI's text-embedding-3-small at 1536 dimensions is the practical default: good quality, low cost, and straightforward to integrate. If accuracy is the priority, Voyage AI's voyage-3.5-lite consistently outperforms the OpenAI models on retrieval benchmarks and is the model Anthropic uses in their own RAG cookbook.
The dimension you embed at must match the column type exactly. There is no implicit truncation — a 3072-dim vector inserted into a halfvec(1536) column fails with a type error. If you ever switch models, you need to alter the column, re-embed every row, and rebuild the index.
OpenAI supports Matryoshka truncation, which lets you request a smaller dimension at embedding time and retain most of the quality:
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
dimensions: 512, // trade some quality for storage savings
});
At 512 dimensions the model retains ~95% of full-dimension quality while using 3x less storage. Match whatever dimensions you pass here to your Postgres column size.
The Retrieval Function#
PostgREST does not expose pgvector's distance operators directly. You need a Postgres function and call it via .rpc():
create or replace function match_documents (
query_embedding extensions.halfvec(1536),
match_threshold float,
match_count int
)
returns table (
id bigint,
title text,
content text,
similarity float
)
language sql stable
as $$
select
documents.id,
documents.title,
documents.content,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where 1 - (documents.embedding <=> query_embedding) > match_threshold
order by documents.embedding <=> query_embedding asc
limit match_count;
$$;
Two things to get right here. The <=> operator returns cosine distance (0 = identical), so 1 - distance gives cosine similarity (1 = identical) — that's what you compare against match_threshold. But the ORDER BY clause must use the raw distance, not the derived similarity, so the query planner uses the HNSW index. Using ORDER BY similarity DESC produces a sequential scan with no error message — just silent, full-table performance.
Calling it from TypeScript:
const { data: chunks } = await supabase.rpc('match_documents', {
query_embedding: queryEmbedding, // number[]
match_threshold: 0.78, // tune between 0.75–0.85
match_count: 8,
});
Chunking Documents#
Before embedding, split documents into chunks. The practical default: 512 tokens per chunk, 64-token overlap between adjacent chunks. Overlap preserves context across chunk boundaries — without it, sentences that span a split can lose meaning in both halves.
Chunk too large and the embedding averages too many ideas, making it hard for any single query to score well. Chunk too small and the retrieved context is too fragmentary to be useful. 512 tokens is a reasonable starting point; tune based on your actual documents.
Prepend the document title to each chunk before embedding — it anchors the embedding to the right topic:
const textToEmbed = `# ${document.title}\n\n${chunk}`;
Writing to the Database#
At ingest time: chunk, embed, insert.
async function ingestDocument(title: string, content: string) {
const chunks = splitIntoChunks(content, { maxTokens: 512, overlap: 64 });
for (const chunk of chunks) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: `# ${title}\n\n${chunk}`,
});
await supabase.from('documents').insert({
title,
content: chunk,
embedding: response.data[0].embedding,
});
}
}
If you update document content, re-generate the embedding — Supabase won't do it automatically unless you set up the pgmq trigger pipeline. Stale embeddings return stale results silently.
Querying and Calling Claude#
At query time: embed the user's question, retrieve the relevant chunks, build a prompt, call Claude.
async function answer(question: string): Promise<string> {
// 1. Embed the question
const embeddingRes = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: question,
});
const queryEmbedding = embeddingRes.data[0].embedding;
// 2. Retrieve matching chunks
const { data: chunks } = await supabase.rpc('match_documents', {
query_embedding: queryEmbedding,
match_threshold: 0.78,
match_count: 8,
});
if (!chunks?.length) {
return "I couldn't find relevant information for that question.";
}
// 3. Build context
const context = chunks
.map((c) => `### ${c.title}\n\n${c.content}`)
.join('
---
');
// 4. Call Claude
const response = await anthropic.messages.create({
: ,
: ,
:
+
,
: [
{
: ,
: ,
},
],
});
response.[]. === ? response.[]. : ;
}
Place the context before the question in the prompt. For large context blocks (20k+ tokens), Anthropic's benchmarks show up to 30% better response quality when documents come first and the question comes last.
The Gotchas That Cost Time#
Dimension mismatch crashes silently at query time, not setup time. You can insert with the wrong dimension if you forget the column type, then discover the mismatch when the retrieval function errors. Lock the dimension in a constant and reuse it everywhere.
ORDER BY similarity breaks index usage. As mentioned: always order by the raw distance expression, not the derived similarity column. Check with EXPLAIN ANALYZE if you suspect a sequential scan.
Context window creep. Eight retrieved chunks at 512 tokens each is ~4k tokens before you add the system prompt and conversation history. That's fine, but verify your actual usage — verbose documents or high match_count values can quietly push you into expensive territory. Pre-filter with match_threshold to cut low-relevance results before they reach the prompt.
IVFFlat on an empty table. If you chose IVFFlat instead of HNSW, build the index only after loading data. An empty-table IVFFlat index has no clusters and produces useless results with no warning.
Stale embeddings after content updates. There is no automatic re-embedding unless you wire up the pgmq trigger pipeline. A simple safeguard: add an embedding_updated_at column and run a periodic check for rows where updated_at > embedding_updated_at.
Freelance
Besoin d'aide sur ce sujet ?
Je peux aider sur les migrations, nouveaux produits et performances web.
Me contacter →