Why Add Vector Search to PostgreSQL?
Vector similarity search — finding the most semantically similar items to a query embedding — is the engine behind semantic search, recommendation systems, duplicate detection, and most RAG pipelines. The canonical advice used to be: add a dedicated vector database. But most teams already run PostgreSQL and the operational overhead of a second database for embeddings — schema sync, separate backups, another connection pool, one more service to monitor — adds up quickly.
pgvector is an open-source PostgreSQL extension that adds a vector column type and ANN (approximate nearest neighbour) index methods — IVFFlat and HNSW — directly to your existing Postgres instance. Embeddings live in the same transaction scope as the rest of your data. You get ACID consistency, standard SQL JOINs across structured and vector data, existing RLS policies, and row-level security all working on a single table — without any serialisation overhead.
This guide covers pgvector from initial setup through production-grade indexing, hybrid search, batched ingestion, and connection pooling. For a comparison of pgvector against dedicated vector databases, see our guide on vector databases compared — pgvector vs Qdrant vs Weaviate vs Pinecone.
Operational simplicity
Embeddings in Postgres — same backups, same RLS, same connection pooling as your OLTP data.
SQL JOIN power
JOIN nearest neighbours with structured tables, apply WHERE filters, use subqueries — standard SQL, no custom query DSL.
Scale ceiling
Up to ~100 million vectors on a large instance. Beyond that, consider Qdrant or a managed service like Supabase Vector.
Installation and Setup
pgvector ships as a standard PostgreSQL extension. On Ubuntu/Debian it is packaged as postgresql-16-pgvector. On managed services — AWS RDS, Supabase, Neon, Timescale — it is available without compilation. Enable it per database with a single SQL command.
-- Ubuntu / Debian (PostgreSQL 16)
sudo apt install -y postgresql-16-pgvector
-- macOS (Homebrew)
brew install pgvector
-- Compile from source (any Postgres version)
git clone https://github.com/pgvector/pgvector.git
cd pgvector && make && sudo make install
-- Enable the extension in your database
CREATE EXTENSION IF NOT EXISTS vector;
-- Verify
SELECT extversion FROM pg_extension WHERE extname = 'vector';
-- → 0.8.0 (or current version)With the extension enabled, create a table with a vector(N) column where N is the embedding dimensionality. OpenAI text-embedding-3-small uses 1536 dimensions; text-embedding-3-large supports 3072 (or a smaller truncated dimension). sentence-transformers models like all-MiniLM-L6-v2 use 384.
-- Documents table with pgvector embedding column
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding vector(1536), -- OpenAI text-embedding-3-small
created_at TIMESTAMPTZ DEFAULT NOW(),
tenant_id UUID NOT NULL -- for multi-tenant RLS
);
-- Full-text search index (used for hybrid search later)
CREATE INDEX ON documents USING gin(to_tsvector('english', content));
-- Metadata index for filtered ANN queries
CREATE INDEX ON documents USING gin(metadata jsonb_path_ops);Generating Embeddings and Batched Insertion
The embedding generation step runs outside PostgreSQL — in your application or a pipeline worker. The key performance lever is batch size: embedding APIs (OpenAI, Cohere, sentence-transformers) process lists of texts significantly faster per item than single requests. Insert in batches using executemany or copy_records_to_table to avoid per-row round-trips.
import openai
import psycopg2
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector
import numpy as np
openai_client = openai.OpenAI()
def get_embeddings(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
"""Batch embed up to 2048 texts per API call."""
response = openai_client.embeddings.create(input=texts, model=model)
return [item.embedding for item in response.data]
def upsert_documents(conn, docs: list[dict]) -> None:
"""
Batch upsert documents with embeddings.
docs = [{"id": ..., "content": ..., "metadata": ..., "tenant_id": ...}]
"""
register_vector(conn)
# Generate embeddings in one batched API call
texts = [d["content"] for d in docs]
vectors = get_embeddings(texts)
rows = [
(
d["id"],
d["content"],
d.get("metadata", {}),
np.array(v, dtype=np.float32),
d["tenant_id"],
)
for d, v in zip(docs, vectors)
]
with conn.cursor() as cur:
execute_values(
cur,
"""
INSERT INTO documents (id, content, metadata, embedding, tenant_id)
VALUES %s
ON CONFLICT (id) DO UPDATE
SET content = EXCLUDED.content,
metadata = EXCLUDED.metadata,
embedding = EXCLUDED.embedding
""",
rows,
template="(%s, %s, %s, %s, %s)",
page_size=500, -- commit every 500 rows
)
conn.commit()
# Usage: process a corpus in chunks of 500
BATCH_SIZE = 500
with psycopg2.connect("postgresql://user:pass@localhost/mydb") as conn:
for i in range(0, len(all_docs), BATCH_SIZE):
batch = all_docs[i : i + BATCH_SIZE]
upsert_documents(conn, batch)
print(f"Upserted {min(i + BATCH_SIZE, len(all_docs))} / {len(all_docs)}")Note
IVFFlat vs HNSW: Choosing the Right ANN Index
pgvector ships two ANN index types with fundamentally different trade-offs. Understanding them is the single most important performance decision you will make.
| Property | IVFFlat | HNSW |
|---|---|---|
| Build time | Fast (minutes for 1M vectors) | Slow (hours for 1M at m=16) |
| Build memory | Low — proportional to list count | High — entire graph in memory during build |
| Query speed | Good (tune probes) | Excellent — faster at same recall |
| Recall at k=10 | ~95% with probes=10 | ~99% with ef=64 |
| Incremental inserts | Degrades over time — needs REINDEX | Handles inserts well without rebuild |
| Memory footprint | Lower | Higher (graph edges stored) |
| Best for | Batch-built static corpora, RAM-constrained hosts | Production with continuous inserts, low-latency queries |
For new production deployments start with HNSW. The build time cost is a one-time investment; the ongoing query latency and recall benefits compound on every search.
-- ── HNSW index (recommended for production) ─────────────────────────
-- m: number of connections per node (16 is the default, 8–48 useful range)
-- ef_construction: beam width during index build — higher = better recall, slower build
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Per-query: ef controls the search beam width (overrides index default at query time)
SET hnsw.ef_search = 100; -- higher = better recall, slower query
-- ── IVFFlat index (static corpora or low-memory hosts) ───────────────
-- lists: number of inverted lists (√n_rows is a good starting point)
-- For 1 million rows: lists = 1000
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
-- Per-query: probes controls how many lists to scan
SET ivfflat.probes = 10; -- 1 (fastest, lowest recall) to lists (exact scan)
-- ── Distance operators ────────────────────────────────────────────────
-- vector_cosine_ops → <=> operator (cosine distance, 1 - cosine similarity)
-- vector_l2_ops → <-> operator (Euclidean / L2 distance)
-- vector_ip_ops → <#> operator (negative inner product — for dot-product models)
-- Use vector_cosine_ops for normalised embeddings (OpenAI, sentence-transformers)
-- Use vector_ip_ops for inner-product models (some CLIP variants, PaLM)
-- Use vector_l2_ops when your model training used L2 distance explicitlyNote
maintenance_work_mem = '4GB' (or higher) before building an HNSW index on large tables. pgvector builds the graph entirely in memory — if maintenance_work_mem is too small, the index build spills to disk and takes 10–50× longer. Run SET maintenance_work_mem = '8GB' in your session before CREATE INDEX.Similarity Queries: k-NN, Filtered Search, and Distance Thresholds
pgvector adds distance operators directly to SQL. The query structure is a standard ORDER BY embedding <=> $query_vector LIMIT k. You can combine it with any SQL WHERE clause, JOIN, or subquery — the ANN index is used when the planner recognises the distance operator in the ORDER BY and no prohibitive filter is applied first.
-- Basic k-NN query: find 10 most similar documents
SELECT
id,
content,
metadata,
1 - (embedding <=> ${query_vector}::vector) AS cosine_similarity
FROM documents
ORDER BY embedding <=> ${query_vector}::vector
LIMIT 10;
-- Filtered k-NN: only search within a specific tenant
-- HNSW/IVFFlat index is still used — the index scan filters rows post-retrieval
SELECT id, content, 1 - (embedding <=> ${query_vector}::vector) AS score
FROM documents
WHERE tenant_id = ${tenant_uuid}
ORDER BY embedding <=> ${query_vector}::vector
LIMIT 10;
-- Distance threshold: only return results with similarity > 0.80
SELECT id, content, 1 - (embedding <=> ${query_vector}::vector) AS score
FROM documents
WHERE 1 - (embedding <=> ${query_vector}::vector) > 0.80
ORDER BY embedding <=> ${query_vector}::vector
LIMIT 20;
-- JOIN vector results with structured data in one query
SELECT
d.id,
d.content,
u.display_name AS author,
p.name AS project_name,
1 - (d.embedding <=> ${query_vector}::vector) AS score
FROM documents d
JOIN users u ON u.id = d.author_id
JOIN projects p ON p.id = d.project_id
WHERE p.workspace_id = ${workspace_id}
AND d.created_at > NOW() - INTERVAL '90 days'
ORDER BY d.embedding <=> ${query_vector}::vector
LIMIT 10;Filtered queries are the trickiest performance case. When the WHERE clause is highly selective (returning <1% of rows), Postgres may choose a sequential scan over the ANN index — the planner estimates that the index overhead outweighs the savings. Force the index with SET enable_seqscan = off in development to verify the index is usable, then consider a partial index scoped to the most common filter.
-- Partial HNSW index scoped to a specific workspace
-- Only vectors for this workspace are indexed — smaller, faster build
CREATE INDEX idx_docs_hnsw_ws_abc ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
WHERE workspace_id = 'ws-abc-123';
-- The partial index is used when the WHERE clause exactly matches
SELECT id, content, 1 - (embedding <=> ${v}::vector) AS score
FROM documents
WHERE workspace_id = 'ws-abc-123' -- matches partial index predicate
ORDER BY embedding <=> ${v}::vector
LIMIT 10;
-- For dynamic tenant filtering, use a composite index approach:
-- Store tenant_id as a prefix in the embedding space is NOT recommended.
-- Instead: if you have O(10) large tenants, create one partial index per tenant.
-- For O(100+) tenants, rely on the full HNSW index with WHERE filtering
-- and accept the small recall degradation from post-filter ANN results.Hybrid Search: Vector + BM25 Full-Text with RRF Fusion
Pure vector search works well for semantic similarity but struggles with keyword precision — searching for "error code 503" or a specific product SKU where exact token matching matters. Hybrid search combines vector similarity with BM25-style full-text search scores using Reciprocal Rank Fusion (RRF), which re-ranks documents based on their position in each result list without needing to normalise the raw scores to a common scale.
This is the retrieval pattern used by most production RAG pipelines — vector search for semantic recall, full-text for keyword precision, RRF to merge the lists.
-- Hybrid search with Reciprocal Rank Fusion (RRF)
-- k=60 is the standard RRF constant (balances the two lists)
WITH vector_results AS (
SELECT
id,
ROW_NUMBER() OVER (ORDER BY embedding <=> ${query_vector}::vector) AS rn
FROM documents
WHERE tenant_id = ${tenant_id}
ORDER BY embedding <=> ${query_vector}::vector
LIMIT 40
),
fts_results AS (
SELECT
id,
ROW_NUMBER() OVER (
ORDER BY ts_rank_cd(
to_tsvector('english', content),
plainto_tsquery('english', ${query_text})
) DESC
) AS rn
FROM documents
WHERE
tenant_id = ${tenant_id}
AND to_tsvector('english', content) @@ plainto_tsquery('english', ${query_text})
LIMIT 40
),
rrf AS (
SELECT
COALESCE(v.id, f.id) AS id,
COALESCE(1.0 / (60 + v.rn), 0) + COALESCE(1.0 / (60 + f.rn), 0) AS rrf_score
FROM vector_results v
FULL OUTER JOIN fts_results f USING (id)
)
SELECT
d.id,
d.content,
r.rrf_score
FROM rrf r
JOIN documents d USING (id)
ORDER BY r.rrf_score DESC
LIMIT 10;Note
k=60 (the constant in the denominator). If you want to weight vector similarity higher than full-text, multiply the vector RRF score by a weight factor (e.g. 2.0 * COALESCE(1.0 / (60 + v.rn), 0)). Tune this on a golden evaluation set — 60/40 vector/text is a common starting point for document retrieval.Python Query Interface: psycopg2, asyncpg, and SQLAlchemy
The pgvector-python package provides adapters for psycopg2, asyncpg, and SQLAlchemy that handle serialisation of NumPy arrays and lists into the PostgreSQL wire format. Without the adapter, you must cast vectors manually as SQL string literals — fragile and slow.
# pip install pgvector psycopg2-binary asyncpg sqlalchemy
# ── psycopg2 (synchronous) ────────────────────────────────────────────
import psycopg2
import numpy as np
from pgvector.psycopg2 import register_vector
conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
register_vector(conn)
query_vec = np.array(get_embedding("how to reset my password"), dtype=np.float32)
with conn.cursor() as cur:
cur.execute(
"""
SELECT id, content, 1 - (embedding <=> %s) AS score
FROM documents
WHERE tenant_id = %s
ORDER BY embedding <=> %s
LIMIT 5
""",
(query_vec, tenant_id, query_vec),
)
rows = cur.fetchall()
# ── asyncpg (async / FastAPI) ────────────────────────────────────────
import asyncpg
from pgvector.asyncpg import register_vector
async def search(pool: asyncpg.Pool, query_text: str, tenant_id: str):
query_vec = await embed_async(query_text) # async embedding call
async with pool.acquire() as conn:
await register_vector(conn)
return await conn.fetch(
"""
SELECT id, content, 1 - (embedding <=> $1) AS score
FROM documents
WHERE tenant_id = $2
ORDER BY embedding <=> $1
LIMIT 10
""",
query_vec,
tenant_id,
)
# ── SQLAlchemy ORM (pgvector 0.3+) ────────────────────────────────────
from sqlalchemy import Column, Text, String
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import DeclarativeBase
from pgvector.sqlalchemy import Vector
class Base(DeclarativeBase):
pass
class Document(Base):
__tablename__ = "documents"
id = Column(String, primary_key=True)
content = Column(Text, nullable=False)
metadata = Column(JSONB)
embedding = Column(Vector(1536))
# Query using the ORM cosine distance method
results = (
session.query(Document)
.order_by(Document.embedding.cosine_distance(query_vec))
.limit(10)
.all()
)Connection Pooling with PgBouncer for High-Throughput Search
Each Postgres backend process consumes ~5 MB of memory. At 100 concurrent application connections that is 500 MB just for connection overhead before any query executes. PgBouncer in transaction pooling mode multiplexes hundreds of application connections onto a small pool of Postgres backends, dramatically reducing memory pressure and connection overhead.
There is one critical constraint: pgvector's per-session settings (hnsw.ef_search, ivfflat.probes) are session-scoped. With transaction pooling, a SET command applies to the backend connection, not your application session — a different backend may serve the next transaction with the default setting. Always pass these as inline SET LOCAL within a transaction, or configure them as database-level defaults in postgresql.conf.
# pgbouncer.ini — transaction pooling for pgvector workloads
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20 # 20 Postgres backends serve 1000 clients
min_pool_size = 5
reserve_pool_size = 5
server_idle_timeout = 60
log_connections = 0 # disable in production (noisy)-- Always use SET LOCAL inside a transaction to avoid leaking settings
-- across PgBouncer transaction-pool connections
BEGIN;
SET LOCAL hnsw.ef_search = 100; -- applies to this transaction only
SELECT id, content, 1 - (embedding <=> ${query_vector}::vector) AS score
FROM documents
WHERE tenant_id = ${tenant_id}
ORDER BY embedding <=> ${query_vector}::vector
LIMIT 10;
COMMIT;
-- Or set database-level defaults to avoid per-query SET overhead:
ALTER DATABASE mydb SET hnsw.ef_search = 80;
ALTER DATABASE mydb SET ivfflat.probes = 10;RAG Integration: Full Retrieval Pipeline
pgvector slots into a RAG pipeline as the retrieval layer. The application embeds the user query, fetches top-k chunks from Postgres, and passes them as context to the generation model. Because the chunks are in the same Postgres database as your application data, you can filter by user permissions, project scope, or recency using ordinary SQL predicates — no synchronisation between two systems. See our deeper guide on vector databases in production for RAG at scale for architectural patterns beyond single-database deployments.
import openai
import psycopg2
import numpy as np
from pgvector.psycopg2 import register_vector
openai_client = openai.OpenAI()
def retrieve_context(
conn,
question: str,
tenant_id: str,
k: int = 5,
min_score: float = 0.75,
) -> list[dict]:
"""Embed question and fetch top-k chunks from Postgres."""
register_vector(conn)
# 1. Embed the query
resp = openai_client.embeddings.create(
model="text-embedding-3-small",
input=question,
)
query_vec = np.array(resp.data[0].embedding, dtype=np.float32)
# 2. Retrieve from pgvector with cosine similarity threshold
with conn.cursor() as cur:
cur.execute(
"""
SELECT
id,
content,
metadata,
1 - (embedding <=> %s) AS score
FROM documents
WHERE
tenant_id = %s
AND 1 - (embedding <=> %s) > %s
ORDER BY embedding <=> %s
LIMIT %s
""",
(query_vec, tenant_id, query_vec, min_score, query_vec, k),
)
rows = cur.fetchall()
return [
{"id": r[0], "content": r[1], "metadata": r[2], "score": float(r[3])}
for r in rows
]
def answer_with_rag(conn, question: str, tenant_id: str) -> str:
"""Retrieve context from Postgres and generate an answer."""
chunks = retrieve_context(conn, question, tenant_id)
context = "
---
".join(c["content"] for c in chunks)
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer the question using only the context provided. "
"If the context is insufficient, say so.",
},
{
"role": "user",
"content": f"Context:
{context}
Question: {question}",
},
],
temperature=0,
)
return response.choices[0].message.contentIndex Maintenance: VACUUM, REINDEX, and Index Health
Vector indexes degrade over time as rows are inserted, updated, and deleted. HNSW is more resilient than IVFFlat to incremental changes — the graph reconnects around deleted nodes. However both index types accumulate dead tuples that increase query scan time until VACUUM reclaims them. Monitor index bloat with the pgstattuple extension and schedule REINDEX CONCURRENTLY during low-traffic windows if dead tuple ratio exceeds 20%.
-- Check dead tuple ratio for the documents table
SELECT
n_live_tup,
n_dead_tup,
ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'documents';
-- Check HNSW index size and entry count
SELECT
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'documents';
-- Rebuild the HNSW index online (does not block reads/writes)
REINDEX INDEX CONCURRENTLY idx_documents_hnsw;
-- Tune autovacuum to be more aggressive for high-write vector tables
ALTER TABLE documents SET (
autovacuum_vacuum_scale_factor = 0.01, -- vacuum after 1% dead tuples
autovacuum_analyze_scale_factor = 0.005
);pgvector vs Dedicated Vector Databases: When to Choose Each
pgvector is the right choice for most teams under 50 million vectors who already run PostgreSQL. The operational simplicity and ACID consistency outweigh the query performance gap with dedicated systems in the vast majority of production RAG and semantic search use cases.
| Criteria | pgvector | Dedicated (Qdrant / Pinecone) |
|---|---|---|
| Vector count | Up to ~100M on a large Postgres instance | Billions with horizontal sharding |
| Query latency (p99) | 5–30 ms with HNSW on hot index | 1–5 ms for Qdrant/Pinecone optimised clusters |
| Operational overhead | Zero — extends existing Postgres | New service, new backups, new monitoring |
| Metadata filtering | Full SQL — JOINs, subqueries, complex WHERE | Custom filter DSL, usually less expressive |
| Consistency with app data | ACID — embeddings and metadata in one transaction | Eventual — separate write paths, sync required |
| Multi-tenancy | Row-Level Security — native Postgres RLS | Collection-per-tenant or namespace separation |
| Cost at scale | Amortised — existing Postgres bill | Additional service cost, can be significant |
Migrate to a dedicated system when: query p99 exceeds your SLA at current vector count, you need multi-region replication of embeddings independently of OLTP data, or you exceed 200M vectors and Postgres memory cannot hold the HNSW index in shared_buffers.
pgvector Production Checklist
Use HNSW for production — IVFFlat only if building from a static corpus on a RAM-constrained host
Set maintenance_work_mem to at least 4 GB before CREATE INDEX on large tables to avoid disk-spill during HNSW build
Use SET LOCAL hnsw.ef_search inside transactions when behind PgBouncer transaction-pool mode
Batch embedding inserts with execute_values and page_size=500 — never insert one row at a time
Add a GIN full-text index on the content column for hybrid RRF search
Monitor dead tuple ratio on the documents table — autovacuum the vector table more aggressively than OLTP tables
Set shared_buffers high enough to hold the entire HNSW index in cache — cold-cache queries are 5-10× slower
Use partial HNSW indexes scoped to large tenants when per-tenant filtered queries dominate your query mix
Store embeddings as vector(N) not JSONB or float[] — pgvector's native type uses half the storage and enables ANN indexes
Test recall: run your k-NN queries with enable_seqscan=off and compare results with exact search to verify HNSW recall meets your requirement
Adding semantic search or RAG retrieval to your PostgreSQL application and unsure whether to add pgvector or spin up a dedicated vector database?
We design and implement pgvector-based semantic search and RAG systems — from HNSW index configuration and embedding ingestion pipeline design with batched upserts to hybrid vector plus full-text RRF search, partial index strategies for multi-tenant filtering, PgBouncer connection pool setup with per-transaction ANN parameters, RAG retrieval pipeline integration with OpenAI or open-source embedding models, and decision frameworks for migrating from pgvector to dedicated vector databases when scale demands it. Let’s talk.
Let's Talk