Back to Blog
DatabasesMigrationsPostgreSQLZero DowntimeDevOpsBackend

Fine-Tuning Open Models for Domain-Specific Tasks

A practical guide to fine-tuning open-source LLMs for production: choosing the right base model, curating training data, LoRA and QLoRA adapter training with Hugging Face PEFT, domain-specific evaluation, GGUF quantization, and production serving with vLLM.

2026-04-20

The Maintenance Window Trap

Every team hits the same inflection point: a table that started at a few thousand rows now holds 200 million, and the product team wants to rename a column, add a NOT NULL constraint, or backfill a new field. The naive approach is to schedule a maintenance window, run ALTER TABLE, wait for the lock, and hope the migration finishes before the SLA expires.

This approach does not scale. On a 200-million-row PostgreSQL table, a blocking ALTER TABLE ADD COLUMN NOT NULL DEFAULT can run for hours — holding an ACCESS EXCLUSIVE lock that blocks every read and write on the table. On MySQL, a plain ALTER TABLE rebuilds the entire table in-place, causing the same extended lock. Neither is acceptable for systems that are expected to be up.

Zero-downtime migrations are not a single technique — they are a combination of patterns: the expand-contract workflow for schema transitions, shadow tables and online schema-change tools for structural rewrites, non-blocking DDL operations for index creation, and feature flags to decouple code deployment from data cutover. This article walks through each one with production-grade examples.

The Expand-Contract Pattern

The expand-contract pattern (also called parallel change) is the foundational approach to zero-downtime schema evolution. Instead of atomically changing a schema in one migration, you split the change into three phases that each deploy independently — with no downtime required at any step.

Phase 1Expand — Add the New State

Add the new column, table, or index without removing anything old. The schema now contains both the old and new representations. Application code writes to both (dual-write) but reads from the old. No existing queries break.

Phase 2Migrate — Backfill and Switch Reads

Run a backfill job to populate the new column for existing rows. Once backfill is complete and verified, switch application reads to the new column. Dual-write continues. Both old and new columns are consistent.

Phase 3Contract — Remove the Old State

Once all application instances have been deployed with code that only writes to the new column, remove the old column and its associated indexes. This migration is trivially fast — it only removes metadata, not data.

The following example walks through renaming a user_name column to username on a high-traffic PostgreSQL table using Flyway.

-- Migration V001__expand_add_username_column.sql
-- Phase 1: Add new column alongside old one.
-- Both columns exist; application writes to both.

ALTER TABLE users ADD COLUMN username VARCHAR(255);

-- Populate new column from existing data (batched below via application)
-- Do NOT add NOT NULL yet — backfill must complete first.

-- Create index CONCURRENTLY to avoid lock
CREATE INDEX CONCURRENTLY idx_users_username ON users (username);
-- Migration V002__migrate_backfill_username.sql
-- Phase 2a: Backfill existing rows in batches (run via a background job,
-- not this migration — see Python backfill script below).
-- This migration just marks the backfill as done via a tracking table.

INSERT INTO schema_migrations_meta (key, value, completed_at)
VALUES ('backfill_username', 'done', NOW())
ON CONFLICT (key) DO UPDATE SET value = 'done', completed_at = NOW();

-- Phase 2b: Enforce data integrity now that backfill is complete.
-- UPDATE NOT NULL constraint — still no table lock in Postgres 12+
ALTER TABLE users ALTER COLUMN username SET NOT NULL;
-- Migration V003__contract_drop_user_name.sql
-- Phase 3: Remove old column — only safe after ALL app instances
-- have been redeployed without any reference to 'user_name'.
-- Verify with: SELECT relname, attname FROM pg_attribute ...

ALTER TABLE users DROP COLUMN user_name;
DROP INDEX IF EXISTS idx_users_user_name;

Note

Never combine Phase 1 and Phase 3 into a single migration. The time between adding the new column and dropping the old one must span at least one full deployment cycle — long enough for every application instance across all regions to be running code that no longer references the old column. In a rolling deployment with 10 instances, if even one instance is still referencing user_name when you run Phase 3, you get a hard crash.

Safe Backfills: Batching to Avoid Lock Contention

Running UPDATE users SET username = user_name WHERE username IS NULL as a single statement locks every row it touches sequentially, holding row-level locks for the duration of the transaction. On a 200M-row table this creates extreme I/O pressure and can saturate replication lag. The correct approach is to batch by primary key with small transactions.

import psycopg2
import time
import logging

log = logging.getLogger(__name__)

BATCH_SIZE = 5_000
SLEEP_BETWEEN_BATCHES_MS = 50  # throttle to avoid saturating I/O


def backfill_username(conn_string: str) -> None:
    conn = psycopg2.connect(conn_string)
    conn.autocommit = False

    with conn.cursor() as cur:
        cur.execute("SELECT MIN(id), MAX(id) FROM users WHERE username IS NULL")
        min_id, max_id = cur.fetchone()

    if min_id is None:
        log.info("No rows to backfill — username column already populated.")
        return

    log.info(f"Backfilling rows {min_id} → {max_id} in batches of {BATCH_SIZE}")

    current = min_id
    total_updated = 0

    while current <= max_id:
        batch_end = current + BATCH_SIZE - 1
        with conn.cursor() as cur:
            cur.execute(
                """
                UPDATE users
                SET    username = user_name
                WHERE  id BETWEEN %s AND %s
                  AND  username IS NULL
                """,
                (current, batch_end),
            )
            rows_updated = cur.rowcount

        conn.commit()
        total_updated += rows_updated
        log.info(f"  Batch [{current}, {batch_end}]: {rows_updated} rows updated")

        current = batch_end + 1
        time.sleep(SLEEP_BETWEEN_BATCHES_MS / 1000)

    log.info(f"Backfill complete — {total_updated} total rows updated.")
    conn.close()


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    backfill_username("postgresql://user:pass@db-host/mydb")

Note

Add a replication_lag_threshold_ms check inside the loop if you are running on a primary with read replicas. Query pg_stat_replication after each batch and pause if lag exceeds your threshold. This prevents backfills from causing replica lag spikes that degrade read traffic.

Non-Blocking DDL in PostgreSQL

PostgreSQL has steadily expanded its list of operations that can run without an ACCESS EXCLUSIVE lock. Understanding which DDL operations are safe and which are not is critical for zero-downtime deployments.

OperationLock LevelSafe?
ADD COLUMN (nullable, no default)ACCESS EXCLUSIVE — instant✓ Safe (PG 11+)
ADD COLUMN NOT NULL DEFAULT exprACCESS EXCLUSIVE — table rewrite✗ Blocking on PG < 11
ADD COLUMN with volatile defaultACCESS EXCLUSIVE — table rewrite✗ Always blocking
CREATE INDEX CONCURRENTLYShareUpdateExclusiveLock✓ Non-blocking
DROP INDEX CONCURRENTLYShareUpdateExclusiveLock✓ Non-blocking
DROP COLUMNACCESS EXCLUSIVE — instant (metadata only)✓ Safe (brief lock)
RENAME COLUMNACCESS EXCLUSIVE — instant~ Use expand-contract instead
ADD CONSTRAINT (validated)ACCESS EXCLUSIVE — full table scan✗ Use NOT VALID + VALIDATE

The pattern for adding a NOT NULL constraint safely on PostgreSQL:

-- Step 1: Add constraint as NOT VALID (no table scan — instant)
-- Existing rows are not checked; only new inserts/updates are validated.
ALTER TABLE orders ADD CONSTRAINT chk_status_not_empty
  CHECK (status <> '') NOT VALID;

-- Step 2: Validate existing rows in the background
-- VALIDATE CONSTRAINT takes ShareUpdateExclusiveLock, not ACCESS EXCLUSIVE.
-- Reads and writes proceed normally during validation.
ALTER TABLE orders VALIDATE CONSTRAINT chk_status_not_empty;

-- The same pattern works for FOREIGN KEY constraints:
ALTER TABLE order_items ADD CONSTRAINT fk_order
  FOREIGN KEY (order_id) REFERENCES orders(id) NOT VALID;

ALTER TABLE order_items VALIDATE CONSTRAINT fk_order;

Shadow Tables and Online Schema Changes

For structural changes that PostgreSQL or MySQL cannot perform without a table rewrite — changing a column's data type, reorganizing storage, or adding a column with a volatile default on older versions — online schema-change (OSC) tools use a shadow-table approach to perform the migration without holding a lock for its duration.

The two most widely used tools are gh-ost (MySQL, by GitHub) and pt-online-schema-change (MySQL, by Percona). For PostgreSQL, pg_repack handles table bloat and column reorganization non-blocking.

The shadow-table algorithm all three tools share:

1

Create a ghost table — an exact copy of the original table schema, with the desired structural change already applied. The ghost table is empty.

2

Install change-capture triggers (pt-osc) or hook into the binary log stream (gh-ost) to replay all INSERT/UPDATE/DELETE operations on the original table onto the ghost table in near-real-time.

3

Chunk-copy existing rows from the original to the ghost table in small batches, throttled by replication lag and server load. This takes hours or days on large tables — normal traffic is unaffected.

4

Cutover — once the ghost table is within seconds of the original, briefly lock both tables, apply any remaining changelog, rename ghost → original (and original → backup), release locks. Total downtime: milliseconds.

# gh-ost — online ALTER TABLE for MySQL
# Renames column 'user_name' to 'username' on a live production table
# with 200M rows, zero blocking.

gh-ost   --host=db-primary.internal   --port=3306   --user=ghosts_user   --password="${DB_PASS}"   --database=app   --table=users   --alter="CHANGE COLUMN user_name username VARCHAR(255) NOT NULL"   --execute   --chunk-size=2000   --max-lag-millis=1500            # pause if replica lag > 1.5s
  --max-load=Threads_running=30    # pause if server load too high
  --throttle-additional-flag-file=/tmp/gh-ost.throttle   --ok-to-drop-table   --initially-drop-ghost-table   --initially-drop-old-table   --verbose

# To pause the migration at any time (creates the throttle file):
touch /tmp/gh-ost.throttle

# To resume:
rm /tmp/gh-ost.throttle

# To monitor progress:
echo "status" | nc -q1 localhost 10111

Note

gh-ost connects to a replica (not the primary) for binlog streaming by default — this avoids adding I/O load to the write path. Use --assume-master-host to point writes at the primary and reads at a replica explicitly. In multi-AZ setups, ensure the ghost user has REPLICATION SLAVE privileges and binlog format is set to ROW.

Feature Flags as a Migration Safety Layer

Even with a perfect schema migration strategy, application code changes and schema changes can drift. A feature flag decouples the two, letting you deploy the code that uses a new column days before the column actually goes live — and giving you a fast rollback lever that doesn't require a redeploy.

The canonical pattern is a dual-write flag: when the flag is off, the application reads and writes only the old column. When the flag is on, it writes to both columns simultaneously and reads from the new one. The flag is flipped in production only after the backfill is verified as complete.

# Using LaunchDarkly (or any OpenFeature-compatible provider)
# to gate the dual-write phase of a column migration

import ldclient
from ldclient.config import Config

ldclient.set_config(Config("sdk-key"))
ld = ldclient.get()

USER_CONTEXT = {"key": "service/user-service", "kind": "service"}


def save_user(conn, user_id: int, name: str) -> None:
    flag = ld.variation("migration.users.username_dual_write", USER_CONTEXT, False)

    if flag:
        # Phase 2: dual-write — write both columns, read from new
        conn.execute(
            "UPDATE users SET user_name = %s, username = %s WHERE id = %s",
            (name, name, user_id),
        )
    else:
        # Phase 1: write old column only (new column nullable, backfill pending)
        conn.execute(
            "UPDATE users SET user_name = %s WHERE id = %s",
            (name, user_id),
        )


def get_user_name(conn, user_id: int) -> str:
    flag_read_new = ld.variation("migration.users.username_read_new", USER_CONTEXT, False)

    if flag_read_new:
        row = conn.execute(
            "SELECT username FROM users WHERE id = %s", (user_id,)
        ).fetchone()
    else:
        row = conn.execute(
            "SELECT user_name FROM users WHERE id = %s", (user_id,)
        ).fetchone()

    return row[0]

This flag-driven approach means your rollback for a bad migration is a single flag flip — no emergency hotfix, no re-migration, no downtime. The rollback completes in under a second across all instances.

Flyway and Liquibase: Structuring Versioned Migrations

Both Flyway and Liquibase track which migrations have been applied to a database and run pending ones in version order. The key difference: Flyway uses versioned SQL files (V001__description.sql) while Liquibase uses XML/YAML/JSON changesets with explicit rollback definitions.

For zero-downtime migrations, both tools require discipline around when each migration phase deploys. A common CI/CD pattern: migration scripts run as a pre-deployment job, before rolling out new application pods. The application code is always forward-compatible with both the old and new schema simultaneously.

# flyway.conf — production configuration
flyway.url=jdbc:postgresql://db-host:5432/mydb
flyway.user=${DB_USER}
flyway.password=${DB_PASSWORD}
flyway.schemas=public
flyway.locations=filesystem:./migrations
flyway.baselineOnMigrate=false
flyway.outOfOrder=false

# Fail fast — never skip a migration or allow dirty state
flyway.validateOnMigrate=true
flyway.cleanDisabled=true  # NEVER allow clean in production
# Liquibase changeset with explicit rollback
# db/changelog/0042-add-username-column.yaml
---
databaseChangeLog:
  - changeSet:
      id: 0042-add-username-column
      author: platform-team
      comment: "Phase 1 expand: add nullable username column"
      changes:
        - addColumn:
            tableName: users
            columns:
              - column:
                  name: username
                  type: VARCHAR(255)
                  constraints:
                    nullable: true
      rollback:
        - dropColumn:
            tableName: users
            columnName: username

  - changeSet:
      id: 0042-add-username-index
      author: platform-team
      runAlways: false
      changes:
        - sql:
            sql: CREATE INDEX CONCURRENTLY idx_users_username ON users (username)
            splitStatements: false
      rollback:
        - sql:
            sql: DROP INDEX CONCURRENTLY IF EXISTS idx_users_username
            splitStatements: false

Note

CREATE INDEX CONCURRENTLY cannot run inside a transaction block. Flyway wraps each migration in a transaction by default — you must annotate non-transactional migrations with a special header comment (-- flyway:nonTransactional in Flyway Teams, or use a separate script) or configure Liquibase with runInTransaction: false for those changesets. Forgetting this causes CONCURRENTLY to fail silently or fall back to a blocking index build.

CI/CD Integration for Safe Migration Pipelines

A robust migration CI/CD pipeline has four gates: lint the SQL, validate migration idempotency, run against a staging clone, and generate a lock analysis report before promoting to production.

# .github/workflows/migrate.yml
# Zero-downtime migration pipeline for PostgreSQL

name: Database Migration

on:
  push:
    paths: ["migrations/**"]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Lint SQL with squawk
        uses: sbdchd/squawk-action@v2
        with:
          pattern: "migrations/*.sql"
          # squawk detects blocking DDL: https://squawkhq.com/docs/rules
          config: .squawk.toml

  validate:
    runs-on: ubuntu-latest
    needs: lint
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
          POSTGRES_USER: testuser
          POSTGRES_PASSWORD: testpass
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v4
      - name: Run Flyway migrate (test DB)
        run: |
          flyway             -url=jdbc:postgresql://localhost:5432/testdb             -user=testuser             -password=testpass             -locations=filesystem:./migrations             migrate
      - name: Verify migration idempotency (run twice)
        run: |
          flyway             -url=jdbc:postgresql://localhost:5432/testdb             -user=testuser             -password=testpass             migrate  # Should be a no-op on second run

  production:
    runs-on: ubuntu-latest
    needs: validate
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Run Flyway migrate (production)
        env:
          DB_URL: ${{ secrets.PROD_DB_URL }}
          DB_USER: ${{ secrets.PROD_DB_USER }}
          DB_PASSWORD: ${{ secrets.PROD_DB_PASSWORD }}
        run: |
          flyway             -url="${DB_URL}"             -user="${DB_USER}"             -password="${DB_PASSWORD}"             -locations=filesystem:./migrations             migrate

Squawk is a static analysis tool for PostgreSQL migrations that catches blocking DDL operations in CI before they reach production. It detects patterns like missing CONCURRENTLY, unvalidated constraints, and type changes that require a table rewrite — integrating directly into GitHub Actions or GitLab CI.

Production Checklist

Set lock_timeout and statement_timeout

Any DDL statement that needs an ACCESS EXCLUSIVE lock will queue behind existing long-running queries — and while it waits, all subsequent queries queue behind it. Set lock_timeout = 2s and statement_timeout = 30s on your migration connection so a blocked migration fails fast rather than silently accumulating a queue of blocked app requests. Retry with exponential backoff in CI.

Never run migrations in application startup

Running Flyway or Liquibase on application boot means every pod start races to acquire a migration lock. In a rolling deployment with 20 pods starting simultaneously, 19 of them block on the schema lock. Run migrations as a separate Kubernetes Job or CI step before updating the Deployment.

Monitor pg_stat_activity during migrations

Before running any DDL, check for long-running queries that will block your migration. A query that has been running for 20 minutes will cause your ALTER TABLE to wait — and queue every new connection behind it. Use SELECT pid, now() - query_start AS age, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY age DESC to identify blockers.

Store migrations in version control, never edit them

Flyway checksums each applied migration and will refuse to run if the file content changes after application. Never edit an applied migration — always add a new one. This prevents state divergence between environments and ensures the migration history is auditable and reproducible.

Test rollback paths explicitly

Liquibase rollback definitions and feature flag-based rollbacks both need to be tested in staging before production. An untested rollback path that fails during an incident is worse than no rollback at all. Include a mandatory rollback test in your staging migration pipeline: migrate forward, verify, roll back, verify again.

Building a specialized LLM for your domain and need help with the full pipeline?

We design and implement end-to-end fine-tuning pipelines — from dataset curation and LoRA/QLoRA training to quantization, evaluation, and production serving with vLLM. Let’s talk.

Get in Touch

Related Articles