Migrating 50GB+ MySQL Data Without Downtime

We needed to restructure a core MySQL table. 50GB of data. 200M+ rows. Serving live traffic at ~3K queries per second. Downtime was not an option.

Here’s the playbook we used — and the mistakes that almost cost us.

Why Not Just `ALTER TABLE`?

For small tables, ALTER TABLE works fine. For a 50GB table with heavy write traffic, it’s a disaster.

MySQL’s default ALTER TABLE on InnoDB:

Creates a full copy of the table
Blocks writes during the copy (or uses an online DDL that still degrades performance)
Can take hours for large tables
If it fails halfway, you start over

We needed something that could run alongside live traffic with zero impact.

The Strategy: Expand-Contract Migration

The approach breaks down into four phases:

Phase 1: Expand    → Create new structure alongside old
Phase 2: Migrate   → Backfill historical data
Phase 3: Cutover   → Switch reads and writes
Phase 4: Contract  → Remove old structure

Each phase is independently deployable and reversible.

Phase 1: Expand — Create the Shadow Table

Create the new table with the target schema. Don’t touch the old one.

CREATE TABLE orders_v2 (
  id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  order_id VARCHAR(36) NOT NULL,
  customer_id BIGINT UNSIGNED NOT NULL,
  status ENUM('pending', 'confirmed', 'shipped', 'delivered', 'cancelled') NOT NULL,
  total_cents BIGINT NOT NULL,
  currency CHAR(3) NOT NULL DEFAULT 'USD',
  metadata JSON,
  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

  UNIQUE INDEX idx_order_id (order_id),
  INDEX idx_customer_status (customer_id, status),
  INDEX idx_created (created_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Then implement dual-write in the application layer:

async function createOrder(order: OrderInput): Promise<Order> {
  const result = await db.transaction(async (tx) => {
    // Write to old table (source of truth)
    const oldRow = await tx.query(
      "INSERT INTO orders (order_id, customer_id, ...) VALUES (?, ?, ...)",
      [order.orderId, order.customerId, ...]
    );

    // Write to new table (shadow)
    try {
      await tx.query(
        "INSERT INTO orders_v2 (order_id, customer_id, ...) VALUES (?, ?, ...)",
        [order.orderId, order.customerId, ...]
      );
    } catch (err) {
      // Log but don't fail — old table is still source of truth
      logger.warn("Shadow write failed", { orderId: order.orderId, error: err });
    }

    return oldRow;
  });

  return result;
}

Key decision: should the shadow write be in the same transaction?

Same transaction: guarantees consistency but couples the tables. If the new table has issues, writes to the old table also fail.
Separate transaction: more resilient but can have temporary inconsistency.

We chose same transaction with a feature flag to disable the shadow write if things went wrong.

Phase 2: Migrate — Backfill Historical Data

This is where the 50GB lives. You can’t INSERT INTO orders_v2 SELECT * FROM orders — it’ll lock the table and blow up your replication lag.

Instead, backfill in small batches with throttling:

async function backfillBatch(startId: number, batchSize: number): Promise<number> {
  const rows = await db.query(
    `SELECT * FROM orders
     WHERE id > ? AND id NOT IN (SELECT id FROM orders_v2 WHERE id > ?)
     ORDER BY id ASC
     LIMIT ?`,
    [startId, startId, batchSize]
  );

  if (rows.length === 0) return 0;

  const values = rows.map(transformRow);
  await db.query(
    `INSERT IGNORE INTO orders_v2 (id, order_id, customer_id, ...)
     VALUES ?`,
    [values]
  );

  return rows[rows.length - 1].id;
}

async function runBackfill() {
  const BATCH_SIZE = 1000;
  const DELAY_MS = 100; // Throttle to limit DB impact
  let lastId = 0;
  let totalMigrated = 0;

  while (true) {
    lastId = await backfillBatch(lastId, BATCH_SIZE);
    if (lastId === 0) break;

    totalMigrated += BATCH_SIZE;

    if (totalMigrated % 100_000 === 0) {
      logger.info("Backfill progress", { totalMigrated, lastId });
    }

    // Throttle to keep replication lag under control
    await sleep(DELAY_MS);

    // Check replication lag — pause if it's too high
    const lag = await getReplicationLag();
    if (lag > 5) {
      logger.warn("Replication lag high, pausing backfill", { lag });
      await sleep(30_000);
    }
  }

  logger.info("Backfill complete", { totalMigrated });
}

This took about 6 hours for our 50GB table. We ran it during low-traffic hours, but it was safe to run anytime because of the throttling and INSERT IGNORE.

Verifying the Backfill

Don’t trust the backfill blindly. Run a verification pass:

async function verifyBatch(startId: number, batchSize: number): Promise<number> {
  const mismatches = await db.query(
    `SELECT o.id, o.order_id
     FROM orders o
     LEFT JOIN orders_v2 v ON o.id = v.id
     WHERE o.id > ?
       AND (v.id IS NULL OR o.order_id != v.order_id OR o.total != v.total_cents)
     ORDER BY o.id ASC
     LIMIT ?`,
    [startId, batchSize]
  );

  if (mismatches.length > 0) {
    logger.error("Backfill mismatches found", {
      count: mismatches.length,
      sampleIds: mismatches.slice(0, 5).map((r) => r.id),
    });
  }

  return mismatches.length;
}

We found 12 rows with mismatches due to a timezone conversion bug in transformRow. Caught before cutover. Verification is not optional.

Phase 3: Cutover — Switch Traffic

This is the scary part. We used a feature flag to make it gradual:

async function getOrder(orderId: string): Promise<Order> {
  if (featureFlag.isEnabled("read-from-orders-v2")) {
    const row = await db.query("SELECT * FROM orders_v2 WHERE order_id = ?", [orderId]);
    if (row) return transformFromV2(row);

    // Fallback to old table if missing (shouldn't happen after backfill)
    logger.warn("Order missing from v2, falling back", { orderId });
  }

  return db.query("SELECT * FROM orders WHERE order_id = ?", [orderId]);
}

Cutover sequence:

Enable shadow reads (read from both, compare results, return old) — run for 24 hours, verify no mismatches
Switch reads to new table (with fallback to old)
Monitor for 48 hours — error rates, latency, query performance
Switch source of truth for writes to new table
Keep dual-write to old table for another week (rollback safety net)

Phase 4: Contract — Clean Up

Once confident:

Remove dual-write code
Remove feature flags
Drop old table: DROP TABLE orders
Remove fallback read logic

We waited two weeks after full cutover before dropping the old table. Paranoia pays off.

The Mistake That Almost Got Us

During backfill, we forgot to account for ON UPDATE CURRENT_TIMESTAMP on the updated_at column. Every backfilled row got a fresh updated_at timestamp, which broke our “recently modified orders” queries.

The fix was simple — explicitly set updated_at during backfill:

INSERT INTO orders_v2 (id, order_id, ..., created_at, updated_at)
VALUES (?, ?, ..., ?, ?)

But we didn’t catch it until the verification step flagged timestamp mismatches. Without verification, this would have been a silent data corruption.

Tooling Alternatives

We did this manually because we needed full control. But there are tools that automate parts of this:

gh-ost (GitHub) — online schema migration for MySQL. Creates a shadow table and uses the binlog to replicate changes. Excellent for schema changes.
pt-online-schema-change (Percona) — similar approach using triggers instead of binlog.
DMS (AWS Database Migration Service) — for cross-database migrations.

These tools are great for straightforward schema changes. For complex transformations (restructuring data, splitting tables, changing column semantics), the manual approach gives you more control.

Checklist

If you’re doing a similar migration:

Create shadow table with new schema
Implement dual-write with feature flag kill switch
Backfill in small batches with throttling
Monitor replication lag during backfill
Run verification pass — don’t trust the backfill
Shadow read phase: compare old vs new results
Gradual cutover with monitoring at each step
Keep old table alive for at least a week after cutover
Clean up dual-write code and drop old table

Zero-downtime migrations are stressful, but the expand-contract pattern makes them systematic. The key insight: every step should be independently deployable and reversible. If anything goes wrong, you can stop and roll back without affecting users.

It took us a week of planning, a week of execution, and two weeks of monitoring. For a 50GB table serving live traffic, that’s a good trade.

Why Not Just ALTER TABLE?