Chapter 14: Migration Strategies

Introduction: The Make-or-Break Moment

"Experience with migrations and data conversions from legacy systems reveals a critical truth: this is the part of implementation most prone to failure unless done properly and thoughtfully. Most failed projects don't fail because of bad code - they fail because of bad migration strategy."

The hard truth about migrations:

70% of IT projects fail during migration, not development
Average migration takes 2-3x longer than planned
Data quality issues surface only during migration
User resistance kills good systems
Rollback is harder than you think

But migrations don't have to fail. With the right strategy, you can migrate legacy systems to pattern-based systems safely, predictably, and successfully.

This chapter covers: 1. Why migrations fail (learn from others' mistakes) 2. The Strangler Fig pattern (gradual replacement) 3. Data migration strategies (ETL with validation) 4. Parallel running (reduce risk) 5. Cutover strategies (minimize downtime) 6. Rollback planning (when things go wrong) 7. Change management (the human side)

Section 1: Why Migrations Fail

Learn from these common mistakes. Every one of these has killed real projects:

Failure Mode 1: "Big Bang" Cutover

What they did: Spent 18 months building new system in secret. On cutover weekend, turned off old system, turned on new system. Monday morning, chaos.

What went wrong: - Users had no training on new system - Edge cases only old system handled - Data migration had errors (discovered live) - No rollback plan (old system turned off) - Support team overwhelmed

Real example (government agency): County clerk's office replaced 25-year-old permit system with modern web app. Big bang cutover on Friday night. Monday morning: - 40% of permit applications failed (missing data fields) - Inspectors couldn't access old permits (legacy data not migrated) - Staff didn't know how to use new system (no training) - Public couldn't submit permits (system down) - Old system tapes in warehouse (3-day recovery)

Cost: $2.3M project scrapped, old system restored, CIO fired.

Lesson: Never big bang. Always gradual.

Failure Mode 2: Underestimating Data Quality

What they did: Assumed legacy data was clean. Started migration, discovered 30 years of data quality issues.

What went wrong: - Duplicate records (same customer entered 5 times) - Inconsistent formats (phone: 555-1234, (555) 123-4567, 555.123.4567) - Missing required fields (new system requires email, old had none) - Invalid references (orders referencing deleted customers) - Encoding issues (special characters corrupted)

Real example (legal firm): Law firm migrating 50,000 client records from DOS-based system to modern case management. Discovered: - 12,000 duplicate client records (23%) - 8,000 clients with no address (16%) - 15,000 cases with no closing date (30%) - 200+ different spellings of "plaintiff" in notes - Client SSNs stored in phone number fields (!)

Timeline impact: - Planned: 3 months - Actual: 18 months (data cleanup) - Extra cost: $400,000

Lesson: Assess data quality BEFORE migration. Plan for cleanup.

Failure Mode 3: Ignoring User Resistance

What they did: Built technically perfect system. Users refused to use it.

What went wrong: - Users comfortable with old system (20+ years) - New system had learning curve - No one asked users what they needed - Training was optional (no one attended) - Old system still available (users kept using it)

Real example (healthcare clinic): Clinic replaced paper charts with EMR. System was excellent, but: - Doctors continued using paper, transcribed later - Nurses printed EMR data, wrote on printouts - Front desk kept paper schedule alongside EMR - Data in both systems, constantly out of sync

Result: After 2 years, still using paper primarily. $500k EMR barely used.

Lesson: Change management is more important than code quality.

Failure Mode 4: No Rollback Plan

What they did: "We're committed. No going back." Then discovered critical bug in production.

What went wrong: - Old system already decommissioned (database server wiped) - Backup tapes in offsite storage (3-day retrieval) - Users trained on new system only (forgot old workflows) - Can't roll forward, can't roll back = stuck

Real example (e-commerce company): Online retailer migrated order management system. Week 2 in production: - Bug in shipping calculation (overcharged customers $80k) - Old system shut down (server reused) - Had to manually refund 4,000 orders - Lost 2,000 customers (bad reviews)

Cost: $250k in refunds + reputation damage.

Lesson: Always have rollback plan. Keep old system running in parallel.

Failure Mode 5: Underestimating Complexity

What they did: "It's just CRUD. How hard can it be?" Discovered 20 years of undocumented business logic.

What went wrong: - Old system had hundreds of edge cases - Business rules buried in code (no documentation) - Integrations no one knew existed - Reports depended on specific data quirks - "Magic" calculations no one could explain

Real example (financial services): Bank migrating loan origination system. Discovered: - 47 different loan types (documented: 12) - Interest calculation varied by state, product, date - 200+ undocumented business rules in stored procedures - Credit bureau integration used custom SOAP encoding - Executive reports pulled data from 7 different systems

Original estimate: 6 months, $500k Actual: 24 months, $2.8M

Lesson: Legacy systems are more complex than they look. Triple your estimate.

Section 2: The Strangler Fig Pattern

The safest migration strategy. Named after fig trees that gradually grow around host trees until the host can be removed.

How It Works

Phase 1: Run in parallel - New system handles new features - Old system keeps handling existing functionality - Both systems active simultaneously

Phase 2: Gradual migration - Migrate one feature at a time - Test thoroughly before moving next feature - Users learn new system gradually

Phase 3: Complete strangling - Eventually all features in new system - Old system turned off - Migration complete

Example: Permit System Migration

Legacy system: DOS-based permit application from 1995. 30 years of data. 50 users. 10,000 permits/year.

Strangler fig approach:

Month 1-2: Infrastructure

┌─────────────────────────────────────┐
│         Old DOS System              │
│   (Still handles everything)        │
└─────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────┐
│      Database Migration Layer       │
│  (Replicate data to new DB nightly) │
└─────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────┐
│         New System (empty)          │
│    (Read-only access to data)       │
└─────────────────────────────────────┘

Users: Still using old system 100% Data: Copying to new database nightly Risk: Low (new system not critical yet)

Month 3-4: First Feature Migration

┌─────────────────────────────────────┐
│         Old DOS System              │
│  - Permit applications (still)      │
│  - Inspections (still)              │
│  - Reports (still)                  │
└─────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────┐
│         New System                  │
│  - Permit lookup (NEW!)             │
│    Public can search permits online │
└─────────────────────────────────────┘

Users: Public uses new system for lookup, staff uses old system for data entry Benefits: Public can find permits online (never could before) Risk: Low (read-only, no data changes)

Month 5-7: Second Feature Migration

┌─────────────────────────────────────┐
│         Old DOS System              │
│  - Permit applications (still)      │
│  - Inspections (still)              │
│  - Reports (migrating...)           │
└─────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────┐
│         New System                  │
│  - Permit lookup                    │
│  - Reports (NEW!)                   │
│    Generate all reports online      │
└─────────────────────────────────────┘

Users: Staff generates reports in new system, everything else in old system Benefits: Beautiful reports (vs dot matrix printouts) Risk: Low (reports don't change data)

Month 8-10: Critical Feature Migration

┌─────────────────────────────────────┐
│         Old DOS System              │
│  - Inspections (still)              │
│  - Legacy permit lookup             │
└─────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────┐
│         New System                  │
│  - Permit applications (NEW!)       │
│    Online application with          │
│    validation, zoning checks        │
│  - Permit lookup                    │
│  - Reports                          │
└─────────────────────────────────────┘

Users: New permits in new system, old permits still in old system Benefits: Real-time validation, no paper forms Risk: Medium (creating new data)

Month 11-12: Final Migration

┌─────────────────────────────────────┐
│    Old DOS System (read-only)       │
│    Historical data only             │
└─────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────┐
│         New System                  │
│  - Permit applications              │
│  - Inspections (NEW!)               │
│  - Permit lookup                    │
│  - Reports                          │
│  - All features migrated!           │
└─────────────────────────────────────┘

Users: 100% on new system for daily work Old system: Available for historical lookups only Risk: Low (gradual migration de-risked each step)

Month 13+: Decommission - Old system data fully migrated - Old system shut down - Hardware repurposed

Why Strangler Fig Works

✅ Users adapt gradually - Learn one feature at a time ✅ Low risk - Each step tested before next ✅ Reversible - Can pause migration if issues arise ✅ Builds confidence - Early wins create momentum ✅ Maintains business continuity - Never fully down

The pattern is clear: Successful migrations consistently use strangler fig variations. Failed ones consistently try big bang.

Section 3: Data Migration Strategies

Data migration is where projects die. Here's how to do it right.

The Five-Stage Data Migration Pipeline

Stage 1: Extract (Get data out of legacy system) Stage 2: Profile (Understand data quality) Stage 3: Transform (Clean and convert data) Stage 4: Validate (Verify correctness) Stage 5: Load (Insert into new system)

Stage 1: Extract

Goal: Get data out of legacy system in usable format.

Challenges: - Legacy systems often have no export function - Database may be proprietary format (dBase, FoxPro) - Some data only in printed reports - Special characters may corrupt

Extraction methods:

Method 1: Database dump (best)

# If you have database access
mysqldump --single-transaction legacy_db > legacy_data.sql

# Or for specific tables
pg_dump -t permits -t inspections legacy_db > data.sql

Method 2: Export via legacy application

# Many legacy systems can export to CSV
# Use the application's export feature if available
Old System → File → Export → CSV

Method 3: Database replication (for strangler fig)

// Set up real-time replication
const replicator = new DatabaseReplicator({
  source: 'legacy-db.company.local:3306',
  target: 'new-db.company.local:5432',
  tables: ['permits', 'inspections', 'customers'],
  schedule: 'every 4 hours' // or real-time
});

replicator.start();

Method 4: Screen scraping (last resort)

# If no database access, scrape the UI
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://legacy-system/permits')

permits = []
for row in driver.find_elements_by_tag_name('tr'):
    permit = {
        'id': row.find_element_by_class('permit-id').text,
        'address': row.find_element_by_class('address').text,
        # ...
    }
    permits.append(permit)

Critical tip: Always extract more than you think you need. You can't go back to extract later.

Stage 2: Profile

Goal: Understand data quality before attempting transformation.

Profile everything: - Record counts - Null/empty values - Unique values per field - Data type distributions - Duplicate records - Referential integrity

Example profiling script:

import pandas as pd

# Load extracted data
df = pd.read_csv('legacy_permits.csv')

# Profile report
print(f"Total records: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
print("\nNull values per column:")
print(df.isnull().sum())
print("\nDuplicate records: {df.duplicated().sum()}")
print("\nData types:")
print(df.dtypes)

# Field-specific analysis
print("\nPermit status distribution:")
print(df['status'].value_counts())

print("\nInvalid zip codes:")
invalid_zips = df[~df['zip_code'].str.match(r'^\d{5}(-\d{4})?$')]
print(f"Count: {len(invalid_zips)}")

# Referential integrity
print("\nOrphaned records (customer_id not in customers table):")
orphans = df[~df['customer_id'].isin(customers_df['id'])]
print(f"Count: {len(orphans)}")

Output example:

Total records: 48,392
Columns: ['id', 'permit_type', 'address', 'customer_id', 'status', 'submitted_date']

Null values per column:
id                    0
permit_type          47
address             892
customer_id       1,203
status                0
submitted_date    4,281

Duplicate records: 1,847

Invalid zip codes: 3,472

Orphaned records: 2,108

Action: Create data quality report. Show to stakeholders. Adjust timeline based on cleanup needed.

Stage 3: Transform

Goal: Clean, standardize, and convert data to new schema.

Common transformations:

1. Deduplication

# Find duplicates based on key fields
duplicates = df[df.duplicated(subset=['customer_name', 'address'], keep=False)]

# Strategy 1: Keep most recent
df = df.sort_values('created_date').drop_duplicates(
    subset=['customer_name', 'address'],
    keep='last'
)

# Strategy 2: Merge records
def merge_duplicates(group):
    # Take first non-null value for each field
    return group.apply(lambda x: x.dropna().iloc[0] if not x.dropna().empty else None)

df = df.groupby(['customer_name', 'address']).apply(merge_duplicates)

2. Standardization

# Phone numbers: (555) 123-4567 → 5551234567
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)

# Addresses: Upper case, trim whitespace
df['address'] = df['address'].str.strip().str.upper()

# Dates: Various formats → ISO 8601
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Status codes: Y/N → true/false
df['active'] = df['active'].map({'Y': True, 'N': False, '': False})

3. Enrichment

# Add missing data from external sources
import requests

def validate_address(address):
    response = requests.post('https://api.usps.com/validate', json={
        'address': address
    })
    return response.json()['corrected_address']

# Apply to addresses with missing zip codes
df.loc[df['zip_code'].isnull(), 'address'] = df[df['zip_code'].isnull()]['address'].apply(validate_address)

4. Schema mapping

# Old schema → New schema
mapping = {
    'cust_id': 'customer_id',
    'cust_nm': 'customer_name',
    'addr1': 'street_address',
    'addr2': 'address_line_2',
    'city': 'city',
    'st': 'state',
    'zip': 'postal_code'
}

new_df = df.rename(columns=mapping)

# Add required fields not in legacy system
new_df['created_by'] = 'migration_script'
new_df['created_at'] = datetime.now()

Critical warning: Transformation is where you'll spend 80% of migration time. Budget accordingly.

Stage 4: Validate

Goal: Verify transformed data is correct before loading.

Validation checks:

1. Record count reconciliation

legacy_count = len(legacy_df)
transformed_count = len(transformed_df)
expected_duplicates_removed = 1847

assert transformed_count == legacy_count - expected_duplicates_removed, \
    f"Record count mismatch! Expected {legacy_count - expected_duplicates_removed}, got {transformed_count}"

2. Field-level validation

# All required fields present
required_fields = ['customer_id', 'address', 'status']
for field in required_fields:
    null_count = transformed_df[field].isnull().sum()
    assert null_count == 0, f"Field {field} has {null_count} null values"

# Data types correct
assert transformed_df['customer_id'].dtype == 'int64'
assert transformed_df['submitted_date'].dtype == 'datetime64[ns]'

# Value ranges valid
assert (transformed_df['permit_fee'] >= 0).all(), "Negative fees found"
assert transformed_df['state'].isin(US_STATES).all(), "Invalid state codes found"

3. Business rule validation

# Approved permits must have approval date
approved = transformed_df[transformed_df['status'] == 'approved']
assert approved['approval_date'].notnull().all(), \
    "Approved permits missing approval dates"

# Customer IDs must exist in customers table
customer_ids = set(customers_df['id'])
invalid_refs = transformed_df[~transformed_df['customer_id'].isin(customer_ids)]
assert len(invalid_refs) == 0, \
    f"Found {len(invalid_refs)} permits with invalid customer_id"

4. Sample validation (manual)

# Export sample for manual review
sample = transformed_df.sample(n=100)
sample.to_csv('migration_sample.csv')

print("Review migration_sample.csv manually")
print("Check that:")
print("- Addresses are correct")
print("- Phone numbers formatted properly")
print("- Dates make sense")
print("- Status values appropriate")

Validation report:

Data Validation Report - Legacy Permits Migration
Generated: 2025-12-28

Record Counts:
- Legacy system: 48,392 records
- After deduplication: 46,545 records (-1,847)
- After cleanup: 46,545 records
- Ready to load: 46,545 records ✓

Field Validation:
✓ All required fields present
✓ No null values in required fields
✓ Data types correct
✓ Value ranges valid

Business Rules:
✓ Approved permits have approval dates
✓ All customer references valid
✓ Dates within reasonable range
⚠ 23 permits with $0 fee (flagged for review)

Sample Validation:
✓ 100 random samples reviewed manually
✓ No issues found

Ready for loading: YES
Estimated load time: 2-3 hours

Stage 5: Load

Goal: Insert validated data into new system.

Loading strategies:

Strategy 1: Bulk insert (fastest)

import psycopg2
from psycopg2.extras import execute_batch

conn = psycopg2.connect("postgresql://localhost/new_db")
cursor = conn.cursor()

# Prepare data
records = transformed_df.to_records(index=False)
data = [tuple(x) for x in records]

# Bulk insert
sql = """
    INSERT INTO permits (customer_id, address, status, submitted_date)
    VALUES (%s, %s, %s, %s)
"""

execute_batch(cursor, sql, data, page_size=1000)
conn.commit()

print(f"Loaded {len(data)} records")

Strategy 2: API-based (safer)

import requests

api_url = 'https://new-system.com/api/permits'
api_key = os.getenv('API_KEY')

loaded = 0
errors = []

for _, row in transformed_df.iterrows():
    try:
        response = requests.post(api_url, 
            headers={'Authorization': f'Bearer {api_key}'},
            json={
                'customer_id': int(row['customer_id']),
                'address': row['address'],
                'status': row['status'],
                'submitted_date': row['submitted_date'].isoformat()
            }
        )
        response.raise_for_status()
        loaded += 1

        if loaded % 1000 == 0:
            print(f"Loaded {loaded} / {len(transformed_df)} records")

    except Exception as e:
        errors.append({
            'record': row.to_dict(),
            'error': str(e)
        })

print(f"Successfully loaded: {loaded}")
print(f"Errors: {len(errors)}")

# Save errors for review
if errors:
    pd.DataFrame(errors).to_csv('load_errors.csv')

Strategy 3: Incremental (for large datasets)

# Load in chunks, commit after each chunk
chunk_size = 10000

for i in range(0, len(transformed_df), chunk_size):
    chunk = transformed_df[i:i+chunk_size]

    # Load chunk
    chunk.to_sql('permits', conn, if_exists='append', index=False)

    # Commit
    conn.commit()

    print(f"Loaded chunk {i//chunk_size + 1}: {len(chunk)} records")

    # Pause between chunks (be nice to database)
    time.sleep(1)

Post-load validation:

# Verify counts match
legacy_count = 46545
new_count = conn.execute("SELECT COUNT(*) FROM permits").fetchone()[0]

assert new_count == legacy_count, \
    f"Count mismatch! Expected {legacy_count}, got {new_count}"

# Spot check sample records
sample_ids = [12, 156, 2847, 19203, 45012]
for id in sample_ids:
    legacy_record = legacy_df[legacy_df['id'] == id].iloc[0]
    new_record = pd.read_sql(f"SELECT * FROM permits WHERE legacy_id = {id}", conn).iloc[0]

    # Compare key fields
    assert legacy_record['customer_id'] == new_record['customer_id']
    assert legacy_record['address'] == new_record['address']
    # ...

print("✓ Post-load validation passed")

Section 4: Parallel Running

Running old and new systems side-by-side reduces risk.

Parallel Running Strategy

Phase 1: New system shadow mode (weeks 1-4) - Old system: Production (users work here) - New system: Shadow (processes data but doesn't affect users) - Compare outputs between systems

Phase 2: New system read-only (weeks 5-8) - Old system: Still production for writes - New system: Users can view data (read-only) - Builds user confidence

Phase 3: Gradual cutover (weeks 9-12) - New system: Handles some transactions - Old system: Handles remaining transactions - Monitor both systems

Phase 4: Full cutover (week 13) - New system: Primary - Old system: Backup/archive only

Example: Invoice System Parallel Running

Week 1-4: Shadow mode

// Every invoice entered in old system also creates record in new system
async function processInvoice(invoiceData) {
  // Process in old system (production)
  const oldSystemResult = await oldSystem.createInvoice(invoiceData);

  try {
    // Also process in new system (shadow)
    const newSystemResult = await newSystem.createInvoice(invoiceData);

    // Compare results
    const comparison = compareInvoices(oldSystemResult, newSystemResult);
    if (!comparison.match) {
      logger.warn('Invoice mismatch detected', {
        invoiceId: invoiceData.id,
        differences: comparison.differences
      });

      // Alert developers
      await alerting.send({
        type: 'shadow_mode_mismatch',
        details: comparison
      });
    }
  } catch (error) {
    // New system errors don't affect production
    logger.error('New system error (shadow mode)', error);
  }

  // Return old system result (production)
  return oldSystemResult;
}

function compareInvoices(old, new) {
  const differences = [];

  if (old.total !== new.total) {
    differences.push(`Total: ${old.total} vs ${new.total}`);
  }
  if (old.tax !== new.tax) {
    differences.push(`Tax: ${old.tax} vs ${new.tax}`);
  }

  return {
    match: differences.length === 0,
    differences
  };
}

Week 5-8: Read-only mode

// Users can view invoices in new system
app.get('/invoices/:id', async (req, res) => {
  const invoiceId = req.params.id;

  // Fetch from both systems
  const [oldInvoice, newInvoice] = await Promise.all([
    oldSystem.getInvoice(invoiceId),
    newSystem.getInvoice(invoiceId)
  ]);

  // Show comparison view
  res.render('invoice-compare', {
    old: oldInvoice,
    new: newInvoice,
    differences: compareInvoices(oldInvoice, newInvoice).differences
  });
});

Week 9-12: Gradual cutover

// Route some invoices to new system based on criteria
async function processInvoice(invoiceData) {
  // Feature flag: gradually increase percentage
  const newSystemPercentage = featureFlags.get('new_system_cutover'); // Start at 10%, increase to 100%

  const useNewSystem = Math.random() < (newSystemPercentage / 100);

  if (useNewSystem) {
    try {
      const result = await newSystem.createInvoice(invoiceData);
      logger.info('Invoice processed in new system', { invoiceId: invoiceData.id });
      return result;
    } catch (error) {
      // Fallback to old system on error
      logger.error('New system failed, falling back to old', error);
      return await oldSystem.createInvoice(invoiceData);
    }
  } else {
    return await oldSystem.createInvoice(invoiceData);
  }
}

Benefits of parallel running: - ✅ Catch discrepancies before cutover - ✅ Users build confidence in new system - ✅ Can pause if issues arise - ✅ Gradual load increase (not sudden spike) - ✅ Rollback easy (flip feature flag)

Cost: Extra infrastructure (running both systems)

Worth it? Absolutely. Parallel running has saved countless migrations from disaster.

Section 5: Cutover Strategies

Eventually you need to cut over completely. Here are proven approaches:

Strategy 1: Weekend Cutover (Low-Volume Systems)

Best for: Systems with low weekend activity

Timeline:

Friday 5 PM:  Freeze old system (read-only)
Friday 6 PM:  Begin final data migration
Friday 11 PM: Validate migrated data
Saturday 8 AM: Test new system thoroughly
Sunday 2 PM:  Training for support staff
Monday 7 AM:  New system goes live

Rollback trigger: If critical issues found by Sunday noon, abort and revert

Example: County permit system - Friday evening: 2,347 permits in system - Weekend: Migrate all data, test thoroughly - Monday: Staff arrives to new system - Had entire weekend for migration + testing

Strategy 2: Phased Cutover (High-Volume Systems)

Best for: Systems that can't go offline

Phase 1: New transactions only

Week 1: All new data goes to new system
        Old data stays in old system
        Users must check both systems

Phase 2: Most recent data migrated

Week 2-3: Migrate last 6 months of data
          Users check new system first, old system if not found

Phase 3: Historical data migrated

Week 4-8: Migrate older historical data in chunks
          Most users fully on new system

Phase 4: Archive old system

Week 9+: Old system read-only for archive
         New system fully operational

Example: Hospital patient records Can't migrate 20 years of records overnight. Instead: - Week 1: All new patients in new system - Week 2-4: Migrate last 2 years (active patients) - Month 2-3: Migrate years 3-10 (recent patients) - Month 4-6: Migrate years 11-20 (archive)

Strategy 3: Pilot Group Cutover

Best for: Large organizations, multiple locations

Week 1-2: Pilot group (10% of users)

- IT department tests new system
- Iron out issues with small group
- Gather feedback, fix problems

Week 3-4: Early adopters (20% of users)

- Volunteer departments go live
- Create internal champions
- Document lessons learned

Week 5-8: Major rollout (50% of users)

- Roll out to half of organization
- Support team at full capacity
- Monitor closely for issues

Week 9-12: Full deployment (100% of users)

- Remaining users cutover
- Old system shut down
- Migration complete

Example: State government (50 agencies) - Agency 1 (IT): Weeks 1-2 - Agency 2-3 (volunteers): Weeks 3-4 - Agencies 4-10: Week 5-6 - Agencies 11-25: Week 7-8 - Agencies 26-50: Week 9-10

Section 6: Rollback Planning

"Hope for the best, plan for the worst."

When to Roll Back

Trigger conditions (decide upfront): 1. Data loss: Any data lost or corrupted (automatic rollback) 2. Critical function down: Core feature doesn't work (rollback if not fixed in 4 hours) 3. Mass user rejection: >50% of users can't complete tasks (rollback if not fixed in 24 hours) 4. Performance disaster: System 10x slower than old system (rollback if not fixed in 8 hours) 5. Integration failure: Critical integration broken (e.g., payment processing) (immediate rollback)

Rollback Procedure

Prepare before cutover:

1. Backup everything

# Database backup
pg_dump production_db > backup_pre_cutover_2025_12_28.sql

# Application code backup
git tag v2.0-pre-cutover
git push origin v2.0-pre-cutover

# Configuration backup
cp -r /etc/app-config /backup/config_2025_12_28

2. Document rollback steps

# Rollback Procedure

## Decision
Rollback authorized by: [CTO name]
Reason: [Critical issue description]
Time: [timestamp]

## Steps (60-minute timeline)

### Minute 0-5: Notify
- [ ] Email all users: "System issue detected, rolling back"
- [ ] Post status page update
- [ ] Alert support team

### Minute 5-15: Stop new system
- [ ] Set new system to maintenance mode
- [ ] Stop application servers
- [ ] Wait for in-flight requests to complete

### Minute 15-30: Restore old system
- [ ] Restore database from backup
- [ ] Restart old application servers
- [ ] Verify old system functionality

### Minute 30-45: Data reconciliation
- [ ] Export any data created in new system (last 4 hours)
- [ ] Manually enter into old system (or save for later)
- [ ] Verify critical transactions not lost

### Minute 45-60: Verify and communicate
- [ ] Test old system with 10 test cases
- [ ] Confirm all users can access
- [ ] Email users: "System restored, back to normal"
- [ ] Update status page: "Resolved"

## Post-Rollback
- [ ] Schedule post-mortem meeting (within 24 hours)
- [ ] Analyze what went wrong
- [ ] Fix issues before reattempting migration

3. Test rollback procedure

Before go-live:
1. Perform full cutover in test environment
2. Introduce simulated issue
3. Execute rollback procedure
4. Verify everything restored correctly
5. Time how long each step takes

Goal: Rollback should take <60 minutes

Real example: E-commerce site rollback

Black Friday launch of new checkout system. Within 2 hours: - 40% of transactions failing (payment integration bug) - $80,000 in lost sales - Social media on fire

Rollback executed: - 15 minutes: Decided to roll back - 10 minutes: Notified users, stopped new system - 20 minutes: Restored old system from backup - 15 minutes: Verified functionality - Total: 60 minutes from decision to resolution

Lost sales during rollback: $15,000 Lost sales if continued with broken system: $300,000+

Critical advice: Test your rollback procedure. Many teams plan rollbacks that don't work when needed.

Section 7: Change Management

Technical migration is only half the battle. Users must embrace the new system.

The Psychology of Change

Users fear: - Looking incompetent (don't know new system) - Losing productivity (slower initially) - Losing job (automation replacing them) - Breaking things (scared to try features)

Your job: Address these fears proactively.

Change Management Strategy

Phase 1: Before Migration - Build Buy-In

Involve users early

Months before go-live:
- Survey users about pain points with old system
- Demo new system, get feedback
- Form "power user" group to test beta
- Let users suggest features

Result: Users feel heard, have ownership

Create champions

Identify enthusiastic users in each department:
- Give them early access
- Train them deeply
- Make them trainers for their peers
- Recognize them publicly

Result: Peer influence drives adoption

Communicate benefits (not features)

Bad: "New system has inline validation!"
Good: "No more rejected forms - system tells you errors immediately"

Bad: "Real-time integration with USPS API!"
Good: "Address autocomplete - just type and select, no more typos"

Focus on: Faster, easier, less frustration

Phase 2: During Migration - Support Users

Training strategy

Don't: Single 2-hour training session, done
Do: Multiple touchpoints

Week before:
- 1-hour overview session
- Hands-on practice in test system

Launch day:
- Quick reference card on every desk
- Support staff in every department
- "Help" button in every screen

Week after:
- Daily "office hours" for questions
- Video tutorials for common tasks
- Peer training (champions help colleagues)

Month after:
- Advanced features training
- Efficiency tips
- Gather feedback for improvements

Support structure

Level 1: Peer support (champions in each dept)
Level 2: Help desk (ticketing system)
Level 3: Development team (critical issues)

Track:
- Common questions (add to FAQ)
- Feature requests (prioritize for next release)
- Pain points (fix quickly)

Phase 3: After Migration - Sustain Adoption

Monitor usage

// Track feature adoption
analytics.track('feature_used', {
  feature: 'inline_validation',
  user: userId,
  department: userDept
});

// Identify struggling users
const lowUsageUsers = getUsersWithLogin({
  lastLoginMoreThan: '7 days ago',
  totalLogins: '<5'
});

// Proactive outreach
for (const user of lowUsageUsers) {
  sendEmail(user, {
    template: 'need_help',
    message: 'We noticed you haven't used the new system much. Need help?'
  });
}

Celebrate wins

Share success stories:
- "Processing time cut from 10 days to 2 days!"
- "User satisfaction up 45%"
- "Zero errors this week (vs 20/week in old system)"

Recognize users:
- "Department of the Month: IT (100% adoption)"
- "Power User Spotlight: Jane (helped 30 colleagues)"

Continuous improvement

Monthly:
- Review support tickets (common issues)
- Survey users (satisfaction, pain points)
- Prioritize improvements

Quarterly:
- Release new features
- Advanced training sessions
- Showcase power user tips

Real Example: Hospital EMR Migration

Challenge: Doctors hated old paper charts but feared EMR.

Change management approach:

3 months before: - Interviewed 50 doctors about pain points - Demoed EMR focusing on their specific needs - Created "Physician Advisory Group" (10 doctors)

1 month before: - Advisory group tested EMR, gave feedback - Modified based on their input (custom templates) - Recorded video testimonials from advisory group

Launch week: - 1 doctor + 1 medical assistant per shift as "super users" - IT staff embedded in every clinic - Old paper charts still available (safety net)

Results: - Week 1: 60% adoption (better than expected 40%) - Week 4: 90% adoption - Month 3: 98% adoption - 6 months: Paper charts rarely used

Key factors: - Doctors involved in design (felt ownership) - Peer champions (trusted voices) - Safety net (could fall back to paper if needed) - Fast support (IT staff present)

Section 8: Migration Checklist

Pre-Migration (2-4 months before)

Data Assessment: - [ ] Extract sample data from legacy system - [ ] Profile data quality (null, duplicates, invalid) - [ ] Estimate cleanup effort (weeks? months?) - [ ] Create data quality report for stakeholders

Technical Preparation: - [ ] Set up new system infrastructure - [ ] Configure integrations with external systems - [ ] Build data migration pipeline (extract, transform, validate, load) - [ ] Test migration with sample data

Change Management: - [ ] Survey users about pain points - [ ] Demo new system to users - [ ] Form power user group - [ ] Begin training material development

Risk Planning: - [ ] Define rollback triggers - [ ] Document rollback procedure - [ ] Test rollback in staging environment - [ ] Identify critical go/no-go criteria

Migration Execution (1-3 months)

Data Migration: - [ ] Full data extract from legacy system - [ ] Data transformation and cleanup - [ ] Validation (counts, samples, business rules) - [ ] Load into new system - [ ] Post-load validation

Parallel Running: - [ ] Shadow mode (new system processes but doesn't affect production) - [ ] Compare outputs between old and new - [ ] Fix discrepancies - [ ] Read-only access for users (build confidence)

Training: - [ ] Train support staff first - [ ] Train power users / champions - [ ] Train general users (multiple sessions) - [ ] Distribute quick reference materials

Cutover: - [ ] Execute cutover plan (weekend/phased/pilot) - [ ] Monitor closely for first 48 hours - [ ] Support team at full capacity - [ ] Daily status updates to stakeholders

Post-Migration (1-3 months)

Support: - [ ] Office hours for user questions - [ ] Track and resolve support tickets - [ ] Identify and fix common issues - [ ] Gather user feedback

Optimization: - [ ] Monitor system performance - [ ] Optimize slow queries - [ ] Tune infrastructure as needed - [ ] Release bug fixes and improvements

Validation: - [ ] Verify data integrity - [ ] Reconcile financial records - [ ] Audit critical business processes - [ ] Sign-off from stakeholders

Decommission: - [ ] Keep old system read-only (6-12 months) - [ ] Archive old system data - [ ] Document lessons learned - [ ] Celebrate success! 🎉

Conclusion: Migration Success Formula

The proven formula for successful migrations:

Start with data quality (Most failures are data issues)
Never big bang (Strangler fig pattern always)
Parallel running (Catches problems before cutover)
Plan rollback (And test it!)
Support users (Change management = 50% of effort)
Monitor closely (First 30 days are critical)
Iterate quickly (Fix issues fast, users will forgive bugs if you respond quickly)

The migrations that succeed: - ✅ Start with realistic assessment of data quality - ✅ Migrate incrementally (feature by feature) - ✅ Run old and new systems in parallel - ✅ Have tested rollback procedure - ✅ Train users thoroughly - ✅ Support heavily during first month - ✅ Fix issues within hours, not days

The migrations that fail: - ❌ Assume data is clean (it never is) - ❌ Try to cut over everything at once - ❌ No parallel running or testing - ❌ No rollback plan - ❌ Train users in single session before go-live - ❌ Skeleton support team - ❌ Slow to fix issues

Final wisdom from decades of migration experience:

"Every successful migration is boring. Gradual, methodical, well-planned. Every failed migration tried to be exciting - big reveals, dramatic cutovers. Boring is good. Boring means low risk. Boring means you go home on time. Aim for boring."

Next chapter: Business Models (how to make money from pattern-based systems).