Redshift ETL pipelines to BigQuery
Re-home incremental ETL—staging, dedupe, UPSERT patterns, and orchestration—from Redshift into BigQuery with an explicit run contract and validation gates that prevent KPI drift.
- Input
- Redshift ETL / pipeline migration logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Distribution/sort key thinking carried over: Redshift-era shapes don’t map; BigQuery needs pruning and layout aligned to filters.
- UPSERT via DELETE+INSERT drift: partial failures and retries double-apply or delete wrong rows without idempotency guards.
- Implicit watermarks: using job runtime or CURRENT_TIMESTAMP() instead of persisted high-water marks.
Why this breaks
Redshift ETL systems often hide correctness in operational conventions: watermark tables, VACUUM/ANALYZE-driven assumptions, distribution/sort key choices, and UPSERT patterns implemented via staging + delete/insert. BigQuery can implement equivalent outcomes—but the run contract must be made explicit and validated under reruns and late data.
Common symptoms after cutover:
- Duplicates or missing updates because watermarks and tie-breakers were implicit
- Incremental loads turn into full scans because staging boundaries aren’t pruning-aware
- DELETE+INSERT UPSERT logic drifts under retries and partial failures
- SCD dimensions drift during backfills and late-arrival corrections
- Costs spike because partition/clustering and apply windows weren’t designed for BigQuery
How conversion works
- Inventory & classify ETL jobs, schedules, dependencies, and operational controls (watermarks, control tables, retries).
- Extract the run contract: business keys, incremental boundaries, dedupe tie-breakers, late-arrival policy, and failure/restart semantics.
- Re-home transformations to BigQuery-native staging (landing → typed staging → dedupe → apply) with partitioning/clustering aligned to load windows and access paths.
- Implement restartability: applied-batch tracking, idempotency keys, deterministic ordering, and safe retries.
- Re-home orchestration (Composer/Airflow, dbt, or your runner) with explicit DAG dependencies, retries, and alerts.
- Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, backfills, late injections) and rollback-ready criteria.
Supported constructs
Representative Redshift ETL constructs we commonly migrate to BigQuery (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Staging + DELETE/INSERT UPSERT patterns | BigQuery MERGE with bounded apply windows | Avoid full scans and ensure idempotency under retries. |
| Watermark/control tables | Explicit high-water marks + applied-window tracking | Restartable and auditable incremental behavior. |
| SCD Type-1 / Type-2 logic | MERGE + end-date/current-flag patterns | Backfills and late updates validated as first-class scenarios. |
| COPY-based ingestion chains | Landing tables + batch manifests (or streaming) | Ingestion made explicit; replayability supported. |
| Vacuum/analyze expectations | Partitioning/clustering + pruning-aware SQL | Performance posture shifts to scan reduction and layout alignment. |
| Scheduler-driven chains | Composer/dbt orchestration with explicit DAG contracts | Retries, concurrency, and alerts modeled and monitored. |
How workload changes
| Topic | Redshift | BigQuery |
|---|---|---|
| Incremental correctness | Often encoded in staging + delete/insert conventions and job timing | Explicit high-water marks + deterministic staging + integrity gates |
| Performance model | Distribution/sort keys + VACUUM/ANALYZE habits | Partition pruning + clustering + slot usage |
| Upserts | DELETE+INSERT common for UPSERT semantics | MERGE with partition-scoped boundaries |
| Orchestration | Schedulers + script chains | Composer/dbt/native orchestration with explicit DAG contracts |
Examples
Canonical BigQuery incremental apply pattern: stage → dedupe deterministically → MERGE with scoped partitions + applied-window tracking. Adjust keys, partitions, and casts to your model.
-- Control table for load windows (restartability)
CREATE TABLE IF NOT EXISTS `proj.control.load_windows` (
job_name STRING NOT NULL,
window_start TIMESTAMP NOT NULL,
window_end TIMESTAMP NOT NULL,
status STRING NOT NULL,
applied_at TIMESTAMP
);
-- Mark a window as STARTED
INSERT INTO `proj.control.load_windows` (job_name, window_start, window_end, status)
VALUES (@job_name, @window_start, @window_end, 'STARTED');Common pitfalls
- Distribution/sort key thinking carried over: Redshift-era shapes don’t map; BigQuery needs pruning and layout aligned to filters.
- UPSERT via DELETE+INSERT drift: partial failures and retries double-apply or delete wrong rows without idempotency guards.
- Implicit watermarks: using job runtime or
CURRENT_TIMESTAMP()instead of persisted high-water marks. - Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
- Unbounded MERGE scans: applying MERGE without partition-scoped boundaries increases cost and runtime.
- Schema evolution surprises: upstream types widen; typed targets break or truncate silently.
- No incremental integrity tests: parity looks fine on a snapshot but breaks under reruns/backfills/late data.
Validation approach
- Execution checks: pipeline runs reliably under representative volumes and schedules.
- Structural parity: partition-level row counts and column profiles (null/min/max/distinct) for key tables.
- KPI parity: aggregates by key dimensions for critical marts and dashboards.
- Incremental integrity (mandatory):
- Idempotency: rerun same window → no net change
- Restart simulation: fail mid-run → resume → correct final state
- Backfill: historical windows replay without drift
- Late-arrival: inject late corrections → only expected rows change
- Cost/performance gates: pruning verified; scan bytes and runtime thresholds set for top jobs.
- Operational readiness: retry/alerting tests and rollback criteria defined before cutover.
Migration steps
- 01
Inventory ETL jobs, schedules, and dependencies
Extract job chains, upstream/downstream dependencies, SLAs, retry policies, and control-table conventions. Identify business-critical marts and consumers.
- 02
Formalize the run contract
Define load windows/high-water marks, business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival policy, and SCD strategy. Make restartability explicit.
- 03
Rebuild transformations on BigQuery-native staging
Implement landing → typed staging → dedupe → apply, with partitioning/clustering aligned to load windows and BI access paths. Define schema evolution policy (widen/quarantine/reject).
- 04
Re-home orchestration and operations
Implement DAGs in Composer/Airflow or your orchestrator: dependencies, retries, alerts, and concurrency. Recreate restart logic with idempotent markers.
- 05
Run parity and incremental integrity gates
Golden outputs + KPI aggregates, restart simulations, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.
We inventory your Redshift ETL estate, formalize the run contract (windows, restartability, SCD), migrate a representative pipeline end-to-end, and deliver parity evidence with cutover gates.
Get an actionable migration plan with restart simulations, incremental integrity tests, reconciliation evidence, and cost/performance baselines—so ETL cutover is controlled and dispute-proof.