Workload

Redshift ETL pipelines to BigQuery

Re-home incremental ETL—staging, dedupe, UPSERT patterns, and orchestration—from Redshift into BigQuery with an explicit run contract and validation gates that prevent KPI drift.

Book assessment See migration approach

Back to pair page

At a glance

Input: Redshift ETL / pipeline migration logic
Output: BigQuery equivalent (validated)
Common pitfalls: Distribution/sort key thinking carried over: Redshift-era shapes don’t map; BigQuery needs pruning and layout aligned to filters.
UPSERT via DELETE+INSERT drift: partial failures and retries double-apply or delete wrong rows without idempotency guards.
Implicit watermarks: using job runtime or CURRENT_TIMESTAMP() instead of persisted high-water marks.

Context

Why this breaks

Redshift ETL systems often hide correctness in operational conventions: watermark tables, VACUUM/ANALYZE-driven assumptions, distribution/sort key choices, and UPSERT patterns implemented via staging + delete/insert. BigQuery can implement equivalent outcomes—but the run contract must be made explicit and validated under reruns and late data.

Common symptoms after cutover:

Duplicates or missing updates because watermarks and tie-breakers were implicit
Incremental loads turn into full scans because staging boundaries aren’t pruning-aware
DELETE+INSERT UPSERT logic drifts under retries and partial failures
SCD dimensions drift during backfills and late-arrival corrections
Costs spike because partition/clustering and apply windows weren’t designed for BigQuery

Approach

How conversion works

Inventory & classify ETL jobs, schedules, dependencies, and operational controls (watermarks, control tables, retries).
Extract the run contract: business keys, incremental boundaries, dedupe tie-breakers, late-arrival policy, and failure/restart semantics.
Re-home transformations to BigQuery-native staging (landing → typed staging → dedupe → apply) with partitioning/clustering aligned to load windows and access paths.
Implement restartability: applied-batch tracking, idempotency keys, deterministic ordering, and safe retries.
Re-home orchestration (Composer/Airflow, dbt, or your runner) with explicit DAG dependencies, retries, and alerts.
Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, backfills, late injections) and rollback-ready criteria.

Supported constructs

Representative Redshift ETL constructs we commonly migrate to BigQuery (exact coverage depends on your estate).

Source	Target	Notes
Staging + DELETE/INSERT UPSERT patterns	BigQuery MERGE with bounded apply windows	Avoid full scans and ensure idempotency under retries.
Watermark/control tables	Explicit high-water marks + applied-window tracking	Restartable and auditable incremental behavior.
SCD Type-1 / Type-2 logic	MERGE + end-date/current-flag patterns	Backfills and late updates validated as first-class scenarios.
COPY-based ingestion chains	Landing tables + batch manifests (or streaming)	Ingestion made explicit; replayability supported.
Vacuum/analyze expectations	Partitioning/clustering + pruning-aware SQL	Performance posture shifts to scan reduction and layout alignment.
Scheduler-driven chains	Composer/dbt orchestration with explicit DAG contracts	Retries, concurrency, and alerts modeled and monitored.

How workload changes

Topic	Redshift	BigQuery
Incremental correctness	Often encoded in staging + delete/insert conventions and job timing	Explicit high-water marks + deterministic staging + integrity gates
Performance model	Distribution/sort keys + VACUUM/ANALYZE habits	Partition pruning + clustering + slot usage
Upserts	DELETE+INSERT common for UPSERT semantics	MERGE with partition-scoped boundaries
Orchestration	Schedulers + script chains	Composer/dbt/native orchestration with explicit DAG contracts

Incremental correctness: Correctness becomes auditable and repeatable under retries/backfills.

Performance model: Staging boundaries and filters drive stable cost and runtime.

Upserts: Avoid full-target scans and protect idempotency.

Orchestration: Retries, alerts, and dependencies are modeled and monitored.

Examples

Canonical BigQuery incremental apply pattern: stage → dedupe deterministically → MERGE with scoped partitions + applied-window tracking. Adjust keys, partitions, and casts to your model.

-- Control table for load windows (restartability)
CREATE TABLE IF NOT EXISTS `proj.control.load_windows` (
  job_name STRING NOT NULL,
  window_start TIMESTAMP NOT NULL,
  window_end TIMESTAMP NOT NULL,
  status STRING NOT NULL,
  applied_at TIMESTAMP
);

-- Mark a window as STARTED
INSERT INTO `proj.control.load_windows` (job_name, window_start, window_end, status)
VALUES (@job_name, @window_start, @window_end, 'STARTED');

Avoid

Common pitfalls

Distribution/sort key thinking carried over: Redshift-era shapes don’t map; BigQuery needs pruning and layout aligned to filters.
UPSERT via DELETE+INSERT drift: partial failures and retries double-apply or delete wrong rows without idempotency guards.
Implicit watermarks: using job runtime or CURRENT_TIMESTAMP() instead of persisted high-water marks.
Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
Unbounded MERGE scans: applying MERGE without partition-scoped boundaries increases cost and runtime.
Schema evolution surprises: upstream types widen; typed targets break or truncate silently.
No incremental integrity tests: parity looks fine on a snapshot but breaks under reruns/backfills/late data.

Proof

Validation approach

Execution checks: pipeline runs reliably under representative volumes and schedules.
Structural parity: partition-level row counts and column profiles (null/min/max/distinct) for key tables.
KPI parity: aggregates by key dimensions for critical marts and dashboards.
Incremental integrity (mandatory):
- Idempotency: rerun same window → no net change
- Restart simulation: fail mid-run → resume → correct final state
- Backfill: historical windows replay without drift
- Late-arrival: inject late corrections → only expected rows change
Cost/performance gates: pruning verified; scan bytes and runtime thresholds set for top jobs.
Operational readiness: retry/alerting tests and rollback criteria defined before cutover.

Execution

Migration steps

A sequence that keeps pipeline correctness measurable and cutover controlled.

01
Inventory ETL jobs, schedules, and dependencies
Extract job chains, upstream/downstream dependencies, SLAs, retry policies, and control-table conventions. Identify business-critical marts and consumers.
02
Formalize the run contract
Define load windows/high-water marks, business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival policy, and SCD strategy. Make restartability explicit.
03
Rebuild transformations on BigQuery-native staging
Implement landing → typed staging → dedupe → apply, with partitioning/clustering aligned to load windows and BI access paths. Define schema evolution policy (widen/quarantine/reject).
04
Re-home orchestration and operations
Implement DAGs in Composer/Airflow or your orchestrator: dependencies, retries, alerts, and concurrency. Recreate restart logic with idempotent markers.
05
Run parity and incremental integrity gates
Golden outputs + KPI aggregates, restart simulations, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.

Workload Assessment

Migrate Redshift ETL with restartability intact

We inventory your Redshift ETL estate, formalize the run contract (windows, restartability, SCD), migrate a representative pipeline end-to-end, and deliver parity evidence with cutover gates.

Book assessment See migration approach

Migration Acceleration

Cut over pipelines with proof-backed gates

Get an actionable migration plan with restart simulations, incremental integrity tests, reconciliation evidence, and cost/performance baselines—so ETL cutover is controlled and dispute-proof.

Book assessment Explore workloads

FAQ

Frequently asked questions

Is Redshift ETL migration just rewriting scripts?+

No. The critical work is preserving the run contract: windows/watermarks, restartability, dedupe/tie-break rules, SCD semantics, and operational failure handling. Script translation is only one part.

How do you handle UPSERT patterns from Redshift?+

We migrate DELETE+INSERT patterns into BigQuery MERGE with bounded apply windows, deterministic dedupe, and idempotency markers so retries and partial failures don’t create drift.

Can you keep pipelines incremental in BigQuery?+

Yes. We implement explicit high-water marks and late-arrival windows, use partition-scoped applies, and validate incremental integrity with reruns/backfills/late injections.

How do you avoid BigQuery cost surprises for ETL?+

We design pruning-aware staging boundaries and choose partitioning/clustering aligned to load windows. Validation includes scan bytes/runtime baselines and regression thresholds for your top jobs.

Redshift to BigQuery migration

Migration Pair

End-to-end approach: what breaks, validation gates, and cutover plan.

View page

Validation & reconciliation

Method

How we define parity contracts, thresholds, and evidence for sign-off.

Learn more

Cost & performance governance

Method

Keep spend predictable and catch regressions early.

Learn more