Workload

Spark SQL ETL pipelines to BigQuery

Re-home Spark SQL pipelines—staging, windowed dedupe, incremental apply, and late-data reprocessing—into BigQuery with an explicit run contract and validation gates that prevent KPI drift and scan-cost surprises.

Book assessment See migration approach

Back to pair page

At a glance

Input: Spark SQL ETL / pipeline migration logic
Output: BigQuery equivalent (validated)
Common pitfalls: Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
Late-arrival policy implicit: Spark relied on partition overwrite; BigQuery version becomes append-only.
Unbounded apply: MERGE or refresh touches too much history each run (scan bytes spikes).

Context

Why this breaks

Spark SQL pipelines encode correctness in operational behavior: partition overwrite conventions, incremental windows, dedupe logic implemented via window functions, and retry semantics that “usually work” because of job structure. When migrated naïvely, teams translate queries but lose the run contract—keys, ordering/tie-breakers, late-arrival policy, and restartability—so outputs drift after cutover.

Common symptoms after migration:

Duplicates or missing updates because dedupe keys and tie-breakers were implicit
Late-arrival events double-counted or silently ignored because reprocessing windows weren’t recreated
Full reprocessing replaces incremental apply, causing missed SLAs and cost spikes
Scan-cost surprises because pruning-aware staging boundaries weren’t designed
Orchestration changes retries/ordering, turning failures into silent data issues

Approach

How conversion works

Inventory & classify pipelines: Spark SQL jobs/models, staging zones, schedules, and orchestration dependencies.
Extract the run contract: business keys, deterministic ordering/tie-breakers, dedupe rule, late-arrival window policy, and restart semantics.
Re-home transformations to BigQuery-native staging (landing → typed staging → dedupe → apply) with partitioning/clustering aligned to load windows.
Implement late-data behavior explicitly: define reprocessing windows and staged re-apply so outcomes match without full refreshes.
Re-home orchestration: Composer/Airflow/dbt/your runner with explicit DAG dependencies, retries, alerts, and concurrency posture.
Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, backfills, late injections) and rollback-ready criteria.

Supported constructs

Representative Spark SQL ETL constructs we commonly migrate to BigQuery (exact coverage depends on your estate).

Source	Target	Notes
Partition overwrite pipelines	Partition-scoped apply (MERGE or overwrite-by-partition)	Preserve overwrite semantics without full-table refreshes.
ROW_NUMBER-based dedupe	Deterministic dedupe with explicit tie-breakers	Prevents nondeterministic drift under retries.
Late-data reprocessing windows	Explicit late-arrival policy + staged re-apply	Validated via late-injection simulations.
SCD Type-1 / Type-2 patterns	MERGE + current-flag/end-date patterns	Backfills and late updates tested as first-class scenarios.
Semi-structured parsing in Spark	Typed extraction boundaries in BigQuery	Extract once, cast once, reuse everywhere.
Spark orchestration	Composer/dbt DAGs with explicit contracts	Retries, alerts, and concurrency posture modeled.

How workload changes

Topic	Spark SQL	BigQuery
Incremental correctness	Often relies on partition overwrite and implicit reprocessing	Explicit late-window policy + staged apply with integrity gates
Cost model	Cluster runtime and shuffle patterns	Bytes scanned + slot time + pruning effectiveness
Orchestration	Job frameworks and scripts	Composer/dbt orchestration with explicit DAG contracts

Incremental correctness: Correctness becomes auditable and repeatable under retries/backfills.

Cost model: Apply windows and layout alignment keep spend stable.

Orchestration: Retries and alerts become first-class artifacts.

Examples

Canonical BigQuery apply pattern for incremental Spark-style pipelines: stage → dedupe deterministically → MERGE with scoped partitions + applied-window tracking. Adjust keys, partitions, and casts to your model.

-- Applied-window tracking (restartability)
CREATE TABLE IF NOT EXISTS `proj.control.applied_windows` (
  job_name STRING NOT NULL,
  window_start TIMESTAMP NOT NULL,
  window_end TIMESTAMP NOT NULL,
  applied_at TIMESTAMP NOT NULL
);

Avoid

Common pitfalls

Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
Late-arrival policy implicit: Spark relied on partition overwrite; BigQuery version becomes append-only.
Unbounded apply: MERGE or refresh touches too much history each run (scan bytes spikes).
Type drift in staging: Spark implicit casts must be made explicit for stable BigQuery outputs.
Schema evolution surprises: upstream fields widen/change; typed targets break without drift policy.
Pruning defeated: filters wrap partition columns or cast in WHERE, preventing partition elimination.
Orchestration mismatch: concurrency/dependency ordering changes create freshness drift.

Proof

Validation approach

Execution checks: pipelines run reliably under representative volumes and schedules.
Structural parity: partition/window-level row counts and column profiles (null/min/max/distinct) for key tables.
KPI parity: aggregates by key dimensions for critical marts and dashboards.
Incremental integrity (mandatory):
- Idempotency: rerun same window → no net change
- Late-arrival: inject late updates → only expected rows change
- Backfill safety: replay historical windows → stable dedupe and SCD logic
- Dedupe stability: duplicates eliminated consistently under retries
Cost/performance gates: pruning verified; scan bytes/runtime thresholds set for top jobs.
Operational readiness: retry/alerting tests, canary gates, and rollback criteria defined before cutover.

Execution

Migration steps

A sequence that keeps pipeline correctness measurable and cutover controlled.

01
Inventory pipelines, schedules, and dependencies
Extract pipeline graph: Spark SQL jobs, staging zones, upstream feeds, orchestration DAGs, schedules, and SLAs. Identify business-critical marts and consumers.
02
Formalize the run contract
Define business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival window policy, and backfill boundaries. Make restartability explicit.
03
Rebuild transformations on BigQuery-native staging
Implement landing → typed staging → dedupe → apply, with partitioning/clustering aligned to load windows and BI access paths. Define schema evolution policy (widen/quarantine/reject).
04
Re-home orchestration and operations
Implement DAGs in Composer/Airflow/dbt: dependencies, retries, alerts, and concurrency posture. Add applied-window tracking and failure handling.
05
Run parity and incremental integrity gates
Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.

Workload Assessment

Migrate Spark pipelines with the run contract intact

We inventory your Spark SQL pipelines, formalize dedupe and late-data semantics, migrate a representative pipeline end-to-end, and produce parity evidence with cutover gates—without scan-cost surprises.

Book assessment See migration approach

Migration Acceleration

Cut over pipelines with proof-backed gates

Get an actionable migration plan with incremental integrity tests (reruns, late data, backfills), reconciliation evidence, and cost/performance baselines—so pipeline cutover is controlled and dispute-proof.

Book assessment Explore workloads

FAQ

Frequently asked questions

Is migrating Spark SQL ETL just translating SQL?+

No. The critical work is preserving the run contract: keys, ordering/tie-breakers, dedupe rules, late-arrival policy, and restartability under retries/backfills.

How do you preserve partition overwrite semantics from Spark jobs?+

We implement partition-scoped apply (MERGE or overwrite-by-partition), with applied-window tracking and integrity tests so reruns are safe and outcomes match.

What about late-arriving data?+

We convert it into an explicit late-arrival window policy and staged re-apply strategy, then validate with late-injection simulations to prove only expected rows change.

How do you prevent BigQuery cost surprises?+

We design pruning-aware staging boundaries and choose partitioning/clustering aligned to load windows. Validation includes scan bytes/runtime baselines and regression thresholds for the top jobs.

Spark SQL to BigQuery migration

Migration Pair

End-to-end approach: what breaks, validation gates, and cutover plan.

View page

Validation & reconciliation

Method

How we define parity contracts, thresholds, and evidence for sign-off.

Learn more

Cost & performance governance

Method

Keep spend predictable and catch regressions early.

Learn more