Workload

Snowflake ETL/ELT pipelines to BigQuery

Re-home incremental loads, SCD patterns, and semi-structured transforms from Snowflake into BigQuery-without breaking scheduling, idempotency, or downstream KPIs.

At a glance
Input
Snowflake ETL / pipeline migration logic
Output
BigQuery equivalent (validated)
Common pitfalls
  • Implicit watermarks: relying on CURRENT_TIMESTAMP() or task schedule time instead of persisted high-water marks.
  • Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers (event time + source offset + ingestion time).
  • MERGE scan blowups: applying MERGE without pruning boundaries or partition-aware staging.
Context

Why this breaks

Snowflake pipelines usually encode business correctness across more than SQL: incremental boundaries, late-arrival behavior, dedupe rules, and orchestration retries. A migration that only “translates queries” often preserves shape but not outcomes.

Common symptoms after cutover:

  • Duplicates or missing updates because watermarks and tie-breakers were implicit
  • SCD dimensions drift during backfills or late updates
  • Reprocessing semantics change (append-only in BigQuery vs implicit re-run windows in Snowflake tasks/jobs)
  • Costs spike because transformations stop benefiting from pruning-aware staging
  • Operational fragility: schedules, dependencies, and retries no longer match the original run contract
Approach

How conversion works

  1. Inventory & classify pipelines, tasks, and dependencies (dbt models, Snowflake Tasks/Streams, stored procedures, external orchestrators).
  2. Extract the correctness contract: watermarks, keys, dedupe rule, late-arrival window, SCD strategy, and failure/retry semantics.
  3. Re-home execution to BigQuery-native patterns (staging, partitioning, MERGE/apply strategy) + orchestration (Composer/Airflow, dbt, or your runner).
  4. Implement data contracts: typed staging, schema evolution policy (widen/quarantine/reject), and audit columns.
  5. Gate cutover with evidence: golden dataset parity, incremental integrity tests (reruns + late injections), and rollback-ready criteria.

Supported constructs

Representative pipeline constructs we commonly migrate from Snowflake to BigQuery (exact coverage depends on your estate).

SourceTargetNotes
Snowflake Tasks + schedulesComposer/Airflow or dbt scheduling (explicit DAG dependencies)Dependencies, retries, and concurrency modeled explicitly.
Streams + incremental capture patternsCDC landing + typed staging + apply windowsHigh-water mark + late window policy made auditable.
MERGE-based upsertsBigQuery MERGE with pruning-aware stagingAvoid full-target scans via partition boundaries and staged apply.
SCD Type-1 / Type-2 logicMERGE + end-date/current-flag patternsBackfill + late updates tested as first-class scenarios.
VARIANT transforms in ELTJSON extraction + explicit casting in BigQueryType intent and null handling enforced at extraction boundary.
Transient/staging schemasDataset staging + partitioned temp tablesStaging boundaries designed for pruning and cost predictability.

How workload changes

TopicSnowflakeBigQuery
Incremental correctnessOften implicit in Tasks/Streams + timing assumptionsExplicit high-water mark + late-arrival windows + staged apply
Upserts / MERGEMERGE performance relies on Snowflake clustering/micro-partitionsMERGE requires partition-aware staging and pruning strategies
OrchestrationSnowflake Tasks + external schedulersComposer/dbt/native tooling with explicit DAG contracts
Cost modelWarehouse credits + micro-partition effectsBytes scanned + slot time + reservation strategy
Incremental correctness: Correctness becomes testable under retries and backfills.
Upserts / MERGE: Avoid full-table scans by scoping apply windows.
Orchestration: Retries, SLAs, and concurrency are modeled and monitored.
Cost model: Pruning and table layout drive predictable spend.

Examples

Canonical BigQuery apply pattern for incremental upserts: stage -> dedupe deterministically -> MERGE with scoped partitions. Adjust keys, partitions, and casts to your model.

-- BigQuery: stage and type incoming batch
-- Landing: `proj.raw.debezium_events` (batch_id STRING, payload JSON, ingested_at TIMESTAMP)

CREATE TEMP TABLE stg_typed AS
SELECT
  batch_id,
  JSON_VALUE(payload, '$.op') AS op,
  -- choose row image based on op
  IF(JSON_VALUE(payload, '$.op') = 'd', JSON_QUERY(payload, '$.before'), JSON_QUERY(payload, '$.after')) AS row_img,
  TIMESTAMP_MILLIS(CAST(JSON_VALUE(payload, '$.source.ts_ms') AS INT64)) AS event_ts,
  SAFE_CAST(JSON_VALUE(payload, '$.source.lsn') AS INT64) AS src_lsn,
  ingested_at,
  CAST(JSON_VALUE(IF(JSON_VALUE(payload, '$.op')='d', JSON_QUERY(payload,'$.before'), JSON_QUERY(payload,'$.after')), '$.id') AS STRING) AS id,
  CAST(JSON_VALUE(IF(JSON_VALUE(payload, '$.op')='d', JSON_QUERY(payload,'$.before'), JSON_QUERY(payload,'$.after')), '$.status') AS STRING) AS status,
  CAST(JSON_VALUE(IF(JSON_VALUE(payload, '$.op')='d', JSON_QUERY(payload,'$.before'), JSON_QUERY(payload,'$.after')), '$.amount') AS NUMERIC) AS amount
FROM `proj.raw.debezium_events`
WHERE batch_id = @batch_id;
Avoid

Common pitfalls

  • Implicit watermarks: relying on CURRENT_TIMESTAMP() or task schedule time instead of persisted high-water marks.
  • Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers (event time + source offset + ingestion time).
  • MERGE scan blowups: applying MERGE without pruning boundaries or partition-aware staging.
  • SCD drift: end-dating/current-flag logic not tested under backfills and late-arrival updates.
  • Schema evolution surprises: upstream CDC widens types; downstream typed tables fail silently or truncate.
  • Orchestration mismatch: retries, concurrency, and dependency ordering differ; pipeline becomes “eventually wrong.”
Proof

Validation approach

  • Compile + execute checks: BigQuery jobs run reliably under representative volumes and concurrency.
  • Golden parity: key facts/dims match on KPI aggregates and agreed checksums for defined windows.
  • Incremental integrity:
    • Idempotency: re-run same micro-batch -> no net change
    • Late-arrival: inject late events -> only expected rows change
    • Backfill safety: run historical windows -> SCD + dedupe remains stable
  • Cost/performance gates: partition pruning verified; scan bytes + slot time thresholds set for top pipelines.
  • Operational readiness: retry simulations, alerting, and rollback criteria validated before cutover.
Execution

Migration steps

A sequence that keeps pipeline correctness measurable and cutover controlled.
  1. 01

    Inventory pipelines, dependencies, and run contracts

    Extract pipeline graph: tasks/jobs, schedules, upstream/downstream dependencies, SLAs, retries, and concurrency limits. Identify critical marts and consumers.

  2. 02

    Formalize incremental semantics

    Define high-water marks, keys, dedupe tie-breakers, late-arrival window policy, and backfill boundaries. Record as an auditable contract (not tribal knowledge).

  3. 03

    Rebuild transformations on BigQuery-native staging

    Implement landing -> typed staging -> dedupe -> apply. Choose partitioning/clustering for pruning and stable costs. Encode schema evolution policy (widen/quarantine/reject).

  4. 04

    Re-home orchestration

    Implement DAGs in Composer/Airflow or your orchestrator: dependencies, retries, alerting, and idempotency behavior. Confirm schedule alignment and failure handling.

  5. 05

    Run parity and incremental integrity gates

    Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill simulations. Cutover only when evidence meets thresholds and rollback criteria are defined.

Workload Assessment
Migrate pipelines with a provable run contract

We inventory your Snowflake pipelines, formalize incremental semantics, migrate a representative pipeline end-to-end, and produce parity evidence plus a cutover plan with rollback criteria.

Migration Acceleration
Cut over with parity gates and rollback-ready criteria

Get an actionable plan with incremental integrity tests, reconciliation evidence, and cost/performance baselines-so your pipeline cutover is gated by proof, not optimism.

FAQ

Frequently asked questions

Is this just a SQL translation exercise?+
No. Pipeline migration is about preserving the run contract: incremental boundaries, late-arrival behavior, dedupe rules, SCD semantics, and orchestration retries. SQL translation is only one component.
Can we keep our pipeline incremental after moving to BigQuery?+
Yes. We implement explicit high-water marks, late windows, and staged applies (often MERGE) with pruning-aware boundaries, then validate idempotency and late-correction behavior with simulations.
How do you avoid BigQuery cost surprises?+
We design staging boundaries and partition/clustering strategy to maximize pruning. Validation includes scan bytes/slot time baselines and regression thresholds for your top pipelines.
Do you migrate dbt models and orchestration too?+
Yes. We can migrate dbt models, Snowflake Tasks logic, and external orchestrations-ensuring schedules, dependencies, retries, and operational alerts remain equivalent (or improved).