Validation & reconciliation for Databricks → BigQuery
Turn “it runs” into a measurable parity contract. We prove correctness for Delta MERGE and incremental systems with golden queries, KPI diffs, and integrity simulations—then gate cutover with rollback-ready criteria and pruning baselines.
- Input
- Databricks Validation & reconciliation logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Validating only the backfill: parity on a static snapshot doesn’t prove correctness under reruns/late data.
- Spot checks instead of gates: a few sampled rows miss drift in ties and edge windows.
- No tolerance model: teams argue over diffs because thresholds weren’t defined upfront.
Why this breaks
Databricks migrations fail late when teams validate only a one-time backfill and a few spot checks. Delta systems encode correctness in operational behavior: partition overwrite assumptions, MERGE/upsert semantics, retries, and late-arrival corrections. BigQuery can implement equivalent outcomes—but only if correctness rules and pruning posture are made explicit and tested under stress.
Common drift drivers in Databricks/Delta → BigQuery:
- MERGE semantics drift: match keys, casts, and update predicates differ subtly
- Non-deterministic dedupe: window ordering missing tie-breakers; reruns choose different winners
- Late-arrival behavior: Delta reprocessing windows vs BigQuery staged apply not equivalent by default
- SCD drift: end-dating/current-flag logic breaks under backfills and late updates
- Pruning/cost surprises: filters and layouts don’t align; bytes scanned explodes after cutover
Validation must treat the workload as an incremental system and include scan-cost posture as a first-class gate.
How conversion works
- Define the parity contract: what must match (facts/dims, KPIs, dashboards) and what tolerances apply.
- Define the pruning/cost contract: which workloads must prune and what scan-byte/slot thresholds are acceptable.
- Build validation datasets: golden inputs, edge cohorts (ties, null-heavy segments), and representative windows (including boundary days).
- Run readiness + execution gates: schemas/types align, dependencies deployed, and jobs run reliably.
- Run layered parity gates: counts/profiles → KPI diffs → targeted row-level diffs where needed.
- Validate incremental integrity (mandatory for MERGE/upserts): idempotency reruns, late-arrival injections, and backfill simulations.
- Gate cutover: pass/fail thresholds, canary strategy, rollback triggers, and post-cutover monitors.
Supported constructs
Representative validation and reconciliation mechanisms we apply in Databricks → BigQuery migrations.
| Source | Target | Notes |
|---|---|---|
| Delta MERGE/upsert correctness | MERGE parity contract + rerun/late-data simulations | Proves behavior under retries, late arrivals, and backfills. |
| Golden dashboards/queries | Golden query harness + repeatable parameters | Codifies business sign-off into runnable tests. |
| Counts and profiles | Partition-level counts + null/min/max/distinct profiles | Cheap early drift detection before deep diffs. |
| KPI validation | Aggregate diffs by key dimensions + tolerance thresholds | Aligns validation with business meaning. |
| Pruning/cost posture | Scan-byte baselines + pruning verification | Treat cost posture as a cutover gate. |
| Operational sign-off | Canary gates + rollback criteria + monitors | Prevents “successful cutover” turning into KPI debates. |
How workload changes
| Topic | Databricks / Delta | BigQuery |
|---|---|---|
| Where correctness hides | Job structure + partition overwrite/reprocessing semantics | Explicit staged apply + idempotency contracts |
| Cost model | Cluster runtime | Bytes scanned + slot time |
| Operational sign-off | Often based on “looks right” dashboard checks | Evidence-based gates + rollback triggers |
Examples
Illustrative parity and integrity checks in BigQuery. Replace datasets, keys, and KPI definitions to match your migration.
-- Row counts by window (BigQuery)
SELECT
DATE(updated_at) AS d,
COUNT(*) AS rows
FROM `proj.mart.fact_orders`
WHERE DATE(updated_at) BETWEEN @start_date AND @end_date
GROUP BY 1
ORDER BY 1;Common pitfalls
- Validating only the backfill: parity on a static snapshot doesn’t prove correctness under reruns/late data.
- Spot checks instead of gates: a few sampled rows miss drift in ties and edge windows.
- No tolerance model: teams argue over diffs because thresholds weren’t defined upfront.
- Unstable ordering: ROW_NUMBER/RANK without complete ORDER BY; winners change under retries.
- No pruning gate: bytes scanned increases slip through because cost posture isn’t validated.
- MERGE scope blind: apply touches too much history, causing scan blowups and slow SLAs.
Validation approach
Gate set (layered)
Gate 0 — Readiness
- Datasets, permissions, and target schemas exist
- Dependent assets deployed (UDFs/procedures, reference data, control tables)
Gate 1 — Execution
- Pipelines run reliably under representative volume and concurrency
- Deterministic ordering + explicit casts enforced
Gate 2 — Structural parity
- Row counts by partitions/windows
- Null/min/max/distinct profiles for key columns
Gate 3 — KPI parity
- KPI aggregates by key dimensions
- Top-N and ranking parity validated on tie/edge cohorts
Gate 4 — Pruning & cost posture (mandatory)
- Partition filters prune as expected on representative parameters
- Bytes scanned and slot time remain within agreed thresholds
- Regression alerts defined for scan blowups
Gate 5 — Incremental integrity (mandatory)
- Idempotency: rerun same micro-batch → no net change
- Late-arrival: inject late updates → only expected rows change
- Backfill safety: replay historical windows → stable SCD and dedupe
- Dedupe stability: duplicates eliminated consistently under retries
Gate 6 — Cutover & monitoring
- Canary criteria + rollback triggers
- Post-cutover monitors: latency, scan bytes/slot time, failures, KPI sentinels
Migration steps
- 01
Define parity and cost/pruning contracts
Decide what must match (tables, dashboards, KPIs) and define tolerances. Identify workloads where pruning is mandatory and set scan-byte/slot thresholds.
- 02
Create validation datasets and edge cohorts
Select representative windows and cohorts that trigger edge behavior (ties, null-heavy segments, boundary dates, late updates).
- 03
Implement layered gates
Start with cheap checks (counts/profiles), then KPI diffs, then deep diffs only where needed. Add pruning verification and baseline capture for top workloads.
- 04
Validate incremental integrity
Run idempotency reruns, late-arrival injections, and backfill simulations. These are the scenarios that usually break after cutover if not tested.
- 05
Gate cutover and monitor
Establish canary/rollback criteria and post-cutover monitors for KPIs, scan-cost sentinels (bytes/slot), latency, and failures.
We define parity + pruning contracts, build the golden-query set, and implement layered reconciliation gates—including idempotency reruns and late-data simulations—so drift and cost surprises are caught pre-production.
Get a validation plan, runnable gates, and sign-off artifacts (diff reports, thresholds, pruning baselines, monitors) so Databricks→BigQuery cutover is controlled and dispute-proof.