Databricks SQL queries to Snowflake
Translate Spark SQL idioms—MERGE/upserts, window logic, array/map patterns, and time handling—into Snowflake SQL with validation gates that prevent semantic drift and KPI surprises.
- Input
- Databricks SQL / query migration logic
- Output
- Snowflake equivalent (validated)
- Common pitfalls
- Type coercion drift: Spark’s implicit casts inside CASE/COALESCE differ; Snowflake may require explicit casts to preserve intent.
- NULL semantics in joins: equality and null-safe comparisons can change match behavior if not explicit.
- Window ordering ambiguity: ROW_NUMBER/RANK filters without stable tie-breakers cause nondeterministic drift.
Why this breaks
Databricks SQL estates are usually a mix of Spark SQL, Delta-specific patterns, and BI-generated queries. Many will translate syntactically—but drift happens when implicit semantics differ: NULL behavior, type coercion in CASE/COALESCE, timestamp/timezone assumptions, and array/map handling.
Common symptoms after cutover:
- KPI drift from window logic and top-N selections with incomplete ordering
- MERGE behavior changes because match keys, casts, or update predicates were implicit
- Arrays/maps and semi-structured fields don’t translate 1:1
- Date/time logic shifts due to timezone assumptions and TIMESTAMP vs DATE intent
- Performance regressions because Spark-era query shapes don’t align to Snowflake pruning/clustering
SQL migration must preserve both meaning and a Snowflake-native execution posture.
How conversion works
- Inventory & prioritize the SQL corpus (dbt models, notebooks, BI extracts, jobs, views). Rank by business impact and risk patterns (MERGE, windows, arrays/maps, time).
- Normalize Spark SQL: identifier/quoting rules, functions, and CTE shapes to reduce dialect-specific noise.
- Rewrite with rule-anchored mappings: Spark SQL → Snowflake SQL function equivalents, explicit cast strategy, and deterministic ordering for windowed filters.
- Apply pattern libraries for high-risk constructs: MERGE/upserts, late-arrival dedupe, arrays/maps, timezone normalization, and null-safe comparisons.
- Validate with gates: compilation, catalog/type alignment, and golden-query parity (plus edge cohorts for ties/null-heavy segments).
- Performance-safe refactors: tune query shapes for Snowflake pruning (clustering alignment, bounded MERGE scopes, and staging boundaries).
Supported constructs
Representative Databricks/Spark SQL constructs we commonly convert to Snowflake SQL (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Spark SQL window filters (ROW_NUMBER/RANK) | Snowflake window functions + QUALIFY | Deterministic ordering and tie-breakers enforced. |
| Delta MERGE and upsert patterns | Snowflake MERGE with staged apply | Match keys, casts, and update predicates made explicit and testable. |
| explode/arrays/maps/structs | FLATTEN/VARIANT + lateral joins (or normalized staging) | Empty-array semantics preserved intentionally. |
| DATE/TIMESTAMP arithmetic | Snowflake date/time functions | Timezone intent (NTZ/LTZ/TZ) normalized explicitly. |
| NULL-safe comparisons | Explicit null-safe equality patterns | Prevents join drift and duplicate inserts. |
| String/regex functions | Snowflake string/regex equivalents | Edge-case behavior validated via golden cohorts. |
How workload changes
| Topic | Databricks / Spark SQL | Snowflake |
|---|---|---|
| Execution assumptions | Shuffle-heavy plans and partition overwrite patterns common | Pruning + clustering alignment and bounded apply scopes |
| Semi-structured handling | Struct/array/map types and explode patterns | VARIANT + FLATTEN or normalized staging |
| Time semantics | Timezone assumptions often implicit | Explicit NTZ/LTZ/TZ and conversion rules |
Examples
Representative Databricks → Snowflake rewrites for windowed dedupe, explode/flatten, and MERGE patterns. Adjust keys, paths, and casts to your model.
-- Databricks: latest row per key
SELECT *
FROM events
QUALIFY ROW_NUMBER() OVER (
PARTITION BY business_key
ORDER BY event_ts DESC
) = 1;Make the dedupe rule explicit
Most KPI drift comes from missing dedupe rules or unstable ordering. Treat match keys, tie-breakers, and late-arrival policy as requirements—not implementation details.
Avoid full-target MERGE scans
Design staging boundaries and pruning-aware apply strategies so credit burn stays predictable as volume grows.
Common pitfalls
- Type coercion drift: Spark’s implicit casts inside CASE/COALESCE differ; Snowflake may require explicit casts to preserve intent.
- NULL semantics in joins: equality and null-safe comparisons can change match behavior if not explicit.
- Window ordering ambiguity: ROW_NUMBER/RANK filters without stable tie-breakers cause nondeterministic drift.
- Array/map handling: Spark
explode,map, and nested structs need explicit Snowflake VARIANT/FLATTEN rewrites when semi-structured. - MERGE scope blowups: unbounded MERGE patterns cause large scans and credit spikes.
- Timezone assumptions: Databricks sessions often assume a timezone; Snowflake timestamp types (NTZ/LTZ/TZ) require explicit intent.
- Over-trusting “it runs”: compilation success is not parity; validate with golden outputs and edge cohorts.
Validation approach
- Compilation gates: converted queries compile and execute in Snowflake reliably under representative parameters.
- Catalog/type checks: referenced objects exist; implicit casts are surfaced and made explicit where needed.
- Golden-query parity: business-critical queries/dashboards match on outputs or agreed tolerances.
- KPI aggregates: compare aggregates by key dimensions (date, region, product, cohorts).
- Edge-cohort diffs: validate ties, null-heavy segments, boundary dates, and timezone transitions.
- Performance baseline: capture runtime, scanned micro-partitions, and credit burn for top queries; set regression thresholds.
Migration steps
- 01
Collect and prioritize the query estate
Export dbt models, notebooks, job SQL, view definitions, and BI SQL. Rank by business impact, frequency, and risk patterns (MERGE, windows, arrays/maps, time).
- 02
Define semantic contracts for risky patterns
Make tie-breakers, NULL handling, casting strategy, timezone intent, and array/map handling explicit—especially for top-N, dedupe, and MERGE logic.
- 03
Convert with rule-anchored mappings
Apply deterministic rewrites for common constructs and flag ambiguous intent with review markers (implicit casts, ordering ambiguity, timezone assumptions).
- 04
Validate with golden queries and edge cohorts
Compile and run in Snowflake, compare KPI aggregates, and run targeted diffs on edge cohorts (ties, null-heavy segments, boundary dates).
- 05
Tune top queries for Snowflake
Align clustering to access paths, bound MERGEs and applies, and recommend materializations where BI patterns repeatedly scan large facts.
We inventory your SQL estate, convert a representative slice, and deliver parity evidence on golden queries—plus a risk register for constructs that carry business meaning.
Get a conversion plan, review markers, and validation artifacts so query cutover is gated by evidence and rollback-ready criteria.