Workload

UDFs & procedural utilities for Hadoop legacy cluster → BigQuery

Re-home reusable logic—from Hive/Impala UDFs, Spark helper libraries, and script-driven “macro” utilities—into BigQuery routines with explicit contracts and a replayable harness so behavior stays stable under reruns and backfills.

Book assessment See migration approach

Back to pair page

At a glance

Input: Hadoop (legacy clusters) Stored procedure / UDF migration logic
Output: BigQuery equivalent (validated)
Common pitfalls: Hidden dependencies: UDFs rely on external JARs/configs or implicit Hive settings not captured in migration.
Mixed-type branches: CASE/IF returns mixed types; BigQuery needs explicit casts to preserve intent.
NULL semantics drift: comparisons and string functions behave differently unless made explicit.

Context

Why this breaks

Hadoop estates rarely have “stored procedures” in the warehouse sense, but they do have procedural behavior: Hive/Impala UDFs (Java/Scala/Python), Spark helper libraries, and macro-like shell/Oozie utilities that generate SQL, move partitions, and update control tables. These assets embed business rules, typing assumptions, and side effects. When migrated naïvely, SQL may compile in BigQuery but outputs drift because UDF semantics, regex behavior, NULL handling, and time conversion differ—and restartability rules disappear.

Common symptoms after migration:

UDF outputs drift due to type coercion and NULL handling differences
Regex/string behavior changes (dialect and escaping differences)
Epoch/time conversion helpers drift on boundary days (timezone intent missing)
Script-driven dynamic SQL behaves differently under templating and quoting
Side effects (audit/control tables) aren’t modeled, so reruns/backfills double-apply or skip

A successful migration converts these scattered utilities into explicit BigQuery routines with a behavior contract and a replayable test harness.

Approach

How conversion works

Inventory & classify reusable logic: Hive/Impala UDFs, Spark helper functions, shell/Oozie macro utilities, and their call sites across pipelines and BI queries.
Extract the behavior contract: inputs/outputs, typing, NULL rules, regex expectations, time semantics, side effects, and performance constraints.
Choose BigQuery target form per asset:
- SQL UDF for pure expressions
- JavaScript UDF for complex/regex-heavy logic
- Stored procedure (SQL scripting) for multi-statement control flow and dynamic SQL
- Set-based refactor where procedural loops exist in scripts
Translate and normalize: explicit casts, null-safe comparisons, timezone intent, and deterministic ordering where logic depends on ranking/dedup.
Validate with a harness: golden inputs/outputs, branch coverage, failure-mode tests, and side-effect assertions—then integrate into representative pipelines.

Supported constructs

Representative Hadoop-era procedural constructs we commonly migrate to BigQuery routines (exact coverage depends on your estate).

Source	Target	Notes
Hive/Impala UDFs (Java/Scala/Python)	BigQuery SQL UDFs / JavaScript UDFs	Choose target form based on complexity and semantics; validate edge cases.
Regex-heavy string transforms	BigQuery REGEXP_* functions or JS UDFs	Regex dialect differences validated with golden cohorts.
Shell/Oozie macro utilities	Reusable views/UDFs/procedures	Consolidate reusable patterns into testable BigQuery assets.
Dynamic SQL via templating	EXECUTE IMMEDIATE with parameter binding	Normalize identifier rules; reduce drift and injection risk.
Control tables for restartability	Applied-window tracking + idempotency markers	Reruns/backfills become safe and auditable.
Epoch/time conversion helpers	TIMESTAMP_SECONDS/MILLIS + explicit timezone handling	Prevents boundary-day drift in reporting.

How workload changes

Topic	Hadoop legacy cluster	BigQuery
Where logic lives	UDF JARs + script-driven SQL utilities in orchestration	Centralized routines (UDFs/procedures) with explicit contracts
Typing and coercion	Hive/Impala implicit casts often tolerated	Explicit casts recommended for stable outputs
Regex/time semantics	Dialect-specific regex and epoch conversions	BigQuery REGEXP + explicit timestamp intent
Operational behavior	Reruns/retries encoded in scripts and coordinators	Idempotency and side effects must be explicit

Where logic lives: Migration consolidates and stabilizes reusable logic.

Typing and coercion: Validation focuses on mixed-type branches and join keys.

Regex/time semantics: Edge cohorts are mandatory for tricky strings and boundary days.

Operational behavior: Harness proves behavior under reruns/backfills.

Examples

Illustrative patterns for moving Hadoop-era UDF and macro utilities into BigQuery routines. Adjust datasets and types to match your environment.

-- BigQuery SQL UDF example
CREATE OR REPLACE FUNCTION `proj.util.safe_div`(n NUMERIC, d NUMERIC) AS (
  IF(d IS NULL OR d = 0, NULL, n / d)
);

Avoid

Common pitfalls

Hidden dependencies: UDFs rely on external JARs/configs or implicit Hive settings not captured in migration.
Mixed-type branches: CASE/IF returns mixed types; BigQuery needs explicit casts to preserve intent.
NULL semantics drift: comparisons and string functions behave differently unless made explicit.
Regex dialect differences: pattern syntax and escaping change outputs for edge inputs.
Dynamic SQL via templating: string substitution behaves differently; identifier quoting breaks.
Side effects ignored: control-table/audit updates not recreated; reruns/backfills become unsafe.
Row-by-row script logic: loops and per-partition scripts should become set-based SQL or bounded windows.

Proof

Validation approach

Compile + interface checks: each routine deploys; signatures match the contract (args/return types).
Golden tests: curated input sets validate outputs, including NULL-heavy and boundary cases.
Regex/time edge cohorts: validate tricky strings, escaping, and boundary-day timestamp conversions.
Branch + failure-mode coverage: expected failures are tested (invalid inputs, missing rows).
Side-effect verification: assert expected writes to control/log/audit tables and idempotency under reruns.
Integration replay: run routines inside representative pipelines and compare downstream KPIs/aggregates.

Execution

Migration steps

A sequence that keeps behavior explicit, testable, and safe to cut over.

01
Inventory reusable logic and call sites
Collect Hive/Impala UDFs, Spark helper libraries, and script-driven utilities, plus call sites across pipelines and BI queries. Identify dependencies and side effects (control/audit tables).
02
Define the behavior contract
Specify inputs/outputs, typing, NULL rules, regex expectations, time semantics, error behavior, and side effects. Decide target form (SQL UDF, JS UDF, procedure, or refactor).
03
Convert with safety patterns
Make casts explicit, normalize timezone intent, implement null-safe comparisons, and migrate dynamic SQL using EXECUTE IMMEDIATE with bindings and explicit identifier rules.
04
Build a replayable harness
Create golden input sets, boundary cases (regex/time), and expected failures. Validate outputs and side effects deterministically so parity isn’t debated at cutover.
05
Integrate and cut over behind gates
Run routines in representative pipelines, compare downstream KPIs, validate reruns/backfills, and cut over with rollback-ready criteria.

Workload Assessment

Migrate UDFs and utilities with a replayable harness

We inventory your Hadoop UDFs and macro utilities, migrate a representative subset into BigQuery routines, and deliver a harness that proves parity—including side effects and rerun behavior.

Book assessment See migration approach

Migration Acceleration

Cut over utilities with proof-backed sign-off

Get a conversion plan, review markers for ambiguous intent, and validation artifacts so UDF and utility cutover is gated by evidence and rollback-ready criteria.

Book assessment Explore workloads

FAQ

Frequently asked questions

Hadoop doesn’t really have stored procedures—what are we migrating?+

Usually UDFs (often Java/Scala/Python) and script-driven procedural behavior embedded in orchestration. We migrate that logic into BigQuery UDFs/procedures with explicit contracts and a validation harness.

Do we have to rewrite all UDFs in JavaScript?+

No. Many can become BigQuery SQL UDFs. We use JS UDFs only when logic is complex or regex/object handling requires it. The target form is chosen per asset to reduce risk and cost.

How do you handle regex and string behavior differences?+

We treat regex behavior as a contract: build golden cohorts for tricky strings and escaping, then validate outputs explicitly. JS UDFs are used when that’s the safest parity path.

How do you prove parity for UDFs and utilities?+

We build a replayable harness with golden inputs/outputs, branch and failure-mode coverage, and side-effect assertions. Integration replay validates downstream KPIs before cutover.

Hadoop legacy cluster to BigQuery migration

Migration Pair

End-to-end approach: what breaks, validation gates, and cutover plan.

View page

ETL / pipeline migration

Workload

Migrate Hadoop-era pipelines with partition semantics, late-data behavior, and restartability preserved.

View page

SQL / query migration

Workload

Convert Hadoop SQL to BigQuery with pruning-safe rewrites and golden-query validation.

View page