Teradata ETL pipelines to BigQuery
Re-home batch and incremental ETL-staging, dedupe, SCD patterns, and orchestration-from Teradata into BigQuery with an explicit run contract and validation gates that prevent KPI drift.
- Input
- Teradata ETL / pipeline migration logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Assuming Teradata load tools map 1:1: FastLoad/TPump patterns need BigQuery-native staging and apply strategy.
- Implicit restart logic: control-table conventions not recreated, causing reruns to double-apply or skip.
- Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers, leading to drift under retries.
Why this breaks
Teradata ETL systems often carry decades of operational behavior: restartability patterns, control tables, load windows, and incremental rules shaped by Teradata’s PI/AMP execution model and tooling (BTEQ, FastLoad/TPump, MultiLoad, TPT). When migrated naïvely, jobs may “run” in BigQuery but produce drift or miss SLAs because the run contract was never made explicit.
Common symptoms after cutover:
- Duplicates or missing updates because watermarks and tie-breakers were implicit
- SCD dimensions drift during backfills, restarts, or late-arrival corrections
- Control-table driven restart logic breaks; reruns double-apply or silently skip
- Performance collapses because pipelines lose pruning-aware staging boundaries
- Orchestration dependencies and retries change, turning failures into silent data issues
How conversion works
- Inventory & classify ETL jobs, dependencies, schedules, and operational controls (control tables, restart markers, load windows).
- Extract the run contract: business keys, incremental boundaries, dedupe rules, SCD strategy, error handling, and restartability semantics.
- Re-home transformations using BigQuery-native staging (landing -> typed staging -> dedupe -> apply) with partitioning/clustering aligned to load windows.
- Rebuild restartability: idempotency keys, applied-batch tracking, and deterministic ordering so retries are safe.
- Re-home orchestration (Composer/Airflow, dbt, or your runner) with explicit DAG dependencies, retries, and alerts.
- Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, backfills, late-arrival injections) and rollback-ready criteria.
Supported constructs
Representative Teradata ETL constructs we commonly migrate to BigQuery (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| BTEQ / SQL scripts with control tables | BigQuery SQL + control tables for restartability | Applied-window tracking and idempotent reruns made explicit. |
| FastLoad/TPump/MultiLoad patterns | Landing + typed staging + partition-scoped apply | Replace load-tool semantics with deterministic staging boundaries. |
| Incremental loads (watermarks, load windows) | Explicit high-water mark + late window policy | Auditable and testable under retries/backfills. |
| SCD Type-1 / Type-2 apply logic | MERGE + end-date/current-flag patterns | Backfills and restarts validated as first-class scenarios. |
| Staging/volatile tables | Dataset staging + temp tables / materializations | Concurrency-safe and pruning-aligned. |
| Workload management constraints | Reservations/slots + concurrency controls | Predictable batch windows and BI refresh coexistence. |
How workload changes
| Topic | Teradata | BigQuery |
|---|---|---|
| Incremental correctness | Often encoded in control tables, load windows, and restart conventions | Explicit high-water marks + deterministic staging + integrity gates |
| Staging and apply | Load tools + volatile tables + PI/AMP locality | Landing -> typed staging -> dedupe -> partition-scoped apply |
| Orchestration | Enterprise schedulers + script chains | Composer/dbt/native orchestration with explicit DAG contracts |
| Cost and capacity | Platform resource governance + workload classes | Bytes scanned + slot time + reservation strategy |
Examples
Canonical BigQuery apply pattern for incremental ETL: stage -> dedupe deterministically -> MERGE with scoped partitions + applied-window tracking. Adjust keys, partitions, and casts to your model.
-- Control table for load windows (restartability)
CREATE TABLE IF NOT EXISTS `proj.control.load_windows` (
job_name STRING NOT NULL,
window_start TIMESTAMP NOT NULL,
window_end TIMESTAMP NOT NULL,
status STRING NOT NULL,
applied_at TIMESTAMP
);
-- Example: mark a window as STARTED
INSERT INTO `proj.control.load_windows` (job_name, window_start, window_end, status)
VALUES (@job_name, @window_start, @window_end, 'STARTED');Common pitfalls
- Assuming Teradata load tools map 1:1: FastLoad/TPump patterns need BigQuery-native staging and apply strategy.
- Implicit restart logic: control-table conventions not recreated, causing reruns to double-apply or skip.
- Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers, leading to drift under retries.
- PI/AMP assumptions carried over: join and staging shapes that relied on locality now reshuffle large datasets.
- Unbounded applies: MERGE/upsert jobs scan full targets without partition-scoped boundaries.
- Schema evolution surprises: upstream types widen; downstream typed tables fail or truncate.
- No incremental integrity tests: parity looks fine on a one-time run but breaks under backfills/late data.
Validation approach
- Execution checks: pipeline runs reliably under representative volumes and schedules.
- Structural parity: partition-level row counts and column profiles (null/min/max/distinct) for key tables.
- KPI parity: aggregates by key dimensions for critical marts and dashboards.
- Incremental integrity (mandatory):
- Idempotency: rerun same load window -> no net change
- Restart simulation: fail mid-run -> resume -> correct final state
- Backfill: historical windows replay without SCD drift
- Late-arrival: inject late corrections -> only expected rows change
- Cost/performance gates: pruning verified; scan bytes and runtime thresholds set for top jobs.
- Operational readiness: retry/alerting tests and rollback criteria defined before cutover.
Migration steps
- 01
Inventory ETL jobs, schedules, and dependencies
Extract job chains, upstream/downstream dependencies, SLAs, retry policies, and control-table conventions. Identify business-critical marts and consumers.
- 02
Formalize the run contract
Define load windows/high-water marks, business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival policy, and SCD strategy. Make restartability explicit.
- 03
Rebuild transformations on BigQuery-native staging
Implement landing -> typed staging -> dedupe -> apply, with partitioning/clustering aligned to load windows and BI access paths. Define schema evolution policy (widen/quarantine/reject).
- 04
Re-home orchestration and operations
Implement DAGs in Composer/Airflow or your orchestrator: dependencies, retries, alerts, and concurrency. Recreate control-table driven restart logic with idempotent markers.
- 05
Run parity and incremental integrity gates
Golden outputs + KPI aggregates, restart simulations, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.
We inventory your Teradata ETL estate, formalize the run contract (windows, restartability, SCD), migrate a representative pipeline end-to-end, and deliver parity evidence with cutover gates.
Get an actionable migration plan with restart simulations, incremental integrity tests, reconciliation evidence, and cost/performance baselines-so ETL cutover is controlled and dispute-proof.