Key design decisions in building a reliable, testable, low-maintenance ETL pipeline for a multi-source data environment.
Full-table extracts were replaced with incremental strategies per source type. SQL sources use watermark-based extraction on updated_at or CDC (Change Data Capture) for high-frequency tables. API sources use delta endpoints where available; otherwise, full pulls with hash comparison identify changed records.
This reduced average extract runtime from 4.2 hours to 18 minutes and eliminated peak-hour load on source databases.
Each source maps to one dbt staging model. Staging models do no business logic — only: column renaming to snake_case, type casting, null standardisation, and deduplication by primary key. This creates a clean, documented interface that isolates source-system changes to a single model.
Source freshness tests (dbt source freshness) alert if any source table hasn't been updated within its expected window.
Intermediate models encode all join logic between staging tables. Each intermediate model represents one business concept (e.g. int_beneficiary_programme joins beneficiary, programme, and geography dimensions). Keeping joins here prevents both duplicate logic in marts and complex multi-source staging models.
Every intermediate model has a not_null test on its primary key and a referential integrity test to the upstream staging model.
The pipeline uses a three-tier data quality strategy: (1) schema tests on every model run after each dbt layer; (2) row-count anomaly detection comparing each run to a 7-day rolling average (alert threshold ±20%); (3) business rule tests — e.g. disbursements cannot exceed approved budget, beneficiary counts cannot decrease without a flag.
Failures at any tier block downstream tasks in Airflow, preventing corrupted data from reaching Power BI reports.
After marts complete, Airflow calls the Power BI REST API (POST /datasets/{datasetId}/refreshes) to trigger dataset refreshes. Reports are refreshed in dependency order — shared datasets first, then dependent reports.
A polling loop checks refresh status every 2 minutes and records completion latency. Average end-to-end pipeline run: 38 minutes. Reports are available to stakeholders by 07:30 daily — 2.5 hours earlier than the previous manual process.
The completed sandbox uses demo operational records so the portfolio can demonstrate ETL automation without exposing confidential source schemas or client data. The simulator creates multi-source records, injects realistic issues such as duplicate primary keys and over-budget values, then applies the same pattern used in production pipelines: stage, validate, transform, quarantine exceptions and publish a reporting mart.
The business value is intentionally visible in the interface: users can see time saved, report turnaround improvement, quality pass rate, clean mart rows, practical use cases and exportable CSV outputs. This makes the project understandable to both technical reviewers and business stakeholders.