Overview Documentation Methodology Use Cases 🧪 Sandbox

Methodology

Key design decisions in building a reliable, testable, low-maintenance ETL pipeline for a multi-source data environment.

Architectural patterns documented here represent best-practice decisions applied across multiple similar engagements. Client-specific configurations and schemas are omitted.

Incremental extraction design

Full-table extracts were replaced with incremental strategies per source type. SQL sources use watermark-based extraction on updated_at or CDC (Change Data Capture) for high-frequency tables. API sources use delta endpoints where available; otherwise, full pulls with hash comparison identify changed records.

This reduced average extract runtime from 4.2 hours to 18 minutes and eliminated peak-hour load on source databases.

Staging layer: type safety and lineage

Each source maps to one dbt staging model. Staging models do no business logic — only: column renaming to snake_case, type casting, null standardisation, and deduplication by primary key. This creates a clean, documented interface that isolates source-system changes to a single model.

Source freshness tests (dbt source freshness) alert if any source table hasn't been updated within its expected window.

Intermediate layer: business join logic

Intermediate models encode all join logic between staging tables. Each intermediate model represents one business concept (e.g. int_beneficiary_programme joins beneficiary, programme, and geography dimensions). Keeping joins here prevents both duplicate logic in marts and complex multi-source staging models.

Every intermediate model has a not_null test on its primary key and a referential integrity test to the upstream staging model.

Data quality gates

The pipeline uses a three-tier data quality strategy: (1) schema tests on every model run after each dbt layer; (2) row-count anomaly detection comparing each run to a 7-day rolling average (alert threshold ±20%); (3) business rule tests — e.g. disbursements cannot exceed approved budget, beneficiary counts cannot decrease without a flag.

Failures at any tier block downstream tasks in Airflow, preventing corrupted data from reaching Power BI reports.

Power BI orchestrated refresh

After marts complete, Airflow calls the Power BI REST API (POST /datasets/{datasetId}/refreshes) to trigger dataset refreshes. Reports are refreshed in dependency order — shared datasets first, then dependent reports.

A polling loop checks refresh status every 2 minutes and records completion latency. Average end-to-end pipeline run: 38 minutes. Reports are available to stakeholders by 07:30 daily — 2.5 hours earlier than the previous manual process.

Interactive simulator and demo data strategy

The completed sandbox uses demo operational records so the portfolio can demonstrate ETL automation without exposing confidential source schemas or client data. The simulator creates multi-source records, injects realistic issues such as duplicate primary keys and over-budget values, then applies the same pattern used in production pipelines: stage, validate, transform, quarantine exceptions and publish a reporting mart.

The business value is intentionally visible in the interface: users can see time saved, report turnaround improvement, quality pass rate, clean mart rows, practical use cases and exportable CSV outputs. This makes the project understandable to both technical reviewers and business stakeholders.

Technical stack

Orchestration

Apache Airflow 2.8

Transform

dbt Core + BigQuery

Extract

Python + SQLAlchemy

Warehouse

Google BigQuery

Power BI Premium

Source control

GitHub + CI/CD

Monitoring

Grafana + Slack alerts

Deployment

GKE (Google Kubernetes)