Automated ETL & Reporting Pipeline

This pipeline ingests from 6 source systems, applies dbt transformations to a central data warehouse, and publishes 12 Power BI reports on schedule — eliminating 40 hours per week of manual analyst effort and reducing report turnaround from 3 days to under 1 hour.

Note: Source system names, schema details, and organisational identifiers are representative. Actual client-specific configurations are confidential.

Source systems

Source Type Ingestion method Cadence
Finance ERPPostgreSQLFull + incremental CDCHourly
HR System (HRIS)REST API (JSON)Incremental by updated_atDaily 02:00
CRMSalesforce APIBulk API 2.0 deltaDaily 03:00
Field OperationsCSV over SFTPFile watch + parseOn arrival
M&E PlatformMySQLIncremental by record_idEvery 4 hours
Budget SystemExcel (SharePoint)Graph API file downloadDaily 06:00

dbt transformation layer

The transformation layer uses dbt Core running against BigQuery. Models are organised in three layers:

models/
  staging/          # 1:1 source cleaning, type casting, renaming
    stg_finance/
    stg_hr/
    stg_crm/
    stg_field_ops/
  intermediate/     # business logic joins
    int_beneficiary_programme.sql
    int_expenditure_actuals.sql
  marts/            # final fact + dimension tables for BI
    fct_disbursements.sql
    fct_programme_outcomes.sql
    dim_beneficiary.sql
    dim_geography.sql

All models have schema tests (not_null, unique, accepted_values) and row-count change tests. A failing test blocks the DAG downstream tasks and fires a Slack alert.

Scheduling & orchestration

Apache Airflow 2.8 orchestrates the pipeline with the following DAG structure:

  1. extract_all_sources (parallel fan-out)
  2. load_to_staging (parallel, post extract)
  3. run_dbt_staging
  4. run_dbt_intermediate
  5. run_dbt_marts
  6. refresh_powerbi_datasets (REST API calls to Power BI service)
  7. send_completion_notification

Report catalogue

Report Audience Refresh cadence
Executive DashboardCEO, Country DirectorHourly
Programme Outcomes ScorecardProgramme ManagersDaily
Finance vs BudgetFinance TeamDaily
Donor Disbursement TrackerFinance, DonorsDaily
Beneficiary Reach ReportM&E OfficersDaily
Field Operations SummaryOperations ManagerOn arrival
HR Headcount & AttritionHR ManagerDaily
Compliance & Audit LogCompliance OfficerWeekly

Interactive simulator

The sandbox page has been completed as a deployment-ready browser simulation. It generates demo source records, optionally injects realistic data quality issues, runs staging and mart transformations, displays dbt-style test results, charts value by programme, explains impact, lists practical use cases and exports both source and mart CSV files.

Simulator componentPurposePortfolio value
Demo data generatorCreates safe multi-source data without exposing client information.Demonstrates data engineering logic while respecting confidentiality.
Quality gatesChecks duplicate IDs, missing budgets, invalid programmes, over-budget records and date validity.Shows how poor-quality records are quarantined before reporting.
Reporting martBuilds fct_pipeline_reporting with utilisation, cost-per-case and clean reporting fields.Connects ETL work directly to Power BI-ready business outputs.
Impact panelEstimates time saved, turnaround improvement and mart quality score.Links technical automation to measurable business outcomes.

Packaged sample files are available under data/synthetic/etl_pipeline_synthetic_source.csv and data/synthetic/etl_pipeline_reporting_mart.csv.

Monitoring & alerting

Pipeline health is monitored via:

  • Airflow task failure alerts — Slack + email within 5 minutes of task failure.
  • dbt test failures — Block downstream tasks; alert on first failure.
  • Row count anomaly detection — Alert if any mart table row count changes >20% between runs.
  • Power BI refresh failure — Retry × 3, then page on-call analyst.
  • SLA breach — Alert if the full pipeline run exceeds 45 minutes.