Customer Churn Prediction — Documentation

v2.1.0 Python 3.10+ XGBoost 2.0 Flask Production Ready

Overview

The Customer Churn Prediction Pipeline is a production-grade machine learning system designed to identify subscribers at high risk of cancellation within a 30-day forward window. It processes customer behavioural, transactional, and engagement signals to generate individual churn scores with SHAP-powered explanations.

Sector applicability: While built for telecoms, the pipeline's modular feature engineering layer makes it adaptable to insurance policy renewal, NGO donor lapse prediction, and SaaS subscription churn with minimal rework.

Architecture

The system comprises six independent Python modules, each with a defined interface:

churn_pipeline/
├── ingestion/
│   ├── connectors.py        # DB / API / CSV adapters
│   └── validators.py        # Schema validation (Pandera)
├── features/
│   ├── behavioural.py       # Rolling window aggregations
│   ├── financial.py         # Payment and spend features
│   └── pipeline.py          # sklearn Pipeline wrapper
├── training/
│   ├── train.py             # XGBoost + Optuna tuning
│   ├── evaluate.py          # AUC, PR, threshold calibration
│   └── registry.py          # MLflow model registry
├── scoring/
│   ├── batch_scorer.py      # Airflow DAG entry point
│   └── realtime.py          # Flask endpoint
├── explainability/
│   └── shap_explainer.py    # SHAP waterfall generation
└── monitoring/
    └── drift.py             # PSI + feature monitoring

Data Schema

Required input columns for the scoring pipeline:

customer_id          str      Unique customer identifier
tenure_months        int      Months since account activation
plan_type            str      {'prepaid', 'postpaid', 'hybrid'}
data_usage_30d_mb    float    Total data consumed in last 30 days
data_usage_60d_mb    float    Total data consumed in last 60 days
call_minutes_30d     float    Total call minutes in last 30 days
complaint_count_90d  int      Support tickets raised in 90 days
top_up_count_30d     int      Number of top-ups (prepaid)
payment_delay_days   int      Average payment delay in days
last_app_login_days  int      Days since last app/portal login
contract_end_days    int      Days until contract expiry (-1 if open)

Installation

git clone https://github.com/chirchirp/churn-prediction-pipeline.git
cd churn-prediction-pipeline

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your database credentials and MLflow URI

Usage

Training a new model

from churn_pipeline.training.train import ChurnTrainer

trainer = ChurnTrainer(
    data_path="data/training_data.csv",
    target_col="churned_30d",
    experiment_name="churn-v2"
)
model, metrics = trainer.run(n_trials=200)
print(f"AUC: {metrics['auc']:.4f}")

Batch scoring

from churn_pipeline.scoring.batch_scorer import BatchScorer

scorer = BatchScorer(model_version="production")
results = scorer.score(
    data_path="data/customers_2025_01.csv",
    threshold=0.38,
    output_path="outputs/scored_2025_01.csv",
    include_shap=True
)
print(f"Scored {len(results)} customers | High risk: {results['high_risk'].sum()}")

Real-time API call

import httpx

response = httpx.post(
    "http://localhost:8000/predict",
    json={
        "customer_id": "CUST-0042",
        "tenure_months": 18,
        "data_usage_30d_mb": 1200,
        "complaint_count_90d": 3,
        # ... other features
    }
)
print(response.json())
# {"customer_id": "CUST-0042", "churn_score": 0.72,
#  "risk_level": "High", "shap_top_features": [...]}

API Reference

POST /api/predict — Score current customer records and return churn probability, risk band, and retention actions

POST /api/train — Train the churn model from a historical CSV with a selected target column

GET /api/sample/training — Return sample training data preview

GET /download — Download the latest scored churn output

Performance

AUC-ROC
0.893
Precision@500
0.74
Inference P95
<48ms

Streamlit-to-Flask Conversion

The uploaded Streamlit app has been converted into a Flask-ready implementation while keeping a fast browser sandbox for the portfolio. This avoids public-demo cold boots and gives recruiters or partners an instant interactive experience.

  • Training: upload or load historical customer data with a churn label.
  • Target setup: select the churn target column and optionally exclude ID/leakage fields.
  • Scoring: upload current customer data and generate churn probability, risk level, drivers, and retention action.
  • Export: download a CRM-ready churn prediction file.
Portfolio deployment note: the sandbox runs quickly in the browser; the Flask app under flask_app/ can be deployed separately when server-side training/scoring is needed.

Deployment

The API is containerised and deployable on any OCI-compliant platform:

# Build and run with Docker
docker build -t churn-api:latest .
docker run -p 8000:8000 --env-file .env churn-api:latest

# Or deploy to Azure Container Apps
az containerapp up --name churn-api \
  --resource-group rg-analytics \
  --image chirchirp/churn-api:latest \
  --ingress external --target-port 8000
Note: Ensure MLFLOW_TRACKING_URI and MODEL_REGISTRY_URI are set in your environment before deploying to production.
🔬 Methodology 🧪 Try the Sandbox →