The Customer Churn Prediction Pipeline is a production-grade machine learning system designed to identify subscribers at high risk of cancellation within a 30-day forward window. It processes customer behavioural, transactional, and engagement signals to generate individual churn scores with SHAP-powered explanations.
The system comprises six independent Python modules, each with a defined interface:
churn_pipeline/
├── ingestion/
│ ├── connectors.py # DB / API / CSV adapters
│ └── validators.py # Schema validation (Pandera)
├── features/
│ ├── behavioural.py # Rolling window aggregations
│ ├── financial.py # Payment and spend features
│ └── pipeline.py # sklearn Pipeline wrapper
├── training/
│ ├── train.py # XGBoost + Optuna tuning
│ ├── evaluate.py # AUC, PR, threshold calibration
│ └── registry.py # MLflow model registry
├── scoring/
│ ├── batch_scorer.py # Airflow DAG entry point
│ └── realtime.py # Flask endpoint
├── explainability/
│ └── shap_explainer.py # SHAP waterfall generation
└── monitoring/
└── drift.py # PSI + feature monitoring
Required input columns for the scoring pipeline:
customer_id str Unique customer identifier
tenure_months int Months since account activation
plan_type str {'prepaid', 'postpaid', 'hybrid'}
data_usage_30d_mb float Total data consumed in last 30 days
data_usage_60d_mb float Total data consumed in last 60 days
call_minutes_30d float Total call minutes in last 30 days
complaint_count_90d int Support tickets raised in 90 days
top_up_count_30d int Number of top-ups (prepaid)
payment_delay_days int Average payment delay in days
last_app_login_days int Days since last app/portal login
contract_end_days int Days until contract expiry (-1 if open)
git clone https://github.com/chirchirp/churn-prediction-pipeline.git
cd churn-prediction-pipeline
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your database credentials and MLflow URI
from churn_pipeline.training.train import ChurnTrainer
trainer = ChurnTrainer(
data_path="data/training_data.csv",
target_col="churned_30d",
experiment_name="churn-v2"
)
model, metrics = trainer.run(n_trials=200)
print(f"AUC: {metrics['auc']:.4f}")
from churn_pipeline.scoring.batch_scorer import BatchScorer
scorer = BatchScorer(model_version="production")
results = scorer.score(
data_path="data/customers_2025_01.csv",
threshold=0.38,
output_path="outputs/scored_2025_01.csv",
include_shap=True
)
print(f"Scored {len(results)} customers | High risk: {results['high_risk'].sum()}")
import httpx
response = httpx.post(
"http://localhost:8000/predict",
json={
"customer_id": "CUST-0042",
"tenure_months": 18,
"data_usage_30d_mb": 1200,
"complaint_count_90d": 3,
# ... other features
}
)
print(response.json())
# {"customer_id": "CUST-0042", "churn_score": 0.72,
# "risk_level": "High", "shap_top_features": [...]}
POST /api/predict — Score current customer records and return churn probability, risk band, and retention actions
POST /api/train — Train the churn model from a historical CSV with a selected target column
GET /api/sample/training — Return sample training data preview
GET /download — Download the latest scored churn output
The uploaded Streamlit app has been converted into a Flask-ready implementation while keeping a fast browser sandbox for the portfolio. This avoids public-demo cold boots and gives recruiters or partners an instant interactive experience.
flask_app/ can be deployed separately when server-side training/scoring is needed.The API is containerised and deployable on any OCI-compliant platform:
# Build and run with Docker
docker build -t churn-api:latest .
docker run -p 8000:8000 --env-file .env churn-api:latest
# Or deploy to Azure Container Apps
az containerapp up --name churn-api \
--resource-group rg-analytics \
--image chirchirp/churn-api:latest \
--ingress external --target-port 8000
MLFLOW_TRACKING_URI and MODEL_REGISTRY_URI
are set in your environment before deploying to production.