IEEE-CIS Fraud Detection — Sugumaran Balasubramaniyan

Problem Statement

Fraud costs the global economy over $5 trillion annually. The IEEE-CIS Kaggle competition provided a real-world fraud detection dataset with 590K+ transactions and 400+ features across identity and transaction tables. The challenge: fraud rate below 1%, requiring sophisticated handling of extreme class imbalance while maintaining high precision to avoid blocking legitimate transactions.

Technical Approach

Data Integration

Merged two massive tables — transaction records (590K rows, 394 features) and identity information (144K rows, 41 features) — on transaction ID. Handled 50%+ missing values in the identity table through careful imputation: categorical variables filled with mode, numerical with median, and created binary "is_missing" flags that proved predictive (missing identity data correlated with fraud).

Handling Class Imbalance

SMOTE (Synthetic Minority Oversampling): Generated synthetic fraud examples in feature space rather than duplicating — crucial with such extreme imbalance (~0.8% fraud rate)
Stratified K-Fold Cross-Validation: Preserved fraud ratio in each fold to ensure reliable evaluation metrics
Threshold Tuning: Optimized for precision-recall tradeoff — prioritized recall at 80%+ precision to minimize false positives for business viability

Ensemble Architecture

Built a stacked ensemble in R combining XGBoost, LightGBM, and Random Forest. Each base model trained on SMOTE-balanced folds with different feature subsets (transaction-only, identity-only, combined). A logistic regression meta-learner weighted the predictions — XGBoost received the highest weight, reflecting its strong performance on the sparse high-dimensional feature space.

Key Results

Achieved 0.91 AUC-ROC, ranking in the top 15% of 6,000+ Kaggle teams
Transaction amount and card verification status identified as strongest fraud indicators
"Is missing identity data" binary feature was unexpectedly predictive — 3x fraud rate when identity was incomplete
SMOTE improved recall from 62% to 81% while maintaining precision above 80%

Tech Stack

RXGBoostLightGBMSMOTEcarettidyversedata.tableggplot2

Back to Portfolio View on GitHub