Ensemble methods in R on 590K+ transactions — handling extreme class imbalance with SMOTE to achieve 0.91 AUC-ROC.
Fraud costs the global economy over $5 trillion annually. The IEEE-CIS Kaggle competition provided a real-world fraud detection dataset with 590K+ transactions and 400+ features across identity and transaction tables. The challenge: fraud rate below 1%, requiring sophisticated handling of extreme class imbalance while maintaining high precision to avoid blocking legitimate transactions.
Merged two massive tables — transaction records (590K rows, 394 features) and identity information (144K rows, 41 features) — on transaction ID. Handled 50%+ missing values in the identity table through careful imputation: categorical variables filled with mode, numerical with median, and created binary "is_missing" flags that proved predictive (missing identity data correlated with fraud).
Built a stacked ensemble in R combining XGBoost, LightGBM, and Random Forest. Each base model trained on SMOTE-balanced folds with different feature subsets (transaction-only, identity-only, combined). A logistic regression meta-learner weighted the predictions — XGBoost received the highest weight, reflecting its strong performance on the sparse high-dimensional feature space.