Problem Statement
Predicting patient mortality and readmission risk is critical for hospital resource allocation and preventative care. Electronic health records contain both structured data (lab results, vitals, demographics) and unstructured clinical notes (physician observations, discharge summaries). Most ML models use only one modality, leaving valuable signal unused.
Technical Approach
Multimodal Fusion Architecture
The core innovation is a late-fusion architecture that combines predictions from two independent models trained on different data modalities:
- Structured Data Model: XGBoost trained on 100K+ patient records with 200+ engineered features including lab values, vital signs, medication history, and demographic indicators. Hyperparameter tuning via Bayesian optimization on AWS SageMaker.
- Unstructured Text Model: Fine-tuned BERT (ClinicalBERT) on 500K+ clinical notes. The model extracts semantic features from physician narratives — symptoms, diagnoses, treatment plans — that are invisible to structured models.
- Fusion Layer: Weighted ensemble combining XGBoost probability scores with BERT embeddings. A meta-learner (logistic regression) learns optimal modality weights, achieving AUC-ROC of 0.81 — a 12% improvement over the best single-modal baseline.
AWS Infrastructure
Built a fully serverless ML pipeline on AWS:
- AWS Glue: ETL jobs processing raw EHR data from S3, handling missing values, normalization, and feature engineering at scale
- Amazon Athena: SQL-based exploratory analysis on the data lake for hypothesis validation
- AWS Lambda: Event-driven inference triggers — new patient records automatically queued for prediction
- Amazon SageMaker: Model training, hyperparameter tuning, and endpoint deployment with auto-scaling
- Amazon Bedrock: Integrated LLM-based clinical summarization for generating human-readable risk reports
Key Results
- Achieved AUC-ROC of 0.81, outperforming single-modal baselines by 12%
- Reduced inference latency by 30% through SageMaker endpoint optimization and model compilation
- Processed 100K+ patient records through automated Glue ETL pipelines
- Identified top predictive features: lab result velocity, medication count, and clinical note sentiment
- Deployed as MSc thesis project with confidential clinical data partner
Tech Stack
XGBoost
BERT / ClinicalBERT
AWS SageMaker
AWS Lambda
AWS Glue
Amazon Athena
Amazon Bedrock
Python
PyTorch
Scikit-learn