One-page case study
Anti-Money Laundering Detection — IBM HI-Medium
PySpark ML pipeline on IBM's 32M+ synthetic AML transaction dataset. Engineered temporal features, handled severe class imbalance (0.23% laundering), and compared Logistic Regression vs Random Forest — achieving 99.94% accuracy and 0.998 F1.
Proof Points
32M+ transactions
0.998 F1
0.23% fraud class
Challenges
- • Severe class imbalance — only 0.23% of 32M transactions are laundering
- • Processing 32M rows required distributed computing with PySpark + 6GB executor memory
- • Feature engineering temporal patterns (transaction duration, frequency) from timestamps
Learnings
- • PySpark MLlib pipelines for scalable ML on large financial datasets
- • Cross-currency transaction patterns are strong laundering indicators
- • AUC is misleading with severe imbalance — F1 and precision-recall are more reliable
Stack
PythonPySparkPandasNumPyMatplotlib