One-page case study

Anti-Money Laundering Detection — IBM HI-Medium

PySpark ML pipeline on IBM's 32M+ synthetic AML transaction dataset. Engineered temporal features, handled severe class imbalance (0.23% laundering), and compared Logistic Regression vs Random Forest — achieving 99.94% accuracy and 0.998 F1.

Proof Points

32M+ transactions

0.998 F1

0.23% fraud class

Challenges

• Severe class imbalance — only 0.23% of 32M transactions are laundering
• Processing 32M rows required distributed computing with PySpark + 6GB executor memory
• Feature engineering temporal patterns (transaction duration, frequency) from timestamps

Learnings

• PySpark MLlib pipelines for scalable ML on large financial datasets
• Cross-currency transaction patterns are strong laundering indicators
• AUC is misleading with severe imbalance — F1 and precision-recall are more reliable

Stack

PythonPySparkPandasNumPyMatplotlib

Anti-Money Laundering Detection — IBM HI-Medium

Proof Points

Challenges

Learnings

Stack

Continue Exploring