Blog freshness: Research notes liveLatest update: May 2026Telemetry mode: Public-safe live stripAI tools: Self-hosted demos live
Skip to main content

One-page case study

Anti-Money Laundering Detection — IBM HI-Medium

PySpark ML pipeline on IBM's 32M+ synthetic AML transaction dataset. Engineered temporal features, handled severe class imbalance (0.23% laundering), and compared Logistic Regression vs Random Forest — achieving 99.94% accuracy and 0.998 F1.

Proof Points

32M+ transactions
0.998 F1
0.23% fraud class

Challenges

  • Severe class imbalance — only 0.23% of 32M transactions are laundering
  • Processing 32M rows required distributed computing with PySpark + 6GB executor memory
  • Feature engineering temporal patterns (transaction duration, frequency) from timestamps

Learnings

  • PySpark MLlib pipelines for scalable ML on large financial datasets
  • Cross-currency transaction patterns are strong laundering indicators
  • AUC is misleading with severe imbalance — F1 and precision-recall are more reliable

Stack

PythonPySparkPandasNumPyMatplotlib