Skip to content

OpenMM OpenCL — Amber20 Cellulose & STMV on Apple M5 Max

Engine: openmm-reference. Large-system measurement of OpenMM’s OpenCL backend on the M5 Max for the two heavy AMBER 20 benchmark systems.

TestAtomsM5 Max OpenCL (ns/day)A100 (ns/day)²H100 1× (ns/day)²H100 4× (ns/day)²B200 (ns/day)²M5 Max / A100
Cellulose PME408,60955.89131.85216.98342.21276.4142%
STMV PME1,067,09518.1446.0570.16147.09116.9139%

² openmm.org/benchmarks, OpenMM 8.4.

The full M5 Max scaling curve, all benchmarks so far

Section titled “The full M5 Max scaling curve, all benchmarks so far”

Combining this report with openmm-opencl-dhfr.md and openmm-opencl-apoa1.md:

SystemAtomsPME?M5 Max ns/dayM5 Max / A100
DHFR Implicit~2.5k1762.191%
DHFR Explicit-RF23.6k1018.470%
DHFR Explicit-PME23.6k752.558%
ApoA1 PME92.2k231.148%
Cellulose PME408.6k55.942%
STMV PME1,067k18.139%

The M5 Max / A100 ratio on PME-heavy systems decelerates as system size grows, approaching an asymptote near ~35–40%:

  • 23.6k → 92.2k (≈4×): 58% → 48%, a 10-point drop.
  • 92.2k → 408.6k (≈4×): 48% → 42%, a 6-point drop.
  • 408.6k → 1.07M (≈2.6×): 42% → 39%, a 3-point drop.

The ratio is not crashing toward zero. M5 Max is not memory-pressure-limited at STMV (1M atoms): unified memory holds the system fine, and the relative deficit vs A100 stabilizes rather than blowing up. This is the most informative thing this run produced — Apple’s unified memory does carry its weight at large scale; the gap stays bound by the per-step PME-FFT deficit, not by data movement.

For reference, even a single A100 only reaches 46 ns/day on STMV; the OpenMM table shows you need 3–4× H100 to make STMV throughput meaningful (130 to 147 ns/day). M5 Max at 18 ns/day is in the regime where single-GPU NVIDIA is also a struggle.

  • Engine: OpenMM 8.5.1.dev-f7fa0c2 (vendored at vendors/openmm/, run via the upstream stock benchmark script)
  • Platform: OpenCL
  • OpenCL platform name: Apple
  • Device: Apple M5 Max (DeviceIndex 0)
  • Host: AppCubics-MacBook-Pro.local, Darwin arm64
  • Date: 2026-05-15
  • Raw output: results/openmm-opencl-amber20-m5max.json (gitignored)

These tests require the AMBER 20 Benchmark Suite (not shipped with OpenMM):

FieldValue
Sourcehttps://ambermd.org/Amber20_Benchmark_Suite.tar.gz
Tarball size~75 MB
Extracted size~411 MB
Local pathresults/inputs/Amber20_Benchmark_Suite/ (gitignored)
Fetched2026-05-15

results/inputs/README.md records what is in that directory and why. The benchmark script handles the download automatically on first run; subsequent runs reuse the local copy. See the reproducer below.

ParameterCelluloseSTMV
Force fieldAMBER (suite default)AMBER (suite default)
IntegratorLangevin (NVT)Langevin (NVT)
Timestep4 fs4 fs
ConstraintsHBondsHBonds
Hydrogen mass1.5 amu1.5 amu
PME cutoff0.9 nm0.9 nm
Precisionsinglesingle
Target wall time30 s30 s

Identical to the configuration used at openmm.org/benchmarks.

Terminal window
cd results/inputs
UV_CACHE_DIR=/tmp/mlx-atomistic-uv-cache uv run --project ../.. \
python ../../vendors/openmm/examples/benchmarks/benchmark.py \
--platform OpenCL \
--test amber20-cellulose,amber20-stmv \
--seconds 30 \
--precision single \
--outfile ../openmm-opencl-amber20-m5max.json

The script downloads Amber20_Benchmark_Suite.tar.gz into the current working directory on first run and extracts it next to itself. By running from results/inputs/, the download stays inside the gitignored results/ tree rather than polluting the vendored OpenMM source under vendors/.

  • --disable-pme-stream on STMV: would directly test whether the PME-stream overlap path is the deficit, or whether the FFT itself is.
  • Cellulose with mixed precision: large-system mixed-precision benefits on Apple GPU are unmeasured publicly.
  • mlx-atomistic at the same scales — once feasible, an MLX run on a comparable system would land alongside these and tell us where the MLX runtime fits relative to OpenMM’s OpenCL ceiling on this hardware.