OpenMM OpenCL — Amber20 Cellulose & STMV on Apple M5 Max
Engine: openmm-reference. Large-system measurement of OpenMM’s OpenCL
backend on the M5 Max for the two heavy AMBER 20 benchmark systems.
Result
Section titled “Result”| Test | Atoms | M5 Max OpenCL (ns/day) | A100 (ns/day)² | H100 1× (ns/day)² | H100 4× (ns/day)² | B200 (ns/day)² | M5 Max / A100 |
|---|---|---|---|---|---|---|---|
| Cellulose PME | 408,609 | 55.89 | 131.85 | 216.98 | 342.21 | 276.41 | 42% |
| STMV PME | 1,067,095 | 18.14 | 46.05 | 70.16 | 147.09 | 116.91 | 39% |
² openmm.org/benchmarks, OpenMM 8.4.
The full M5 Max scaling curve, all benchmarks so far
Section titled “The full M5 Max scaling curve, all benchmarks so far”Combining this report with openmm-opencl-dhfr.md
and openmm-opencl-apoa1.md:
| System | Atoms | PME? | M5 Max ns/day | M5 Max / A100 |
|---|---|---|---|---|
| DHFR Implicit | ~2.5k | ❌ | 1762.1 | 91% |
| DHFR Explicit-RF | 23.6k | ❌ | 1018.4 | 70% |
| DHFR Explicit-PME | 23.6k | ✅ | 752.5 | 58% |
| ApoA1 PME | 92.2k | ✅ | 231.1 | 48% |
| Cellulose PME | 408.6k | ✅ | 55.9 | 42% |
| STMV PME | 1,067k | ✅ | 18.1 | 39% |
What this tells us
Section titled “What this tells us”The M5 Max / A100 ratio on PME-heavy systems decelerates as system size grows, approaching an asymptote near ~35–40%:
- 23.6k → 92.2k (≈4×): 58% → 48%, a 10-point drop.
- 92.2k → 408.6k (≈4×): 48% → 42%, a 6-point drop.
- 408.6k → 1.07M (≈2.6×): 42% → 39%, a 3-point drop.
The ratio is not crashing toward zero. M5 Max is not memory-pressure-limited at STMV (1M atoms): unified memory holds the system fine, and the relative deficit vs A100 stabilizes rather than blowing up. This is the most informative thing this run produced — Apple’s unified memory does carry its weight at large scale; the gap stays bound by the per-step PME-FFT deficit, not by data movement.
For reference, even a single A100 only reaches 46 ns/day on STMV; the OpenMM table shows you need 3–4× H100 to make STMV throughput meaningful (130 to 147 ns/day). M5 Max at 18 ns/day is in the regime where single-GPU NVIDIA is also a struggle.
Provenance
Section titled “Provenance”- Engine: OpenMM 8.5.1.dev-f7fa0c2 (vendored at
vendors/openmm/, run via the upstream stock benchmark script) - Platform: OpenCL
- OpenCL platform name: Apple
- Device: Apple M5 Max (DeviceIndex 0)
- Host:
AppCubics-MacBook-Pro.local, Darwin arm64 - Date: 2026-05-15
- Raw output:
results/openmm-opencl-amber20-m5max.json(gitignored)
External inputs
Section titled “External inputs”These tests require the AMBER 20 Benchmark Suite (not shipped with OpenMM):
| Field | Value |
|---|---|
| Source | https://ambermd.org/Amber20_Benchmark_Suite.tar.gz |
| Tarball size | ~75 MB |
| Extracted size | ~411 MB |
| Local path | results/inputs/Amber20_Benchmark_Suite/ (gitignored) |
| Fetched | 2026-05-15 |
results/inputs/README.md records what is in that directory and why. The
benchmark script handles the download automatically on first run; subsequent
runs reuse the local copy. See the reproducer below.
Config
Section titled “Config”| Parameter | Cellulose | STMV |
|---|---|---|
| Force field | AMBER (suite default) | AMBER (suite default) |
| Integrator | Langevin (NVT) | Langevin (NVT) |
| Timestep | 4 fs | 4 fs |
| Constraints | HBonds | HBonds |
| Hydrogen mass | 1.5 amu | 1.5 amu |
| PME cutoff | 0.9 nm | 0.9 nm |
| Precision | single | single |
| Target wall time | 30 s | 30 s |
Identical to the configuration used at openmm.org/benchmarks.
Reproducer
Section titled “Reproducer”cd results/inputsUV_CACHE_DIR=/tmp/mlx-atomistic-uv-cache uv run --project ../.. \ python ../../vendors/openmm/examples/benchmarks/benchmark.py \ --platform OpenCL \ --test amber20-cellulose,amber20-stmv \ --seconds 30 \ --precision single \ --outfile ../openmm-opencl-amber20-m5max.jsonThe script downloads Amber20_Benchmark_Suite.tar.gz into the current
working directory on first run and extracts it next to itself. By running
from results/inputs/, the download stays inside the gitignored
results/ tree rather than polluting the vendored OpenMM source under
vendors/.
Open questions for follow-up benchmarks
Section titled “Open questions for follow-up benchmarks”--disable-pme-streamon STMV: would directly test whether the PME-stream overlap path is the deficit, or whether the FFT itself is.- Cellulose with mixed precision: large-system mixed-precision benefits on Apple GPU are unmeasured publicly.
- mlx-atomistic at the same scales — once feasible, an MLX run on a comparable system would land alongside these and tell us where the MLX runtime fits relative to OpenMM’s OpenCL ceiling on this hardware.