Skip to content

OpenMM OpenCL — DHFR on Apple M5 Max

Engine: openmm-reference. Reference-ceiling measurement of OpenMM’s OpenCL backend on the M5 Max for the canonical DHFR hello-world system (23k atoms).

TestM5 Max OpenCL (ns/day)A100 (ns/day)²H100 (ns/day)²B200 (ns/day)²M5 Max / A100
DHFR Implicit (GBSA)1762.11942.22498.62268.491%
DHFR Explicit-RF1018.41460.61873.21802.370%
DHFR Explicit-PME752.51286.61704.51658.758%

² openmm.org/benchmarks, OpenMM 8.4.

SystemAtomsM5 Max / A100 (PME)
DHFR Implicit (no PME, no water)2.5k91%
DHFR Explicit-RF (water, no PME)23.6k70%
DHFR Explicit-PME23.6k58%
ApoA1 PME (report)92.2k48%

The relative gap to A100 grows with system size and PME usage, not with kernel-launch overhead. M5 Max essentially matches an A100 on DHFR Implicit and falls to about half the throughput at ApoA1 PME.

What this tells us about the M5 Max bottleneck

Section titled “What this tells us about the M5 Max bottleneck”

The expected story for an Apple GPU vs NVIDIA on a small system is that launch overhead, threadgroup-memory latency, and lack of warp-level primitives would hurt the small-system regime more than the large one. The data here points the other direction:

  • Small + no PME (DHFR GBSA): M5 Max ≈ A100. Pair-list and bonded force kernels on Apple GPU are competitive when they are the only thing running.
  • Large + PME (ApoA1 PME): M5 Max ≈ 0.5 × A100. The deficit shows up in proportion to PME workload.

The real M5 Max bottleneck in OpenMM today is the OpenCL FFT path used by PME, not pair-list dispatch. This is consistent with philipturner’s finding in openmm/openmm#3924 that the largest unrealized speedups on Apple GPUs come from rewriting findBlocksWithInteractions and prefix-sum primitives in native Metal — those gains are on the non-PME side. A native Metal FFT (separate from the Metal kernel work) is likely needed to close the PME gap.

M5 Max DHFR Implicit (1762 ns/day) lands within ~10% of NVIDIA DGX Spark (1942 ns/day on the same test). DGX Spark is NVIDIA’s personal-workstation class part, so this is a near-peer comparison in the segment Apple actually competes in.

  • Engine: OpenMM 8.5.1.dev-f7fa0c2 (vendored at vendors/openmm/, run from the upstream stock benchmark script)
  • Platform: OpenCL
  • OpenCL platform name: Apple
  • Device: Apple M5 Max (DeviceIndex 0)
  • Host: AppCubics-MacBook-Pro.local, Darwin arm64
  • Date: 2026-05-15
  • Raw output: results/openmm-opencl-dhfr-m5max.json (gitignored)

All three tests share the OpenMM public-benchmark config exactly:

ParameterValue
Force fieldAMBER99SB
IntegratorLangevin (NVT)
Timestep4 fs
ConstraintsHBonds
Hydrogen mass1.5 amu
Cutoff0.9 nm (PME) / 1.0 nm (RF) / 2.0 nm (implicit)
Precisionsingle
Target wall time30 s per test

Identical to the configuration used at openmm.org/benchmarks, so the M5 Max column is directly comparable to the NVIDIA columns in that table.

Terminal window
cd vendors/openmm/examples/benchmarks
UV_CACHE_DIR=/tmp/mlx-atomistic-uv-cache uv run --project ../../../.. \
python benchmark.py \
--platform OpenCL \
--test gbsa,rf,pme \
--seconds 30 \
--precision single \
--outfile ../../../../results/openmm-opencl-dhfr-m5max.json

PDB inputs (5dfr_minimized.pdb, 5dfr_solv-cube_equil.pdb) ship with the OpenMM source in this directory — no downloads required.

  • STMV (1M atoms) — the natural other extreme. Would test whether Apple’s unified-memory advantage shows up at large scale, or whether the PME-FFT deficit dominates further.
  • Mixed precision--precision mixed on DHFR PME to see if FP16 intermediate accumulation moves the needle on Apple GPU.
  • AMOEBA polarizable on DHFR--test amoebagk,amoebapme exercises a different code path (induced dipoles). M5 Max relative position there is unmeasured publicly.
  • PME isolation — toggle --disable-pme-stream to measure how much of the M5 Max PME deficit is the stream-overlap path vs the FFT itself.