Benchmark Ladder
Date: 2026-05-23
This ladder organizes mlx_atomistic benchmark rows by decision value. It is
not a ranking across engines. Reference rows are used only where operation
semantics and metric families line up.
Layer Rules
Section titled “Layer Rules”| Layer | Purpose | Comparable rule |
|---|---|---|
| micro/kernel | Isolate local MLX costs such as force terms, neighbor work, virtual-site reconstruction, and synchronization. | Diagnostic unless a reference engine exposes the same operation and metric. |
| controlled MD | Run tiny same-workload MD rows with matching atom and step semantics. | Comparable only when both sides are ok and use the same metric family. |
| feature physics | Time MLX feature rows for GBSA/OBC, TIP4P-Ew, soft-core/lambda, and replica exchange. | Comparable only after a matching OpenMM controlled case exists; otherwise diagnostic. |
| scaling | Sweep sizes or policies to tell fixed overhead from force-evaluation, neighbor-list, or memory-pressure costs. | Usually MLX-only diagnostic; reference parity is separate. |
| reference parity | Map controlled OpenMM rows to MLX rows and allow ratios only for matching metrics. | Comparable, diagnostic, or blocked from normalized payload status. |
| stretch | Track real-system rows such as DHFR until MLX artifacts and runtime parity exist. | Blocked or deferred until the MLX system, physics, and metric match the reference. |
Must-Ship Rows
Section titled “Must-Ship Rows”| Row | Layer | MLX command | OpenMM command | LAMMPS mapping | Metric family | Raw output | Comparability status | Decision value |
|---|---|---|---|---|---|---|---|---|
lj-synthetic-loop | controlled MD; reference parity | uv run python -m mlx_atomistic.benchmarks.md_performance --sizes 32 --steps 1 --sample-interval 1 --diagnostic-interval 1 --evaluation-interval 1 --json | uv run python scripts/benchmark_openmm_opencl.py --platform OpenCL --particles 32 --steps 1 --warmup-steps 0 --spacing-nm 1.0 --json | Simple LJ-like smoke mapping may use uv run python scripts/benchmark_lammps_opencl.py --particles 32 --steps 1 --json; first-class LAMMPS parity is deferred. | steps/s; ns/day only when timestep and step semantics are aligned | results/same-workload-openmm-comparison/mlx-lj-synthetic-loop.json; results/same-workload-openmm-comparison/openmm-lj-synthetic-loop.json; optional results/performance-audit-harness-hardening/lammps-fast.json | comparable only for MLX/OpenMM rows with matching atom count, step count, and timing metric; LAMMPS is diagnostic/deferred | Decide whether tiny full-loop overhead deserves optimization before larger-system work. |
gbsa-obc-small | feature physics; reference parity | uv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --json | uv run python scripts/benchmark_openmm_opencl.py --case gbsa-obc-small --platform Reference --particles 4 --steps 1 --json | deferred; no LAMMPS mapping in this slice | ms/eval latency; no mixed unit ratio | MLX combined raw source: results/same-workload-openmm-comparison/mlx-phase3-controlled.json; MLX row extract: results/same-workload-openmm-comparison/mlx-gbsa-obc-small.json; OpenMM: results/same-workload-openmm-comparison/openmm-gbsa-obc-small.json; summary: results/same-workload-openmm-comparison/summary.json; audit source results/performance-audit-harness-hardening/phase3-physics-fast.json | comparable for the refreshed MLX/OpenMM Reference latency row; OpenMM reports fixture: gbsa_obc_small, status: ok, and obc_force_setup | Decide whether GBSA/OBC force evaluation is an optimization target from the controlled latency row. |
tip4p-ew-water | micro/kernel; feature physics; reference parity | uv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --json; micro row: uv run python -m mlx_atomistic.benchmarks.mm_force_terms --evaluations 1 --particles 16 --json | uv run python scripts/benchmark_openmm_opencl.py --case tip4p-ew-water --platform Reference --particles 4 --steps 1 --json | deferred; no LAMMPS mapping in this slice | ms/eval latency for virtual-site reconstruction; no mixed unit ratio | MLX combined raw source: results/same-workload-openmm-comparison/mlx-phase3-controlled.json; MLX row extract: results/same-workload-openmm-comparison/mlx-tip4p-ew-water.json; OpenMM: results/same-workload-openmm-comparison/openmm-tip4p-ew-water.json; summary: results/same-workload-openmm-comparison/summary.json; audit sources results/performance-audit-harness-hardening/phase3-physics-fast.json and results/performance-audit-harness-hardening/mm-force-terms-fast.json | comparable for the refreshed MLX/OpenMM Reference virtual-site reconstruction latency row; OpenMM reports operation_semantics: virtual_site_reconstruction and openmm_operation: Context.computeVirtualSites | Decide whether TIP4P-Ew overhead is reconstruction-specific or part of a broader water-workload cost. |
soft-core-lambda | feature physics | uv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --json | deferred mapping | deferred | ms/eval for lambda-grid energy/force derivative work | results/performance-audit-harness-hardening/phase3-physics-fast.json; future same-workload path under results/same-workload-openmm-comparison/ when mapped | diagnostic; reference parity deferred | Decide whether lambda derivative work needs a larger opt-in sweep before optimization. |
replica-exchange | feature physics | uv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --json | deferred mapping | deferred | ms/eval, per-replica throughput, aggregate replica throughput, swap/history counters | results/performance-audit-harness-hardening/phase3-physics-fast.json | diagnostic; reference parity deferred | Decide whether replica execution, history materialization, or swap bookkeeping is the next MLX bottleneck. |
scaling-sweep | scaling | uv run python -m mlx_atomistic.benchmarks.md_acceleration --include-large --evaluations 10 --json; uv run python -m mlx_atomistic.benchmarks.md_performance --include-large --steps 100 --json | deferred until a controlled OpenMM sweep is specified | deferred except simple LJ-like smoke notes | ms/eval, neighbor-build timing, steps/s; keep units separate by row | results/mlx-md-acceleration.json; results/mlx-md-performance.json; audit smoke sources under results/performance-audit-harness-hardening/ | diagnostic for MLX scaling; reference parity deferred | Decide whether overhead is fixed, neighbor-list dominated, memory-pressure dominated, or force-evaluation dominated. |
dhfr-implicit | stretch; reference parity | uv run python -m mlx_atomistic.benchmarks.dhfr --case dhfr-implicit --steps 1 --json | uv run python scripts/benchmark_openmm_dhfr.py --case dhfr-implicit --platform Reference --steps 1 --json | deferred | ns/day for matching one-step GBSA/OBC rows | MLX: results/same-workload-openmm-comparison/mlx-dhfr-implicit.json; OpenMM Reference: results/same-workload-openmm-comparison/openmm-dhfr-implicit.json; summary: results/same-workload-openmm-comparison/summary.json; OpenMM OpenCL context: results/openmm-opencl-dhfr-m5max.json | comparable for the one-step MLX/OpenMM Reference smoke row; broader OpenCL context remains separate | Track the smaller DHFR real-system path and harden MLX artifact/runtime behavior before broad performance claims. |
dhfr-explicit-pme | stretch; reference parity | uv run python -m mlx_atomistic.benchmarks.dhfr --case dhfr-explicit-pme --steps 1 --json | uv run python scripts/benchmark_openmm_dhfr.py --case dhfr-explicit-pme --platform Reference --steps 1 --json | deferred | ns/day only after matching runnable MLX and OpenMM rows exist | MLX: results/same-workload-openmm-comparison/mlx-dhfr-explicit-pme.json; OpenMM Reference: results/same-workload-openmm-comparison/openmm-dhfr-explicit-pme.json; OpenMM OpenCL context: results/openmm-opencl-dhfr-m5max.json | blocked; PME artifact policy requires neutrality and local Amber20/JAC has net_charge=-11; no ratio | Track the production-like DHFR PME path and resolve charged-system/neutralization policy before PME runtime optimization. |
Reference Coverage Notes
Section titled “Reference Coverage Notes”- OpenMM parity rows belong under
scripts/and emit normalizedok,diagnostic, orblockedpayloads. Ratios are valid only for matchingokrows. - LAMMPS is deferred for this ladder except for simple LJ-like smoke notes. Do not map LAMMPS materials, protein, or package-specific benchmark families to MLX rows without a separate same-workload plan.
- DHFR now has explicit same-workload row IDs.
dhfr-implicitis runnable as a one-step GBSA/OBC smoke row;dhfr-explicit-pmeremains blocked on PME artifact neutrality. ApoA1, Cellulose, and STMV OpenMM reports remain reference context until matching MLX workloads exist.