Benchmark Ladder

Date: 2026-05-23

This ladder organizes mlx_atomistic benchmark rows by decision value. It is not a ranking across engines. Reference rows are used only where operation semantics and metric families line up.

Layer Rules

Layer	Purpose	Comparable rule
micro/kernel	Isolate local MLX costs such as force terms, neighbor work, virtual-site reconstruction, and synchronization.	Diagnostic unless a reference engine exposes the same operation and metric.
controlled MD	Run tiny same-workload MD rows with matching atom and step semantics.	Comparable only when both sides are `ok` and use the same metric family.
feature physics	Time MLX feature rows for GBSA/OBC, TIP4P-Ew, soft-core/lambda, and replica exchange.	Comparable only after a matching OpenMM controlled case exists; otherwise diagnostic.
scaling	Sweep sizes or policies to tell fixed overhead from force-evaluation, neighbor-list, or memory-pressure costs.	Usually MLX-only diagnostic; reference parity is separate.
reference parity	Map controlled OpenMM rows to MLX rows and allow ratios only for matching metrics.	Comparable, diagnostic, or blocked from normalized payload status.
stretch	Track real-system rows such as DHFR until MLX artifacts and runtime parity exist.	Blocked or deferred until the MLX system, physics, and metric match the reference.

Must-Ship Rows

Row	Layer	MLX command	OpenMM command	LAMMPS mapping	Metric family	Raw output	Comparability status	Decision value
`lj-synthetic-loop`	controlled MD; reference parity	`uv run python -m mlx_atomistic.benchmarks.md_performance --sizes 32 --steps 1 --sample-interval 1 --diagnostic-interval 1 --evaluation-interval 1 --json`	`uv run python scripts/benchmark_openmm_opencl.py --platform OpenCL --particles 32 --steps 1 --warmup-steps 0 --spacing-nm 1.0 --json`	Simple LJ-like smoke mapping may use `uv run python scripts/benchmark_lammps_opencl.py --particles 32 --steps 1 --json`; first-class LAMMPS parity is deferred.	`steps/s`; `ns/day` only when timestep and step semantics are aligned	`results/same-workload-openmm-comparison/mlx-lj-synthetic-loop.json`; `results/same-workload-openmm-comparison/openmm-lj-synthetic-loop.json`; optional `results/performance-audit-harness-hardening/lammps-fast.json`	`comparable` only for MLX/OpenMM rows with matching atom count, step count, and timing metric; LAMMPS is diagnostic/deferred	Decide whether tiny full-loop overhead deserves optimization before larger-system work.
`gbsa-obc-small`	feature physics; reference parity	`uv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --json`	`uv run python scripts/benchmark_openmm_opencl.py --case gbsa-obc-small --platform Reference --particles 4 --steps 1 --json`	deferred; no LAMMPS mapping in this slice	`ms/eval` latency; no mixed unit ratio	MLX combined raw source: `results/same-workload-openmm-comparison/mlx-phase3-controlled.json`; MLX row extract: `results/same-workload-openmm-comparison/mlx-gbsa-obc-small.json`; OpenMM: `results/same-workload-openmm-comparison/openmm-gbsa-obc-small.json`; summary: `results/same-workload-openmm-comparison/summary.json`; audit source `results/performance-audit-harness-hardening/phase3-physics-fast.json`	`comparable` for the refreshed MLX/OpenMM Reference latency row; OpenMM reports `fixture: gbsa_obc_small`, `status: ok`, and `obc_force_setup`	Decide whether GBSA/OBC force evaluation is an optimization target from the controlled latency row.
`tip4p-ew-water`	micro/kernel; feature physics; reference parity	`uv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --json`; micro row: `uv run python -m mlx_atomistic.benchmarks.mm_force_terms --evaluations 1 --particles 16 --json`	`uv run python scripts/benchmark_openmm_opencl.py --case tip4p-ew-water --platform Reference --particles 4 --steps 1 --json`	deferred; no LAMMPS mapping in this slice	`ms/eval` latency for virtual-site reconstruction; no mixed unit ratio	MLX combined raw source: `results/same-workload-openmm-comparison/mlx-phase3-controlled.json`; MLX row extract: `results/same-workload-openmm-comparison/mlx-tip4p-ew-water.json`; OpenMM: `results/same-workload-openmm-comparison/openmm-tip4p-ew-water.json`; summary: `results/same-workload-openmm-comparison/summary.json`; audit sources `results/performance-audit-harness-hardening/phase3-physics-fast.json` and `results/performance-audit-harness-hardening/mm-force-terms-fast.json`	`comparable` for the refreshed MLX/OpenMM Reference virtual-site reconstruction latency row; OpenMM reports `operation_semantics: virtual_site_reconstruction` and `openmm_operation: Context.computeVirtualSites`	Decide whether TIP4P-Ew overhead is reconstruction-specific or part of a broader water-workload cost.
`soft-core-lambda`	feature physics	`uv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --json`	deferred mapping	deferred	`ms/eval` for lambda-grid energy/force derivative work	`results/performance-audit-harness-hardening/phase3-physics-fast.json`; future same-workload path under `results/same-workload-openmm-comparison/` when mapped	`diagnostic`; reference parity deferred	Decide whether lambda derivative work needs a larger opt-in sweep before optimization.
`replica-exchange`	feature physics	`uv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --json`	deferred mapping	deferred	`ms/eval`, per-replica throughput, aggregate replica throughput, swap/history counters	`results/performance-audit-harness-hardening/phase3-physics-fast.json`	`diagnostic`; reference parity deferred	Decide whether replica execution, history materialization, or swap bookkeeping is the next MLX bottleneck.
`scaling-sweep`	scaling	`uv run python -m mlx_atomistic.benchmarks.md_acceleration --include-large --evaluations 10 --json`; `uv run python -m mlx_atomistic.benchmarks.md_performance --include-large --steps 100 --json`	deferred until a controlled OpenMM sweep is specified	deferred except simple LJ-like smoke notes	`ms/eval`, neighbor-build timing, steps/s; keep units separate by row	`results/mlx-md-acceleration.json`; `results/mlx-md-performance.json`; audit smoke sources under `results/performance-audit-harness-hardening/`	`diagnostic` for MLX scaling; reference parity deferred	Decide whether overhead is fixed, neighbor-list dominated, memory-pressure dominated, or force-evaluation dominated.
`dhfr-implicit`	stretch; reference parity	`uv run python -m mlx_atomistic.benchmarks.dhfr --case dhfr-implicit --steps 1 --json`	`uv run python scripts/benchmark_openmm_dhfr.py --case dhfr-implicit --platform Reference --steps 1 --json`	deferred	`ns/day` for matching one-step GBSA/OBC rows	MLX: `results/same-workload-openmm-comparison/mlx-dhfr-implicit.json`; OpenMM Reference: `results/same-workload-openmm-comparison/openmm-dhfr-implicit.json`; summary: `results/same-workload-openmm-comparison/summary.json`; OpenMM OpenCL context: `results/openmm-opencl-dhfr-m5max.json`	`comparable` for the one-step MLX/OpenMM Reference smoke row; broader OpenCL context remains separate	Track the smaller DHFR real-system path and harden MLX artifact/runtime behavior before broad performance claims.
`dhfr-explicit-pme`	stretch; reference parity	`uv run python -m mlx_atomistic.benchmarks.dhfr --case dhfr-explicit-pme --steps 1 --json`	`uv run python scripts/benchmark_openmm_dhfr.py --case dhfr-explicit-pme --platform Reference --steps 1 --json`	deferred	`ns/day` only after matching runnable MLX and OpenMM rows exist	MLX: `results/same-workload-openmm-comparison/mlx-dhfr-explicit-pme.json`; OpenMM Reference: `results/same-workload-openmm-comparison/openmm-dhfr-explicit-pme.json`; OpenMM OpenCL context: `results/openmm-opencl-dhfr-m5max.json`	`blocked`; PME artifact policy requires neutrality and local Amber20/JAC has `net_charge=-11`; no ratio	Track the production-like DHFR PME path and resolve charged-system/neutralization policy before PME runtime optimization.

Reference Coverage Notes

OpenMM parity rows belong under scripts/ and emit normalized ok, diagnostic, or blocked payloads. Ratios are valid only for matching ok rows.
LAMMPS is deferred for this ladder except for simple LJ-like smoke notes. Do not map LAMMPS materials, protein, or package-specific benchmark families to MLX rows without a separate same-workload plan.
DHFR now has explicit same-workload row IDs. dhfr-implicit is runnable as a one-step GBSA/OBC smoke row; dhfr-explicit-pme remains blocked on PME artifact neutrality. ApoA1, Cellulose, and STMV OpenMM reports remain reference context until matching MLX workloads exist.