Skip to content

Benchmark Ladder

Date: 2026-05-23

This ladder organizes mlx_atomistic benchmark rows by decision value. It is not a ranking across engines. Reference rows are used only where operation semantics and metric families line up.

LayerPurposeComparable rule
micro/kernelIsolate local MLX costs such as force terms, neighbor work, virtual-site reconstruction, and synchronization.Diagnostic unless a reference engine exposes the same operation and metric.
controlled MDRun tiny same-workload MD rows with matching atom and step semantics.Comparable only when both sides are ok and use the same metric family.
feature physicsTime MLX feature rows for GBSA/OBC, TIP4P-Ew, soft-core/lambda, and replica exchange.Comparable only after a matching OpenMM controlled case exists; otherwise diagnostic.
scalingSweep sizes or policies to tell fixed overhead from force-evaluation, neighbor-list, or memory-pressure costs.Usually MLX-only diagnostic; reference parity is separate.
reference parityMap controlled OpenMM rows to MLX rows and allow ratios only for matching metrics.Comparable, diagnostic, or blocked from normalized payload status.
stretchTrack real-system rows such as DHFR until MLX artifacts and runtime parity exist.Blocked or deferred until the MLX system, physics, and metric match the reference.
RowLayerMLX commandOpenMM commandLAMMPS mappingMetric familyRaw outputComparability statusDecision value
lj-synthetic-loopcontrolled MD; reference parityuv run python -m mlx_atomistic.benchmarks.md_performance --sizes 32 --steps 1 --sample-interval 1 --diagnostic-interval 1 --evaluation-interval 1 --jsonuv run python scripts/benchmark_openmm_opencl.py --platform OpenCL --particles 32 --steps 1 --warmup-steps 0 --spacing-nm 1.0 --jsonSimple LJ-like smoke mapping may use uv run python scripts/benchmark_lammps_opencl.py --particles 32 --steps 1 --json; first-class LAMMPS parity is deferred.steps/s; ns/day only when timestep and step semantics are alignedresults/same-workload-openmm-comparison/mlx-lj-synthetic-loop.json; results/same-workload-openmm-comparison/openmm-lj-synthetic-loop.json; optional results/performance-audit-harness-hardening/lammps-fast.jsoncomparable only for MLX/OpenMM rows with matching atom count, step count, and timing metric; LAMMPS is diagnostic/deferredDecide whether tiny full-loop overhead deserves optimization before larger-system work.
gbsa-obc-smallfeature physics; reference parityuv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --jsonuv run python scripts/benchmark_openmm_opencl.py --case gbsa-obc-small --platform Reference --particles 4 --steps 1 --jsondeferred; no LAMMPS mapping in this slicems/eval latency; no mixed unit ratioMLX combined raw source: results/same-workload-openmm-comparison/mlx-phase3-controlled.json; MLX row extract: results/same-workload-openmm-comparison/mlx-gbsa-obc-small.json; OpenMM: results/same-workload-openmm-comparison/openmm-gbsa-obc-small.json; summary: results/same-workload-openmm-comparison/summary.json; audit source results/performance-audit-harness-hardening/phase3-physics-fast.jsoncomparable for the refreshed MLX/OpenMM Reference latency row; OpenMM reports fixture: gbsa_obc_small, status: ok, and obc_force_setupDecide whether GBSA/OBC force evaluation is an optimization target from the controlled latency row.
tip4p-ew-watermicro/kernel; feature physics; reference parityuv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --json; micro row: uv run python -m mlx_atomistic.benchmarks.mm_force_terms --evaluations 1 --particles 16 --jsonuv run python scripts/benchmark_openmm_opencl.py --case tip4p-ew-water --platform Reference --particles 4 --steps 1 --jsondeferred; no LAMMPS mapping in this slicems/eval latency for virtual-site reconstruction; no mixed unit ratioMLX combined raw source: results/same-workload-openmm-comparison/mlx-phase3-controlled.json; MLX row extract: results/same-workload-openmm-comparison/mlx-tip4p-ew-water.json; OpenMM: results/same-workload-openmm-comparison/openmm-tip4p-ew-water.json; summary: results/same-workload-openmm-comparison/summary.json; audit sources results/performance-audit-harness-hardening/phase3-physics-fast.json and results/performance-audit-harness-hardening/mm-force-terms-fast.jsoncomparable for the refreshed MLX/OpenMM Reference virtual-site reconstruction latency row; OpenMM reports operation_semantics: virtual_site_reconstruction and openmm_operation: Context.computeVirtualSitesDecide whether TIP4P-Ew overhead is reconstruction-specific or part of a broader water-workload cost.
soft-core-lambdafeature physicsuv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --jsondeferred mappingdeferredms/eval for lambda-grid energy/force derivative workresults/performance-audit-harness-hardening/phase3-physics-fast.json; future same-workload path under results/same-workload-openmm-comparison/ when mappeddiagnostic; reference parity deferredDecide whether lambda derivative work needs a larger opt-in sweep before optimization.
replica-exchangefeature physicsuv run python -m mlx_atomistic.benchmarks.phase3_physics --evaluations 1 --waters 1 --atoms 4 --replica-steps 1 --jsondeferred mappingdeferredms/eval, per-replica throughput, aggregate replica throughput, swap/history countersresults/performance-audit-harness-hardening/phase3-physics-fast.jsondiagnostic; reference parity deferredDecide whether replica execution, history materialization, or swap bookkeeping is the next MLX bottleneck.
scaling-sweepscalinguv run python -m mlx_atomistic.benchmarks.md_acceleration --include-large --evaluations 10 --json; uv run python -m mlx_atomistic.benchmarks.md_performance --include-large --steps 100 --jsondeferred until a controlled OpenMM sweep is specifieddeferred except simple LJ-like smoke notesms/eval, neighbor-build timing, steps/s; keep units separate by rowresults/mlx-md-acceleration.json; results/mlx-md-performance.json; audit smoke sources under results/performance-audit-harness-hardening/diagnostic for MLX scaling; reference parity deferredDecide whether overhead is fixed, neighbor-list dominated, memory-pressure dominated, or force-evaluation dominated.
dhfr-implicitstretch; reference parityuv run python -m mlx_atomistic.benchmarks.dhfr --case dhfr-implicit --steps 1 --jsonuv run python scripts/benchmark_openmm_dhfr.py --case dhfr-implicit --platform Reference --steps 1 --jsondeferredns/day for matching one-step GBSA/OBC rowsMLX: results/same-workload-openmm-comparison/mlx-dhfr-implicit.json; OpenMM Reference: results/same-workload-openmm-comparison/openmm-dhfr-implicit.json; summary: results/same-workload-openmm-comparison/summary.json; OpenMM OpenCL context: results/openmm-opencl-dhfr-m5max.jsoncomparable for the one-step MLX/OpenMM Reference smoke row; broader OpenCL context remains separateTrack the smaller DHFR real-system path and harden MLX artifact/runtime behavior before broad performance claims.
dhfr-explicit-pmestretch; reference parityuv run python -m mlx_atomistic.benchmarks.dhfr --case dhfr-explicit-pme --steps 1 --jsonuv run python scripts/benchmark_openmm_dhfr.py --case dhfr-explicit-pme --platform Reference --steps 1 --jsondeferredns/day only after matching runnable MLX and OpenMM rows existMLX: results/same-workload-openmm-comparison/mlx-dhfr-explicit-pme.json; OpenMM Reference: results/same-workload-openmm-comparison/openmm-dhfr-explicit-pme.json; OpenMM OpenCL context: results/openmm-opencl-dhfr-m5max.jsonblocked; PME artifact policy requires neutrality and local Amber20/JAC has net_charge=-11; no ratioTrack the production-like DHFR PME path and resolve charged-system/neutralization policy before PME runtime optimization.
  • OpenMM parity rows belong under scripts/ and emit normalized ok, diagnostic, or blocked payloads. Ratios are valid only for matching ok rows.
  • LAMMPS is deferred for this ladder except for simple LJ-like smoke notes. Do not map LAMMPS materials, protein, or package-specific benchmark families to MLX rows without a separate same-workload plan.
  • DHFR now has explicit same-workload row IDs. dhfr-implicit is runnable as a one-step GBSA/OBC smoke row; dhfr-explicit-pme remains blocked on PME artifact neutrality. ApoA1, Cellulose, and STMV OpenMM reports remain reference context until matching MLX workloads exist.