Skip to content

Benchmark Inventory And Gap Matrix

Date: 2026-05-22

Scope: .agent/work/2026-05-22-performance-audit-harness-hardening. This inventory maps the benchmark surface and tracks which Phase 3 coverage gaps were closed by the harness-hardening change. External OpenMM, LAMMPS, OpenBenchmarking, and MLX material is used only as benchmark-design context, not as a pass/fail target for mlx_atomistic.

TierPurposeRequired availabilityResult location
Fast developerSmoke-check importable MLX benchmark modules and JSON/CSV shape.uv run environment only; no mandatory OpenMM, LAMMPS, OpenCL, or large fixture.Temporary pytest paths, stdout JSON, or optional local CSV.
Opt-in performanceLarger Apple Silicon runs, prepared production fixtures, and reference-engine context.Local accelerator, optional OpenMM/LAMMPS/dev reference setup, and optional gitignored fixtures.Raw JSON/CSV under gitignored results/; committed summaries under docs/benchmarks/.
ModuleCurrent coverageTest/doc evidenceResult/raw-output locationTierPhase 3 gap
src/mlx_atomistic/benchmarks/lj_md.pyLJ MD modes and CSV output.tests/test_benchmarks.py::test_lj_benchmark_csv_smokeCaller-provided --csv; stdout otherwise.Fast developerDoes not cover virtual sites, TIP4P-Ew, GBSA/OBC, soft-core/lambda, or replica exchange.
src/mlx_atomistic/benchmarks/md_performance.pyEnd-to-end synthetic LJ MD throughput, neighbor policy, cadence, synchronization, finite output, and MLX runtime metadata.tests/test_benchmarks.py benchmark smoke coverage.Stdout JSON/CSV when invoked by module; future raw outputs should land in results/.Fast developer now; opt-in for larger sizes.No Phase 3 feature-specific row.
src/mlx_atomistic/benchmarks/md_acceleration.pyNeighbor build versus force evaluation split, backend policy, pair representation, and waste counters.tests/test_benchmarks.py benchmark smoke coverage.Stdout JSON/CSV when invoked by module.Fast developer now; opt-in for larger sizes.No virtual-site, GBSA/OBC, soft-core/lambda, or replica-exchange overhead row.
src/mlx_atomistic/benchmarks/cadence_sensitivity.pyReporter/evaluation cadence and synchronization/materialization counts.tests/test_benchmarks.py benchmark smoke coverage.Stdout JSON/CSV when invoked by module.Fast developerCan inform future replica exchange history/materialization timing, but does not cover replica exchange today.
src/mlx_atomistic/benchmarks/mm_force_terms.pyBonded autodiff, neighbor-list build, LJ pair eval, direct Coulomb, combined nonbonded, constraints, and TIP4P-Ew virtual-site reconstruction/force redistribution micro-rows.tests/test_benchmarks.py::test_force_term_benchmark_includes_profile_rowsCaller-provided --csv; stdout JSON with --json.Fast developerStill a microbenchmark surface, not a production-scale advanced-water workload.
src/mlx_atomistic/benchmarks/phase3_physics.pyFast normalized rows for virtual sites, TIP4P-Ew M-site reconstruction, GBSA/OBC energy/forces and surface-area term, soft-core/lambda derivative grid, and two-replica exchange.tests/test_benchmarks.py::test_phase3_physics_benchmark_covers_required_feature_rowsCaller-provided --csv; stdout JSON with --json; raw audit outputs under results/performance-audit-harness-hardening/.Fast developerSynthetic probes only; larger opt-in rows are still needed before optimization claims.
src/mlx_atomistic/benchmarks/pme_performance.pyPME stage profiling against a prepared parity fixture; blocked payload when fixture data is absent.tests/test_benchmarks.py benchmark smoke coverage.Default fixture under results/md-engine-structural-gap-closure/pme-parity; default raw output under results/md-engine-structural-gap-closure/baseline/pme-profile.json.Opt-in performance; blocked-path smoke is fast.Adjacent to TIP4P-Ew water work but does not benchmark TIP4P-Ew virtual-site overhead.
src/mlx_atomistic/benchmarks/ewald_reference.pySmall-system Ewald correctness/backend timing, explicitly not GPCRmd-scale PME.tests/test_benchmarks.py::test_ewald_reference_benchmark_json_and_csv_smokeCaller-provided --csv; stdout JSON with --json.Fast developerDoes not cover soft-core/lambda derivatives or advanced water models.
src/mlx_atomistic/benchmarks/stability.pyNVE/NVT stability diagnostics for small systems.tests/test_benchmarks.py::test_stability_cli_json_and_csv_smokeCaller-provided --csv; stdout JSON with --json.Fast developerCould catch stability impacts after Phase 3 rows exist, but no Phase 3 timing row today.
src/mlx_atomistic/benchmarks/validation_gauntlet.pyFinite-difference force validation cases.tests/test_benchmarks.py::test_validation_gauntlet_cli_json_and_csvCaller-provided --csv; stdout JSON with --json.Fast developerValidation-oriented, not a performance row for Phase 3 features.
src/mlx_atomistic/benchmarks/schema.pyShared normalized benchmark fields and default command helpers.Imported by current benchmark modules and exercised by tests/test_benchmarks.py.N/A helper module.Fast developer supportNew benchmark rows should use this helper to keep report joins simple.
src/mlx_atomistic/benchmarks/gpcrmd_runtime.pyShared GPCRmd runtime reporting helpers for output directory size, resident memory, PME mesh summary, and diagnostic reductions.Covered indirectly by production/probe workflows and available to benchmark reports; no dedicated smoke test in tests/test_benchmarks.py today.N/A helper module; consumers should write raw run artifacts under results/.Opt-in performance supportHelper-only surface; does not create Phase 3 feature timing rows by itself.
DFT benchmark modules: dft_scf.py, dft_operator.py, dft_pseudopotential.py, dft_geometry.py, dft_nonlocal.py, dft_solver.py, dft_spin_kpoints.py, dft_relaxation.pyDFT operation and solver smoke timings.DFT smoke tests in tests/test_benchmarks.py.Caller-provided --csv; stdout JSON with --json.Fast developerOut of the Phase 3 MD physics gap set.
SurfaceCurrent coverageTest/doc evidenceResult/raw-output locationTierBoundary
scripts/benchmark_openmm_opencl.pySynthetic OpenMM/OpenCL LJ reference benchmark with blocked payload for unavailable platform.tests/test_benchmarks.py::test_openmm_opencl_unavailable_platform_non_json_does_not_crash; docs under docs/benchmarks/openmm-opencl-*.md.Caller-provided --csv; raw OpenMM report examples under results/openmm-opencl-*.json.Opt-in reference; blocked smoke is fast.openmm-reference, design context only.
scripts/run_openmm_mlx_parity.pyOpenMM versus MLX parity workflow support.Existing parity tests outside this slice; not a benchmark smoke in tests/test_benchmarks.py.Local run artifacts under results/ when invoked.Opt-in reference/contextReference parity context, not product runtime dependency.
scripts/run_openmm_mlx_npt_parity.pyNPT parity workflow support.Existing NPT/parity tests outside this slice.Local run artifacts under results/ when invoked.Opt-in reference/contextReference parity context, not product runtime dependency.
scripts/run_openmm_production_md_reference.pyProduction MD reference command surface.Existing production reference tests outside this slice.Local run artifacts under results/ when invoked.Opt-in reference/contextOpenMM context only; not a pass/fail throughput target.
scripts/run_mlx_production_md_probe.pyMLX production-probe command surface.Existing production probe tests outside this slice.Local run artifacts under results/ when invoked.Opt-in performanceMLX product probe, but not a routine fast gate.
scripts/benchmark_lammps_opencl.pySynthetic LAMMPS/OpenCL reference benchmark with normalized ok or blocked payloads.tests/test_benchmarks.py::test_lammps_opencl_reference_payload_is_normalized; docs command matrix and baseline audit.Caller-provided --csv; stdout JSON with --json; raw audit output under results/performance-audit-harness-hardening/lammps-fast.json.Opt-in reference; blocked smoke is fast.lammps-reference, design context only.
docs/benchmarks/README.mdEngine labels, file template, index, external input policy, raw-output policy.This inventory is linked from README.Committed Markdown in docs/benchmarks/.DocumentationDefines mlx_atomistic, openmm-reference, and lammps-reference.
docs/benchmarks/openmm-opencl-dhfr.md, openmm-opencl-apoa1.md, openmm-opencl-amber20.mdOpenMM OpenCL summary reports for DHFR, ApoA1, Cellulose, and STMV on Apple M5 Max.Indexed from docs/benchmarks/README.md.Raw JSON paths named under gitignored results/.Opt-in reference documentationExternal/reference comparison only.
FeatureCurrent implementation/test evidenceCurrent benchmark placementRemaining benchmark gap
virtual sitestests/test_virtual_sites.py covers virtual-site position reconstruction, force redistribution, artifact round trip, runner propagation, and simulation configuration plumbing.phase3_physics.py covers reconstruction and force redistribution; mm_force_terms.py adds synchronized TIP4P-Ew micro-rows.Larger opt-in advanced-water workload.
TIP4P-Ewtests/test_virtual_sites.py covers tip4p_ew_virtual_site, reference geometry, prepared-system round trip, and artifact build exposure.phase3_physics.py emits a TIP4P-Ew M-site row; mm_force_terms.py emits TIP4P-Ew reconstruction and redistribution rows.Couple the row to a larger nonbonded/PME workload before optimization claims.
GBSA/OBCtests/test_gbsa.py covers GBSA surface-area energy, OBC force finite-difference behavior, OpenMM OBC reference energy, artifact loading, and save/load parameters.phase3_physics.py emits OBC energy/force and surface-area rows.Scaling row over larger atom counts.
soft-core/lambdatests/test_soft_core.py covers finite overlap, endpoint equivalence, finite-difference energy_forces_dlambda, wrapper delegation, artifact metadata, and fail-closed non-cutoff electrostatics.phase3_physics.py emits an energy_forces_dlambda lambda-grid row.Larger lambda-grid sweep.
replica exchangetests/test_replica_exchange.py covers Metropolis probability, adjacent swaps, lambda-scaled Hamiltonians, odd/even pairing, metadata validation, and unsupported runtime inputs.phase3_physics.py reports per-replica throughput, swap counts, acceptance rate, and history materialization count.Larger opt-in multi-replica workload.
Context sourceUseful design signalCaveat
OpenMMPublic reports commonly use ns/day on named systems such as DHFR, ApoA1, Cellulose, and STMV, with platform, precision, timestep, constraints, hydrogen mass, ensemble, and cutoff/PME settings recorded.OpenMM numbers are reference context only. They are not direct pass/fail targets for MLX because hardware, backend, precision, and engine semantics differ.
LAMMPSOfficial benchmark families cover LJ liquid, polymers, metals/EAM, granular systems, and protein/rhodopsin-style systems with atom counts, timesteps, packages, precision/backend, and loop-time style metrics.LAMMPS coverage should stay opt-in and fail-soft; historical public numbers are context for scaling behavior, not target thresholds.
OpenBenchmarkingThe LAMMPS profile records repeatable run metadata and reports ns/day for rhodopsin/protein-sized systems with variance reporting.OpenBenchmarking helps shape provenance and repetition fields; it is not an apples-to-apples MLX acceptance gate.
MLXPublic MLX benchmark practice emphasizes operation-level timing, Apple Silicon device/runtime metadata, synchronization-aware measurements, and CPU/GPU/backend distinctions.MLX context informs measurement hygiene for mlx_atomistic; it does not provide atomistic throughput targets by itself.
  • The fast benchmark gate is currently centered on tests/test_benchmarks.py.
  • Current committed benchmark docs include OpenMM reference reports, the benchmark README, this inventory, and the baseline audit report.
  • Raw benchmark outputs should remain under gitignored results/; committed Markdown in docs/benchmarks/ should summarize reproducible rows and cite raw output paths.
  • The Phase 3 named features now have normalized fast benchmark rows; remaining gaps are larger opt-in workloads for optimization validation.