Private test split
Train and dev splits are public so anyone can iterate honestly. The test split is held privately — only its task IDs and hashes are published — so the leaderboard score stays meaningful release-over-release.
asm-bench is a platform-stratified benchmark for evaluating LLM agents on the day-to-day work of managing an enterprise SaaS platform as a software product — configuration, governance, integration, upgrades, and operational hygiene. The benchmark is platform-by-platform; we are starting with ServiceNow.
Same harness across every entry — only the agent varies. Scores are passk with Wilson 95% confidence intervals; ranks are only claimed where intervals are disjoint. Click a row to expand per-suite pass rates. Per-task results are not surfaced — even submitters only see scores at the suite level and above.
| # | Submitter · invoker | Resolved (pass³) | Governance | Scope ↓ | κ | $ / task | Run |
|---|
Showing — submitter buckets. Request access →
Most enterprise platforms — ServiceNow, Datadog, Databricks, Salesforce, etc. — are closed-source SaaS. Agents can't be evaluated by running unit tests on a forked repo, because there is no repo. They have to be evaluated against the platform's real semantics: encoded queries, ACLs, business rules, catalog workflows, upgrade-skip remediation, integration design.
Each platform gets its own task corpus and rubric. asm-bench is the parent; ServiceNow is the first instance.
Closed-source SaaS has no forkable repo. Tasks are graded against deterministic state checks first; a rubric only adjudicates what state can't.
Tasks are synthesized from anonymized aggregate platform-operations patterns. No customer-proprietary data. Apache-2.0 with a contamination clause.
Resolved, governance, scope, blast radius, hallucination, latency, cost, two-judge κ — the leaderboard surfaces them all and lets you weight them.
Refusal tasks, blast-radius caps, scope contracts. An agent that "just does what the user asked" fails the adversarial tier.
Tasks carry release tags and a deprecated field. Old leaderboard rows pin to the release they ran against so history stays comparable.
Every enterprise SaaS platform has its own semantics, its own tool surface, and its own failure modes. asm-bench reflects that: each platform ships as a self-contained corpus, emulator, and rubric. Pick a platform below to see what its corpus covers.
Enough public material to motivate the benchmark; enough private material to keep the leaderboard honest; enough metrics to see how an agent fails.
Train and dev splits are public so anyone can iterate honestly. The test split is held privately — only its task IDs and hashes are published — so the leaderboard score stays meaningful release-over-release.
k ≥ 3 trials per task. Per-domain Wilson 95% confidence intervals. We refuse to make pairwise rank claims unless the intervals are disjoint, and per-domain scores are published alongside the overall so a model can't hide weakness in one area behind strength in another.
deprecated, never silently deleted, so older leaderboard rows remain interpretable.asm-bench has two secrets that point in opposite directions: our test split, which we keep private so the leaderboard stays meaningful, and your agent — its prompts, planner internals, tool schemas, intermediate reasoning, and any private retrieval it does at run time. The packaging model treats them symmetrically: the runner is open source, the wire format is whitelisted, and the leaderboard only ever returns aggregates.
The runner is a public OCI image (public.ecr.aws/rapdev-io/asm-bench-runner); the train/dev splits, emulator, verifiers, and runner code are all open source. You pull it, run it on your own infrastructure against your own agent, and never have to send us anything. No telemetry, no phone-home, no upload required.
On a test run the runner emits one ed25519-signed envelope per task. It carries task id and hash, agent name and model id (whatever you choose to disclose), pass/fail per verifier, the resolved flag, observed blast radius, wall clock, and a one-time nonce. That's the full list.
Your system prompt, the user prompt you composed, chain-of-thought, planner state, retrieval queries, tool-call arguments and returns beyond what a verifier strictly needs. The runner enforces a field whitelist before signing — anything outside the schema is dropped, and the leaderboard rejects envelopes with extra fields.
For llm_rubric verifiers, the judge sees only the redacted slice configured by your redact.yaml — site patterns append to defaults, never replace them. The redacted input is hashed into the envelope so the operator cannot substitute a different input after the fact.
Every line of code that touches your agent is in runner/. CI builds the image twice on the same commit and fails the release if the contents of /app differ. The published digest is recorded with each release; anyone can rebuild from source and compare.
Run --dry-run --emit-envelope against a dev task to see exactly the bytes that would ship. Grep it, diff it against your local trace, satisfy yourself that no agent internals are present. The signing key is generated on your machine; only the public half is registered with us.
Reach out and we'll get you set up. Train and dev splits are public for iteration; landing a row on the leaderboard goes through us.
Tell us what you'd like to evaluate. We'll get back to you with what's involved and the next available slot.
@misc{asmbench2026,
title = {asm-bench: An Agentic Software Management Benchmark
for Enterprise SaaS Platforms},
author = {{RapDev}},
year = {2026},
url = {https://asm-bench.rapdev.io},
note = {v0.1.0, ServiceNow corpus}
}