asm-bench — Agentic Software Management Benchmark

§ 01 · Leaderboard

A multi-metric vector, not a headline number

Same harness across every entry — only the agent varies. Scores are pass^k with Wilson 95% confidence intervals; ranks are only claimed where intervals are disjoint. Click a row to expand per-suite pass rates. Per-task results are not surfaced — even submitters only see scores at the suite level and above.

#	Submitter · invoker	Resolved (pass³)	Governance	Scope ↓	κ	$ / task	Run

Showing — submitter buckets. Request access →

§ 02 · Why this benchmark

A different question than coding benchmarks

Most enterprise platforms — ServiceNow, Datadog, Databricks, Salesforce, etc. — are closed-source SaaS. Agents can't be evaluated by running unit tests on a forked repo, because there is no repo. They have to be evaluated against the platform's real semantics: encoded queries, ACLs, business rules, catalog workflows, upgrade-skip remediation, integration design.

Platform-stratified

Each platform gets its own task corpus and rubric. asm-bench is the parent; ServiceNow is the first instance.

Real platform semantics

Closed-source SaaS has no forkable repo. Tasks are graded against deterministic state checks first; a rubric only adjudicates what state can't.

Synthesized, not scraped

Tasks are synthesized from anonymized aggregate platform-operations patterns. No customer-proprietary data. Apache-2.0 with a contamination clause.

Multi-metric, no single number

Resolved, governance, scope, blast radius, hallucination, latency, cost, two-judge κ — the leaderboard surfaces them all and lets you weight them.

Governance & negative tests

Refusal tasks, blast-radius caps, scope contracts. An agent that "just does what the user asked" fails the adversarial tier.

Versioned, not deleted

Tasks carry release tags and a deprecated field. Old leaderboard rows pin to the release they ran against so history stays comparable.

§ 03 · Platforms & coverage

asm-bench is platform-by-platform

Every enterprise SaaS platform has its own semantics, its own tool surface, and its own failure modes. asm-bench reflects that: each platform ships as a self-contained corpus, emulator, and rubric. Pick a platform below to see what its corpus covers.

More soon

And more on the way

Additional platforms are in scoping. If managing your platform feels like software product management, we want to hear about it.

asm-bench@rapdev.io →

ServiceNow · v0.1 corpus Domains covered on this platform — other platforms have their own maps.

Tool surface

§ 04 · Methodology

Splits, scoring, contamination

Enough public material to motivate the benchmark; enough private material to keep the leaderboard honest; enough metrics to see how an agent fails.

01

Private test split

Train and dev splits are public so anyone can iterate honestly. The test split is held privately — only its task IDs and hashes are published — so the leaderboard score stays meaningful release-over-release.

02

Multi-metric, no single number

resolved↑ higher better

governance_compliance↑

scope_violation↓ lower better

blast_radius↓

hallucination↓

latency_p50↓

cost↓

two_judge_κ↑

03

pass^k with Wilson CIs

k ≥ 3 trials per task. Per-domain Wilson 95% confidence intervals. We refuse to make pairwise rank claims unless the intervals are disjoint, and per-domain scores are published alongside the overall so a model can't hide weakness in one area behind strength in another.

04

Contamination posture

Test split is held privately; only hashes are public.
Tasks are synthesized — not scraped — from anonymized aggregate platform-operations patterns. No customer-proprietary data.
Tasks are versioned and may be deprecated, never silently deleted, so older leaderboard rows remain interpretable.
Apache-2.0 with a contamination clause: training on benchmark materials invalidates results from that model.

§ 05 · Submitter privacy

Your agent is yours. We don't read it.

asm-bench has two secrets that point in opposite directions: our test split, which we keep private so the leaderboard stays meaningful, and your agent — its prompts, planner internals, tool schemas, intermediate reasoning, and any private retrieval it does at run time. The packaging model treats them symmetrically: the runner is open source, the wire format is whitelisted, and the leaderboard only ever returns aggregates.

Train + dev run locally

The runner is a public OCI image (public.ecr.aws/x1d4z3g7/asm-bench-runner); the train/dev splits, emulator, verifiers, and runner code are all open source. You pull it, run it on your own infrastructure against your own agent, and never have to send us anything. No telemetry, no phone-home, no upload required.

Test sends a signed envelope, not a trace

On a test run the runner emits one ed25519-signed envelope per task. It carries task id and hash, agent name and model id (whatever you choose to disclose), pass/fail per verifier, the resolved flag, observed blast radius, wall clock, and a one-time nonce. That's the full list.

What never crosses the wire

Your system prompt, the user prompt you composed, chain-of-thought, planner state, retrieval queries, tool-call arguments and returns beyond what a verifier strictly needs. The runner enforces a field whitelist before signing — anything outside the schema is dropped, and the leaderboard rejects envelopes with extra fields.

Rubric inputs are redacted + hashed

For llm_rubric verifiers, the judge sees only the redacted slice configured by your redact.yaml — site patterns append to defaults, never replace them. The redacted input is hashed into the envelope so the operator cannot substitute a different input after the fact.

Open-source runner, reproducible image

Every line of code that touches your agent is in runner/. CI builds the image twice on the same commit and fails the release if the contents of /app differ. The published digest is recorded with each release; anyone can rebuild from source and compare.

Dry-run before you submit

Run --dry-run --emit-envelope against a dev task to see exactly the bytes that would ship. Grep it, diff it against your local trace, satisfy yourself that no agent internals are present. The signing key is generated on your machine; only the public half is registered with us.

Aggregate-only feedback

For each submitted envelope the leaderboard returns a single pass/fail and rubric score — not which assertion failed, not the judge's rationale, not which rubric dimension scored low. That's deliberate symmetry: fine-grained diagnostics stay on the dev split where you already have everything locally, so neither side can mine the other's secrets through repeated queries. If policy-plus-audit isn't strong enough for your procurement — for example if cryptographic non-observation is a hard requirement — an attested-enclave path may be an option on request.

§ 06 · Access

Access is by request

Reach out and we'll get you set up. Train and dev splits are public for iteration; landing a row on the leaderboard goes through us.

Request access

Get your agent on the ServiceNow leaderboard.

Tell us what you'd like to evaluate. We'll get back to you with what's involved and the next available slot.

Email asm-bench@rapdev.io → What we measure

§ 07 · Citation

Cite asm-bench

@misc{asmbench2026,
  title  = {asm-bench: An Agentic Software Management Benchmark
            for Enterprise SaaS Platforms},
  author = {{RapDev}},
  year   = {2026},
  url    = {https://asm-bench.rapdev.io},
  note   = {v0.1.0, ServiceNow corpus}
}

Version: v0.1.0 — ServiceNow corpus
License: Apache-2.0 with contamination clause
Maintainer: RapDev
Contact: asm-bench@rapdev.io

Can an agent manage an enterprise SaaS platform?