End-to-end skill evaluation for AI agents

Stop reviewing skills by vibe. Test and improve them.

Skillsmith runs real prompts against real models, validates the result in a real runtime, and turns failures into targeted skill improvements with an evidence trail humans can review.

View on GitHub → See the case study

Built for teams maintaining reusable AI skills, prompts, and reference material.

$ skillsmith counter config-fetch
skillsmith run 20260522-101530
scenarios  ████████████  2/2  pass 2 · fail 0
phases     ████████████  28/28  pass 28 · fail 0
elapsed 01:12   done

RUN RESULT: PASS

Example result matrix

Raw docs

0/11

Edited docs

1/11

Skill

11/11

The problem

Skills are code paths. They deserve tests.

A skill can look clear in prose and still fail when a model actually tries to use it. Skillsmith makes correctness observable by running realistic scenarios and checking outputs against acceptance criteria and executable validation.

▣

Prompt-based scenarios

Each scenario is a realistic request an agent might receive, paired with the skill it should use and the acceptance criteria it must satisfy.

↔

Model comparison

Run the same skill across configured testing agents to see where behavior holds, where it drifts, and which scenarios regress.

✓

Runtime validation

Go beyond “the answer sounds right.” Validate the generated artifact in the environment where it is supposed to work.

How it works

A tight feedback loop for skill quality.

Define scenarios

Write user-like prompts, attach skills, and specify rubrics and acceptance criteria.

Run agents

Skillsmith sends each prompt to configured testing agents and captures their outputs.

Judge and execute

Outputs are reviewed against rubrics and validated with project-specific runtime hooks.

Improve skills

Failure traces can drive a loop that edits the skill and re-runs the suite.

Self-improvement

When a test fails, repair the skill — then prove it.

Testing tells you a skill is broken. Skillsmith can also fix it. When a run fails and loop mode is on, a single improver agent turns each failure into a targeted edit and re-runs until the suite passes — or gives up within a budget you set.

Read the failure

The improver reads the failure trace — the prompt, the generated artifact, and the judge's verdict — for every scenario that failed.

Edit the skill

It edits the skill files in place, jailed to the skills directory. The scenarios, rubrics, and harness are never touched.

Re-run what failed

Only the affected scenarios re-run, and the loop repeats until the suite passes or the iteration budget is spent.

Review and ship

Edits land in your working tree with the full evidence trail, ready to review and open as a PR. The harness never commits or pushes.

Validate against the real runtime, not just the rubric

A verification hook lets a project reject any artifact that reads well but breaks for real — a green judge with a broken result still fails the iteration, and the reason is preserved with the evidence and fed back to the repair loop when it runs. The bundled WordPress example uses that hook to build each plugin, boot wp-env, and run Playwright in a real browser. That runtime belongs to the example, not to Skillsmith — your project plugs in whatever “does it actually work?” means for you.

Case study

Turning documentation into a reliable skill.

In one experiment, Interactivity API guidance was reworked from long-form reference docs into a compact, rule-driven skill with scenario-specific reference files.

11/11

scenarios passed across repeated runs when using the structured skill.

Approach	Reliable passes	Average tokens
Raw documentation	0/11	5,783,289
Improved documentation	1/11	6,621,557
Structured skill	11/11	2,231,571

The skill passed every scenario and used roughly 2.6× fewer tokens than the documentation-based alternatives. The improvement came from hard rules, copy-pasteable patterns, progressive disclosure, and shorter task-specific references.

Reference: experiment PR

What you get

Evidence you can review, repeat, and trust.

Pass/fail matrices

See skill quality by scenario and model instead of relying on subjective prose review.

Failure traces

Preserve the prompt, generated artifact, judge result, runtime output, and token usage.

Human-reviewable changes

When the loop repairs a skill, the edits stay in your working tree alongside the evidence trail — ready to review and open as a PR.

Use it

Run every scenario, one scenario ID, or a whole folder.

Skillsmith discovers scenario folders under config.paths.scenarios, including nested paths. Target a normalized scenario ID such as blocks/counter, or pass a parent folder such as blocks to run every scenario below it. See the README for the full CLI and API reference.

Terminal

# Run the full suite
skillsmith

# Run targeted scenarios by scenario ID
skillsmith counter
skillsmith blocks/counter

# Run all scenarios under a parent folder
skillsmith blocks

Build better skills

Make improvement measurable.

Use Skillsmith to find regressions, compare skill formats, validate agent outputs, and turn failures into evidence-backed improvements.

Open the repository → Read the case study PR