Prompt-based scenarios
Each scenario is a realistic request an agent might receive, paired with the skill it should use and the acceptance criteria it must satisfy.
Skillsmith runs real prompts against real models, validates the result in a real runtime, and turns failures into targeted skill improvements with an evidence trail humans can review.
Built for teams maintaining reusable AI skills, prompts, and reference material.
$ skillsmith counter config-fetch skillsmith run 20260522-101530 scenarios ████████████ 2/2 pass 2 · fail 0 phases ████████████ 28/28 pass 28 · fail 0 elapsed 01:12 done RUN RESULT: PASS
The problem
A skill can look clear in prose and still fail when a model actually tries to use it. Skillsmith makes correctness observable by running realistic scenarios and checking outputs against acceptance criteria and executable validation.
Each scenario is a realistic request an agent might receive, paired with the skill it should use and the acceptance criteria it must satisfy.
Run the same skill across configured testing agents to see where behavior holds, where it drifts, and which scenarios regress.
Go beyond “the answer sounds right.” Validate the generated artifact in the environment where it is supposed to work.
How it works
Write user-like prompts, attach skills, and specify rubrics and acceptance criteria.
Skillsmith sends each prompt to configured testing agents and captures their outputs.
Outputs are reviewed against rubrics and validated with project-specific runtime hooks.
Failure traces can drive a loop that edits the skill and re-runs the suite.
Self-improvement
Testing tells you a skill is broken. Skillsmith can also fix it. When a run fails and loop mode is on, a single improver agent turns each failure into a targeted edit and re-runs until the suite passes — or gives up within a budget you set.
The improver reads the failure trace — the prompt, the generated artifact, and the judge's verdict — for every scenario that failed.
It edits the skill files in place, jailed to the skills directory. The scenarios, rubrics, and harness are never touched.
Only the affected scenarios re-run, and the loop repeats until the suite passes or the iteration budget is spent.
Edits land in your working tree with the full evidence trail, ready to review and open as a PR. The harness never commits or pushes.
A verification hook lets a project reject any artifact that reads well but breaks for
real — a green judge with a broken result still fails the iteration, and the reason is
preserved with the evidence and fed back to the repair loop when it runs. The bundled
WordPress example uses that hook to build each plugin, boot wp-env, and run
Playwright in a real browser. That runtime belongs to the example, not to Skillsmith —
your project plugs in whatever “does it actually work?” means for you.
Case study
In one experiment, Interactivity API guidance was reworked from long-form reference docs into a compact, rule-driven skill with scenario-specific reference files.
scenarios passed across repeated runs when using the structured skill.
| Approach | Reliable passes | Average tokens |
|---|---|---|
| Raw documentation | 0/11 | 5,783,289 |
| Improved documentation | 1/11 | 6,621,557 |
| Structured skill | 11/11 | 2,231,571 |
The skill passed every scenario and used roughly 2.6× fewer tokens than the documentation-based alternatives. The improvement came from hard rules, copy-pasteable patterns, progressive disclosure, and shorter task-specific references.
Reference: experiment PR
What you get
See skill quality by scenario and model instead of relying on subjective prose review.
Preserve the prompt, generated artifact, judge result, runtime output, and token usage.
When the loop repairs a skill, the edits stay in your working tree alongside the evidence trail — ready to review and open as a PR.
Use it
Skillsmith discovers scenario folders under config.paths.scenarios, including
nested paths. Target a normalized scenario ID such as blocks/counter, or pass
a parent folder such as blocks to run every scenario below it. See the
README for the full CLI
and API reference.
# Run the full suite
skillsmith
# Run targeted scenarios by scenario ID
skillsmith counter
skillsmith blocks/counter
# Run all scenarios under a parent folder
skillsmith blocks
Build better skills
Use Skillsmith to find regressions, compare skill formats, validate agent outputs, and turn failures into evidence-backed improvements.