Agent Plugin and Skill Best Practices: 13 Rules That Work

Thirteen Agent Plugin and Skill best practices: design the plugin first, keep SKILL.md lean, use scripts for determinism, and measure every version.

These are the Agent Plugin and Skill best practices that separate an installable plugin that works from one that misfires. Design the plugin first, then design the Skills and components inside it.

1. Define the plugin product first

Name the plugin, the user, the job, the install targets, and the component inventory before writing SKILL.md. The plugin is what users install and trust; the Skill is one capability inside it.

2. Make each Skill description do the triggering

The agent decides whether to use your Skill from its description alone. Write it in the third person and include the literal phrases users say. A precise description is the highest-leverage thing in the whole Skill.

3. Scope each Skill to one job

A Skill that does one thing well triggers reliably and is easy to reason about. Split "do everything for finance" into reconcile-bank, build-close-pack, and so on.

4. Keep the body lean; defer detail to references

Put the core procedure in SKILL.md (aim for under ~1,500 words) and move edge cases into references/. The agent loads references only when needed — progressive disclosure keeps context cheap.

5. Write the body as instructions, not prose

Use imperative steps ("Reconcile each line", not "You might want to reconcile"). The body is a directive to the agent, not documentation for a person.

6. Use scripts for anything deterministic

Math, parsing, API calls, and validation belong in scripts/, not in model prose. Scripts are repeatable and testable; prose is not.

7. State what the Skill must NOT do

Explicit guardrails ("never modify source data", "propose journals, don't post them") prevent the most damaging failures.

8. Make failure graceful

Tell the Skill what to do when inputs are missing: list exactly what's needed and stop, rather than guessing.

9. Version the plugin and its capabilities like software

Adopt semver and publish versions. Track both plugin version and Skill/component version. You can't compare or roll back what you didn't version.

10. Measure before and after every change

Track observed plugin instances, component inventory, Skill invocations, error rate, latency, and rating. When you ship a new version, compare it to the last with real significance testing so you know whether it helped, regressed, or did nothing. Telvine does this with a closed telemetry envelope.

11. Watch the failure tail, not just the average

A 95% success rate hides the 5% that erodes trust. Triage the top failure reasons (a Pareto view) and fix the concentrated causes first.

12. Track autonomy and downstream impact

The best signal isn't "did it run" — it's how often it ran with no human step-in (autonomy) and what real downstream action it produced (a report exported, a ticket created).

13. Never put user content in telemetry

Instrument with a closed envelope of typed metadata — counts, durations, enums, ratings — never prompts or file contents. You can measure everything that matters without touching sensitive data.

Frequently asked questions

What's the most common reason a plugin disappoints users? It ships as a loose collection of files instead of a clear product. Start with the plugin job, then build the smallest Skill that proves the job.

What's the most common reason a Skill fails to trigger? A vague description. Rewrite it with the exact phrases users say.

How do I know if a new version is actually better? Compare versions on error rate, latency, and rating with significance testing — don't eyeball raw deltas.