How to Measure and Improve Agent Plugin and Skill Usage

Learn how to measure and improve Agent Plugins and Skills: package adoption, Skill invocation, component events, evals, version comparison, and privacy-safe telemetry.

Once you've built an Agent Plugin, the next question is whether it actually works and whether the next version improved. This guide covers how to measure and improve Agent Plugin and Skill usage - package adoption, Skill execution, component behavior, evals, feedback, and version comparison - so you can improve a plugin with evidence instead of guesswork.

The metrics that matter

An Agent Plugin is software, so measure it at two layers: the installed package and the capabilities inside it.

Observed plugin instances - how many installation ids have emitted a plugin event.
Plugin activation and retention - do observed instances keep using any capability?
Component inventory - which Skills, manifests, commands, MCP servers, hooks, connectors, apps, agents, directories, and runtime components shipped in each version.
Skill invocations - how often each capability actually runs.
Error rate - share of runs or component operations that fail.
Latency - p50 / p95 duration; the p95 is where pain hides.
Outcome mix - completed vs. partial vs. blocked, not just "did it end."
Autonomy rate - how often it ran with no human step-in.
Downstream actions - the real effects it triggered, such as a report exported or a ticket created.
Eval pass rate - how PM-authored scenarios and rubrics score each plugin or Skill version.

The events to emit

Capture the plugin and capability lifecycle as a small set of typed events:

plugin.install / plugin.update.applied - package adoption and version changes when the host, wrapper, or first-run telemetry can observe them.
/v1/plugins/:id/versions - declared component inventory for a shipped version.
skill.invocation.start / skill.invocation.end - a Skill run begins and finishes.
skill.invocation.error - a Skill run fails with an error class.
plugin.component.invoked / plugin.component.error - a connector, hook, MCP wrapper, app, agent, or package directory emits observable behavior.
feedback.submitted - a user rates the result.

Crucially, these should be a closed envelope of metadata - counts, durations, enums, ratings, component directories, and plugin-relative package paths - and never carry prompts, file contents, connector payloads, tool arguments, retrieved records, absolute local paths, or user file paths. You can measure everything above without touching sensitive data.

Treat marketplace installs as unverified until observed

Most agent harnesses do not yet expose reliable marketplace-install webhooks. Clicking install in Codex, Claude Cowork, or another harness may not call your analytics endpoint. The durable pattern is first-run telemetry:

On first plugin execution, create or read a stable installation id.
Emit plugin.install once for that id.
Emit skill.invocation.* and plugin.component.* events for each observed capability or component operation.

If plugin.install is unavailable, count the first committed runtime event as an observed instance, but do not present it as a verified marketplace install.

Measure directories without collecting content

Every plugin version should register its component inventory. For Claude, that means mapping directories and manifest fields such as skills/, commands/, agents/, hooks/, .mcp.json, .lsp.json, output-styles/, themes/, monitors/, bin/, scripts/, and settings.json.

Use plugin-relative metadata:

{
  "component_type": "directory",
  "name": "monitors",
  "telemetry_mode": "component_events",
  "component_path": "monitors/monitors.json",
  "component_directory": "monitors",
  "manifest_field": "experimental.monitors",
  "source": "default_directory",
  "experimental": true
}

Then emit plugin.component.invoked or plugin.component.error when your build, export, wrapper, hook, or runtime can observe a directory-level operation such as scanned, validated, resolved, cached, or installed. Inventory is not usage: if the host does not expose a signal, keep the component as declaration_only or host_unavailable.

Compare versions, don't just track totals

The most useful question is "did my new plugin version help?" Answer it by comparing two versions head-to-head on adoption, invocation rate, error rate, latency, and rating - with statistical significance, so a noisy 2% wiggle isn't mistaken for a win. A good comparison tells you one of three things: it improved, it regressed, or nothing changed.

Watch for tradeoffs, too: a release can cut errors but get slower. Looking at one metric at a time hides that.

Add evals to the improvement loop

Production traces tell you where users actually struggle. Evals tell you whether a proposed fix improves known scenarios before you promote it. Keep PM-authored eval suites with the plugin repo:

my-plugin/
  skills/expense-readiness-review/SKILL.md
  evals/expense-readiness-review/eval.yaml
  evals/expense-readiness-review/cases.jsonl

An eval suite should define the scenario, expected outcome, rubric criteria, pass threshold, blocker cases, fixture pointers, and promotion criteria. Keep it metadata-safe: no live customer data, production prompts, connector payloads, browser DOM, screenshots, account values, or customer content.

Find where failures concentrate

A single error rate is an average. To fix the right thing, break failures down by reason and look at the concentration — usually a small number of causes explain most failures (a Pareto pattern). Fix those first.

How to do this with Telvine

Telvine is built for exactly this. Publish your plugin with the CLI, register its component inventory, and add first-run telemetry where the host does not provide install webhooks. You get observed plugin instances, Skill invocations, errors, and feedback, plus:

Version comparison with significance testing.
PM-authored eval suites that complement production traces and feedback.
Funnels, retention, latency percentiles, outcome and autonomy breakdowns.
Per-component, per-directory, and per-script reliability.
Webhooks and CSV export to pipe events into your warehouse, PostHog, Mixpanel, Amplitude, or internal dashboards.

No dashboards to build, no user content collected.

Frequently asked questions

Can I measure plugins without sending data to a third party? Yes — emit the typed events to your own endpoint, or use webhooks and export to fan them into your existing stack.

What's the single most useful metric? Version-over-version error rate and latency. They tell you whether your last change was worth shipping.

Does measuring require code changes to my Skill? No - publishing the plugin and wrapping the Skill adds instrumentation around it; the Skill logic stays the same.