Telvine Learn

How to Measure Agent Plugin Usage, Errors, and Performance

Learn how to measure Agent Plugin usage: package adoption, Skill invocation, component events, version comparison, and privacy-safe telemetry.

Once you've built an Agent Plugin, the next question is whether it actually works. This guide covers how to measure Agent Plugin usage - package adoption, Skill execution, component behavior, and version comparison - so you can improve a plugin with evidence instead of guesswork.

The metrics that matter

An Agent Plugin is software, so measure it at two layers: the installed package and the capabilities inside it.

  • Plugin installs - how many installations have the plugin.
  • Plugin activation and retention - do people who install it keep using any capability?
  • Component inventory - which Skills, MCP config, hooks, connectors, apps, or agents shipped in each version.
  • Skill invocations - how often each capability actually runs.
  • Error rate - share of runs or component operations that fail.
  • Latency - p50 / p95 duration; the p95 is where pain hides.
  • Outcome mix - completed vs. partial vs. blocked, not just "did it end."
  • Autonomy rate - how often it ran with no human step-in.
  • Downstream actions - the real effects it triggered, such as a report exported or a ticket created.

The events to emit

Capture the plugin and capability lifecycle as a small set of typed events:

  • plugin.install / plugin.update.applied - package adoption and version changes.
  • /v1/plugins/:id/versions - declared component inventory for a shipped version.
  • skill.invocation.start / skill.invocation.end - a Skill run begins and finishes.
  • skill.invocation.error - a Skill run fails with an error class.
  • plugin.component.invoked / plugin.component.error - a connector, hook, MCP wrapper, app, or agent emits runtime behavior.
  • feedback.submitted - a user rates the result.

Crucially, these should be a closed envelope of metadata - counts, durations, enums, ratings - and never carry prompts, file contents, connector payloads, tool arguments, or retrieved records. You can measure everything above without touching sensitive data.

Compare versions, don't just track totals

The most useful question is "did my new plugin version help?" Answer it by comparing two versions head-to-head on adoption, invocation rate, error rate, latency, and rating - with statistical significance, so a noisy 2% wiggle isn't mistaken for a win. A good comparison tells you one of three things: it improved, it regressed, or nothing changed.

Watch for tradeoffs, too: a release can cut errors but get slower. Looking at one metric at a time hides that.

Find where failures concentrate

A single error rate is an average. To fix the right thing, break failures down by reason and look at the concentration — usually a small number of causes explain most failures (a Pareto pattern). Fix those first.

How to do this with Telvine

Telvine is built for exactly this. Publish your plugin with the CLI and register its component inventory. You get plugin installs, Skill invocations, errors, and feedback out of the box, plus:

  • Version comparison with significance testing.
  • Funnels, retention, latency percentiles, outcome and autonomy breakdowns.
  • Per-component and per-script reliability.
  • Webhooks and CSV export to pipe events into your warehouse, PostHog, Mixpanel, Amplitude, or internal dashboards.

No dashboards to build, no user content collected.

Frequently asked questions

Can I measure plugins without sending data to a third party? Yes — emit the typed events to your own endpoint, or use webhooks and export to fan them into your existing stack.

What's the single most useful metric? Version-over-version error rate and latency. They tell you whether your last change was worth shipping.

Does measuring require code changes to my Skill? No - publishing the plugin and wrapping the Skill adds instrumentation around it; the Skill logic stays the same.

Next steps

On this page