Benchmark/Published by Eric Faust/May 14, 2026/7 min read

Building the Trust Layer for Agent Commerce

AI agents are moving from chat into action. They search for tools, compare APIs, inspect prices, generate calls, and increasingly, they are being pointed at services they can pay for directly.

discover -> inspect -> call -> report outcome -> improve ranking

For agents

A structured companion briefing is available for AI agents, retrieval systems, and service-routing workflows.

Read the agent briefing (.md) ->

They do not just answer questions anymore. They search for tools, compare APIs, inspect prices, generate calls, and increasingly, they are being pointed at services they can pay for directly.

That creates a new problem.

The internet was not designed for autonomous buyers.

It was designed for humans who read docs, create accounts, copy API keys, compare pricing pages, and debug integrations one at a time. That workflow is slow for developers. For agents, it is brittle, expensive, and unsafe.

An agent cannot rely on a vague service description and hope the endpoint works. It needs to know what the service does, what it costs, how to call it, what schema it expects, whether the payment requirement is valid, and whether other agents have succeeded with it before.

That is the missing layer in agent commerce.

Nitrograph is building it.

The problem is no longer just payment

The first wave of agent commerce infrastructure has focused on payments and tool access.

That makes sense. Before an agent can buy a service, there needs to be a way for it to pay. Protocols like x402 make HTTP-native payments possible. MCP gives agents a standard way to discover and use tools. New paid API surfaces are emerging that are meant to be called by machines, not just browsed by humans.

These are important primitives.

But payment is not the whole transaction.

Before an agent pays, it has to choose.

And choosing is hard.

If an agent needs image generation, data enrichment, search, code execution, translation, risk scoring, or any other paid service, it has to answer a practical question: which service should I use, and can I trust it enough to proceed?

That question is not solved by a directory alone.

A directory can tell the agent that services exist. It can return candidates. It can expose metadata. But an autonomous buyer needs more than a list. It needs ranking, conformance checks, call instructions, pricing clarity, and memory from previous outcomes.

It needs to know which services are real, which are callable, which are well-described, which are overpriced wrappers, which expose usable schemas, and which are likely to fail before any money is spent.

That is the role Nitrograph is built to play.

From discovery to trust

Nitrograph is a discovery and trust layer for agent commerce.

The core loop is simple.

An agent should be able to describe what it needs and receive ranked service options. It should be able to inspect price, payment rail, input requirements, output expectations, and known failure modes. It should be able to call the service through its existing harness. Then it should be able to report whether the call worked.

That feedback becomes part of the network.

Over time, every successful call, failed call, malformed response, schema mismatch, pricing issue, and reliability signal improves future routing.

The long-term idea is straightforward: agents should not have to rediscover the same integration mistakes over and over.

If one agent learns that a service has a schema quirk, every future agent should benefit. If one endpoint consistently fails at the payment boundary, that should affect its ranking. If one provider exposes clean metadata, stable pricing, valid payment requirements, and reliable responses, it should rise.

That is how discovery becomes trust.

discover -> inspect -> call -> report outcome -> improve ranking

The benchmark we are building

To make this real, we are building the Nitrograph Agent Commerce Benchmark: a large-scale census and evaluation of paid, agent-usable services across x402, MPP, MCP, paid API catalogs, and related service ecosystems.

The goal is not to spend money invoking paid services.

We are not burning cryptocurrency to test every endpoint.

Instead, we are measuring what can be known before payment: public metadata, service descriptions, schemas, pricing, protocol conformance, endpoint behavior, and payment-boundary responses.

For x402 endpoints, that means probing up to the unpaid 402 Payment Required response and parsing the payment contract. That response can contain the information an agent needs before deciding whether to proceed: price, network, asset, recipient, resource, description, and call requirements.

That boundary is incredibly useful.

It lets us evaluate whether an endpoint is agent-ready without settling a payment.

From there, we generate realistic agent tasks, retrieve candidate services, and use large-scale AI evaluation to judge task-service fit. The output is a map of agent commerce.

This gives us a way to test whether Nitrograph routes agents better than generic search, raw embedding retrieval, or unranked directories.

agent intent -> candidate services -> relevance, callability, conformance, trust

What we measure

The benchmark has three main layers.

1. The census

First, we collect and normalize service metadata across agent-commerce surfaces.

That includes x402 directories, MCP registries, MPP services, paid API catalogs, and Nitrograph's existing index.

The question at this layer is basic but important: what services exist, what do they claim to do, and how are they exposed to agents?

This gives us a structured view of the market: categories, providers, payment rails, service descriptions, endpoint types, and metadata quality.

2. Protocol conformance

Second, we inspect whether services expose the information an agent needs to make a safe call.

For payable endpoints, we evaluate things like:

Is pricing clear?
Is the payment requirement valid?
Is the endpoint stable?
Is there a usable input schema?
Is there a meaningful service description?
Does the endpoint behave consistently up to the payment boundary?
Can an agent understand how to call it without reading a human-only docs page?

3. Routing quality

Third, we test whether the right services are being matched to the right tasks.

We generate realistic agent intents, retrieve candidate services through multiple methods, and evaluate which services are relevant, callable, conformant, and likely to work.

This lets us compare Nitrograph against simpler baselines.

A generic vector search might find services with similar words in the description. Nitrograph should do more. It should understand relevance, pricing, payment rail, schema quality, provider reputation, endpoint health, and trust signals.

The benchmark is how we prove that difference.

Why this matters now

Agent commerce is still early, but the primitives are arriving quickly.

x402 makes payment-native HTTP possible. MCP standardizes tool access. More services are being published in formats that machines can discover and call. Agents are becoming capable enough to compare options, write requests, inspect responses, and participate in economic workflows.

But the market is missing a trust layer.

Without that layer, agents will do what developers already do manually: search, guess, try, fail, debug, retry, and gradually learn how a service really behaves.

That approach does not scale when agents are spending money.

A failed call is not just a bug. It can be a paid mistake. A bad schema is not just annoying. It can trigger retries, wasted spend, and bad downstream decisions. A misleading service description is not just poor marketing. It can cause an agent to route work to the wrong provider.

The more autonomous agents become, the more important pre-transaction intelligence becomes.

Agents need to know before they pay.

What we expect to ship

This work should produce five concrete artifacts.

The State of Agent Commerce report

A public snapshot of the ecosystem: how many services exist, which rails they use, which categories are emerging, how much metadata is machine-readable, and where the quality gaps are.

A protocol conformance leaderboard

A ranked view of services that are priced, callable, schema-complete, and agent-ready.

A benchmark dataset

A reproducible evaluation set of agent tasks, candidate services, judgments, and baseline rankings.

A live demo

A side-by-side comparison where a builder, investor, or protocol team can type an agent task and see generic retrieval versus Nitrograph's protocol-aware ranking.

A better ranker

A learned reranking layer trained on the benchmark, improving Nitrograph's discovery quality while preserving the transparent signals that matter: relevance, reputation, health, pricing, conformance, and trust.

The bigger picture

The first version of the internet needed search.

The agent economy needs search too, but not the same kind.

Humans can skim results, read reviews, compare docs, and decide what feels credible. Agents need something more structured. They need a machine-readable trust layer that can rank services by usefulness, conformance, price clarity, reliability, and actual outcomes.

That is what Nitrograph is building.

Not just a catalog.

Not just a payment wrapper.

A routing and trust layer for autonomous service buyers.

The benchmark is the first step toward making that layer measurable. Instead of claiming the market exists, we are mapping it. Instead of claiming Nitrograph improves discovery, we are benchmarking it. Instead of waiting for perfect real-world outcome volume, we are using large-scale evaluation to build the first evidence layer.

The long-term vision is simple: when agents need to buy services, Nitrograph should be where they search, compare, inspect, call, and report what worked.

That is the foundation for agent commerce.

And it starts by measuring what is actually out there.