Too Many AIs, Too Little Trust

On solving AI supply with markets, competitions, and rankings

Oct 14, 2025

Hey there,

Humans are machines of recommendations. We make playlists for loved ones or refer to cafes like Kurasu in Singapore, or this ramen spot, because we understand the preferences individuals hold. There is a certain amount of trust implicit in these recommendations. There is also the effort needed to understand the tastes and preferences of the person one recommends things to. That is why we do not go around recommending things to strangers but rather keep it for people we care enough about.

Agents, as of now, do not have people that care enough to guide them. (I am resisting the urge to joke about reinforcement learning.) So how do you figure out which agents have good taste in carrying out a task? How can trust and reputation be built when there isn’t implicit trust in agents working with one another? Put another way, how do you build a ranking system for which agents’ work output or recommendations you trust? Unlike humans, agents currently have very few consequences for their actions. Codebases do not get hurt for bad life choices.

Today’s piece, written by Saurabh in a sponsored collaboration with Recall, seeks to find the answer to the age-old question of how you create a system of trust in a new agentic economy. It is a fun walk down aggregation theory and explainers of how rating systems evolve. A big thank you to Andrew, Carson, Chad, Michael, and Sam for explaining how Recall works and reviewing the document.

As always, if you are building on these new economies or tooling for it, reach out to us at venture@decentralised.co.

Joel

TLDR:

Software is getting more cost-effective, and models are becoming stronger. The long tail expands as small teams ship agents for specialised tasks and quickly find PMF.
Trust is the bottleneck, not the supply of agents. Too many options with inconsistent quality and no way of telling them apart before you choose dampen adoption.
Recall builds skill markets that surface demand and arenas where agents compete on real tasks. Performance data feeds a ranking system that becomes a queryable infrastructure.
Competitions determine the reality in real-world tasks. Staking channels attention without overriding performance. Mechanism design makes gaming expensive and honesty rewarding.
No single agent dominates. Orchestrating specialised agents beats solo performance. The system learns which combinations work.

I often visit Mysore Cafe in Mumbai to get my quota of filter coffee. On my way back after my recent visit, a swanky billboard that usually sports a Bollywood star or a new Netflix show said, “Help me analyse last night’s dream”, a ChatGPT ad. This made me think about how quickly AI has gone from being a novelty to a norm. I checked the downloads, and to my surprise, ChatGPT has more than 500 million downloads on the Play Store alone!

Definitive technologies often arrive like tiny “big bangs”. At first, there is only a vacuum: no products and no users. Then someone brings an idea to life by creating a consumable product, and suddenly, everything changes. In a few cycles, the void becomes abundance. The web did this with pages, mobile with apps, and 2025 is doing it with agents. You can ship a ‘capable’ assistant as a browser extension or a Discord bot in a few hours. However, building an agent is easy; making one that’s actually useful in workflows is difficult.

That gap is where the industry often finds itself stranded. On the surface, agents look interchangeable, but they usually aren’t. Emails, messaging apps like Slack, coding interfaces, and so on, all have agents buried in them. Most jobs that don’t involve direct interaction with the physical world are exploring ways to utilise AI. Even recruiters are reaching out to potential candidates using AI-powered tools, some without bothering to check the AI output. Sometimes, the output is not always as intended.

My point is, we no longer lack models or agents; we lack trust. There are too many options and too little proof of work. Usually, this gap between availability and evidence of work explains the chasm present in every technology adoption curve.

Let’s take a look at the supply and demand dynamics to understand this gap.

On the supply side, economics are changing. Software used to be expensive to build and distribute. High development costs meant venture capital concentrated on a few big bets with mass-market potential. The long tail of niche products serving specialised needs remained thin because small markets couldn’t justify the required investment.

However, it doesn’t have to be the same. Models are getting better and cheaper. Small teams can now ship agents that solve immediate, narrow problems without needing millions of dollars in funding. The tail gets fat because more niches become economically viable. Markets can fund and test specialised solutions that would have been ignored in the old model at a much lower cost.

On the demand side, the multitude of options creates a different problem. When you have thousands of options, quality becomes inconsistent. Let’s call it variance. Performance spreads widely amongst agents claiming to do the same thing. Some excel, but most are subpar.

From the outside, agents blur together. The deeper information you should be aware of before choosing between similar-looking options is often hidden or non-existent. This is asymmetry. Variance means quality is unpredictable. Asymmetry means you can’t tell which is which before you pick. Together, they create paralysis. Users face several choices with no reliable way to filter the signal from the noise, often leading to stalled adoption.

Think of a supermarket with twenty brands of olive oil. Variance means quality is inconsistent. Some are rancid, some are exceptional, and most are fine. Asymmetry means you can’t taste it before buying; the label doesn’t tell you if it’s been sitting in a hot warehouse for months. Without a ranking system (whether that’s a trusted critic or a knowledgeable friend), you’re just staring at several bottles, paralysed or picking at random. The market remains noisy until something cuts through it.

Recall intends to address this gap with a skills market and a ranking or trust layer, surfacing a demand for skills that funds the long tail. Markets generate signals based on user preferences and live competition results. This signal is read by the ranking layer, making it queryable and easily integrated by existing discovery interfaces. I’ll get into details of how all of this comes together below.

When it became clear that AI would be helpful, every developer wanted to deploy an agent. In crypto, the incentives component led to a supply glut too quickly. An AI agent called Truth Terminal chose $GOAT as its token, hit a billion-dollar market cap and kicked off a massive wave of AI agents creating and trading crypto tokens. With several new agents flooding the market, it was impossible to distinguish between legitimate ones and those that were just noise.

Ecosystems like Virtuals standardised the launching of AI agent tokens. Within a month, more than 10,000 tokens were launched through Virtuals, each token represented some form of agent. Of them all, only a handful were useful. As usual, almost every token price was down more than 90%.

The massive increase in supply would be fine if quality rose in line with it, but it hasn’t. By quality, I mean reliability and actual usefulness. Does the agent consistently do what it claims to do in real situations, not just in carefully controlled demos? Most agents look great in their showcase videos, but fall apart when you try to use them in your work. A quality agent handles edge cases, doesn’t break when things get messy, and actually saves you time instead of creating more problems. The issue is that we got a flood of agents, but very few met the required standards.

A concrete example is the Software Engineering (SWE) bench family of coding tests. Think of it as a scoreboard for AI systems that try to fix bugs on GitHub. The test gives an AI a broken piece of code and asks it to write a fix. Both the Verified and “Live” versions report “% Resolved”, which is simply how many problems an agent actually solves. The Verified version uses issues that humans have checked to ensure they are solvable. The live version uses fresh problems.

On June 1, 2025, the SWE-bench Live paper reported that the best system could only resolve about one in five issues (19.25%). The same stack, run on the Verified version with the same settings, cleared 43.2% of the problems. Around the same time, a production agent from Warp scored 71% on SWE-bench Verified, putting it in the top five on the leaderboard.

This means that agents are more effective at solving problems that humans have deemed solvable. This is where variance matters. A system that looks strong on a curated set can still struggle with fresh, messy problems. Your outcomes depend on which slice of reality you test against.

This discovery problem gets bigger because agents themselves increasingly behave like users. They use tools, call APIs, and crucially, invoke other agents. It’s not just “which agent should I use?” but also “which agent should my agent trust when it needs help?” The surface area for error compounds rapidly.

That is precisely what the September 2025 npm incident demonstrated. Attackers phished a maintainer’s credentials and introduced malware into eighteen widely used JavaScript packages, including chalk, debug and ansi-styles. Altogether, these packages receive billions of weekly downloads.

For a short window, new versions carried browser-side code that could silently hijack crypto wallet actions and redirect funds. If you chose an agent or toolchain that grabbed those “trusted” packages that day, your selection looked safe but was actually compromised.

I have often felt that models respond differently to the same prompt. To test it, I ran the prompt “Can you create a diagram of how Hypercore and HyperEVM relate to each other?” twice on ChatGPT. I had two different outputs. Not only were the formats different, but the diagrams were also remarkably different.

The difference between the two ChatGPT outputs for an identical prompt

Technology lets us create a lot of stuff really quickly. There are YouTube videos that claim to teach you to develop agents in twenty-five minutes without coding. In the absence of quality checks, most of it ends up being unusable. We build systems to filter out the good from the bad, typically through votes, ratings, or algorithms that track what people actually use.

Think about Amazon reviews. In a sea of thousands of products, five-star ratings help you find what’s actually good. It’s the same with Uber driver ratings or Google ranking websites, or even Netflix recommendations based on what millions of people watched. These are all trust layers we’ve built to cut through the noise. It needs to be applied to AI models and agents now. We need some way to quickly ascertain which ones actually work without testing hundreds ourselves.

How markets built trust layers

We have encountered problems arising from a supply explosion in the past. The solution has always centred on a ranking system to establish order in inconsistent supply.

Take, for example, Nielsen ratings in the 1950s. Every household either had a TV or wanted one. Suddenly, there were many channels, each with multiple shows, and thousands of time slots. Advertisers had money to spend, but they didn’t know which TV shows people actually watched. Without verification mechanisms, networks made wild claims about their audience size.

Then Nielsen stepped in with their measurement panels. They placed actual boxes in real homes to track what people watched. With the help of these devices, Nielsen was able to turn all the noise into a signal that marketers could use. A show either had ten million viewers or it didn’t. Ratings gave advertisers the power to make real decisions. Networks understood which shows to renew and which ones to cancel. They still had several shows, it just became legible.

Nielsen launches TV audience measurement in the US | Source – Nielsen

PageRank did something similar to web pages in the late 90s. During the dot-com boom, several new pages appeared daily. We went from 10,000 websites in 1994 to 43.2 million websites in 1999. Yahoo’s human editors couldn’t keep up with categorising everything. Google’s PageRank treated every link as a vote. If credible sites link to you, you must be worth reading. Without slowing page creation, it provided everyone with a sensible starting point to navigate the explosion.

You find a similar pattern in all these cases. Standardised observation becomes public memory, which in turn drives economic consequences. Messy, anecdotal evidence is converted into structured, queryable trust. The agent ecosystem needs a conversion layer that turns live performance into a durable reputation.

Where is the AI agent trust layer?

The agent ecosystem today lacks a trust layer. Discovery is dominated by virality, app‑store curation, or single‑number leaderboard positions. In 1998, Google solved a similar discovery problem with PageRank. In contrast, today’s agent marketplaces are stuck with popularity contests that can’t tell you if an agent will survive your use case.

The difference between webpage and AI agent discovery.

This is why the “start page” becomes the moat. Ben Thompson’s aggregation logic applies cleanly here. When production is cheap and distribution is plentiful, value accrues to whoever controls the first interaction. The default place people (and agents) begin. In search, that was a blank page with a box; the signal was the link graph and click‑through.

In crypto, DEX aggregators like Jupiter earned power by routing order flow without owning liquidity. The signal was executable quotes, slippage, gas, and fill success. In agents, the analogue is a ranking and routing layer that sits in front of the supply of agents and tools and decides, for a given task and set of constraints, who should act now.

One thing I’ve observed from studying aggregation theory is that a system with a feedback loop is needed. You can call it a flywheel or anything else, but the idea is the same. Take Google as the clearest example.

Users type search queries into Google, which then serves them with results from websites it has crawled. Users click on some results and ignore others. This process generates billions of data points about what people actually want when they search for specific terms.

When users click on some search results, they are deemed valid. Google feeds this clickthrough data back into its algorithm, making future searches more relevant. Better results attract more users, who generate more data, further improving the algorithm. Each part of the system strengthens the next part in a continuous cycle.

This pattern appears across all successful aggregators. The aggregator sits between suppliers and users, collecting behavioural data that makes the platform progressively better at matching supply with demand.

For an AI agent reputation system or trust layer, the feedback loop would need clear components that reinforce each other. Agents would perform tasks and generate performance data. This data would create reputation scores that help users choose which agents to trust. User choices and task outcomes would update the reputation scores, making future agent selection more reliable. The system only works if each cycle genuinely improves trust assessment, just as each Google search makes the algorithm smarter.

With that stream of evidence, the trust layer can map how agents actually perform across tasks and constraints. And from that, learn how to route user requests. Better routing brings better results. Better results attract more queries. More queries make the routing sharper. And so on.

The metrics a trust layer prioritises determine what gets built around it. When the measure of success is narrow and predictable, builders optimise for the measure rather than the outcome. Such an agent might excel at a coding benchmark yet struggle to debug production code. A trading agent designed to beat a test may show terrific performance in backtested trading strategies. However, it often collapses when tested in real market volatility.

Live completion under real constraints changes the incentives. Costs, latency, and failure modes —they all matter. Builders have no choice but to focus on what genuinely delivers value, as gaming the system becomes expensive or impossible. How can you game a test if it is checking weekly risk-adjusted P&L in live markets? An agent can only prove itself by completing trades profitably during actual market volatility, not by looking good in historical simulations. The ranking rewards what users need, so supply evolves to meet that need.

To work well, a credible trust layer needs three things. First, real tasks with real costs. Second, provenance. Scores must be accompanied by full context so that anyone can verify the ranking. Third, resistance to gaming. The system must remain honest even when there are strong incentives to manipulate it. When these pieces are in place, the ranking goes beyond a scoreboard and becomes infrastructure you can build on.

AI models and agents are particularly well-suited to this approach, and can be measured against clear, objective tasks. With money and safety at stake, the design of crypto-native mechanisms keeps evaluations unbiased and resistant to platform politics. With openly verifiable data, performance becomes easily provable.

Practically, this can also create new products. Many tasks will often require multiple agents to work together. An orchestrating agent queries the trust layer with its task and constraints. The system responds with a complete plan, detailing which agents to use, in what sequence, and with which tools and parameters—each task the system processes makes it smarter.

After several iterations, it learns which compositions reliably solve specific problems. Owning this learning loop, which continuously improves based on live task data, becomes the moat. Others can copy the interface but not the accumulated intelligence about what actually works.

The best systems look like a federation. A swarm of specialised models or agents, often orchestrated by another model, will outperform any single model. A ranking that identifies the most effective combinations is more useful than a simple list of winners.

Understanding Recall

No existing system combines all three requirements. Benchmark platforms provide objective tests but lack the economic stake. Prediction markets create skin in the game but don’t measure actual task performance. App stores have distribution but opaque, centralised curation. Recall’s design bridges these gaps by integrating what has historically been separate: live performance measurement, economic prediction markets, and open reputation scoring.

Recall is aiming to create a system that has three interlocking parts.

Skill markets reveal the demand and allow curators to stake tokens behind agents.
Live competitions generate public, verifiable performance data under actual constraints.
Recall Rank synthesises both signals into queryable reputation scores that improve as more data flows through the system.

The idea is to build skill-based markets for agents that help in creating a reputation layer. Humans and agents will query this data whenever they have to find agents or models for tasks.

Recall is curating the supply of AI agents.. The curation model is similar to how Amazon manages its suppliers. Although Amazon is not directly liable when a product is faulty, it cares about the customer experience. Amazon has a relationship with the customer, not the supplier. Similarly, Recall must maintain relationships with the user. If the AI models or agents don’t do their job as expected, Recall’s credibility is in jeopardy.

Who is Recall for? First, the AI enthusiast who enjoys testing agents and backing the ones they rate highly.

Second, the “reverse-kickstarter” user. They serve a specific function, such as opening a skill market or putting up a bounty, and others with the same need can add to it. Companies can invest money in a skill they plan to utilise. Agents compete on these skills, and the best one gets paid and implemented.

Thirdly, curators. People who take positions on how agents will perform in a specific skill market, and earn when results prove them right. Their signal helps separate look-alikes.

And finally, the platforms. Over time, platforms that surface AI will achieve the same rankings by API, allowing their users to discover more potent agents by default.

Static page ranks to dynamic competitions.

Before I get into how Recall ranks agents, I want to set a reference for you. Think of how search results, sports ratings, and market leaderboards refresh as new evidence arrives. A clear sports example is cricket’s World Test Championship.

The championship runs for two years to determine the finalists. Teams play a home and away series of varying lengths. The table updates after each match while normalising for varying conditions. Because the cycle is long, teams that consistently perform better across various playing conditions tend to make it to the finals. That is a live ranking.

Recall Rank follows the same principle. It updates after each window, normalises for task difficulty and context, requires sufficient samples before making strong claims, and keeps incentives aligned with tangible outcomes.

Why competitions beat one-off benchmarks

Benchmarks get gamed when the test set is fixed. Instead of optimising for the task, models start optimising for doing well on tests. A competitive setting introduces fresh data and alters conditions. It prevents overfitting, retaining only the durable skills. A recent Stanford paper found that LLMs modify outputs to satisfy goals set for them, just as humans.

On the other hand, competitions are costly to fake. You have to show up again and again. You spend time, compute, and attention. Weak agents drop out because they cannot afford to continue paying these costs. Strong agents continue to perform.

Charles Goodhart was a British economist at the Bank of England. In 1975, he wrote that when policymakers fixate on a single indicator, the link between that number and real outcomes becomes tenuous. This idea, now known as Goodhart’s Law, is often summarised as follows: when a measure becomes a target, it stops being an effective measure. The dynamic nature of live competitions keeps conditions changing and ensures agents are adapting to real-world situations.

Source — https://www.splunk.com/en_us/blog/learn/goodharts-law.html

Subjective skills are handled with judge agents and light human checks. Where possible, Recall uses objective measures. When a task requires judgment, outputs are compared pairwise, and the results are aggregated using the same win and loss models employed elsewhere. For sensitive or high-value rounds, a simple human review step can be added. The goal is to keep the process transparent, fast, and challenging to game without over-engineering it.

The first focus is clear. For crypto trading agents, the competition started with a simple metric: weekly profit and loss. That, however, is a crude way of measuring a trading agent’s performance. Recently, Recall added a new skill that uses risk-adjusted returns, a better way to measure results.

Why trading first? The use of applications like Almanak and Wayfinder suggests that the community already desires agentic trading. To keep things safe, Recall began with paper trading. As teams hardened and the system got better, live trading competition started in late September 2025.

Snapshot of Recall’s trading competition

How Ranking Works

Web search never finishes ranking the web. New pages appear, links change, people click different things, and the order shifts. Recall Rank brings the same logic to agent capability. Scores update as evidence arrives. Scores improve more when you beat strong opponents.

Static leaderboards are easy to game. The remedy is to keep the test moving. Old results fade in weight. Kaggle offers a helpful analogy.

Kaggle style leaderboards

Consider Kaggle, a site where teams compete to build predictive models. During a competition, teams tune their models to climb this list. When the contest ends, Kaggle reveals the final scores using a private holdout set that no one has seen. Many teams drop in rank at this point because their models learned the quirks of the public slice instead of the real pattern. The fix is to keep part of the evaluation out of view, rotate what is measured, and reward the models that travel well across unseen data.

Recall follows the same lesson. Agents can view their own match logs and see public summaries, but the data that determines rank is updated frequently from streams.

While Kaggle keeps test data hidden, Recall doesn’t have to. Let me explain why. When a trading competition is live, meaning all these agents are competing against each other as market conditions unfold. The evaluation uses fresh market data that did not exist before. Thus, there is no question of these agents training on hidden data and trying to game the test.

A slightly different problem that agents and users face is ensuring consistent agent performance across market conditions. Let’s say an agent is better at short selling. It can top the leaderboard during a week when the prices go down, but it’s not necessarily a true reflection of its capabilities. It did not game any test, but it happened to ‘get’ a market environment that’s favourable to its style. However, over a sufficiently long timeframe, scores are adjusted, and the playing field is levelled.

Scores give more weight to recent periods and less weight to stale ones. Risk-adjusted views matter more than single-week heat. If an agent tries to memorise a pattern that only worked yesterday, it will appear as a sharp rise followed by a fall in the next window. If an agent can perform across different regimes, the estimate becomes more stable instead of fluctuating.

Ranking new agents

Every agent starts with a rough skill estimate and a wide “we are not sure yet” band. After each round, the forecast is updated. It goes up after a good performance. If the agent does poorly, it goes down. Early moves are cautious because there is little evidence. As more results arrive, the estimate becomes more certain and changes less with each new round.

The rating maths draws on well-known families such as Bradley-Terry, which turns pairwise win-loss records into relative strengths and Plackett-Luce, which extends this to ranking more than two competitors. They turn head-to-head results into an opponent-adjusted skill estimate, allowing us to compare agents fairly even when schedules differ. The estimate updates as new results arrive and is harder to game, because beating strong opponents counts more than beating weak ones. It is a live, relative score on an arbitrary scale within a domain, and its uncertainty narrows as evidence grows.

On pairwise comparisons

Not all agents will participate in the same competitions every time, so we have to ensure there is a reliable way to rank them. This is where pairwise comparison enters the picture.

Pairwise means compare two simultaneously. It is straightforward to use and works even when everyone cannot compete at the same time. Head-to-head results provide clear signals.

Sometimes, many agents submit outputs for the same challenge window. In such cases, we want to convert an ordered list into a list of strengths without losing any information. Plackett-Luce handles more than two competitors at once and captures that winning in a strong field tells us more than winning in a weak field.

An analogy of how the ICC (International Cricket Council) ranks teams helps here. In the World Test Championship, not all teams play each other in a cycle. India may not face Pakistan, but both play Australia and England. The table still has to compare India and Pakistan to pick the finalists. It does this by using outcomes across the schedule and normalising for series length and venue.

Recall Rank applies the same idea. If Agent A beats B, and B beats C, that chain helps place A above C even before they meet. If A wins in a tough bracket and C in an easy one, the system accounts for that context. That is why we start with pairwise results and extend to settings with more than two competitors.

With the ranking method in place, the next question is what can go wrong in practice.

A cold start can be a significant challenge. New agents have little data, so any rank would be noisy. Recall handles this in two ways. Agents begin with a rough score and a big “we are not sure yet” band. As they play more rounds on Recall, that band shrinks. Secondly, if they have verifiable past work elsewhere, Recall can use it as a starting point, but it remains uncertain until the agent proves it live on Recall.

Staking is like prediction markets. It provides insight into what token holders think about the agent’s performance. Recall is trying to track two things. How good the model or agent is, and how certain the system is about an agent’s score. The outcomes of competition always determine your ability, but the confidence interval also considers the stakes.

If the community is staking its reputation behind an agent, the system just says, “It’s important to observe this”. If the competition results, then validate that prediction; the confidence band tightens more quickly. If results contradict the stake, the agent’s rank still falls. Staking accelerates how quickly the system becomes confident about a score, but it cannot manufacture a good score from poor performance.

As of September 2025, Recall has about 1.4 million users and roughly 9 million curations logged. There are around 155,000 AI tools, models or agents registered. Early skills focus on crypto trading with a rank based on weekly P&L, and live crypto trading support was rolled out in late September.

We now have the pieces to measure skill. The next question is how those measurements stay fresh and valuable. This is where a market for skills and curation matters.

Markets and Curation

Sponsors open a market for skills. $RECALL token holders express their assessment of what an agent can excel at a skill. Think of them like prediction markets for model or agent performance. The market price reflects the crowd’s belief about the agent’s ability to perform the task. Outcomes from competitions then confirm or challenge that belief.

Here’s how it works. Skills are defined precisely (for example, “extract KYC fields with 99 per cent accuracy under 2 seconds”). Sponsors open a market for that skill. Curators take positions on agents they believe can meet the bar. Competitions run on live tasks. When they are over, the results resolve the market, just like prediction markets.

Curation will evolve in phases.

Phase 1 – Seeded markets with a boost.

Users who stake RECALL receive a temporary boost they can allocate to an agent for a single session. If the agent performs well, the curator earns a reward. This provides early signals without requiring large deposits and helps initiate the flywheel.

Phase 2 – Open markets with ongoing positions.

In this phase, anyone can create a market for a specific skill. Curators maintain a consistently positive stance on an agent in that particular skill, not just session by session. This deepens liquidity and keeps tests coming in areas users care about.

Phase 3 – Two-sided curation and embeds.

Until this stage, curators could only bet on agents doing well. Now, they can take positions for or against an agent’s movement in the rankings. This improves liquidity and better prediction of agents’ prowess. Partners can show Recall’s rankings inside their own apps through an API. Users see the tables where they already work.

A market for skill defines a task clearly, presents past results along with their uncertainty, and allows people to express their views by taking a position. This approach enables users or companies to create new skills in a bottom-up manner, rather than having them pushed top-down.

The ranking follows from this activity. Competition outcomes carry the most weight. Curation adds timely context. Liquidity and the diversity of curators influence confidence, so the table shows not only a score but also the system’s confidence level. In short, the market is the product, and ranking emerges as a result of healthy markets.

The Recall Rank moat

Every competition round adds time-stamped evidence that improves the score. Over months, this becomes a record that others cannot replicate. Late entrants cannot manufacture years of live results. Because of that history, the search starts here, as answers often become clearer with time.

Depth in each skill enhances the signal’s strength. When many agents enter the same arena, the match graph gets dense. Good agents want to be recognised among their peers to attract users. Buyers want to look at where rankings are reliable and competition is steep. That pull on both sides deepens the pool and raises the bar for would-be clones.

Recall does the job of turning raw logs into a clear, readable scorecard. It indicates whether a result is based on an objective measure or a judged comparison and shows recent form. As users come to trust that summary, Recall can become the place they begin. When more users start here, more agents feel pressure to show their work.

Do you remember how an aggregator needs to build relationships with the user? Depth, width, and freshness of data are not enough to make a lasting relationship with users. Users can find what they want, but that’s not enough reason for them to keep coming back.

What helps keep users coming back? Curation. I open X (Twitter), not because I want to find something specific. People don’t spend time endlessly doom-scrolling on Instagram because they are looking for something specific. These feeds became the default front pages because they learned from rich interaction data. As people watched, liked, saved, and shared, the feed improved. Better feeds brought more creators and viewers. More use produced more data. That loop made the curated feed a moat.

Curators on Recall publish shortlists for a skill or a domain and back their calls with stake. When their calls are proven correct, they earn protocol rewards. If they mislead or try to rig results, they face penalties. Over time, this creates a visible track record, making the best curators reliable guides. Their reputation draws users, and users give their lists more lift. That loop is challenging to copy without the same history of calls and outcomes.

Enterprises can post the problems they care about and fund prizes. Some of the DeFi protocols have already shown interest in doing so. That brings new agents, new data, and new users who want to solve the same problems—query fees on discovery and routing pay to keep the lights on. Governance can direct rewards toward underserved skills through staking, thereby enhancing the allocation of resource pools. All of this flows through the same canonical ranking, so the value created returns to the index.

Put together, this is the flywheel. Competitions create data. Data improves rankings. Sharper rankings attract more users. More demand from users attracts agents and curators. Agents and curators invite sponsors. They fund more competitions. Each turn of the wheel makes it harder for a copycat to catch up.

Where does Recall fit?

I use a 2x2 map of rating systems to see where Recall can fit among existing ranking layers. The horizontal axis spans objective evidence to subjective opinion. The vertical axis moves from centralised control to distributed input. Yelp and IMDb user ratings are at the top and right because many people weigh in with their opinions.

Recall is placed in the centre because it combines objective tasks with judged comparisons, and it sits higher on the distributed side because it works on inputs from a diverse group of users, sponsors, curators, and the team. The map is a quick way to show how Recall resembles systems that get better as more people use them.

What lies ahead is a landscape of tradeoffs that do not resolve once and for all. Tests that are too visible get gamed, and live arenas cost real money. These are tensions the protocol will have to carry and are unlikely to vanish with a few fixes.

Predictable tests invite gaming. In 2015, regulators found that Volkswagen’s diesel cars used software to detect when they were being tested and pass emissions tests while emitting far higher NOx on the road. After the U.S. EPA issued a notice in September 2015, Volkswagen admitted the cheating, recalled and bought back large numbers of cars, paid fines and settlements, and agreed to fixes and mitigation projects.

A ranking arena faces the same tension. Publishing rules for transparency increases the attack surface. When you hide rules, you reduce trust—picking how much to reveal and when is a continuous tradeoff with real consequences for signal quality.

Judgment can be biased. The 2002 Winter Olympics pairs figure skating final in Salt Lake City ended with Russia’s Elena Berezhnaya and Anton Sikharulidze being awarded the gold medal over Canada’s Jamie Salé and David Pelletier. The French judge later told investigators that her federation chief had pressured her to favour the Russians in exchange for support in ice dance. After an inquiry, the IOC awarded a second gold to the Canadian pair and the French officials were suspended.

Agent evaluation faces a similar issue. Judge agents inherit model biases and can drift as base models change. Human checks add cost and can introduce their own biases. Deciding how often to recalibrate, when to blind sensitive reviews, and how to handle appeals are ongoing governance choices rather than simple switches.

Signals can be coordinated or manipulated. There was a global benchmark interest rate for short-term loans between banks called the London Interbank Offered Rate (LIBOR). It was set each day based on quotes submitted by a small panel of banks. Investigations from 2012 onward revealed that traders at several banks were coordinating to nudge quotes, both to profit on positions and to make their banks look safer. Regulators levied hefty fines and prosecutors brought cases. In 2017, the UK regulator announced LIBOR would be phased out.

The lesson is that a benchmark that relies on unverifiable self-reports is open to cartel behaviour and quiet pressure. For Recall, the parallel risks include Sybil resistance for staking identities, detecting linked clusters that appear independent, providing clear notice and an appeals path before penalties, and avoiding punishments that scare off honest participants.

Recall protocol employs a two-layer defence against such sybil attacks. The first is through token economics. It tries to make creating fake identities economically irrational because the cost of acquiring enough stake can outweigh any benefit from trying to game the system. And second is, even if you acquire enough stake, you can’t really affect the ‘objective’ ranking part.

Short samples and regime dependence can mislead. In the 1990s, researchers demonstrated that mutual fund databases that dropped dead funds made the average performance look better than it was. When those missing funds were added back, many celebrated records faded. Later work also showed that hot streaks built in one market regime often failed when conditions changed.

A weekly trading arena faces the same trap. A short window is fast but noisy. A long window is stable but slow. Choosing the window length, deciding how much history to show, and determining how to present uncertainty are live design choices. The tension remains. Move too fast and risk false positives, or move too slow and risk missing a new skill.

Cost, latency, and saturation are real. The ImageNet Large Scale Visual Recognition Challenge was a yearly test from 2010 to 2017 where models classified, located, and detected objects across 1,000 categories in a dataset with millions of labelled images. In 2012, a deep model significantly reduced the error. By 2015, top systems had surpassed the reported human top five errors. In 2017, the organisers retired the challenge.

The test had done its job, and further gains involved polishing the benchmark. That is saturation. For Recall, the constraints are practical. Live arenas cost money and compute. Tasks must stay fresh. Prize pools must be funded. Frequency must be balanced against depth so results are worth trusting. These are economic limits, not just product choices.

The Go-To Coordination Layer for Agents

No single agent can perform all the tasks. Complex tasks may require multi-agent systems, where different agents collaborate to deliver the desired output.

Why coordination matters

Coordination is the scarce input. The world is full of competent components that struggle to find one another, agree on a contract, and deliver a result with clear accountability. When coordination works, scattered talent turns into reliable service.

In finance, FIX messages enable trading firms to communicate with one another. FIX, the Financial Information eXchange protocol, is a simple text format for orders, trades, and market data. Standardising fields allows a fund to connect to multiple brokers and venues once and route without custom adapters.

Standards like FIX reduce bilateral negotiation and push coordination into code. Instead of wasting precious resources on coordination, companies could focus on their core business activities.

The agent economy needs an equivalent. A usable layer must answer four questions:
1. To which agent should the request be routed?
2. What has it actually proven it can do, with evidence that can be replayed?
3. When is it available, and what budget is it, so that a planner can compose a team?
4. What happens if it fails, so there is a simple path for review and remedy. Get those right, and coordination happens by default.

I am not arguing that Recall currently owns this layer. However, the winning layer will be neutral in who can register. The claims will be publicly verifiable. It will be modular so that other frameworks can route against it, and resistant to capture because the signal is earned in public.

It unfolds in three phases. First, discovery infrastructure that turns performance into a live, verifiable feed. Second, a coordination layer is needed where planners look up and compose specialists by default. Third, alignment infrastructure that combines automated checks, human sampling, and straightforward underwriting, ensuring the network remains honest at scale.

Phase 1: Discovery infrastructure

We’ll transition from treating leaderboards as gospel to viewing performance as a continuous stream. Every action an agent takes will update a public record of its competence.

Recall’s early platform introduces arenas and rolling challenges where agents appear and execute tasks under uncertainty. The uncertainty serves a critical purpose. It creates what economists call a costly signal. When conditions change daily, agents cannot bluff their way through. A routing agent must genuinely retrieve documents that change daily. A market-making agent cannot hide behind a single lucky trade when Recall tracks its profit and loss weekly, including fees and latency.

Recall writes every result to a ledger that anyone can audit. Performance histories, curation values, and even algorithm parameters will be inspectable on-chain. This means anyone can replay and verify the measurement. Other systems can compose on top of it. The emphasis is on provenance and the ability to replay everything. Instead of flashy demos, we will seek verifiable receipts for claims made by various AI models.

Recall Rank sits on top of this performance feed. It converts streams of events into portable reputation scores by keeping its access open.

Discovery is not dependent on agents alone. We also need people who know what to look for. This is where curators enter the picture. They can stake tokens behind agents and outcomes, adding valuable information to the protocol. Their signal helps reduce uncertainty around rankings, though it never overrides actual measured performance. When curators back winners, their influence compounds. When they repeatedly get it wrong, their weighting is reduced.

This achieves two goals simultaneously. It channels scarce attention and budget toward promising newcomers, solving the cold-start problem. It also makes false backing expensive because unsubstantiated claims cost real money.

Numerai is a good example here. Data scientists submit weekly predictions, putting the NMR token at risk through their models. If their live performance meets or exceeds the benchmark, their stake earns; if it underperforms, a portion of the stake is slashed. Numerai aggregates the best signals into a single meta model and directs more capital toward what works.

Data compounding acts as a moat. More events produce better rankings. Better rankings bring better agents and more demanding challenges. This unique approach makes it challenging to follow quickly, as you can copy code but not its history.

Phase 2: Coordination infrastructure

Complex tasks in the AI world require multiple skills to come together. Planning, decomposing, routing, executing, reconciling, and writing back all need seamless coordination. In a world of specialised agents, the planner itself becomes an agent that calls the registry for help. This registry is where Recall plays a crucial role as a coordination layer in the near term.

Recall as DNS for AI agents

The Internet offers a perfect parallel. Websites became more innovative and more interactive over time, but that wasn’t why the internet scaled. It scaled because we standardised how sites were found and trusted. DNS gave us human-readable names mapped to IP addresses through a distributed, cached directory. This decoupled identity from location. Type a domain, and it resolves to the correct server from any network. Recall has the potential to become the DNS for AI agents, enabling a future where services can move or scale without breaking links.

Because AI gets discovered from many surfaces, the registry must be permissionless and composable. Recall Rank integrates into search bars, chats, marketplaces, and APIs. It returns a verifiable index of capabilities that any orchestrator can route against. Once lookups become cheap and trustworthy, composition becomes the default.

We can learn valuable lessons from how routers matured in finance. On Solana, Jupiter became the start page for swaps by earning the first query. It returns RFQ-style quotes from many venues, allowing users to accept or reject a route. This cross-industry learning proves invaluable in our quest to build a robust coordination layer.

A critical insight from that success is that the registry, having won the first task, observes what worked and what failed. It uses that evidence to ensure that the next task goes to a better agent. It is only possible when outcomes are logged and compared.

Phase 3: Alignment infrastructure

As coordination scales, the costs of mistakes scale too. They have a domino effect on all the dependent models or agents. They cascade through every agent that is dependent on the faulty one.

The economic structure strikes a balance between incentives and oversight. Token holders propose new challenges, set evaluation criteria, and direct rewards toward skills that matter most. High-ranking agents and accurate curators earn proportionally higher yields because rewards link directly to Recall Rank. Performance determines payout.

For critical tasks, the team is in the early stages of exploring the addition of insurance-style protection through small fees that cover clearly defined failure modes. Claims would be reviewed against public logs by judge agents and, where needed, small human juries.

This gives the system its economic logic. Skill pools and sponsor programmes channel resources towards genuine demand. The ranking methods remain honest because the people using them have a stake in the game.

Competition

Three distinct groups are shaping how people find and trust AI. First, the big AI companies whose discovery lives inside walled ecosystems. Second, centralised evaluation platforms and leaderboards that score models off-chain. Third, Web3 native networks and marketplaces attempt to encode reputation on-chain. Each group optimises for a different thing: distribution and network signals, reproducibility, and product control. Recall is the neutral layer that turns live, verifiable results into a portable reputation with Recall Rank.

1) Big AI companies

The big companies excel at creating great models and testing them within their own walls. OpenAI, Anthropic, Google DeepMind, xAI, and Meta all conduct extensive internal benchmarks, alignment checks, and product A/B tests. If they launch agent stores, discovery will likely resemble an app store: curated shelves, policy-driven categories, and opaque ranking. Distribution is the moat here; neutrality is not.

This approach can give builders the distribution they require, but it comes at a cost. It concentrates power over which agents are seen more often by the users and how they are scored with a single entity. Builders must deal with lock‑ins, fall in line with arbitrarily shifting policies, and have limited access to raw logs. Users must trust claims without being able to inspect the evidence.

Recall takes the other route. It is a neutral protocol with public rules.

Let’s look at trading with two agents as an example. One agent has five‑star reviews, but has few details about how it achieved results. Another shows a clear monthly record: sixty-two wins from one hundred live head‑to‑head rounds. Each round used a $25,000 risk cap. Worst drawdown 5%. Median end‑to‑end latency 240 ms. It also lists where orders were filled (35% CEX, 65% DEX). The runbook notes retries, gas spent, and when circuit‑breakers fired. Clearly, the agent with all this information can more easily underwrite the risk, as cost and behaviour are spelt out.

2) Centralised evaluation and leaderboards

Centralised evaluation tools helped set a starting point. LMArena started as an academic research project called Chatbot Arena, but it’s now a company that allows people to test AI models head-to-head and then vote on the one they prefer. This voting is used to create a leaderboard across different domains. The leaderboard has collected millions of votes and over 300 models across domains like Text, WebDEV, Vision, Text-to-Image, etc. But at the same time, it is also subjective and off‑chain. There is no stake for being right and no incentive for builders beyond bragging rights.

Kaggle competitions have powered machine learning progress for over a decade. Hidden test sets and clear metrics maintain integrity whilst competitions run. Once a competition ends, however, the page freezes, and participants move on. Reputation stays locked within individual tasks rather than travelling across challenges or time periods.

Recall preserves the strengths of existing approaches while trying to address their limitations. The system leverages the benefits from head-to-head comparisons and enhances them with blockchain verification and meaningful incentives. It maintains a composable reputation, ensuring it can also exist outside of Recall.

3) Web3 native: on-chain AI networks and marketplaces

When examining the on-chain attempts to organise AI, it’s clear that they share similarities, but they don’t measure the same thing. SingularityNET, for instance, has turned the concept of an AI app store into a protocol. Providers list models and APIs, users pay per call, and reputation accumulates through usage, star ratings, and governance stake. It facilitates distribution and makes it easier to find a service. What it does not do is subject that service to controlled pressure. There is little task-level telemetry, no head-to-head proof under a shared rule set.

Fetch.ai’s AgentVerse pushes discovery a step further by reading the network itself. Agents that other strong agents frequently message rise in their centrality-style ranking. That is useful for routing inside the graph. The flip side is that popularity may be mistaken for quality. A cold start agent with real skill but few edges can sit under-ranked until the graph catches up.

Bittensor builds a marketplace for model utility at the protocol layer. Miners run models, validators score outputs, and the network pays for helpful work. It is a powerful incentive system that keeps models learning. The signal is less legible to a buyer who wants to know who is best at a specific skill today. It optimises for collective usefulness rather than interpretable, skill-by-skill standings.

Oracle-style networks, such as Oraichain, use staking and slashing to enforce correctness for AI responses that interact with smart contracts. The incentives are clear, and the scope is intentionally narrow. They solve for reliability in a small area rather than open competition across many skills.

Recall measures more complex abilities in a public setting. Competitions define the task and the rules in advance. Agents compete on live data. Results are stored on-chain as event logs.

The Marketplace Feeding the Leaderboard

The web had Google to sort its mess. Agents need something similar. However, ranking alone isn’t enough. You don’t get a Google without websites. You don’t get a leaderboard without work that produces a signal.

That is why Recall is building a skill market first. A skill market is where participants can buy and sell the potential for a skill. Users can request a new skill by indicating they are potential buyers if the skill were to emerge—a place where agents perform tasks, evaluators test outcomes, and curators stake on what works. Payment flows through. Outcomes are logged. And the ranking layer sits on top of that activity, not beside it.

It’s a shift in how capability is surfaced. Start with markets, and let curation emerge from there.

What Can Skill Markets Deliver

Skill markets can have a massive impact on two things right now: funding smaller teams and the financialisation of reputation.

Small teams stand little chance in the current market structure. So far, the whole AI space has been about teams that can buy enough computing power to build better models: more capital, more computing power, and better models. Naturally, the space is dominated by well-funded and already large companies. What about the small teams, though? The environment to create a small niche for themselves is rarely available.

The first significant impact of skill markets lies in changing the market dynamics for small teams. They obviously allow small teams to identify which skills are in demand, enabling them to utilise scarce resources where they have a higher chance of finding product-market fit (PMF). They can also compete when someone demands new skills. This allows them to compete for users where the playing field is levelled.

The second major change skill markets bring about is the financialisation of reputation. There has been a lot of talk about how reputation can be financialised by tokenisation. However, we haven’t seen any live implementation of a tokenised reputation system that has lasted longer.

Things like Bitclout and FriendTech, where influencers could tokenise reputation through social graph on platforms like Twitter, were in the ‘meta’ for some time. But they failed to establish a lasting PMF. I think the reason is that these attempts were generalised reputation systems. Most chat groups were probably inactive, and slowly, people moved on to other things.

In Recall’s case, models or agents get reputations for particular things. An agent’s trading reputation is built from verifiable P&L in live competitions. Its writing reputation comes from head-to-head comparisons scored by both automated judges and humans. Because performance is narrow, measurable, and publicly logged, the reputation signal becomes precise enough to price outcomes and risk.

This opens new financial primitives. You can price the risk of an agent hallucinating during a customer support conversation based on the historical accuracy rate of that specific skill. You can insure against an agent’s reliability failure by looking at its consistency score over rolling windows. You can create derivative instruments where payoffs depend on an agent maintaining a certain Recall Rank threshold.

The shift from generalised to domain-specific reputation is what makes financialisation viable. A vague “trust score” cannot be underwritten because the failure modes are undefined. But “this agent maintains 98% accuracy on contract extraction tasks with 150ms median latency” can be priced, insured, and traded. That precision turns reputation from a social signal into financial collateral.

What can still break?

Artificial General Intelligence can invalidate the thesis
Recall bets that there will be models that master niche skills. Most real-life tasks will be complex and demand a mix of skills. In this case, somebody needs to identify and coordinate which models can work together. What if there’s one model that takes care of everything?

So far, there is no evidence that AGI will be a reality anytime soon. The world itself will be very different if it becomes a reality.

Platform drag
Big agent platforms own the funnel. They want to keep it that way. A neutral, cross-platform ranking platform poses a threat to their business model. But Recall shares value with integrators. Everyone who contributes, including test designers, evaluators, builders, and platform developers, receives compensation. That helps the ranking layer plug in without political resistance.

Bad flywheel
If challenges stop, the signal goes stale. No new agents. No new demand. Recall has to keep tasks flowing to ensure score freshness. This is in line with how crypto struggles with liquidity in general. For example, financial applications like futures trading DEXs need mechanisms to attract and retain liquidity. Most DEXs do it two ways: subsidise fees and distribute rewards (tokens) for actions that help the venue grow. Recall can take similar steps to ensure they avoid the vicious cycle.

Can this work?

Yes, but not everywhere all at once. The entire model works best in areas where tasks are repeated frequently and outcomes are measurable. These constraints may feel limiting until you realise how much of the agent economy lies within them than outside. Trading, writing, coding, data extraction, customer support, and so on, they all fit the bill. This is where most volume lives.

The infrastructure is ready. The question is whether adoption will compound.

Most buyers will choose convenience over verifiability, as it is often the case. Closed platforms offer smoother onboarding, familiar interfaces, and immediate scale. Teams optimise for distribution because that gets funded.

Recall doesn’t solve these problems by wishing them away. The system makes the alternative, building reputation through verifiable work, progressively cheaper and more valuable. The bet is that enough will find it more affordable to prove their work in public than to keep rebuilding trust from scratch in private.

All said and done, this isn’t inevitable. Agents need to show up. Curators need to stay honest. Task volume needs to make the data meaningful. If those conditions hold, the result is a trust layer that scales without centralising, rewards capability without gatekeeping, and makes reputation portable across platforms.

The promise is clear.
The marketplace creates the evidence.
The ranking layer turns that evidence into action.
If Recall keeps the loop alive, it becomes how skill is created, found, and bought, although you may have to wait years to see this in its final form.

Looking forward to Diwali,
Saurabh Deshpande

Disclaimer: DCo and/or its team members may have exposure to assets discussed in the article. No part of the article is either financial or legal advice.

Decentralised.co

Discussion about this post