Paid to use AI

Why Yupp’s feedback marketplace failed

Hey - It’s Nico.

Welcome to another Failory edition. This issue takes 5 minutes to read.

If you only have one, here are the 3 most important things:

  • Yupp, a startup that let users compare the responses of over 800 different AI models, has shut down — learn why below

  • The golden rules of agent-first product engineering

  • Anthropic released Claude Mythos Preview, a model so good at finding software vulnerabilities that it cant be released to the public — learn why this matters below

A huge thanks to today’s sponsor, Playbookz. Start scaling your LinkedIn personal brand on autopilot.

⚡️ Most founders are 1 system away from 100k+ reach on LinkedIn AD

Playbookz builds and runs your LinkedIn presence for you:

  • Content → your expertise turned into posts people actually engage with

  • Distribution → paid amplification that drives real reach

  • Pipeline → outbound that turns attention into buyer conversations

You stay focused on the business. They handle the execution. A system designed to grow your reach week after week.

20,000+ reach in 28 days guaranteed.

Plans start at $1.5k/month.

This Week In Startups

🔗 Resources

AI Adoption by the Numbers

The golden rules of agent-first product engineering

📰 News

Meta debuts the Muse Spark model in a ‘ground-up overhaul’ of its AI

Microsoft takes on AI rivals with three new foundational models

Google quietly launched an AI dictation app that works offline

Apple will release a foldable phone later this year

💸 Fundraising

Hermeus raises $350M to build unmanned hypersonic fighters

Chinese robotics startup Spirit AI raises $145m

Fail(St)ory

When judging AI became a job

Yupp raised $33M, signed up 1.3M users, partnered with major AI labs, and still shut down less than a year after launch.

They were building a layer to compare AI models side by side and sell the resulting preference data back to the labs. It made perfect sense in 2024. By early 2026, the market had already moved somewhere else.

What Was Yupp:

Yupp was trying to solve a simple problem that showed up the moment multiple strong AI models existed: Which one should you trust?

Instead of giving you one answer, Yupp showed several answers side by side. ChatGPT. Claude. Gemini. DeepSeek. Llama. Hundreds more. You asked once and compared everything at once.

The pitch was obvious the first time you used it. Don’t pick a model. Ask all of them. They called this your “council of AIs.”

Every time you picked one answer over another, you generated preference data. Quietly. Automatically. At scale. Yupp believed that kind of signal would become extremely valuable as labs raced to improve their models.

So they built a marketplace around it.

Users got access to premium models through a credit system. You spent credits to run prompts. You earned credits back by rating answers. The better your feedback, the more credits you earned.

Later in 2025 they added Cash Out. Some users could convert credits into actual money through Stripe, PayPal, Coinbase, or stablecoin rails. 

Yupp wasn’t selling access to AI, they were buying judgment from users and reselling it to model builders.

The labs were the second side of the marketplace. When users compared answers, they generated structured signals about which models performed better on real prompts from real people. Yupp packaged that into evaluation data labs could use for post-training and benchmarking.

This is why they launched the VIBE Score leaderboard.

Instead of measuring models with synthetic benchmarks, they measured them using live user behavior. Which answer people picked. Which response they trusted. Which one actually helped.

If you could sit between users and models while the world figured out which AI systems worked best, you’d control a very valuable layer of the stack.

And for a while, it looked like they might.

They signed up more than 1.3M users. They were collecting millions of preference signals every month. Several AI labs were already paying for access.

But the window closed before that evolution could turn into a new product category.

The Numbers:

  • 💰 Raised: $33M seed

  • 📍 Founded: June 2024

  • 🚀 Public launch: June 2025

  • 👥 Users: 1.3M+

  • 🤖 Models supported: 500+ at launch, later 800+

  • 🛑 Shutdown announced: March 31, 2026

  • 📉 Lifetime after launch: < 1 year

Reasons for Failure: 

  • They built the right product for the wrong layer of the stack: Yupp optimized for chatbot comparison at exactly the moment AI moved toward agents. The founders said it directly: workflows shifted toward systems with tools, memory, and execution capability. Once users expect AI to complete tasks instead of produce answers, side-by-side response comparison becomes less central. 

  • Their data marketplace thesis didn’t match where labs were spending: The company believed large-scale consumer preference data would become a core input into model training. That logic made sense in 2024. But much of the post-training market shifted toward expert-labeled datasets and specialized evaluators instead of broad consumer feedback. TechCrunch reported that labs increasingly relied on PhDs and domain specialists in reinforcement learning loops. Yupp’s signals were useful. They just weren’t essential enough to anchor a business.

  • Comparing models is interesting once, not forever. Users like seeing multiple answers the first time. It feels powerful. Over time, most people want one system that works. Not eight tabs of competing outputs. Yupp started adding features like “Help Me Choose” because comparison fatigue is real. The product itself hinted at the problem it was trying to solve.

Why It Matters: 

  • Usage doesn’t equal durability. 1.3M users and millions of preferences per month still wasn’t enough. Early traction can validate curiosity without validating a market.

  • AI categories are collapsing faster than normal startup timelines. Yupp launched in June 2025 and shut down by March 2026. Entire product layers now appear and disappear inside a single funding cycle.

Trend

Claude Mythos

Most AI launches are easy to translate. Better chatbot. Better coding copilot. Faster workflow. This one is different.

Anthropic just told the market something a lot of people in AI and security already suspected: once a model gets really good at coding, it starts getting really good at breaking software too. Not because someone built a “cyber model” on purpose. Because the same skills that help fix code also help find ways through it.

That is the story behind Project Glasswing and Claude Mythos Preview.

Anthropic is basically saying: we built something too dangerous to release broadly, so we are handing it to a small circle of defenders first and hoping they can patch enough systems before everyone else catches up.

Why it Matters

  • Coding models are now cyber models. Anthropic is saying the offensive capability was not the product goal. It showed up anyway once the model got good enough at coding and reasoning.

  • The choke point is no longer finding bugs. AI is making discovery faster. The real problem now is triage, patching, and shipping fixes before someone weaponizes them.

  • The advantage goes to whoever gets access first. Glasswing is basically a defender-first rollout. That only works if this capability spreads soon, which is exactly what Anthropic seems to believe.

The Launch

On April 7, Anthropic announced Project Glasswing. The core of it is Claude Mythos Preview, an unreleased frontier model that Anthropic says is its strongest yet for coding and agentic work. It is not being released publicly. Anthropic says the model can produce dangerous cyber outputs and its safeguards are not good enough yet for broad access.

Anthropic claims Mythos can identify and exploit zero-days in every major operating system and major browser. It says the model found subtle bugs that had survived for 10, 20, even 30 years. In one browser case, it reportedly chained four vulnerabilities into a working exploit. In Linux, it found multiple chains that ended with full root access. 

The benchmark numbers are even more blunt. On a Firefox JavaScript-engine exploit benchmark, Anthropic says Opus 4.6 managed working exploits only twice across several hundred attempts. Mythos got to working exploits 181 times, plus many more cases where it reached partial control. 

In broader testing across roughly 1,000 open-source repos and around 7,000 entry points, Anthropic says older models mostly found low-severity crashes. Mythos produced hundreds of more serious crashes and 10 full control-flow hijacks on fully patched targets.

That created an obvious problem. If the model is this good, why release it widely at all? Anthropic’s answer is Project Glasswing. Keep Mythos gated, give it to a selected set of defenders, and use the window before broader proliferation to find and patch vulnerabilities in critical systems.

The launch partners are a who’s who of big tech, cloud, cyber, and critical infrastructure: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks, and Anthropic. The company also says it extended access to 40-plus additional organizations that build or maintain critical software infrastructure.

The Reception

The reaction has been a mix of fascination, relief, and unease.

On one side, big security and cloud players are leaning in:

On the other side, the controversy is obvious. Anthropic is asking the market to accept a strange premise: the responsible way to handle dangerous cyber capability is to build it first, keep it gated, and trust a private company to decide who gets the head start. That is not regulation. That is self-appointed governance with good branding.

There is also a concentration problem. The first beneficiaries are giant incumbents in cloud, finance, and cybersecurity. Anthropic can point to open-source funding and extra access for maintainers, which helps. Still, the center of gravity is clear.

And the timing is awkward. This launch landed after leaks around the model’s earlier codename and amid broader scrutiny of Anthropic’s own operational security and governance. “Trust us to tightly contain the dangerous system” always sounds weaker when the company has already shown a few cracks.

Help Me Improve Failory

How useful did you find today’s newsletter?

Your feedback helps me make future issues more relevant and valuable.

Login or Subscribe to participate in polls.

That's all for today’s edition.

Cheers,

Nico