The $500 AI Accountant

Why FINNX couldn’t make Edith reliable

Hey - It’s Nico.

Welcome to another Failory edition. This issue takes 5 minutes to read.

If you only have one, here are the 3 most important things:

This Week In Startups

🔗 Resources

How to build your AI GTM system

Why we're bullish on loops

AI customer support for startups. Up to 90% off your first year * 

📰 News

Snap finally debuts its long-awaited AR glasses

ChatGPT’s market share slips below 50% for first time

💸 Fundraising

* sponsored

Fail(St)ory

Would you hire Edith?

FINNX built Edith, an AI finance employee for accounting teams buried in manual work. It shut down in June 2026, after nearly two years of building.

Another vertical AI startup that found a painful workflow and learned how far a convincing demo is from software trusted to run it.

What Was FINNX:

FINNX sold Edith as “your first finance team member.” The product was built for the daily accounting work that falls between software, spreadsheets, inboxes, and someone following up when the process breaks.

Edith was meant to pull invoices from email, post them into the right systems, match them to purchase orders, flag unusual items, chase vendors, and send messy cases to a human. Month-end reconciliation was part of the pitch too.

The buyer was a finance team that already had accounting software, AP tools, and approval workflows, yet still spent time cleaning up missing information and mismatched records. 

Their thesis was that existing finance software handled discrete tasks but left the exception-heavy “last mile” behind: missing documents, PO mismatches, timing differences, and judgment calls.

FINNX wanted Edith to sit in that gap. The company’s pitch was that finance tools automate individual tasks, while Edith would carry a process through to completion.

Edith was built around deterministic, classical AI rather than LLM-first agents, with an emphasis on rules, traceability, and auditability. Finance teams need to understand why an invoice was matched, why an exception was flagged, and how a number reached the ledger.

This meant the team had to build a lot before Edith could operate safely. The product needed access to accounting systems, bank feeds, ERP tools, AP workflows, approval rules, and each customer’s own tolerances.

A vendor match that works for one company can be wrong for another because of different chart-of-account structures, approval policies, tax treatment, or purchasing habits.

FINNX advertised Edith at $500 per month and compared it with a $70,000-a-year finance hire. 

More than 700 companies joined the waitlist before the shutdown. That showed showed that finance teams wanted the outcome; FINNX still had to prove Edith could deliver it inside live accounting workflows.

The Numbers:

  • 🏢 Founded: 2024

  • 📍 Headquarters: Singapore

  • 👥 Team size: Around 5 people

  • 🧾 Waitlist: 700+ companies

  • 💸 Capital sought: $1.5M in March 2026

Reasons for Failure: 

  • One bad entry mattered: A wrong invoice match, tax treatment, or journal entry could create cleanup work and audit risk for the customer. The founder said the team “ran out of runway before she was ready.” Until Edith could handle those mistakes safely, finance teams still had to check the work themselves.

  • The technical scope was large for a five-person startup: Edith needed to connect with accounting systems, bank feeds, ERP tools, approval rules, and each customer’s own finance logic. Building that safely required more integration and infrastructure work than a tiny team could finish quickly.

  • Buyers needed category education: FINNX had to explain why its deterministic approach mattered and why a generic AI agent was not enough for accounting. That added work to every sales conversation before the company could even get to implementation or pricing. The founder later said explaining the difference between FINNX and generic AI agents had become a full-time job.

  • The waitlist did not provide enough proof for the next round: The waitlist was not enough to change the financing outcome. More than 700 companies had signed up to hear about Edith, but the founder said the company’s final VC diligence process made it clear that raising the capital needed to finish the model was unrealistic.

Why It Matters: 

  • Vertical AI is easy to pitch and hard to trust. The closer the product gets to owning a core workflow, the higher the bar for reliability.

  • Waitlists do not reduce implementation risk. Interest is cheap; live customers still need to connect real systems and let the product make real decisions.

Trend

Answer Fusion

OpenRouter shipped a very cool feature last week that deserves more attention than it got.

Fusion sends the same question to several AI models, compares what they say, spots where they disagree, and produces one final answer.

It sounds like a minor upgrade to a model API, but it really points to a different way of building AI products: treat a single model response as a draft, then give the system a way to challenge it before the user sees it.

Why it Matters

  • The best AI answer may come from a process, not a model. A strong model can still miss an edge case, follow a bad assumption, or give a convincing answer with thin reasoning. Fusion adds a second layer that looks for overlap, disagreement, and missing coverage before producing the final response.

  • Startups can buy more confidence only when it matters. Running several models costs more and takes longer, so this will not replace everyday chat. 

  • Model choice becomes less visible to the user. Customers rarely care whether an answer came from GPT, Claude, Gemini, or a cheaper open model. They care whether the answer feels complete, catches obvious issues, and gives them a reason to trust it.

OpenRouter Fusion

Most AI products still follow a simple pattern: ask one model a question and return its answer.

Fusion changes the sequence. Several models first work on the same prompt independently. A judge then reviews their answers and identifies where they agree, where they conflict, what each one added, and what the group missed. A final model uses that review to write the response.

Think of it like giving the same research task to three analysts, asking a fourth person to compare their notes, then having an editor write the final memo. The value comes from the gaps that appear when several answers sit next to each other.

OpenRouter tested the idea on 100 deep-research tasks from Perplexity’s DRACO benchmark. Its best panel, Fable 5 and GPT-5.5 with Claude Opus 4.8 synthesizing the results, scored 69.0%. Fable 5 alone scored 65.3%. GPT-5.5 alone scored 60.0%.

Its cheaper panel is even more relevant for startups. Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro reached 64.7%, close to Fable 5’s score at roughly half the cost. That creates room for products that reserve expensive frontier models for judging and final synthesis while letting cheaper models do the first pass.

A Pattern That Keeps Reappearing

  • Together AI’s Mixture-of-Agents came early. In 2024, Together showed a layered system where multiple open-source models generate responses, and later models build on earlier outputs. 

  • Karpathy’s LLM Council made it a power-user workflow. His open-source project sends a query to several models, asks them to review and rank each other’s answers anonymously, then has a “Chairman” model compile the final response. It looks like the manual version of what products are now turning into features.

  • Perplexity brought the idea into consumer AI. Model Council runs multiple frontier models on the same query, compares their outputs, and uses a separate model to synthesize a higher-confidence answer. That is the same customer promise in a different wrapper: better answers through cross-model comparison.

  • CollectivIQ is taking the enterprise angle. It launched in March 2026 as an AI consensus platform, pulling from ChatGPT, Gemini, Claude, Grok, and other models to give teams more trusted answers. The buyer here is less interested in model fandom and more interested in reducing hallucinations, bias, and vendor dependence.

  • Even n8n has workflow templates for it. Operators can send one question to several models, have them evaluate the answers, and return a final verdict inside an automation.

The Trend

A new layer is starting to emerge above AI models.

The early products all use different names: Model Council, Mixture-of-Agents, consensus AI.

I prefer the name Answer Fusion. Across these products, the shared pattern is broader: multiple models contribute intermediate work, and a system synthesizes that work into one final answer.

That makes sense because model performance varies by task, cost, latency, context window, tool use, and reasoning style.

The first products in this category will likely win in places where people already double-check AI manually. They copy a prompt into Claude, GPT, Gemini, and Perplexity, compare the outputs, and try to work out what they missed. Answer Fusion just turns that manual comparison into a product.

Help Me Improve Failory

How useful did you find today’s newsletter?

Your feedback helps me make future issues more relevant and valuable.

Login or Subscribe to participate in polls.

That's all for today’s edition.

Cheers,

Nico