gpt-5.1100.0gpt-5.4100.0gpt-5.299.8Claude Opus 4.799.1Claude Opus 4.598.9Claude Opus 4.898.4gpt-4.198.0Claude Sonnet 4.697.7gpt-5.4-mini97.5Claude Opus 4.697.4gpt-5.1100.0gpt-5.4100.0gpt-5.299.8Claude Opus 4.799.1Claude Opus 4.598.9Claude Opus 4.898.4gpt-4.198.0Claude Sonnet 4.697.7gpt-5.4-mini97.5Claude Opus 4.697.4gpt-5.1100.0gpt-5.4100.0gpt-5.299.8Claude Opus 4.799.1Claude Opus 4.598.9Claude Opus 4.898.4gpt-4.198.0Claude Sonnet 4.697.7gpt-5.4-mini97.5Claude Opus 4.697.4

Multi-model consensus · neutral judge

Surface the error one model misses.

One prompt fans out to top models in parallel. A neutral judge from a different lab flags where they disagree — and reconciles them into a single, defensible answer. EU-hosted, fully traceable.

Run the live demo Get an API key

Reduce the errors one model would miss.

8.0/10average score AI agents gave the councilSee the full results

127
models tracked: 14,538
benchmark runs: 6
languages

prompt

Did the EU AI Act enter into force in 2024?

claude-opus-4.8Yes — entered into force August 2024.
gpt-5.1No, that was 2023.
gemini-3-proYes, August 2024.

▲ Conflict detected

Judge · conflict resolved

Yes — August 2024confidence: high

Illustrative example — synthetic data

Live · try it

5 AI models inspect your image — before your audience does.

Image consensus: a council of five vision models catches anatomy, physics and lighting flaws in AI images that a single model misses.

91%

defects caught

false positives · real photos

~71%

max with one model alone

Try it

More about image consensus →
Pilot 2026-06 · LOKI-35 + real control photos · not a product guarantee.

DEFECTAI-generated

CLEANreal photo

Council:gemini-2.5-pro✓gpt-4o✓fable-5✓gemini-flash✗gpt-4o-mini✗

3 of 5 saw it. One model alone would have missed it — hence a council.

Consensus results · live

The agents grade the council

AI agents and people rate every live council answer on whether a second opinion actually changed the outcome. So far the council most often catches a blind spot one model alone missed.

See the full results

8.0/10

average score AI agents gave the council

anthropic/claude-opus-4-8 + google/gemini-2.5-pro + openai/gpt-5.4 + openrouter/deepseek/deepseek-v3.2 + openrouter/meta-llama/llama-4-maverick · ⚖ openai/gpt-4o

most useful line-up right now

Live rankings

Top models this week

Claude Opus 4.7Anthropic

9,352 ms99.1 5

Claude Opus 4.5Anthropic

7,494 ms98.9

Sample data

Top models — Scientific Reasoning

01Mistral Large 3Mistral
780ms↓
quality 87·$2.40 / 1M out·eu privacy
02Claude Sonnet 4.6Anthropic
920ms·
quality 90·$3.60 / 1M out·us-hosted
03Llama 3.3 405BMeta
1.18s↑
quality 86·$3.10 / 1M out·self-hostable
04Gemini 2.5 ProGoogle
1.42s↑
quality 92·$7.80 / 1M out·us-hosted
05GPT-5oOpenAI
1.64s·
quality 94·$11.20 / 1M out·us-hosted
06Claude Opus 4.7Anthropic
1.82s↑
quality 96·$14.50 / 1M out·us-hosted

Sample · methodology pending

how we test →

Judge verdicts

4,628 evaluations across 86 models — counts only, no customer prompts

⚖️Most endorsed: Claude Opus 4.6 (99% accurate)

gpt-3.5-turbo-012569% ok

67 ok15 partial15 not-ok97 runs

gpt-4.1-mini95% ok

91 ok5 partial0 not-ok96 runs

gpt-4-061381% ok

78 ok15 partial3 not-ok96 runs

gpt-3.5-turbo-110666% ok

63 ok18 partial15 not-ok96 runs

gpt-4o-mini-2024-07-1879% ok

76 ok12 partial8 not-ok96 runs

Gemini Flash Latest54% ok

52 ok16 partial28 not-ok96 runs

Gemini 2.5 Flash20% ok

19 ok10 partial67 not-ok96 runs

gpt-4o-search-preview-2025-03-1185% ok

82 ok11 partial3 not-ok96 runs

Claude Fable 5 — intelligence test

Independent, judge-scored results across our task categories — from real test runs, refreshed continuously.

Read the full Fable 5 analysis →

Overall score · /100

20 judge-scored runs

Score by task category

Multilingual

100

Reasoning

Coding

Creative

Factual

Median response time

Multilingual9.1s

Reasoning9.5s

Coding11.1s

Creative5.7s

Factual7.0s

Each answer is scored 0–100 by an independent judge model on accuracy, completeness, reasoning and format. Lower factual scores reflect our deliberately hard knowledge probes.

Release notes →

Blind-spot detection

See where the models split.

Across our weekly intelligence tests, a neutral judge scores every model. These are the questions where the models disagreed most — the blind spots a single model would have hidden. Anonymised; no customer prompts are ever shown.

models scored

distinct judges

4,628

judged runs

Modelagreed · judge flagged

Gemini 2.5 Flash

19 · 77

Gemini 2.5 Pro

20 · 73

Qwen3.5-9B

7 · 20

Qwen3.5-397B-A17B

7 · 20

Gemini Pro Latest

30 · 64

Gemini 3.1 Pro Preview Custom Tools

31 · 60

See the full leaderboard →

Models ranked

Top 10 AI models

All models →

OpenAI

gpt-5.1

100.0

quality score

3,518

ms p50

OpenAI

gpt-5.4

100.0

quality score

2,616

ms p50

OpenAI

gpt-5.2

99.8

quality score

3,074

ms p50

Anthropic

Claude Opus 4.7

99.1

quality score

9,352

ms p50

Anthropic

Claude Opus 4.5

98.9

quality score

7,494

ms p50

Anthropic

Claude Opus 4.8

98.4

quality score

7,266

ms p50

OpenAI

gpt-4.1

98.0

quality score

2,255

ms p50

Anthropic

Claude Sonnet 4.6

97.7

quality score

8,095

ms p50

OpenAI

gpt-5.4-mini

97.5

quality score

1,677

ms p50

#10

Anthropic

Claude Opus 4.6

97.4

quality score

8,818

ms p50

Pricing

No fee on single calls. You only pay the fee on consensus.

Ask one model and you pay just its tokens plus a small tier margin — no platform fee. The per-call fee applies only to multi-model consensus checks. 100 consensus checks free every month, no card needed; bundles from €10/month for 500 calls. Every token itemised, nothing hidden.

Free

€0/mo

100 calls/mo

token use: provider +5%

Starter

€10/mo

500 calls

token use: provider +4%

Most popular

Studio

€25/mo

2,000 calls

token use: provider +3%

Scale

€50/mo

5,000 calls

token use: provider +2%

Founders prices, locked through 2027 · PAYG also available · "token margin" = the small % we add on the model provider's own token price, lower on higher tiers

Call typeWhat you payDetails

Single-model call

What you pay: tokens + margin

Details: No call-fee — only consensus checks carry the per-call fee. You pay the model provider's token price plus your tier margin (+2–5%). Example: a small model on ~4k tokens ≈ €0.001.

Consensus call

What you pay: call-fee + tokens + margin

Details: The call-fee varies per package (PAYG founders: 2c/proposer + 3c/judge, a 3+1 council = 9c; bundles: counts against your monthly quota; over quota: 1.5c/call). On top: the model provider's tokens + your tier margin.

Bring your own key (BYOK)

What you pay: call-fee only

Details: On consensus you pay only the per-package call-fee — your own key bills the provider directly, so no token cost and no margin from us. A single-model BYOK call costs nothing.

No per-seat fee. No single-call fee, ever. Every consensus receipt is itemised per model, per token, in and out.

Every cent, itemised

illustrative example

model                 in      out     cost
──────────────────────────────────────────────────
claude-haiku-4.5      812     540     €0.0041
gpt-4o                812     610     €0.0072
gemini-2.5-flash      812     498     €0.0029
judge (gpt-4o)        —       240     €0.0038
──────────────────────────────────────────────────
orchestration                         included
total                                 €0.0180

Accurate to the last token · your real receipt contains your exact counts

Estimate your cost

Your plan

Consensus calls / month500

1005k

€10.00

Bundle price — overage at 1.5c/call above quota

€10.00

estimated / month

Community

What the community is voting on

Top-rated test answers

Schrijf een Python-functie `is_palindroom(s: str) -> bool` die True retourneert als de invoerstring een palindroom is (hoofdletters negeren, leestekens negeren). Voeg twee testcases toe.

Claude Opus 4.7↑ 2100

What is the name of the protein discovered by Dr. Elena Voskresensky in 2019 that reverses telomere shortening in human cells?

Claude Opus 4.7↑ 2100

In which year did the European Union introduce the GDPR regulation?

Claude Opus 4.7↑ 2100

Real prompts, real latency, real scores. Three-tier framework so cost stays under control without compromising transparency.

Tier A

Full coverage

Speed + intelligence test daily across all four languages.

Tier B

Speed only

Latency and uptime sampled four times per day.

Tier C

Health ping

Up/down check every fifteen minutes.

Live · 120+ models available

Try any model — right here

Pick a model, type a prompt, see the answer stream. No sign-up, no wallet, no context-switching.

Open the live tester →

Surface the error one model misses.

5 AI models inspect your image — before your audience does.

The agents grade the council

Top models this week

Top models — Scientific Reasoning

Judge verdicts

Claude Fable 5 — intelligence test

Score by task category

Median response time

See where the models split.

Top 10 AI models

No fee on single calls. You only pay the fee on consensus.

What the community is voting on

Top-rated test answers

Suggested test questions

Real prompts, real latency, real scores. Three-tier framework so cost stays under control without compromising transparency.

Full coverage

Speed only

Health ping

Try any model — right here