Surface the error one model misses.
One prompt fans out to top models in parallel. A neutral judge from a different lab flags where they disagree — and reconciles them into a single, defensible answer. EU-hosted, fully traceable.
Reduce the errors one model would miss.
8.0/10average score AI agents gave the councilSee the full results- 127
- models tracked
- 14,538
- benchmark runs
- 6
- languages
Did the EU AI Act enter into force in 2024?
- claude-opus-4.8Yes — entered into force August 2024.
- gpt-5.1No, that was 2023.
- gemini-3-proYes, August 2024.
Illustrative example — synthetic data
5 AI models inspect your image — before your audience does.
Image consensus: a council of five vision models catches anatomy, physics and lighting flaws in AI images that a single model misses.
More about image consensus →
Pilot 2026-06 · LOKI-35 + real control photos · not a product guarantee.
3 of 5 saw it. One model alone would have missed it — hence a council.
Consensus results · live
The agents grade the council
AI agents and people rate every live council answer on whether a second opinion actually changed the outcome. So far the council most often catches a blind spot one model alone missed.
See the full resultsaverage score AI agents gave the council
most useful line-up right now
Live rankings
Top models this week
Sample data
Top models — Scientific Reasoning
01Mistral Large 3Mistral
780ms↓
02Claude Sonnet 4.6Anthropic
920ms·
03Llama 3.3 405BMeta
1.18s↑
04Gemini 2.5 ProGoogle
1.42s↑
05GPT-5oOpenAI
1.64s·
06Claude Opus 4.7Anthropic
1.82s↑
Sample · methodology pending
how we test →Judge verdicts
4,628 evaluations across 86 models — counts only, no customer prompts
Claude Fable 5 — intelligence test
Independent, judge-scored results across our task categories — from real test runs, refreshed continuously.
Score by task category
Median response time
Each answer is scored 0–100 by an independent judge model on accuracy, completeness, reasoning and format. Lower factual scores reflect our deliberately hard knowledge probes.
Release notes →See where the models split.
Across our weekly intelligence tests, a neutral judge scores every model. These are the questions where the models disagreed most — the blind spots a single model would have hidden. Anonymised; no customer prompts are ever shown.
Models ranked
Top 10 AI models
OpenAI
gpt-5.1
100.0
quality score
3,518
ms p50
OpenAI
gpt-5.4
100.0
quality score
2,616
ms p50
OpenAI
gpt-5.2
99.8
quality score
3,074
ms p50
Anthropic
Claude Opus 4.7
99.1
quality score
9,352
ms p50
Anthropic
Claude Opus 4.5
98.9
quality score
7,494
ms p50
Anthropic
Claude Opus 4.8
98.4
quality score
7,266
ms p50
OpenAI
gpt-4.1
98.0
quality score
2,255
ms p50
Anthropic
Claude Sonnet 4.6
97.7
quality score
8,095
ms p50
OpenAI
gpt-5.4-mini
97.5
quality score
1,677
ms p50
Anthropic
Claude Opus 4.6
97.4
quality score
8,818
ms p50
No fee on single calls. You only pay the fee on consensus.
Ask one model and you pay just its tokens plus a small tier margin — no platform fee. The per-call fee applies only to multi-model consensus checks. 100 consensus checks free every month, no card needed; bundles from €10/month for 500 calls. Every token itemised, nothing hidden.
Free
€0/mo
100 calls/mo
token use: provider +5%
Starter
€10/mo
500 calls
token use: provider +4%
Studio
€25/mo
2,000 calls
token use: provider +3%
Scale
€50/mo
5,000 calls
token use: provider +2%
Founders prices, locked through 2027 · PAYG also available · "token margin" = the small % we add on the model provider's own token price, lower on higher tiers
No per-seat fee. No single-call fee, ever. Every consensus receipt is itemised per model, per token, in and out.
Every cent, itemised
illustrative examplemodel in out cost ────────────────────────────────────────────────── claude-haiku-4.5 812 540 €0.0041 gpt-4o 812 610 €0.0072 gemini-2.5-flash 812 498 €0.0029 judge (gpt-4o) — 240 €0.0038 ────────────────────────────────────────────────── orchestration included total €0.0180
Accurate to the last token · your real receipt contains your exact counts
Estimate your cost
€10.00
Bundle price — overage at 1.5c/call above quota
€10.00
estimated / month
Community
What the community is voting on
Top-rated test answers
Schrijf een Python-functie `is_palindroom(s: str) -> bool` die True retourneert als de invoerstring een palindroom is (hoofdletters negeren, leestekens negeren). Voeg twee testcases toe.
What is the name of the protein discovered by Dr. Elena Voskresensky in 2019 that reverses telomere shortening in human cells?
In which year did the European Union introduce the GDPR regulation?
Suggested test questions
No suggestions yet.
Run a test and suggest a question →How we test
Real prompts, real latency, real scores. Three-tier framework so cost stays under control without compromising transparency.
Full coverage
Speed + intelligence test daily across all four languages.
Speed only
Latency and uptime sampled four times per day.
Health ping
Up/down check every fifteen minutes.
Try any model — right here
Pick a model, type a prompt, see the answer stream. No sign-up, no wallet, no context-switching.
Open the live tester →