LLM showdown

Compare

One prompt, five top language models — pick the best answer

Stop opening five tabs to ask the same question

5-column Gab AI Deck recipe for direct LLM comparison

LLM Showdown puts the same prompt in front of GPT-5.5, Claude 3.5 Sonnet, Gemini 2.5 Pro, o3, and Llama 3.3 simultaneously. You see the answers side by side; you make the call with evidence. Use it to pick a default LLM for an app, decide which model to ship on for an API, or simply settle the "which one is best" debate for a specific kind of work.

How to use this recipe

Click "Use this recipe" to clone the 5-column deck.
Paste the exact same prompt into all five columns; each column is bound to a different model.
Run all five columns in parallel; the deck queues independently so you do not wait on one for the next.
Compare the outputs side by side — score them on accuracy, voice, latency, and cost.
Pin the winner and save the deck as a recipe so you can re-run the comparison on future prompts.

Best for

Engineering leads picking a default LLM for a product
Researchers documenting model-behavior deltas
Solo founders deciding which API to ship on
Power users curious which model "gets" their voice
Educators teaching prompt-engineering principles
Reviewers writing comparison content
Anyone replacing five tabs with one comparison deck

LLM Showdown FAQ

Why these five models?

They span the major frontier families (OpenAI, Anthropic, Google, OpenAI reasoning, open-weights) and are the most-asked-about by users. Swap any column to a different model — Mistral, Command, Cohere, Grok — via the column header.

Will it score the answers automatically?

No — auto-scoring an LLM's output is hard to do reliably. The deck surfaces all five answers; you make the qualitative call. Add a sixth chat column to ask one model to evaluate the others if you want a meta-take.

How do I compare cost and latency?

Each column shows runtime in its header; cost depends on token count + per-model pricing. For systematic benchmarking, log prompts and outputs to your own analytics.

Can I run a multi-turn conversation?

Yes — every column is a full conversation, not a one-shot. Continue the dialogue independently in each column to see how each model handles follow-ups.

Can I include vision-capable models?

Yes — paste an image into any vision-capable column (GPT-5.5, Gemini 2.5 Pro, Claude 3.5 Sonnet) and they will reason over it directly.

What about open-weights models?

Llama 3.3 is included by default. Swap to Mistral, DeepSeek, or any other open-weights model the catalog supports via the column header.

Workflow columns

GPT-5.5
Claude 3.5 Sonnet
Gemini 2.5 Pro
o3
DeepSeek V3