The LLM Map Is Redrawn — 7 Categories, Who Leads Each in May 2026

The model layer never stops moving. But in May 2026, something clicked into place — the categories finally feel stable even as the models inside them keep rotating. Here’s how I think about the landscape right now.

1. All-Around Flagship

The “can do anything” tier. GPT-5.5 holds the top composite score and is now the default ChatGPT model. Claude Opus 4.7 breathes right behind it on code reasoning, with 1M context at flat pricing. Gemini 3.1 Pro is the cheapest US frontier and the only one doing 1M-token multimodal without a long-context surcharge. Grok 4.20 uses a multi-agent debate architecture to push hallucination rates to an all-time low.

2. Coding

Qwen 3.6 Max-Preview (Alibaba) just swept six coding and agent benchmarks in a row. But benchmarks and real work are different things — Claude Opus 4.7 wins the “I have a 50-file repo and a ticket” scenario. Different game, different winner.

3. Agentic

GPT-5.5 leads Terminal-Bench 2.0 at 82.7% and OSWorld at 78.7%. If you need an agent that lives in a shell and actually finishes tasks, this is still the pick. Kimi K2.6 is the strongest open-weight contender if you’re self-hosting.

4. Long Context

Llama 4 Scout hit 10 million tokens. Not a typo. For a book, a codebase, a year of logs — that changes what’s even possible. Grok 4.20 does 2M, Claude and Gemini do 1M. The race is now measured in book-lengths, not page counts.

5. Multimodal

Gemini 3.1 Pro at $12/M output — native vision, native 1M context, 94.3% GPQA. It’s the “just works” pick for anything that involves images + long documents at scale. GPT-5.5 adds native audio on top if you need that layer.

6. Open Weight

Kimi K2.6 leads the open leaderboards (1.1T MoE, #1 on AA Intelligence Index for open-weight). DeepSeek V4-Pro at $0.87/M output is ~34x cheaper than GPT-5.5 at comparable scores. Mistral Large 3 is the strongest non-Chinese open option (Apache 2.0, self-hostable, agentic-tuned).

7. Price-Performance

DeepSeek V4-Flash: $0.07/M output. That’s the number. For bulk summarization, triage, preprocessing, classification — there’s nothing close. The US-China frontier pricing gap has widened to 5–25x. The old mental model of “open-weight Chinese, closed-weight Western” no longer holds. Alibaba just closed weights on its Qwen flagship. The assumptions have flipped.

Bottom Line

The category that matters most to you determines which model wins. There is no single answer anymore — and that’s actually healthy.

Data: llm-stats.com, futureagi.com, public benchmarks (May 2026)