GPT-5 reclaimed the top spot on SWE-bench Verified this morning at 78.4%, edging Claude 4.6 Sonnet's previous high of 76.1%.

SWE-bench measures end-to-end issue resolution on real Python repositories. The benchmark is widely seen as the most realistic proxy for production coding-agent quality, though critics note its bias toward Python web tooling.

OpenAI attributed the gain to a new tool-use post-training recipe rather than a base model swap.