GPT-5 Takes #1 on SWE-bench at 78.4%

OpenAI's flagship surpasses all previous models on the canonical software-engineering benchmark. PAPER

DeepSeek Publishes V4 Architecture Notes

The Chinese lab reveals mixture-of-experts details and training compute breakdowns for DeepSeek-V4. FUNDING

Mistral Closes $640M Series C at $6B Valuation

The Paris-based lab will use the funds to expand its API, enterprise sales, and open-weight research. KolayVibe Built in Istanbul. Charting AI for the curious, the cautious, and the shipping. Learn Courses Learning Paths Prompt Library AI Glossary Discover Compare Models AI News Vibe Pulse AI Legends Platform Overview Pricing Marketplace API Access Company About Blog Careers Contact © 2026 KolayVibe · All rights reserved Privacy Terms

Llama 4 Scout Tops Open-Weight Coding Evals

Llama 4 Scout, Meta's smaller release in the 4-series, scored 89.2% on HumanEval — the highest of any open-weight model — and came within two points of GPT-4 on MBPP.

Notably it lags the closed flagships on real-world SWE-bench, where post-training and tool integration dominate.

FAQ

What's notable about Llama 4 Scout's HumanEval score?

Scout's 89.2% on HumanEval is the highest of any open-weight model to date, and within two points of GPT-4 on MBPP — closing a gap that had been steady for two model generations.

Does Scout match closed flagships on real-world coding?

No. On SWE-bench Verified, where post-training and tool integration dominate the score, Scout lags every leading closed model. The gap suggests HumanEval/MBPP are no longer differentiating benchmarks at the frontier.

Is Scout the largest model in Meta's 4-series?

No. Scout is Meta's smaller release in the 4-series, optimised for accessible local inference. Larger Llama 4 variants exist with different speed/quality tradeoffs.