Model Performance Benchmarking

5don MSN

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing ...

New AgentBench LLM AI model benchmarking tool and leaderboards

If you are interested in learning more about how to benchmark AI large language models or LLMs. a new benchmarking tool, Agent Bench, has emerged as a game-changer. This innovative tool has been ...

Microsoft MDASH Beats A Key Mythos Benchmark. Here’s Why That Matters

Microsoft MDASH outperforms Mythos Preview on the CyberGym benchmark, demonstrating improved vulnerability discovery ...

EQS AI Benchmark Volume 2: Latest Frontier Models Make Agentic Compliance Workflows a Practical Reality

Second benchmark edition shows major gains in open-ended compliance work, shifting the focus from model choice to real-world deployment MUNICH, DE / ACCESS Newswire / May 11, 2026 /AI has crossed a ...

VentureBeat

LiveBench is an open LLM benchmark that uses contamination-free test data and objective scoring

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A team of Abacus.AI, New York University, ...

Hosted on MSN

Qwen3.5-9B tops every AI benchmark right now, but that's not how you should pick a model

Qwen3.5-9B has been making waves in the AI enthusiast community, especially given that Alibaba's compact reasoning model outscored OpenAI's gpt-oss-120b on GPQA Diamond, MMLU-Pro, and MMMLU, all while ...

SiliconANGLE

OpenAI details o3 reasoning model with record-breaking benchmark scores

OpenAI today detailed o3, its new flagship large language model for reasoning tasks. The model’s introduction caps off a 12-day product announcement series that started with the launch of a new ...

TechCrunch

Did xAI lie about Grok 3’s benchmarks?

Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view. This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading ...

News Medical

New AI model sets benchmark in digital pathology with superior cancer diagnostics

In a recent study published in the journal Nature, researchers developed and evaluated the Providence Gigapixel Pathology Model (Prov-GigaPath), a whole-slide pathology foundation model, to achieve ...

Forbes

Data-Driven Strategy: How Benchmarking Drives Stronger CX Performance

Alok Kulkarni is Co-Founder and CEO of Cyara, a customer experience (CX) leader trusted by leading brands around the world. Organizations are under increased pressure to meet customers’ growing demand ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results