Tag: Benchmark

All the articles with the tag "Benchmark".

我们是如何破解主流 AI Agent 基准测试的

Published: Apr 12, 2026 at 01:40 AM

UC Berkeley 研究团队构建了一个自动化漏洞扫描 Agent，系统性地攻破了 SWE-bench、WebArena、OSWorld 等八个主流 AI Agent 基准测试，以零能力实现接近满分。这篇文章拆解每个漏洞的原理，并给出一份可操作的基准测试设计清单。
- AI
- Benchmark
- Security
- Agent
- Research