Back to AI information
Anthropic's engineering team interprets the AI Agents review: a roadmap from task set to grader design

Anthropic's engineering team interprets the AI Agents review: a roadmap from task set to grader design

AI information Admin 84 views

Anthropic released an engineering article on January 9, 2026, systematically dismantling the key methods of AI agents evaluation (evals), emphasizing that agents have the characteristics of multiple rounds of interaction, calling tools and rewriting the state of the environment, and a single round of evaluation is often insufficient.

This paper divides the scorer into three categories: code-based, model-based, and manual, and suggests that it can be used in combination according to scenarios: coding agents can be used to measure correctness and process quality using unit testing, static analysis, and trajectory constraints; Research agents need to check the quality of argument support, cover key facts and sources, and use manual review to calibrate model scoring. The computer operation agent checks the page status and background results in a real or sandboxed environment. For non-deterministic outputs, the paper compares pass@k and pass^k: the former measures the success of multiple attempts at least once, and the latter measures the success of multiple consecutive attempts, which is closer to the product requirement of "reliable every time".

On the landing path, Anthropic recommends starting with 20–50 real failure cases, clear task descriptions and judgment criteria, and preparing passable reference solutions for each task. The question set should cover the two-way examples of "should be done/not done" at the same time to avoid unilateral optimization. The evaluation environment should isolate each test run to prevent inflated or correlation failures caused by shared state, cache, or history. At the same time, it combines automated evaluation, online monitoring, A/B testing and regular manual spot checks to form a multi-layered line of defense.

FAQs

Q: What is the main problem that Anthropic's Evals discuss in this article?

A: The article focuses on the difficulty of stable evaluation of AI agents under multiple rounds, tool calls, and state changes, with the goal of making iterations more controllable and regressions more discoverable.

Q: What is the difference between "trajectory record" and "final result" in AI agent evaluation?

A: The track record is the whole process of conversation and tool call logs, and the final result is the real landing state in the environment, such as whether the database is really written or whether the order is really generated.

Q: Which product forms are pass@k and pass^k suitable for?

A: pass@k is suitable for tool-based scenarios such as "try a few more times and have one success", and pass^k is suitable for customer service, transactions and other scenarios that require stable success every time.

Q: Why should the question set cover the two-way examples of "do's/don'ts" at the same time?

A: Bidirectional examples prevent the model from being trained to over-trigger a behavior (such as indiscriminate search or indiscriminate calling a tool), resulting in higher costs or a worse experience.

Q: What is the minimum feasible practice for the team to build an evaluation system from scratch?

A: First, the manual regression list and the real fault work order are converted into 20-50 reproducible tasks, matched with reference solutions and stable environments, and then gradually expanded to the regression kit and production monitoring closed loop.

Anthropic's dismantling of AI agent evaluation is not enough Anthropic teaches you how to build an AI agent Evals reproducible system Anthropic named the AI agent multi-round tool call evaluation problem Anthropic proposed a five-piece set of task test grader tracks Anthropic's engineering article explains in detail how AI agent Evals prevents fallbacks Anthropic divides the grader into three routes: code, model, and manual Anthropic says that the evaluation of the coding agent depends on the single test + trajectory constraints Anthropic reminds research agents to verify facts and source quality Anthropic talks about computer operation agents must verify the real page state Anthropic compared pass@k and pass^k who is closer to the product and reliable Anthropic warns that pass@k can easily overestimate proxy stability Anthropic pushes pass^k reviews to make AI agents successful every time Anthropic recommends starting with 20 to 50 real failure cases Anthropic requires each question to be accompanied by a reference solution, otherwise the evaluation will be distorted Anthropic emphasizes that the question set should contain two-way examples of what to do and what not to do Anthropic explains why the track recording is separate from the final result Anthropic said that only looking at the dialogue and not looking at the landing state will step on the pit Anthropic advocates isolation and anti-cache inflated in the trial run environment Anthropic states that shared state causes relevance failure Anthropic adds line monitoring and A/B defense to AI agent evaluation Anthropic proposes a closed loop of automated evaluation + manual spot checks Anthropic Engineering in Practice: Transform agent regression kits with work orders Anthropic teaches the team to reduce the cost of passive remediation after go-live Anthropic reveals how to mix and match AI agent Evals scorers Anthropic said that model scoring needs to be manually calibrated to avoid self-satisfaction Anthropic recommends static analysis to measure the quality of the coding agent process Anthropic emphasizes that the track log must be fully traceable Anthropic talks about how non-deterministic outputs can be repeatedly tested Anthropic uses pass^k to approach the transaction-level stability requirements of customer service Anthropic said that unclear mission descriptions would render Evals ineffective Anthropic gave the MVP of the minimum viable solution for agent evaluation Anthropic reminds that a single round of datum is difficult to override the tool call chain Anthropic is AI Agents evaluation defines the test sequence and trajectory Anthropic advocates using scorers to restrain proxies to call tools randomly Anthropic warns that unilateral optimization will cause agents to trigger behavior excessively Anthropic teaches you to reduce costs and improve your experience with two-way examples Anthropic emphasizes that the end result must be verified in the environment Anthropic said that database orders must be written in order to be successful Anthropic publishes engineering paper: How AI agent evaluation is reproducible Anthropic explains how the agent evaluation task set covers key risks Anthropic recommends making a small set of questions and then expanding it into a regression kit Anthropic pointed out that the evaluation of the lack of trajectory records is difficult to locate the root cause of regression Anthropic teaching research agency evaluation checks argument support and coverage Anthropic emphasizes that source quality is key to the reliability of research agents Anthropic teaches the computer operator agent to verify the background results in the sandbox Anthropic says environmental isolation prevents historical contamination evaluation Anthropic uses multiple layers of defense to prevent proxy quality from quietly regressing Anthropic proposed that agent evaluation should record the whole process of tool calls Anthropic teaches you how to turn manual regression lists into automated Evals Anthropic summarizes the evolution of AI agent evaluation from failure cases to controllable iteration

Recommended Tools

More