Anthropic released Bloom on December 19, 2025, and is available for download and use as open source. Bloom is positioned as an agent framework for "automated behavior assessment": researchers first specify a single behavioral feature to be observed, and then Bloom automatically generates a large number of scenarios and conversation rounds, scores the performance of the target model in these scenarios, and outputs indicators such as behavior trigger rate and average intensity to measure the frequency and severity of the behavior in the model.
Bloom is described as a complement to the existing tool Petri, which prefers to scan multiple behavioral dimensions and find suspicious instances in user-given scenarios. Bloom automatically expands to create more reproducible scenarios around a specific behavior to get to quantitative conclusions faster. The official example benchmark covers alignment-related behaviors such as "delusional pandering", "long-range disruption by instructions", "self-protection", and "self-preference", and provides a complete process from behavior definition to evaluation output.
In terms of mechanism, Bloom adopts a four-stage pipeline of "understanding-ideation-execution-judgment", and records behavior descriptions, example dialogues, and key parameters through "seed configuration" to reproduce experiments and compare differences under different models or configurations. Since this type of evaluation relies on automatic scene generation and judgment model, it is still necessary to pay attention to factors such as evaluation configuration, judgment consistency and scene authenticity in actual use, and avoid over-extrapolating a single result to the stable performance of the model in the real environment.
FAQs
Q: What is Anthropic's Bloom tool primarily used for?
A: Bloom is used to automatically generate evaluation scenarios for a given behavior and quantify the frequency and severity of that behavior in the model.
Q: What is the core difference between Bloom and Petri?
A: Bloom focuses on a single behavior and automatically expands a large number of scenes for quantitative measurement; Petri prefers to cover multi-dimensional behavior and find anomalies in a given scene.
Q: What are the key aspects of Bloom's evaluation process?
A: Bloom adopts four stages: understanding, ideation, execution, and judgment, and finally outputs summary indicators and evaluation reports such as trigger rate.
Q: What does Bloom's "seed configuration" do in the review?
A: The seed configuration is used to record behavior definitions and parameter settings, which is convenient for reproducing experiments and comparable results between different models.
Q: What risks should researchers be aware of when using Bloom results?
A: It is necessary to pay attention to the authenticity of the automatically generated scene, the bias of the judgment model, and the impact of configuration differences on the results, and avoid directly equating the evaluation conclusion with the real-world performance.