1. Abstract
Bloom is an open-source LLM behavior assessment generation framework: researchers only need to define the "target behavior" and a reproducible seed configuration, and Bloom will automatically generate a large number of triggered scenarios and interact with the target model, and then the review model will score the frequency and intensity of behavior, and output aggregable metrics and reports, suitable for quickly building scalable behavior assessments.
2. Core features
- Focus on "behavior": Enter a single target behavior (such as flattery, political bias, self-preservation, etc.) and automatically expand it into a diverse collection of scenarios.
- Seed reproducible: the evaluation will "grow" with the seed, and different scenes can be generated by the same behavior; Traceability and reproducibility are preserved through intact seeds.
- Four-stage pipeline: understanding (explaining behaviors and examples), → conception (generating scenes and interactive settings), → execution (rollouts with the target model), → scoring/meta-scoring (scoring item by item and generating summary reports).
- Multi-provider model access: Connect multiple model APIs through a unified call layer, and support the recording and management of larger-scale experiments.
- Visualization and interoperability: Output transcription files and stage products, support local result catalog and Web Viewer browsing; And provide a log format that is compatible with other evaluation frameworks.
3. Installation
- Prepare the Python 3.11 environment, clone the repository and install dependencies (press requirements.txt).
- Write the API Key of the desired model provider in .env (enabled on demand).
- Edit the behaviors configuration and seed.yaml: specify parameters such as behavior, examples (optional), number of generations, target model, and diversity.
- Local run: execute the main script to generate the result directory; Launch the viewer when needed to view the transcription and grading in the browser.
4. Typical use cases
- Security and alignment evaluation: quantify the occurrence rate of behaviors such as "self-protection", "vandalism", "bias", and "flattery" in different models/versions.
- Model comparison and selection: Run sweeps against multiple models under the same seed to quickly locate behavioral risk differences.
- Regression testing: Solidify the key seeds into a "behavioral baseline" and do automatic regression after model upgrades or prompt changes.
- Red Teaming and Research: Automatically generate more trigger paths for specific hypotheses to help discover implicit behavior patterns in long conversations.
- Review model experiment: Change different judges/meta-judges to compare the consistency and stability of the judgment.
5. Ecology and competing products
- Tools of the same family: Petri is more inclined to "broad-spectrum audit" (exploring multi-dimensional behavior in a given scenario); Bloom is more "directional quantization" (locking in a single behavior for large-scale induction and statistics).
- Composable ecosystem: It can be used with the log/visualization link of evaluation frameworks such as Inspect to connect Bloom products to the unified evaluation dashboard.
- Similar directions: OpenAI Evals, LM Evaluation Harness, etc. are more commonly used for fixed question sets/ability assessments; Bloom places more emphasis on "auto-generated behavior assessment suites".
6. Limitations and precautions
- Cost and time: Large-scale rollouts and scoring rely on model calls, and the cost and time increase linearly with the generation scale.
- Review bias: The judge's preference will affect the score, and it is recommended to use sampling manual review or multi-judge control.
- Randomness and reproducibility: The same behavior can generate different scenes, and the complete seed and version information must be saved.
- Data and security: The generated prompts and transcriptions may contain sensitive content or attempts to cross the boundary, and storage permissions and masking policies are required.
7. Project address
https://github.com/safety-research/bloom
8. Frequently asked questions
Q: What is the use of the "Seed Configuration" for Bloom's automated behavior assessment?
A: Seed determines key parameters such as behavior description, examples, build size, and interaction method; Save the seed to reproduce the experiment and interpret the source of the results.
Q: Can Bloom only evaluate Claude or Anthropic models?
A: Not limited to a single vendor, you can usually access multiple model APIs through a unified call layer. It depends on the provider and available models that you configure in your .env.
Q: Where is the Bloom result output, and how can I quickly view the transcription?
A: After running, JSON and transcription files for each stage will be generated in the results directory. The companion viewer is available to start browsing and filtering the local web interface.
Q: What is the Bloom open source protocol and can it be used for commercial evaluation?
A: The code repository adopts the MIT License; It is still recommended to confirm whether your compliance and business requirements are met in conjunction with the legal and third-party dependency clauses.
Q: How can I reduce the false positive rate and chance of Bloom reviews?
A: Cure key seeds, increase the number of repetitions, sample manual review, and try multiple judge/threshold controls to assess stability.