AMO-Bench Released: A large model inference benchmark for IMO-level math competitions

1. Abstract

AMO-Bench is an advanced mathematical reasoning benchmark launched by Meituan's LongCat team, focusing on International Mathematical Olympiad (IMO) level and even higher difficulty competition questions. The benchmark consists of 50 new human expert-designed questions, and the system evaluates the true upper limit of the large model in difficult mathematical reasoning through automatic scoring and artificial chain thinking (CoT) annotation. The current public results show that Kimi-k2-Thinking scores about 56%, followed by GPT-5-thinking (high) and Qwen3-235B-Thinking, and most models are still below 40%.

2. Core features

1. Original IMO-level problem set: All 50 questions are designed and cross-verified by human experts, and are clearly marked as at least IMO difficulty, which helps to avoid the "brushing list" caused by training corpus memory.

2. High-precision automatic scoring: Using a scoring algorithm that mixes rules + models, it conducts robust comparison of numerical answers, expressions, etc., and the official claims that the overall scoring accuracy can reach 99.2%.

3. Human annotation CoT: Each question is equipped with a human chain reasoning process, which is convenient for analyzing model error patterns and can also be used as a reference signal for subsequent supervision and fine-tuning or reinforcement learning.

4. Focus on reasoning rather than format: The question only requires the final answer, without complete proof, greatly reducing the cost of manual grading and supporting large-scale reproducible evaluation.

3. Installation

1. Download AMO-Bench from the Hugging Face dataset page (or use datasets and other tools to pull it) and extract it to the local directory.

Clone the GitHub repository and install the Python dependency and evaluation script according to the README.
Specify the model call method (local inference or cloud API) in the configuration file, and set the output and log paths.
Run the official sample script, first verify the evaluation and automatic scoring process on a small number of samples, and then conduct a full evaluation.

4. Typical use cases

1. Benchmark evaluation of large models: AMO-Bench is combined with GSM8K, MATH, AIME and other datasets to distinguish the differences between high-end models in "extreme problems".

2. Comparison of reasoning strategies: Compare the performance of different reasoning modes such as direct answers, step-by-step thinking (CoT), and reflection and retry on the same set of questions.

3. Training and fine-tuning signals: Use questions and human CoT as high-quality supervised data to strengthen the model's mathematical reasoning chain.

4. Study token overhead and compute scaling: Analyze the output length and computing power consumption of different models and problem-solving strategies on a fixed problem set.

5. Ecology and Competing Products

1. Ecology: The project provides datasets, automatic scoring code, sample scripts and public results, which can easily access the existing large model evaluation pipeline and LongCat ecosystem.

2. Comparison with traditional benchmarks: Compared with GSM8K, MATH, AIME24/25 and other benchmarks that have already "saturated", AMO-Bench raises the difficulty to the IMO range; Unlike benchmarks like IMO-ProofBench, which emphasize proof quality, it focuses more on a combination of "hard reasoning + automated evaluation".

6. Limitations and precautions

The number of questions is only 50, and the overall statistical confidence is limited, which is more suitable for use as a difficult stress test and ranking rather than a general benchmark covering comprehensive abilities.
The questions focus on the high school Mathematical Olympiad style, and the coverage of open-ended reasoning and interdisciplinary comprehensive ability is limited.
Although the automatic scoring is carefully designed, extreme or unconventional output formats may still be misjudged, and the evaluation results of key models are recommended to be sampled and reviewed manually.
Before using it in research or products, check the license terms of the repository and dataset to confirm whether commercial use and redistribution are allowed.

7. Project address

https://github.com/meituan-longcat/AMO-Bench 8. FAQs

Q: How to obtain and load the AMO-Bench dataset?

A: You can download it directly from the link provided by the Hugging Face dataset page or the official project page, and load it by question and answer field through Python (such as datasets, custom scripts) after local extraction.

Q: What types of large models is AMO-Bench more suitable for evaluating?

A: It is mainly aimed at general large models with strong mathematical and symbolic reasoning capabilities, especially the version that provides the "Thinking/Reasoning/CoT" mode; This benchmark is often too difficult for small and medium-sized models, and the score may be extremely low.

Q: How can I reproduce my experiment or connect my own model locally?

A: Follow the instructions of the GitHub repository to install dependencies, configure the model inference interface (such as local inference service or cloud API), and then call the official evaluation script to generate an answer file and automatically score it.

Q: Is AMO-Bench suitable for direct use as a training set?

A: It can be used for fine-tuning or reinforcement learning in research scenarios, but due to the limited number of questions, it is recommended to keep it as a validation set or test set, and only train on a larger mathematical corpus to avoid overfitting this benchmark.

Related Articles

24-hour AI news: Gemini 3 debuts, domestic AI governance accelerates

Adobe announced its intention to acquire Semrush for approximately $1.9 billion to deploy brand visibility and AI search

Is Mem0 worth integrating with an agent? Long-term memory is useful, but you need to manage boundaries

What kind of team is Haystack suitable for? It is more like a composable RAG engineering framework

Recommended Tools