Back to AI is open source
AMO-Bench Released: A large model inference benchmark for IMO-level math competitions

AMO-Bench Released: A large model inference benchmark for IMO-level math competitions

AI is open source Admin 105 views

1. Abstract

AMO-Bench is an advanced mathematical reasoning benchmark launched by Meituan's LongCat team, focusing on International Mathematical Olympiad (IMO) level and even higher difficulty competition questions. The benchmark consists of 50 new human expert-designed questions, and the system evaluates the true upper limit of the large model in difficult mathematical reasoning through automatic scoring and artificial chain thinking (CoT) annotation. The current public results show that Kimi-k2-Thinking scores about 56%, followed by GPT-5-thinking (high) and Qwen3-235B-Thinking, and most models are still below 40%.

2. Core features

1. Original IMO-level problem set: All 50 questions are designed and cross-verified by human experts, and are clearly marked as at least IMO difficulty, which helps to avoid the "brushing list" caused by training corpus memory.

2. High-precision automatic scoring: Using a scoring algorithm that mixes rules + models, it conducts robust comparison of numerical answers, expressions, etc., and the official claims that the overall scoring accuracy can reach 99.2%.

3. Human annotation CoT: Each question is equipped with a human chain reasoning process, which is convenient for analyzing model error patterns and can also be used as a reference signal for subsequent supervision and fine-tuning or reinforcement learning.

4. Focus on reasoning rather than format: The question only requires the final answer, without complete proof, greatly reducing the cost of manual grading and supporting large-scale reproducible evaluation.

3. Installation

1. Download AMO-Bench from the Hugging Face dataset page (or use datasets and other tools to pull it) and extract it to the local directory.

  1. Clone the GitHub repository and install the Python dependency and evaluation script according to the README.
  2. Specify the model call method (local inference or cloud API) in the configuration file, and set the output and log paths.
  3. Run the official sample script, first verify the evaluation and automatic scoring process on a small number of samples, and then conduct a full evaluation.

4. Typical use cases

1. Benchmark evaluation of large models: AMO-Bench is combined with GSM8K, MATH, AIME and other datasets to distinguish the differences between high-end models in "extreme problems".

2. Comparison of reasoning strategies: Compare the performance of different reasoning modes such as direct answers, step-by-step thinking (CoT), and reflection and retry on the same set of questions.

3. Training and fine-tuning signals: Use questions and human CoT as high-quality supervised data to strengthen the model's mathematical reasoning chain.

4. Study token overhead and compute scaling: Analyze the output length and computing power consumption of different models and problem-solving strategies on a fixed problem set.

5. Ecology and Competing Products

1. Ecology: The project provides datasets, automatic scoring code, sample scripts and public results, which can easily access the existing large model evaluation pipeline and LongCat ecosystem.

2. Comparison with traditional benchmarks: Compared with GSM8K, MATH, AIME24/25 and other benchmarks that have already "saturated", AMO-Bench raises the difficulty to the IMO range; Unlike benchmarks like IMO-ProofBench, which emphasize proof quality, it focuses more on a combination of "hard reasoning + automated evaluation".

6. Limitations and precautions

  1. The number of questions is only 50, and the overall statistical confidence is limited, which is more suitable for use as a difficult stress test and ranking rather than a general benchmark covering comprehensive abilities.
  2. The questions focus on the high school Mathematical Olympiad style, and the coverage of open-ended reasoning and interdisciplinary comprehensive ability is limited.
  3. Although the automatic scoring is carefully designed, extreme or unconventional output formats may still be misjudged, and the evaluation results of key models are recommended to be sampled and reviewed manually.
  4. Before using it in research or products, check the license terms of the repository and dataset to confirm whether commercial use and redistribution are allowed.

7. Project address

https://github.com/meituan-longcat/AMO-Bench 8. FAQs

Q: How to obtain and load the AMO-Bench dataset?

A: You can download it directly from the link provided by the Hugging Face dataset page or the official project page, and load it by question and answer field through Python (such as datasets, custom scripts) after local extraction.

Q: What types of large models is AMO-Bench more suitable for evaluating?

A: It is mainly aimed at general large models with strong mathematical and symbolic reasoning capabilities, especially the version that provides the "Thinking/Reasoning/CoT" mode; This benchmark is often too difficult for small and medium-sized models, and the score may be extremely low.

Q: How can I reproduce my experiment or connect my own model locally?

A: Follow the instructions of the GitHub repository to install dependencies, configure the model inference interface (such as local inference service or cloud API), and then call the official evaluation script to generate an answer file and automatically score it.

Q: Is AMO-Bench suitable for direct use as a training set?

A: It can be used for fine-tuning or reinforcement learning in research scenarios, but due to the limited number of questions, it is recommended to keep it as a validation set or test set, and only train on a larger mathematical corpus to avoid overfitting this benchmark.

AMO-Bench Advanced Mathematical Inference Benchmark Introduction AMO-Bench IMO Mathematical Olympiad puzzle collection Use AMO-Bench to evaluate the upper limit of large model inference AMO-Bench scores Kimik2Thinking performance Comparison of GPT5 Thinking scores in AMO-Bench AMO-BenchQwen3235BThinking ranking AMO-Bench original IMO difficulty question features How to use AMO-Bench to avoid dataset swiping AMO-Bench high-precision automatic scoring mechanism The accuracy of AMO-Bench score reached 99.2 resolution AMO-Bench Human Chain CoT Labeling Value AMO-Bench was used to analyze the model error mode method AMO-Bench only requires final answer design Use AMO-Bench as a benchmark for difficult stress testing AMO-Bench is used in contrast to GSM8KMATHAIME Difference analysis between AMO-Bench and IMOProofBench AMO-Bench pays more attention to difficult reasoning and automatic evaluation AMO-Bench was used to study the effects of different inference strategies AMO-Bench pair direct answer versus CoT experiment AMO-Bench supports rethinking and retrying multiple rounds of inference assessments AMO-Bench questions are better suited for large general-purpose models The small and medium-sized models scored low on AMO-Bench How to obtain the AMO-Bench dataset HuggingFace AMO-BenchGitHub review code installation tutorial How to connect your own model using AMO-Bench locally Instructions for using AMO-Bench automatic scoring scripts AMO-Bench was used to study token overhead and computing power scaling AMO-Bench is suitable for leaderboarding and stress testing AMO-Bench has only 50 questions, and the statistics are limited AMO-Bench questions are more about the style of the high school Mathematical Olympiad AMO-Bench reminds you that the coverage of open inference is insufficient AMO-Bench may misjudge the score under extreme output Before using AMO-Bench, you need to check the license terms AMO-Bench can be used as a mathematically fine-tuned high-quality signal It is also recommended to keep AMO-Bench as a test set AMO-Bench is combined with the LongCat ecosystem evaluation process AMO-Bench public results ranking interpretation How to add AMO-Bench to an existing evaluation pipeline AMO-Bench is oriented to the advantages of the Thinking model model AMO-Bench has extremely high requirements for symbolic reasoning ability AMO-Bench supports robust comparison between values and expressions AMO-Bench Human CoT can be used to supervise fine-tuning AMO-Bench helps study complex reasoning error types AMO-Bench is suitable for cutting-edge large model extreme challenges The potential value of AMO-Bench in the mathematical research community AMO-Bench provides the standard for competition-level inference AMO-Bench Question Set Installation and Configuration FAQs Evaluate model feasibility with AMO-Bench within the enterprise AMO-Bench is suitable as one of the benchmarks for paper evaluation AMO-Bench future expansion question volume and difficulty outlook AMO-Bench link with official project address description

Recommended Tools

More