Back to AI information
Qwen releases early preview of Qwen3-Max-Thinking: claims 100% compatibility with HMMT at AIME 2025.

Qwen releases early preview of Qwen3-Max-Thinking: claims 100% compatibility with HMMT at AIME 2025.

AI information Admin 93 views

In early November, the Qwen team released an early preview version of Qwen3-Max-Thinking, stating that the model was an intermediate checkpoint still under training. The official statement indicated that, after combining tool usage with expanded test-time compute, the model achieved 100% scores on challenging inference benchmarks such as AIME 2025 and HMMT. The current version is available on Qwen Chat and can be accessed via the Alibaba Cloud Model Studio API by enabling the enable_thinking parameter.

It's important to note that publicly available third-party leaderboards typically use fixed settings and may not account for computational power expansion during external tools or unconventional testing. Therefore, their results may differ from those claimed by manufacturers as "tool enhancements + expanded computational power." Recent AIME 2025 summary leaderboards do not generally display 100% perfect scores; whether they will be included in future unified rankings depends on the evaluation rules and reproduction procedures. Overall, this release is a feature preview; training and metrics will continue to be updated.

Frequently Asked Questions

Q: Where can I use Qwen3-Max-Thinking now?

A: You can try it out in the Qwen Chat frontend, or you can call it through the Alibaba Cloud Model Studio API and set enable_thinking=True in the request to enable thinking mode.

Q: What are the specific conditions for the claimed AIME 2025 and HMMT "100%"?

A: The official explanation is that it was obtained under the conditions of "enhanced tools + expanded inference computing power during testing"; there is a difference in the definition compared to the public leaderboard with standard closed settings.

Q: Why do public rankings not necessarily show perfect scores?

A: Many rankings require a fixed temperature, no external tools, or a limited inference budget; scores may differ or not be included if the test setup differs from the official test setup.

Q: Is this the official version?

A: No. This version is an early preview and is still under development. Its capabilities and stability may change in the future. The official statement is that it will continue to be updated.

Q: How do I enable the thinking mode in the API?

A: Use the enable_thinking parameter in the relevant interfaces of Alibaba Cloud Model Studio; the specific implementation documentation provides examples.

A preview of the third edition of "Tongyi 1000 Questions" has been released. How to activate the "Thousand Questions on General Theory" thinking mode? AIME 2025 perfect score analysis HMMT High-Difficulty Benchmark Achievement Interpretation Tool Enhancement and Computing Power Explanation Inference computing power scaling mechanism during testing Officials say they are still at the midpoint of training. QwenChat front-end can be tried directly Alibaba Cloud ModelStudio Interface Guide How to use the enable_thinking parameter Differences between publicly available rankings and manufacturers' statements Why are perfect scores not displayed on the leaderboard? The boost that thinking patterns provide to reasoning Summary of high-difficulty reasoning benchmark tests Preview version capabilities and stability changes Evaluation rules and reproduction experiment procedures Tutorial Example Call and Return Parsing Comparison with standard enclosed setup No external tools for comparing scores The Real Impact of Expanding Reasoning Budgets Benefits of using tools for solving math problems AIME and HMMT Evaluation Scope What are the early preview version feature limitations? Model continuous training update rhythm Differences between official news statements and actual measurements Qwen3MaxThinking Introduction and Basic Information Consider link length and computing power budget Example of a multi-tool collaborative calling scenario Mathematical Reasoning 100 points Reproducibility Necessary conditions for inclusion in public rankings Usage limits and billing considerations Inference calculation budget setting suggestions Can it be deployed in an enterprise environment? Risk control that initiates a thinking mode Guidelines for submitting reproduction experiments Competition Question Bank Versions and Leakage Prevention How researchers conduct controlled trials Comparison with Claude et al.'s models Tongyi Qianwen Ecological Product Panorama Thinking patterns affect performance on coding problems Real-world business scenario implementation observation The Boundary Between Academic Evaluation and Product Promotion How to track model update records Compilation of key points from developer community discussions Applications for College Competition Training Implications for Enterprise Decision-Making Reasoning Stability under multiple temperature settings Long context and tool routing strategy Security Compliance and Data Protection Tips Will subsequent rankings include all data?

Recommended Tools

More