On November 3, 2025, OpenAI released IndQA, a new benchmark for evaluating AI systems' understanding and reasoning in Indian languages and cultures. The official statement claims that existing multilingual assessments (such as MMMLU and MGSM) tend to have "high score clustering" and are heavily focused on translation or multiple-choice questions, failing to reflect real-world cultural and contextual understanding. Therefore, IndQA features questions written in the native languages by local experts, covering 10 areas including architectural design, literature and language, law and ethics, religion and spirituality, sports and leisure, and daily life and food, totaling 2,278 questions in 12 languages (including Hinglish), with English translations provided for auditing and comparison. Each question includes a "scoring rubric" and an ideal answer; the system scores each item according to the rubric, making it closer to open-ended question-and-answer and argumentative essay assessments.
In its development, OpenAI collaborated with 261 domain experts in India and employed an "adversarial screening" approach: only questions that most of the strong models at the time (GPT-4o, OpenAI o3, GPT-4.5, and GPT-5, which was retested after its public release) failed to meet the standards were retained, ensuring room for improvement. The official website showcases comparisons stratified by language and domain, claiming significant model improvement over time. However, cross-language scores cannot be directly compared, and the adversarial screening may introduce confusion for the models themselves. The official data release and download methods are not clearly defined; currently, it is primarily used for internal and external benchmark demonstrations, with plans to expand this approach to other regions and languages in the future.
Frequently Asked Questions
Q: How does IndQA differ from previous multilingual benchmarks?
A: Emphasis is placed on local culture and contextual understanding, open-ended answers, and detailed scoring criteria, rather than simple translation/multiple-choice questions; questions are original works by local experts and reviewed by peers.
Q: Which languages and fields are covered, and what is the scale of the data?
A: There are 2,278 questions in 12 languages (including Bengali, Hindi, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Hinglish, and English), covering 10 culture-related areas.
Q: How are scores given?
A: Each question is accompanied by a weighted scoring rule. The model's answer is checked by the scorer to see if it meets the key points, and the final score is calculated, which is closer to human marking.
Q: Is it publicly available for download or can it be used for chart comparison?
A: The official website does not clearly define the process for making complete data available for download and for establishing a unified leaderboard; moreover, the official website states that the question formats differ across languages, and scores across languages should not be directly compared. It is more appropriate to track the time series progress of the same family of models.
Q: Why do IndQA?
A: OpenAI says that about 80% of the population does not speak English as their primary language, and existing non-English assessments are insufficient to measure true ability; India is both a multilingual country and ChatGPT's second largest market, so we will start with the Indian scenario.