MiniMax M2.7 Launches, Claims 30% Self‑Improvement After 13 Blind Evaluations in Hours

MiniMax unveiled its M2.7 model today, asserting a 30% self‑improvement over its nine‑month‑old M1 after 13 blind evaluations conducted by external judges within hours, according to a recent report.

Key Facts

•Key company: MiniMax

MiniMax’s claim of a 30 % self‑improvement hinges on a recursive reinforcement‑learning (RL) loop that the company says runs “100+ optimization cycles with no human in the loop.” In practice, the model was evaluated through a single‑turn, blind‑peer framework called The Multivac, which pits responses against those of older MiniMax versions and three external frontier judges—Claude Sonnet 4.6, GPT‑5.4 and Gemini 3.1 Pro (the latter failed to produce parseable rankings in most runs). Across 13 evaluations, GPT‑5.4 topped the leaderboard with an average score of 9.26, while MiniMax’s newest M2.7 posted an 8.46 average, virtually identical to its nine‑month‑old predecessor M1’s 8.47 (see the results table in the Multivac report). On the nine shared evals where both M1 and M2.7 participated, M1 out‑scored M2.7 on five, suggesting the purported 30 % gain is lost in statistical noise rather than reflected in real‑world performance.

The evaluation methodology itself has been revised after earlier batches that relied on intra‑MiniMax judging, which community members flagged as a “noise source.” The current batch uses independent frontier models to reduce bias, yet the core test remains single‑turn question‑answering. Critics on the Multivac Discord argue that this format is a “sprint test for a marathon runner,” insufficient for measuring the multi‑turn, agentic capabilities MiniMax advertises. To probe the self‑improvement claim, the evaluator designed three bespoke prompts—“Debug Your Own Reasoning Chain,” “Iterative Self‑Improvement on Code,” and “Recursive Optimization Under Constraints.” M2.7 only tied for first on the last prompt; it placed sixth and fifth on the first two, respectively, with scores of 7.41 and 5.96 versus GPT‑5.4’s 9.97 and 7.06. The limited success in the targeted tests underscores a gap between MiniMax’s advertised recursive RL loops and observable gains in multi‑turn reasoning.

Open‑source tooling also shapes the interpretation of results. All models were accessed via the OpenRouter API, meaning quantization and inference settings were provider‑controlled, a fact disclosed by the evaluator. Consequently, the reported scores reflect API‑level output rather than raw model capability, and local deployments could yield different outcomes. The evaluation engine itself is MIT‑licensed, allowing community members to replicate or extend the tests. Several users have already suggested alternative agentic harnesses—Claude Code, Kilo Code, OpenClaw—to better capture the iterative improvement dynamics MiniMax claims. The community’s next step, according to the Multivac thread, is to construct a fair iterative benchmark that measures how a model’s output evolves across successive rounds, rather than a static snapshot.

Despite the modest empirical gains, MiniMax’s broader market positioning remains aggressive. The company has framed M2.7 as a “next‑generation” model capable of autonomous self‑optimization, a narrative that resonates with investors seeking “AI that can improve itself without costly human feedback loops.” However, the Multivac data shows a tier gap of roughly 0.8 points between the best MiniMax model (M2.7 at 8.46) and the leading external contender (GPT‑5.4 at 9.26) on a ten‑point scale. In practical terms, this difference translates to a noticeable performance margin in hard reasoning and code generation tasks, where frontier models continue to dominate. MiniMax’s claim of matching Gemini 3.1 in machine‑learning competitions, while technically true in a limited benchmark, does not yet extend to the broader suite of reasoning challenges that enterprise customers prioritize.

The community’s response highlights a demand for more rigorous, multi‑turn testing frameworks. Proposals circulating on the Multivac Discord include designing a “recursive self‑improvement ladder” where a model must iteratively refine its own code or reasoning chain over several cycles, with each iteration judged independently. Such a setup would better surface any genuine gains from MiniMax’s internal RL pipeline and provide a clearer signal to investors and developers alike. Until such benchmarks are publicly released and the results independently verified, the 30 % self‑improvement claim remains, at best, an aspirational metric rather than a demonstrable advantage.

MiniMax M2.7 Launches, Claims 30% Self‑Improvement After 13 Blind Evaluations in Hours

Key Facts

Sources

🏢Companies in This Story

Related Stories