Lobster Large Model Evaluation Rankings Are Here! MinMax and Kimi Make It Into the Top Three

CryptocurrencySniper · 2026-03-09T11:36:11+00:00

Recently, OpenClaw ("Lobster Farming") has quickly become popular, establishing itself as a representative of tool-based AI. It can perform various tasks through chat software, requiring configuration and training. The PinchBench website evaluated 33 large language models, with Google's Gemini-3 performing the best, while domestically developed minimax-m2.1 and kimi-k2.5 also demonstrated high success rates and low costs, indicating their competitiveness in the market.

CryptocurrencySniper

2026-03-09 11:36:11

Abstract generation in progress

In the past week, “Lobster Farming” has become a huge craze!

A long line of people waiting outside Tencent’s headquarters to get free “lobsters” installed, and on second-hand platforms like Xianyu, there are dozens to hundreds of “lobster” installation services. Major cloud providers have also launched one-click deployment tutorials and services. But here, “lobster” doesn’t refer to the small crayfish we eat, but to “OpenClaw.” “Claw” means both a claw and a tool, fitting its function as a tool, and the mascot of OpenClaw is a cute lobster.

The official definition on the OpenClaw website is “The AI that actually does things,” which can be literally translated as “truly working AI.” It can help you clean your inbox, send emails, manage schedules, check in for flights, and more—all by sending commands through connected chat apps like WhatsApp, Telegram, Feishu, DingTalk, and others.

In fact, OpenClaw cannot be used directly; it requires deployment and configuration, and over time, skills are added to it. That’s why it’s called “Lobster Farming.” When deploying OpenClaw, the first challenge is choosing which large model to serve as its “brain.” To answer this question, the PinchBench website was created.

PinchBench specifically benchmarks large models for OpenClaw, evaluating their performance in OpenClaw tasks. Currently, the official website has tested 33 of the world’s leading large models.

Data shows that in terms of success rate, Google’s Gemini-3-Flash-Preview ranks first at 95.1%. Domestic models Minimax-m2.1 and Kimi-k2.5 also made it into the top three, with success rates of 93.6% and 93.4%, outperforming many Claude models.

Regarding testing costs, Minimax-m2.1 and Kimi-k2.5 also perform well, balancing success rate and lower costs compared to Gemini-3-Flash-Preview. The costs are $0.14 and $0.20 respectively, while Gemini costs $0.72.

Additionally, in task completion speed, Minimax-m2.1 and Kimi-k2.5 also reach average levels among the seven models with success rates above 90%.

No wonder OpenClaw’s founder, Peter Steinberger, once said in a podcast interview that he believes Minimax 2.1 is the best open-source model currently (at that time, he hadn’t tested the latest models of Minimax and Kimi).

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.