LM Studio launches headless CLI and Claude code, enabling Google Gemma 4 to run locally.
Photo by Markus Spiske on Unsplash
LM Studio released version 0.4.0, adding a headless CLI and Claude Code support that let users run Google’s Gemma 4 26B model locally on macOS, according to Hnrss.
Key Facts
- •Key company: LM Studio
LM Studio’s 0.4.0 release adds a “headless” command‑line interface (the `lms` CLI) and a new integration called Claude Code, which together make it possible to run Google’s Gemma 4 26B model entirely on a local macOS machine. The headless mode strips away the graphical front‑end, allowing developers to invoke inference directly from the terminal or embed it in scripts, while Claude Code supplies a thin wrapper that translates the model’s output into the JSON‑based schema expected by Anthropic’s Claude Code IDE. According to Hnrss, the combination “lets users run Google’s Gemma 4 26B model locally on macOS” without needing an external API endpoint (Hnrss).
The key to Gemma 4’s feasibility on consumer hardware lies in its mixture‑of‑experts (MoE) architecture. The 26‑billion‑parameter model contains 128 expert sub‑networks plus a shared expert, but only a subset—roughly 4 billion parameters—are activated for each forward pass. This selective activation reduces memory bandwidth and compute requirements dramatically. George Liu notes that on his 14‑inch MacBook Pro M4 Pro with 48 GB of unified memory, the model “fits comfortably and generates at 51 tokens per second,” a performance level that would be impossible with a dense 26 B model (Liu). The MoE design also means that the model can scale down to devices with limited VRAM while still preserving most of the quality of the larger dense variant.
Claude Code integration is implemented as an alias command (`claude‑lm`) that forwards prompts to the LM Studio API and reformats the response for Claude’s code‑assistant workflow. Liu reports that while the local setup eliminates API costs and data‑exfiltration risks, “there’s significant slowdowns when used within Claude Code from my experience.” The bottleneck appears to stem from the extra JSON serialization step and the need to batch prompts for the MoE routing logic, which adds latency compared to a straight terminal call. Nonetheless, the ability to keep proprietary code snippets on‑device is a compelling trade‑off for many developers concerned about privacy and compliance.
Gemma 4 is offered in four variants, each targeting different hardware footprints. The “E” models (E2B, E4B) include per‑layer embeddings optimized for on‑device deployment and support audio input, while the 31 B dense model delivers the highest benchmark scores—85.2 % on MMLU Pro and 89.2 % on AIME 2026, according to Google’s own release (Liu). Liu opted for the 26 B‑A4B variant precisely because its MoE structure balances capability and resource consumption, making it the sweet spot for a laptop‑class Mac. The headless CLI also exposes a low‑level `llmster` binary that can be called from any programming language, enabling custom pipelines that chain Gemma 4 with other local models or external tools without leaving the host environment.
The broader implication of LM Studio’s update is a shift toward more self‑hosted AI workflows on mainstream hardware. By removing the reliance on cloud APIs, developers can avoid rate limits, unpredictable latency, and recurring usage fees—issues Liu enumerates as “rate limits, usage costs, privacy concerns, and network latency” that often hinder rapid prototyping (Liu). While the current performance figures (≈51 tps on an M4 Pro) are modest compared to server‑grade GPUs, they are sufficient for many everyday tasks such as code review, prompt testing, or short‑form content generation. As MoE models continue to mature and tooling like LM Studio matures its CLI and integration layers, the gap between cloud‑only AI services and locally hosted alternatives is expected to narrow, giving developers a viable path to fully offline, cost‑free inference.
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.