Experts Test Claude's Limits, Reveal Surprising Strengths and Weaknesses

While Alinpanaitiu expected Claude to fade like NFTs, his friend’s workplace trial proved the model can actually excel on macOS app‑dev tasks, exposing both surprising strengths and clear weaknesses.

Key Facts

•Key company: Claude

Claude’s performance on the “Stages” project—an ambitious macOS app‑switcher that records and restores window states—revealed a level of code synthesis that many developers have not seen from a single LLM. According to the detailed account posted by Alinpanaitiu on March 4, 2026, Claude was fed a 200‑line specification (stages.md) and then tasked with refactoring an existing codebase, integrating custom window‑state persistence, and implementing a workaround for macOS’s lack of instant hide/show functionality. Within minutes the model produced a diff that compiled without error in Xcode, and the resulting binary correctly hid all but the designated apps, restored project folders in VS Code and Ghostty, and even performed a web search to reverse‑engineer the Aerospace window manager’s hiding trick. The author described the outcome as “immense” but functional, noting that the app behaved as intended on the first run—a result that contrasts sharply with his prior experience of LLM‑generated snippets stalling after 50 lines.

The success, however, was not universal. While Claude handled the “Stages” workflow end‑to‑end, Alinpanaitiu observed that the model’s reasoning was heavily guided by the specific prompts and the developer’s domain knowledge. The code diff was large, and the author admitted that without his own verification of the window‑manager hack, the solution might have been brittle. This aligns with broader developer sentiment captured by VentureBeat, which reported that Anthropic has recently imposed weekly rate limits on Claude subscriptions after “some users have been running Claude 24/7, with the majority of usage centered around its Claude Code product.” The throttling suggests that while Claude can produce high‑quality output on complex tasks, sustained heavy usage strains Anthropic’s infrastructure and may expose inconsistencies when the model is pushed beyond its optimal prompt length or context window.

Anthropic’s internal analysis, also cited by VentureBeat, underscores a paradox: the company’s own data from 700,000 Claude conversations shows the assistant exhibits a “moral code of its own,” yet the same dataset highlights variability in technical reliability. The “Stages” experiment illustrates this duality—Claude demonstrated sophisticated problem‑solving, such as locating and adapting open‑source code from Aerospace, but the process required the developer to intervene, validate, and sometimes correct the model’s assumptions. This mirrors findings from an Ars Technica feature on a separate experiment where sixteen Claude agents collaboratively built a new C compiler. Although the project succeeded in compiling a Linux kernel, the report emphasized that “deep human management” was essential to keep the agents aligned and to resolve integration errors, reinforcing the notion that Claude’s strengths are amplified when paired with expert oversight.

From a market perspective, the mixed results of Claude’s code generation have implications for enterprise adoption. Companies that have subscribed to Claude Code, as noted by Alinpanaitiu’s friend’s employer, may experience breakthrough productivity gains on niche, high‑complexity tasks, yet they must also contend with the platform’s usage caps and the need for internal code review pipelines. The recent rate‑limit rollout could temper enthusiasm among developers who were counting on unrestricted access for continuous integration workflows. Moreover, the anecdotal evidence of Claude’s ability to autonomously research and integrate third‑party tools suggests a competitive edge over rivals that rely more heavily on static prompt‑to‑code mappings.

In sum, the “Stages” case study offers a concrete illustration of Claude’s evolving capabilities: it can navigate intricate macOS APIs, synthesize large codebases, and even perform auxiliary web research, but its output remains contingent on precise prompting and human validation. As Anthropic tightens access and continues to analyze conversational data, the industry will watch whether these safeguards improve consistency or inadvertently curb the very productivity gains that early adopters like Alinpanaitiu have begun to document.

Experts Test Claude's Limits, Reveal Surprising Strengths and Weaknesses

Key Facts

Sources

🏢Companies in This Story

Related Stories