Apple researchers launch Ferret‑UI Lite, on‑device AI that sees and controls user

While on‑device AI was limited to text, Apple now ships Ferret‑UI Lite that sees and manipulates screens. InfoQ reports the new model runs entirely on the device, enabling visual UI control.

Key Facts

•Key company: Apple

Apple’s research team unveiled Ferret‑UI Lite, a 3‑billion‑parameter on‑device model that can “see” screen content and issue direct UI commands, according to a paper posted on InfoQ [InfoQ]. The model is engineered for both mobile and desktop environments, processing raw screen captures, parsing icons, text, and layout cues, then executing actions such as reading messages or pulling health metrics without ever leaving the device. By keeping inference local, Apple sidesteps the latency, privacy, and connectivity drawbacks that plague larger cloud‑based agents like GPT‑4 or Gemini, which the authors note “require modeling complexity, compute budget… and higher latency” [InfoQ].

The researchers achieved Ferret‑UI Lite’s performance through a two‑stage training pipeline. First, supervised fine‑tuning (SFT) exposed the model to a curated mix of real‑world and synthetic GUI interaction data, ensuring coverage of diverse app designs. In the second stage, reinforcement learning with verifiable rewards (RLVR) optimized the agent for task success rather than mere imitation, a strategy the paper says “significantly improves performance on both grounding and navigation tasks” [InfoQ]. Additional inference‑time tricks—screen‑image cropping, “zoom‑in” focus, and chain‑of‑thought prompting—help the relatively small model resolve complex layouts and tiny UI elements, yielding “competitive, or in some cases superior, performance compared to larger models” [InfoQ].

Benchmark results underscore the trade‑offs of a compact agent. On the ScreenSpot‑V2 GUI‑grounding suite, Ferret‑UI Lite attained a 91.6 % accuracy rate, outpacing many heavyweight baselines. Scores dipped on the more demanding ScreenSpot‑Pro (53.3 %) and OSWorld‑G (61.2 %) tests, reflecting the difficulty of interpreting intricate, multi‑window interfaces. For end‑to‑end navigation, the model succeeded in 28.0 % of AndroidWorld scenarios and 19.8 % of OSWorld tasks, numbers that, while modest, still surpass prior on‑device attempts [InfoQ]. The authors acknowledge that “small models continue to struggle with long‑horizon, multi‑step tasks and are sensitive to reward design,” indicating room for future reinforcement‑learning refinements.

Beyond raw metrics, the paper highlights practical implications for Apple’s ecosystem. By standardizing action formats and embedding visual tools directly into the inference loop, Ferret‑UI Lite could serve as an “intelligent” on‑device assistant, automating routine interactions across iOS, macOS, and web apps without exposing user data to external servers. The researchers also stress that synthetic data generation proved pivotal: “curation of synthetic data from diverse sources significantly improves performance in both tasks” [InfoQ]. However, they caution that chain‑of‑thought reasoning and visual tools, while beneficial, offer diminishing returns as task complexity grows.

Industry observers see Ferret‑UI Lite as a proof point for Apple’s broader push toward privacy‑first AI. The model’s ability to operate entirely offline aligns with the company’s long‑standing narrative of keeping user information on the device, a stance that differentiates it from competitors that rely on cloud inference. If Apple can translate the research prototype into a consumer‑ready feature—perhaps integrated into Siri or the upcoming “Intelligent Control” layer—it could reshape how users interact with apps, moving from voice‑only commands to visual, context‑aware automation. The InfoQ report concludes that while challenges remain, the blend of compact architecture, synthetic data, and reinforcement learning positions Ferret‑UI Lite as a viable foundation for the next generation of on‑device UI agents.

Apple researchers launch Ferret‑UI Lite, on‑device AI that sees and controls user

Key Facts

Sources

🏢Companies in This Story

Related Stories