Qwen Launches Free Real-Time Voice‑to‑Voice LLM with Telegram Integration and 25 Tools
Photo by Robin Jonathan Deutsch (unsplash.com/@rodeutsch) on Unsplash
35 billion‑parameter Qwen 3.5 runs fully locally on a Mac Studio, delivering real‑time voice‑to‑voice interaction, Telegram integration and 25+ tools with zero API cost, according to a recent report.
Key Facts
- •Key company: Qwen
Qwen 3.5 35B A3B’s 4‑bit quantization lets the model run on a single M1 Ultra Mac Studio with 64 GB of unified memory, occupying roughly 18.5 GB on disk. According to the author’s report, the model scores 37 on the Artificial Analysis Arena benchmark, outpacing GPT‑5.2 (34) and Gemini 3 Flash (35) while tying Claude Haiku 4.5. The quantized version activates only about 3 billion parameters at inference time, yet still supports tool‑calling after a few tweaks, making it “a breakthrough” for on‑device LLMs (source: report).
The system exposes three distinct interfaces that all route back to the same local Qwen instance. The flagship is a real‑time voice‑to‑voice agent built on the Pipecat Playground. A phone browser connects via WebRTC to a local stack that includes Silero VAD for voice activity detection, MLX Whisper‑Large V3 Turbo Q4 for speech‑to‑text, the Qwen 3.5 engine on port 8081, and Kokoro 82M for text‑to‑speech. All components run locally, and the endpoint is tunneled through Cloudflare’s free tier, allowing the author to bookmark a single URL on his phone and converse with the AI with latency comparable to commercial services such as GPT‑4o or Gemini Flash (source: report).
A second interface is a Telegram bot orchestrated with n8n, which aggregates more than 25 tools. Voice messages are transcribed by the same local Whisper model before being fed to Qwen; documents are parsed by a local doc server; images are processed by Qwen Vision; and external services such as Notion, Pinecone (vector store), Wikipedia, web search, translation, calculator, and a “Think” mode are invoked through n8n workflows. The author notes that this setup replaces his previous $100‑per‑month API stack, delivering “ChatGPT‑level” functionality without any cloud cost (source: report).
A lightweight Discord bot rounds out the offering. Implemented in roughly 70 lines of Python using discord.py, it connects directly to the Qwen API, maintains per‑channel conversation memory, and mirrors the “Q” personality (dry humor, direct, judgmentally helpful) used in the voice agent. The bot runs as a PM2 service, requiring no n8n orchestration (source: report).
The author attributes the entire architecture to a two‑day debugging sprint aided by Claude Opus 4.6 Thinking, and he has open‑sourced the code and workflows for the community. By combining a high‑performing 35‑billion‑parameter model, aggressive 4‑bit quantization, and a suite of locally hosted speech, vision, and tool‑calling components, the project demonstrates that fully private, real‑time AI assistants are now feasible on consumer‑grade hardware. The approach aligns with recent coverage of home‑run LLM deployments, such as The Register’s guide to running LLMs on a PC with llama.cpp, underscoring a broader shift toward edge‑centric AI (source: The Register).
Sources
No primary source found (coverage-based)
- Reddit - r/MachineLearning New
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.