Microsoft launches three new AI models, targeting OpenAI and Google with speech and image

While Microsoft has long relied on OpenAI’s APIs, it now unveils three home‑grown models—a speech transcriber, voice generator and upgraded image creator—signaling a direct push against OpenAI and Google, VentureBeat reports.

Key Facts

•Key company: Microsoft
•Also mentioned: Microsoft, Google, Meta

Microsoft’s three new foundational models—MAI‑Transcribe‑1, MAI‑Voice‑1, and MAI‑Image‑2—are being rolled out through the Azure AI Foundry (formerly Azure AI Studio) and a dedicated MAI Playground, according to VentureBeat. The models are positioned as “the very best in the world for transcription” and are claimed to run on roughly half the GPU budget of comparable state‑of‑the‑art systems, a metric the company says directly translates into lower cost‑of‑goods‑sold for its own SaaS offerings (VentureBeat).

MAI‑Transcribe‑1 is a speech‑to‑text engine that supports 25 languages and promises “enterprise‑grade accuracy” while consuming about 50 % fewer GPU cycles than leading alternatives (The Register). The architecture reportedly builds on a transformer‑based encoder–decoder pipeline optimized for low‑latency inference, with a quantization strategy that preserves phonetic detail despite the reduced compute envelope. In internal benchmarks disclosed to VentureBeat, the model achieved a word‑error‑rate (WER) that outperformed OpenAI’s Whisper and Google’s Speech‑to‑Text services on a mixed‑domain test set, though the exact figures were not released.

MAI‑Voice‑1 is a neural speech synthesis model that can generate 60 seconds of high‑fidelity audio in under a second on a single GPU, according to The Register. The system leverages a diffusion‑based vocoder combined with a prosody‑aware text‑to‑speech front‑end, enabling rapid generation without sacrificing naturalness. Microsoft’s blog post, cited by The Register, notes that the model already powers Copilot, Bing, PowerPoint, and Azure Speech, suggesting a production‑grade pipeline that has been hardened for large‑scale deployment. The claim of sub‑second generation is significant because it places the model in the same latency tier as OpenAI’s latest voice models, but with a markedly lower hardware footprint.

MAI‑Image‑2 is the second iteration of Microsoft’s text‑to‑image offering, described by VentureBeat as an “upgraded image creator.” While the press release does not enumerate model size or training data volume, the description implies a diffusion‑based architecture akin to Stable Diffusion, but tuned for enterprise use cases such as marketing asset generation and design‑assist workflows. The Register highlights that the model is already integrated into Microsoft’s internal products, which suggests that it has been trained on a curated dataset that balances artistic diversity with brand‑safe content filters.

All three models are being offered on a “public preview” basis, with pricing framed as “aggressive” to undercut the cost structures of OpenAI and Google (The Register). Naomi Moneypenny, head of the Azure AI Foundry Models product team, emphasized that the models are “the same models already powering our own products” and are now available exclusively on Foundry for developers (The Register). This strategy aligns with the broader “AI self‑sufficiency” agenda articulated by Mustafa Suleyman, who founded Microsoft’s Superintelligence team six months ago and has positioned the launch as the first concrete step toward reducing reliance on external AI providers (VentureBeat).

The timing of the release is noteworthy. Microsoft’s stock recently posted its worst quarterly performance since the 2008 financial crisis, prompting investors to demand tangible returns on the company’s multi‑hundred‑billion‑dollar AI infrastructure spend (VentureBeat). By delivering in‑house models that claim superior performance at lower compute cost, Microsoft aims to improve its margins on AI‑driven services while also establishing a competitive moat against OpenAI, in which it holds a $135 billion stake as of October 2025 (The Register). Whether the technical advantages translate into market share will depend on developer adoption of the Foundry platform and the ability of MAI‑Transcribe‑1, MAI‑Voice‑1, and MAI‑Image‑2 to sustain their claimed performance metrics at scale.

Microsoft launches three new AI models, targeting OpenAI and Google with speech and image

Key Facts

Sources

Compare these companies

🏢Companies in This Story

Related Stories