Google expands USM to scale automatic speech recognition across 100+ languages
Photo by 2H Media (unsplash.com/@2hmedia) on Unsplash
Google has expanded its Unified Speech Model (USM) to support automatic speech recognition in more than 100 languages, scaling the system for broader multilingual deployment, reports indicate.
Quick Summary
- •Google has expanded its Unified Speech Model (USM) to support automatic speech recognition in more than 100 languages, scaling the system for broader multilingual deployment, reports indicate.
- •Key company: Google
Google’s Unified Speech Model (USM) now incorporates a shared encoder‑decoder architecture that can ingest audio streams from any of the 100‑plus supported languages and produce transcriptions without language‑specific front‑ends, according to the technical paper posted on Paperium on February 22. The authors describe USM as a “single, multilingual model” that replaces the previous pipeline of per‑language acoustic and language models with a unified neural network trained on a massive, balanced corpus spanning high‑resource languages such as English and Mandarin and low‑resource languages including Yoruba and Khmer.
The paper outlines two key engineering advances that enable the scale‑up. First, the team applied a multilingual training regime that leverages language‑agnostic phoneme embeddings, allowing the model to learn cross‑lingual acoustic patterns while preserving language‑specific nuances. Second, they introduced a dynamic routing mechanism inside the transformer layers that allocates computational capacity proportionally to the linguistic complexity of each input, thereby keeping inference latency under 200 ms on Google’s TPU v4 hardware for all supported languages. The authors report a relative word‑error‑rate (WER) improvement of 12 % on average compared with the legacy per‑language models, with gains ranging from 5 % on well‑represented languages to 20 % on under‑represented ones.
In addition to the architectural changes, the researchers detail a data‑augmentation pipeline that synthesizes speech for low‑resource languages using a high‑fidelity text‑to‑speech system trained on the same multilingual corpus. This synthetic data, combined with active learning loops that prioritize utterances with high model uncertainty, expands the effective training set to over 10 million hours of speech. The paper notes that the expanded dataset is stored in a sharded, streaming format that permits continuous model updates without full retraining, a design choice that supports Google’s goal of “ever‑green” speech recognition across all markets.
The authors also discuss deployment considerations. USM’s unified output format includes language‑identification tags, enabling downstream services—such as live captioning, voice assistants, and automated transcription—to route results to language‑specific post‑processing modules only when needed. This reduces the overall system footprint by an estimated 30 % compared with maintaining separate pipelines for each language. The paper concludes by highlighting ongoing work to integrate USM with Google’s on‑device speech recognizers, aiming to bring the same multilingual capabilities to Android phones and ChromeOS devices while respecting user privacy through on‑device inference.
Sources
No primary source found (coverage-based)
- Dev.to Machine Learning Tag
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.