Qwen released a new lineup of Qwen3-TTS, launching two capability lines: VoiceDesign-VD-Flash and VoiceClone-VC-Flash: the former uses "free text instructions" to control the tone, rhythm, mood and character design at a fine-grained level, emphasizing that it does not rely on preset timbres; The latter focuses on voice cloning in just about 3 seconds of audio, and enhances performance in multilingual generation and more natural speech speed stops. Official publicity claims that the two outperform several competing or similar systems in some role-playing and multilingual evaluations.
From the perspective of the scope of application, VoiceClone-VC-Flash claims to be able to generate voices in 10 languages (including Chinese, English, Japanese, Western, etc.), and gives indicators such as relative WER reduction, but the public caliber may not cover all data sets, noise conditions and evaluation processes, and the actual effect may fluctuate with accent, recording quality, and text field. Relevant capabilities have been demonstrated on Qwen Chat and public demo pages, and developers can also refer to cloud models and TTS documentation. At the same time, voice cloning involves portrait rights, privacy, and authorization boundaries, and the use of samples and generated content requires ensuring explicit consent and avoiding the risk of impersonation.
FAQs
Q: What problems do the new VoiceDesign and VoiceClone solve in Qwen3-TTS?
A: VoiceDesign is used to "design and control" the voice style with text instructions; VoiceClone is used to quickly replicate specific speaker timbres from short audio samples and synthesize them in multiple languages.
Q: What are the audio requirements for VoiceClone-VC-Flash for 3-second voice cloning?
A: Usually requires clear vocals, less background noise and distortion; The cleaner and more stable the sample, the better the clonal similarity and understandability.
Q: What languages does VoiceClone-VC-Flash support and what are the common limitations?
A: The official claim supports 10 languages (including Chinese, English, Japanese, Spanish, etc.); When crossing languages, accent migration, pronunciation deviations of individual proper names and fluctuations in intelligibility may occur.
Q: What are the most easy risk points to step on when using the voice cloning function?
A: Unauthorized cloning of other people's voices, impersonation or misleading dissemination; and uploading audio samples containing sensitive personal information to unknown environments.