1. Abstract
Alibaba Tongyi Voice Team (FunAudioLLM) open-sources two types of audio models: Fun-CosyVoice3-0.5B-2512 (TTS) for speech synthesis and Fun-ASR-Nano-2512 (ASR) for speech recognition. The former emphasizes multilingual, zero-shot voice cloning and low-latency streaming synthesis; The latter emphasizes 31 language recognition, dialect accent coverage, and real-time dictation, making it suitable for end-to-end applications from "voiceover generation" to "voice transcription."
2. Core Features
- Fun-CosyVoice3-0.5B (TTS)
- covers 9 common languages and supports 18+ Chinese dialects/accents and cross-language zero-shot voice cloning.
- It supports text streaming input and audio streaming output (bidirectional streaming) for low-latency interaction.
- It supports directive control (e.g., language, dialect, speech rate/volume, etc.) and stronger text normalization capabilities.
- 2. Fun-ASR-Nano (ASR)
- covers 31 languages and supports free switching and hybrid recognition.
- It supports the recognition of major Chinese dialects and multi-regional accents, and is suitable for complex scenarios such as conferences and vehicles.
- Provides low-latency real-time transcription capabilities and can be called via funasr's AutoModel.
3. Installation
- Clone the CosyVoice repository (TTS / Fun-CosyVoice3)
- and install the dependencies (according to requirements and official examples).
- Download the Fun-CosyVoice3-0.5B-2512 weights from Hugging Face, or auto-pull them as per the example script.
- Streaming inference prioritizes using official streaming examples/server-side scripts to avoid sentence breaks and high latency caused by self-stitching.
- 2. ASR (Fun-ASR / Fun-ASR-Nano)
- installs funasr with the dependencies listed in the repository/model card.
- Load the model with
AutoModel(..., trust_remote_code=True)by model card example. - Real-time dictation suggests inference based on short frames/small segments, and incremental output merging and error correction at the application layer.
4. Typical use cases
- Cross-language dubbing and audio content: Multilingual TTS + unified timbre, adapted to video dubbing, podcasts, and learning content.
- Voice cloning and character dubbing: Zero-shot cloning with a small amount of reference audio for virtual characters and multi-character narration (authorization required).
- Real-time transcription of meetings/classes: low-latency dictation + (if supported by toolchain) hot words/word lists improve the accuracy of special names.
- Call center quality inspection: ASR transcription is done for search, compliance audit and summary, and manual review is recommended for key links.
5. Ecology and Competing Products
- The ecological
- TTS side is mainly based on the CosyVoice project, and the weight is released in Hugging Face / ModelScope, etc., which is conducive to deployment and reproduction.
- The ASR side provides the Fun-ASR repository and model weights, and connects to the funasr toolchain. 2. Common comparisons of
- competing TTS
- include open source solutions such as VITS and F5-TTS and commercial cloud TTS; The difference of Fun-CosyVoice3 is the combination of "multilingual zero-shot cloning + two-way streaming + instruction control".
- Common controls for ASR include Whisper line, Wenet, etc.; Fun-ASR-Nano emphasizes multilingualism, dialect accents, and low latency. Effectiveness is recommended to use your own data for A/B verification.
6. Limitations and precautions
- Voice cloning involves authorization and privacy: It must be explicitly authorized to avoid impersonation and fraud.
- The streaming experience is highly dependent on engineering details: slicing policies, VAD, network jitter, and caching can all affect latency and sentence breakage.
- Long-tail dialects and noisy environments may still be misidentified: it is recommended to set a confidence threshold and a manual review link.
4. Use trust_remote_code=True to evaluate supply chain security: fixed versions, audit codes, and isolated operation are more secure.
7. Project address
https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512
8. FAQs
Q: Does Fun-CosyVoice3-0.5B support "TTS in 9 languages" with streaming output?
A: It supports 9-language speech synthesis and supports two-way streaming capabilities for text streaming input and audio streaming output.
Q: How much reference audio is needed for "voice cloning" for Fun-CosyVoice3-0.5B?
A: It is positioned as a zero-shot sound clone, usually with a small amount of reference audio, but different sound quality and accent will affect the similarity and stability.
Q: Does the Fun-ASR-Nano support 31 languages and dialect accent recognition?
A: It supports 31 languages and covers major Chinese dialects and multi-regional accents, making it suitable for real-time dictation scenarios.
Q: How do I quickly call Fun-ASR-Nano in Python?
A: Load the model card example through funasr's AutoModel to infer audio files or streaming slices.