WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Linhan Ma1,†, Dake Guo1,†, Kun Song1, Yuepeng Jiang1, Shuai Wang2,3, Liumeng Xue3, Weiming Xu1,Huan Zhao1, Binbin Zhang4, Lei Xie1
1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,Northwestern Polytechnical University, Xi’an, China
2Shenzhen Research Institute of Big Data, 3School of Data Science, The Chinese University of Hong Kong,Shenzhen (CUHK-Shenzhen), China
4WeNet Open Source Community, China


With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains 12,800 hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets, establishing benchmarks for the usability of WenetSpeech4TTS and the fair comparison of TTS systems. The corpus and corresponding benchmarks will be made publicly available to advance research in this field.

Download WenetSpeech4TTS
Table 1: WenetSpeech4TTS subsets.
Training Subsets DNSMOS Threshold Hours Average Segment Duration (s)
Premium 4.0 945 8.3
Standard 3.8 4,056 7.5
Basic 3.6 7,226 6.6
Rest < 3.6 5,574 -
WenetSpeech1 (orig) - 12,483 -

Zero-Shot TTS Samples

Models

Seen Speakers

Unseen Speakers

References:

[1] B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng et al., “Wenetspeech: A 10000+ hours multi- domain mandarin corpus for speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6182–6186.
[2] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,” CoRR, vol. abs/2301.02111, 2023.
[3] K. Shen, Z. Ju, X. Tan, E. Liu, Y. Leng, L. He, T. Qin, sheng zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” in The Twelfth International Conference on Learning Representations, 2024.