WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Linhan Ma^1,†, Dake Guo^1,†, Kun Song¹, Yuepeng Jiang¹, Shuai Wang^2,3, Liumeng Xue³, Weiming Xu¹,Huan Zhao¹, Binbin Zhang⁴, Lei Xie¹ ¹Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,Northwestern Polytechnical University, Xi’an, China ²Shenzhen Research Institute of Big Data, ³School of Data Science, The Chinese University of Hong Kong,Shenzhen (CUHK-Shenzhen), China ⁴WeNet Open Source Community, China

With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains 12,800 hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets, establishing benchmarks for the usability of WenetSpeech4TTS and the fair comparison of TTS systems. The corpus and corresponding benchmarks will be made publicly available to advance research in this field.

Download WenetSpeech4TTS

Table 1: WenetSpeech4TTS subsets.
Training Subsets	DNSMOS Threshold	Hours	Average Segment Duration (s)
Premium	4.0	945	8.3
Standard	3.8	4,056	7.5
Basic	3.6	7,226	6.6
Rest	< 3.6	5,574	-
WenetSpeech¹ (orig)	-	12,483	-

Zero-Shot TTS Samples

Models

VALL-E: VALL-E² trained with the WenetSpeech4TTS Basic subset
VALL-E S: VALL-E fine-tuning with the WenetSpeech4TTS Standard subset
VALL-E P: VALL-E fine-tuning with the WenetSpeech4TTS Premium subset
VALL-E P Scratch (Ours): VALL-E trained with the WenetSpeech4TTS Premium subset

Seen Speakers

Unseen Speakers

References:

[1] B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng et al., “Wenetspeech: A 10000+ hours multi- domain mandarin corpus for speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6182–6186.

[2] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,” CoRR, vol. abs/2301.02111, 2023.

[3] K. Shen, Z. Ju, X. Tan, E. Liu, Y. Leng, L. He, T. Qin, sheng zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” in The Twelfth International Conference on Learning Representations, 2024.