WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark 
    
    
       Linhan Ma1,†, Dake Guo1,†, Kun Song1, Yuepeng Jiang1, Shuai Wang2,3, Liumeng Xue3, Weiming Xu1,Huan Zhao1, Binbin Zhang4, Lei Xie1 
      
      1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,Northwestern Polytechnical University, Xi’an, China 
      2Shenzhen Research Institute of Big Data,  3School of Data Science, The Chinese University of Hong Kong,Shenzhen (CUHK-Shenzhen), China 
      4WeNet Open Source Community, China 
    
    With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains 12,800 hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets, establishing benchmarks for the usability of WenetSpeech4TTS and the fair comparison of TTS systems. The corpus and corresponding benchmarks will be made publicly available to advance research in this field.
| Training Subsets | DNSMOS Threshold | Hours | Average Segment Duration (s) | 
|---|---|---|---|
| Premium | 4.0 | 945 | 8.3 | 
| Standard | 3.8 | 4,056 | 7.5 | 
| Basic | 3.6 | 7,226 | 6.6 | 
| Rest | < 3.6 | 5,574 | - | 
| WenetSpeech1 (orig) | - | 12,483 | - | 
Zero-Shot TTS Samples
Models
- VALL-E: VALL-E2 trained with the WenetSpeech4TTS Basic subset
- VALL-E S: VALL-E fine-tuning with the WenetSpeech4TTS Standard subset
- VALL-E P: VALL-E fine-tuning with the WenetSpeech4TTS Premium subset
- VALL-E P Scratch (Ours): VALL-E trained with the WenetSpeech4TTS Premium subset