We introduce MultiVSR - a large-scale dataset for multilingual visual speech recognition. MultiVSR comprises ~12,000 hours of video data paired with word-aligned transcripts from 13 languages. We ...
Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results