AI & ML Papers

🔥 Fish Audio S2 Technical Report

💡 The paper introduces Fish Audio S2, an open source text to speech system that features multi speaker capabilities, multi turn generation, and instruction following control through natural language descriptions. The system utilizes a multi stage training approach, which includes a staged data pipeline covering video captioning, speech captioning, voice quality assessment, and reward modeling. This approach allows for scalable training and improves the overall performance of the system. The authors also release their model weights, fine tuning code, and an inference engine, making it production ready for streaming. The inference engine achieves a real time factor of 0.195 and a time to first audio of below 100 milliseconds, indicating its efficiency and speed. The code and weights are made available on GitHub and Hugging Face, and users are encouraged to try custom voices on the website. Overall, the paper contributes to the advancement of open source text to speech systems, providing a robust and efficient solution for generating high quality speech.

📅 Published on Mar 9

🔗 Links:
• arXiv: https://arxiv.org/abs/2603.08823
• PDF: https://arxiv.org/pdf/2603.08823
• Project Page: https://fish.audio/
• GitHub: https://github.com/fishaudio/fish-speech ⭐ 30.2k

🤖 Models citing this paper:
• https://huggingface.co/fishaudio/s2-pro
• https://huggingface.co/drbaph/s2-pro-fp8
• https://huggingface.co/mlx-community/fish-audio-s2-pro-bf16

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Izzyzlin/CFSDD

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/artificialguybr/fish-s2-pro-zero
• https://huggingface.co/spaces/fguilleme/fish-s2-pro-zero
• https://huggingface.co/spaces/MAYA-AI/fish-s2-pro-zero

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TextToSpeechSystems #MultispeakerSynthesis #NaturalLanguageProcessing #SpeechGenerationModels #RealTimeAudioProcessing

arXiv.org

Fish Audio S2 Technical Report

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language...

❤4👍2

818 views01:37

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform