Data Science by ODS.ai 🦜

⚡️ A new model has been released in Llama3-Speech, that can natively understand audio and text input.

This multimodal checkpoint with improved speech understanding, listens to human speech and responds in text

Llama3s v0.2 consistently performs across multiple Speech Understanding benchmarks.

They adapted llama3.1 using early-fusion with semantic tokens.

It uses whispervq to get semantic tokens. encoder is frozen during training, only llama3 base is trained.

So the devs used a synthetically generated speech dataset. This speech data is then semantically encoded with WhisperVQ from WhisperSpeech.

This dataset was then interleaved to have 70% speech instruction prompts and 30% speech transcription prompts.

You can try the demo and ask questions in English and keep them under 10 seconds long. This is due to our model's limitation in being trained on audio prompts with fewer than 500 tokens, which the developers plan to address in a future update.

https://huggingface.co/homebrewltd/llama3.1-s-instruct-v0.2

homebrew.ltd/blog/llama3-just-got-ears

@opendatascience

#llama

🔥12👍6❤1

8.77K views15:03

About

Blog

Apps

Platform