Llasa: Llama-Based Speech Synthesis

(llasatts.github.io)

142 points | by CalmStorm 14 hours ago

5 comments

  • ks2048 11 hours ago
    Odd that the page doesn't seem to link to either,

    paper: https://arxiv.org/abs/2502.04128

    github: https://github.com/zhenye234/LLaSA_training

    • thot_experiment 8 hours ago
      Interesting that there isn't a mention of Orpheus as prior art either since it's the exact same thing.

      (https://github.com/canopyai/Orpheus-TTS)

      • gapeleon 4 hours ago
        > Interesting that there isn't a mention of Orpheus as prior art either

        Llasa-3b (https://huggingface.co/HKUSTAudio/Llasa-3B) came out before Orpheus (https://huggingface.co/canopylabs/orpheus-3b-0.1-ft).

        > it's the exact same thing.

        They're very similar, but they're not the exact same thing.

        Llasa uses xcodec2, a much simpler, lossless 16khz wav codec. This makes it superior for one-shot voice cloning.

        Orpheus' 24khz snac codec is lossy which makes it difficult to use for zero-shot cloning as the reference audio gets degraded during tokenization. You can test this here: https://huggingface.co/spaces/Gapeleon/snac_test

        But when finetuned on 50+ audio samples, it produces much cleaner 24khz audio than Llasa, and the snac model is much easier to run on consumer hardware than xcodec2 (87t/s for realtime speech, which can be achieved on an RTX3080 for example)

        • oezi 58 minutes ago
          Do you happen to know why Orpheus and Llasa use Finetuning for voice cloning?

          Zonos uses 128-float embeddings for voices and it seems so much nicer. Because you can just mix and match voices without changing the model.

        • oezi 41 minutes ago
          Isn't xcodec2 also lossy? I thought it is also just another neural codec (50 tok/s, single codebook).

          What are people using to upsampling back to 44,1 or 48 khz? Anything fancy?

  • CalmStorm 14 hours ago
    LLaSA is a simple framework for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA.
    • WastedCucumber 13 hours ago
      Probably the title should have the correct capitalization then. Cause I was fully expecting a speech synthesis tool that sounded like llamas talking human language and now I'm bummed out!
  • mring33621 12 hours ago
    the long 'uuuuhhhhhhh' from some of the lesser models is killing me.
    • gapeleon 4 hours ago
      This finetune seems pretty stable (1b llasa) https://huggingface.co/spaces/HKUST-Audio/Llasa-1B-multi-spe...

      1B is actually huge for a TTS model. Here's an 82m model with probably the most stable/coherent output of all the open weights tts models I've tested: https://huggingface.co/spaces/hexgrad/Kokoro-TTS

      But if you mean zero-shot cloning, yeah they all seem to have those slurred speech artefacts from time to time.

    • jszymborski 12 hours ago
      based on the samples, it really seams like anything smaller than 3B is pretty useless.
      • hadlock 11 hours ago
        If you're doing a home lab voice assistant 1B is nice, because on a 12gb gpu you can run a moderately competent 7b LLM and two 1b models; 1 for speech to text and also text to speech, plus some for the wake word monitor. Maybe in a couple of years we can combine all this into a single ~8b model that runs efficiently on 12gb gpu. Nvidia doesn't seem very incentivized right now to sell consumer GPUs that can run all this on a single consumer grade chip when they're making so much money selling commercial grade 48gb cards.
  • StevenNunez 13 hours ago
    I can't wait see this integrated into Open WebUI! These sound amazing.
    • gapeleon 4 hours ago
      You can run an openai-compatible endpoint and point open-webui at it if you want this. I had to add a function to filter out markdown lists, code, etc as the model was choking on them.
  • dheera 11 hours ago
    > employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align

    I really wish when new models were released that they would draw a diagram of all the layers and the tensor input and output sizes at each layer, with zoom in/out capabilities if needed using D3.js or whatever visualization framework if needed. Every single layer should be on there with its input and output sizes.

    These one-sentence descriptions, and approximate block diagrams with arrows pointing at each other are never enough to understand how something is actually implemented.

    • dr_kiszonka 4 hours ago
      That might be intentional.
    • exe34 11 hours ago
      Sounds like a solid SaaS business plan!