Comments Page - Ask HN: What's the current best local/open speech-to-speech setup?

« Back Ask HN: What's the current best local/open speech-to-speech setup?undefinedSubmitted by dsrtslnd23 14 hours ago

mpaepper 3 hours ago
You should look into the new Nvidia model: https://research.nvidia.com/labs/adlr/personaplex/
It has dual channel input / output and a very permissible license
- cbrews an hour ago
  Thanks for sharing this! I'm going to put this on my list to play around with. I'm not really an expert in this tech, I come from the audio background, but recently was playing around with streaming Speech-to-Text (using Whisper) / Text-to-Speech (using Kokoro at the time) on a local machine.
  The most challenging part in my build was tuning the inference batch sizing here. I was able to get it working well for Speech-to-Text down to batch sizes of 200ms. I even implement a basic local agreement algorithm and it was still very fast (inferencing time, I think, was around 10-20ms?). You're basically limited by the minimum batch size, NOT inference time. Maybe that's a missing "secret sauce" suggested in the original post?
  In the use case listed above, the TTS probably isn't a bottleneck as long as OP can generate tokens quickly.
  All this being said a wrapped model like this that is able to handle hand-offs between these parts of the process sounds really useful and I'll definitely be interested in seeing how it performs.
  Let me know if you guys play with this and find success.
- dsrtslnd23 3 hours ago
  oh - very interesting indeed! thanks
amelius an hour ago
For the TTS part: https://github.com/supertone-inc/supertonic
Johnny_Bonk an hour ago
Anyone using any reasonably good small speech to text os models?
- garblegarble an hour ago
  For my inputs, whisper distil-large-v3.5 is the best. I tried Parakeet 0.6 v3 last night but it has higher error rates than I'd like (but it is fast...)
  Johnny_Bonk an hour ago
  Nice I'll try it, as of now for my personal stt workflow I use eleven labs api which is pretty generous but curious to play around with other options
  garblegarble an hour ago
  I assume that will be better than whisper - I haven't benchmarked it against cloud models, the project I'm working on cannot send data out to cloud models
  BiraIgnacio an hour ago
  oh I've been looking into whisper and vosk in the last few days. I'll probably go with whisper (with whisper.cpp) but has anyone compared it to vosk models?
jauntywundrkind 3 hours ago
It was a little annoying getting old qt5 tools installed but I really enjoyed using dsnote / Speech Note. Huge model selection for my amd gpu. Good tool. I haven't done enough specific studying yet to give you suggestions for which model to go with. WhisperFlow is very popular.
Kyutai some very interesting work always. Their delayed streams work is bleeding edge & sounds very promising especially for low latency. Not sure why I have not yet tried it tbh. https://github.com/kyutai-labs/delayed-streams-modeling
There's also a really nice elegant simple app Handy. Only supports Whisper and Parakeet V3 but nice app & those are amazing models. https://github.com/cjpais/Handy
hackomorespacko 23 minutes ago
Just going out on the street and talk nigga?