the voices have a very obviously robotic quality to them, and tend to talk over each other at odd points. But the Meta ...
“We present a system that transforms speech into physical objects by combining 3D generative Artificial Intelligence (AI) ...
That’s because they rely on automatic speech recognition systems to process spoken inputs, before synthesizing them with a language model and converting it all using text-to-speech models.
In the NotebookLlama samples I’ve listened to, the voices have a very obvious robotic quality to them ... with stronger models. “The [text-to-speech] model is the limitation of how natural ...