1. Describe the problem:
Lately I have been working a lot with voice cloning. My model copies timbre and intonation, but the output voice sounds too clean and smooth. Users say it has lost its character and feels lifeless. Many people recognize it as an AI voice, so trust in such a voice drops to zero.
In reality, most algorithms are trained on studio recordings and automatically cut out natural speech imperfections: a slight rasp, uneven rhythm, micropauses, quirks of speech. The result is technically perfect but unrecognizably "combed" voice. But people want their own, natural way of speaking. So far I have not been able to achieve that perfectly.
In reality, most algorithms are trained on studio recordings and automatically cut out natural speech imperfections: a slight rasp, uneven rhythm, micropauses, quirks of speech. The result is technically perfect but unrecognizably "combed" voice. But people want their own, natural way of speaking. So far I have not been able to achieve that perfectly.
2. How often does the problem occur?
With every new clone. In fact, this is not a bug but a feature of current models.
3. What attempts have you made to solve the problem?
I tried using "dirty" audio data (podcasts, phone recordings, interviews with background noise). But the model still gravitates toward sterile sound. I have not found a simple way to force AI to preserve the natural (even if imperfect) characteristics of a voice.
4. How much are you willing to pay for the solution?
In general, I am willing to pay at market rates for similar solutions, but for a solution that lets me control the "degree of cleanliness" of the clone, keeping the voice as natural as possible.