Joon-Seung Choi, Dong-Min Byun, Hyung-Seok Oh and Seong-Whan Lee
Controlling singing style is crucial for achieving an expressive and natural singing voice. Among the various style factors, vibrato plays a key role in conveying emotions and enhancing musical depth. However, modeling vibrato remains challenging due to its dynamic nature, making it difficult to controlin singing voice conversion. To address this, we propose VibE-SVC, a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform. Unlike previous methods that model vibrato implicitly, our approach decomposes the F0 contour into frequency components, enabling precise transfer. This allows vibrato control for enhanced flexibility. Experimental results show that VibE-SVC effectively transforms singing styles while preserving speaker similarity. Both subjective and objective evaluations confirm high-quality conversion.
Straight-to-Vibrato Conversion
Source Speaker | Converted | ||
---|---|---|---|
Female 6 |
SoVITS w/ Style Emb |
SoVITS w/ Style Emb & DWT |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
|
Male 5 |
SoVITS w/ Style Emb |
SoVITS w/ Style Emb & DWT |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
Vibrato-to-Straight Conversion
Source Speaker | Converted | ||
---|---|---|---|
Female 8 |
SoVITS w/ Style Emb |
SoVITS w/ Style Emb & DWT |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
|
Male 2 |
SoVITS w/ Style Emb |
SoVITS w/ Style Emb & DWT |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
Straight-to-Vibrato Conversion
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
Female 7 |
Female 5 |
SoVITS w/ Style Emb |
SoVITS |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
||
VibE-SVC [DWT level 3] |
VibE-SVC [DWT level 5] |
|||
Female 1 |
Male 2 |
SoVITS w/ Style Emb |
SoVITS |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
||
VibE-SVC [DWT level 3] |
VibE-SVC [DWT level5] |
|||
Male 9 |
Male 8 |
SoVITS w/ Style Emb |
SoVITS |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
||
VibE-SVC [DWT level 3] |
VibE-SVC [DWT level 5] |
|||
Male 9 |
Female 7 |
SoVITS w/ Style Emb |
SoVITS |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
||
VibE-SVC [DWT level 3] |
VibE-SVC [DWT level 5] |
Vibrato-to-Straight Conversion
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
Female 8 |
Female 2 |
SoVITS w/ Style Emb |
SoVITS |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
||
VibE-SVC [DWT level 3] |
VibE-SVC [DWT level 5] |
|||
Female 2 |
Male 3 |
SoVITS w/ Style Emb |
SoVITS |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
||
VibE-SVC [DWT level 3] |
VibE-SVC [DWT level 5] |
|||
Male 11 |
Male 5 |
SoVITS w/ Style Emb |
SoVITS |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
||
VibE-SVC [DWT level 3] |
VibE-SVC [DWT level 5] |
|||
Male 3 |
Female 5 |
SoVITS w/ Style Emb |
SoVITS |
SoVITS w/ PST |
VibE-SVC (Ours) |
VibE-SVC w/o MPD |
VibE-SVC w/o DWT |
||
VibE-SVC [DWT level 3] |
VibE-SVC [DWT level 5] |
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
Female 6 |
Male 5 |
Scaling 1.0 |
Scaling 0.1 |
Scaling 0.3 |
Scaling 0.5 |
Scaling 0.7 |
Scaling 2.0 |
||
Frame-level control Type 1 |
Frame-level control Type 2 |
Frame-level control Type 3 |
||
Male 7 |
Female 8 |
Scaling 1.0 |
Scaling 0.1 |
Scaling 0.3 |
Scaling 0.5 |
Scaling 0.7 |
Scaling 2.0 |
||
Frame-level control Type 1 |
Frame-level control Type 2 |
Frame-level control Type 3 |