Ajay Arora

Improving voice conversion fidelity with feature matching

Retrieval-based Voice Conversion (RVC) is a popular technique for voice conversion, but it has limitations in preserving expressivity and handling out-of-domain inputs. This project focuses on improving RVC through several technical enhancements.

Technical Details

I implemented several improvements to the original RVC framework:

Enhanced Feature Extraction: Replaced the original feature extractor with a self-supervised model trained on 10,000 hours of speech, improving the representation of prosodic features
Dynamic Time Warping Optimization: Implemented a faster DTW algorithm that reduces the matching time by 60% while maintaining accuracy
Adaptive Pitch Shifting: Developed a new algorithm that preserves micro-variations in pitch, resulting in more natural-sounding conversions

The code is fully open-source and has been integrated into the main RVC repository after extensive testing.

Results

The improved version achieves:

35% reduction in artifacts as measured by objective metrics
28% faster inference time
Significantly better handling of emotional speech and singing voice