LEARNING A CROSS-DOMAIN EMBEDDING SPACE OF VOCAL AND MIXED AUDIO WITH A STRUCTURE-PRESERVING TRIPLET LOSS
Keunhyoung Luke Kim, Jongpil Lee, Sangeun Kum and Juhan Nam,
in Proceedings of ISMIR, 2021
Abstract
Recent advances of music source separation have achieved high quality of vocal isolation from mix audio. This has paved the way for various applications in the area of music informational retrieval (MIR). In this paper, we propose a method to learn a cross-domain embedding space between isolated vocal and mixed audio for vocal-centric MIR tasks, leveraging a pre-trained music source separation model. Learning the cross-domain embedding was previously attempted with a triplet-based similarity model where vocal and mixed audio are encoded by two different convolutional neural networks. We improve the approach with a structure-preserving triplet loss that exploits not only cross-domain similarity between vocal and mixed audio but also intra-domain similarity within vocal tracks or mix tracks. We learn vocal embedding using a large-scaled dataset and evaluate it in singer identification and query-by-singer tasks. In addition, we use the vocal embedding for vocal-based music tagging in a transfer learning setting. We show that the proposed model significantly improves the previous cross-domain embedding model, particularly when the two embedding spaces from isolated vocals and mixed audio are concatenated.
Vocal-accompaniment matching and cross-mixing
Using the cross-domain music embedding suggested in the paper, we investigated a few famous songs and their nearest neighbors from the Million Song Dataset. A few matches with very high score are ommitted since they are identical or only a slightly different version of the query songs.
Wake Me Up Before You Go-Go (Wham!)
The original song and its separated vocal
Similar song 1 with cosine similarity of 0.5610 (mix, separated accompaniment, mix with the query vocal)
Similar song 2 with cosine similarity of 0.5053 (mix, separated accompaniment, mix with the query vocal)
Every Breath You Take (The Police)
The original song and its separated vocal
Similar song 1 with cosine similarity of 0.7367 (mix, separated accompaniment, mix with the query vocal)
A different arrangement of the query song with cosine similarity of 0.5224 (mix, separated accompaniment, mix with the query vocal)
I Love You for Sentimental Reasons (Nat King Cole)
The original song and its separated vocal
Similar song 1 with cosine similarity of 0.3569 (mix, separated accompaniment, mix with the query vocal)
Similar song 2 with cosine similarity of 0.2916 (mix, separated accompaniment, mix with the query vocal)