Audio AI: isolating vocals from stereo music using Convolutional Neural Networks
- Ale Koretzky
- Fairly standard CNN, but use a binary STFT mask to isolate vocals from instruments.
- Get Fourier-type time-domain artifacts as a results; but it sounds reasonable.
- Didn't realize it until this paper / blog post: stacked conv layers combine channels.
- E.g. Input size (512 freq channels + DC, 25 time slices, 16 filter channels) into a 3x3 Conv2D -> total parameters (filter weights and bias).
- If this is followed by a second Conv2D layer of the same parameters, the layer acts as a 'normal' fully connected network in the channel dimension.
- This means there are parameters.
- Each input channel from the previous conv layer has independent weights -- they are not shared -- whereas the spatial weights are shared.
- Hence, same number of input channels and output channels (in this case; doesn't have to be).
- This, naturally, falls out of spatial weight sharing, which might be obvious in retrospect; of course it doesn't make sense to share non-spatial weights.
- See also: https://datascience.stackexchange.com/questions/17064/number-of-parameters-for-convolution-layers
- Synthesized a large training set via acapella youtube videos plus instrument tabs .. that looked like a lot of work!
- Need a karaoke database here.
- Authors wrapped this into a realtime extraction toolkit.
|