PhD candidate: Olga Slizovskaia
Supervisors:
Dr. Emilia GómezCommittee:
Dr. Xavier Giró-i-Nieto21/10/2020
modern and efficient audio-visual methods for MIR
\[
\begin{aligned}
\definecolor{classes}{RGB}{114,0,172}
\definecolor{params}{RGB}{45,177,93}
\definecolor{model}{RGB}{251,0,29}
\definecolor{signal}{RGB}{18,110,213}
\definecolor{probability}{RGB}{217,86,16}
\definecolor{categories}{RGB}{203,23,206}
{\color{probability} \hat{y}} = {\color{model} f}_{\color{params} \theta}({\color{signal} x})
\end{aligned}
\]
find parameters
for a model that gives
class
estimation probabilities
for a sample
\[ \definecolor{loss}{RGB}{128,121,14} \definecolor{classes}{RGB}{114,0,172} \definecolor{probability}{RGB}{217,86,16} \definecolor{categories}{RGB}{203,23,206} { \color{loss} \mathcal{L} } ( {\color{classes} y}, {\color{probability} \hat{y}}) = -{\color{loss} \sum}_{\color{categories} i=1}^{\color{categories} K} {\color{classes}y_i} {\color{loss} * \log}( {\color{probability} \hat{y}_i }) \] for multi-labels classification, minimize categorical cross-entropy between ground truth and estimated probability distributions of class labels
Audio input Video input
Audio-based CNN Video-based CNN
DATASETS
- FCVID: Musical Performance With Instruments
12 classes, 5K videos, 260 hours- YouTube-8M: MusInstr-Normalized
46 classes, 60k videos, 4k hours
Sample 1
Sample 2
Sample 3
Can you hear the difference now?
Separated 1 | Separated 2 | Separated 3 |
---|---|---|
\[ \definecolor{loss}{RGB}{128,121,14} \definecolor{predicted}{RGB}{217,86,16} \definecolor{gt}{RGB}{203,23,206} \definecolor{nsources}{RGB}{114,0,172} \definecolor{sum}{RGB}{251,0,29} \definecolor{mix}{RGB}{18,110,213} {\color{mix} y(t)} = {\color{sum} \sum_{\color{nsources}i=1}^{\color{nsources}N} {\color{gt} x_i(t)}} \] the mixture equals to the sum of all sources
\[
\definecolor{loss}{RGB}{128,121,14}
\definecolor{predicted}{RGB}{217,86,16}
\definecolor{nsources}{RGB}{114,0,172}
\definecolor{model}{RGB}{251,0,29}
\definecolor{gt}{RGB}{203,23,206}
\definecolor{mix}{RGB}{18,110,213}
\definecolor{param}{RGB}{45,177,93}
{\color{predicted} \hat{x}_{\color{nsources}i}(t)} =
{\color{model}f_{\color{params}\theta}^{\color{nsources}i}}({\color{mix}y(t)})
\]
find parameters
for a model that gives
individual
sources estimation
from the mixture
approaches to estimate individual sources from the mixture
predicting time-domain signals $ \definecolor{nsources}{RGB}{114,0,172} \definecolor{predicted}{RGB}{217,86,16} \color{predicted} \hat{x}_{\color{nsources}i}(t)$
\[ \definecolor{nsources}{RGB}{114,0,172} \definecolor{loss}{RGB}{128,121,14} \definecolor{predicted}{RGB}{217,86,16} \definecolor{gt}{RGB}{203,23,206} {\color{loss} \mathcal{L}^{w} {\color{black}=} \sum_{\color{nsources}i=1}^{\color{nsources}N} \sum_{j=1}^{T} ({\color{gt} x_{\color{nsources}i}(j)} - {\color{predicted} \hat{x}_{\color{nsources}i}(j)})^2} \]
predicting ratio masks $ \definecolor{nsources}{RGB}{114,0,172} \definecolor{predicted}{RGB}{217,86,16} \color{predicted} \hat{M}_{\color{nsources}i}^{r} $
$ \definecolor{mix}{RGB}{18,110,213} \color{mix} \boldsymbol{Y} $, $ \definecolor{gt}{RGB}{203,23,206} \definecolor{nsources}{RGB}{114,0,172} \color{gt} \boldsymbol{X_{\color{nsources} i}} $ STFT values of the mixture and sources
magnitude of the STFT value at frequency $\nu$ and time $\tau$
$\definecolor{gt}{RGB}{203,23,206}
\definecolor{nsources}{RGB}{114,0,172}
| {\color{gt} \boldsymbol{X_{\color{nsources}i}}(\tau, \nu) } |$,
$\definecolor{mix}{RGB}{18,110,213}
| {\color{mix} \boldsymbol{Y}(\tau, \nu)}|$
ideal ratio mask of source $\definecolor{gt}{RGB}{203,23,206} \definecolor{nsources}{RGB}{114,0,172} | {\color{gt} \boldsymbol{X_{\color{nsources}i}}(\tau, \nu) } |$ \[ \definecolor{gt}{RGB}{203,23,206} \definecolor{nsources}{RGB}{114,0,172} \definecolor{mix}{RGB}{18,110,213} \definecolor{predicted}{RGB}{217,86,16} {\color{predicted} \hat{M}_{\color{nsources}i}^{r}} = \frac{|{\color{gt} \boldsymbol{X_{\color{nsources}i}}(\tau, \nu) } |}{|{\color{mix} \boldsymbol{Y}(\tau, \nu)}|} \]
\[ \definecolor{loss}{RGB}{128,121,14} \definecolor{nsources}{RGB}{114,0,172} \definecolor{predicted}{RGB}{217,86,16} \definecolor{gt}{RGB}{203,23,206} {\color{loss} \mathcal{L}^{r} {\color{black}=} \sum_{\color{nsources}i=1}^{\color{nsources}N} \sum_{j=1}^{|T \times F |} ({\color{gt} M_{\color{nsources}i}^{ir}(j)} - {\color{predicted} \hat{M}_{\color{nsources}i}^{r}(j)})^2} \]
💪 Can we use extra information to improve separation?
Method | SDR | SIR | SAR |
---|---|---|---|
InformedNMF | -0.16 | 1.42 | 9.31 |
Exp-Wave-U-Net | -4.12 | -3.06 | 12.18 |
CExp-Wave-U-Net | -1.37 | 2.16 | 6.36 |
Take away note: it was inspiring but we lacked TPUs and scaling was not an option
Learning rate
TPUs
tf.float32 vs tf.float16
Total Speedup: x35.4
What we did next...
modern and efficient audio-visual methods for MIR
Where can we merge different data representations?
How should we merge data from different sources?
PhD candidate: Olga Slizovskaia
Supervisors:
Dr. Emilia GómezCommittee:
Dr. Xavier Giró-i-Nieto21/10/2020