TY - UNPB
T1 - Receptive-Field Regularized CNNs for Music Classification and Tagging
AU - Koutini, Khaled
AU - Eghbal-Zadeh, Hamid
AU - Haunschmid, Verena
AU - Primus, Paul
AU - Chowdhury, Shreyan
AU - Widmer, Gerhard
PY - 2020
Y1 - 2020
N2 - Convolutional Neural Networks (CNNs) have beensuccessfully used in various Music Information Retrieval (MIR)tasks, both as end-to-end models and as feature extractorsfor more complex systems. However, the MIR field is stilldominated by the classical VGG-based CNN architecture vari-ants, often in combination with more complex modules suchas attention, and/or techniques such as pre-training on largedatasets. Deeper models such as ResNet – which surpassed VGGby a large margin in other domains – are rarely used in MIR.One of the main reasons for this, as we will show, is the lackof generalization of deeper CNNs in the music domain.In this paper, we present a principled way to make deeparchitectures like ResNet competitive for music-related tasks,based on well-designed regularization strategies. In particular,we analyze the recently introducedReceptive-Field Regulariza-tionandShake-Shake, and show that they significantly improvethe generalization of deep CNNs on music-related tasks, andthat the resulting deep CNNs can outperform current morecomplex models such as CNNs augmented with pre-training andattention. We demonstrate this on two different MIR tasks andtwo corresponding datasets, thus offering our deep regularizedCNNs as a new baseline for these datasets, which can also beused as a feature-extracting module in future, more complexapproaches.
AB - Convolutional Neural Networks (CNNs) have beensuccessfully used in various Music Information Retrieval (MIR)tasks, both as end-to-end models and as feature extractorsfor more complex systems. However, the MIR field is stilldominated by the classical VGG-based CNN architecture vari-ants, often in combination with more complex modules suchas attention, and/or techniques such as pre-training on largedatasets. Deeper models such as ResNet – which surpassed VGGby a large margin in other domains – are rarely used in MIR.One of the main reasons for this, as we will show, is the lackof generalization of deeper CNNs in the music domain.In this paper, we present a principled way to make deeparchitectures like ResNet competitive for music-related tasks,based on well-designed regularization strategies. In particular,we analyze the recently introducedReceptive-Field Regulariza-tionandShake-Shake, and show that they significantly improvethe generalization of deep CNNs on music-related tasks, andthat the resulting deep CNNs can outperform current morecomplex models such as CNNs augmented with pre-training andattention. We demonstrate this on two different MIR tasks andtwo corresponding datasets, thus offering our deep regularizedCNNs as a new baseline for these datasets, which can also beused as a feature-extracting module in future, more complexapproaches.
UR - https://arxiv.org/pdf/2007.13503.pdf
U2 - 10.48550/arXiv.2007.13503
DO - 10.48550/arXiv.2007.13503
M3 - Preprint
T3 - arXiv.org
BT - Receptive-Field Regularized CNNs for Music Classification and Tagging
ER -