Abstract
This paper describes our submission to the first Freesound generalpurpose
audio tagging challenge carried out within the DCASE
2018 challenge. Our proposal is based on a fully convolutional
neural network that predicts one out of 41 possible audio class labels
when given an audio spectrogram excerpt as an input. What
makes this classification dataset and the task in general special, is
the fact that only 3,700 of the 9,500 provided training examples are
delivered with manually verified ground truth labels. The remaining
non-verified observations are expected to contain a substantial
amount of label noise (up to 30-35% in the “worst” categories). We
propose to address this issue by a simple, iterative self-verification
process, which gradually shifts unverified labels into the verified,
trusted training set. The decision criterion for self-verifying a training
example is the prediction consensus of a previous snapshot of
the network on multiple short sliding window excerpts of the training
example at hand. On the unseen test data, an ensemble of
three networks trained with this self-verification approach achieves
a mean average precision (MAP@3) of 0.951. This is the second
best out of 558 submissions to the corresponding Kaggle challenge.
Original language | English |
---|---|
Title of host publication | Proceedings of workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) |
Number of pages | 5 |
Publication status | Published - 2018 |
Fields of science
- 202002 Audiovisual media
- 102 Computer Sciences
- 102001 Artificial intelligence
- 102003 Image processing
- 102015 Information systems
JKU Focus areas
- Computation in Informatics and Mathematics
- Engineering and Natural Sciences (in general)