Abstract
A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, achieving a new single-model, state-of-the-art performance on the public evaluation set of DESED with a PSDS1 of 0.692. A single model and an ensemble, both based on our proposed training procedure, ranked first in Task 4 of the DCASE Challenge 2024.
| Originalsprache | Englisch |
|---|---|
| Titel | Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024), |
| Seitenumfang | 5 |
| Publikationsstatus | Veröffentlicht - 2024 |
Wissenschaftszweige
- 202002 Audiovisuelle Medien
- 102 Informatik
- 102001 Artificial Intelligence
- 102003 Bildverarbeitung
- 102015 Informationssysteme
- 101019 Stochastik
- 103029 Statistische Physik
- 101018 Statistik
- 101017 Spieltheorie
- 202017 Embedded Systems
- 101016 Optimierung
- 101015 Operations Research
- 101014 Numerische Mathematik
- 101029 Mathematische Statistik
- 101028 Mathematische Modellierung
- 101026 Zeitreihenanalyse
- 101024 Wahrscheinlichkeitstheorie
- 102032 Computational Intelligence
- 102004 Bioinformatik
- 102013 Human-Computer Interaction
- 101027 Dynamische Systeme
- 305907 Medizinische Statistik
- 101004 Biomathematik
- 305905 Medizinische Informatik
- 101031 Approximationstheorie
- 102033 Data Mining
- 305901 Computerunterstützte Diagnose und Therapie
- 102019 Machine Learning
- 106007 Biostatistik
- 102018 Künstliche Neuronale Netze
- 106005 Bioinformatik
- 202037 Signalverarbeitung
- 202036 Sensorik
- 202035 Robotik
JKU-Schwerpunkte
- Digital Transformation
Dieses zitieren
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver