Abstract
This technical report describes the CP-JKU team’s submission
for Task 4 Sound Event Detection with Heterogeneous Training
Datasets and Potentially Missing Labels of the DCASE 24 Chal
lenge. We fine-tune three large Audio Spectrogram Transformers,
PaSST, BEATs, and ATST, on the joint DESED and MAESTRO
datasets in a two-stage training procedure. The first stage closely
matches the baseline system setup and trains a CRNN model while
keeping the large pre-trained transformer model frozen. In the sec
ond stage, both CRNN and transformer are fine-tuned using heavily
weighted self-supervised losses. After the second stage, we com
pute strong pseudo-labels for all audio clips in the training set using
an ensemble of all three fine-tuned transformers. Then, in a sec
ond iteration, we repeat the two-stage training process and include a
distillation loss based on the pseudo-labels, boosting single-model
performance substantially. Additionally, we pre-train PaSST and
ATST on the subset of AudioSet that comes with strong temporal
labels, before fine-tuning them on the Task 4 datasets
for Task 4 Sound Event Detection with Heterogeneous Training
Datasets and Potentially Missing Labels of the DCASE 24 Chal
lenge. We fine-tune three large Audio Spectrogram Transformers,
PaSST, BEATs, and ATST, on the joint DESED and MAESTRO
datasets in a two-stage training procedure. The first stage closely
matches the baseline system setup and trains a CRNN model while
keeping the large pre-trained transformer model frozen. In the sec
ond stage, both CRNN and transformer are fine-tuned using heavily
weighted self-supervised losses. After the second stage, we com
pute strong pseudo-labels for all audio clips in the training set using
an ensemble of all three fine-tuned transformers. Then, in a sec
ond iteration, we repeat the two-stage training process and include a
distillation loss based on the pseudo-labels, boosting single-model
performance substantially. Additionally, we pre-train PaSST and
ATST on the subset of AudioSet that comes with strong temporal
labels, before fine-tuning them on the Task 4 datasets
| Original language | English |
|---|---|
| Title of host publication | Technical report DCASE2024 Challenge, 2024 |
| Number of pages | 5 |
| Publication status | Published - 2024 |
Publication series
| Name | Detection and Classification of Acoustic Scenes and Events |
|---|
Fields of science
- 102003 Image processing
- 202002 Audiovisual media
- 102001 Artificial intelligence
- 102015 Information systems
- 102 Computer Sciences
- 101019 Stochastics
- 103029 Statistical physics
- 101018 Statistics
- 101017 Game theory
- 202017 Embedded systems
- 101016 Optimisation
- 101015 Operations research
- 101014 Numerical mathematics
- 101029 Mathematical statistics
- 101028 Mathematical modelling
- 101026 Time series analysis
- 101024 Probability theory
- 102032 Computational intelligence
- 102004 Bioinformatics
- 102013 Human-computer interaction
- 101027 Dynamical systems
- 305907 Medical statistics
- 101004 Biomathematics
- 305905 Medical informatics
- 101031 Approximation theory
- 102033 Data mining
- 305901 Computer-aided diagnosis and therapy
- 102019 Machine learning
- 106007 Biostatistics
- 102018 Artificial neural networks
- 106005 Bioinformatics
- 202037 Signal processing
- 202036 Sensor systems
- 202035 Robotics
JKU Focus areas
- Digital Transformation
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver