Improving Audio Spectrogram Transformers ForSound Event Detection Through Multi-Stage Training

Research output: Chapter in Book/Report/Conference proceedingConference proceedingspeer-review

Abstract

This technical report describes the CP-JKU team’s submission
for Task 4 Sound Event Detection with Heterogeneous Training
Datasets and Potentially Missing Labels of the DCASE 24 Chal
lenge. We fine-tune three large Audio Spectrogram Transformers,
PaSST, BEATs, and ATST, on the joint DESED and MAESTRO
datasets in a two-stage training procedure. The first stage closely
matches the baseline system setup and trains a CRNN model while
keeping the large pre-trained transformer model frozen. In the sec
ond stage, both CRNN and transformer are fine-tuned using heavily
weighted self-supervised losses. After the second stage, we com
pute strong pseudo-labels for all audio clips in the training set using
an ensemble of all three fine-tuned transformers. Then, in a sec
ond iteration, we repeat the two-stage training process and include a
distillation loss based on the pseudo-labels, boosting single-model
performance substantially. Additionally, we pre-train PaSST and
ATST on the subset of AudioSet that comes with strong temporal
labels, before fine-tuning them on the Task 4 datasets
Original languageEnglish
Title of host publicationTechnical report DCASE2024 Challenge, 2024
Number of pages5
Publication statusPublished - 2024

Publication series

NameDetection and Classification of Acoustic Scenes and Events

Fields of science

  • 102003 Image processing
  • 202002 Audiovisual media
  • 102001 Artificial intelligence
  • 102015 Information systems
  • 102 Computer Sciences
  • 101019 Stochastics
  • 103029 Statistical physics
  • 101018 Statistics
  • 101017 Game theory
  • 202017 Embedded systems
  • 101016 Optimisation
  • 101015 Operations research
  • 101014 Numerical mathematics
  • 101029 Mathematical statistics
  • 101028 Mathematical modelling
  • 101026 Time series analysis
  • 101024 Probability theory
  • 102032 Computational intelligence
  • 102004 Bioinformatics
  • 102013 Human-computer interaction
  • 101027 Dynamical systems
  • 305907 Medical statistics
  • 101004 Biomathematics
  • 305905 Medical informatics
  • 101031 Approximation theory
  • 102033 Data mining
  • 305901 Computer-aided diagnosis and therapy
  • 102019 Machine learning
  • 106007 Biostatistics
  • 102018 Artificial neural networks
  • 106005 Bioinformatics
  • 202037 Signal processing
  • 202036 Sensor systems
  • 202035 Robotics

JKU Focus areas

  • Digital Transformation

Cite this