A Knowledge DistillationApproach to Improving Language-Based Audio Retrieval Models

Research output: Chapter in Book/Report/Conference proceedingConference proceedingspeer-review

Abstract

This technical report describes the CP-JKU team’s submissions
to the language-based audio retrieval task of the 2024 DCASE Chal
lenge (Task 8). All our submitted systems are based on the dual
encoder architecture that projects recordings and textual descrip
tions into a shared audio-caption space in which related examples
from the two modalities are similar. We used pretrained audio and
text embedding models and trained them on audio-caption datasets
(WavCaps, AudioCaps, and ClothoV2) via contrastive learning. We
further fine-tuned the resulting models on ClothoV2 via knowl
edge distillation from a large ensemble of audio retrieval models.
Our best single system submission based on PaSST and RoBERTa
achieves a mAP@10 of 39.77 on the ClothoV2 test split, outper
forming last year’s best single system submission by around 1pp.
without using metadata and synthetic captions. An ensemble of
three distilled models achieves 41.91 mAP@10 on the ClothoV2
test split. A repository with our implementation is available on
GitHub1
Original languageEnglish
Title of host publicationTechnical report DCASE2024 Challenge, 2024
Number of pages5
Publication statusPublished - 2024

Publication series

NameDetection and Classification of Acoustic Scenes and Events

Fields of science

  • 102003 Image processing
  • 202002 Audiovisual media
  • 102001 Artificial intelligence
  • 102015 Information systems
  • 102 Computer Sciences
  • 101019 Stochastics
  • 103029 Statistical physics
  • 101018 Statistics
  • 101017 Game theory
  • 202017 Embedded systems
  • 101016 Optimisation
  • 101015 Operations research
  • 101014 Numerical mathematics
  • 101029 Mathematical statistics
  • 101028 Mathematical modelling
  • 101026 Time series analysis
  • 101024 Probability theory
  • 102032 Computational intelligence
  • 102004 Bioinformatics
  • 102013 Human-computer interaction
  • 101027 Dynamical systems
  • 305907 Medical statistics
  • 101004 Biomathematics
  • 305905 Medical informatics
  • 101031 Approximation theory
  • 102033 Data mining
  • 305901 Computer-aided diagnosis and therapy
  • 102019 Machine learning
  • 106007 Biostatistics
  • 102018 Artificial neural networks
  • 106005 Bioinformatics
  • 202037 Signal processing
  • 202036 Sensor systems
  • 202035 Robotics

JKU Focus areas

  • Digital Transformation

Cite this