Abstract
Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio–caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio–caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio–caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 9th Workshop on Detection and Classification ofAcoustic Scenes and Events (DCASE), Tokyo, Japan |
| Pages | 121-125 |
| Number of pages | 5 |
| Publication status | Published - Oct 2024 |
Fields of science
- 202002 Audiovisual media
- 102 Computer Sciences
- 102001 Artificial intelligence
- 102003 Image processing
- 102015 Information systems
- 101019 Stochastics
- 103029 Statistical physics
- 101018 Statistics
- 101017 Game theory
- 202017 Embedded systems
- 101016 Optimisation
- 101015 Operations research
- 101014 Numerical mathematics
- 101029 Mathematical statistics
- 101028 Mathematical modelling
- 101026 Time series analysis
- 101024 Probability theory
- 102032 Computational intelligence
- 102004 Bioinformatics
- 102013 Human-computer interaction
- 101027 Dynamical systems
- 305907 Medical statistics
- 101004 Biomathematics
- 305905 Medical informatics
- 101031 Approximation theory
- 102033 Data mining
- 305901 Computer-aided diagnosis and therapy
- 102019 Machine learning
- 106007 Biostatistics
- 102018 Artificial neural networks
- 106005 Bioinformatics
- 202037 Signal processing
- 202036 Sensor systems
- 202035 Robotics
JKU Focus areas
- Digital Transformation
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver