Abstract
Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio–caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio–caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio–caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.
| Originalsprache | Englisch |
|---|---|
| Titel | Proceedings of the 9th Workshop on Detection and Classification ofAcoustic Scenes and Events (DCASE), Tokyo, Japan |
| Seiten | 121-125 |
| Seitenumfang | 5 |
| Publikationsstatus | Veröffentlicht - Okt. 2024 |
Wissenschaftszweige
- 202002 Audiovisuelle Medien
- 102 Informatik
- 102001 Artificial Intelligence
- 102003 Bildverarbeitung
- 102015 Informationssysteme
- 101019 Stochastik
- 103029 Statistische Physik
- 101018 Statistik
- 101017 Spieltheorie
- 202017 Embedded Systems
- 101016 Optimierung
- 101015 Operations Research
- 101014 Numerische Mathematik
- 101029 Mathematische Statistik
- 101028 Mathematische Modellierung
- 101026 Zeitreihenanalyse
- 101024 Wahrscheinlichkeitstheorie
- 102032 Computational Intelligence
- 102004 Bioinformatik
- 102013 Human-Computer Interaction
- 101027 Dynamische Systeme
- 305907 Medizinische Statistik
- 101004 Biomathematik
- 305905 Medizinische Informatik
- 101031 Approximationstheorie
- 102033 Data Mining
- 305901 Computerunterstützte Diagnose und Therapie
- 102019 Machine Learning
- 106007 Biostatistik
- 102018 Künstliche Neuronale Netze
- 106005 Bioinformatik
- 202037 Signalverarbeitung
- 202036 Sensorik
- 202035 Robotik
JKU-Schwerpunkte
- Digital Transformation
Dieses zitieren
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver