Abstract
This technical report details the CP-JKU submission to the automatic audio captioning task of the 2022’s DCASE challenge (task 6a). The objective of the task was to train a sequence-to-sequence model that automatically generates textual descriptions for given audio recordings. The approach described in this report enhances the BART-based encoder-decoder model used as the challenge’s baseline system in three directions: firstly, the VGGish embedding model was replaced with a custom CNN10-like model that we pretrained on AudioSet. Secondly, the BART encoder-decoder model was pre-trained on AudioCaps, which led to faster convergence. And finally, the best model was further fine-tuned by optimizing the non-differentiable CIDEr metric using the REINFORCE algorithm. Our best model achieves a SPIDEr score of .29 (single-model performance), which is an improvement of 6.6 pp. over the challenge’s baseline score.
| Original language | English |
|---|---|
| Title of host publication | Detection and Classification of Acoustic Scenes and Events 2022 Challenge (DCASE2022) |
| Number of pages | 4 |
| Publication status | Published - 2022 |
Fields of science
- 202002 Audiovisual media
- 102 Computer Sciences
- 102001 Artificial intelligence
- 102003 Image processing
- 102015 Information systems
JKU Focus areas
- Digital Transformation
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver