Skip to main navigation Skip to search Skip to main content

CP-JKU's Submission to Task 6a of the DCASE2022 Challenge: A BART encoder-decoder for Automatic Audio Captioning trained via the Reinforce Algorithm and Transfer Learning

Research output: Chapter in Book/Report/Conference proceedingConference proceedings

Abstract

This technical report details the CP-JKU submission to the automatic audio captioning task of the 2022’s DCASE challenge (task 6a). The objective of the task was to train a sequence-to-sequence model that automatically generates textual descriptions for given audio recordings. The approach described in this report enhances the BART-based encoder-decoder model used as the challenge’s baseline system in three directions: firstly, the VGGish embedding model was replaced with a custom CNN10-like model that we pretrained on AudioSet. Secondly, the BART encoder-decoder model was pre-trained on AudioCaps, which led to faster convergence. And finally, the best model was further fine-tuned by optimizing the non-differentiable CIDEr metric using the REINFORCE algorithm. Our best model achieves a SPIDEr score of .29 (single-model performance), which is an improvement of 6.6 pp. over the challenge’s baseline score.
Original languageEnglish
Title of host publicationDetection and Classification of Acoustic Scenes and Events 2022 Challenge (DCASE2022)
Number of pages4
Publication statusPublished - 2022

Fields of science

  • 202002 Audiovisual media
  • 102 Computer Sciences
  • 102001 Artificial intelligence
  • 102003 Image processing
  • 102015 Information systems

JKU Focus areas

  • Digital Transformation

Cite this