Fusing Audio and Metadata Embeddings Improves Language-based Audio

Research output: Chapter in Book/Report/Conference proceedingConference proceedingspeer-review

Abstract

Matching raw audio signals with textual descriptionsrequires understanding the audio's content and the description'ssemantics and then drawing connections between the two modalities.This paper investigates a hybrid retrieval system that utilizesaudio metadata as an additional clue to understand the content ofaudio signals before matching them with textual queries. Weexperimented with metadata often attached to audio recordings,such as keywords and natural-language descriptions, and weinvestigated late and mid-level fusion strategies to merge audioand metadata. Our hybrid approach with keyword metadata and latefusion improved the retrieval performance over a content-basedbaseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and AudioCapsbenchmarks, respectively.
Original languageEnglish
Title of host publicationProceedings of the 32nd European Signal Processing Conference(EUSIPCO), Lyon, France
Number of pages5
Publication statusPublished - 2024

Fields of science

  • 202002 Audiovisual media
  • 102 Computer Sciences
  • 102001 Artificial intelligence
  • 102003 Image processing
  • 102015 Information systems
  • 101019 Stochastics
  • 103029 Statistical physics
  • 101018 Statistics
  • 101017 Game theory
  • 202017 Embedded systems
  • 101016 Optimisation
  • 101015 Operations research
  • 101014 Numerical mathematics
  • 101029 Mathematical statistics
  • 101028 Mathematical modelling
  • 101026 Time series analysis
  • 101024 Probability theory
  • 102032 Computational intelligence
  • 102004 Bioinformatics
  • 102013 Human-computer interaction
  • 101027 Dynamical systems
  • 305907 Medical statistics
  • 101004 Biomathematics
  • 305905 Medical informatics
  • 101031 Approximation theory
  • 102033 Data mining
  • 305901 Computer-aided diagnosis and therapy
  • 102019 Machine learning
  • 106007 Biostatistics
  • 102018 Artificial neural networks
  • 106005 Bioinformatics
  • 202037 Signal processing
  • 202036 Sensor systems
  • 202035 Robotics

JKU Focus areas

  • Digital Transformation

Cite this