Learning Face-Voice Association

Andreas Wagner

Research output: ThesisMaster's / Diploma thesis

Abstract

With the increasing popularity of social media platforms in the last decades multimodal learning has also gained more and more interest from researchers. Millions of texts, images, videos and audio recordings are posted online everyday. Using this vast amount of data different multimodal tasks such as cross-modal retrieval, matching and verification have been tackled and researched by scientists[1]. Face-Voice Association (FVA) is an example for such a multimodal task using audio and image data. The goal of Face-Voice Association is to learn the underlying characteristics of face images and voice samples of identities in order to accurately match faces to the voices of the same identity and vice versa. In this thesis, we conduct experiments on three different parts of the Face-Voice Association pipeline. Existing work on FVA only uses late fusion, meaning that fusing the modality specific features is the last step before obtaining the logits. Therefore, we will analyze the impact of different fusion strategies such as early and middle fusion, as well as a fusion independent approach. Furthermore, often we see modality specific sub-networks being leveraged, to extract discriminative embeddings. To investigate the impact of these sub-networks on total FVA performance we evaluate the performance using four different sub-networks. Finally, due to the success of hyperbolic embeddings used in deep learning we anaylze if hyperbolic embeddings are suited for FVA and compare them to the models using euclidean space. Our experiments show that FVA performance increases the later we fuse our features. Additionally, we notice that the choice of sub-networks for extraction has a major impact on overall performance. Lastly, our experiments using hyperbolic embeddings lead to the conclusion that they might be very well suited for the task of FVA. However, more research has to be conducted in order to form a meaningful conclusion.
Original languageEnglish
Supervisors/Reviewers
  • Nawaz, Shah, Supervisor
Publication statusPublished - Jul 2024

Fields of science

  • 202002 Audiovisual media
  • 102 Computer Sciences
  • 102001 Artificial intelligence
  • 102003 Image processing
  • 102015 Information systems

JKU Focus areas

  • Digital Transformation

Cite this