Learning Audio-Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification.

Matthias Dorfer, Jan Hajic, Andreas Arzt, Harald Frostel, Gerhard Widmer

Research output: Contribution to journalArticlepeer-review

Abstract

This work addresses the problem of matching musical audio directly to sheet music, without any higher-level abstract representation. We propose a method that learns joint embedding spaces for short excerpts of audio and their respective counterparts in sheet music images, using multimodal convolutional neural networks. Given the learned representations, we show how to utilize them for two sheet-music-related tasks: (1) piece/score identification from audio queries and (2) retrieving relevant performances given a score as a search query. All retrieval models are trained and evaluated on a new, large scale multimodal audio–sheet music dataset which is made publicly available along with this article. The dataset comprises 479 precisely annotated solo piano pieces by 53 composers, for a total of 1,129 pages of music and about 15 hours of aligned audio, which was synthesized from these scores. Going beyond this synthetic training data, we carry out first retrieval experiments using scans of real sheet music of high complexity (e.g., nearly the complete solo piano works by Frederic Chopin) and commercial recordings by famous concert pianists. Our results suggest that the proposed method, in combination with the large-scale dataset, yields retrieval models that successfully generalize to data way beyond the synthetic training data used for model building.
Original languageEnglish
Number of pages12
JournalTransactions of the International Society of Music Information Retrieval
Publication statusPublished - 2018

Fields of science

  • 202002 Audiovisual media
  • 102 Computer Sciences
  • 102001 Artificial intelligence
  • 102003 Image processing
  • 102015 Information systems

JKU Focus areas

  • Computation in Informatics and Mathematics
  • Engineering and Natural Sciences (in general)

Cite this