Skip to main navigation Skip to search Skip to main content

Pay Attention to the Keys: Visual Piano Transcription Using Transformers

Research output: Chapter in Book/Report/Conference proceedingConference proceedingspeer-review

Abstract

Visual piano transcription (VPT) is the task of obtaining a symbolic representation of a piano performance from visual information alone (e.g., from a top-down video of the piano keyboard). In this work we propose a VPT system based on the vision transformer (ViT), which surpasses previous methods based on convolutional neural networks (CNNs). Our system is trained on the newly introduced R3 dataset, consisting of ca.~31 hours of synchronized video and MIDI recordings of piano performances. We additionally introduce an approach to predict note offsets, which has not been previously explored in this context. We show that our system outperforms the state-of-the-art on the PianoYT dataset for onset prediction and on the R3 dataset for both onsets and offsets.
Original languageEnglish
Title of host publicationProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
EditorsJames Kwok
Pages10243-10251
Number of pages9
Edition1
ISBN (Electronic)978-1-956792-06-5
DOIs
Publication statusPublished - 2025

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
ISSN (Print)1045-0823

Fields of science

  • 102003 Image processing
  • 202002 Audiovisual media
  • 102001 Artificial intelligence
  • 102015 Information systems
  • 102 Computer Sciences

JKU Focus areas

  • Digital Transformation

Cite this