Zur Hauptnavigation wechseln Zur Suche wechseln Zum Hauptinhalt wechseln

Pay Attention to the Keys: Visual Piano Transcription Using Transformers

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

Abstract

Visual piano transcription (VPT) is the task of obtaining a symbolic representation of a piano performance from visual information alone (e.g., from a top-down video of the piano keyboard). In this work we propose a VPT system based on the vision transformer (ViT), which surpasses previous methods based on convolutional neural networks (CNNs). Our system is trained on the newly introduced R3 dataset, consisting of ca.~31 hours of synchronized video and MIDI recordings of piano performances. We additionally introduce an approach to predict note offsets, which has not been previously explored in this context. We show that our system outperforms the state-of-the-art on the PianoYT dataset for onset prediction and on the R3 dataset for both onsets and offsets.
OriginalspracheEnglisch
TitelProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Herausgeber*innenJames Kwok
Seiten10243-10251
Seitenumfang9
Auflage1
ISBN (elektronisch)978-1-956792-06-5
DOIs
PublikationsstatusVeröffentlicht - 2025

Publikationsreihe

NameIJCAI International Joint Conference on Artificial Intelligence
ISSN (Print)1045-0823

Wissenschaftszweige

  • 102003 Bildverarbeitung
  • 202002 Audiovisuelle Medien
  • 102001 Artificial Intelligence
  • 102015 Informationssysteme
  • 102 Informatik

JKU-Schwerpunkte

  • Digital Transformation

Dieses zitieren