TY - GEN
T1 - Estimating Musical Surprisal in Audio
AU - Bjare, Mathias
AU - Cantisani, Gerorgia
AU - Lattner, Stefan
AU - Widmer, Gerhard
PY - 2025
Y1 - 2025
N2 - In modeling musical surprisal expectancy with com
putational methods, it has been proposed to use the information
content (IC) of one-step predictions from an autoregressive model
as a proxy for surprisal in symbolic music. With an appropriately
chosen model, the IC of musical events has been shown to
correlate with human perception of surprise and complexity
aspects, including tonal and rhythmic complexity. This work
investigates whether an analogous methodology can be applied to
music audio. We train an autoregressive Transformer model to
predict compressed latent audio representations of a pretrained
autoencoder network. We verify learning effects by estimating the
decrease in IC with repetitions. We investigate the mean IC of
musical segment types (e.g., A or B) and find that segment types
appearing later in a piece have a higher IC than earlier ones on
average. We investigate the IC’s relation to audio and musical
features and find it correlated with timbral variations and
loudness and, to a lesser extent, dissonance, rhythmic complexity,
and onset density related to audio and musical features. Finally,
we investigate if the IC can predict EEG responses to songs and
thus model humans’ surprisal in music. We provide code for our
method on github.com/sonycslparis/audioic.
AB - In modeling musical surprisal expectancy with com
putational methods, it has been proposed to use the information
content (IC) of one-step predictions from an autoregressive model
as a proxy for surprisal in symbolic music. With an appropriately
chosen model, the IC of musical events has been shown to
correlate with human perception of surprise and complexity
aspects, including tonal and rhythmic complexity. This work
investigates whether an analogous methodology can be applied to
music audio. We train an autoregressive Transformer model to
predict compressed latent audio representations of a pretrained
autoencoder network. We verify learning effects by estimating the
decrease in IC with repetitions. We investigate the mean IC of
musical segment types (e.g., A or B) and find that segment types
appearing later in a piece have a higher IC than earlier ones on
average. We investigate the IC’s relation to audio and musical
features and find it correlated with timbral variations and
loudness and, to a lesser extent, dissonance, rhythmic complexity,
and onset density related to audio and musical features. Finally,
we investigate if the IC can predict EEG responses to songs and
thus model humans’ surprisal in music. We provide code for our
method on github.com/sonycslparis/audioic.
UR - https://www.scopus.com/pages/publications/105009591626
U2 - 10.1109/ICASSP49660.2025.10890619
DO - 10.1109/ICASSP49660.2025.10890619
M3 - Conference proceedings
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2025)
ER -