TY - GEN
T1 - Frame-level Audio Similarity - A Codebook Approach.
AU - Widmer, Gerhard
AU - Knees, Peter
AU - Seyerlehner, Klaus
PY - 2008
Y1 - 2008
N2 - Modeling audio signals by the long-term statistical distribution
of their local spectral features - often denoted as bag of frames
approach (BOF) - is a popular and powerful method to describe
audio content. While modeling the distribution of local spectral
features by semi-parametric distributions (e.g. Gaussian Mixture
Models) has been studied intensively, we investigate a non-parametric
variant based on vector quantization (VQ) in this paper.
The essential advantage of the proposed VQ approach over stateof-
the-art similarity measures is that the proposed audio similarity
metric forms a normed vector space. This allows for more powerful
search strategies, e.g. KD-Trees or Local Sensitive Hashing
(LSH), making content-based audio similarity available for
even larger music archives. Standard VQ approaches are known
to be computationally very expensive; to counter this problem,
we propose a multi-level clustering architecture. Additionally, we
show that the multi-level vector quantization approach (ML-VQ),
in contrast to standard VQ approaches, is comparable to state-ofthe-
art frame-level similarity measures in terms of quality. Another
important finding w.r.t. the ML-VQ approach is that, in contrast
to GMM models of songs, our approach does not seem to
suffer from the recently discovered hub problem.
AB - Modeling audio signals by the long-term statistical distribution
of their local spectral features - often denoted as bag of frames
approach (BOF) - is a popular and powerful method to describe
audio content. While modeling the distribution of local spectral
features by semi-parametric distributions (e.g. Gaussian Mixture
Models) has been studied intensively, we investigate a non-parametric
variant based on vector quantization (VQ) in this paper.
The essential advantage of the proposed VQ approach over stateof-
the-art similarity measures is that the proposed audio similarity
metric forms a normed vector space. This allows for more powerful
search strategies, e.g. KD-Trees or Local Sensitive Hashing
(LSH), making content-based audio similarity available for
even larger music archives. Standard VQ approaches are known
to be computationally very expensive; to counter this problem,
we propose a multi-level clustering architecture. Additionally, we
show that the multi-level vector quantization approach (ML-VQ),
in contrast to standard VQ approaches, is comparable to state-ofthe-
art frame-level similarity measures in terms of quality. Another
important finding w.r.t. the ML-VQ approach is that, in contrast
to GMM models of songs, our approach does not seem to
suffer from the recently discovered hub problem.
UR - https://www.scopus.com/pages/publications/77951169912
M3 - Conference proceedings
SN - 9789512295173
T3 - Proceedings of the International Conference on Digital Audio Effects, DAFx
SP - 349
EP - 356
BT - Proceedings of the 11th International Conference on Digital Audio Effects (DAFx 2008)
ER -