Distilling Knowledge For Low-Complexity Convolutional Neural Networks From a Patchout Audio Transformer

Research output: Chapter in Book/Report/Conference proceedingConference proceedings

Abstract

In this technical report, we describe the CP-JKU team’s submission for Task 1 Low-Complexity Acoustic Scene Classification of the DCASE 22 challenge. We use Knowledge Distillation to teach low-complexity CNN student models from Patchout Spectrogram Transformer (PaSST) models. We use the pre-trained PaSST models on Audioset and fine-tune them on the TAU Urban Acoustic Scenes 2022 Mobile development dataset. We experiment with using an ensemble of teachers, different receptive fields of the student models, and mixing frequency-wise statistics of spectrograms to enhance generalization to unseen devices. Finally, the student models are quantized in order to perform inference computations using 8 bit integers, simulating the low-complexity constraints of edge devices.
Original languageEnglish
Title of host publicationin Detection and Classification of Acoustic Scenes and Events (DCASE2022 Challenge), Technical Report
Number of pages5
Publication statusPublished - 2022

Fields of science

  • 202002 Audiovisual media
  • 102 Computer Sciences
  • 102001 Artificial intelligence
  • 102003 Image processing
  • 102015 Information systems

JKU Focus areas

  • Digital Transformation

Cite this