Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data

  • Robert Frank*
  • , Michael Widrich
  • , Rahmad Akbar
  • , Günter Klambauer
  • , Geir Kjetil Sandve
  • , Philippe A. Robert
  • , Victor Greiff
  • *Corresponding author for this work

Research output: Working paper and reportsPreprint

Abstract

Generative machine learning models offer a powerful framework for therapeutic design by efficiently exploring large spaces of biological sequences enriched for desirable properties. Unlike supervised learning methods, which require both positive and negative labeled data, generative models such as LSTMs can be trained solely on positively labeled sequences, for example, high-affinity antibodies. This is particularly advantageous in biological settings where negative data are scarce, unreliable, or biologically ill-defined. However, the lack of attribution methods for generative models has hindered the ability to extract interpretable biological insights from such models. To address this gap, we developed Generative Attribution Metric Analysis (GAMA), an attribution method for autoregressive generative models based on Integrated Gradients. We assessed GAMA using synthetic datasets with known ground truths to characterize its statistical behavior and validate its ability to recover biologically relevant features. We further demonstrated the utility of GAMA by applying it to experimental antibody-antigen binding data. GAMA enables model interpretability and the validation of generative sequence design strategies without the need for negative training data.
Original languageEnglish
Number of pages42
DOIs
Publication statusPublished - 29 Jun 2025

Publication series

NamearXiv.org
No.2506.23182

Fields of science

  • 101019 Stochastics
  • 102003 Image processing
  • 103029 Statistical physics
  • 101018 Statistics
  • 101017 Game theory
  • 102001 Artificial intelligence
  • 202017 Embedded systems
  • 101016 Optimisation
  • 101015 Operations research
  • 101014 Numerical mathematics
  • 101029 Mathematical statistics
  • 101028 Mathematical modelling
  • 101026 Time series analysis
  • 101024 Probability theory
  • 102032 Computational intelligence
  • 102004 Bioinformatics
  • 102013 Human-computer interaction
  • 101027 Dynamical systems
  • 305907 Medical statistics
  • 101004 Biomathematics
  • 305905 Medical informatics
  • 101031 Approximation theory
  • 102033 Data mining
  • 102 Computer Sciences
  • 305901 Computer-aided diagnosis and therapy
  • 102019 Machine learning
  • 106007 Biostatistics
  • 102018 Artificial neural networks
  • 106005 Bioinformatics
  • 202037 Signal processing
  • 202036 Sensor systems
  • 202035 Robotics

JKU Focus areas

  • Digital Transformation

Cite this