Machine Learning-Driven Optimization of Specific, Compact, and Efficient Base Editors via Single-Round Diversification

Research output: Working paper and reportsPreprint

Abstract

Cytosine and adenosine base editors show great potential in research and clinical applications. Current iterations of the deaminase—the enzyme used to create precise single-nucleotide changes via base editing—exhibit various off-target effects, including Cas-independent off-targeting, off-base editing, and bystander editing. Engineered deaminases are typically derived from eukaryotic deaminases, which are larger and exhibit high levels of Cas-independent DNA editing, or from evolved variants of the E. coli TadA protein (ecTadA), which are smaller but frequently cause off-base editing. To overcome the limitations inherent to using a single protein sequence as the basis for engineering, we diversified 95 newly identified TadA orthologs by introducing literature-derived mutations and DNA shuffling to yield millions of training sequences for measuring base editor efficiency. Rather than pursuing multiple rounds of random mutagenesis and selection, we trained generative models on the performance data from the diversified pools of variants and drew on information-theoretic insights to efficiently explore the deaminase sequence space to generate diverse and high-performing deaminases. From a single round of diversification, we created a small set of novel and specific cytosine and adenosine deaminases that were markedly distinct in sequence from published base editor deaminases. We additionally found that the deaminases created by our model generally outperform those which we identified through typical directed evolution. The novel adenosine and cytosine deaminases identified in this work showed high on-base activity, comparable to the leading published base editors, but with demonstrably lower off-base activity. The cytosine deaminases were particularly compact compared to known sequences due to a truncation in their final α-helix.
Original languageEnglish
PublisherbioRxiv
Number of pages48
DOIs
Publication statusPublished - 30 Jul 2025

Fields of science

  • 101019 Stochastics
  • 102003 Image processing
  • 103029 Statistical physics
  • 101018 Statistics
  • 101017 Game theory
  • 102001 Artificial intelligence
  • 202017 Embedded systems
  • 101016 Optimisation
  • 101015 Operations research
  • 101014 Numerical mathematics
  • 101029 Mathematical statistics
  • 101028 Mathematical modelling
  • 101026 Time series analysis
  • 101024 Probability theory
  • 102032 Computational intelligence
  • 102004 Bioinformatics
  • 102013 Human-computer interaction
  • 101027 Dynamical systems
  • 305907 Medical statistics
  • 101004 Biomathematics
  • 305905 Medical informatics
  • 101031 Approximation theory
  • 102033 Data mining
  • 102 Computer Sciences
  • 305901 Computer-aided diagnosis and therapy
  • 102019 Machine learning
  • 106007 Biostatistics
  • 102018 Artificial neural networks
  • 106005 Bioinformatics
  • 202037 Signal processing
  • 202036 Sensor systems
  • 202035 Robotics

JKU Focus areas

  • Digital Transformation

Cite this