Abstract
In pursuit of advancing global health, and resilience against evolving health threats like pandemics and antimicrobial resistance, drug discovery remains imperative. Efficient drug discovery relies on robust molecular property prediction, as well as retro-synthesis models, both challenging due to the scarcity of data.
This thesis proposes uni- and multi-modal contrastive learning approaches to address these challenges. Leveraging Modern Hopfield Networks (MHN), we enhance compound representations, particularly in scenarios with limited examples, as well as in the case of zero and few-shot template-prediction for retrosynthesis. Furthermore, we introduce a groundbreaking approach to activity prediction models by incorporating textual information that describe experiments, offering fast property-predictions, even before biological-experiments are conducted.
The methods demonstrate superior performance on few-shot learning benchmarks and zero-shot problems in drug discovery, surpassing state-of-the-art results on the FS-Mol dataset as well as PubChem18, the largest text-to-molecules dataset introduced herein. On this dataset we show an improvement over the enrichment-factor of more than 10 times that of baselines. With inference speed several times faster, we improve predictive performance, specifically for reaction templates with few examples, in the retrosynthesis benchmark USPTO-50k.
In conclusion, these contributions collectively advance the field, offering novel solutions for molecular property prediction and retrosynthetic planning in drug discovery. By leveraging multimodal contrastive learning techniques and incorporating textual information, our methods demonstrate promise in addressing the challenges posed by limited data availability in drug discovery research.
This thesis proposes uni- and multi-modal contrastive learning approaches to address these challenges. Leveraging Modern Hopfield Networks (MHN), we enhance compound representations, particularly in scenarios with limited examples, as well as in the case of zero and few-shot template-prediction for retrosynthesis. Furthermore, we introduce a groundbreaking approach to activity prediction models by incorporating textual information that describe experiments, offering fast property-predictions, even before biological-experiments are conducted.
The methods demonstrate superior performance on few-shot learning benchmarks and zero-shot problems in drug discovery, surpassing state-of-the-art results on the FS-Mol dataset as well as PubChem18, the largest text-to-molecules dataset introduced herein. On this dataset we show an improvement over the enrichment-factor of more than 10 times that of baselines. With inference speed several times faster, we improve predictive performance, specifically for reaction templates with few examples, in the retrosynthesis benchmark USPTO-50k.
In conclusion, these contributions collectively advance the field, offering novel solutions for molecular property prediction and retrosynthetic planning in drug discovery. By leveraging multimodal contrastive learning techniques and incorporating textual information, our methods demonstrate promise in addressing the challenges posed by limited data availability in drug discovery research.
| Original language | English |
|---|---|
| Qualification | PhD |
| Awarding Institution |
|
| Supervisors/Reviewers |
|
| Publication status | Published - Feb 2024 |
Fields of science
- 101019 Stochastics
- 102003 Image processing
- 103029 Statistical physics
- 101018 Statistics
- 101017 Game theory
- 102001 Artificial intelligence
- 202017 Embedded systems
- 101016 Optimisation
- 101015 Operations research
- 101014 Numerical mathematics
- 101029 Mathematical statistics
- 101028 Mathematical modelling
- 101026 Time series analysis
- 101024 Probability theory
- 102032 Computational intelligence
- 102004 Bioinformatics
- 102013 Human-computer interaction
- 101027 Dynamical systems
- 305907 Medical statistics
- 101004 Biomathematics
- 305905 Medical informatics
- 101031 Approximation theory
- 102033 Data mining
- 102 Computer Sciences
- 305901 Computer-aided diagnosis and therapy
- 102019 Machine learning
- 106007 Biostatistics
- 102018 Artificial neural networks
- 106005 Bioinformatics
- 202037 Signal processing
- 202036 Sensor systems
- 202035 Robotics
JKU Focus areas
- Digital Transformation