Thesis Defense: Abhinav Adduri | October 29, 2024 | 10 a.m.

Title: "Genome Mining and Machine Learning Algorithms for Natural Product Drug Discovery"

Abhinav Adduri

Tuesday, October 29, 2024

10 a.m. EST

GHC 7501

Committee: Hosein Mohimani, Chair, CMU Louis Felix Nothias, Université Cote d'Azur Russell Schwartz, CMU Erik Wright

Abstract: Natural products have long been a rich source of diverse antimicrobials and clinically effective drug candidates. Non-ribosomal peptides (NRPs), polyketides (PKs), and NRP-PK hybrids are three classes of natural products that display a broad range of bioactivities, including antibiotic, antifungal, anticancer, and immunosuppressant activities. However, discovering these compounds through traditional bioactivity- guided techniques is costly and time-consuming, and often results in the rediscovery of known molecules. Consequently, genome mining has emerged as a high-throughput strategy to screen hundreds of thousands of microbial genomes to identify their potential for producing novel natural products.

In this thesis, I introduce a suite of algorithms and machine learning models that predict NRPs, PKs, and NRP-PK hybrids from their microbial biosynthetic gene clusters (BGCs) of origin. Our resulting Seq2X methods significantly outperform the previous state-of-the-art in benchmarks. We used Seq2PKS and Seq2Hybrid to link several orphan natural products to their putative BGCs, and we further used Seq2NRP (subsequently named NPDiscover) to identify a novel antifungal drug, Edaphochelin, that displays promising activity against drug-resistant Candida auris and Candida glabrata strains. Taking inspiration from recent trends in deep learning, we further improve the generalization of our approach using large protein language models (PLMs) to featurize microbial BGCs. The resulting method, MASPR, provides accurate and interpretable predictions of adenylation domain specificity in NRPs and NRP-PK hybrids.

Unlike previous approaches, MASPR uniquely offers zero-shot classification for novel substrates not present in the training data, greatly improving the applicability of our methods on fungal genomes. Lastly, I introduce SPRINT, a high-throughput deep learning method that uses protein-structure-aware PLMs to predict drug-target interactions (DTIs) for the virtual screening of our mined natural products against entire human and microbial proteomes. SPRINT sets a new state-of-the-art in speed and accuracy, and together with the Seq2X suite and MASPR, presents an end-to-end strategy to mine drug-like natural products from microbial genomes and elucidate their mechanisms of action when targeting specific proteins.