October 22, 2024
Thesis Defense: Abhinav Adduri | October 29, 2024 | 10 a.m.
Title: "Genome Mining and Machine Learning Algorithms for Natural Product Drug Discovery"
Abhinav Adduri
Tuesday, October 29, 2024
10 a.m. EST
GHC 7501
Committee: Hosein Mohimani, Chair, CMU Louis Felix Nothias, Univer sité Cote d'Azur Russell Schwartz, CMU Erik Wright
Abstract: Natural products have long been a rich source of diverse antimicrobials and clinically effective drug candidates. Non-ribosomal peptides (NRPs), polyketides (PKs), and NRP-PK hybrids are three classes of natural products that display a broad range of bioactivities, including antibiotic, antifungal, anticancer, and immunosuppressa nt activities. However, discovering these compounds through traditional bioactivity- guided techniques is costly and time-consuming, and often results in the rediscovery of known molecules. Consequently, genome mining has emerged as a high-throughput strategy to screen hundreds of thousands of microbial genomes to identify their potential for producing novel natural products.
In this thesis, I introduce a suite of algorithms and machine learning models that predict NRPs, PKs, and NRP-PK hybrids from their microbial biosynthetic gene clusters (BGCs) of origin. Our resulting Seq2X methods significantly outperform the previous state-of-the-ar t in benchmarks. We used Seq2PKS and Seq2Hybrid to link several orphan natural product s to their putative BGCs, and we further used Seq2NRP (subsequently named NPDiscover) to identify a novel antifungal drug, Edaphochelin, that displays promising activity against drug-resistant Candida auris and Candida glabrata strains. Taking inspiration from recent trends in deep learning, we further improve the generalization of our approach using large protein language models (PLMs) to featurize microbial BGCs. The resulting method, MASPR, provides accurate and interpretable predictions of adenylation domain specificity in NRPs and NRP-PK hybrids.
Unlike previous approaches, MASPR uniquely offers zero-shot classification for novel substrate s not present in the training data, greatly improving the applicability of our methods on fungal genomes. Lastly, I introduce SPRINT, a high-throughput deep learning method that uses protein-structu re-aware PLMs to predict drug-target interactions (DTIs) for the virtual screening of our mined natural products against entire human and microbial proteomes. SPRINT sets a new state-of-the-ar t in speed and accuracy, and together with the Seq2X suite and MASPR, presents an end-to-end stra tegy to mine drug-like natural products from microbial genomes and elucidate their mechanisms of action when targeting specific proteins.