Novel large language models for biological sequences and their interactions

Dr Ke Yuan & Prof Crispin Miller

Labs: AI for Cancer Research & Computational Biolog y
Duration: 4 years, starting October 2025
Closing Date: Friday 14 March 2025
Interviews for this position will take place in April 2025

Background

The identification of genomic alterations driving cancer progression has historically relied on identifying recurrent mutations across patient cohorts. While some driver mutations are well-characterized due to their high recurrence, the vast majority of mutations are rare, shared by only a few patients. This limits the scope for functional studies and potential therapeutic targets. Recent advances in large language models (LLMs) trained from vast biological sequences offer a promising solution by enabling a deeper understanding of the structural and functional consequences of genomic alterations in DNA, RNA, and protein sequences. By leveraging LLMs, we can model the effects of even rare mutations in cancer with unprecedented detail and precision.

This project aims to train and utilize LLMs to explore the impact of genomic mutations, specifically focusing on protein sequences, RNA sequences, and their interactions. These models will provide novel insights into the consequences of genomic alterations and pave the way for improved understanding and therapeutic targeting in cancer biology.

Research Question

Objective 1: Building LLMs for Protein Sequences and Protein-Protein Interactions
Mutations in protein sequences can have significant functional consequences, particularly in the context of protein-protein interactions. Current LLMs are trained on single protein sequences, limiting their ability to model these interactions. In this objective, we aim to build a novel protein language model specifically designed to learn and predict protein-protein interactions. This model will be applied to both mouse and patient data, enabling the study of mutations in the context of protein interaction networks.

Objective 2: Building LLMs for RNA Sequences and Protein-RNA Interactions
Mutations in RNA sequences, including those in non-coding regions, have profound effects. In collaboration with RNA-focused research groups, we will develop LLMs to model the interaction between RNA and proteins, as well as the functional consequences of mutations. With an existing RNA model and a wealth of data for fine-tuning and validation, we will explore how these interactions influence gene regulation and cancer progression.

Objective 3: Fine-Tuning LLMs for Mitochondrial Genomes
Mitochondrial mutations are increasingly recognized for their role in cancer and other diseases. Glasgow's ongoing initiative in mitochondrial genomics provides an ideal opportunity to fine-tune LLMs for mitochondrial DNA. These models will enhance our ability to assess the functional consequences of mitochondrial mutations and could potentially inform the design of genome-editing tools targeting the mitochondrial genome.

Skills/Techniques that will be gained

Programming skills in training, validation and testing state-of-the-art deep learning models.
State-of-the-art methods for fine-turning large language models including transformer- and state space model-based architectures.
Biological interpretation of mutations in protein, RNA and DNA sequences.
Broad understanding of the translational value of cutting-edge data and technology in cancer sciences.

For questions regarding the application process, PhD programme/studentships at the CRUK Scotland Institute or any other queries, please contact phdstudentships@beatson.gla.ac.uk.

Closing date: Friday 14 March 2025

Applications are open to all individuals irrespective of nationality or country of residence.

APPLY HERE

Relevant Publications

Lamb K et al. From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv 2024.07.05.602129; doi: https://doi.org/10.1101/2024.07.05.602129

Zhang Z et al. Protein language models learn evolutionary statistics of interacting sequence motifs. bioRxiv 2024.01.30.577970; doi: https://doi.org/10.1101/2024.01.30.577970

Rives A et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS 2021 Apr 13;118(15):e2016239118. doi: 10.1073/pnas.2016239118.

Brandes N et al. Genome-wide prediction of disease variant effects with a deep protein language model. Nature Genetics. 2023 Sep;55(9):1512-1522. doi: 10.1038/s41588-023-01465-0.