High School Student River Hill High School Highland, Maryland, United States
Introduction: Creating vaccines that are effective against a wide range of variants has been a long- standing challenge in vaccine design, as seen in the COVID-19 pandemic. Several studies have attempted to use consensus-sequence vaccine design to design viral proteins that are “sequence-averages” of variants. However, this has had mixed results, without clinical value.
Materials and
Methods: The emergence of Natural Language Processing in biology has enabled “protein language models" to learn syntax and semantics in protein sequences in the same way we would with English sentences, allowing us to learn useful information about proteins simply from the sequence. These protein language models learn “embeddings” - vector representations of protein sequences - that include information about protein function, the scope of which is currently unknown. In this project, I hypothesized that the functional centerpoint between viral spike variants is better represented by language model embeddings than the state-of-the art naive sequence average.
Results, Conclusions, and Discussions: I tested this method by extracting embeddings for variants and vaccine candidates for multiple viruses using the ESM-2 protein language model. The cosine similarities between vaccine embeddings and their viral variants are calculated, and are shown to correspond with vaccine neutralization for multiple viruses. For computer scientists, I discover that protein embeddings contain complex information including viral neutralization landscapes. For biologists and vaccine designers, I discover that distance of a vaccine sequence from its intended viral variants in protein embedding space correlates with viral neutralization - providing a method to engineer better cross-neutralizing vaccines for future pandemics without requiring years-long iteration cycles.