Assistant Professor University of Pittsburgh Pittsburgh, Pennsylvania, United States
Introduction: The explosion of sequence data has allowed the rapid growth of protein language models (pLMs). pLMs have now been employed in many frameworks including variant-effect and peptide-specificity prediction. Traditionally, for protein-protein or peptide-protein interactions (PPIs), corresponding sequences are either co-embedded followed by post-hoc integration or the sequences are concatenated prior to embedding. Interestingly, no method utilizes a language representation of the interaction itself. We developed an interaction LM (iLM), which uses a novel language to represent interactions between protein/peptide sequences. Sliding Window Interaction Grammar (SWING) leverages differences in amino acid properties to generate an interaction vocabulary. This vocabulary is the input into a LM followed by a supervised prediction step where the LM’s representations are used as features. Here we present Sliding Window Interaction Grammar (SWING), a first-in-class interaction language model (iLM) that captures the language of protein-protein and protein-peptide interactions. We applied SWING to a range of tasks, where it was comparable to or outperformed the benchmarks set by the current state-of-art methods across biological domains. We detailed the importance of our iLM in multiple nuanced interaction associated tasks such as predicting peptide-MHC (pMHC) interactions across contexts and species or the impact of missense mutations on disruption of specific protein-protein interactions. Furthermore, we have extended the utility of deep learning-based models to contexts where the full-length protein sequence is unavailable or unnecessary. Overall, SWING is a generalizable iLM that can be applied across contexts to learn the language of peptide and protein interactions.
Materials and
Methods: SWING captures pairwise residue information in a language-like representation followed by lexicon embedding. A sliding window on the query protein, a peptide of length ‘n’, is completely matched with the n positions on the target partner sequence, starting from the first position. At each position, the difference in an amino acid based biochemical metric is calculated. This difference is rounded off and the absolute value is taken to ordinally encode each amino acid pair. Next, the sliding window is shifted by one amino acid position and the above steps are repeated until the end of the sequence. We divide the resulting sequence into overlapping k-mers, where each subsequence can be thought of as a “word” and each interaction as a “document” composed of these words. This document is then encoded using a Doc2Vec model to infer a corresponding vector representation. Each interaction is mapped to a unique vector represented by a row in the matrix D, and every k-mer is also mapped to a unique vector represented as a row in the matrix W. The average of the embedding matrices D and W are fed into the hidden layer, which projects the averaged input vectors into a lower rank space. The output layer applies a softmax function to produce the probability distribution over the k-mers. The probability scores are used to predict the target k-mer. D and W are randomly initialized with regular updates as the training progresses. The interaction embeddings are then used as features for XGboost.
Results, Conclusions, and Discussions: SWING was first applied to predicting peptide:MHC (pMHC) interactions. 3 pMHC prediction models were trained (Class I, Class II, and Mixed Class) using a large ensemble of human immunopeptidome datasets. Currently, existing approaches have separate Class I and Class II pMHC prediction models as the structural and functional aspects of Class I and Class II pMHC interactions are completely distinct. SWING was not only successful at generating Class I and Class II models that have comparable prediction to state-of-the-art approaches, but the unique Mixed Class model was also successful at jointly predicting both classes for unseen datasets. Further, we generated a SWING model trained only on Class I alleles that was predictive for Class II, a complex prediction task not attempted by any existing approach (Fig 2E). For de novo data, using only Class I or Class II data, SWING also accurately predicted Class II pMHC interactions in murine models of SLE (MRL/lpr model) and T1D (NOD model), that were validated experimentally. The immunopeptidome of the 2 disease models were categorized and binders of H-2-IEk and H-2-IAg7 were subsequently analyzed using SWING as well as two benchmark approaches - MixMHC2Pred2.0 and NetMHCIIPan 4.2. All three SWING models (built using only Class I or Class II data and the Mixed Class model) performed significantly better than the benchmarks despite being the only method without training on murine data. This demonstrates that SWING is a highly generalizable zero-shot model that learns the language of PPIs. To further evaluate SWING’s generalizability, we tested its ability to predict the disruption of specific protein-protein interactions by missense mutations. Although modern methods like AlphaMissense and ESM1b can predict interfaces and variant effects/pathogenicity per mutation, they are unable to predict interactionspecific disruptions. SWING was successful at accurately predicting (relative to experimental benchmarks) the impact of Mendelian mutations and population variants in protein-protein interactions. SWING also accurately predicted interaction disruptions caused by variants regardless of the context, with a model trained on both the datasets. Overall, SWING is a first-in-class generalizable zero-shot iLM that can learn the language of peptide and protein interactions with only sequence information.