Post bacc University of Oregon Eugene, Oregon, United States
Introduction: PET (polyethylene terephthalate) is one of the most common plastics used in everyday products. Efficient and clean degradation of PET can help mitigate its environmental impact and allow for complete chemical recycling. A small number of naturally occurring enzymes, referred to as PETases, have been discovered that catalyze the breakdown of PET. However, these natural enzymes have limited activity at ambient temperatures and efforts to improve the activity of these enzymes have had limited success.
Our goal is to engineer novel PETase enzymes that can serve as additional starting points for further improvements using directed evolution or rational design. To achieve this, we are employing computational methods, including machine learning, to design and optimize these plastic degrading enzymes. We specifically train machine learning models on Multiple Sequence Alignment data, searching based on sequence similarities to PETase. This approach allows us to generate new PETase with potential for enhanced activity and stability.
Machine learning algorithms enable us to analyze sequence generated. By using these advanced algorithms, predictive models and physics based molecular docking software, to study the interactions between PETase enzymes and PET molecules. We aim to enhance the efficiency of PETase, making them degrade polyethylene terephthalate faster.
By integrating computational design with experimental techniques, we aim to create a robust pipeline for the development of highly efficient PETase.
Materials and
Methods: To generate new enzymes, machine learning models trained on naturally occurring sequences from HMMER databases are employed to create novel PETase amino acid sequences. Using a well-studied crystal structure of PETase as a template, sequences are aligned via multiple sequence alignment (MSA) to form a training set for the models. The main models used are autoregressive direct coupling analysis (arDCA) and MSA transformers. arDCA generates PETase sequences by modeling statistical dependencies in native sequences, ensuring they likely fold into functional enzymes. MSA transformers process MSA and capture co-evolutionary patterns using an architecture that alternates between row and column attention mechanisms, reducing computational complexity.
New sequences are compared to native ones using principal component analysis (PCA) to visualize sequence space. To predict three-dimensional structures, ESMFold is used for its speed, despite being less accurate than AlphaFold. However, folding revealed issues with N-terminal residues, leading to the use of SignalP to predict and remove signal peptides, ensuring proper folding in subsequent rounds of sequence generation. Catalytic residues essential for enzymatic activity, such as serine, aspartic acid, and histidine, are conserved in generated sequences. Hamming distance measures sequence diversity, with heatmaps visualizing differences. Adjusting model temperature increases sequence diversity, balancing between generating new structures and maintaining stability, assessed via pLDDT scores from ESMFold.
Molecular docking simulations with AutoDock Vina evaluate PETase interactions with PET, with Rosetta software refining structures for better accuracy. Root mean square deviation (RMSD) is used to compare relaxed and non-relaxed structures, ensuring functional and structurally stable PETase enzymes.
Results, Conclusions, and Discussions: Searching against HMMER yielded approximately 1100 sequences with high similarity to known PETase enzymes. Using ESMFold, the structures of these sequences were predicted and aligned with the PETase crystal structure using PyMOL. Many predicted structures did not align well, indicating most sequences were not true PETase. Filtering out poorly aligned sequences left 473 native sequences. Signal peptides were removed, and over 10,000 PETase sequences were generated using the autoregressive direct coupling analysis (arDCA) model and MSA transformers.
Most generated sequences had high pLDDT scores, suggesting stable structures. Alignments showed 98% conserved essential catalytic residues (serine, aspartic acid, histidine). Sequences without these residues were filtered out. Temperature adjustments in the models increased sequence diversity, with higher temperatures producing greater Hamming distances from native sequences, indicating broader sequence space exploration. However, higher Hamming distances correlated with lower pLDDT scores, suggesting decreased structural stability.
Docking simulations using AutoDock Vina assessed binding affinities of PETase to PET. Higher exhaustiveness settings improved docking accuracy but required more computational resources. Simulations at exhaustiveness 100 balanced accuracy and computational demand. Rosetta Relax was used to refine structures, measured by RMSD. The refined structures showed minimal structural changes but improved binding affinities in docking simulations, indicating better interactions with PET.