Machine learning-guided mutagenesis

Inventors & their inventions

Apr 08, 2024

Axial partners with great founders and inventors. We invest in early-stage life sciences companies such as Appia Bio, Seranova Bio, Delix Therapeutics, Simcha Therapeutics, among others often when they are no more than an idea. We are fanatical about helping the rare inventor who is compelled to build their own enduring business. If you or someone you know has a great idea or company in life sciences, Axial would be excited to get to know you and possibly invest in your vision and company. We are excited to be in business with you — email us at info@axialvc.com

Machine learning-guided mutagenesis is an approach in protein engineering that combines directed evolution techniques with machine learning algorithms to accelerate the discovery & optimization of protein variants with desired functional properties. This method has gained significant attention due to its potential to overcome the limitations of traditional directed evolution methods, which often rely on wet-lab screening processes.

The central premise behind ML-guided mutagenesis is to use predict the relationship between protein sequences and their corresponding functions. By training machine learning models on experimental data obtained from initial mutagenesis libraries, researchers can guide the design of subsequent libraries enriched with potentially improved variants, thereby reducing the experimental effort required for screening and increasing the efficiency of the directed evolution process.

Directed evolution is a powerful technique for engineering proteins with enhanced or novel functions. It involves iterative cycles of mutagenesis, screening or selection, and amplification of desired variants. However, traditional directed evolution methods face several challenges:

a. Library size and screening throughput: Generating large mutagenesis libraries increases the probability of finding desirable variants, but the screening throughput is often limited by experimental constraints, making it challenging to explore the vast sequence space effectively.

b. Epistatic interactions and combinatorial complexity: Protein functions can depend on complex interactions between multiple mutations, making it difficult to predict the effects of combining beneficial mutations identified in separate rounds of mutagenesis.

c. Local optima and sequence space exploration: Directed evolution can become trapped in local optima, failing to explore regions of the sequence space that may harbor superior variants.

Machine learning-guided mutagenesis aims to address these challenges by incorporating machine learning algorithms into the directed evolution process. An initial library of protein variants is created through random or targeted mutagenesis techniques, such as error-prone PCR or site-saturation mutagenesis. The library is then experimentally screened or selected for the desired functional property, and the sequences and corresponding functional data (e.g., enzyme activity, binding affinity) are collected. The collected data is used to train a machine learning model that can predict the functional property of interest based on the protein sequence. Then the trained model is employed to predict the functional properties of uncharacterized protein sequences and rank or score them according to their predicted performance. This information is used to design a subsequent, focused mutagenesis library enriched with potentially improved variants. Iterative rounds of mutagenesis and model refinement: The process is repeated iteratively, with each round incorporating experimental data from the previous round to refine the machine learning model and generate increasingly enriched libraries.

Machine learning models can predict the functional properties of vast numbers of uncharacterized protein sequences, enabling the design of focused libraries enriched with potentially improved variants. This can significantly reduce the experimental screening effort required. By leveraging machine learning predictions, directed evolution can explore broader regions of the sequence space, increasing the likelihood of discovering superior variants that might otherwise be overlooked by traditional methods. The models can also potentially capture complex interactions between multiple mutations, allowing the design of combinatorial libraries that account for epistatic effects. And integrate various types of data, such as structural information, evolutionary constraints, and physicochemical properties, providing a holistic approach to predicting protein function and enabling knowledge transfer between related proteins or systems.

Various ML methods have been used in the context of protein engineering, each with its own strengths and limitations:

Regression-based methods: Linear regression, Gaussian processes, and other regression techniques are commonly used to model the relationship between protein sequences and functional properties.
Generative models: Techniques like deep generative models (e.g., variational autoencoders, generative adversarial networks) can learn the underlying distribution of protein sequences and generate novel sequences with desired properties.
Sequence-based models: Convolutional neural networks, recurrent neural networks, and transformers have been applied to capture the sequential and structural patterns in protein sequences, enabling predictions based on sequence information alone.
Hybrid models: Combining multiple machine learning techniques, such as integrating sequence-based models with structural information or physicochemical properties, can potentially improve predictive performance.

While machine learning-guided mutagenesis has shown promising results, several challenges and future directions remain. Data quality and quantity: The performance of machine learning models heavily depends on the quality and quantity of the training data. Addressing issues such as noise, bias, and limited data availability is crucial for improving model accuracy. Interpretability and explainability: Many machine learning models, particularly deep neural networks, can be opaque and difficult to interpret, hindering the understanding of the underlying sequence-function relationships and limiting the ability to extract biological insights. Expanding applicability: Most current studies have focused on specific protein classes or functions. Developing generalizable models that can be applied across diverse protein families and functions remains a significant challenge. Integration of diverse data sources: Incorporating various types of data, such as structural information, evolutionary constraints, and physicochemical properties, into machine learning models can potentially improve their predictive power, but also introduces technical challenges related to data integration and feature engineering. Exploration of novel sequence spaces: While machine learning can guide the exploration of unexplored regions of the sequence space, the ability to generate truly novel and functional protein sequences remains a significant challenge, as most current approaches are limited by the constraints of the initial training data. Experimental validation and iteration: Machine learning-guided mutagenesis is an iterative process that requires continuous experimental validation and model refinement. Developing efficient strategies for experimental design, screening, and data integration is crucial for maximizing the benefits of this approach.

Machine learning-guided mutagenesis is a promising approach that combines the power of directed evolution with the predictive capabilities of machine learning algorithms. By addressing the limitations of traditional directed evolution methods and enabling more efficient exploration of sequence space, this approach has the potential to accelerate the discovery and optimization of novel protein variants with desired functional properties. However, ongoing challenges related to data quality, interpretability, and generalizability must be addressed to fully realize the potential of this technique in protein engineering and biotechnology applications.

Axial

Discussion about this post

Ready for more?