Axial: https://linktr.ee/axialxyz
Axial partners with great founders and inventors. We invest in early-stage life sciences companies such as Appia Bio, Seranova Bio, Delix Therapeutics, Simcha Therapeutics, among others often when they are no more than an idea. We are fanatical about helping the rare inventor who is compelled to build their own enduring business. If you or someone you know has a great idea or company in life sciences, Axial would be excited to get to know you and possibly invest in your vision and company. We are excited to be in business with you — email us at info@axialvc.com
The paper develops an approach to protein function prediction and homology search, leveraging conformal prediction to provide statistically rigorous guarantees and calibrated probabilities. The authors address a critical limitation in current methods: the lack of reliable statistical assurances, hindering the efficient selection of proteins for further experimental or computational investigation. Their method offers a framework for robustly pre-filtering candidate proteins, annotating genes of unknown function, and improving enzyme classification performance, all while offering calibrated probabilistic predictions rather than relying on arbitrary thresholds.
The core innovation lies in the application of conformal prediction principles. Unlike traditional machine learning approaches that rely on often unrealistic model assumptions (e.g., linearity), conformal prediction offers a model-agnostic framework. This means the method's validity holds regardless of the underlying predictive model's complexity, accommodating the prevalence of deep learning "black box" models in bioinformatics. By employing conformal prediction, the authors transform raw similarity scores (from various protein search models, including Protein-Vec, TM-Vec, Foldseek, and TOPH) into calibrated probabilities and well-defined retrieval sets. These sets are constructed to guarantee user-specified risk levels concerning biologically relevant loss metrics (e.g., false discovery rate, false negative rate), allowing researchers to select proteins with controlled error probabilities. This addresses the problem of arbitrary threshold selection often encountered in current protein homology search methods, leading to more reliable and interpretable results.
The paper demonstrates the efficacy of their approach across several key applications. First, they tackle the challenge of annotating genes of unknown function, focusing on the minimal viable genome of Mycoplasma mycoides (JCVI Syn3.0). Using their conformal prediction framework, they effectively control the false discovery rate while assigning calibrated probabilities to predicted functional matches. This results in a substantial increase in the confidence of functional annotations compared to traditional methods. The method is shown to be robust and reliable, identifying a significant portion (39.6%) of the coding genes with functional matches. This application highlights the practical utility of the approach for exploring poorly understood genomes and identifying potential candidates for further experimental studies.
Secondly, the paper addresses the problem of enzyme function prediction. The authors demonstrate the power of their conformal framework by applying it to an existing state-of-the-art enzyme classification model (CLEAN). They improve upon CLEAN's existing selection procedures (max-separation and p-value selection) by incorporating conformal prediction. This leads to more accurate and statistically robust classification performance across different datasets (New and Price), exceeding the results obtained by the original CLEAN methods. Notably, their approach achieves this improved performance without requiring the training of new models, showcasing the versatility and efficiency of the conformal prediction framework. This improvement is particularly valuable in the context of high-throughput enzyme annotation, where speed and accuracy are critical. The application to the challenging Price dataset underscores the robustness of their method, overcoming potential data biases or inconsistencies.
Finally, the authors introduce a novel strategy for pre-filtering proteins for computationally intensive structural alignment algorithms (like DALI). They use a faster, embedding-based model (Protein-Vec) to pre-filter candidate proteins before applying DALI. By controlling the false negative rate via conformal prediction, they demonstrate a significant reduction in the number of proteins needing full DALI analysis, while retaining a substantial proportion (82.8%) of high-scoring DALI matches. This pre-filtering approach drastically speeds up the overall workflow without compromising accuracy, making large-scale structural homology searches significantly more feasible. This is especially relevant given the rapid increase in the number of predicted protein structures and the computational expense associated with high-resolution structural alignment methods.
The paper also addresses potential limitations and future directions. The authors acknowledge that the assumption of exchangeability, required for conformal prediction, might not always perfectly hold in biological data due to potential sampling biases or data heterogeneity. They suggest that future work could explore techniques designed to address distribution shifts, potentially enhancing the robustness and applicability of their approach. Furthermore, they highlight the need for more comprehensive and accurate datasets, as the reliability of conformal prediction relies on the representativeness of the calibration data.
In conclusion, this paper presents a significant advancement in the field of protein functional annotation and homology search. The use of conformal prediction provides a unique framework for generating statistically rigorous predictions, eliminating the reliance on arbitrary thresholds and enhancing interpretability. The authors effectively demonstrate the utility of their approach across diverse applications, showcasing its potential to accelerate biological discovery and accelerate research workflows. The emphasis on calibrated probabilities and user-specified risk levels makes their method exceptionally practical and suitable for high-throughput analyses in bioinformatics and computational biology. The model-agnostic nature of the conformal approach ensures widespread applicability, facilitating its integration with various protein prediction and homology search algorithms. The future integration of this framework with advancements in protein language models and techniques for handling distribution shifts holds great promise for further enhancing the accuracy and reliability of protein function prediction and homology detection.