PoET: A generative model of protein families as sequences-of-sequences
Inventors & their inventions
Axial: https://linktr.ee/axialxyz
Axial partners with great founders and inventors. We invest in early-stage life sciences companies such as Appia Bio, Seranova Bio, Delix Therapeutics, Simcha Therapeutics, among others often when they are no more than an idea. We are fanatical about helping the rare inventor who is compelled to build their own enduring business. If you or someone you know has a great idea or company in life sciences, Axial would be excited to get to know you and possibly invest in your vision and company. We are excited to be in business with you — email us at info@axialvc.com
The paper invents PoET (Protein Evolutionary Transformer), a new type of autoregressive generative language model designed for modeling and generating protein sequences. Unlike existing protein language models that model individual protein sequences, PoET models entire protein families as sequences-of-sequences. This novel framing allows PoET to learn the evolutionary constraints and relationships across tens of millions of natural protein families, enabling it to generalize well to unseen families and leverage information shared across families during generation and prediction tasks.
Designing new proteins with enhanced functions is a major goal in fields like pharmaceuticals and biotechnology. While experimental methods like deep mutational scanning and directed evolution have had successes, they are costly and difficult. Accurate computational models that can predict the effects of sequence mutations and generate promising new sequences could accelerate protein engineering efforts.
Existing protein language models are either unconditional, training on databases of all known proteins but unable to specialize to particular families of interest, or family-specific, training only on sequences from one family defined by a multiple sequence alignment (MSA). The latter perform better on predicting fitness effects for the family, but require large MSAs, cannot leverage learning across families, and cannot model insertions/deletions outside the MSA. Some hybrid models combine unconditional and family-specific components.
PoET is proposed as a unified model that can generalize across protein families while also specializing to particular families when conditioned on sequences from that family during inference. Its unique sequence-of-sequences framing is key to achieving these capabilities.