Protein function prediction plays a crucial role in understanding the intricate workings of biological systems. In recent years, significant progress has been made in this field through the development of various machine learning approaches. However, most existing methods formulate the task as a classification problem, aiming to assign predefined labels to proteins. In contrast, we propose a novel approach, Prot2Text, which predicts protein function descriptions in free text, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks and Large Language Models, in the encoder-decoder framework, our model effectively integrates diverse data types, including protein sequence, structure, and textual annotations. This multimodal approach allows for a holistic representation of protein function, enabling the generation of detailed and accurate protein descriptions. Prot2Text models are trained on 248,315 proteins from SwissProt for 25 epochs on 64 V100 GPUs. We trained four variants of Prot2Text: Prot2Text Small with 256M parameters; Prot2Text Base with 283M parameters; Prot2Text Medium with 398M parameters; and Prot2Text Large with 898M parameters. Our extensive experimental results on the SwissProt dataset, demonstrate the effectiveness of Prot2Text in generating detailed and accurate protein descriptions. Our findings underscore the immense potential of transformer-based multimodal models in the biological sciences, offering a valuable contribution toward advancing protein understanding and analysis.
In this page, you can test our multi-modal model Prot2Text Base and our Seq2Seq model Esm2Text Base.
Preprint: https://arxiv.org/pdf/2307.14367.pdf
Github Repository: https://github.com/hadi-abdine/Prot2Text
News:
Prot2Text Architecture
Prot2Text Base model is multi-modal model that combines Graph Neural Networks and Large Language Models, it takes as input a protein ID that exists in AlphaFoldDB to download the PDB file of the protein and construct the graph input for the GNN, then query the amino-acid sequence from UniProt and finally outputs a protein description. Prot2Text Base has 283M parameters, its encoder uses ESM-35M.
Examples of inputs (Click on the ID to place it in the input box):
Input: AlphaFold Protein ID
→
Amino acid sequence
Protein structure from AlphaFoldDB
Generated protein description
Esm2Text model is a variant of our main model that can generate a protein description using only the amino-acid sequence. It is useful in case the structure of the given protein is not provided. To test thismodel, just enter the amino acid sequence in the input box below. ESM2Text Base has 225M parameters, its encoder uses ESM-35M.
Examples of inputs:
Input: Amino-Acid sequence
→
Generated protein description