TIS transformer: Re-annotation of the human proteome using deep learning
The precise detection of translation initiation sites is essential for proteome delineation. In turn, the accurate mapping of the proteome is fundamental in advancing our understanding of biological systems and cellular mechanisms. We propose TIS Transformer, a deep learning model for the determination of translation start sites, based on information embedded in processed transcript nucleotide sequences. Through the application of deep learning techniques first designed for natural language processing tasks, we have developed an approach that achieves state-of-the-art performances on the prediction of translation initiation sites. TIS Transformer utilizes the FAVOR+ algorithm for attention calculation, enabling processing of full transcript sequences by the model. Analysis of input importance revealed TIS Transformer's ability to detect key features of translation, such as translation stop sites and reading frames. Furthermore, we demonstrate TIS Transformer's ability to detect multiple peptides on a transcript, and peptides encoded by short Open Reading Frames (sORFs), either alongside a canonical coding sequence or in long non-coding RNAs. Using a cross-validation scheme, we apply TIS Transformer to re-annotate the full human transcriptome.