scholarly journals Deep embedding and alignment of protein sequences

2021 ◽  
Author(s):  
Felipe Llinares-López ◽  
Quentin Berthet ◽  
Mathieu Blondel ◽  
Olivier Teboul ◽  
Jean-Philippe Vert

Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here, we leverage recent advances in deep learning for language modelling and differentiable programming to propose DEDAL, a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or three-fold the alignment correctness over existing methods on remote homologs, and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.

2007 ◽  
Vol 88 (8) ◽  
pp. 2194-2197 ◽  
Author(s):  
Liwang Cui ◽  
Xiaowen Cheng ◽  
Lianchao Li ◽  
Jianyong Li

Ascoviruses are a family of insect viruses with circular, double-stranded DNA genomes. With the sequencing of the Trichoplusia ni ascovirus 2c (TnAV-2c) genome, the virion structural proteins were identified by using tandem mass spectrometry. From at least eight protein bands visible on a Coomassie blue-stained gel of TnAV-2c virion proteins, seven bands generated protein sequences that matched predicted open reading frames (ORFs) in the genome, i.e. ORFs 2, 43, 115, 141, 142, 147 and 153. Among these ORFs, only ORF153, encoding the major capsid protein, has been characterized previously.


1998 ◽  
Vol 54 (6) ◽  
pp. 1139-1146 ◽  
Author(s):  
Geoffrey J. Barton

The basic algorithms for alignment of two or more protein sequences are explained. Alternative methods for scoring substitutions and gaps (insertions and deletions) are described, as are global and local alignment methods. Multiple alignment techniques are explained, including methods for profile comparison. A summary is given of programs for the alignment and analysis of protein sequences, either from sequence alone, or from three-dimensional structure.


2002 ◽  
Vol 83 (4) ◽  
pp. 855-872 ◽  
Author(s):  
Caroline Gubser ◽  
Geoffrey L. Smith

Camelpox virus (CMPV) and variola virus (VAR) are orthopoxviruses (OPVs) that share several biological features and cause high mortality and morbidity in their single host species. The sequence of a virulent CMPV strain was determined; it is 202182 bp long, with inverted terminal repeats (ITRs) of 6045 bp and has 206 predicted open reading frames (ORFs). As for other poxviruses, the genes are tightly packed with little non-coding sequence. Most genes within 25 kb of each terminus are transcribed outwards towards the terminus, whereas genes within the centre of the genome are transcribed from either DNA strand. The central region of the genome contains genes that are highly conserved in other OPVs and 87 of these are conserved in all sequenced chordopoxviruses. In contrast, genes towards either terminus are more variable and encode proteins involved in host range, virulence or immunomodulation. In some cases, these are broken versions of genes found in other OPVs. The relationship of CMPV to other OPVs was analysed by comparisons of DNA and predicted protein sequences, repeats within the ITRs and arrangement of ORFs within the terminal regions. Each comparison gave the same conclusion: CMPV is the closest known virus to variola virus, the cause of smallpox.


2010 ◽  
Vol 76 (18) ◽  
pp. 6150-6155 ◽  
Author(s):  
Pedro A. Noguera ◽  
Jorge E. Ibarra

ABSTRACT On the basis of the known cry gene sequences of Bacillus thuringiensis, three sets of primers were designed from four conserved blocks found in the delta-endotoxin-coding region. The primer pairs designed amplify the regions between blocks 1 and 5, 2 and 5, and 1 and 4. In silico analyses indicated that 100% of the known three-domain cry gene sequences can be amplified by these sets of primers. To test their ability to amplify known and unknown cry gene sequences, 27 strains from the CINVESTAV (LBIT series) collection showing atypical crystal morphology were selected. Their DNA was used as the template with the new primer system, and after a systematic amplification and sequencing of the amplicons, each strain showed one or more cry-related sequences, totaling 54 different sequences harbored by the 27 strains. Seven sequences were selected on the basis of their low level of identity to the known cry sequences, and once cloning and sequencing of the complete open reading frames were done, three new cry-type genes (primary ranks) were identified and the toxins that they encode were designated Cry57Aa1, Cry58Aa1, and Cry59Aa1 by the B. thuringiensis Toxin Nomenclature Committee. The rest of the seven sequences were classified Cry8Ka2, Cry8-like, Cry20Ba1, and Cry1Ma1 by the committee. The crystal morphology of the selected strains and analysis of the new Cry protein sequences showed interesting peculiarities.


To develop an efficient system for matching the biological protein sequences and generating the scoring matrix using a distributed scan approach by applying SmithWaterman(SW) algorithm. The algorithm generates fatest solution and the proposed system is comparing sequences with System, OpenMP and Hadoop. The comparison of the system leads in generating an efficient matrix of the protein sequence, beneficial for predicting the efficiency of the system.


Sign in / Sign up

Export Citation Format

Share Document