A new protein sequence data bank

Nature ◽  
1985 ◽  
Vol 318 (6041) ◽  
pp. 19-19 ◽  
Author(s):  
JEAN-MICHEL CLAVERIE ◽  
ISABELLE SAUVAGET
1993 ◽  
Vol 21 (13) ◽  
pp. 3093-3096 ◽  
Author(s):  
Amos Bairoch ◽  
Brigitte Boeckmann

1993 ◽  
Vol 293 (3) ◽  
pp. 781-788 ◽  
Author(s):  
B Henrissat ◽  
A Bairoch

301 glycosyl hydrolases and related enzymes corresponding to 39 EC entries of the I.U.B. classification system have been classified into 35 families on the basis of amino-acid-sequence similarities [Henrissat (1991) Biochem. J. 280, 309-316]. Approximately half of the families were found to be monospecific (containing only one EC number), whereas the other half were found to be polyspecific (containing at least two EC numbers). A > 60% increase in sequence data for glycosyl hydrolases (181 additional enzymes or enzyme domains sequences have since become available) allowed us to update the classification not only by the addition of more members to already identified families, but also by the finding of ten new families. On the basis of a comparison of 482 sequences corresponding to 52 EC entries, 45 families, out of which 22 are polyspecific, can now be defined. This classification has been implemented in the SWISS-PROT protein sequence data bank.


1992 ◽  
Vol 20 (suppl) ◽  
pp. 2019-2022 ◽  
Author(s):  
A. Bairoch ◽  
B. Boeckmann

1991 ◽  
Vol 19 (suppl) ◽  
pp. 2247-2249 ◽  
Author(s):  
A. Bairoch ◽  
B. Boeckmann

Entropy ◽  
2021 ◽  
Vol 23 (5) ◽  
pp. 530
Author(s):  
Milton Silva ◽  
Diogo Pratas ◽  
Armando J. Pinho

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.


Sign in / Sign up

Export Citation Format

Share Document