SAM/BAM format v1.5 extensions for de novo assemblies

Summary: The plain text Sequence Alignment/Map (SAM) file format and its companion binary form (BAM) are a generic alignment format for storing read alignments against reference sequences (and unmapped reads) together with structured meta-data (Li et al., 2009). Driven by the needs of the 1000 Genomes Project which sequenced many individual human genomes, early SAM/BAM usage focused on pairwise alignments of reads to a reference. However, through the CIGAR P operator multiple sequence alignments can also be preserved. Herein we describe clarifications and additions in version 1.5 of the specification to facilitate storing de novo sequence alignments: Padded reference sequences (with gap characters), annotation of reads or regions of the reference, and the option of embedding the reference sequence within the file. Availability: The latest public release of the specification is at http://samtools.sourceforge.net/SAM1.pdf, with in development drafts at https://github.com/samtools/hts-specs/ under version control.

Download Full-text

QuanTest2: benchmarking multiple sequence alignments using secondary structure prediction

Bioinformatics ◽

10.1093/bioinformatics/btz552 ◽

2019 ◽

Cited By ~ 3

Author(s):

Fabian Sievers ◽

Desmond G Higgins

Keyword(s):

Secondary Structure ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Reference Sequence ◽

Supplementary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Reference Sequences ◽

Selection Of

Abstract Motivation Secondary structure prediction accuracy (SSPA) in the QuanTest benchmark can be used to measure accuracy of a multiple sequence alignment. SSPA correlates well with the sum-of-pairs score, if the results are averaged over many alignments but not on an alignment-by-alignment basis. This is due to a sub-optimal selection of reference and non-reference sequences in QuanTest. Results We develop an improved strategy for selecting reference and non-reference sequences for a new benchmark, QuanTest2. In QuanTest2, SSPA and SP correlate better on an alignment-by-alignment basis than in QuanTest. Guide-trees for QuanTest2 are more balanced with respect to reference sequences than in QuanTest. QuanTest2 scores correlate well with other well-established benchmarks. Availability and implementation QuanTest2 is available at http://bioinf.ucd.ie/quantest2.tar, comprises of reference and non-reference sequence sets and a scoring script. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

Bioinformatics ◽

10.1093/bioinformatics/btv592 ◽

2015 ◽

Vol 32 (6) ◽

pp. 814-820 ◽

Cited By ~ 14

Author(s):

Gearóid Fox ◽

Fabian Sievers ◽

Desmond G. Higgins

Keyword(s):

Protein Structure ◽

Structure Prediction ◽

De Novo ◽

Biological Data ◽

Supplementary Information ◽

Test Case ◽

Sequence Alignments ◽

Progressive Alignment ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Abstract Motivation: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data. Results: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins. Availability and implementation: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments

Microbial Informatics and Experimentation ◽

10.1186/2042-5783-2-2 ◽

2012 ◽

Vol 2 (1) ◽

pp. 2

Author(s):

Manal Helal ◽

Fanrong Kong ◽

Sharon C-A Chen ◽

Fei Zhou ◽

Dominic E Dwyer ◽

...

Keyword(s):

Hash Function ◽

Gene Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Reference Sequences

Download Full-text

Sampling the conformational landscapes of transporters and receptors with AlphaFold2

10.1101/2021.11.22.469536 ◽

2021 ◽

Author(s):

Diego del Alamo ◽

Davide Sala ◽

Hassane Mchaourab ◽

Jens Meiler

Keyword(s):

Conformational Changes ◽

Structure Prediction ◽

De Novo ◽

Structural Heterogeneity ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Recent Success ◽

Conformational Plasticity ◽

Multiple Conformations

Equilibrium fluctuations and triggered conformational changes often underlie the functional cycles of membrane proteins. For example, transporters mediate the passage of molecules across cell membranes by alternating between inward-facing (IF) and outward-facing (OF) states, while receptors undergo intracellular structural rearrangements that initiate signaling cascades. Although the conformational plasticity of these proteins has historically posed a challenge for traditional de novo protein structure prediction pipelines, the recent success of AlphaFold2 (AF2) in CASP14 culminated in the modeling of a transporter in multiple conformations to high accuracy. Given that AF2 was designed to predict static structures of proteins, it remains unclear if this result represents an underexplored capability to accurately predict multiple conformations and/or structural heterogeneity. Here, we present an approach to drive AF2 to sample alternative conformations of topologically diverse transporters and G-protein coupled receptors (GPCRs) that are absent from the AF2 training set. Whereas models generated using the default AF2 pipeline are conformationally homogeneous and nearly identical to one another, reducing the depth of the input multiple sequence alignments (MSAs) led to the generation of accurate models in multiple conformations. In our benchmark, these conformations were observed to span the range between two experimental structures of interest, suggesting that our protocol allows sampling of the conformational landscape at the energy minimum. Nevertheless, our results also highlight the need for the next generation of deep learning algorithms to be designed to predict ensembles of biophysically relevant states.

Download Full-text

EVfold.org: Evolutionary Couplings and Protein 3D Structure Prediction

10.1101/021022 ◽

2015 ◽

Cited By ~ 14

Author(s):

Robert Sheridan ◽

Robert J. Fieldhouse ◽

Sikander Hayat ◽

Yichao Sun ◽

Yevgeniy Antipin ◽

...

Keyword(s):

Protein Function ◽

Structure Prediction ◽

De Novo ◽

3D Structure ◽

Sequence Information ◽

Major Advance ◽

Sequence Alignments ◽

Multiple Sequence ◽

Genomic Databases ◽

Multiple Sequence Alignments

Recently developed maximum entropy methods infer evolutionary constraints on protein function and structure from the millions of protein sequences available in genomic databases. The EVfold web server (at EVfold.org) makes these methods available to predict functional and structural interactions in proteins. The key algorithmic development has been to disentangle direct and indirect residue-residue correlations in large multiple sequence alignments and derive direct residue-residue evolutionary couplings (EVcouplings or ECs). For proteins of unknown structure, distance constraints obtained from evolutionarily couplings between residue pairs are used to de novo predict all-atom 3D structures, often to good accuracy. Given sufficient sequence information in a protein family, this is a major advance toward solving the problem of computing the native 3D fold of proteins from sequence information alone. Availability: EVfold server at http://evfold.org/ Contact: [email protected]

Download Full-text

Faculty Opinions recommendation of Evolutionary profiles from the QR factorization of multiple sequence alignments.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1024515.296730 ◽

2005 ◽

Author(s):

Anne-Catherine Dock-Bregeon

Keyword(s):

Qr Factorization ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Faculty Opinions recommendation of Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.732011981.793542976 ◽

2018 ◽

Author(s):

Chandra Verma ◽

Suryani Lukman

Keyword(s):

Machine Learning ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Multiple Sequence Alignments

Download Full-text

Positive natural selection in primate genes of the type I interferon response

BMC Ecology and Evolution ◽

10.1186/s12862-021-01783-z ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Elena N. Judd ◽

Alison R. Gilchrist ◽

Nicholas R. Meyerson ◽

Sara L. Sawyer

Keyword(s):

Natural Selection ◽

Positive Selection ◽

Type I Interferon ◽

Interferon Response ◽

Type I ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Interferon Stimulated Genes ◽

Interferon Induction

Abstract Background The Type I interferon response is an important first-line defense against viruses. In turn, viruses antagonize (i.e., degrade, mis-localize, etc.) many proteins in interferon pathways. Thus, hosts and viruses are locked in an evolutionary arms race for dominance of the Type I interferon pathway. As a result, many genes in interferon pathways have experienced positive natural selection in favor of new allelic forms that can better recognize viruses or escape viral antagonists. Here, we performed a holistic analysis of selective pressures acting on genes in the Type I interferon family. We initially hypothesized that the genes responsible for inducing the production of interferon would be antagonized more heavily by viruses than genes that are turned on as a result of interferon. Our logic was that viruses would have greater effect if they worked upstream of the production of interferon molecules because, once interferon is produced, hundreds of interferon-stimulated proteins would activate and the virus would need to counteract them one-by-one. Results We curated multiple sequence alignments of primate orthologs for 131 genes active in interferon production and signaling (herein, “induction” genes), 100 interferon-stimulated genes, and 100 randomly chosen genes. We analyzed each multiple sequence alignment for the signatures of recurrent positive selection. Counter to our hypothesis, we found the interferon-stimulated genes, and not interferon induction genes, are evolving significantly more rapidly than a random set of genes. Interferon induction genes evolve in a way that is indistinguishable from a matched set of random genes (22% and 18% of genes bear signatures of positive selection, respectively). In contrast, interferon-stimulated genes evolve differently, with 33% of genes evolving under positive selection and containing a significantly higher fraction of codons that have experienced selection for recurrent replacement of the encoded amino acid. Conclusion Viruses may antagonize individual products of the interferon response more often than trying to neutralize the system altogether.

Download Full-text

SNN-SB: Combining Partial Alignment Using Modified SNN Algorithm with Segment-Based for Multiple Sequence Alignments

Journal of Physics Conference Series ◽

10.1088/1742-6596/1962/1/012048 ◽

2021 ◽

Vol 1962 (1) ◽

pp. 012048

Author(s):

Aziz Nasser Boraik Ali ◽

Hassan Pyar Ali Hassan ◽

Hesham Bahamish

Keyword(s):

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Partial Alignment

Download Full-text

DNCON2_Inter: predicting interchain contacts for homodimeric and homomultimeric protein complexes using multiple sequence alignments of monomers and deep learning

Scientific Reports ◽

10.1038/s41598-021-91827-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Farhan Quadir ◽

Raj S. Roy ◽

Randal Halfmann ◽

Jianlin Cheng

Keyword(s):

Deep Learning ◽

Tertiary Structure ◽

Protein Complexes ◽

Complex Structure ◽

Great Success ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Residue Contacts ◽

Evolutionary Features

AbstractDeep learning methods that achieved great success in predicting intrachain residue-residue contacts have been applied to predict interchain contacts between proteins. However, these methods require multiple sequence alignments (MSAs) of a pair of interacting proteins (dimers) as input, which are often difficult to obtain because there are not many known protein complexes available to generate MSAs of sufficient depth for a pair of proteins. In recognizing that multiple sequence alignments of a monomer that forms homomultimers contain the co-evolutionary signals of both intrachain and interchain residue pairs in contact, we applied DNCON2 (a deep learning-based protein intrachain residue-residue contact predictor) to predict both intrachain and interchain contacts for homomultimers using multiple sequence alignment (MSA) and other co-evolutionary features of a single monomer followed by discrimination of interchain and intrachain contacts according to the tertiary structure of the monomer. We name this tool DNCON2_Inter. Allowing true-positive predictions within two residue shifts, the best average precision was obtained for the Top-L/10 predictions of 22.9% for homodimers and 17.0% for higher-order homomultimers. In some instances, especially where interchain contact densities are high, DNCON2_Inter predicted interchain contacts with 100% precision. We also developed Con_Complex, a complex structure reconstruction tool that uses predicted contacts to produce the structure of the complex. Using Con_Complex, we show that the predicted contacts can be used to accurately construct the structure of some complexes. Our experiment demonstrates that monomeric multiple sequence alignments can be used with deep learning to predict interchain contacts of homomeric proteins.

Download Full-text