A minimum reporting standard for multiple sequence alignments

Abstract Multiple sequence alignments (MSAs) play a pivotal role in studies of molecular sequence data, but nobody has developed a minimum reporting standard (MRS) to quantify the completeness of MSAs in terms of completely specified nucleotides or amino acids. We present an MRS that relies on four simple completeness metrics. The metrics are implemented in AliStat, a program developed to support the MRS. A survey of published MSAs illustrates the benefits and unprecedented transparency offered by the MRS.

Download Full-text

A minimum reporting standard for multiple sequence alignments

10.1101/2020.01.15.907733 ◽

2020 ◽

Author(s):

Thomas KF Wong ◽

Subha Kalyaanamoorthy ◽

Karen Meusemann ◽

David K Yeates ◽

Bernhard Misof ◽

...

Keyword(s):

Amino Acids ◽

Sequence Data ◽

Pivotal Role ◽

Sequence Alignments ◽

Reporting Standard ◽

Multiple Sequence ◽

Molecular Sequence Data ◽

Molecular Sequence ◽

Multiple Sequence Alignments

ABSTRACTMultiple sequence alignments (MSAs) play a pivotal role in studies of molecular sequence data, but nobody has developed a minimum reporting standard (MRS) to quantify the completeness of MSAs in terms of completely-specified nucleotides or amino acids. We present an MRS that relies on four simple completeness metrics. The metrics are implemented in AliStat, a program developed to support the MRS. A survey of published MSAs illustrates the benefits and unprecedented transparency offered by the MRS.

Download Full-text

VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences

Bioinformatics ◽

10.1093/bioinformatics/btz689 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jun Wang ◽

Pu-Feng Du ◽

Xin-Yu Xue ◽

Guang-Ping Li ◽

Yuan-Ke Zhou ◽

...

Keyword(s):

Sequence Data ◽

Software Tool ◽

Data Retrieval ◽

Supplementary Information ◽

Statistical Features ◽

Biological Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Source Codes ◽

Multiple Sequence Alignments

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Using sound to understand protein sequence data: new sonification algorithms for protein sequences and multiple sequence alignments

BMC Bioinformatics ◽

10.1186/s12859-021-04362-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Edward J. Martin ◽

Thomas R. Meagher ◽

Daniel Barker

Keyword(s):

Focus Group ◽

User Experience ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Future Directions ◽

Multiple Sequence Alignments ◽

Protein Sequence Data

Abstract Background The use of sound to represent sequence data—sonification—has great potential as an alternative and complement to visual representation, exploiting features of human psychoacoustic intuitions to convey nuance more effectively. We have created five parameter-mapping sonification algorithms that aim to improve knowledge discovery from protein sequences and small protein multiple sequence alignments. For two of these algorithms, we investigated their effectiveness at conveying information. To do this we focussed on subjective assessments of user experience. This entailed a focus group session and survey research by questionnaire of individuals engaged in bioinformatics research. Results For single protein sequences, the success of our sonifications for conveying features was supported by both the survey and focus group findings. For protein multiple sequence alignments, there was limited evidence that the sonifications successfully conveyed information. Additional work is required to identify effective algorithms to render multiple sequence alignment sonification useful to researchers. Feedback from both our survey and focus groups suggests future directions for sonification of multiple alignments: animated visualisation indicating the column in the multiple alignment as the sonification progresses, user control of sequence navigation, and customisation of the sound parameters. Conclusions Sonification approaches undertaken in this work have shown some success in conveying information from protein sequence data. Feedback points out future directions to build on the sonification approaches outlined in this paper. The effectiveness assessment process implemented in this work proved useful, giving detailed feedback and key approaches for improvement based on end-user input. The uptake of similar user experience focussed effectiveness assessments could also help with other areas of bioinformatics, for example in visualisation.

Download Full-text

Molecular Characterization of a Novel Fusarivirus Infecting the Plant-pathogenic Fungus Alternaria Solani

10.21203/rs.3.rs-216353/v1 ◽

2021 ◽

Author(s):

Jie Wei Gong ◽

Hong Liu ◽

Fei Xiao Zhu ◽

Yun Shi Zhao ◽

Le Jia Cheng ◽

...

Keyword(s):

Amino Acids ◽

Pathogenic Fungus ◽

Alternaria Solani ◽

Open Reading Frames ◽

Phytopathogenic Fungus ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Putative Protein

Abstract A novel mycovirus belonging to the proposed family "Fusariviridae" was discovered in Alternaria Solani by sequencing a double-stranded RNA extracted from this phytopathogenic fungus. The virus was tentatively named “Alternaria solani fusarivirus 1” (AsFV1). AsFV1 has a single-stranded positive-sense (+ssRNA) genome of 6,845 nucleotides containing three open reading frames (ORFs) and a poly(A) tail. The largest ORF, ORF1 encodes a large polypeptide of 1,556 amino acids (aa) with conserved RNA-dependent RNA polymerase and helicase domains. The ORF2 and ORF3 have overlapping regions, encoding a putative protein of 522 amino acids (aa) and a putative protein of 105 amino acids (aa) respectively, for which function is unknown now. Multiple sequence alignments and phylogenetic analysis revealed AsFV1 belonging to Fusariviridae. This is the first report of the full-length nucleotide sequence of a fusarivirus infected with Alternaria solani.

Download Full-text

An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics

ISRN Biomathematics ◽

10.1155/2013/615630 ◽

2013 ◽

Vol 2013 ◽

pp. 1-14 ◽

Cited By ~ 28

Author(s):

Jurate Daugelaite ◽

Aisling O' Driscoll ◽

Roy D. Sleator

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Sequence Data ◽

Cloud Base ◽

Data Sets ◽

Next Generation ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Computing Technologies

Multiple sequence alignment (MSA) of DNA, RNA, and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology, and bioinformatics. Next-generation sequencing technologies are changing the biology landscape, flooding the databases with massive amounts of raw sequence data. MSA of ever-increasing sequence data sets is becoming a significant bottleneck. In order to realise the promise of MSA for large-scale sequence data sets, it is necessary for existing MSA algorithms to be run in a parallelised fashion with the sequence data distributed over a computing cluster or server farm. Combining MSA algorithms with cloud computing technologies is therefore likely to improve the speed, quality, and capability for MSA to handle large numbers of sequences. In this review, multiple sequence alignments are discussed, with a specific focus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and the next generation of cloud base MSA algorithms is introduced.

Download Full-text

Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments

Molecules ◽

10.3390/molecules24010104 ◽

2018 ◽

Vol 24 (1) ◽

pp. 104

Author(s):

Patrice Koehl ◽

Henri Orland ◽

Marc Delarue

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Principal Components ◽

Gaussian Model ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Substitution Matrices ◽

Multivariate Gaussian ◽

Multivariate Gaussian Model

Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.

Download Full-text

Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models

10.1101/028936 ◽

2015 ◽

Cited By ~ 2

Author(s):

Hugo Jacquin ◽

Amy Gilson ◽

Eugene Shakhnovich ◽

Simona Cocco ◽

Rémi Monasson

Keyword(s):

Protein Structure ◽

Structural Information ◽

Sequence Data ◽

Careful Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Pairwise Models ◽

Statistical Approaches ◽

And Function

Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of `true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons of the power of inverse approaches to the modelling of proteins from sequence data, and their limitations; we show, in particular, that their success crucially depend on the accurate inference of the Potts pairwise couplings.

Download Full-text

Viewing multiple sequence alignments with the JavaScript Sequence Alignment Viewer (JSAV)

F1000Research ◽

10.12688/f1000research.5486.1 ◽

2014 ◽

Vol 3 ◽

pp. 249 ◽

Cited By ~ 9

Author(s):

Andrew C. R. Martin

Keyword(s):

Amino Acids ◽

Sequence Alignment ◽

Web Site ◽

Web Pages ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

The JavaScript Sequence Alignment Viewer (JSAV) is designed as a simple-to-use JavaScript component for displaying sequence alignments on web pages. The display of sequences is highly configurable with options to allow alternative coloring schemes, sorting of sequences and ’dotifying’ repeated amino acids. An option is also available to submit selected sequences to another web site, or to other JavaScript code. JSAV is implemented purely in JavaScript making use of the JQuery and JQuery-UI libraries. It does not use any HTML5-specific options to help with browser compatibility. The code is documented using JSDOC and is available from http://www.bioinf.org.uk/software/jsav/.

Download Full-text

Molecular characterization of intraspecific variations in Helicoverpa armigera (Hübner) populations across India

Journal of Environmental Biology ◽

10.22438/jeb/42/5/mrn-1764 ◽

2021 ◽

Vol 42 (5) ◽

pp. 1320-1329

Author(s):

S. Chakravarty ◽

◽

K.G. Padwal ◽

C.P. Srivastava ◽

◽

...

Keyword(s):

Helicoverpa Armigera ◽

Sequence Data ◽

Pcr Amplification ◽

Haplotype Network ◽

Coi Gene ◽

Sequence Alignments ◽

Separate Species ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Helicoverpa Armígera

Aim: The present study was undertaken to explore the genetic diversity among Helicoverpa armigera populations from varied geographic regions of India using mitochondrial cytochrome c oxidase I (COI) gene fragments. Methodology: The larval specimens of H. armigera collected from 20 locations were subjected to DNA extraction, PCR amplification of target gene, sequencing and then multiple sequence alignments. Results: Based on COI sequence data, high levels of genetic differentiation among some H. armigera populations were detected, but divergence existing was not high enough to delineate them as separate species. The Indian population as a whole exhibited similarity with global genetic assemblage. Significant negative neutrality test indices and unimodal mismatch distribution further supported that this insect experienced a demographic expansion in the past. The phylogenetic tree and median-joining haplotype network indicated that genetic similarity was not related with geographic proximity of populations. Interpretation: Differences based on genetic analyses indicate considerable subspecific level variations among H. armigera populations of India. However, there is no existence of any unidentified cryptic species of H. armigera in the country.

Download Full-text

The evolution of contact prediction: Evidence that contact selection in statistical contact prediction is changing

10.1101/660191 ◽

2019 ◽

Author(s):

Mark Chonofsky ◽

Saulo H. P. de Oliveira ◽

Konrad Krawczyk ◽

Charlotte M. Deane

Keyword(s):

Amino Acids ◽

Protein Structure ◽

Amino Acid ◽

Structure Prediction ◽

Prediction Methods ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Multiple Sequence Alignments ◽

Physico Chemical

AbstractOver the last few years, the field of protein structure prediction has been transformed by increasingly-accurate contact prediction software. These methods are based on the detection of coevolutionary relationships between residues from multiple sequence alignments. However, despite speculation, there is little evidence of a link between contact prediction and the physico-chemical interactions which drive amino-acid coevolution. Furthermore, existing protocols predict only a fraction of all protein contacts and it is not clear why some contacts are favoured over others.Using a dataset of 863 protein domains, we assessed the physico-chemical interactions of contacts predicted by CCMpred, MetaPSICOV, and DNCON2, as examples of direct coupling analysis, meta-prediction, and deep learning, respectively. To further investigate what sets these predicted contacts apart, we considered correctly-predicted contacts and compared their properties against the protein contacts that were not predicted.We found that predicted contacts tend to form more bonds than non-predicted contacts, which suggests these contacts may be more important. Comparing the contacts predicted by each method, we found that metaPSICOV and DNCON2 favour accuracy whereas CCMPred detects contacts with more bonds. This suggests that the push for higher accuracy may lead to a loss of physico-chemically important contacts.These results underscore the connection between protein physico-chemistry and the coevolutionary couplings that can be derived from multiple sequence alignments. This relationship is likely to be relevant to protein structure prediction and functional analysis of protein structure and may be key to understanding their utility for different problems in structural biology.Author summaryAccurate contact prediction has allowed scientists to predict protein structures with unprecedented levels of accuracy. The success of contact prediction methods, which are based on inferring correlations between amino acids in protein multiple sequence alignments, has prompted a great deal of work to improve the quality of contact prediction, leading to the development of several different methods for detecting amino acids in proximity.In this paper, we investigate the properties of these contact prediction methods. We find that contacts which are predicted differ from the other contacts in the protein, in particular they have more physico-chemical bonds, and the predicted contacts are more strongly conserved than other contacts across protein families. We also compared the properties of different contact prediction methods and found that the characteristics of the predicted sets depend on the prediction method used.Our results point to a link between physico-chemical bonding interactions and the evolutionary history of proteins, a connection which is reflected in their amino acid sequences.

Download Full-text