scholarly journals WASPS: web-assisted symbolic plasmid synteny server

2019 ◽  
Author(s):  
Catherine Badel ◽  
Violette Da Cunha ◽  
Ryan Catchpole ◽  
Patrick Forterre ◽  
Jacques Oberto

Abstract Motivation Comparative plasmid genome analyses require complex tools, the manipulation of large numbers of sequences and constitute a daunting task for the wet bench experimentalist. Dedicated plasmid databases are sparse, only comprise bacterial plasmids and provide exclusively access to sequence similarity searches. Results We have developed Web-Assisted Symbolic Plasmid Synteny (WASPS), a web service granting protein and DNA sequence similarity searches against a database comprising all completely sequenced natural plasmids from bacterial, archaeal and eukaryal origin. This database pre-calculates orthologous protein clustering and enables WASPS to generate fully resolved plasmid synteny maps in real time using internal and user-provided DNA sequences. Availability and implementation WASPS queries befit all current browsers such as Firefox, Edge or Safari while the best functionality is achieved with Chrome. Internet Explorer is not supported. WASPS is freely accessible at https://archaea.i2bc.paris-saclay.fr/wasps/. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Vol 35 (21) ◽  
pp. 4427-4429 ◽  
Author(s):  
Andrea Ghelfi ◽  
Kenta Shirasawa ◽  
Hideki Hirakawa ◽  
Sachiko Isobe

Abstract Summary Hayai-Annotation Plants is a browser-based interface for an ultra-fast and accurate functional gene annotation system for plant species using R. The pipeline combines the sequence-similarity searches, using USEARCH against UniProtKB (taxonomy Embryophyta), with a functional annotation step. Hayai-Annotation Plants provides five layers of annotation: i) protein name; ii) gene ontology terms consisting of its three main domains (Biological Process, Molecular Function and Cellular Component); iii) enzyme commission number; iv) protein existence level; and v) evidence type. It implements a new algorithm that gives priority to protein existence level to propagate GO and EC information and annotated Arabidopsis thaliana representative peptide sequences (Araport11) within 5 min at the PC level. Availability and implementation The software is implemented in R and runs on Macintosh and Linux systems. It is freely available at https://github.com/kdri-genomics/Hayai-Annotation-Plants under the GPLv3 license. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (19) ◽  
pp. 4827-4832
Author(s):  
C S Casimiro-Soriguer ◽  
M M Rigual ◽  
A M Brokate-Llanos ◽  
M J Muñoz ◽  
A Garzón ◽  
...  

Abstract Motivation Short bioactive peptides encoded by small open reading frames (sORFs) play important roles in eukaryotes. Bioinformatics prediction of ORFs is an early step in a genome sequence analysis, but sORFs encoding short peptides, often using non-AUG initiation codons, are not easily discriminated from false ORFs occurring by chance. Results AnABlast is a computational tool designed to highlight putative protein-coding regions in genomic DNA sequences. This protein-coding finder is independent of ORF length and reading frame shifts, thus making of AnABlast a potentially useful tool to predict sORFs. Using this algorithm, here, we report the identification of 82 putative new intergenic sORFs in the Caenorhabditis elegans genome. Sequence similarity, motif presence, expression data and RNA interference experiments support that the underlined sORFs likely encode functional peptides, encouraging the use of AnABlast as a new approach for the accurate prediction of intergenic sORFs in annotated eukaryotic genomes. Availability and implementation AnABlast is freely available at http://www.bioinfocabd.upo.es/ab/. The C.elegans genome browser with AnABlast results, annotated genes and all data used in this study is available at http://www.bioinfocabd.upo.es/celegans. Supplementary information Supplementary data are available at Bioinformatics online.


The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .


2018 ◽  
Author(s):  
Henan Zhu ◽  
Tristan Dennis ◽  
Joseph Hughes ◽  
Robert J. Gifford

ABSTRACTA significant fraction of most genomes is comprised of DNA sequences that have been incompletely investigated. This genomic ‘dark matter’ contains a wealth of useful biological information that can be recovered by systematically screening genomes in silico using sequence similarity search tools. Specialized computational tools are required to implement these screens efficiently. Here, we describe the database-integrated genome-screening (DIGS) tool: a computational framework for performing these investigations. To demonstrate, we screen mammalian genomes for endogenous viral elements (EVEs) derived from the Filoviridae, Parvoviridae, Circoviridae and Bornaviridae families, identifying numerous novel elements in addition to those that have been described previously. The DIGS tool provides a simple, robust framework for implementing a broad range of heuristic, sequence analysis-based explorations of genomic diversity.Availabilityhttp://giffordlabcvr.github.io/DIGS-tool/[email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Miguel Andrade ◽  
Caroline Louis-Jeune ◽  
Carol Perez-Iratxeta

Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


2012 ◽  
Vol 13 (Suppl 4) ◽  
pp. S2 ◽  
Author(s):  
Emanuele Bramucci ◽  
Alessandro Paiardini ◽  
Francesco Bossa ◽  
Stefano Pascarella

Genetics ◽  
2004 ◽  
Vol 166 (2) ◽  
pp. 947-957 ◽  
Author(s):  
John G Jelesko ◽  
Kristy Carter ◽  
Whitney Thompson ◽  
Yuki Kinoshita ◽  
Wilhelm Gruissem

Abstract Paralogous genes organized as a gene cluster can rapidly evolve by recombination between misaligned paralogs during meiosis, leading to duplications, deletions, and novel chimeric genes. To model unequal recombination within a specific gene cluster, we utilized a synthetic RBCSB gene cluster to isolate recombinant chimeric genes resulting from meiotic recombination between paralogous genes on sister chromatids. Several F1 populations hemizygous for the synthRBCSB1 gene cluster gave rise to Luc+ F2 plants at frequencies ranging from 1 to 3 × 10-6. A nonuniform distribution of recombination resolution sites resulted in the biased formation of recombinant RBCS3B/1B::LUC genes with nonchimeric exons. The positioning of approximately half of the mapped resolution sites was effectively modeled by the fractional length of identical DNA sequences. In contrast, the other mapped resolution sites fit an alternative model in which recombination resolution was stimulated by an abrupt transition from a region of relatively high sequence similarity to a region of low sequence similarity. Thus, unequal recombination between paralogous RBCSB genes on sister chromatids created an allelic series of novel chimeric genes that effectively resulted in the diversification rather than the homogenization of the synthRBCSB1 gene cluster.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i857-i865
Author(s):  
Derrick Blakely ◽  
Eamon Collins ◽  
Ritambhara Singh ◽  
Andrew Norton ◽  
Jack Lanchantin ◽  
...  

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Qin Ma ◽  
Rui-Feng Lei ◽  
Yu-Qian Li ◽  
Dilireba Abudourousuli ◽  
Zulihumaer Rouzi ◽  
...  

A bacterial strain, designated YZGR15T, was isolated from the root of an annual halophyte Suaeda aralocaspica, collected from the southern edge of the Gurbantunggut desert, north-west PR China. Cells of the isolate were Gram-stain-positive, facultatively anaerobic, irregular rods. Growth occurred at 4–42 °C (optimum, 30–37 °C), at pH 6.0–9.0 (optimum, pH 7.0–7.5) and in the presence of 0–9 % (w/v) NaCl (optimum, 2–5 %). Phylogenetic analysis using 16S rRNA gene sequences indicated that strain YZGR15T showed the highest sequence similarity to Sanguibacter keddieii (98.27 %), Sanguibacter antarcticus (98.20 %) and Sanguibacter inulinus (98.06 %). Results of genome analyses of strain YZGR15T indicated that the genome size was 3.16 Mb, with a genomic DNA G+C content of 71.9 mol%. Average nucleotide identity and digital DNA–DNA hybridization values between strain YZGR15Tand three type strains were in the range of 76.5–77.8 % and 20.0–22.2 %, respectively. Analysis of the cellular component of strain YZGR15T revealed that the primary fatty acids were anteiso-C15 : 0, C16 : 0, C14 : 0 and iso-C16 : 0 and the polar lipids included diphosphatidylglycerol, phosphatidylglycerol, three unidentified phospholipids and two unidentified glycolipids. The cell-wall characteristic amino acids were glutamic acid, alanine and an unknown amino acid. The whole-cell sugars for the strain were mannose, ribose, rhamnose, glucose and an unidentified sugar. The predominant respiratory quinone was MK-9(H4). Based on the results of genomic, phylogenetic, phenotypic and chemotaxonomic analyses, strain YZGR15T represents a novel species of the genus Sanguibacter , for which the name Sanguibacter suaedae sp. nov. is proposed. The type strain is YZGR15T (=CGMCC 1.18691T=KCTC 49659T)


Sign in / Sign up

Export Citation Format

Share Document