scholarly journals annonex2embl: automatic preparation of annotated DNA sequences for bulk submissions to ENA

2019 ◽  
Author(s):  
Michael Gruenstaeudl

ABSTRACTMotivationThe submission of annotated sequence data to public sequence databases constitutes a central pillar in biological research. The surge of novel DNA sequences awaiting database submission due to the application of next-generation sequencing has increased the need for software tools that facilitate bulk submissions. This need has yet to be met with a concurrent development of tools to automate the preparatory work preceding such submissions.ResultsI introduce annonex2embl, a Python package that automates the preparation of complete sequence flatfiles for large-scale sequence submissions to the European Nucleotide Archive. The tool enables the conversion of DNA sequence alignments that are co-supplied with sequence annotations and metadata to submission-ready flatfiles. Among other features, the software automatically accounts for length differences among the input sequences while maintaining correct annotations, automatically interlaces metadata to each record, and displays a design suitable for easy integration into bioinformatic workflows. As proof of its utility, annonex2embl is employed in preparing a dataset of more than 1,500 fungal DNA sequences for database submission.

2020 ◽  
Vol 36 (12) ◽  
pp. 3841-3848
Author(s):  
Michael Gruenstaeudl

Abstract Motivation The submission of annotated sequence data to public sequence databases constitutes a central pillar in biological research. The surge of novel DNA sequences awaiting database submission due to the application of next-generation sequencing has increased the need for software tools that facilitate bulk submissions. This need has yet to be met with the concurrent development of tools to automate the preparatory work preceding such submissions. Results The author introduce annonex2embl, a Python package that automates the preparation of complete sequence flatfiles for large-scale sequence submissions to the European Nucleotide Archive. The tool enables the conversion of DNA sequence alignments that are co-supplied with sequence annotations and metadata to submission-ready flatfiles. Among other features, the software automatically accounts for length differences among the input sequences while maintaining correct annotations, automatically interlaces metadata to each record and displays a design suitable for easy integration into bioinformatic workflows. As proof of its utility, annonex2embl is employed in preparing a dataset of more than 1500 fungal DNA sequences for database submission. Availability and implementation annonex2embl is freely available via the Python package index at http://pypi.python.org/pypi/annonex2embl. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (7) ◽  
pp. 2047-2052 ◽  
Author(s):  
Ha Young Kim ◽  
Dongsup Kim

Abstract Motivation Accurate prediction of the effects of genetic variation is a major goal in biological research. Towards this goal, numerous machine learning models have been developed to learn information from evolutionary sequence data. The most effective method so far is a deep generative model based on the variational autoencoder (VAE) that models the distributions using a latent variable. In this study, we propose a deep autoregressive generative model named mutationTCN, which employs dilated causal convolutions and attention mechanism for the modeling of inter-residue correlations in a biological sequence. Results We show that this model is competitive with the VAE model when tested against a set of 42 high-throughput mutation scan experiments, with the mean improvement in Spearman rank correlation ∼0.023. In particular, our model can more efficiently capture information from multiple sequence alignments with lower effective number of sequences, such as in viral sequence families, compared with the latent variable model. Also, we extend this architecture to a semi-supervised learning framework, which shows high prediction accuracy. We show that our model enables a direct optimization of the data likelihood and allows for a simple and stable training process. Availability and implementation Source code is available at https://github.com/ha01994/mutationTCN. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Alexis Criscuolo ◽  
Sylvie Issenhuth-Jeanjean ◽  
Xavier Didelot ◽  
Kaisa Thorell ◽  
James Hale ◽  
...  

AbstractBacteria and archaea make up most of natural diversity but the mechanisms that underlie the origin and maintenance of prokaryotic species are poorly understood. We investigated the speciation history of the genusSalmonella, an ecologically diverse bacterial lineage, within whichS. entericasubsp.entericais responsible for important human food-borne infections. We performed a survey of diversity across a large reference collection using multilocus sequence typing, followed by genome sequencing of distinct lineages. We identified eleven distinct phylogroups, three of which were previously undescribed. Strains assigned toS. entericasubsp.salamaeare polyphyletic, with two distinct lineages that we designate Salamae A and Salamae B. Strains of subspecieshoutenaeare subdivided into two groups, Houtenae A and B and are both related to Selander’s group VII. A phylogroup we designate VIII was previously unknown. A simple binary fission model of speciation cannot explain observed patterns of sequence diversity. In the recent past, there have been large scale hybridization events involving an unsampled ancestral lineage and three distantly related lineages of the genus that have given rise to Houtenae A, Houtenae B and VII. We found no evidence for ongoing hybridization in the other eight lineages but detected more subtle signals of ancient recombination events. We are unable to fully resolve the speciation history of the genus, which might have involved additional speciation-by-hybridization or multi-way speciation events. Our results imply that traditional models of speciation by binary fission and divergence may not apply inSalmonella.Data summaryIllumina sequence data were submitted to the European Nucleotide Archive under project number PRJEB2099 and are available from INSDC (NCBI/ENA/DDBJ) under accession numbers ERS011101 to ERS011146. The MLST sequence and profile data generated in this study have been publicly available on theSalmonellaMLST web site between 2010 and the migration of theSalmonellaMLST website to EnteroBase (https://enterobase.warwick.ac.uk/), and subsequently from there.


2020 ◽  
Author(s):  
Yang Young Lu ◽  
Jiaxing Bai ◽  
Yiwen Wang ◽  
Ying Wang ◽  
Fengzhu Sun

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Dmitry Schigel ◽  
Thomas Jeppesen ◽  
Robert Finn ◽  
Guy Cochrane ◽  
Urmas Kõljalg ◽  
...  

The Global Biodiversity Information Facility (GBIF) was established by governments in 2001, largely through the initiative and leadership of the natural history collections community, following the 1999 recommendation by a working group under the Megascience Forum (predecessor of the Global Science Forum) of the Organization for Economic Cooperation and Development (OECD). Over 20 years, GBIF has helped develop standards and convened a global community of data-publishing institutions, aggregrating over one billion specimen occurrence records freely and openly available for use in research and policy making. These GBIF mediated data range from vouchered museum specimens to observation records generated by humans and machines. New data are being generated from integrated remote sensing, ecological sampling, and molecular sequencing that have strong geospatial components but lack traditional vouchers. GBIF is working with partners to develop best practices of bringing this data into the GBIF architecture. Following discussions during the second Global Biodiversity Information Conference in 2018, GBIF and the European Bioinformatics Institute (EMBL-EBI), supported by ELIXIR, have extended collaboration to share species occurrence records known only from their genetic material. When these data providers contribute data coordinates along with the sequences to the European Nucleotide Archive (ENA), the records will appear on GBIF maps and in spatial searches. This collaboration enables significant new molecular data streams to become discoverable through GBIF.org: by mid-March 2019, over 7.8m individual occurrence records via the ENA, and over 13.2m records as standardized Darwin Core sampling-event datasets via MGnify, a resource that provides taxonomic and functional annotations on sequences derived from environmental sequencing projects. Sequence-based occurrence records published by ENA and MGnify boost representation of microbial diversity which was underrepresented at GBIF. The ELIXIR-ENA-MGnify-GBIF partnership is working on further refinement of the dynamic data linkages, frequency of updates and other improvements. The API-based tool that connects GBIF data infrastructures is open to new data contributors and for indexes of molecular occurrences. Indexing of these data streams is dependent on the presence of a name (any rank) with the sequence. Under the current Codes of nomenclature, animals, fungi, plants, and algae cannot be described based on exclusively sequence data. Yet, a significant volume of biodiversity data has only been represented by DNA sequences. Barcoding and sequence clustering procedures vary among taxa and research communities, but clusters can be related to a taxon with a Latin name. Many DNA similarity clusters do not contain a sequence from a formally described taxon; however these sequence clusters provide provisional molecular names for nomenclatural communication. In the best cases, curated libraries of reference sequences, their metadata, clusters, alignments, and links to individuals and physical material become de facto naming conventions for certain taxonomic groups, and co-exist with Latin names. Integration of molecular names into the taxonomic backbone of GBIF started with Fungi and UNITE, a data management and identification environment for fungal ITS barcodes with 87,000+ fungal species hypotheses demarcating 800,000+ sequence specimens as of March 2019. Checklist publication of all names in UNITE through GBIF.org including Linnaean names and stable, DOI-trackable molecular sequence based ‘species hypotheses’, enables indexing of fungal metabarcoding data worldwide, such as BIOWIDE. As names are currently essential to indexing the world’s occurrence data, GBIF will develop similar linkages with names in the Barcode of Life data system (BOLD) and in SILVA - a resource for high-quality ribosomal RNA sequence data and taxonomy, and welcomes other reference systems to this development. Expanding the molecular data streams (Fig. 1) allows GBIF to address spatial, temporal and taxonomic gaps and biases, and to support large-scale data-intensive research openly and worldwide.


2020 ◽  
Author(s):  
Qingdong Zeng ◽  
Wenjin Cao ◽  
Liping Xing ◽  
Guowei Qin ◽  
Jianhui Wu ◽  
...  

AbstractAcross domains of biological research using genome sequence data, high-quality reference genome sequences are essential for characterizing genetic variation and understanding the genetic basis of phenotypes. However, the construction of genome assemblies for various species is often hampered by complexities of genome organization, especially repetitive and complex sequences, leading to mis-assembly and missing regions. Here, we describe a high-throughput gold standard genome assembly workflow using a large-scale bacterial artificial chromosome (BAC) library with a refined two-step pooling strategy and the Lamp assembler algorithm. This strategy minimizes the laborious processes of physical map construction and clone-by-clone sequencing, enabling inexpensive sequencing of several thousand BAC clones. By applying this strategy with a minimum tiling path BAC clone library for the short arm of chromosome 2D (2DS) of bread wheat, 98% of BAC sequences, covering 92.7% of the 2DS chromosome, were assembled correctly for this species with a highly complex and repetitive genome. We also identified 48 large mis-assemblies in the reference wheat genome assembly (IWGSC RefSeq v1.0) and corrected these large mis-assemblies in addition to filling 92.2% of the gaps in RefSeq v1.0. Our 2DS assembly represents a new benchmark for the assembly of complex genomes with both high accuracy and efficiency.


2013 ◽  
Vol 2013 ◽  
pp. 1-14 ◽  
Author(s):  
Jurate Daugelaite ◽  
Aisling O' Driscoll ◽  
Roy D. Sleator

Multiple sequence alignment (MSA) of DNA, RNA, and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology, and bioinformatics. Next-generation sequencing technologies are changing the biology landscape, flooding the databases with massive amounts of raw sequence data. MSA of ever-increasing sequence data sets is becoming a significant bottleneck. In order to realise the promise of MSA for large-scale sequence data sets, it is necessary for existing MSA algorithms to be run in a parallelised fashion with the sequence data distributed over a computing cluster or server farm. Combining MSA algorithms with cloud computing technologies is therefore likely to improve the speed, quality, and capability for MSA to handle large numbers of sequences. In this review, multiple sequence alignments are discussed, with a specific focus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and the next generation of cloud base MSA algorithms is introduced.


Author(s):  
Yang Young Lu ◽  
Jiaxing Bai ◽  
Yiwen Wang ◽  
Ying Wang ◽  
Fengzhu Sun

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Nicolas J. Rawlence ◽  
Alexander T. Salis ◽  
Hamish G. Spencer ◽  
Jonathan M. Waters ◽  
Lachie Scarsbrook ◽  
...  

ABSTRACTAimUnderstanding how wild populations respond to climatic shifts is a fundamental goal of biological research in a fast-changing world. The Southern Ocean represents a fascinating system for assessing large-scale climate-driven biological change, as it contains extremely isolated island groups within a predominantly westerly, circumpolar wind and current system. The blue-eyed shags (Leucocarbo spp.) represent a paradoxical Southern Ocean seabird radiation; a circumpolar distribution implies strong dispersal capacity yet their speciose nature suggests local adaptation and isolation. Here we use genetic tools in an attempt to resolve this paradox.LocationSouthern Ocean.Taxa17 species and subspecies of blue-eyed shags (Leucocarbo spp.) across the geographical distribution of the genus.MethodsHere we use mitochondrial and nuclear sequence data to conduct the first global genetic analysis of this group using a temporal phylogenetic framework to test for rapid speciation.ResultsOur analysis reveals remarkably shallow evolutionary histories among island-endemic lineages, consistent with a recent high-latitude circumpolar radiation. This rapid sub-Antarctic expansion contrasts with significantly deeper lineages detected in more temperate regions such as South America and New Zealand that may have acted as glacial refugia. The dynamic history of high-latitude expansions is further supported by ancestral demographic and biogeographic reconstructions.Main conclusionsThe circumpolar distribution of blue-eyed shags, and their highly dynamic evolutionary history, potentially make Leucocarbo a strong sentinel of past and ongoing Southern Ocean ecosystem change given their sensitivity to climatic impacts.


1998 ◽  
Vol 72 (3) ◽  
pp. 1974-1982 ◽  
Author(s):  
Andrew J. Davison

ABSTRACT Salmonid herpesvirus 1 (SalHV-1) is a pathogen of the rainbow trout (Oncorhynchus mykiss). Restriction endonuclease mapping, cosmid cloning, DNA hybridization, and targeted DNA sequencing experiments showed that the genome is 174.4 kbp in size, consisting of a long unique region (UL; 133.4 kbp) linked to a short unique region (US; 25.6 kbp) which is flanked by an inverted repeat (RS; 7.7 kbp). US is present in virion DNA in either orientation, but UL is present in a single orientation. This structure is characteristic of theVaricellovirus genus of the subfamilyAlphaherpesvirinae but has evidently evolved independently, since an analysis of randomly sampled DNA sequence data showed that SalHV-1 shares at least 18 genes with channel catfish virus (CCV), a fish herpesvirus whose complete sequence is known and which is unrelated to mammalian herpesviruses. The use of oligonucleotide probes demonstrated that in comparison with CCV, the conserved SalHV-1 genes are located in UL in at least five rearranged blocks. Large-scale gene rearrangements of this type are also characteristic of the three mammalian herpesvirus subfamilies. The junction between two SalHV-1 gene blocks was confirmed by sequencing a 4,245-bp region which contains the dUTPase gene, part of a putative spliced DNA polymerase gene, and one other complete gene. The implications of these findings in herpesvirus taxonomy are discussed.


Sign in / Sign up

Export Citation Format

Share Document