DiscoMark: Nuclear marker discovery from orthologous sequences using draft genome data

AbstractHigh-throughput sequencing has laid the foundation for fast and cost-effective development of phylogenetic markers. Here we present the program DISCOMARK, which streamlines the development of nuclear DNA (nDNA) markers from whole-genome (or whole-transcriptome) sequencing data, combining local alignment, alignment trimming, reference mapping and primer design based on multiple sequence alignments in order to design primer pairs from input orthologous sequences. In order to demonstrate the suitability of DISCOMARK we designed markers for two groups of species, one consisting of closely related species and one group of distantly related species. For the closely related members of the species complex of Cloeon dipterum s.l. (Insecta, Ephemeroptera), the program discovered a total of 78 markers. Among these, we selected eight markers for amplification and Sanger sequencing. The exon sequence alignments (2,526 base pairs (bp)) were used to reconstruct a well supported phylogeny and to infer clearly structured haplotype networks. For the distantly related species we designed primers for several families in the insect order Ephemeroptera, using available genomic data from four sequenced species. We developed primer pairs for 23 markers that are designed to amplify across several families. The DISCOMARK program will enhance the development of new nDNA markersby providing a streamlined, automated approach to perform genome-scale scans for phylogenetic markers. The program is written in Python, released under a public license (GNU GPL v2), and together with a manual and example data set available at: https://github.com/hdetering/discomark.

Download Full-text

Minimal clustering and species delimitation based on multi-locus alignments vs SNPs: the case of the Seriphium plumosum L. complex (Gnaphalieae: Asteraceae)

10.1101/2021.03.21.436318 ◽

2021 ◽

Author(s):

Zaynab Shaik ◽

Nicola Georgina Bergh ◽

Bengt Oxelman ◽

Anthony George Verboom

Keyword(s):

Species Delimitation ◽

Bayes Factor ◽

Cape Floristic Region ◽

Parametric Models ◽

Western Cape ◽

Nucleotide Polymorphisms ◽

Sequence Alignments ◽

Separate Species ◽

Multiple Sequence ◽

Data Set

We applied species delimitation methods based on the Multi-Species Coalescent (MSC) model to 500+ loci derived from genotyping-by-sequencing on the South African Seriphium plumosum (Asteraceae) species complex. The loci were represented either as multiple sequence alignments or single nucleotide polymorphisms (SNPs), and analysed by the STACEY and Bayes Factor Delimitation (BFD)/SNAPP methods, respectively. Both methods supported species taxonomies where virtually all of the 32 sampled individuals, each representing its own geographical population, were identified as separate species. Computational efforts required to achieve adequate mixing of MCMC chains were considerable, and the species/minimal cluster trees identified similar strongly supported clades in replicate runs. The resolution was, however, higher in the STACEY trees than in the SNAPP trees, which is consistent with the higher information content of full sequences. The computational efficiency, measured as effective sample sizes of likelihood and posterior estimates per time unit, was consistently higher for STACEY. A random subset of 56 alignments had similar resolution to the 524-locus SNP data set. The STRUCTURE-like sparse Non-negative Matrix Factorisation (sNMF) method was applied to six individuals from each of 48 geographical populations and 28023 SNPs. Significantly fewer (13) clusters were identified as optimal by this analysis compared to the MSC methods. The sNMF clusters correspond closely to clades consistently supported by MSC methods, and showed evidence of admixture, especially in the western Cape Floristic Region. We discuss the significance of these findings, and conclude that it is important to a priori consider the kind of species one wants to identify when using genome-scale data, the assumptions behind the parametric models applied, and the potential consequences of model violations may have.

Download Full-text

Intra-species recombination among strains of the ampelovirus Grapevine leafroll-associated virus 4

Virology Journal ◽

10.1186/s12985-019-1243-4 ◽

2019 ◽

Vol 16 (1) ◽

Cited By ~ 1

Author(s):

Jati Adiputra ◽

Sridhar Jarugula ◽

Rayapati A. Naidu

Keyword(s):

Genome Sequence ◽

High Throughput Sequencing ◽

Washington State ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Genome Wide ◽

Detection Program ◽

Leafroll Disease ◽

First Time

Abstract Background Grapevine leafroll disease is one of the most economically important viral diseases affecting grape production worldwide. Grapevine leafroll-associated virus 4 (GLRaV-4, genus Ampelovirus, family Closteroviridae) is one of the six GLRaV species documented in grapevines (Vitis spp.). GLRaV-4 is made up of several distinct strains that were previously considered as putative species. Currently known strains of GLRaV-4 stand apart from other GLRaV species in lacking the minor coat protein. Methods In this study, the complete genome sequence of three strains of GLRaV-4 from Washington State vineyards was determined using a combination of high-throughput sequencing, Sanger sequencing and RACE. The genome sequence of these three strains was compared with corresponding sequences of GLRaV-4 strains reported from other grapevine-growing regions. Phylogenetic analysis and SimPlot and Recombination Detection Program (RDP) were used to identify putative recombination events among GLRaV-4 strains. Results The genome size of GLRaV-4 strain 4 (isolate WAMR-4), strain 5 (isolate WASB-5) and strain 9 (isolate WALA-9) from Washington State vineyards was determined to be 13,824 nucleotides (nt), 13,820 nt, and 13,850 nt, respectively. Multiple sequence alignments showed that a 11-nt sequence (5′-GTAATCTTTTG-3′) towards 5′ terminus of the 5′ non-translated region (NTR) and a 10-nt sequence (5′-ATCCAGGACC-3′) towards 3′ end of the 3′ NTR are conserved among the currently known GLRaV-4 strains. LR-106 isolate of strain 4 and Estellat isolate of strain 6 were identified as recombinants due to putative recombination events involving divergent sequences in the ORF1a from strain 5 and strain Pr. Conclusion Genome-wide analyses showed for the first time that recombinantion can occur between distinct strains of GLRaV-4 resulting in the emergence of genetically stable and biologically successful chimeric viruses. Although the origin of recombinant strains of GLRaV-4 remains elusive, intra-species recombination could be playing an important role in shaping genetic diversity and evolution of the virus and modulating the biology and epidemiology of GLRaV-4 strains.

Download Full-text

SECAPR - A bioinformatics pipeline for the rapid and user-friendly alignment of hybrid enrichment sequences, from raw reads to alignments

10.7287/peerj.preprints.26477v2 ◽

2018 ◽

Author(s):

Tobias Andermann ◽

Angela Cano ◽

Alexander Zizka ◽

Christine Bacon ◽

Alexandre Antonelli

Keyword(s):

Evolutionary Biology ◽

Sequence Data ◽

Model Organisms ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Capture ◽

Sequencing Platforms ◽

User Friendly

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (= hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Download Full-text

Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies

PeerJ ◽

10.7717/peerj.1839 ◽

2016 ◽

Vol 4 ◽

pp. e1839 ◽

Cited By ~ 57

Author(s):

Tom O. Delmont ◽

A. Murat Eren

Keyword(s):

High Throughput Sequencing ◽

Draft Genome ◽

Cost Effective ◽

Single Copy ◽

Eukaryotic Genome ◽

Sequencing Data ◽

Bacterial Genomes ◽

Long Read ◽

Domains Of Life ◽

Genome Assemblies

High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigradeHypsibius dujardini,and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome forH. dujardinisupported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.

Download Full-text

PhyloToL: A Taxon/Gene-Rich Phylogenomic Pipeline to Explore Genome Evolution of Diverse Eukaryotes

Molecular Biology and Evolution ◽

10.1093/molbev/msz103 ◽

2019 ◽

Vol 36 (8) ◽

pp. 1831-1842 ◽

Cited By ~ 6

Author(s):

Mario A Cerón-Romero ◽

Xyrus X Maurer-Alcalá ◽

Jean-David Grattepanche ◽

Ying Yan ◽

Miguel M Fonseca ◽

...

Keyword(s):

Gene Family ◽

High Throughput Sequencing ◽

Stop Codon ◽

Tree Of Life ◽

Gene Family Evolution ◽

Third Party ◽

Gene Trees ◽

Sequence Alignments ◽

Multiple Sequence ◽

Membrane Pore

Abstract Estimating multiple sequence alignments (MSAs) and inferring phylogenies are essential for many aspects of comparative biology. Yet, many bioinformatics tools for such analyses have focused on specific clades, with greatest attention paid to plants, animals, and fungi. The rapid increase in high-throughput sequencing (HTS) data from diverse lineages now provides opportunities to estimate evolutionary relationships and gene family evolution across the eukaryotic tree of life. At the same time, these types of data are known to be error-prone (e.g., substitutions, contamination). To address these opportunities and challenges, we have refined a phylogenomic pipeline, now named PhyloToL, to allow easy incorporation of data from HTS studies, to automate production of both MSAs and gene trees, and to identify and remove contaminants. PhyloToL is designed for phylogenomic analyses of diverse lineages across the tree of life (i.e., at scales of >100 My). We demonstrate the power of PhyloToL by assessing stop codon usage in Ciliophora, identifying contamination in a taxon- and gene-rich database and exploring the evolutionary history of chromosomes in the kinetoplastid parasite Trypanosoma brucei, the causative agent of African sleeping sickness. Benchmarking PhyloToL’s homology assessment against that of OrthoMCL and a published paper on superfamilies of bacterial and eukaryotic organellar outer membrane pore-forming proteins demonstrates the power of our approach for determining gene family membership and inferring gene trees. PhyloToL is highly flexible and allows users to easily explore HTS data, test hypotheses about phylogeny and gene family evolution and combine outputs with third-party tools (e.g., PhyloChromoMap, iGTP).

Download Full-text

An Extensive Meta-Metagenomic Search Identifies SARS-CoV-2-Homologous Sequences in Pangolin Lung Viromes

mSphere ◽

10.1128/msphere.00160-20 ◽

2020 ◽

Vol 5 (3) ◽

Cited By ~ 9

Author(s):

Lamia Wahba ◽

Nimit Jain ◽

Andrew Z. Fire ◽

Massa J. Shoura ◽

Karen L. Artiles ◽

...

Keyword(s):

Nucleic Acid ◽

High Speed ◽

High Throughput Sequencing ◽

Biological Significance ◽

Metagenomic Data ◽

Data Sets ◽

Sequencing Data ◽

Data Set ◽

Link Type ◽

Recent Emergence

ABSTRACT In numerous instances, tracking the biological significance of a nucleic acid sequence can be augmented through the identification of environmental niches in which the sequence of interest is present. Many metagenomic data sets are now available, with deep sequencing of samples from diverse biological niches. While any individual metagenomic data set can be readily queried using web-based tools, meta-searches through all such data sets are less accessible. In this brief communication, we demonstrate such a meta-metagenomic approach, examining close matches to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in all high-throughput sequencing data sets in the NCBI Sequence Read Archive accessible with the “virome” keyword. In addition to the homology to bat coronaviruses observed in descriptions of the SARS-CoV-2 sequence (F. Wu, S. Zhao, B. Yu, Y. M. Chen, et al., Nature 579:265–269, 2020, https://doi.org/10.1038/s41586-020-2008-3; P. Zhou, X. L. Yang, X. G. Wang, B. Hu, et al., Nature 579:270–273, 2020, https://doi.org/10.1038/s41586-020-2012-7), we note a strong homology to numerous sequence reads in metavirome data sets generated from the lungs of deceased pangolins reported by Liu et al. (P. Liu, W. Chen, and J. P. Chen, Viruses 11:979, 2019, https://doi.org/10.3390/v11110979). While analysis of these reads indicates the presence of a similar viral sequence in pangolin lung, the similarity is not sufficient to either confirm or rule out a role for pangolins as an intermediate host in the recent emergence of SARS-CoV-2. In addition to the implications for SARS-CoV-2 emergence, this study illustrates the utility and limitations of meta-metagenomic search tools in effective and rapid characterization of potentially significant nucleic acid sequences. IMPORTANCE Meta-metagenomic searches allow for high-speed, low-cost identification of potentially significant biological niches for sequences of interest.

Download Full-text

SECAPR - A bioinformatics pipeline for the rapid and user-friendly processing of Illumina sequences, from raw reads to alignments

10.7287/peerj.preprints.26477 ◽

2018 ◽

Author(s):

Tobias Andermann ◽

Angela Cano ◽

Alexander Zizka ◽

Christine Bacon ◽

Alexandre Antonelli

Keyword(s):

Evolutionary Biology ◽

Sequence Data ◽

Model Organisms ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Capture ◽

Sequencing Platforms ◽

User Friendly

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Download Full-text

Highly comparable metabarcoding results from MGI-Tech and Illumina sequencing platforms

PeerJ ◽

10.7717/peerj.12254 ◽

2021 ◽

Vol 9 ◽

pp. e12254

Author(s):

Sten Anslan ◽

Vladimir Mikryukov ◽

Kęstutis Armolaitis ◽

Jelena Ankuda ◽

Dagnija Lazdina ◽

...

Keyword(s):

High Throughput Sequencing ◽

Coi Gene ◽

Synthesis Methods ◽

Sequence Variant ◽

Sequencing Data ◽

Data Set ◽

Soil Dna ◽

Sequencing Platform ◽

Sequencing By Synthesis ◽

Sequencing Platforms

With the developments in DNA nanoball sequencing technologies and the emergence of new platforms, there is an increasing interest in their performance in comparison with the widely used sequencing-by-synthesis methods. Here, we test the consistency of metabarcoding results from DNBSEQ-G400RS (DNA nanoball sequencing platform by MGI-Tech) and NovaSeq 6000 (sequencing-by-synthesis platform by Illumina) platforms using technical replicates of DNA libraries that consist of COI gene amplicons from 120 soil DNA samples. By subjecting raw sequencing data from both platforms to a uniform bioinformatics processing, we found that the proportion of high-quality reads passing through the filtering steps was similar in both datasets. Per-sample operational taxonomic unit (OTU) and amplicon sequence variant (ASV) richness patterns were highly correlated, but sequencing data from DNBSEQ-G400RS harbored a higher number of OTUs. This may be related to the lower dominance of most common OTUs in DNBSEQ data set (thus revealing higher richness by detecting rare taxa) and/or to a lower effective read quality leading to generation of spurious OTUs. However, there was no statistical difference in the ASV and post-clustered ASV richness between platforms, suggesting that additional denoising step in the ASV workflow had effectively removed the ‘noisy’ reads. Both OTU-based and ASV-based composition were strongly correlated between the sequencing platforms, with essentially interchangeable results. Therefore, we conclude that DNBSEQ-G400RS and NovaSeq 6000 are both equally efficient high-throughput sequencing platforms to be utilized in studies aiming to apply the metabarcoding approach, but the main benefit of the former is related to lower sequencing cost.

Download Full-text

Colonization and diversification of aquatic insects on three Macaronesian archipelagos using 59 nuclear loci derived from a draft genome

10.1101/063859 ◽

2016 ◽

Cited By ~ 1

Author(s):

Sereina Rutschmann ◽

Harald Detering ◽

Sabrina Simon ◽

David H. Funk ◽

Jean-Luc Gattolliat ◽

...

Keyword(s):

Species Complex ◽

Nuclear Dna ◽

Draft Genome ◽

Aquatic Insect ◽

Evolutionary Process ◽

Single Copy ◽

Whole Genome Sequencing Data ◽

Ancestral State ◽

Sequencing Data ◽

Multispecies Coalescent

AbstractThe study of processes driving diversification requires a fully sampled and well resolved phylogeny. Multilocus approaches to the study of recent diversification provide a powerful means to study the evolutionary process, but their application remains restricted because multiple unlinked loci with suitable variation for phylogenetic or coalescent analysis are not available for most non-model taxa. Here we identify novel, putative single-copy nuclear DNA (nDNA) phylogenetic markers to study the colonization and diversification of an aquatic insect species complex,Cloeon dipterumL. 1761 (Ephemeroptera: Baetidae), in Macaronesia. Whole-genome sequencing data from one member of the species complex were used to identify 59 nDNA loci (32,213 base pairs), followed by Sanger sequencing of 29 individuals sampled from 13 islands of three Macaronesian archipelagos. Multispecies coalescent analyses established six putative species. Three island species formed a monophyletic clade, with one species occurring on the Azores, Europe and North America. Ancestral state reconstruction indicated at least two colonization events from the mainland (Canaries, Azores) and one within the archipelago (between Madeira and the Canaries). Random subsets of the 59 loci showed a positive linear relationship between number of loci and node support. In contrast, node support in the multispecies coalescent tree was negatively correlated with mean number of phylogenetically informative sites per locus, suggesting a complex relationship between tree resolution and marker variability. Our approach highlights the value of combining coalescent-based phylogeography, species delimitation, and phylogenetic reconstruction to resolve recent diversification events in an archipelago species complex.

Download Full-text

BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data

10.1101/2021.10.02.462868 ◽

2021 ◽

Author(s):

Jacob L. Steenwyk ◽

Thomas J. Buida ◽

Carla Goncalves ◽

Dayna C. Goltz ◽

Grace H Morales ◽

...

Keyword(s):

Codon Usage ◽

Sequence Data ◽

Synonymous Codon ◽

Synonymous Codon Usage ◽

Relative Synonymous Codon Usage ◽

Summary Statistics ◽

Sequencing Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Genome Assemblies

Bioinformatic analysis - such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, paired-end aware quality trimming and filtering of sequencing reads, file format conversion, and processing and analysis - is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses; however, the lack of a unified toolkit that conducts all these analyses can be a barrier in workflows. To address this obstacle, we introduce BioKIT, a versatile toolkit for the UNIX shell environment with 40 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we assessed the quality and characteristics of 901 eukaryotic genome assemblies, calculated alignment summary statistics for 10 phylogenomic data matrices, determined relative synonymous codon usage across 171 fungal genomes including those that use alternative genetic codes, and demonstrate that a novel metric, gene-wise relative synonymous codon usage, can accurately estimate gene-wise codon optimization. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/biokit), and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).

Download Full-text