PhyloHerb: A phylogenomic pipeline for processing genome skimming data for plants

2021 ◽  
Author(s):  
Liming Cai ◽  
Hongrui Zhang ◽  
CHARLES C DAVIS

Premise of the study: The application of high throughput sequencing, especially to herbarium specimens, is greatly accelerating biodiversity research. Among various techniques, low coverage Illumina sequencing of total genomic DNA (genome skimming) can simultaneously recover the plastid, mitochondrial, and nuclear ribosomal regions across hundreds of species. Here, we introduce PhyloHerb -- a bioinformatic pipeline to efficiently and effectively assemble phylogenomic datasets derived from genome skimming. Methods and Results: PhyloHerb uses either a built-in database or user-specified references to extract orthologous sequences using BLAST search. It outputs FASTA files and offers a suite of utility functions to assist with alignment, data partitioning, concatenation, and phylogeny inference. The program is freely available at https://github.com/lmcai/PhyloHerb/. Conclusions: Using published data from Clusiaceae, we demonstrated that PhyloHerb can accurately identify genes using highly fragmented assemblies derived from sequencing older herbarium specimens. Our approach is effective at all taxonomic depths and is scalable to thousands of species.

Mobile DNA ◽  
2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Jonathan Filée ◽  
Sarah Farhat ◽  
Dominique Higuet ◽  
Laure Teysset ◽  
Dominique Marie ◽  
...  

Abstract Background With the expansion of high throughput sequencing, we now have access to a larger number of genome-wide studies analyzing the Transposable elements (TEs) composition in a wide variety of organisms. However, genomic analyses often remain too limited in number and diversity of species investigated to study in depth the dynamics and evolutionary success of the different types of TEs among metazoans. Therefore, we chose to investigate the use of transcriptomes to describe the diversity of TEs in phylogenetically related species by conducting the first comparative analysis of TEs in two groups of polychaetes and evaluate the diversity of TEs that might impact genomic evolution as a result of their mobility. Results We present a detailed analysis of TEs distribution in transcriptomes extracted from 15 polychaetes depending on the number of reads used during assembly, and also compare these results with additional TE scans on associated low-coverage genomes. We then characterized the clades defined by 1021 LTR-retrotransposon families identified in 26 species. Clade richness was highly dependent on the considered superfamily. Copia elements appear rare and are equally distributed in only three clades, GalEa, Hydra and CoMol. Among the eight BEL/Pao clades identified in annelids, two small clades within the Sailor lineage are new for science. We characterized 17 Gypsy clades of which only 4 are new; the C-clade largely dominates with a quarter of the families. Finally, all species also expressed for the majority two distinct transcripts encoding PIWI proteins, known to be involved in control of TEs mobilities. Conclusions This study shows that the use of transcriptomes assembled from 40 million reads was sufficient to access to the diversity and proportion of the transposable elements compared to those obtained by low coverage sequencing. Among LTR-retrotransposons Gypsy elements were unequivocally dominant but results suggest that the number of Gypsy clades, although high, may be more limited than previously thought in metazoans. For BEL/Pao elements, the organization of clades within the Sailor lineage appears more difficult to establish clearly. The Copia elements remain rare and result from the evolutionary consistent success of the same three clades.


Viruses ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 2006
Author(s):  
Anna Y Budkina ◽  
Elena V Korneenko ◽  
Ivan A Kotov ◽  
Daniil A Kiselev ◽  
Ilya V Artyushin ◽  
...  

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.


2019 ◽  
Vol 8 (8) ◽  
pp. 1175 ◽  
Author(s):  
Valentina Sas ◽  
Vlad Moisoiu ◽  
Patric Teodorescu ◽  
Sebastian Tranca ◽  
Laura Pop ◽  
...  

During recent decades, understanding of the molecular mechanisms of acute lymphoblastic leukemia (ALL) has improved considerably, resulting in better risk stratification of patients and increased survival rates. Age, white blood cell count (WBC), and specific genetic abnormalities are the most important factors that define risk groups for ALL. State-of-the-art diagnosis of ALL requires cytological and cytogenetical analyses, as well as flow cytometry and high-throughput sequencing assays. An important aspect in the diagnostic characterization of patients with ALL is the identification of the Philadelphia (Ph) chromosome, which warrants the addition of tyrosine kinase inhibitors (TKI) to the chemotherapy backbone. Data that support the benefit of hematopoietic stem cell transplantation (HSCT) in high risk patient subsets or in late relapse patients are still questioned and have yet to be determined conclusive. This article presents the newly published data in ALL workup and treatment, putting it into perspective for the attending physician in hematology and oncology.


2019 ◽  
Vol 7 (11) ◽  
pp. 493 ◽  
Author(s):  
Zhan ◽  
Li ◽  
Xu

Metabarcoding and high-throughput sequencing methods have greatly improved our understanding of protist diversity. Although the V4 region of small subunit ribosomal DNA (SSU-V4 rDNA) is the most widely used marker in DNA metabarcoding of eukaryotic microorganisms, doubts have recently been raised about its suitability. Here, using the widely distributed ciliate genus Pseudokeronopsis as an example, we assessed the potential of SSU-V4 rDNA and four other nuclear and mitochondrial markers for species delimitation and phylogenetic reconstruction. Our studies revealed that SSU-V4 rDNA is too conservative to distinguish species, and a threshold of 97% and 99% sequence similarity detected only one and three OTUs, respectively, from seven species. On the basis of the comparative analysis of the present and previously published data, we proposed the multilocus marker including the nuclear 5.8S rDNA combining the internal transcribed spacer regions (ITS1-5.8S-ITS2) and the hypervariable D2 region of large subunit rDNA (LSU-D2) as an ideal barcode rather than the mitochondrial cytochrome c oxidase subunit 1 gene, and the ITS1-5.8S-ITS2 as a candidate metabarcoding marker for ciliates. Furthermore, the compensating base change and tree-based criteria of ITS2 and LSU-D2 were useful in complementing the DNA barcoding and metabarcoding methods by giving second structure and phylogenetic evidence.


Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Elisa Banchi ◽  
Claudio G Ametrano ◽  
Samuele Greco ◽  
David Stanković ◽  
Lucia Muggia ◽  
...  

Abstract DNA metabarcoding combines DNA barcoding with high-throughput sequencing to identify different taxa within environmental communities. The ITS has already been proposed and widely used as universal barcode marker for plants, but a comprehensive, updated and accurate reference dataset of plant ITS sequences has not been available so far. Here, we constructed reference datasets of Viridiplantae ITS1, ITS2 and entire ITS sequences including both Chlorophyta and Streptophyta. The sequences were retrieved from NCBI, and the ITS region was extracted. The sequences underwent identity check to remove misidentified records and were clustered at 99% identity to reduce redundancy and computational effort. For this step, we developed a script called ‘better clustering for QIIME’ (bc4q) to ensure that the representative sequences are chosen according to the composition of the cluster at a different taxonomic level. The three datasets obtained with the bc4q script are PLANiTS1 (100 224 sequences), PLANiTS2 (96 771 sequences) and PLANiTS (97 550 sequences), and all are pre-formatted for QIIME, being this the most used bioinformatic pipeline for metabarcoding analysis. Being curated and updated reference databases, PLANiTS1, PLANiTS2 and PLANiTS are proposed as a reliable, pivotal first step for a general standardization of plant DNA metabarcoding studies. The bc4q script is presented as a new tool useful in each research dealing with sequences clustering. Database URL: https://github.com/apallavicini/bc4q; https://github.com/apallavicini/PLANiTS.


2017 ◽  
Vol 37 (04) ◽  
pp. 314-331 ◽  
Author(s):  
Johannes Hov ◽  
Tom Karlsen

AbstractThe close relationship between primary sclerosing cholangitis (PSC) and inflammatory bowel disease has inspired hypothetical models in which gut bacteria or bacterial products are key players in PSC pathogenesis. Several studies using high-throughput sequencing technology to characterize the gut microbiota in PSC have been published over the past years. They all report reduced diversity and significant shifts in the overall composition of the gut microbiota. However, it remains unclear as to whether the observed changes are primary or secondary to PSC development and further studies are needed to assess the biological implications of the findings. In the present article, we review the published data in perspective of similar studies in other diseases. We discuss aspects of methodology and study design that are relevant to interpretation of the data. Furthermore, we propose that interpretation and further assessments of findings are structured into conceptual compartments, and elaborate three such possible concepts relating to immune function (the “immunobiome”), host metabolism (the “endobiome”), and dietary and xenobiotic factors (the “xenobiome”) in PSC.


2019 ◽  
Author(s):  
Eleonora Rachtman ◽  
Metin Balaban ◽  
Vineet Bafna ◽  
Siavash Mirarab

AbstractThe ability to detect the identity of a sample obtained from its environment is a cornerstone of molecular ecological research. Thanks to the falling price of shotgun sequencing, genome skimming, the acquisition of short reads spread across the genome at low coverage, is emerging as an alternative to traditional barcoding. By obtaining far more data across the whole genome, skimming has the promise to increase the precision of sample identification beyond traditional barcoding while keeping the costs manageable. While methods for assembly-free sample identification based on genome skims are now available, little is known about how these methods react to the presence of DNA from organisms other than the target species. In this paper, we show that the accuracy of distances computed between a pair of genome skims based on k-mer similarity can degrade dramatically if the skims include contaminant reads; i.e., any reads originating from other organisms. We establish a theoretical model of the impact of contamination. We then suggest and evaluate a solution to the contamination problem: Query reads in a genome skim against an extensive database of possible contaminants (e.g., all microbial organisms) and filter out any read that matches. We evaluate the effectiveness of this strategy when implemented using Kraken-II, in detailed analyses. Our results show substantial improvements in accuracy as a result of filtering but also point to limitations, including a need for relatively close matches in the contaminant database.


2018 ◽  
Author(s):  
Quinn K. Langdon ◽  
David Peris ◽  
Brian Kyle ◽  
Chris Todd Hittinger

AbstractThe genomics era has expanded our knowledge about the diversity of the living world, yet harnessing high-throughput sequencing data to investigate alternative evolutionary trajectories, such as hybridization, is still challenging. Here we present sppIDer, a pipeline for the characterization of interspecies hybrids and pure species,that illuminates the complete composition of genomes. sppIDer maps short-read sequencing data to a combination genome built from reference genomes of several species of interest and assesses the genomic contribution and relative ploidy of each parental species, producing a series of colorful graphical outputs ready for publication. As a proof-of-concept, we use the genus Saccharomyces to detect and visualize both interspecies hybrids and pure strains, even with missing parental reference genomes. Through simulation, we show that sppIDer is robust to variable reference genome qualities and performs well with low-coverage data. We further demonstrate the power of this approach in plants, animals, and other fungi. sppIDer is robust to many different inputs and provides visually intuitive insight into genome composition that enables the rapid identification of species and their interspecies hybrids. sppIDer exists as a Docker image, which is a reusable, reproducible, transparent, and simple-to-run package that automates the pipeline and installation of the required dependencies (https://github.com/GLBRC/sppIDer).


Author(s):  
Jimmy A McGuire ◽  
Darko D Cotoras ◽  
Brendan O'Connell ◽  
Shobi Z S Lawalata ◽  
Cynthia Y Wang-Claypool ◽  
...  

We used Massively Parallel High-Throughput Sequencing to obtain genetic data from a 145-year old holotype specimen of the flying lizard, Draco cristatellus. Obtaining genetic data from this holotype was necessary to resolve an otherwise intractable taxonomic problem involving the status of this species relative to closely related sympatric Draco species that cannot otherwise be distinguished from one another on the basis of museum specimens. Initial analyses suggested that the DNA present in the holotype sample was so degraded as to be unusable for sequencing. However, we used a specialized extraction procedure developed for highly degraded ancient DNA samples and MiSeq shotgun sequencing to obtain just enough low-coverage mitochondrial DNA (547 base pairs) to conclusively resolve the species status of the holotype as well as a second known specimen of this species. The holotype was prepared before the advent of formalin-fixation and therefore was most likely originally fixed with ethanol and never exposed to formalin. Whereas conventional wisdom suggests that formalin-fixed samples should be the most challenging for DNA sequencing, we propose that evaporation during long-term alcohol storage and consequent water-exposure may subject older ethanol-fixed museum specimens to hydrolytic damage. If so, this may pose an even greater challenge for sequencing efforts involving historical samples.


2021 ◽  
Author(s):  
Daniel M Fernandes ◽  
Olivia Cheronet ◽  
Pere Gelabert ◽  
Ron Pinhasi

Estimation of genetically related individuals is playing an increasingly important role in the ancient DNA field. In recent years, the numbers of sequenced individuals from single sites have been increasing, reflecting a growing interest in understanding the familial and social organisation of ancient populations. Although a few different methods have been specifically developed for ancient DNA, namely to tackle issues such as low-coverage homozygous data, they require a 0.1 - 1x minimum average genomic coverage per analysed pair of individuals between. Here we present an updated version of a method that enables estimates of 1st and 2nd-degrees of relatedness with as little as 0.026x average coverage, or around 1.3 million aligned reads per sample - 4 times less data than 0.1x. By using simulated data to estimate false positive error rates, we further show that a threshold even as low as 0.012x, or around 600,000 reads, will always show 1st-degree relationships as related. Lastly, by applying this method to published data, we are able to identify previously undocumented relationships using individuals previously excluded from kinship analysis due to their very low coverage. This methodological improvement has the potential to enable relatedness estimation on ancient whole genome shotgun data during routine low-coverage screening, and therefore improve project management when decisions need to be made on which individuals are to be further sequenced.


Sign in / Sign up

Export Citation Format

Share Document