scholarly journals Phylogeny Estimation Given Sequence Length Heterogeneity

Author(s):  
Vladimir Smirnov ◽  
Tandy Warnow

Abstract Phylogeny estimation is a major step in many biological studies, and has many well known challenges. With the dropping cost of sequencing technologies, biologists now have increasingly large datasets available for use in phylogeny estimation. Here we address the challenge of estimating a tree given large datasets with a combination of full-length sequences and fragmentary sequences, which can arise due to a variety of reasons, including sample collection, sequencing technologies, and analytical pipelines. We compare two basic approaches: (1) computing an alignment on the full dataset and then computing a maximum likelihood tree on the alignment, or (2) constructing an alignment and tree on the full length sequences and then using phylogenetic placement to add the remaining sequences (which will generally be fragmentary) into the tree. We explore these two approaches on a range of simulated datasets, each with 1000 sequences and varying in rates of evolution, and two biological datasets. Our study shows some striking performance differences between methods, especially when there is substantial sequence length heterogeneity and high rates of evolution. We find in particular that using UPP to align sequences and RAxML to compute a tree on the alignment provides the best accuracy, substantially outperforming trees computed using phylogenetic placement methods. We also find that FastTree has poor accuracy on alignments containing fragmentary sequences. Overall, our study provides insights into the literature comparing different methods and pipelines for phylogenetic estimation, and suggests directions for future method development. [Phylogeny estimation, sequence length heterogeneity, phylogenetic placement.]

Author(s):  
Robert C. Edgar

AbstractMapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA and Bowtie2 with comparable accuracy on a benchmark test using simulated paired 150nt reads of a well-studied human genome. Software is freely available at https://drive5.com/urmap.


mBio ◽  
2016 ◽  
Vol 7 (3) ◽  
Author(s):  
Patrick D. Schloss ◽  
Rene A. Girard ◽  
Thomas Martin ◽  
Joshua Edwards ◽  
J. Cameron Thrash

ABSTRACT A census is typically carried out for people across a range of geographical levels; however, microbial ecologists have implemented a molecular census of bacteria and archaea by sequencing their 16S rRNA genes. We assessed how well the census of full-length 16S rRNA gene sequences is proceeding in the context of recent advances in high-throughput sequencing technologies because full-length sequences are typically used as references for classification of the short sequences generated by newer technologies. Among the 1,411,234 and 53,546 full-length bacterial and archaeal sequences, 94.5% and 95.1% of the bacterial and archaeal sequences, respectively, belonged to operational taxonomic units (OTUs) that have been observed more than once. Although these metrics suggest that the census is approaching completion, 29.2% of the bacterial and 38.5% of the archaeal OTUs have been observed more than once. Thus, there is still considerable diversity to be explored. Unfortunately, the rate of new full-length sequences has been declining, and new sequences are primarily being deposited by a small number of studies. Furthermore, sequences from soil and aquatic environments, which are known to be rich in bacterial diversity, represent only 7.8 and 16.5% of the census, while sequences associated with host-associated environments represent 55.0% of the census. Continued use of traditional approaches and new technologies such as single-cell genomics and short-read assembly are likely to improve our ability to sample rare OTUs if it is possible to overcome this sampling bias. The success of ongoing efforts to use short-read sequencing to characterize archaeal and bacterial communities requires that researchers strive to expand the depth and breadth of this census. IMPORTANCE The biodiversity contained within the bacterial and archaeal domains dwarfs that of the eukaryotes, and the services these organisms provide to the biosphere are critical. Surprisingly, we have done a relatively poor job of formally tracking the quality of the biodiversity as represented in full-length 16S rRNA genes. By understanding how this census is proceeding, it is possible to suggest the best allocation of resources for advancing the census. We found that the ongoing effort has done an excellent job of sampling the most abundant organisms but struggles to sample the rarer organisms. Through the use of new sequencing technologies, we should be able to obtain full-length sequences from these rare organisms. Furthermore, we suggest that by allocating more resources to sampling environments known to have the greatest biodiversity, we will be able to make significant advances in our characterization of archaeal and bacterial diversity.


2017 ◽  
Author(s):  
Devang Mehta ◽  
Matthias Hirsch-Hoffmann ◽  
Mariam Were ◽  
Andrea Patrignani ◽  
Hassan Were ◽  
...  

ABSTRACTDeep-sequencing of virus isolates using short-read sequencing technologies is problematic since viruses are often present in complexes sharing a high-degree of sequence identity. The full-length genomes of such highly-similar viruses cannot be assembled accurately from short sequencing reads. We present a new method, CIDER-Seq (Circular DNA Enrichment Sequencing) which successfully generates accurate full-length virus genomes from individual sequencing reads with no sequence assembly required. CIDER-Seq operates by combining a PCR-free, circular DNA enrichment protocol with Single Molecule Real Time sequencing and a new sequence deconcatenation algorithm. We apply our technique to produce more than 1,200 full-length, highly accurate geminivirus genomes from RNAi-transgenic and control plants in a field trial in Kenya. Using CIDER-Seq we can demonstrate for the first time that the expression of antiviral doublestranded RNA (dsRNA) in transgenic plants causes a consistent shift in virus populations towards species sharing low homology to the transgene derived dsRNA. Our results show that CIDER-seq is a powerful, cost-effective tool for accurately sequencing circular DNA viruses, with future applications in deep-sequencing other forms of circular DNA such as transposons and plasmids.


2021 ◽  
Author(s):  
Heather L Wells ◽  
Elizabeth Loh ◽  
Alessandra Nava ◽  
Mei Ho Lee ◽  
Jimmy Lee ◽  
...  

As part of a broad One Health surveillance effort to detect novel viruses in wildlife and people, we report several paramyxoviruses sequenced primarily from bats during 2013 and 2014 in Brazil and Malaysia, including seven from which we recovered full-length genomes. Of these, six represent the first full-length paramyxovirus genomes sequenced from the Americas, including two sequences which are the first full-length bat morbillivirus genomes published to date. Our findings add to the vast number of viral sequences in public repositories that have been increasing considerably in recent years due to the rising accessibility of metagenomics. Taxonomic classification of these sequences in the absence of phenotypic data has been a significant challenge, particularly in the paramyxovirus subfamily Orthoparamyxovirinae, where the rate of discovery of novel sequences has been substantial. Using pairwise amino acid sequence classification (PASC), we describe a novel genus within this subfamily tentatively named Jeishaanvirus, which we propose should include as subgenera Jeilongvirus, Shaanvirus, and a novel South American subgenus Cadivirus. We also highlight inconsistencies in the classification of Tupaia virus and Mojiang virus using the same demarcation criteria and show that members of the proposed subgenus Shaanvirus are paraphyletic. Importantly, this study underscores the critical importance of sequence length in PASC analysis as well as the importance of biological characteristics such as genome organization in the taxonomic classification of viral sequences.


2020 ◽  
Author(s):  
Steffen Klasberg ◽  
Alexander H. Schmidt ◽  
Vinzenz Lange ◽  
Gerhard Schöfl

AbstractBackgroundHigh resolution HLA genotyping of donors and recipients is a crucially important prerequisite for haematopoetic stem-cell transplantation and relies heavily on the quality and completeness of immuno-genetic reference sequence databases of allelic variation.ResultsHere, we report on DR2S, an R package that leverages the strengths of two sequencing technologies – the accuracy of next-generation sequencing with the read length of third-generation sequencing technologies like PacBio’s SMRT sequencing or ONT’s nanopore sequencing – to reconstruct fully-phased high-quality full-length haplotype sequences. Although optimised for HLA and KIR genes, DR2S is applicable to all loci with known reference sequences provided that full-length sequencing data is available for analysis. In addition, DR2S integrates supporting tools for easy visualisation and quality control of the reconstructed haplotype to ensure suitability for submission to public allele databases.ConclusionsDR2S is a largely automated workflow designed to create high-quality fully-phased reference allele sequences for highly polymorphic gene regions such as HLA or KIR. It has been used by biologists to successfully characterise and submit more than 500 HLA alleles and more than 500 KIR alleles to the IPD-IMGT/HLA and IPD-KIR databases.


2018 ◽  
Vol 2018 ◽  
pp. 1-6 ◽  
Author(s):  
Shang-Qian Xie ◽  
Yue Han ◽  
Xiao-Zhou Chen ◽  
Tai-Yu Cao ◽  
Kai-Kai Ji ◽  
...  

The accurate landscape of transcript isoforms plays an important role in the understanding of gene function and gene regulation. However, building complete transcripts is very challenging for short reads generated using next-generation sequencing. Fortunately, isoform sequencing (Iso-Seq) using single-molecule sequencing technologies, such as PacBio SMRT, provides long reads spanning entire transcript isoforms which do not require assembly. Therefore, we have developed ISOdb, a comprehensive resource database for hosting and carrying out an in-depth analysis of Iso-Seq datasets and visualising the full-length transcript isoforms. The current version of ISOdb has collected 93 publicly available Iso-Seq samples from eight species and presents the samples in two levels: (1) sample level, including metainformation, long read distribution, isoform numbers, and alternative splicing (AS) events of each sample; (2) gene level, including the total isoforms, novel isoform number, novel AS number, and isoform visualisation of each gene. In addition, ISOdb provides a user interface in the website for uploading sample information to facilitate the collection and analysis of researchers’ datasets. Currently, ISOdb is the first repository that offers comprehensive resources and convenient public access for hosting, analysing, and visualising Iso-Seq data, which is freely available.


2020 ◽  
Vol 48 (5) ◽  
pp. 2762-2776 ◽  
Author(s):  
Carl J Schiltz ◽  
Myfanwy C Adams ◽  
Joshua S Chappie

Abstract OLD family nucleases contain an N-terminal ATPase domain and a C-terminal Toprim domain. Homologs segregate into two classes based on primary sequence length and the presence/absence of a unique UvrD/PcrA/Rep-like helicase gene immediately downstream in the genome. Although we previously defined the catalytic machinery controlling Class 2 nuclease cleavage, degenerate conservation of the C-termini between classes precludes pinpointing the analogous residues in Class 1 enzymes by sequence alignment alone. Our Class 2 structures also provide no information on ATPase domain architecture and ATP hydrolysis. Here we present the full-length structure of the Class 1 OLD nuclease from Thermus scotoductus (Ts) at 2.20 Å resolution, which reveals a dimerization domain inserted into an N-terminal ABC ATPase fold and a C-terminal Toprim domain. Structural homology with genome maintenance proteins identifies conserved residues responsible for Ts OLD ATPase activity. Ts OLD lacks the C-terminal helical domain present in Class 2 OLD homologs yet preserves the spatial organization of the nuclease active site, arguing that OLD proteins use a conserved catalytic mechanism for DNA cleavage. We also demonstrate that mutants perturbing ATP hydrolysis or DNA cleavage in vitro impair P2 OLD-mediated killing of recBC−Escherichia coli hosts, indicating that both the ATPase and nuclease activities are required for OLD function in vivo.


mSystems ◽  
2018 ◽  
Vol 3 (3) ◽  
Author(s):  
Yi-Chun Yeh ◽  
David M. Needham ◽  
Ella T. Sieradzki ◽  
Jed A. Fuhrman

ABSTRACT Mock communities have been used in microbiome method development to help estimate biases introduced in PCR amplification and sequencing and to optimize pipeline outputs. Nevertheless, the strong value of routine mock community analysis beyond initial method development is rarely, if ever, considered. Here we report that our routine use of mock communities as internal standards allowed us to discover highly aberrant and strong biases in the relative proportions of multiple taxa in a single Illumina HiSeqPE250 run. In this run, an important archaeal taxon virtually disappeared from all samples, and other mock community taxa showed >2-fold high or low abundance, whereas a rerun of those identical amplicons (from the same reaction tubes) on a different date yielded “normal” results. Although obvious from the strange mock community results, we could have easily missed the problem had we not used the mock communities because of natural variation of microbiomes at our site. The “normal” results were validated over four MiSeqPE300 runs and three HiSeqPE250 runs, and run-to-run variation was usually low. While validating these “normal” results, we also discovered that some mock microbial taxa had relatively modest, but consistent, differences between sequencing platforms. We strongly advise the use of mock communities in every sequencing run to distinguish potentially serious aberrations from natural variations. The mock communities should have more than just a few members and ideally at least partly represent the samples being analyzed to detect problems that show up only in some taxa and also to help validate clustering. IMPORTANCE Despite the routine use of standards and blanks in virtually all chemical or physical assays and most biological studies (a kind of “control”), microbiome analysis has traditionally lacked such standards. Here we show that unexpected problems of unknown origin can occur in such sequencing runs and yield completely incorrect results that would not necessarily be detected without the use of standards. Assuming that the microbiome sequencing analysis works properly every time risks serious errors that can be detected by the use of mock communities.


2017 ◽  
Author(s):  
Yi-Chun Yeh ◽  
David M. Needham ◽  
Ella T. Sieradzki ◽  
Jed A. Fuhrman

AbstractMock communities have been used in microbiome method development to help estimate biases introduced in PCR amplification, sequencing, and to optimize pipeline outputs. Nevertheless, the necessity of routine mock community analysis beyond initial method development is rarely, if ever, considered. Here we report that our routine use of mock communities as internal standards allowed us to discover highly aberrant and strong biases in the relative proportions of multiple taxa in a single Illumina HiSeqPE250 run. In this run, an important archaeal taxon virtually disappeared from all samples, and other mock community taxa showed >2-fold high or low abundance, whereas a rerun of those identical amplicons (from the same reaction tubes) on a different date yielded “normal” results. Although obvious from the strange mock community results, due to natural variation of microbiomes at our site, we easily could have missed the problem had we not used the mock communities. The “normal” results were validated over 4 MiSeqPE300 runs and 3 HiSeqPE250 runs, and run-to-run variation was usually low (Bray-Curtis distance was 0.12±0.04). While validating these “normal” results, we also discovered some mock microbial taxa had relatively modest, but consistent, differences between sequencing platforms. We suggest that using mock communities in every sequencing run is essential to distinguish potentially serious aberrations from natural variations. Such mock communities should have more than just a few members and ideally at least partly represent the samples being analyzed, to detect problems that show up only in some taxa, as we observed.ImportanceDespite the routine use of standards and blanks in virtually all chemical or physical assays and most biological studies (a kind of “control”), microbiome analysis has traditionally lacked such standards. Here we show that unexpected problems of unknown origin can occur in such sequencing runs, and yield completely incorrect results that would not necessarily be detected without the use of standards. Assuming that the microbiome sequencing analysis works properly every time risks serious errors that can be avoided by the use of suitable mock communities.


Author(s):  
Junwei Luo ◽  
Yawei Wei ◽  
Mengna Lyu ◽  
Zhengjiang Wu ◽  
Xiaoyan Liu ◽  
...  

Abstract In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.


Sign in / Sign up

Export Citation Format

Share Document