scholarly journals Consensify: a method for generating pseudohaploid genome sequences from palaeogenomic datasets with reduced error rates

2018 ◽  
Author(s):  
Axel Barlow ◽  
Stefanie Hartmann ◽  
Javier Gonzalez ◽  
Michael Hofreiter ◽  
Johanna L.A. Paijmans

A standard practise in palaeogenome analysis is the conversion of mapped short read data into pseudohaploid sequences, typically by selecting a single high quality nucleotide at random from the stack of mapped reads. This controls for biases due to differential sequencing coverage but it does not control for differential rates and types of sequencing error, which are frequently large and variable in datasets obtained from ancient samples. These errors have the potential to distort phylogenetic and population clustering analyses, and to mislead tests of admixture using D statistics. We introduce Consensify, a method for generating pseudohaploid sequences which controls for biases resulting from differential sequencing coverage while greatly reducing error rates. The error correction is derived directly from the data itself, without the requirement for additional genomic resources or simplifying assumptions such as contemporaneous sampling. For phylogenetic analysis, we find that Consensify is less affected by branch length artefacts than methods based on standard pseudohaploidisation, and it performs similarly for population clustering analysis based on genetic distances. For D statistics, Consensify is more resistant to false positives and appears to be less affected by biases resulting from different laboratory protocols than other available methods. Although Consensify is developed with palaeogenomic data in mind, it is applicable for any low to medium coverage short read datasets. We predict that Consenify will be a useful tool for future studies of palaeogenomes.

Genes ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 50
Author(s):  
Axel Barlow ◽  
Stefanie Hartmann ◽  
Javier Gonzalez ◽  
Michael Hofreiter ◽  
Johanna L. A. Paijmans

A standard practise in palaeogenome analysis is the conversion of mapped short read data into pseudohaploid sequences, frequently by selecting a single high-quality nucleotide at random from the stack of mapped reads. This controls for biases due to differential sequencing coverage, but it does not control for differential rates and types of sequencing error, which are frequently large and variable in datasets obtained from ancient samples. These errors have the potential to distort phylogenetic and population clustering analyses, and to mislead tests of admixture using D statistics. We introduce Consensify, a method for generating pseudohaploid sequences, which controls for biases resulting from differential sequencing coverage while greatly reducing error rates. The error correction is derived directly from the data itself, without the requirement for additional genomic resources or simplifying assumptions such as contemporaneous sampling. For phylogenetic and population clustering analysis, we find that Consensify is less affected by artefacts than methods based on single read sampling. For D statistics, Consensify is more resistant to false positives and appears to be less affected by biases resulting from different laboratory protocols than other frequently used methods. Although Consensify is developed with palaeogenomic data in mind, it is applicable for any low to medium coverage short read datasets. We predict that Consensify will be a useful tool for future studies of palaeogenomes.


2017 ◽  
Author(s):  
Hajime Suzuki ◽  
Masahiro Kasahara

AbstractMotivationPairwise alignment of nucleotide sequences has previously been carried out using the seed- and-extend strategy, where we enumerate seeds (shared patterns) between sequences and then extend the seeds by Smith-Waterman-like semi-global dynamic programming to obtain full pairwise alignments. With the advent of massively parallel short read sequencers, algorithms and data structures for efficiently finding seeds have been extensively explored. However, recent advances in single-molecule sequencing technologies have enabled us to obtain millions of reads, each of which is orders of magnitude longer than those output by the short-read sequencers, demanding a faster algorithm for the extension step that accounts for most of the computation time required for pairwise local alignment. Our goal is to design a faster extension algorithm suitable for single-molecule sequencers with high sequencing error rates (e.g., 10-15%) and with more frequent insertions and deletions than substitutions.ResultsWe propose an adaptive banded dynamic programming algorithm for calculating pairwise semi-global alignment of nucleotide sequences that allows a relatively high insertion or deletion rate while keeping band width relatively low (e.g., 32 or 64 cells) regardless of sequence lengths. Our new algorithm eliminated mutual dependences between elements in a vector, allowing an efficient Single-Instruction-Multiple-Data parallelization. We experimentally demonstrate that our algorithm runs approximately 5× faster than the extension alignment algorithm in NCBI BLAST+ while retaining similar sensitivity (recall).We also show that our extension algorithm is more sensitive than the extension alignment routine in DALIGNER, while the computation time is comparable.AvailabilityThe implementation of the algorithm and the benchmarking scripts are available at https://github.com/ocxtal/[email protected]


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Tao Jiang ◽  
Shiqi Liu ◽  
Shuqi Cao ◽  
Yadong Liu ◽  
Zhe Cui ◽  
...  

Abstract Background With the rapid development of long-read sequencing technologies, it is possible to reveal the full spectrum of genetic structural variation (SV). However, the expensive cost, finite read length and high sequencing error for long-read data greatly limit the widespread adoption of SV calling. Therefore, it is urgent to establish guidance concerning sequencing coverage, read length, and error rate to maintain high SV yields and to achieve the lowest cost simultaneously. Results In this study, we generated a full range of simulated error-prone long-read datasets containing various sequencing settings and comprehensively evaluated the performance of SV calling with state-of-the-art long-read SV detection methods. The benchmark results demonstrate that almost all SV callers perform better when the long-read data reach 20× coverage, 20 kbp average read length, and approximately 10–7.5% or below 1% error rates. Furthermore, high sequencing coverage is the most influential factor in promoting SV calling, while it also directly determines the expensive costs. Conclusions Based on the comprehensive evaluation results, we provide important guidelines for selecting long-read sequencing settings for efficient SV calling. We believe these recommended settings of long-read sequencing will have extraordinary guiding significance in cutting-edge genomic studies and clinical practices.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Kelley Paskov ◽  
Jae-Yoon Jung ◽  
Brianna Chrisman ◽  
Nate T. Stockham ◽  
Peter Washington ◽  
...  

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.


Animals ◽  
2021 ◽  
Vol 11 (11) ◽  
pp. 3186
Author(s):  
Eunkyung Choi ◽  
Sun Hee Kim ◽  
Seung Jae Lee ◽  
Euna Jo ◽  
Jinmu Kim ◽  
...  

Trematomus loennbergii Regan, 1913, is an evolutionarily important marine fish species distributed in the Antarctic Ocean. However, its genome has not been studied to date. In the present study, whole genome sequencing was performed using next-generation sequencing (NGS) technology to characterize its genome and develop genomic microsatellite markers. The 25-mer frequency distribution was estimated to be the best, and the genome size was predicted to be 815,042,992 bp. The heterozygosity, average rate of read duplication, and sequencing error rates were 0.536%, 0.724%, and 0.292%, respectively. These data were used to analyze microsatellite markers, and a total of 2,264,647 repeat motifs were identified. The most frequent repeat motif was di-nucleotide with 87.00% frequency, followed by tri-nucleotide (10.45%), tetra-nucleotide (1.94%), penta-nucleotide (0.34%), and hexa-nucleotide (0.27%). The AC repeat motif was the most abundant motif among di-nucleotides and among all repeat motifs. Among microsatellite markers, 181 markers were selected and PCR technology was used to validate several markers. A total of 15 markers produced only one band. In summary, these results provide a good basis for further studies, including evolutionary biology studies and population genetics of Antarctic fish species.


2016 ◽  
Vol 11 (2) ◽  
pp. 137 ◽  
Author(s):  
Melta Rini Fahmi ◽  
Anjang Bangun Prasetio ◽  
Ruby Vidia Kusumah ◽  
Erma Primanita Hayuningtyas ◽  
Idil Ardi

Perairan gambut merupakan ekosistem unik yang memiliki kekayaan biodiversitas ikan, sebagian besar di antaranya memiliki potensi sebagai ikan hias. Penelitian ini dilakukan untuk melakukan identifikasi dan analisis keragaman genetik, karakter genetik, jarak genetik, dan pohon kekerabatan ikan-ikan yang mendiami perairan gambut Cagar Biosfere Bukit Batu Provinsi Riau. Tahap pertama penelitian ini adalah identifikasi secara morfologi terhadap 29 ikan hasil koleksi yang potensial sebagai ikan hias. Selanjutnya amplifikasi dan alignment 675 bp (base pair) dari 90 runutan parsial cytochrome c oxidase subunit 1 (COI). Hasil penelitian menunjukkan ikan yang diidentifikasi dapat dikelompokkan menjadi enam famili, yaitu Balontidae terdiri atas tiga spesies (12,5%); Cyprinidae 13 spesies (54,17%); Cobitidae satu spesies (4,17%); Siluridae dua spesies (8,3%); Datnoidae satu spesies (4,17%); dan Bagridae empat spesies (16,67%). Beberapa spesies memiliki perbedaan genetik intraspesies lebih dari 3%. Analisis kekerabatan dan clustering ikan hias lahan gambut berdasarkan gen COI memiliki nilai bootstrap 87-99 per ulangan.Peat is a unique ecosystem which a high fish biodiversity, and most of them are potential as ornamental fish. This research was conducted to identify and analyze genetic diversity, genetic code, genetic distances, and phylogenies of the fish that inhabit in the Bukit Batu Biosfere Reserves, Riau Province. The first stage of this study was identification of 29 fish species that potential as ornamental fish by using morphological character. The further stages were amplification and alignment of 675 base pairs of 90 partial sequences of cytochrome c oxidase subunit 1 (COI). Results showed that the identification based on COI could be classified into six families. These Families were Balontidae, Cyprinidae, Cobitidae, Siluridae, Datnoidae, and Bagridae which consist of three species (12.5%), 13 species (54.17%), one species (4.17%), two species (8.3%), one species (4.17%), and four species (16.67%), respectively. Some clustered have intra-species genetic divergence more than 3%. Phylogenetic and clustering analysis showed all of the OTU (0perational Taxonomic Unit) has a high bootstrap permutation of 87-99.


Zootaxa ◽  
2011 ◽  
Vol 2906 (1) ◽  
pp. 52
Author(s):  
XIAOQIANG LI ◽  
BINGZHONG REN ◽  
YUTING ZOU ◽  
JIAN ZHANG ◽  
YINLIANG WANG

The present study compares the proventricular morphology, analyzed under optic microscope and scanning electron microscopy (SEM), among ten Grylloidea species. The result showed that the size of proventriculus was of critical value. Internally, the main differences were the number of sclerotized appendix (sa), middle denticles (md) and lateral denticles (ld), and the structure of lateral teeth (lt). In addition, we analyzed the crickets’ feeding habits and note that the the proventriculus possesses highly sclerotized projections which act in the selection of victuals. The morphology of proventriculus is closely related to feeding habits. A clustering analysis of seven features of the proventriculus was constructed. It revealed that the proventriculus had significance for taxonomy and species relationships. Observations on morphological characterization of proventricular morphology will be useful in future studies of the feeding habits and phylogeny of crickets.


2012 ◽  
Vol 13 (1) ◽  
pp. 185 ◽  
Author(s):  
Xin Victoria ◽  
Natalie Blades ◽  
Jie Ding ◽  
Razvan Sultana ◽  
Giovanni Parmigiani

Author(s):  
Wing Shan Chan ◽  
Jan-Louis Kruger ◽  
Stephen Doherty

The addition of subtitles to videos has the potential to benefit students across the globe in a context where online video lectures have become a major channel for learning, particularly because, for many, language poses a barrier to learning. Automated subtitling, created with the use of speech-recognition software, may be a powerful way to make this a scalable and affordable solution. However, in the absence of thorough post-editing by human subtitlers, this mode of subtitling often results in serious errors that arise from problems with speech recognition, accuracy, segmentation and presentation speed. This study therefore aims to investigate the impact of automated subtitling on student learning in a sample of English first- and second-language speakers. Our results show that high error rates and high presentation speeds reduce the potential benefit of subtitles. These findings provide an important foundation for future studies on the use of subtitles in education.


2021 ◽  
Author(s):  
Barış Ekim ◽  
Bonnie Berger ◽  
Rayan Chikhi

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.


Sign in / Sign up

Export Citation Format

Share Document