gencore: An Efficient Tool to Generate Consensus Reads for Error Suppressing and Duplicate Removing of NGS data

AbstractBackgroundRemoving duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain. However, as NGS technology gains more recognition in clinical applications (i.e. cancer testing), researchers start to pay more attention to its sequencing errors, and prefer to remove these errors while performing deduplication operations. Recently, a new technology called unique molecular identifier (UMI) has been developed to better identify sequencing reads derived from different DNA fragments. Most existing duplicate removing tools cannot handle the UMI-integrated data. Some modern tools can work with UMIs, but are usually slow and use too much memory, making them not suitable for cloud-based deployment. Furthermore, existing tools rarely report rich statistical results, which are very important for quality control and downstream analysis. These unmet requirements drove us to develop an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing, with features of handling UMIs and reporting informative results.ResultsThis paper presents an efficient tool gencore for duplicate removing and sequence error suppressing of NGS data. This tool clusters the mapped sequencing reads and merges reads in each cluster to generate one single consensus read. While the consensus read is generated, the random errors introduced by library construction and sequencing can be removed. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. When unique molecular identifier (UMI) technology is applied, gencore can use them to identify the reads derived from same original DNA fragment. gencore reports statistical results in both HTML and JSON formats. The HTML format report contains many interactive figures plotting statistical coverage and duplication information. The JSON format report contains all the statistical results, and is interpretable for downstream programs.ConclusionsComparing to the conventional tools like Picard and SAMtools, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. Comparing to some new tools like UMI-Reducer and UMI-tools, gencore runs much faster, uses less memory, generates better consensus reads and provides simpler interfaces. To our best knowledge, gencore is the only duplicate removing tool that generates both informative HTML and JSON reports. This tool is available at: https://github.com/OpenGene/gencore

Download Full-text

Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data

BMC Bioinformatics ◽

10.1186/s12859-019-3280-9 ◽

2019 ◽

Vol 20 (S23) ◽

Cited By ~ 6

Author(s):

Shifu Chen ◽

Yanqing Zhou ◽

Yaru Chen ◽

Tanxiao Huang ◽

Wenting Liao ◽

...

Keyword(s):

New Technology ◽

Low Frequency ◽

Its Sequencing ◽

Random Errors ◽

Sequencing Data ◽

Efficient Tool ◽

Sequence Error ◽

Ngs Data ◽

Unique Molecular Identifier ◽

Statistical Results

Abstract Background Removing duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain. However, as NGS technology gains more recognition in clinical application, researchers start to pay more attention to its sequencing errors, and prefer to remove these errors while performing deduplication operations. Recently, a new technology called unique molecular identifier (UMI) has been developed to better identify sequencing reads derived from different DNA fragments. Most existing duplicate removing tools cannot handle the UMI-integrated data. Some modern tools can work with UMIs, but are usually slow and use too much memory. Furthermore, existing tools rarely report rich statistical results, which are very important for quality control and downstream analysis. These unmet requirements drove us to develop an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing, with features of handling UMIs and reporting informative results. Results This paper presents an efficient tool gencore for duplicate removing and sequence error suppressing of NGS data. This tool clusters the mapped sequencing reads and merges reads in each cluster to generate one single consensus read. While the consensus read is generated, the random errors introduced by library construction and sequencing can be removed. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. When unique molecular identifier (UMI) technology is applied, gencore can use them to identify the reads derived from same original DNA fragment. Gencore reports statistical results in both HTML and JSON formats. The HTML format report contains many interactive figures plotting statistical coverage and duplication information. The JSON format report contains all the statistical results, and is interpretable for downstream programs. Conclusions Comparing to the conventional tools like Picard and SAMtools, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. Comparing to some new tools like UMI-Reducer and UMI-tools, gencore runs much faster, uses less memory, generates better consensus reads and provides simpler interfaces. To our best knowledge, gencore is the only duplicate removing tool that generates both informative HTML and JSON reports. This tool is available at: https://github.com/OpenGene/gencore

Download Full-text

Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa171 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3561-3562 ◽

Cited By ~ 8

Author(s):

Kun Sun

Keyword(s):

Data Preprocessing ◽

Poor Quality ◽

Read Length ◽

Supplementary Information ◽

Sequencing Data ◽

Efficient Tool ◽

Source Codes ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

Abstract Motivation Next-generation sequencing (NGS) data frequently suffer from poor-quality cycles and adapter contaminations therefore need to be preprocessed before downstream analyses. With the ever-growing throughput and read length of modern sequencers, the preprocessing step turns to be a bottleneck in data analysis due to unmet performance of current tools. Extra-fast and accurate adapter- and quality-trimming tools for sequencing data preprocessing are therefore still of urgent demand. Results Ktrim was developed in this work. Key features of Ktrim include: built-in support to adapters of common library preparation kits; supports user-supplied, customized adapter sequences; supports both paired-end and single-end data; supports parallelization to accelerate the analysis. Ktrim was ∼2–18 times faster than current tools and also showed high accuracy when applied on the testing datasets. Ktrim could thus serve as a valuable and efficient tool for short-read NGS data preprocessing. Availability and implementation Source codes and scripts to reproduce the results descripted in this article are freely available at https://github.com/hellosunking/Ktrim/, distributed under the GPL v3 license. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ESREEM: Efficient Short Reads Error Estimation Computational Model for Next Generation Genome Sequencing

Current Bioinformatics ◽

10.2174/1574893615999200614171832 ◽

2020 ◽

Vol 15 ◽

Author(s):

Muhammad Tahir ◽

Muhammad Sardaraz ◽

Zahid Mehmood ◽

Muhammad Saud Khan

Keyword(s):

Computational Model ◽

Error Estimation ◽

High Throughput Sequencing ◽

Error Rates ◽

Random Errors ◽

Next Generation ◽

Sequencing Data ◽

Systematic Analysis ◽

Proposed Model ◽

Ngs Data

Aims: To assess the error profile in NGS data generated from high throughput sequencing machines. Background: Short-read sequencing data from Next Generation Sequencing (NGS) are presently being generated by a number of research projects. Depicting the errors produced by NGS platforms and expressing accurate genetic variation from reads are two inter-dependent phases. It has high significance in various analyses such as genome sequence assembly, SNPs calling, evolutionary studies, and haplotype inference. The systematic and random errors show incidence profile for each of the sequencing platforms i.e. Illumina sequencing, Pacific Biosciences, 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Ion Torrent sequencing, and Oxford Nanopore sequencing. Advances in NGS deliver galactic data yet with the addition of errors. Some ratio of these errors may emulate genuine true biological signals i.e. mutation and may subsequently negate the results. Various independent applications have been proposed to correct sequencing errors. Systematic analysis of these algorithms shows that state-of-the-art models are missing. Objective: In this paper, we propose an efficient error estimation computational model called ESREEM to assess error rates in NGS data. Methods: The proposed model prospects the analysis that there exists a true linear regression association between the number of reads containing errors and the number of reads sequenced. The model is based on a probabilistic error model integrated with the Hidden Markov Model (HMM). Result: The proposed model is evaluated on several benchmark datasets and the results obtained are compared with state-ofthe-art algorithms. Conclusions: Experimental results analysis show that the proposed model efficiently estimates errors and runs in less time as compared to others.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab174 ◽

2021 ◽

Author(s):

Anne Krogh Nøhr ◽

Kristian Hanghøj ◽

Genis Garcia Erill ◽

Zilong Li ◽

Ida Moltke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Research ◽

Likelihood Estimation ◽

Software Tool ◽

Estimation Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data ◽

Generation Sequencing

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.

Download Full-text

A Geometric Parametric Analysis of a Magnetorheological Engine Mount

Volume 4: 12th International Conference on Advanced Vehicle and Tire Technologies; 4th International Conference on Micro- and Nanosystems ◽

10.1115/detc2010-28941 ◽

2010 ◽

Author(s):

Walter Anderson ◽

Constantine Ciocanel ◽

Mohammad Elahinia

Keyword(s):

High Frequency ◽

New Technology ◽

Low Frequency ◽

Variable Stiffness ◽

Small Displacement ◽

Element Analysis ◽

Mr Fluid ◽

Wide Range ◽

Engine Mount ◽

Vehicle Technology

Engine vibration has caused a great deal of research for isolation to be performed. Traditionally, isolation was achieved through the use of pure elastomeric (rubber) mounts. However, with advances in vehicle technology, these types of mounts have become inadequate. The inadequacy stems from the vibration profile associated with the engine, i.e. high displacement at low frequency and small displacement at high frequency. Ideal isolation would be achieved through a stiff mount for low frequency and a soft mount for high frequency. This is contradictory to the performance of the elastomeric mounts. Hydraulic mounts were then developed to address this problem. A hydraulic mount has variable stiffness and damping due to the use of a decoupler and an inertia track. However, further advances in vehicle technology have rendered these mounts inadequate as well. Examples of these advances are hybridization (electric and hydraulic) and cylinder on demand (VCM, MDS & ACC). With these technologies, the vibration excitation has a significantly different profile, occurs over a wide range of frequencies, and calls for a new technology that can address this need. Magnetorheological (MR) fluid is a smart material that is able to change viscosity in the presence of a magnetic field. With the use of MR fluid, variable damping and stiffness can be achieved. An MR mount has been developed and tested. The performance of the mount depends on the geometry of the rubber part as well as the behavior of the MR fluid. The rubber top of the mount is the topic of this study due to its major impact on the isolation characteristics of the MR mount. To develop a design methodology to address the isolation needs of different hybrid vehicles, a geometric parametric finite element analysis has been completed and presented in this paper.

Download Full-text

The ICR96 exon CNV validation series: a resource for orthogonal assessment of exon CNV calling in NGS data

Wellcome Open Research ◽

10.12688/wellcomeopenres.11689.1 ◽

2017 ◽

Vol 2 ◽

pp. 35 ◽

Cited By ~ 7

Author(s):

Shazia Mahamdallie ◽

Elise Ruark ◽

Shawn Yost ◽

Emma Ramsay ◽

Imran Uddin ◽

...

Keyword(s):

Sequencing Data ◽

Targeted Next Generation Sequencing ◽

Negative Results ◽

Targeted Ngs ◽

Predisposition Genes ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Validation Series ◽

Generation Sequencing ◽

Dependent Probe

Detection of deletions and duplications of whole exons (exon CNVs) is a key requirement of genetic testing. Accurate detection of this variant type has proved very challenging in targeted next-generation sequencing (NGS) data, particularly if only a single exon is involved. Many different NGS exon CNV calling methods have been developed over the last five years. Such methods are usually evaluated using simulated and/or in-house data due to a lack of publicly-available datasets with orthogonally generated results. This hinders tool comparisons, transparency and reproducibility. To provide a community resource for assessment of exon CNV calling methods in targeted NGS data, we here present the ICR96 exon CNV validation series. The dataset includes high-quality sequencing data from a targeted NGS assay (the TruSight Cancer Panel) together with Multiplex Ligation-dependent Probe Amplification (MLPA) results for 96 independent samples. 66 samples contain at least one validated exon CNV and 30 samples have validated negative results for exon CNVs in 26 genes. The dataset includes 46 exon CNVs in BRCA1, BRCA2, TP53, MLH1, MSH2, MSH6, PMS2, EPCAM or PTEN, giving excellent representation of the cancer predisposition genes most frequently tested in clinical practice. Moreover, the validated exon CNVs include 25 single exon CNVs, the most difficult type of exon CNV to detect. The FASTQ files for the ICR96 exon CNV validation series can be accessed through the European-Genome phenome Archive (EGA) under the accession number EGAS00001002428.

Download Full-text

Sensitive detection of tumor mutations from blood and its application to immunotherapy prognosis

10.1101/2019.12.31.19016253 ◽

2020 ◽

Author(s):

Shuo Li ◽

Zorawar Noor ◽

Weihua Zeng ◽

Xiaohui Ni ◽

Zuyang Yuan ◽

...

Keyword(s):

Low Frequency ◽

Sequencing Data ◽

Real Patient ◽

Single Nucleotide ◽

Lung Cancer Patients ◽

Wide Range ◽

Single Nucleotide Variations ◽

Innovative Techniques ◽

Error Suppression ◽

Mutation Profiling

AbstractLiquid biopsy using cell-free DNA (cfDNA) is attractive for a wide range of clinical applications, including cancer detection, locating, and monitoring. However, developing these applications requires precise and sensitive calling of somatic single nucleotide variations (SNVs) from cfDNA sequencing data. To date, no SNV caller addresses all the special challenges of cfDNA to provide reliable results. Here we present cfSNV, a revolutionary somatic SNV caller with five innovative techniques to overcome and exploit the unique properties of cfDNA. cfSNV provides hierarchical mutation profiling, thanks to cfDNA’s complete coverage of the clonal landscape, and multi-layer error suppression. In both simulated datasets and real patient data, we demonstrate that cfSNV is superior to existing tools, especially for low-frequency somatic SNVs. We also show how the five novel techniques contribute to its performance. Further, we demonstrate a clinical application using cfSNV to select non-small-cell lung cancer patients for immunotherapy treatment.

Download Full-text

KNNCNV: A K-Nearest Neighbor Based Method for Detection of Copy Number Variations Using NGS Data

Frontiers in Cell and Developmental Biology ◽

10.3389/fcell.2021.796249 ◽

2021 ◽

Vol 9 ◽

Author(s):

Kun Xie ◽

Kang Liu ◽

Haque A K Alvi ◽

Yuehui Chen ◽

Shuzhen Wang ◽

...

Keyword(s):

Copy Number ◽

Nearest Neighbor ◽

Human Cancer ◽

Gaussian Mixture ◽

Disease Diagnosis ◽

Copy Number Variations ◽

Sequencing Data ◽

K Nearest Neighbor ◽

Data Types ◽

Ngs Data

Copy number variation (CNV) is a well-known type of genomic mutation that is associated with the development of human cancer diseases. Detection of CNVs from the human genome is a crucial step for the pipeline of starting from mutation analysis to cancer disease diagnosis and treatment. Next-generation sequencing (NGS) data provides an unprecedented opportunity for CNVs detection at the base-level resolution, and currently, many methods have been developed for CNVs detection using NGS data. However, due to the intrinsic complexity of CNVs structures and NGS data itself, accurate detection of CNVs still faces many challenges. In this paper, we present an alternative method, called KNNCNV (K-Nearest Neighbor based CNV detection), for the detection of CNVs using NGS data. Compared to current methods, KNNCNV has several distinctive features: 1) it assigns an outlier score to each genome segment based solely on its first k nearest-neighbor distances, which is not only easy to extend to other data types but also improves the power of discovering CNVs, especially the local CNVs that are likely to be masked by their surrounding regions; 2) it employs the variational Bayesian Gaussian mixture model (VBGMM) to transform these scores into a series of binary labels without a user-defined threshold. To evaluate the performance of KNNCNV, we conduct both simulation and real sequencing data experiments and make comparisons with peer methods. The experimental results show that KNNCNV could derive better performance than others in terms of F1-score.

Download Full-text

Monitoring the Impact of Electromagnetic Waves on Yield and Quality in Maize

Bulletin of University of Agricultural Sciences and Veterinary Medicine Cluj-Napoca Agriculture ◽

10.15835/buasvmcn-agr:6437 ◽

2011 ◽

Vol 68 (1) ◽

Author(s):

Florin IMBREA ◽

Branko MARINCOVIC ◽

Valeriu TABĂRĂ ◽

PAUL PÎRŞAN ◽

Gheorghe DAVID ◽

...

Keyword(s):

Protein Content ◽

Quality Indicators ◽

Electromagnetic Waves ◽

Starch Content ◽

New Technology ◽

Low Frequency ◽

Cultivation Method ◽

Yield And Quality ◽

The Impact ◽

Starch Yield

Experimenting new technology of cultivating maize is an important step forward in order to optimise the yielding capacity if a crop that ranks second among crops cultivated worldwide and first among crops cultivated in Romania. Using low frequency radiations to stimulate yield and quality in maize allows increases in yield between 10 and 15% compared to the classical cultivation method and an improvement of the quality indicators (protein content increased with 6-11% determining an increase of the protein yield per ha; starch content increased with 7-14%, which also determined an increase of the starch yield per ha; while fat content, another indicator we monitored, increased with 2-6%).

Download Full-text