Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data

Kun Sun

doi:10.1093/bioinformatics/btaa171

Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa171 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3561-3562 ◽

Cited By ~ 8

Author(s):

Kun Sun

Keyword(s):

Data Preprocessing ◽

Poor Quality ◽

Read Length ◽

Supplementary Information ◽

Sequencing Data ◽

Efficient Tool ◽

Source Codes ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

Abstract Motivation Next-generation sequencing (NGS) data frequently suffer from poor-quality cycles and adapter contaminations therefore need to be preprocessed before downstream analyses. With the ever-growing throughput and read length of modern sequencers, the preprocessing step turns to be a bottleneck in data analysis due to unmet performance of current tools. Extra-fast and accurate adapter- and quality-trimming tools for sequencing data preprocessing are therefore still of urgent demand. Results Ktrim was developed in this work. Key features of Ktrim include: built-in support to adapters of common library preparation kits; supports user-supplied, customized adapter sequences; supports both paired-end and single-end data; supports parallelization to accelerate the analysis. Ktrim was ∼2–18 times faster than current tools and also showed high accuracy when applied on the testing datasets. Ktrim could thus serve as a valuable and efficient tool for short-read NGS data preprocessing. Availability and implementation Source codes and scripts to reproduce the results descripted in this article are freely available at https://github.com/hellosunking/Ktrim/, distributed under the GPL v3 license. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The ICR96 exon CNV validation series: a resource for orthogonal assessment of exon CNV calling in NGS data

Wellcome Open Research ◽

10.12688/wellcomeopenres.11689.1 ◽

2017 ◽

Vol 2 ◽

pp. 35 ◽

Cited By ~ 7

Author(s):

Shazia Mahamdallie ◽

Elise Ruark ◽

Shawn Yost ◽

Emma Ramsay ◽

Imran Uddin ◽

...

Keyword(s):

Sequencing Data ◽

Targeted Next Generation Sequencing ◽

Negative Results ◽

Targeted Ngs ◽

Predisposition Genes ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Validation Series ◽

Generation Sequencing ◽

Dependent Probe

Detection of deletions and duplications of whole exons (exon CNVs) is a key requirement of genetic testing. Accurate detection of this variant type has proved very challenging in targeted next-generation sequencing (NGS) data, particularly if only a single exon is involved. Many different NGS exon CNV calling methods have been developed over the last five years. Such methods are usually evaluated using simulated and/or in-house data due to a lack of publicly-available datasets with orthogonally generated results. This hinders tool comparisons, transparency and reproducibility. To provide a community resource for assessment of exon CNV calling methods in targeted NGS data, we here present the ICR96 exon CNV validation series. The dataset includes high-quality sequencing data from a targeted NGS assay (the TruSight Cancer Panel) together with Multiplex Ligation-dependent Probe Amplification (MLPA) results for 96 independent samples. 66 samples contain at least one validated exon CNV and 30 samples have validated negative results for exon CNVs in 26 genes. The dataset includes 46 exon CNVs in BRCA1, BRCA2, TP53, MLH1, MSH2, MSH6, PMS2, EPCAM or PTEN, giving excellent representation of the cancer predisposition genes most frequently tested in clinical practice. Moreover, the validated exon CNVs include 25 single exon CNVs, the most difficult type of exon CNV to detect. The FASTQ files for the ICR96 exon CNV validation series can be accessed through the European-Genome phenome Archive (EGA) under the accession number EGAS00001002428.

Download Full-text

ScaleQC: a scalable lossy to lossless solution for NGS data compression

Bioinformatics ◽

10.1093/bioinformatics/btaa543 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4551-4559 ◽

Cited By ~ 1

Author(s):

Rongshan Yu ◽

Wenxian Yang

Keyword(s):

Lossless Compression ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Source Codes ◽

Compression Performance ◽

Data Rates ◽

Quality Value ◽

Ngs Data ◽

Bit Stream

Abstract Motivation Per-base quality values in Next Generation Sequencing data take a significant portion of storage even after compression. Lossy compression technologies could further reduce the space used by quality values. However, in many applications, lossless compression is still desired. Hence, sequencing data in multiple file formats have to be prepared for different applications. Results We developed a scalable lossy to lossless compression solution for quality values named ScaleQC (Scalable Quality value Compression). ScaleQC is able to provide the so-called bit-stream level scalability that the losslessly compressed bit-stream by ScaleQC can be further truncated to lower data rates without incurring an expensive transcoding operation. Despite its scalability, ScaleQC still achieves comparable compression performance at both lossless and lossy data rates compared to the existing lossless or lossy compressors. Availability and implementation ScaleQC has been integrated with SAMtools as a special quality value encoding mode for CRAM. Its source codes can be obtained from our integrated SAMtools (https://github.com/xmuyulab/samtools) with dependency on integrated HTSlib (https://github.com/xmuyulab/htslib). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

gpart: human genome partitioning and visualization of high-density SNP data by identifying haplotype blocks

Bioinformatics ◽

10.1093/bioinformatics/btz308 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4419-4421 ◽

Cited By ~ 3

Author(s):

Sun Ah Kim ◽

Myriam Brossard ◽

Delnaz Roshandel ◽

Andrew D Paterson ◽

Shelley B Bull ◽

...

Keyword(s):

Clustering Algorithms ◽

R Package ◽

Supplementary Information ◽

Visualization Tool ◽

Sequencing Data ◽

Haplotype Blocks ◽

Snp Data ◽

Computing Environments ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Summary For the analysis of high-throughput genomic data produced by next-generation sequencing (NGS) technologies, researchers need to identify linkage disequilibrium (LD) structure in the genome. In this work, we developed an R package gpart which provides clustering algorithms to define LD blocks or analysis units consisting of SNPs. The visualization tool in gpart can display the LD structure and gene positions for up to 20 000 SNPs in one image. The gpart functions facilitate construction of LD blocks and SNP partitions for vast amounts of genome sequencing data within reasonable time and memory limits in personal computing environments. Availability and implementation The R package is available at https://bioconductor.org/packages/gpart. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans

10.1101/003988 ◽

2014 ◽

Author(s):

Gundula Povysil ◽

Sepp Hochreiter

Keyword(s):

Gene Flow ◽

Rare Variants ◽

Demographic History ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Chromosome X ◽

False Discovery ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

We analyze the sharing of very short identity by descent (IBD) segments between humans, Neandertals, and Denisovans to gain new insights into their demographic history. Short IBD segments convey information about events far back in time because the shorter IBD segments are, the older they are assumed to be. The identification of short IBD segments becomes possible through next generation sequencing (NGS), which offers high variant density and reports variants of all frequencies. However, only recently HapFABIA has been proposed as the first method for detecting very short IBD segments in NGS data. HapFABIA utilizes rare variants to identify IBD segments with a low false discovery rate. We applied HapFABIA to the 1000 Genomes Project whole genome sequencing data to identify IBD segments that are shared within and between populations. Many IBD segments have to be old since they are shared with Neandertals or Denisovans, which explains their shorter lengths compared to segments that are not shared with these ancient genomes. The Denisova genome most prominently matches IBD segments that are shared by Asians. Many of these segments were found exclusively in Asians and they are longer than segments shared between other continental populations and the Denisova genome. Therefore, we could confirm an introgression from Deniosvans into ancestors of Asians after their migration out of Africa. While Neandertal-matching IBD segments are most often shared by Asians, Europeans share a considerably higher percentage of IBD segments with Neandertals compared to other populations, too. Again, many of these Neandertal-matching IBD segments are found exclusively in Asians, whereas Neandertal-matching IBD segments that are shared by Europeans are often found in other populations, too. Neandertal-matching IBD segments that are shared by Asians or Europeans are longer than those observed in Africans. These IBD segments hint at a gene flow from Neandertals into ancestors of Asians and Europeans after they left Africa. Interestingly, many Neandertal- and/or Denisova-matching IBD segments are predominantly observed in Africans - some of them even exclusively. IBD segments shared between Africans and Neandertals or Denisovans are strikingly short, therefore we assume that they are very old. Consequently, we conclude that DNA regions from ancestors of humans, Neandertals, and Denisovans have survived in Africans. As expected, IBD segments on chromosome X are on average longer than IBD segments on the autosomes. Neandertal-matching IBD segments on chromosome X confirm gene flow from Neandertals into ancestors of Asians and Europeans outside Africa that was already found on the autosomes. Interestingly, there is hardly any signal of Denisova introgression on the X chromosome.

Download Full-text

Tissue-associated microbial detection in cancer using human sequencing data

BMC Bioinformatics ◽

10.1186/s12859-020-03831-9 ◽

2020 ◽

Vol 21 (S9) ◽

Author(s):

Rebecca M. Rodriguez ◽

Vedbar S. Khadka ◽

Mark Menor ◽

Brenda Y. Hernandez ◽

Youping Deng

Keyword(s):

Pathogen Detection ◽

Compositional Variation ◽

Sequencing Data ◽

Microbial Composition ◽

Microbial Detection ◽

Beneficial Effects ◽

Treatment And Prevention ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

AbstractCancer is one of the leading causes of morbidity and mortality in the globe. Microbiological infections account for up to 20% of the total global cancer burden. The human microbiota within each organ system is distinct, and their compositional variation and interactions with the human host have been known to attribute detrimental and beneficial effects on tumor progression. With the advent of next generation sequencing (NGS) technologies, data generated from NGS is being used for pathogen detection in cancer. Numerous bioinformatics computational frameworks have been developed to study viral information from host-sequencing data and can be adapted to bacterial studies. This review highlights existing popular computational frameworks that utilize NGS data as input to decipher microbial composition, which output can predict functional compositional differences with clinically relevant applicability in the development of treatment and prevention strategies.

Download Full-text

Introduction to the analysis of next generation sequencing data and its application to venous thromboembolism

Thrombosis and Haemostasis ◽

10.1160/th15-05-0411 ◽

2015 ◽

Vol 114 (11) ◽

pp. 920-932 ◽

Cited By ~ 5

Author(s):

Joost C. M. Meijers ◽

Saskia Middeldorp ◽

Marisa L. R. Cunha

Keyword(s):

Venous Thromboembolism ◽

Next Generation Sequencing ◽

Clinical Care ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data Analysis ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

SummaryDespite knowledge of various inherited risk factors associated with venous thromboembolism (VTE), no definite cause can be found in about 50% of patients. The application of data-driven searches such as GWAS has not been able to identify genetic variants with implications for clinical care, and unexplained heritability remains. In the past years, the development of several so-called next generation sequencing (NGS) platforms is offering the possibility of generating fast, inexpensive and accurate genomic information. However, so far their application to VTE has been very limited. Here we review basic concepts of NGS data analysis and explore the application of NGS technology to VTE. We provide both computational and biological viewpoints to discuss potentials and challenges of NGS-based studies.

Download Full-text

Using earth mover’s distance for viral outbreak investigations

BMC Genomics ◽

10.1186/s12864-020-06982-4 ◽

2020 ◽

Vol 21 (S5) ◽

Author(s):

Andrew Melnyk ◽

Sergey Knyazev ◽

Fredrik Vannberg ◽

Leonid Bunimovich ◽

Pavel Skums ◽

...

Keyword(s):

Experimental Validation ◽

Genetic Relatedness ◽

Viral Population ◽

Sequencing Data ◽

Transmission Characteristics ◽

Outbreak Investigations ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing ◽

Viral Outbreak

Abstract Background RNA viruses mutate at extremely high rates, forming an intra-host viral population of closely related variants, which allows them to evade the host’s immune system and makes them particularly dangerous. Viral outbreaks pose a significant threat for public health, and, in order to deal with it, it is critical to infer transmission clusters, i.e., decide whether two viral samples belong to the same outbreak. Next-generation sequencing (NGS) can significantly help in tackling outbreak-related problems. While NGS data is first obtained as short reads, existing methods rely on assembled sequences. This requires reconstruction of the entire viral population, which is complicated, error-prone and time-consuming. Results The experimental validation using sequencing data from HCV outbreaks shows that the proposed algorithm can successfully identify genetic relatedness between viral populations, infer transmission direction, transmission clusters and outbreak sources, as well as decide whether the source is present in the sequenced outbreak sample and identify it. Conclusions Introduced algorithm allows to cluster genetically related samples, infer transmission directions and predict sources of outbreaks. Validation on experimental data demonstrated that algorithm is able to reconstruct various transmission characteristics. Advantage of the method is the ability to bypass cumbersome read assembly, thus eliminating the chance to introduce new errors, and saving processing time by allowing to use raw NGS reads.

Download Full-text

NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab174 ◽

2021 ◽

Author(s):

Anne Krogh Nøhr ◽

Kristian Hanghøj ◽

Genis Garcia Erill ◽

Zilong Li ◽

Ida Moltke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Research ◽

Likelihood Estimation ◽

Software Tool ◽

Estimation Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data ◽

Generation Sequencing

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.

Download Full-text

Patient Derived Xenografts for Genome-Driven Therapy of Osteosarcoma

Cells ◽

10.3390/cells10020416 ◽

2021 ◽

Vol 10 (2) ◽

pp. 416

Author(s):

Lorena Landuzzi ◽

Maria Cristina Manara ◽

Pier-Luigi Lollini ◽

Katia Scotlandi

Keyword(s):

Clinical Trials ◽

Tumor Heterogeneity ◽

Functional Studies ◽

Ngs Data Analysis ◽

The Many ◽

Orthotopic Xenografts ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing ◽

Somatic Copy Number Alterations

Osteosarcoma (OS) is a rare malignant primary tumor of mesenchymal origin affecting bone. It is characterized by a complex genotype, mainly due to the high frequency of chromothripsis, which leads to multiple somatic copy number alterations and structural rearrangements. Any effort to design genome-driven therapies must therefore consider such high inter- and intra-tumor heterogeneity. Therefore, many laboratories and international networks are developing and sharing OS patient-derived xenografts (OS PDX) to broaden the availability of models that reproduce OS complex clinical heterogeneity. OS PDXs, and new cell lines derived from PDXs, faithfully preserve tumor heterogeneity, genetic, and epigenetic features and are thus valuable tools for predicting drug responses. Here, we review recent achievements concerning OS PDXs, summarizing the methods used to obtain ectopic and orthotopic xenografts and to fully characterize these models. The availability of OS PDXs across the many international PDX platforms and their possible use in PDX clinical trials are also described. We recommend the coupling of next-generation sequencing (NGS) data analysis with functional studies in OS PDXs, as well as the setup of OS PDX clinical trials and co-clinical trials, to enhance the predictive power of experimental evidence and to accelerate the clinical translation of effective genome-guided therapies for this aggressive disease.

Download Full-text

Mining and Development of Novel SSR Markers Using Next Generation Sequencing (NGS) Data in Plants

Molecules ◽

10.3390/molecules23020399 ◽

2018 ◽

Vol 23 (2) ◽

pp. 399 ◽

Cited By ~ 41

Author(s):

Sima Taheri ◽

Thohirah Lee Abdullah ◽

Mohd Yusop ◽

Mohamed Hanafi ◽

Mahbod Sahebi ◽

...

Keyword(s):

Next Generation Sequencing ◽

Ssr Markers ◽

Next Generation ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

Download Full-text