WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams

2018 ◽  
Vol 16 (05) ◽  
pp. 1850018 ◽  
Author(s):  
Sanjeev Kumar ◽  
Suneeta Agarwal ◽  
Ranvijay

Genomic data nowadays is playing a vital role in number of fields such as personalized medicine, forensic, drug discovery, sequence alignment and agriculture, etc. With the advancements and reduction in the cost of next-generation sequencing (NGS) technology, these data are growing exponentially. NGS data are being generated more rapidly than they could be significantly analyzed. Thus, there is much scope for developing novel data compression algorithms to facilitate data analysis along with data transfer and storage directly. An innovative compression technique is proposed here to address the problem of transmission and storage of large NGS data. This paper presents a lossless non-reference-based FastQ file compression approach, segregating the data into three different streams and then applying appropriate and efficient compression algorithms on each. Experiments show that the proposed approach (WBFQC) outperforms other state-of-the-art approaches for compressing NGS data in terms of compression ratio (CR), and compression and decompression time. It also has random access capability over compressed genomic data. An open source FastQ compression tool is also provided here ( http://www.algorithm-skg.com/wbfqc/home.html ).

Author(s):  
Anne Krogh Nøhr ◽  
Kristian Hanghøj ◽  
Genis Garcia Erill ◽  
Zilong Li ◽  
Ida Moltke ◽  
...  

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.


2020 ◽  
Vol 42 (11) ◽  
pp. 1311-1317
Author(s):  
Dong-Jun Lee ◽  
Taesoo Kwon ◽  
Chang-Kug Kim ◽  
Young-Joo Seol ◽  
Dong-Suk Park ◽  
...  

Abstract Background Sequence variations such as single nucleotide polymorphisms are markers for genetic diseases and breeding. Therefore, identifying sequence variations is one of the main objectives of several genome projects. Although most genome project consortiums provide standard operation procedures for sequence variation detection methods, there may be differences in the results because of human selection or error. Objective To standardize the procedure for sequence variation detection and help researchers who are not formally trained in bioinformatics, we developed the NGS_SNPAnalyzer, a desktop software and fully automated graphical pipeline. Methods The NGS_SNPAnalyzer is implemented using JavaFX (version 1.8); therefore, it is not limited to any operating system (OS). The tools employed in the NGS_SNPAnalyzer were compiled on Microsoft Windows (version 7, 10) and Ubuntu Linux (version 16.04, 17.0.4). Results The NGS_SNPAnalyzer not only includes the functionalities for variant calling and annotation but also provides quality control, mapping, and filtering details to support all procedures from next-generation sequencing (NGS) data to variant visualization. It can be executed using pre-set pipelines and options and customized via user-specified options. Additionally, the NGS_SNPAnalyzer provides a user-friendly graphical interface and can be installed on any OS that supports JAVA. Conclusions Although there are several pipelines and visualization tools available for NGS data analysis, we developed the NGS_SNPAnalyzer to provide the user with an easy-to-use interface. The benchmark test results indicate that the NGS_SNPAnayzer achieves better performance than other open source tools.


2019 ◽  
Author(s):  
Tingting Gong ◽  
Vanessa M Hayes ◽  
Eva KF Chan

AbstractSomatic structural variants (SVs) play a significant role in cancer development and evolution, but are notoriously more difficult to detect than small variants from short-read next-generation sequencing (NGS) data. This is due to a combination of challenges attributed to the purity of tumour samples, tumour heterogeneity, limitations of short-read information from NGS, and sequence alignment ambiguities. In spite of active development of SV detection tools (callers) over the past few years, each method has inherent advantages and limitations. In this review, we highlight some of the important factors affecting somatic SV detection and compared the performance of eight commonly used SV callers. In particular, we focus on the extent of change in sensitivity and precision for detecting different SV types and size ranges from samples with differing variant allele frequencies and sequencing depths of coverage. We highlight the reasons for why some SV callers perform well in some settings but not others, allowing our evaluation findings to be extended beyond the eight SV callers examined in this paper. As the importance of large structural variants become increasingly recognised in cancer genomics, this paper provides a timely review on some of the most impactful factors influencing somatic SV detection and guidance on selecting an appropriate SV caller.


2021 ◽  
Author(s):  
King Wai Lau ◽  
Michelle Kleeman ◽  
Caroline Reuter ◽  
Attila Lorincz

AbstractSummaryExtremely large datasets are impossible or very difficult for humans to comprehend by standard mental approaches. Intuitive visualization of genetic variants in genomic sequencing data could help in the review and confirmation process of variants called by automated variant calling programs. To help facilitate interpretation of genetic variant next-generation sequencing (NGS) data we developed VisVariant, a customizable visualization tool that creates a figure showing the overlapping sequence information of thousands of individual reads including the variant and flanking regions.Availability and implementationDetailed information on how to download, install and run VisVariant together with an example is available on our github website [https://github.com/hugging-biorxiv/visvariant].


Author(s):  
Tingting Gong ◽  
Vanessa M Hayes ◽  
Eva K F Chan

Abstract Somatic structural variants (SVs), which are variants that typically impact >50 nucleotides, play a significant role in cancer development and evolution but are notoriously more difficult to detect than small variants from short-read next-generation sequencing (NGS) data. This is due to a combination of challenges attributed to the purity of tumour samples, tumour heterogeneity, limitations of short-read information from NGS and sequence alignment ambiguities. In spite of active development of SV detection tools (callers) over the past few years, each method has inherent advantages and limitations. In this review, we highlight some of the important factors affecting somatic SV detection and compared the performance of seven commonly used SV callers. In particular, we focus on the extent of change in sensitivity and precision for detecting different SV types and size ranges from samples with differing variant allele frequencies and sequencing depths of coverage. We highlight the reasons for why some SV callers perform well in some settings but not others, allowing our evaluation findings to be extended beyond the seven SV callers examined in this paper. As the importance of large SVs become increasingly recognized in cancer genomics, this paper provides a timely review on some of the most impactful factors influencing somatic SV detection that should be considered when choosing SV callers.


2021 ◽  
Vol 2021 ◽  
pp. 1-6
Author(s):  
Yanjun Ma

Personal genomic data constitute one important part of personal health data. However, due to the large amount of personal genomic data obtained by the next-generation sequencing technology, special tools are needed to analyze these data. In this article, we will explore a tool analyzing cloud-based large-scale genome sequencing data. Analyzing and identifying genomic variations from amplicon-based next-generation sequencing data are necessary for the clinical diagnosis and treatment of cancer patients. When processing the amplicon-based next-generation sequencing data, one essential step is removing primer sequences from the reads to avoid detecting false-positive mutations introduced by nonspecific primer binding and primer extension reactions. At present, the removing primer tools usually discard primer sequences from the FASTQ file instead of BAM file, but this method could cause some downstream analysis problems. Only one tool (BAMClipper) removes primer sequences from BAM files, but it only modified the CIGAR value of the BAM file, and false-positive mutations falling in the primer region could still be detected based on its processed BAM file. So, we developed one cutting primer tool (rmvPFBAM) removing primer sequences from the BAM file, and the mutations detected based on the processed BAM file by rmvPFBAM are highly credible. Besides that, rmvPFBAM runs faster than other tools, such as cutPrimers and BAMClipper.


2018 ◽  
Vol 35 (15) ◽  
pp. 2665-2667 ◽  
Author(s):  
Christopher M Gibb ◽  
Robert Jackson ◽  
Sabah Mohammed ◽  
Jinan Fiaidhi ◽  
Ingeborg Zehbe

Abstract Summary The Pathogen–Host Analysis Tool (PHAT) is an application for processing and analyzing next-generation sequencing (NGS) data as it relates to relationships between pathogens and their hosts. Unlike custom scripts and tedious pipeline programming, PHAT provides an integrative platform encompassing raw and aligned sequence and reference file input, quality control (QC) reporting, alignment and variant calling, linear and circular alignment viewing, and graphical and tabular output. This novel tool aims to be user-friendly for life scientists studying diverse pathogen–host relationships. Availability and implementation The project is available on GitHub (https://github.com/chgibb/PHAT) and includes convenient installers, as well as portable and source versions, for both Windows and Linux (Debian and RedHat). Up-to-date documentation for PHAT, including user guides and development notes, can be found at https://chgibb.github.io/PHATDocs/. We encourage users and developers to provide feedback (error reporting, suggestions and comments).


Algorithms ◽  
2020 ◽  
Vol 13 (6) ◽  
pp. 151
Author(s):  
Bruno Carpentieri

The increase in memory and in network traffic used and caused by new sequenced biological data has recently deeply grown. Genomic projects such as HapMap and 1000 Genomes have contributed to the very large rise of databases and network traffic related to genomic data and to the development of new efficient technologies. The large-scale sequencing of samples of DNA has brought new attention and produced new research, and thus the interest in the scientific community for genomic data has greatly increased. In a very short time, researchers have developed hardware tools, analysis software, algorithms, private databases, and infrastructures to support the research in genomics. In this paper, we analyze different approaches for compressing digital files generated by Next-Generation Sequencing tools containing nucleotide sequences, and we discuss and evaluate the compression performance of generic compression algorithms by confronting them with a specific system designed by Jones et al. specifically for genomic file compression: Quip. Moreover, we present a simple but effective technique for the compression of DNA sequences in which we only consider the relevant DNA data and experimentally evaluate its performances.


2015 ◽  
Vol 114 (11) ◽  
pp. 920-932 ◽  
Author(s):  
Joost C. M. Meijers ◽  
Saskia Middeldorp ◽  
Marisa L. R. Cunha

SummaryDespite knowledge of various inherited risk factors associated with venous thromboembolism (VTE), no definite cause can be found in about 50% of patients. The application of data-driven searches such as GWAS has not been able to identify genetic variants with implications for clinical care, and unexplained heritability remains. In the past years, the development of several so-called next generation sequencing (NGS) platforms is offering the possibility of generating fast, inexpensive and accurate genomic information. However, so far their application to VTE has been very limited. Here we review basic concepts of NGS data analysis and explore the application of NGS technology to VTE. We provide both computational and biological viewpoints to discuss potentials and challenges of NGS-based studies.


2019 ◽  
Author(s):  
Steffen Albrecht ◽  
Miguel A. Andrade-Navarro ◽  
Jean-Fred Fontaine

AbstractControlling quality of next generation sequencing (NGS) data files is a necessary but complex task. To address this problem, we statistically characterized common NGS quality features and developed a novel quality control procedure involving tree-based and deep learning classification algorithms. Predictive models, validated on internal data and external disease diagnostic datasets, are to some extent generalizable to data from unseen species. The derived statistical guidelines and predictive models represent a valuable resource for users of NGS data to better understand quality issues and perform automatic quality control. Our guidelines and software are available at the following URL: https://github.com/salbrec/seqQscorer.


Sign in / Sign up

Export Citation Format

Share Document