scholarly journals OpenContami: a web-based application for detecting microbial contaminants in next-generation sequencing data

Author(s):  
Sung-Joon Park ◽  
Kenta Nakai

Abstract Summary Microorganisms infect and contaminate eukaryotic cells during the course of biological experiments. Because microbes influence host cell biology and may therefore lead to erroneous conclusions, a computational platform that facilitates decontamination is indispensable. Recent studies show that next-generation sequencing (NGS) data can be used to identify the presence of exogenous microbial species. Previously, we proposed an algorithm to improve detection of microbes in NGS data. Here, we developed an online application, OpenContami, which allows researchers easy access to the algorithm via interactive web-based interfaces. We have designed the application by incorporating a database comprising analytical results from a large-scale public dataset and data uploaded by users. The database serves as a reference for assessing user data and provides a list of genera detected from negative blank controls as a ‘blacklist’, which is useful for studying human infectious diseases. OpenContami offers a comprehensive overview of exogenous species in NGS datasets; as such, it will increase our understanding of the impact of microbial contamination on biological and pathological traits. Availability and implementation OpenContami is freely available at: https://openlooper.hgc.jp/opencontami/. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Hyungtaek Jung ◽  
Brendan Jeon ◽  
Daniel Ortiz-Barrientos

Storing and manipulating Next Generation Sequencing (NGS) file formats for understanding biological phenomena is an essential but difficult task in the life sciences. Yet, most methods for analysing NGS data require complex command-line tools in high-performance computing (HPC) or web-based servers and have not yet been implemented in comprehensive, easy-to-use software. Here we present easyfm (easy file manipulation), a free standalone Graphical User Interface (GUI) software with Python support that can be used to facilitate the rapid discovery of target sequences (or user’s interest) in NGS datasets for novice users (more accessible to biologists). It enables them to perform end-to-end reproducible data analyses using a desktop application (Windows, Mac and Linux). Unlike existing tools, the GUI-based easyfm is not dependent on any HPC system and can be operated without an internet connection. For user-friendliness and convenience, easyfm was developed with four work modules and a secondary GUI window, covering different aspects of NGS data analysis, including post-processing, filtering, format conversion, generating results, real-time log, and help. In combination with the executable tools (BLAST+ and BLAT) and Python, easyfm allows the user to set analysis parameters, select/extract regions of interest, examine the input and output results, and convert to a wide range of file formats. To help augment the functionality of existing web-based and command-line tools, easyfm, a self-contained program, comes with extensive documentation (https://github.com/TaekAndBrendan/easyfm). This specific benefit allows easyfm to seamlessly integrate visual and interactive representations of NGS files, supporting a wider scope of bioinformatics applications in the life sciences.


Author(s):  
Anne Krogh Nøhr ◽  
Kristian Hanghøj ◽  
Genis Garcia Erill ◽  
Zilong Li ◽  
Ida Moltke ◽  
...  

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.


Author(s):  
Zeynep Baskurt ◽  
Scott Mastromatteo ◽  
Jiafen Gong ◽  
Richard F Wintle ◽  
Stephen W Scherer ◽  
...  

Abstract Integration of next generation sequencing data (NGS) across different research studies can improve the power of genetic association testing by increasing sample size and can obviate the need for sequencing controls. If differential genotype uncertainty across studies is not accounted for, combining data sets can produce spurious association results. We developed the Variant Integration Kit for NGS (VikNGS), a fast cross-platform software package, to enable aggregation of several data sets for rare and common variant genetic association analysis of quantitative and binary traits with covariate adjustment. VikNGS also includes a graphical user interface, power simulation functionality and data visualization tools. Availability The VikNGS package can be downloaded at http://www.tcag.ca/tools/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 16 (05) ◽  
pp. 1850018 ◽  
Author(s):  
Sanjeev Kumar ◽  
Suneeta Agarwal ◽  
Ranvijay

Genomic data nowadays is playing a vital role in number of fields such as personalized medicine, forensic, drug discovery, sequence alignment and agriculture, etc. With the advancements and reduction in the cost of next-generation sequencing (NGS) technology, these data are growing exponentially. NGS data are being generated more rapidly than they could be significantly analyzed. Thus, there is much scope for developing novel data compression algorithms to facilitate data analysis along with data transfer and storage directly. An innovative compression technique is proposed here to address the problem of transmission and storage of large NGS data. This paper presents a lossless non-reference-based FastQ file compression approach, segregating the data into three different streams and then applying appropriate and efficient compression algorithms on each. Experiments show that the proposed approach (WBFQC) outperforms other state-of-the-art approaches for compressing NGS data in terms of compression ratio (CR), and compression and decompression time. It also has random access capability over compressed genomic data. An open source FastQ compression tool is also provided here ( http://www.algorithm-skg.com/wbfqc/home.html ).


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Michael M. Khayat ◽  
Sayed Mohammad Ebrahim Sahraeian ◽  
Samantha Zarate ◽  
Andrew Carroll ◽  
Huixiao Hong ◽  
...  

Abstract Background Genomic structural variations (SV) are important determinants of genotypic and phenotypic changes in many organisms. However, the detection of SV from next-generation sequencing data remains challenging. Results In this study, DNA from a Chinese family quartet is sequenced at three different sequencing centers in triplicate. A total of 288 derivative data sets are generated utilizing different analysis pipelines and compared to identify sources of analytical variability. Mapping methods provide the major contribution to variability, followed by sequencing centers and replicates. Interestingly, SV supported by only one center or replicate often represent true positives with 47.02% and 45.44% overlapping the long-read SV call set, respectively. This is consistent with an overall higher false negative rate for SV calling in centers and replicates compared to mappers (15.72%). Finally, we observe that the SV calling variability also persists in a genotyping approach, indicating the impact of the underlying sequencing and preparation approaches. Conclusions This study provides the first detailed insights into the sources of variability in SV identification from next-generation sequencing and highlights remaining challenges in SV calling for large cohorts. We further give recommendations on how to reduce SV calling variability and the choice of alignment methodology.


2018 ◽  
Author(s):  
Tamsen Dunn ◽  
Gwenn Berry ◽  
Dorothea Emig-Agius ◽  
Yu Jiang ◽  
Serena Lei ◽  
...  

AbstractMotivationNext-Generation Sequencing (NGS) technology is transitioning quickly from research labs to clinical settings. The diagnosis and treatment selection for many acquired and autosomal conditions necessitate a method for accurately detecting somatic and germline variants, suitable for the clinic.ResultsWe have developed Pisces, a rapid, versatile and accurate small variant calling suite designed for somatic and germline amplicon sequencing applications. Pisces accuracy is achieved by four distinct modules, the Pisces Read Stitcher, Pisces Variant Caller, the Pisces Variant Quality Recalibrator, and the Pisces Variant Phaser. Each module incorporates a number of novel algorithmic strategies aimed at reducing noise or increasing the likelihood of detecting a true variant.AvailabilityPisces is distributed under an open source license and can be downloaded from https://github.com/Illumina/Pisces. Pisces is available on the BaseSpace™ SequenceHub as part of the TruSeq Amplicon workflow and the Illumina Ampliseq Workflow. Pisces is distributed on Illumina sequencing platforms such as the MiSeq™, and is included in the Praxis™ Extended RAS Panel test which was recently approved by the FDA for the detection of multiple RAS gene [email protected] informationSupplementary data are available online.


2017 ◽  
Author(s):  
Sungsoo Park ◽  
Bonggun Shin ◽  
Yoonjung Choi ◽  
Kilsoo Kang ◽  
Keunsoo Kang

AbstractMotivationNext-generation sequencing (NGS), which allows the simultaneous sequencing of billions of DNA fragments simultaneously, has revolutionized how we study genomics and molecular biology by generating genome-wide molecular maps of molecules of interest. For example, an NGS-based transcriptomic assay called RNA-seq can be used to estimate the abundance of approximately 190,000 transcripts together. As the cost of next-generation sequencing sharply declines, researchers in many fields have been conducting research using NGS. The amount of information produced by NGS has made it difficult for researchers to choose the optimal set of target genes (or genomic loci).ResultsWe have sought to resolve this issue by developing a neural network-based feature (gene) selection algorithm called Wx. The Wx algorithm ranks genes based on the discriminative index (DI) score that represents the classification power for distinguishing given groups. With a gene list ranked by DI score, researchers can institutively select the optimal set of genes from the highest-ranking ones. We applied the Wx algorithm to a TCGA pan-cancer gene-expression cohort to identify an optimal set of gene-expression biomarker (universal gene-expression biomarkers) candidates that can distinguish cancer samples from normal samples for 12 different types of cancer. The 14 gene-expression biomarker candidates identified by Wx were comparable to or outperformed previously reported universal gene expression biomarkers, highlighting the usefulness of the Wx algorithm for next-generation sequencing data. Thus, we anticipate that the Wx algorithm can complement current state-of-the-art analytical applications for the identification of biomarker candidates as an alternative method.Availabilityhttps://github.com/deargen/[email protected] informationSupplementary data are available at online.


2017 ◽  
pp. 1-17 ◽  
Author(s):  
Sumit Middha ◽  
Liying Zhang ◽  
Khedoudja Nafa ◽  
Gowtham Jayakumaran ◽  
Donna Wong ◽  
...  

Purpose Microsatellite instability (MSI)/mismatch repair (MMR) status is increasingly important in the management of patients with cancer to predict response to immune checkpoint inhibitors. We determined MSI status from large-panel clinical targeted next-generation sequencing (NGS) data across various solid cancer types. Methods The MSI statuses of 12,288 advanced solid cancers consecutively sequenced with Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets clinical NGS assay were inferred by using MSIsensor, a program that reports the percentage of unstable microsatellites as a score. Cutoff score determination and sensitivity/specificity were based on MSI polymerase chain reaction (PCR) and MMR immunohistochemistry. Results By using an MSIsensor score ≥ 10 to define MSI high (MSI-H), 83 (8%) of 996 colorectal cancers (CRCs) and 42 (16%) of 260 uterine endometrioid cancers (UECs) were MSI-H. Validation against MSI PCR and/or MMR immunohistochemistry performed for 138 (24 MSI-H, 114 microsatellite stable [MSS]) CRCs, and 40 (15 MSI-H, 25 MSS) UECs showed a concordance of 99.4%. MSIsensor also identified 68 MSI-H/MMR-deficient (MMR-D) non-CRC/UECs. Of 9,591 non-CRC/UEC tumors with MSS MSIsensor status, 456 (4.8%) had slightly elevated scores (≥ 3 and < 10) of which 96.6% with available material were confirmed to be MSS by MSI PCR. MSI-H was also detected and confirmed in three non-CRC/UECs with low exonic mutation burden (< 20). MSIsensor correctly scored all 15 polymerase ε ultra-mutated cancers as negative for MSI. Conclusion MSI status can be reliably inferred by MSIsensor from large-panel targeted NGS data. Concurrent MSI testing by NGS is resource efficient, is potentially more sensitive for MMR-D than MSI PCR, and allows identification of MSI-H across various cancers not typically screened, as highlighted by the finding that 35% (68 of 193) of all MSI-H tumors were non-CRC/UEC.


2020 ◽  
Vol 42 (11) ◽  
pp. 1311-1317
Author(s):  
Dong-Jun Lee ◽  
Taesoo Kwon ◽  
Chang-Kug Kim ◽  
Young-Joo Seol ◽  
Dong-Suk Park ◽  
...  

Abstract Background Sequence variations such as single nucleotide polymorphisms are markers for genetic diseases and breeding. Therefore, identifying sequence variations is one of the main objectives of several genome projects. Although most genome project consortiums provide standard operation procedures for sequence variation detection methods, there may be differences in the results because of human selection or error. Objective To standardize the procedure for sequence variation detection and help researchers who are not formally trained in bioinformatics, we developed the NGS_SNPAnalyzer, a desktop software and fully automated graphical pipeline. Methods The NGS_SNPAnalyzer is implemented using JavaFX (version 1.8); therefore, it is not limited to any operating system (OS). The tools employed in the NGS_SNPAnalyzer were compiled on Microsoft Windows (version 7, 10) and Ubuntu Linux (version 16.04, 17.0.4). Results The NGS_SNPAnalyzer not only includes the functionalities for variant calling and annotation but also provides quality control, mapping, and filtering details to support all procedures from next-generation sequencing (NGS) data to variant visualization. It can be executed using pre-set pipelines and options and customized via user-specified options. Additionally, the NGS_SNPAnalyzer provides a user-friendly graphical interface and can be installed on any OS that supports JAVA. Conclusions Although there are several pipelines and visualization tools available for NGS data analysis, we developed the NGS_SNPAnalyzer to provide the user with an easy-to-use interface. The benchmark test results indicate that the NGS_SNPAnayzer achieves better performance than other open source tools.


2017 ◽  
Vol 45 (10) ◽  
pp. 5678-5690 ◽  
Author(s):  
Christopher A. Lavender ◽  
Andrew J. Shapiro ◽  
Adam B. Burkholder ◽  
Brian D. Bennett ◽  
Karen Adelman ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document