scholarly journals easyfm: An easy software suite for file manipulation of Next Generation Sequencing data on desktops

Author(s):  
Hyungtaek Jung ◽  
Brendan Jeon ◽  
Daniel Ortiz-Barrientos

Storing and manipulating Next Generation Sequencing (NGS) file formats for understanding biological phenomena is an essential but difficult task in the life sciences. Yet, most methods for analysing NGS data require complex command-line tools in high-performance computing (HPC) or web-based servers and have not yet been implemented in comprehensive, easy-to-use software. Here we present easyfm (easy file manipulation), a free standalone Graphical User Interface (GUI) software with Python support that can be used to facilitate the rapid discovery of target sequences (or user’s interest) in NGS datasets for novice users (more accessible to biologists). It enables them to perform end-to-end reproducible data analyses using a desktop application (Windows, Mac and Linux). Unlike existing tools, the GUI-based easyfm is not dependent on any HPC system and can be operated without an internet connection. For user-friendliness and convenience, easyfm was developed with four work modules and a secondary GUI window, covering different aspects of NGS data analysis, including post-processing, filtering, format conversion, generating results, real-time log, and help. In combination with the executable tools (BLAST+ and BLAT) and Python, easyfm allows the user to set analysis parameters, select/extract regions of interest, examine the input and output results, and convert to a wide range of file formats. To help augment the functionality of existing web-based and command-line tools, easyfm, a self-contained program, comes with extensive documentation (https://github.com/TaekAndBrendan/easyfm). This specific benefit allows easyfm to seamlessly integrate visual and interactive representations of NGS files, supporting a wider scope of bioinformatics applications in the life sciences.

Author(s):  
Sung-Joon Park ◽  
Kenta Nakai

Abstract Summary Microorganisms infect and contaminate eukaryotic cells during the course of biological experiments. Because microbes influence host cell biology and may therefore lead to erroneous conclusions, a computational platform that facilitates decontamination is indispensable. Recent studies show that next-generation sequencing (NGS) data can be used to identify the presence of exogenous microbial species. Previously, we proposed an algorithm to improve detection of microbes in NGS data. Here, we developed an online application, OpenContami, which allows researchers easy access to the algorithm via interactive web-based interfaces. We have designed the application by incorporating a database comprising analytical results from a large-scale public dataset and data uploaded by users. The database serves as a reference for assessing user data and provides a list of genera detected from negative blank controls as a ‘blacklist’, which is useful for studying human infectious diseases. OpenContami offers a comprehensive overview of exogenous species in NGS datasets; as such, it will increase our understanding of the impact of microbial contamination on biological and pathological traits. Availability and implementation OpenContami is freely available at: https://openlooper.hgc.jp/opencontami/. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Anne Krogh Nøhr ◽  
Kristian Hanghøj ◽  
Genis Garcia Erill ◽  
Zilong Li ◽  
Ida Moltke ◽  
...  

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.


F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 50 ◽  
Author(s):  
Michael T. Wolfinger ◽  
Jörg Fallmann ◽  
Florian Eggenhofer ◽  
Fabian Amman

Recent achievements in next-generation sequencing (NGS) technologies lead to a high demand for reuseable software components to easily compile customized analysis workflows for big genomics data. We present ViennaNGS, an integrated collection of Perl modules focused on building efficient pipelines for NGS data processing. It comes with functionality for extracting and converting features from common NGS file formats, computation and evaluation of read mapping statistics, as well as normalization of RNA abundance. Moreover, ViennaNGS provides software components for identification and characterization of splice junctions from RNA-seq data, parsing and condensing sequence motif data, automated construction of Assembly and Track Hubs for the UCSC genome browser, as well as wrapper routines for a set of commonly used NGS command line tools.


2021 ◽  
Vol 8 (Supplement_1) ◽  
pp. S281-S282
Author(s):  
Heather L Wells ◽  
Joseph Barrows ◽  
Mara Couto-Rodriguez ◽  
Xavier O Jirau Serrano ◽  
Marilyne Debieu ◽  
...  

Abstract Background The quantitative level of pathogens present in a host is a major driver of infectious disease (ID) state and outcome. However, the majority of ID diagnostics are qualitative. Next-generation sequencing (NGS) is an emerging ID diagnostics and research tool to provide insights, including tracking transmission, evolution, and identifying novel strains. Methods We built a novel likelihood-based computational method to leverage pathogen-specific genome-wide NGS data to detect SARS-CoV-2, profile genetic variants, and furthermore quantify levels of these pathogens. We used de-identified clinical specimens tested for SARS-CoV-2 using RT-PCR, SARS-CoV-2 NGS Assay (hybrid capture, Twist Bioscience), or ARTIC (amplicon-based) platform, and COVID-DX software. A training (n=87) and validation (n=22) set was selected to establish the strength of our quantification model. We fit non-uniform probabilistic error profiles to a deterministic sigmoidal equation that more realistically represents observed data and used likelihood maximized over several different read depths to improve accuracy over a wide range of values of viral load. Given the proportion of the genome covered at varying depths for a single sample as input data, our model estimated the Ct of that sample as the value that produces the maximum likelihood of generating the observed genome coverage data. Results The model fit on 87 SARS-CoV-2 NGS Assay training samples produced a good fit to the 22 validation samples, with a coefficient of correlation (r2) of ~0.8. The accuracy of the model was high (mean absolute % error of ~10%, meaning our model is able to predict the Ct value of each sample within a margin of ±10% on average). Because of the nature of the commonly used ARTIC protocol, we found that all quantitative signals in this data were lost during PCR amplification and the model is not applicable for quantification of samples captured this way. The ability to model quantification is a major advantage of the SARS-CoV-2 NGS assay protocol. The likelihood-based model to estimate SARS-CoV-2 viral titer Left Observed genome coverage (y-axis) plotted against Ct value (x-axis). The best-fitting logistic curve is demonstrated with a red line with shaded areas above and below representing the fitted error profile. RIGHT: Model-estimated Ct values (y-axis) compared to laboratory Ct values (x-axis) with grey bars representing estimated confidence intervals. The 1:1 diagonal is shown as a dotted line. Conclusion To our knowledge, this is the first model to incorporate sequence data mapped across the genome of a pathogen to quantify the level of that pathogen in a clinical specimen. This has implications in ID diagnostics, research, and metagenomics. Disclosures Heather L. Wells, MPH, Biotia, Inc. (Consultant) Joseph Barrows, MS, Biotia (Employee) Mara Couto-Rodriguez, MS, Biotia (Employee) Xavier O. Jirau Serrano, B.S., Biotia (Employee) Marilyne Debieu, PhD, Biotia (Employee) Karen Wessel, PhD, Labor Zotz/Klimas (Employee) Christopher Mason, PhD, Biotia (Board Member, Advisor or Review Panel member, Shareholder) Dorottya Nagy-Szakal, MD PhD, Biotia Inc (Employee, Shareholder) Niamh B. O’Hara, PhD, Biotia (Board Member, Employee, Shareholder)


2018 ◽  
Vol 16 (05) ◽  
pp. 1850018 ◽  
Author(s):  
Sanjeev Kumar ◽  
Suneeta Agarwal ◽  
Ranvijay

Genomic data nowadays is playing a vital role in number of fields such as personalized medicine, forensic, drug discovery, sequence alignment and agriculture, etc. With the advancements and reduction in the cost of next-generation sequencing (NGS) technology, these data are growing exponentially. NGS data are being generated more rapidly than they could be significantly analyzed. Thus, there is much scope for developing novel data compression algorithms to facilitate data analysis along with data transfer and storage directly. An innovative compression technique is proposed here to address the problem of transmission and storage of large NGS data. This paper presents a lossless non-reference-based FastQ file compression approach, segregating the data into three different streams and then applying appropriate and efficient compression algorithms on each. Experiments show that the proposed approach (WBFQC) outperforms other state-of-the-art approaches for compressing NGS data in terms of compression ratio (CR), and compression and decompression time. It also has random access capability over compressed genomic data. An open source FastQ compression tool is also provided here ( http://www.algorithm-skg.com/wbfqc/home.html ).


F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 50 ◽  
Author(s):  
Michael T. Wolfinger ◽  
Jörg Fallmann ◽  
Florian Eggenhofer ◽  
Fabian Amman

Recent achievements in next-generation sequencing (NGS) technologies lead to a high demand for reuseable software components to easily compile customized analysis workflows for big genomics data. We present ViennaNGS, an integrated collection of Perl modules focused on building efficient pipelines for NGS data processing. It comes with functionality for extracting and converting features from common NGS file formats, computation and evaluation of read mapping statistics, as well as normalization of RNA abundance. Moreover, ViennaNGS provides software components for identification and characterization of splice junctions from RNA-seq data, parsing and condensing sequence motif data, automated construction of Assembly and Track Hubs for the UCSC genome browser, as well as wrapper routines for a set of commonly used NGS command line tools.


2017 ◽  
pp. 1-17 ◽  
Author(s):  
Sumit Middha ◽  
Liying Zhang ◽  
Khedoudja Nafa ◽  
Gowtham Jayakumaran ◽  
Donna Wong ◽  
...  

Purpose Microsatellite instability (MSI)/mismatch repair (MMR) status is increasingly important in the management of patients with cancer to predict response to immune checkpoint inhibitors. We determined MSI status from large-panel clinical targeted next-generation sequencing (NGS) data across various solid cancer types. Methods The MSI statuses of 12,288 advanced solid cancers consecutively sequenced with Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets clinical NGS assay were inferred by using MSIsensor, a program that reports the percentage of unstable microsatellites as a score. Cutoff score determination and sensitivity/specificity were based on MSI polymerase chain reaction (PCR) and MMR immunohistochemistry. Results By using an MSIsensor score ≥ 10 to define MSI high (MSI-H), 83 (8%) of 996 colorectal cancers (CRCs) and 42 (16%) of 260 uterine endometrioid cancers (UECs) were MSI-H. Validation against MSI PCR and/or MMR immunohistochemistry performed for 138 (24 MSI-H, 114 microsatellite stable [MSS]) CRCs, and 40 (15 MSI-H, 25 MSS) UECs showed a concordance of 99.4%. MSIsensor also identified 68 MSI-H/MMR-deficient (MMR-D) non-CRC/UECs. Of 9,591 non-CRC/UEC tumors with MSS MSIsensor status, 456 (4.8%) had slightly elevated scores (≥ 3 and < 10) of which 96.6% with available material were confirmed to be MSS by MSI PCR. MSI-H was also detected and confirmed in three non-CRC/UECs with low exonic mutation burden (< 20). MSIsensor correctly scored all 15 polymerase ε ultra-mutated cancers as negative for MSI. Conclusion MSI status can be reliably inferred by MSIsensor from large-panel targeted NGS data. Concurrent MSI testing by NGS is resource efficient, is potentially more sensitive for MMR-D than MSI PCR, and allows identification of MSI-H across various cancers not typically screened, as highlighted by the finding that 35% (68 of 193) of all MSI-H tumors were non-CRC/UEC.


2020 ◽  
Vol 42 (11) ◽  
pp. 1311-1317
Author(s):  
Dong-Jun Lee ◽  
Taesoo Kwon ◽  
Chang-Kug Kim ◽  
Young-Joo Seol ◽  
Dong-Suk Park ◽  
...  

Abstract Background Sequence variations such as single nucleotide polymorphisms are markers for genetic diseases and breeding. Therefore, identifying sequence variations is one of the main objectives of several genome projects. Although most genome project consortiums provide standard operation procedures for sequence variation detection methods, there may be differences in the results because of human selection or error. Objective To standardize the procedure for sequence variation detection and help researchers who are not formally trained in bioinformatics, we developed the NGS_SNPAnalyzer, a desktop software and fully automated graphical pipeline. Methods The NGS_SNPAnalyzer is implemented using JavaFX (version 1.8); therefore, it is not limited to any operating system (OS). The tools employed in the NGS_SNPAnalyzer were compiled on Microsoft Windows (version 7, 10) and Ubuntu Linux (version 16.04, 17.0.4). Results The NGS_SNPAnalyzer not only includes the functionalities for variant calling and annotation but also provides quality control, mapping, and filtering details to support all procedures from next-generation sequencing (NGS) data to variant visualization. It can be executed using pre-set pipelines and options and customized via user-specified options. Additionally, the NGS_SNPAnalyzer provides a user-friendly graphical interface and can be installed on any OS that supports JAVA. Conclusions Although there are several pipelines and visualization tools available for NGS data analysis, we developed the NGS_SNPAnalyzer to provide the user with an easy-to-use interface. The benchmark test results indicate that the NGS_SNPAnayzer achieves better performance than other open source tools.


2017 ◽  
Vol 45 (10) ◽  
pp. 5678-5690 ◽  
Author(s):  
Christopher A. Lavender ◽  
Andrew J. Shapiro ◽  
Adam B. Burkholder ◽  
Brian D. Bennett ◽  
Karen Adelman ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document