scholarly journals NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science

2021 ◽  
Vol 4 ◽  
Author(s):  
Li Ma ◽  
Erich A. Peterson ◽  
Ik Jae Shin ◽  
Jason Muesse ◽  
Katy Marino ◽  
...  

Background: Accuracy and reproducibility are vital in science and presents a significant challenge in the emerging discipline of data science, especially when the data are scientifically complex and massive in size. Further complicating matters, in the field of genomic-based science high-throughput sequencing technologies generate considerable amounts of data that needs to be stored, manipulated, and analyzed using a plethora of software tools. Researchers are rarely able to reproduce published genomic studies.Results: Presented is a novel approach which facilitates accuracy and reproducibility for large genomic research data sets. All data needed is loaded into a portable local database, which serves as an interface for well-known software frameworks. These include python-based Jupyter Notebooks and the use of RStudio projects and R markdown. All software is encapsulated using Docker containers and managed by Git, simplifying software configuration management.Conclusion: Accuracy and reproducibility in science is of a paramount importance. For the biomedical sciences, advances in high throughput technologies, molecular biology and quantitative methods are providing unprecedented insights into disease mechanisms. With these insights come the associated challenge of scientific data that is complex and massive in size. This makes collaboration, verification, validation, and reproducibility of findings difficult. To address these challenges the NGS post-pipeline accuracy and reproducibility system (NPARS) was developed. NPARS is a robust software infrastructure and methodology that can encapsulate data, code, and reporting for large genomic studies. This paper demonstrates the successful use of NPARS on large and complex genomic data sets across different computational platforms.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Komal Jain ◽  
Teresa Tagliafierro ◽  
Adriana Marques ◽  
Santiago Sanchez-Vicente ◽  
Alper Gokden ◽  
...  

AbstractInadequate sensitivity has been the primary limitation for implementing high-throughput sequencing for studies of tick-borne agents. Here we describe the development of TBDCapSeq, a sequencing assay that uses hybridization capture probes that cover the complete genomes of the eleven most common tick-borne agents found in the United States. The probes are used for solution-based capture and enrichment of pathogen nucleic acid followed by high-throughput sequencing. We evaluated the performance of TBDCapSeq to surveil samples that included human whole blood, mouse tissues, and field-collected ticks. For Borrelia burgdorferi and Babesia microti, the sensitivity of TBDCapSeq was comparable and occasionally exceeded the performance of agent-specific quantitative PCR and resulted in 25 to > 10,000-fold increase in pathogen reads when compared to standard unbiased sequencing. TBDCapSeq also enabled genome analyses directly within vertebrate and tick hosts. The implementation of TBDCapSeq could have major impact in studies of tick-borne pathogens by improving detection and facilitating genomic research that was previously unachievable with standard sequencing approaches.


MycoKeys ◽  
2018 ◽  
Vol 39 ◽  
pp. 29-40 ◽  
Author(s):  
Sten Anslan ◽  
R. Henrik Nilsson ◽  
Christian Wurzbacher ◽  
Petr Baldrian ◽  
Leho Tedersoo ◽  
...  

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.


Author(s):  
Jane Oja ◽  
Sakeenah Adenan ◽  
Abdel-Fattah Talaat ◽  
Juha Alatalo

A broad diversity of microorganisms can be found in soil, where they are essential for nutrient cycling and energy transfer. Recent high-throughput sequencing methods have greatly advanced our knowledge about how soil, climate and vegetation variables structure the composition of microbial communities in many world regions. However, we are lacking information from several regions in the world, e.g. Middle-East. We have collected soil from 19 different habitat types for studying the diversity and composition of soil microbial communities (both fungi and bacteria) in Qatar and determining which edaphic parameters exert the strongest influences on these communities. Preliminary results indicate that in overall bacteria are more abundant in soil than fungi and few sites have notably higher abundance of these microbes. In addition, we have detected some soil patameters, which tend to have reduced the overall fungal abundance and enhanced the presence of arbuscular mycorrhizal fungi and N-fixing bacteria. More detailed information on the diversity and composition of soil microbial communities is expected from the high-throughput sequenced data.


2014 ◽  
Vol 4 (S2) ◽  
Author(s):  
Anders Christiansen ◽  
Christian Skjodt Hansen ◽  
Jens Vindahl Kringelum ◽  
Ole Lund ◽  
Katrine Lindholm Bogh ◽  
...  

2011 ◽  
Vol 77 (24) ◽  
pp. 8795-8798 ◽  
Author(s):  
Daniel Aguirre de Cárcer ◽  
Stuart E. Denman ◽  
Chris McSweeney ◽  
Mark Morrison

ABSTRACTSeveral subsampling-based normalization strategies were applied to different high-throughput sequencing data sets originating from human and murine gut environments. Their effects on the data sets' characteristics and normalization efficiencies, as measured by several β-diversity metrics, were compared. For both data sets, subsampling to the median rather than the minimum number appeared to improve the analysis.


2018 ◽  
Vol 1 (1) ◽  
pp. 263-274 ◽  
Author(s):  
Marylyn D. Ritchie

Biomedical data science has experienced an explosion of new data over the past decade. Abundant genetic and genomic data are increasingly available in large, diverse data sets due to the maturation of modern molecular technologies. Along with these molecular data, dense, rich phenotypic data are also available on comprehensive clinical data sets from health care provider organizations, clinical trials, population health registries, and epidemiologic studies. The methods and approaches for interrogating these large genetic/genomic and clinical data sets continue to evolve rapidly, as our understanding of the questions and challenges continue to emerge. In this review, the state-of-the-art methodologies for genetic/genomic analysis along with complex phenomics will be discussed. This field is changing and adapting to the novel data types made available, as well as technological advances in computation and machine learning. Thus, I will also discuss the future challenges in this exciting and innovative space. The promises of precision medicine rely heavily on the ability to marry complex genetic/genomic data with clinical phenotypes in meaningful ways.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Yipu Zhang ◽  
Ping Wang

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.


2014 ◽  
Vol 13s1 ◽  
pp. CIN.S13890 ◽  
Author(s):  
Changjin Hong ◽  
Solaiappan Manimaran ◽  
William Evan Johnson

Quality control and read preprocessing are critical steps in the analysis of data sets generated from high-throughput genomic screens. In the most extreme cases, improper preprocessing can negatively affect downstream analyses and may lead to incorrect biological conclusions. Here, we present PathoQC, a streamlined toolkit that seamlessly combines the benefits of several popular quality control software approaches for preprocessing next-generation sequencing data. PathoQC provides a variety of quality control options appropriate for most high-throughput sequencing applications. PathoQC is primarily developed as a module in the PathoScope software suite for metagenomic analysis. However, PathoQC is also available as an open-source Python module that can run as a stand-alone application or can be easily integrated into any bioinformatics workflow. PathoQC achieves high performance by supporting parallel computation and is an effective tool that removes technical sequencing artifacts and facilitates robust downstream analysis. The PathoQC software package is available at http://sourceforge.net/projects/PathoScope/ .


2020 ◽  
Author(s):  
Zeyu Jiao ◽  
Yinglei Lai ◽  
Jujiao Kang ◽  
Weikang Gong ◽  
Liang Ma ◽  
...  

AbstractHigh-throughput technologies, such as magnetic resonance imaging (MRI) and DNA/RNA sequencing (DNA-seq/RNA-seq), have been increasingly used in large-scale association studies. With these technologies, important biomedical research findings have been generated. The reproducibility of these findings, especially from structural MRI (sMRI) and functional MRI (fMRI) association studies, has recently been questioned. There is an urgent demand for a reliable overall reproducibility assessment for large-scale high-throughput association studies. It is also desirable to understand the relationship between study reproducibility and sample size in an experimental design. In this study, we developed a novel approach: the mixture model reproducibility index (M2RI) for assessing study reproducibility of large-scale association studies. With M2RI, we performed study reproducibility analysis for several recent large sMRI/fMRI data sets. The advantages of our approach were clearly demonstrated, and the sample size requirements for different phenotypes were also clearly demonstrated, especially when compared to the Dice coefficient (DC). We applied M2RI to compare two MRI or RNA sequencing data sets. The reproducibility assessment results were consistent with our expectations. In summary, M2RI is a novel and useful approach for assessing study reproducibility, calculating sample sizes and evaluating the similarity between two closely related studies.


Sign in / Sign up

Export Citation Format

Share Document