scholarly journals snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data

F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 567
Author(s):  
Christina Vasilopoulou ◽  
Benjamin Wingfield ◽  
Andrew P. Morris ◽  
William Duddy

Quality control of genomic data is an essential but complicated multi-step procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools. Software incompatibilities, and inconsistencies across computing environments, are recurrent challenges, leading to poor reproducibility. Existing semi-automated or automated solutions lack comprehensive quality checks, flexible workflow architecture, and user control. To address these challenges, we have developed snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data. snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure. This includes human genome build conversion, population stratification against data from the 1,000 Genomes Project, automated population outlier removal, and built-in imputation with its own pre- and post- quality controls. Common input formats are used, and a synthetic dataset and comprehensive online tutorial are provided for testing, educational purposes, and demonstration. The snpQT pipeline is designed to run with minimal user input and coding experience; quality control steps are implemented with numerous user-modifiable thresholds, and workflows can be flexibly combined in custom combinations. snpQT is open source and freely available at https://github.com/nebfield/snpQT. A comprehensive online tutorial and installation guide is provided through to GWAS (https://snpqt.readthedocs.io/en/latest/), introducing snpQT using a synthetic demonstration dataset and a real-world Amyotrophic Lateral Sclerosis SNP-array dataset.

F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 567
Author(s):  
Christina Vasilopoulou ◽  
Benjamin Wingfield ◽  
Andrew P. Morris ◽  
William Duddy

Quality control of genomic data is an essential but complicated multi-step procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools. Dependency hell and reproducibility are recurrent challenges. Existing semi-automated or automated solutions lack comprehensive quality checks, flexible workflow architecture, and user control. To address these challenges, we have developed snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data. snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure. This includes human genome build conversion, population stratification against data from the 1,000 Genomes Project, automated population outlier removal, and built-in imputation with its own pre- and post- quality controls. Common input formats are used, and a synthetic dataset and comprehensive online tutorial are provided for testing, educational purposes, and demonstration. The snpQT pipeline is designed to run with minimal user input and coding experience; quality control steps are implemented with default thresholds which can be modified by the user, and workflows can be flexibly combined in custom combinations. snpQT is open source and freely available at https://github.com/nebfield/snpQT. A comprehensive online tutorial and installation guide is provided through to GWAS (https://snpqt.readthedocs.io/en/latest/), introducing snpQT using a synthetic demonstration dataset and a real-world Amyotrophic Lateral Sclerosis SNP-array dataset.


2021 ◽  
Vol 12 ◽  
Author(s):  
Ting-Hsuan Sun ◽  
Yu-Hsuan Joni Shao ◽  
Chien-Lin Mao ◽  
Miao-Neng Hung ◽  
Yi-Yun Lo ◽  
...  

Background: Single-nucleotide polymorphism (SNP) arrays are an ideal technology for genotyping genetic variants in mass screening. However, using SNP arrays to detect rare variants [with a minor allele frequency (MAF) of <1%] is still a challenge because of noise signals and batch effects. An approach that improves the genotyping quality is needed for clinical applications.Methods: We developed a quality-control procedure for rare variants which integrates different algorithms, filters, and experiments to increase the accuracy of variant calling. Using data from the TWB 2.0 custom Axiom array, we adopted an advanced normalization adjustment to prevent false calls caused by splitting the cluster and a rare het adjustment which decreases false calls in rare variants. The concordance of allelic frequencies from array data was compared to those from sequencing datasets of Taiwanese. Finally, genotyping results were used to detect familial hypercholesterolemia (FH), thrombophilia (TH), and maturity-onset diabetes of the young (MODY) to assess the performance in disease screening. All heterozygous calls were verified by Sanger sequencing or qPCR. The positive predictive value (PPV) of each step was estimated to evaluate the performance of our procedure.Results: We analyzed SNP array data from 43,433 individuals, which interrogated 267,247 rare variants. The advanced normalization and rare het adjustment methods adjusted genotyping calling of 168,134 variants (96.49%). We further removed 3916 probesets which were discordant in MAFs between the SNP array and sequencing data. The PPV for detecting pathogenic variants with 0.01%<MAF≤1% exceeded 99.37%. PPVs for those with an MAF of ≤0.01% improved from 95% to 100% for FH, 42.11% to 85.19% for TH, and 18.24% to 72.22% for MODY after adopting our rare variant quality-control procedure and experimental verification.Conclusion: Adopting our quality-control procedure, SNP arrays can adequately detect variants with MAF values ranging 0.01%∼0.1%. For variants with MAF values of ≤0.01%, experimental validation is needed unless sequencing data from a homogeneous population of >10,000 are available. The results demonstrated our procedure could perform correct genotype calling of rare variants. It provides a solution of pathogenic variant detection through SNP array. The approach brings tremendous promise for implementing precision medicine in medical practice.


1970 ◽  
Vol 68 (2) ◽  
pp. 221-232 ◽  
Author(s):  
R. J. Gilbert

SUMMARYThere is no official scheme for testing disinfectants and detergent/disinfectants for use in the retail food trade and few recommended procedures have been given for the cleaning of equipment with these agents. Therefore, field trials were carried out in a large self-service store. Comparisons were made of the various cleaning efficiencies, as determined by bacterial plate counts, of detergent and disinfectant solutions and machine cleaning oils applied with either clean cloths or disposable paper towels to items of equipment. The most satisfactory results were always obtained when anionic detergent (0·75 % w/v) and hypochlorite (200 p.p.m. available chlorine) solutions were applied in a ‘two-step’ procedure.Tests were made to compare the calcium alginate swab-rinse and the agar sausage (Agaroid) techniques for the enumeration of bacteria on stainless steel, plastic, formica and wooden surfaces before and after a cleaning process. Although recovery rates were always greater by the swab-rinse technique, the agar sausage technique was considered to be a useful routine control method for surface sampling.


2016 ◽  
Author(s):  
Robert J. H. Dunn ◽  
Kate M. Willett ◽  
David E. Parker ◽  
Lorna Mitchell

Abstract. HadISD is a sub-daily, station-based, quality-controlled dataset designed to study past extremes of temperature, pressure and humidity and allow comparisons to future projections. Herein we describe the first major update to the HadISD dataset. The temporal coverage of the dataset has been extended to 1931 to present, doubling the time range over which data are provided. Improvements made to the station selection and merging procedures result in 7677 stations being provided in version 2.0.0.2015p of this dataset. The selection of stations to merge together making composites has also been improved and made more robust. The underlying structure of the quality control procedure is the same as for HadISD.1.0.x, but a number of improvements have been implemented in individual tests. Also, more detailed quality control tests for wind speed and direction have been added. The data will be made available as netCDF files at www.metoffice.gov.uk/hadobs/hadisd and updated annually.


2018 ◽  
Vol 77 (OCE3) ◽  
Author(s):  
S. Cassidy ◽  
B. Phillips ◽  
J. Caldeira Fernandes da Silva ◽  
A. Parle

Author(s):  
Oladotun A. Ojo ◽  
Peter A. Oluwafisoye ◽  
Charles O. Chime

The sensitivity of radiographic films is an important factor to the clarity and accuracy of X-ray exposure to patients during treatment or diagnostic periods. It is therefore important to do a thorough analysis of the sensitivity of the radiographic film before and after exposure to enhance the Quality Assurance (QA) and the Quality Control (QC), of the exposure procedures. The optical densities (OD) of each film was measured, with a densitometer model MA 5336, made by GAMMEX. These values were then converted to the absorbed dose (X mGy), which is the amount of dose absorbed by each patient. The optical density versus the dose curve, followed the expected pattern, showing a good prediction from the General model, that the films employed in the exposures were of good quality and standard. Hence the optical density versus dose sensitometric curves depicts the outcome of the various films sensitivity after an exposure to the X-ray radiation through the patients.


Author(s):  
Radovan Kasarda ◽  
Nina Moravčíková ◽  
Ondrej Kadlečík ◽  
Anna Trakovická ◽  
Marko Halo ◽  
...  

The objective of this study was to analyse the level of pedigree and genomic inbreeding in a herd of the Norik of Muran horses. The pedigree file included 1374 animals (603 stallions and 771 mares), while the reference population consisted of animals that were genotyped by using 70k SNP platform (n = 25). The trend of pedigree inbreeding was expressed as the probability that an animal has two identical alleles by descent according to classical formulas. The trend of genomic inbreeding was derived from the distribution of runs of homozygosity (ROHs) with various length in the genome based on the assumption that these regions reflect the autozygosity originated from past generations of ancestors. A maximum of 19 generations was found in pedigree file. As expected, the highest level of pedigree completeness was found in first five generations. Subsequent quality control of genomic data resulted in totally 54432 SNP markers covering 2.242 Mb of the autosomal genome. The pedigree analysis showed that in current generation can be expected the pedigree inbreeding at level 0.23% (ΔFPEDi = 0.19 ± 1.17%). Comparable results was obtained also by the genomic analysis, when the inbreeding in current generation reached level 0.11%. Thus, in term of genetic diversity both analyses reflected sufficient level of variability across analysed population of Norik of Muran horses.


2015 ◽  
Vol 54 (6) ◽  
pp. 1267-1282 ◽  
Author(s):  
Youlong Xia ◽  
Trent W. Ford ◽  
Yihua Wu ◽  
Steven M. Quiring ◽  
Michael B. Ek

AbstractThe North American Soil Moisture Database (NASMD) was initiated in 2011 to provide support for developing climate forecasting tools, calibrating land surface models, and validating satellite-derived soil moisture algorithms. The NASMD has collected data from over 30 soil moisture observation networks providing millions of in situ soil moisture observations in all 50 states, as well as Canada and Mexico. It is recognized that the quality of measured soil moisture in NASMD is highly variable because of the diversity of climatological conditions, land cover, soil texture, and topographies of the stations, and differences in measurement devices (e.g., sensors) and installation. It is also recognized that error, inaccuracy, and imprecision in the data can have significant impacts on practical operations and scientific studies. Therefore, developing an appropriate quality control procedure is essential to ensure that the data are of the best quality. In this study, an automated quality control approach is developed using the North American Land Data Assimilation System, phase 2 (NLDAS-2), Noah soil porosity, soil temperature, and fraction of liquid and total soil moisture to flag erroneous and/or spurious measurements. Overall results show that this approach is able to flag unreasonable values when the soil is partially frozen. A validation example using NLDAS-2 multiple model soil moisture products at the 20-cm soil layer showed that the quality control procedure had a significant positive impact in Alabama, North Carolina, and west Texas. It had a greater impact in colder regions, particularly during spring and autumn. Over 433 NASMD stations have been quality controlled using the methodology proposed in this study, and the algorithm will be implemented to control data quality from the other ~1200 NASMD stations in the near future.


Sign in / Sign up

Export Citation Format

Share Document