Fast and efficient QTL mapper for thousands of molecular phenotypes

Motivation: In order to discover quantitative trait loci (QTLs), multi-dimensional genomic data sets combining DNA-seq and ChiP-/RNA-seq require methods that rapidly correlate tens of thousands of molecular phenotypes with millions of genetic variants while appropriately controlling for multiple testing. Results: We have developed FastQTL, a method that implements a popular cis-QTL mapping strategy in a user- and cluster-friendly tool. FastQTL also proposes an efficient permutation procedure to control for multiple testing. The outcome of permutations is modeled using beta distributions trained from a few permutations and from which adjusted p-values can be estimated at any level of significance with little computational cost. The Geuvadis & GTEx pilot data sets can be now easily analyzed an order of magnitude faster than previous approaches. Availability: Source code, binaries and comprehensive documentation of FastQTL are freely available to download at http://fastqtl.sourceforge.net/

Download Full-text

Weighted mining of massive collections of P-values by convex optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iax013 ◽

2017 ◽

Vol 7 (2) ◽

pp. 251-275

Author(s):

Edgar Dobriban

Keyword(s):

Convex Optimization ◽

Multiple Testing ◽

Observational Cosmology ◽

Data Sets ◽

Data Set ◽

P Values ◽

False Discovery ◽

Massive Data Set ◽

Optimal Weighting ◽

Weighting Problem

Abstract Researchers in data-rich disciplines—think of computational genomics and observational cosmology—often wish to mine large bodies of $P$-values looking for significant effects, while controlling the false discovery rate or family-wise error rate. Increasingly, researchers also wish to prioritize certain hypotheses, for example, those thought to have larger effect sizes, by upweighting, and to impose constraints on the underlying mining, such as monotonicity along a certain sequence. We introduce Princessp, a principled method for performing weighted multiple testing by constrained convex optimization. Our method elegantly allows one to prioritize certain hypotheses through upweighting and to discount others through downweighting, while constraining the underlying weights involved in the mining process. When the $P$-values derive from monotone likelihood ratio families such as the Gaussian means model, the new method allows exact solution of an important optimal weighting problem previously thought to be non-convex and computationally infeasible. Our method scales to massive data set sizes. We illustrate the applications of Princessp on a series of standard genomics data sets and offer comparisons with several previous ‘standard’ methods. Princessp offers both ease of operation and the ability to scale to extremely large problem sizes. The method is available as open-source software from github.com/dobriban/pvalue_weighting_matlab (accessed 11 October 2017).

Download Full-text

Fast NO<sub>2</sub> retrievals from Odin-OSIRIS limb scatter measurements

Atmospheric Measurement Techniques ◽

10.5194/amt-4-965-2011 ◽

2011 ◽

Vol 4 (5) ◽

pp. 965-972 ◽

Cited By ~ 28

Author(s):

A. E. Bourassa ◽

C. A. McLinden ◽

C. E. Sioris ◽

S. Brohede ◽

A. F. Bathgate ◽

...

Keyword(s):

Infrared Imaging ◽

Imaging System ◽

Computational Cost ◽

Data Sets ◽

Proof Of Concept ◽

Spectral Fitting ◽

Analysis Technique ◽

Order Of Magnitude ◽

Very High ◽

Good Agreement

Abstract. The feasibility of retrieving vertical profiles of NO2 from space-based measurements of limb scattered sunlight has been demonstrated using several different data sets since the 1980's. The NO2 data product routinely retrieved from measurements made by the Optical Spectrograph and InfraRed Imaging System (OSIRIS) instrument onboard the Odin satellite uses a spectral fitting technique over the 437 to 451 nm range, over which there are 36 individual wavelength measurements. In this work we present a proof of concept technique for the retrieval of NO2 using only 4 of the 36 OSIRIS measurements in this wavelength range, which reduces the computational cost by almost an order of magnitude. The method is an adaptation of a triplet analysis technique that is currently used for the OSIRIS retrievals of ozone at Chappuis band wavelengths. The results obtained are shown to be in very good agreement with the spectral fit method, and provide an important alternative for applications where the computational burden is very high. Additionally this provides a baseline for future instrument design in terms of cost effectiveness and reducing spectral range requirements.

Download Full-text

Fast NO<sub>2</sub> retrievals from Odin-OSIRIS limb scatter measurements

Atmospheric Measurement Techniques Discussions ◽

10.5194/amtd-3-5499-2010 ◽

2010 ◽

Vol 3 (6) ◽

pp. 5499-5519

Author(s):

A. E. Bourassa ◽

C. A. McLinden ◽

C. E. Sioris ◽

S. Brohede ◽

E. J. Llewellyn ◽

...

Keyword(s):

Infrared Imaging ◽

Imaging System ◽

Computational Cost ◽

Data Sets ◽

Proof Of Concept ◽

Spectral Fitting ◽

Analysis Technique ◽

Order Of Magnitude ◽

Very High ◽

Good Agreement

Abstract. The feasibility of retrieving vertical profiles of NO2 from space-based measurements of limb scattered sunlight has been demonstrated using several different data sets since the 1980's. The NO2 data product routinely retrieved from measurements made by the Optical Spectrograph and InfraRed Imaging System (OSIRIS) instrument onboard the Odin satellite uses a spectral fitting technique over the 437 to 451 nm range, over which there are 36 individual wavelength measurements. In this work we present a proof of concept technique for the retrieval of NO2 using only 4 of the 36 OSIRIS measurements in this wavelength range, which reduces the computational cost by almost an order of magnitude. The method is an adaptation of a triplet analysis technique that is currently used for the OSIRIS retrievals of ozone at Chappuis band wavelengths. The results obtained are shown to be in very good agreement with the spectral fit method, and provide an important alternative for two dimensional tomographic algorithms where the computational burden is very high. Additionally this provides a baseline for future instrument design in terms of cost effectiveness and boosting signal to noise by reducing spectral resolution requirements.

Download Full-text

Benchmarking Association Analyses of Continuous Exposures with RNA-seq in Observational Studies

10.1101/2021.02.12.430989 ◽

2021 ◽

Author(s):

Tamar Sofer ◽

Nuzulul Kurniansyah ◽

François Aguet ◽

Kristin Ardlie ◽

Peter Durda ◽

...

Keyword(s):

Linear Regression ◽

Observational Studies ◽

Multiple Testing ◽

Null Distribution ◽

Rna Seq ◽

Power Comparison ◽

Multiple Testing Correction ◽

P Values ◽

Association Analyses ◽

Regression Methods

AbstractLarge datasets of hundreds to thousands of individuals measuring RNA-seq in observational studies are becoming available. Many popular software packages for analysis of RNA-seq data were constructed to study differences in expression signatures in an experimental design with well-defined conditions (exposures). In contrast, observational studies may have varying levels of confounding of the transcript-exposure associations; further, exposure measures may vary from discrete (exposed, yes/no) to continuous (levels of exposure), with non-normal distributions of exposure. We compare popular software for gene expression - DESeq2, edgeR, and limma - as well as linear regression-based analyses for studying the association of continuous exposures with RNA-seq. We developed a computation pipeline that includes transformation, filtering, and generation of empirical null distribution of association p-values, and we apply the pipeline to compute empirical p-values with multiple testing correction. We employ a resampling approach that allows for assessment of false positive detection across methods, power comparison, and the computation of quantile empirical p-values. The results suggest that linear regression methods are substantially faster with better control of false detections than other methods, even with the resampling method to compute empirical p-values. We provide the proposed pipeline with fast algorithms in R.

Download Full-text

largeQvalue: A program for calculating FDR estimates with large datasets

10.1101/010074 ◽

2014 ◽

Cited By ~ 6

Author(s):

Andrew Anand Brown

Keyword(s):

Multiple Testing ◽

High Performance ◽

Computation Time ◽

R Package ◽

Large Datasets ◽

Parameter Estimates ◽

Rna Seq ◽

P Values ◽

Exon Level ◽

Value Estimation

This is an implementation of the R statistical software qvalue package (Alan Dabney, John D. Storey and with assistance from Gregory R. Warnes (). qvalue: Q-value estimation for false discovery rate control. R package version 1.34.0.), designed for use with large datasets where memory or computation time is limiting. In addition to estimating p values adjusted for multiple testing, the software outputs a script which can be pasted into R to produce diagnostic plots and report parameter estimates. This program runs almost 30 times faster and requests substantially less memory than the qvalue package when analysing 10 million p values on a high performance cluster. The software has been used to control for the multiple testing of 390 million tests when analysing a full cis scan of RNA-seq exon level gene expression from the Eurobats project. The source code and links to executable files for linux and Mac OSX can be found here: https://github.com/abrown25/qvalue. Help for the package can be found by running ./largeQvalue --help.

Download Full-text

A novel algorithm to flag columns associated in any way with others or a dependent variable is computationally tractable in large data matrices and has much higher power when columns are linked like mutations in chromosomes

10.1101/2021.09.15.460360 ◽

2021 ◽

Author(s):

Marcos A. Antezana

Keyword(s):

Model Selection ◽

Type I Error ◽

Computational Cost ◽

Large Data ◽

Data Matrix ◽

P Value ◽

Type I ◽

P Values ◽

Multiple Tests ◽

Order Of Magnitude

ABSTRACTWhen a data matrix DM has many independent variables IVs, it is not computationally tractable to assess the association of every distinct IV subset with the dependent variable DV of the DM, because the number of subsets explodes combinatorially as IVs increase. But model selection and correcting for multiple tests is complex even with few IVs.DMs in genomics will soon summarize millions of markers (mutations) and genomes. Searching exhaustively in such DMs for mutations that alone or synergistically with others are associated with a trait is computationally tractable only for 1- and 2-mutation effects. This is also why population geneticists study mainly 2-marker combinations.I present a computationally tractable, fully parallelizable Participation in Association Score (PAS) that in a DM with markers detects one by one every column that is strongly associated in any way with others. PAS does not examine column subsets and its computational cost grows linearly with the number of columns, remaining reasonable even when DMs have millions of columns. PAS P values are readily obtained by permutation and accurately Sidak-corrected for multiple tests, bypassing model selection. The P values of a column’s PASs and dvPASs for different orders of association are i.i.d. and easily turned into a single P value.PAS exploits how associations of markers in the rows of a DM cause associations of matches in the pairwise comparisons of the rows. For every such comparison with a match at a tested column, PAS computes the matches at other columns by modifying the comparison’s total matches (scored once per DM), yielding a distribution of conditional matches that reacts diagnostically to the associations of the tested column. Equally computationally tractable is dvPAS that flags DV-associated IVs by also probing the matches at the DV.Simulations show that i) PAS and dvPAS generate uniform-(0,1)-distributed type I error in null DMs and ii) detect randomly encountered binary and trinary models of significant n-column association and n-IV association to a binary DV, respectively, with power in the order of magnitude of exhaustive evaluation’s and false positives that are uniform-(0,1)-distributed or straightforwardly tuned to be so. Power to detect 2-way associations that extend over 100+ columns is non-parametrically ultimate but that to detect pure n-column associations and pure n-IV DV associations sinks exponentially with increasing n.Important for geneticists, dvPAS power increases about twofold in trinary vs. binary DMs and by orders of magnitude with markers linked like mutations in chromosomes, specially in trinary DMs where furthermore dvPAS fine-maps with highest resolution.

Download Full-text

THE EFFECT OF STORY MAPPING STRATEGY ON GRADE VIII STUDENTS’ ACHIEVEMENT IN WRITING NARRATIVE TEXT

10.24114/reg.v6i3.6698 ◽

2017 ◽

Vol 6 (3) ◽

Author(s):

Aryadi Manuel Gultom And Isli Iriani Indiah Pane

Keyword(s):

High School ◽

Junior High School ◽

Narrative Text ◽

Control Group ◽

Writing Test ◽

Mapping Strategy ◽

Story Mapping ◽

Class Viii ◽

Level Of Significance ◽

Experimental Group

This research aims at investigating the effect of story mapping strategy on grade VIII students’ achievement in writing narrative text. It was conducted by using experimental research design. The Population of this research was the eighth (VIII) grade students of St. Thomas 1 Junior High School Medan. There were two classes as the sample. The first class (VIII-F) as the experimental group, while the second class (VIII-B) as the control group. The experimental group was taught by using story mapping strategy while control group was taught by using lecturing strategy. The instrument for collecting the data was writing test. The data were analyzed by using t-test formula. The analysis showed that the score of students in the experimental group by using story mapping strategy was higher than the score of students in the control group by using lecturing strategy, at the level of significance (α) 0.05 with the degree of freedom (df) 38, the t-observed was 2,818 while the t-table was 2,024. Therefore, the applying of story mapping strategy significantly affected the students’ achievement in writing narrative text.

Download Full-text

Development of genic KASP SNP markers from RNA-Seq data for map-based cloning and marker-assisted selection in maize

BMC Plant Biology ◽

10.1186/s12870-021-02932-8 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Zhengjie Chen ◽

Dengguo Tang ◽

Jixing Ni ◽

Peng Li ◽

Le Wang ◽

...

Keyword(s):

Marker Assisted Selection ◽

Inbred Lines ◽

Average Density ◽

Snp Markers ◽

Data Sets ◽

Rna Seq ◽

Specific Pcr ◽

Maize Inbred Lines ◽

Allele Specific ◽

Allele Specific Pcr

Abstract Background Maize is one of the most important field crops in the world. Most of the key agronomic traits, including yield traits and plant architecture traits, are quantitative. Fine mapping of genes/ quantitative trait loci (QTL) influencing a key trait is essential for marker-assisted selection (MAS) in maize breeding. However, the SNP markers with high density and high polymorphism are lacking, especially kompetitive allele specific PCR (KASP) SNP markers that can be used for automatic genotyping. To date, a large volume of sequencing data has been produced by the next generation sequencing technology, which provides a good pool of SNP loci for development of SNP markers. In this study, we carried out a multi-step screening method to identify kompetitive allele specific PCR (KASP) SNP markers based on the RNA-Seq data sets of 368 maize inbred lines. Results A total of 2,948,985 SNPs were identified in the high-throughput RNA-Seq data sets with the average density of 1.4 SNP/kb. Of these, 71,311 KASP SNP markers (the average density of 34 KASP SNP/Mb) were developed based on the strict criteria: unique genomic region, bi-allelic, polymorphism information content (PIC) value ≥0.4, and conserved primer sequences, and were mapped on 16,161 genes. These 16,161 genes were annotated to 52 gene ontology (GO) terms, including most of primary and secondary metabolic pathways. Subsequently, the 50 KASP SNP markers with the PIC values ranging from 0.14 to 0.5 in 368 RNA-Seq data sets and with polymorphism between the maize inbred lines 1212 and B73 in in silico analysis were selected to experimentally validate the accuracy and polymorphism of SNPs, resulted in 46 SNPs (92.00%) showed polymorphism between the maize inbred lines 1212 and B73. Moreover, these 46 polymorphic SNPs were utilized to genotype the other 20 maize inbred lines, with all 46 SNPs showing polymorphism in the 20 maize inbred lines, and the PIC value of each SNP was 0.11 to 0.50 with an average of 0.35. The results suggested that the KASP SNP markers developed in this study were accurate and polymorphic. Conclusions These high-density polymorphic KASP SNP markers will be a valuable resource for map-based cloning of QTL/genes and marker-assisted selection in maize. Furthermore, the method used to develop SNP markers in maize can also be applied in other species.

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text