scholarly journals Normalisr: normalization and association testing for single-cell CRISPR screen and co-expression

2021 ◽  
Author(s):  
Lingfei Wang

AbstractSingle-cell RNA sequencing (scRNA-seq) provides unprecedented technical and statistical potential to study gene regulation but is subject to technical variations and sparsity. Here we present Normalisr, a linear-model-based normalization and statistical hypothesis testing framework that unifies single-cell differential expression, co-expression, and CRISPR scRNA-seq screen analyses. By systematically detecting and removing nonlinear confounding from library size, Normalisr achieves high sensitivity, specificity, speed, and generalizability across multiple scRNA-seq protocols and experimental conditions with unbiased P-value estimation. We use Normalisr to reconstruct robust gene regulatory networks from trans-effects of gRNAs in large-scale CRISPRi scRNA-seq screens and gene-level co-expression networks from conventional scRNA-seq.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Lingfei Wang

AbstractSingle-cell RNA sequencing (scRNA-seq) provides unprecedented technical and statistical potential to study gene regulation but is subject to technical variations and sparsity. Furthermore, statistical association testing remains difficult for scRNA-seq. Here we present Normalisr, a normalization and statistical association testing framework that unifies single-cell differential expression, co-expression, and CRISPR screen analyses with linear models. By systematically detecting and removing nonlinear confounders arising from library size at mean and variance levels, Normalisr achieves high sensitivity, specificity, speed, and generalizability across multiple scRNA-seq protocols and experimental conditions with unbiased p-value estimation. The superior scalability allows us to reconstruct robust gene regulatory networks from trans-effects of guide RNAs in large-scale single cell CRISPRi screens. On conventional scRNA-seq, Normalisr recovers gene-level co-expression networks that recapitulated known gene functions.


2019 ◽  
Author(s):  
Ning Wang ◽  
Andrew E. Teschendorff

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.


2021 ◽  
Vol 3 (2) ◽  
pp. 41-51
Author(s):  
Sri Hidayat ◽  
Syafri Syafri ◽  
Syahriar Tato

Koridor ruas jalan Hertasning-Tun Abdul Razak merupakan wilayah peri-urban yang mengalami dinamika cukup tinggi akibat kebutuhan permukiman dan sarana kegiatan baru. Hal ini memicu terjadinya transformasi spasial. Transformasi spasial memberikan dampak pada peningkatan aktivitas antropogenik yang dapat mengubah iklim perkotaan. Peningkatan aktivitas antropogenik ditandai dengan perbedaan penggunaan lahan dan kinerja lalu lintas sepanjang koridor. Penelitian ini menggunakan metode kuantitatif untuk mengetahui hubungan variabel penggunaan lahan dan kinerja lalu lintas terhadap kondisi iklim perkotaan dengan analisis data menggunakan SEM PLS.  Hasil pengujian hipotesis secara statistik terhadap pengaruh masing-masing variabel independen terhadap variabel dependennya menghasilkan kesimpulan penggunaan lahan berpengaruh signifikan terhadap kondisi iklim dimana nilai T-Statistik sebesar 2,752 > 1,96 atau nilai P sebesar 0,040 < 0,05. Sementara kinerja lalu lintas tidak berpengaruh signifikan terhadap kondisi iklim perkotaan dengan nilai T-Statistik sebesar 1,071 < 1,96 atau nilai P sebesar 0,285 > 0,05. Hasil ini juga menunjukkan bahwa penggunaan lahan di koridor ruas jalan Hertasning-Tun Abdul Razak dapat menyebabkan meningkatnya suhu perkotaan dikawasan tersebut. Namun peningkatan suhu perkotaan pada kawasan tersebut lebih disebabkan oleh aktivitas antropogenik pada penggunaan lahannya dan tidak dipengaruhi oleh luas area yang terbangun. The corridor of the Hertasning-Tun Abdul Razak road section is a peri-urban area experiencing high dynamics due to the need for new housing and activity facilities. This triggers a spatial transformation. Spatial transformation has an impact on increasing anthropogenic activities that can change the urban climate. The increase in anthropogenic activity is indicated by differences in land use and traffic performance along the corridor. This study uses a quantitative method to determine the relationship between land use variables and traffic performance on urban climatic conditions with data analysis using SEM PLS. The results of statistical hypothesis testing on the effect of each independent variable on the dependent variable resulted in the conclusion that land use had a significant effect on climatic conditions where the T-statistic value was 2.752> 1.96 or the P value was 0.040 <0.05. Meanwhile, traffic performance has no significant effect on urban climatic conditions with a T-statistic value of 1.071 <1.96 or a P value of 0.285> 0.05. These results also indicate that land use in the Hertasning-Tun Abdul Razak road corridor can cause an increase in urban temperatures in the area. However, the increase in urban temperature in these areas is more due to anthropogenic activities in land use and is not influenced by the area that is built.


Entropy ◽  
2019 ◽  
Vol 21 (9) ◽  
pp. 883 ◽  
Author(s):  
Luis Gustavo Esteves ◽  
Rafael Izbicki ◽  
Julio Michael Stern ◽  
Rafael Bassi Stern

This paper introduces pragmatic hypotheses and relates this concept to the spiral of scientific evolution. Previous works determined a characterization of logically consistent statistical hypothesis tests and showed that the modal operators obtained from this test can be represented in the hexagon of oppositions. However, despite the importance of precise hypothesis in science, they cannot be accepted by logically consistent tests. Here, we show that this dilemma can be overcome by the use of pragmatic versions of precise hypotheses. These pragmatic versions allow a level of imprecision in the hypothesis that is small relative to other experimental conditions. The introduction of pragmatic hypotheses allows the evolution of scientific theories based on statistical hypothesis testing to be interpreted using the narratological structure of hexagonal spirals, as defined by Pierre Gallais.


2020 ◽  
Vol 10 (20) ◽  
pp. 7077
Author(s):  
Hector-Xavier de Lastic ◽  
Irene Liampa ◽  
Alexandros G. Georgakilas ◽  
Michalis Zervakis ◽  
Aristotelis Chatziioannou

Background: Here, we propose a threshold-free selection method for the identification of differentially expressed features based on robust, non-parametric statistics, ensuring independence from the statistical distribution properties and broad applicability. Such methods could adapt to different initial data distributions, contrary to statistical techniques, based on fixed thresholds. This work aims to propose a methodology, which automates and standardizes the statistical selection, through the utilization of established measures like that of entropy, already used in information retrieval from large biomedical datasets, thus departing from classical fixed-threshold based methods, relying in arbitrary p-value and fold change values as selection criteria, whose efficacy also depends on degree of conformity to parametric distributions,. Methods: Our work extends the rank product (RP) methodology with a neutral selection method of high information-extraction capacity. We introduce the calculation of the RP entropy of the distribution, to isolate the features of interest by their contribution to its information content. Goal is a methodology of threshold-free identification of the differentially expressed features, which are highly informative about the phenomenon under study. Conclusions: Applying the proposed method on microarray (transcriptomic and DNA methylation) and RNAseq count data of varying sizes and noise presence, we observe robust convergence for the different parameterizations to stable cutoff points. Functional analysis through BioInfoMiner and EnrichR was used to evaluate the information potency of the resulting feature lists. Overall, the derived functional terms provide a systemic description highly compatible with the results of traditional statistical hypothesis testing techniques. The methodology behaves consistently across different data types. The feature lists are compact and rich in information, indicating phenotypic aspects specific to the tissue and biological phenomenon investigated. Selection by information content measures efficiently addresses problems, emerging from arbitrary thresh-holding, thus facilitating the full automation of the analysis.


2021 ◽  
Author(s):  
Yingxin Cao ◽  
Laiyi Fu ◽  
Jie Wu ◽  
Qinke Peng ◽  
Qing Nie ◽  
...  

AbstractMotivationSingle-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) provides new opportunities to dissect epigenomic heterogeneity and elucidate transcriptional regulatory mechanisms. However, computational modelling of scATAC-seq data is challenging due to its high dimension, extreme sparsity, complex dependencies, and high sensitivity to confounding factors from various sources.ResultsHere we propose a new deep generative model framework, named SAILER, for analysing scATAC-seq data. SAILER aims to learn a low-dimensional nonlinear latent representation of each cell that defines its intrinsic chromatin state, invariant to extrinsic confounding factors like read depth and batch effects. SAILER adopts the conventional encoder-decoder framework to learn the latent representation but imposes additional constraints to ensure the independence of the learned representations from the confounding factors. Experimental results on both simulated and real scATAC-seq datasets demonstrate that SAILER learns better and biologically more meaningful representations of cells than other methods. Its noise-free cell embeddings bring in significant benefits in downstream analyses: Clustering and imputation based on SAILER result in 6.9% and 18.5% improvements over existing methods, respectively. Moreover, because no matrix factorization is involved, SAILER can easily scale to process millions of cells. We implemented SAILER into a software package, freely available to all for large-scale scATAC-seq data analysis.AvailabilityThe software is publicly available at https://github.com/uci-cbcl/[email protected] and [email protected]


Author(s):  
Hector - Xavier de Lastic ◽  
Irene Liampa ◽  
Alexandros G. Georgakilas ◽  
Michalis Zervakis ◽  
Aristotelis Chatziioannou

Background: Traditional omic analysis relies on p-value and fold change as selection criteria. There is an ongoing debate on their effectiveness in delivering systemic and robust interpretation, due to their dependence on assumptions of conformity with various parametric distributions.Here, we propose a threshold-free selection method based on robust, non-parametric statistics, ensuring independence from the statistical distribution properties and broad applicability. Such methods could adapt to different initial data distributions, contrary to statistical techniques based on fixed thresholds. Methods: Our work extends the Rank Products methodology with a neutral selection method of high information-extraction capacity. We introduce the calculation of the RP distribution&rsquo;s entropy to isolate the features of interest by their contribution to the distribution&rsquo;s information content. The aim is a methodology performing threshold-free identification of the differentially expressed features, which are highly informative about the phenomenon under scrutiny. Conclusions: Applying the proposed method on microarray (transcriptomic and DNA methylation) and RNAseq count data of varying sizes and noise presence, we observe robust convergence for the different parameterisations to stable cutoff points. Functional analysis through BioInfoMiner and EnrichR was used to evaluate the information potency of the resulting feature lists. Overall, the derived functional terms provide a systemic description highly compatible with the results of traditional statistical hypothesis testing techniques. The methodology behaves consistently across different data types. The feature lists are compact and information-rich, indicating phenotypic aspects specific to the tissue and biological phenomenon i nvestigated. Selection by information content measures efficiently addresses problems, emerging from arbitrary thresholding, thus facilitating the full automation of the analysis.


2018 ◽  
Author(s):  
Oliver L Tessmer ◽  
David M Kramer ◽  
Jin Chen

AbstractThere is a critical unmet need for new tools to analyze and understand “big data” in the biological sciences where breakthroughs come from connecting massive genomics data with complex phenomics data. By integrating instant data visualization and statistical hypothesis testing, we have developed a new tool called OLIVER for phenomics visual data analysis with a unique function that any user adjustment will trigger real-time display updates for any affected elements in the workspace. By visualizing and analyzing omics data with OLIVER, biomedical researchers can quickly generate hypotheses and then test their thoughts within the same tool, leading to efficient knowledge discovery from complex, multi-dimensional biological data. The practice of OLIVER on multiple plant phenotyping experiments has shown that OLIVER can facilitate scientific discoveries. In the use case of OLIVER for large-scale plant phenotyping, a quick visualization identified emergent phenotypes that are highly transient and heterogeneous. The unique circular heat map with false-color plant images also indicates that such emergent phenotypes appear in different leaves under different conditions, suggesting that such previously unseen processes are critical for plant responses to dynamic environments.


2019 ◽  
Author(s):  
Brian Hie ◽  
Hyunghoon Cho ◽  
Benjamin DeMeo ◽  
Bryan Bryson ◽  
Bonnie Berger

SUMMARYLarge-scale single-cell RNA-sequencing (scRNA-seq) studies that profile hundreds of thousands of cells are becoming increasingly common, overwhelming existing analysis pipelines. Here, we describe how to enhance and accelerate single-cell data analysis by summarizing the transcriptomic heterogeneity within a data set using a small subset of cells, which we refer to as a geometric sketch. Our sketches provide more comprehensive visualization of transcriptional diversity, capture rare cell types with high sensitivity, and accurately reveal biological cell types via clustering. Our sketch of umbilical cord blood cells uncovers a rare subpopulation of inflammatory macrophages, which we experimentally validatedin vitro. The construction of our sketches is extremely fast, which enabled us to accelerate other crucial resource-intensive tasks such as scRNA-seq data integration. We anticipate that our algorithm will become an increasingly essential step when sharing and analyzing the rapidly-growing volume of scRNA-seq data and help enable the democratization of single-cell omics.


Sign in / Sign up

Export Citation Format

Share Document