scholarly journals Single-Cell Transcriptome Profiling Simulation Reveals the Impact of Sequencing Parameters and Algorithms on Clustering

Life ◽  
2021 ◽  
Vol 11 (7) ◽  
pp. 716
Author(s):  
Yunhe Liu ◽  
Aoshen Wu ◽  
Xueqing Peng ◽  
Xiaona Liu ◽  
Gang Liu ◽  
...  

Despite the scRNA-seq analytic algorithms developed, their performance for cell clustering cannot be quantified due to the unknown “true” clusters. Referencing the transcriptomic heterogeneity of cell clusters, a “true” mRNA number matrix of cell individuals was defined as ground truth. Based on the matrix and the actual data generation procedure, a simulation program (SSCRNA) for raw data was developed. Subsequently, the consistency between simulated data and real data was evaluated. Furthermore, the impact of sequencing depth and algorithms for analyses on cluster accuracy was quantified. As a result, the simulation result was highly consistent with that of the actual data. Among the clustering algorithms, the Gaussian normalization method was the more recommended. As for the clustering algorithms, the K-means clustering method was more stable than K-means plus Louvain clustering. In conclusion, the scRNA simulation algorithm developed restores the actual data generation process, discovers the impact of parameters on classification, compares the normalization/clustering algorithms, and provides novel insight into scRNA analyses.

2021 ◽  
Author(s):  
Yunhe Liu ◽  
Bisheng Shi ◽  
Aoshen Wu ◽  
Xueqing Peng ◽  
Zhenghong Yuan ◽  
...  

ABSTRACTDespite of scRNA-seq analytic algorithms developed, their performance for cell clustering cannot be quantified due to the unknown “true” clusters. Referencing the transcriptomic heterogeneity of cell clusters, a “true” mRNA number matrix of cell individuals was defined as ground truth. Based on the matrix and real data generation procedure, a simulation program (SSCRNA) for raw data was developed. Subsequently, the consistence between simulated data and real data was evaluated. Furthermore, the impact of sequencing depth, and algorithms for analyses on cluster accuracy was quantified. As a result, the simulation result is highly consistent with that of the real data. It is found that mis-classification rate can be attributed to multiple reasons on current scRNA platforms, and clustering accuracy is not only sensitive to sequencing depth increasement, but can also be reflected by the position of the cluster on TSNE plot. Among the clustering algorithms, Gaussian normalization method is more appropriate for current workflows. In the clustering algorithms, k-means&louvain clustering method performs better in dimension reduced data than full data, while k-means clustering method is stable under both situations. In conclusion, the scRNA simulation algorithm developed restores the real data generation process, discovered impact of parameters on mis-clustering, compared the normalization/clustering algorithms and provided novel insight into scRNA analyses.


2020 ◽  
Author(s):  
Yoonjee Kang ◽  
Denis Thieffry ◽  
Laura Cantini

AbstractNetworks are powerful tools to represent and investigate biological systems. The development of algorithms inferring regulatory interactions from functional genomics data has been an active area of research. With the advent of single-cell RNA-seq data (scRNA-seq), numerous methods specifically designed to take advantage of single-cell datasets have been proposed. However, published benchmarks on single-cell network inference are mostly based on simulated data. Once applied to real data, these benchmarks take into account only a small set of genes and only compare the inferred networks with an imposed ground-truth.Here, we benchmark four single-cell network inference methods based on their reproducibility, i.e. their ability to infer similar networks when applied to two independent datasets for the same biological condition. We tested each of these methods on real data from three biological conditions: human retina, T-cells in colorectal cancer, and human hematopoiesis.GENIE3 results to be the most reproducible algorithm, independently from the single-cell sequencing platform, the cell type annotation system, the number of cells constituting the dataset, or the thresholding applied to the links of the inferred networks. In order to ensure the reproducibility and ease extensions of this benchmark study, we implemented all the analyses in scNET, a Jupyter notebook available at https://github.com/ComputationalSystemsBiology/scNET.


2020 ◽  
Author(s):  
Fanny Mollandin ◽  
Andrea Rau ◽  
Pascal Croiseau

ABSTRACTTechnological advances and decreasing costs have led to the rise of increasingly dense genotyping data, making feasible the identification of potential causal markers. Custom genotyping chips, which combine medium-density genotypes with a custom genotype panel, can capitalize on these candidates to potentially yield improved accuracy and interpretability in genomic prediction. A particularly promising model to this end is BayesR, which divides markers into four effect size classes. BayesR has been shown to yield accurate predictions and promise for quantitative trait loci (QTL) mapping in real data applications, but an extensive benchmarking in simulated data is currently lacking. Based on a set of real genotypes, we generated simulated data under a variety of genetic architectures, phenotype heritabilities, and we evaluated the impact of excluding or including causal markers among the genotypes. We define several statistical criteria for QTL mapping, including several based on sliding windows to account for linkage disequilibrium. We compare and contrast these statistics and their ability to accurately prioritize known causal markers. Overall, we confirm the strong predictive performance for BayesR in moderately to highly heritable traits, particularly for 50k custom data. In cases of low heritability or weak linkage disequilibrium with the causal marker in 50k genotypes, QTL mapping is a challenge, regardless of the criterion used. BayesR is a promising approach to simultaneously obtain accurate predictions and interpretable classifications of SNPs into effect size classes. We illustrated the performance of BayesR in a variety of simulation scenarios, and compared the advantages and limitations of each.


2003 ◽  
Vol 40 (4) ◽  
pp. 389-405 ◽  
Author(s):  
Baohong Sun ◽  
Scott A. Neslin ◽  
Kannan Srinivasan

Logit choice models have been used extensively to study promotion response. This article examines whether brand-switching elasticities derived from these models are overestimated as a result of rational consumer adjustment of purchase timing to coincide with promotion schedules and whether a dynamic structural model can address this bias. Using simulated data, the authors first show that if the structural model is correct, brand-switching elasticities are overestimated by stand-alone logit models. A nested logit model improves the estimates, but not completely. Second, the authors estimate the models on real data. The results indicate that the structural model fits better and produces sensible coefficient estimates. The authors then observe the same pattern in switching elasticities as they do in the simulation. Third, the authors predict sales assuming a 50% increase in promotion frequency. The reduced-form models predict much higher sales levels than does the dynamic structural model. The authors conclude that reduced-form model estimates of brand-switching elasticities can be overstated and that a dynamic structural model is best for addressing the problem. Reduced-form models that include incidence can partially, though not completely, address the issue. The authors discuss the implications for researchers and managers.


2014 ◽  
Vol 142 (12) ◽  
pp. 4559-4580 ◽  
Author(s):  
Jason A. Sippel ◽  
Fuqing Zhang ◽  
Yonghui Weng ◽  
Lin Tian ◽  
Gerald M. Heymsfield ◽  
...  

Abstract This study utilizes an ensemble Kalman filter (EnKF) to assess the impact of assimilating observations of Hurricane Karl from the High-Altitude Imaging Wind and Rain Airborne Profiler (HIWRAP). HIWRAP is a new Doppler radar on board the NASA Global Hawk unmanned airborne system, which has the benefit of a 24–26-h flight duration, or about 2–3 times that of a conventional aircraft. The first HIWRAP observations were taken during NASA’s Genesis and Rapid Intensification Processes (GRIP) experiment in 2010. Observations considered here are Doppler velocity (Vr) and Doppler-derived velocity–azimuth display (VAD) wind profiles (VWPs). Karl is the only hurricane to date for which HIWRAP data are available. Assimilation of either Vr or VWPs has a significant positive impact on the EnKF analyses and forecasts of Hurricane Karl. Analyses are able to accurately estimate Karl’s observed location, maximum intensity, size, precipitation distribution, and vertical structure. In addition, forecasts initialized from the EnKF analyses are much more accurate than a forecast without assimilation. The forecasts initialized from VWP-assimilating analyses perform slightly better than those initialized from Vr-assimilating analyses, and the latter are less accurate than EnKF-initialized forecasts from a recent proof-of-concept study with simulated data. Likely causes for this discrepancy include the quality and coverage of the HIWRAP data collected from Karl and the presence of model error in this real-data situation. The advantages of assimilating VWP data likely include the ability to simultaneously constrain both components of the horizontal wind and to circumvent reliance upon vertical velocity error covariance.


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Francesca Pizzorni Ferrarese ◽  
Flavio Simonetti ◽  
Roberto Israel Foroni ◽  
Gloria Menegaz

Validation and accuracy assessment are the main bottlenecks preventing the adoption of image processing algorithms in the clinical practice. In the classical approach, a posteriori analysis is performed through objective metrics. In this work, a different approach based on Petri nets is proposed. The basic idea consists in predicting the accuracy of a given pipeline based on the identification and characterization of the sources of inaccuracy. The concept is demonstrated on a case study: intrasubject rigid and affine registration of magnetic resonance images. Both synthetic and real data are considered. While synthetic data allow the benchmarking of the performance with respect to the ground truth, real data enable to assess the robustness of the methodology in real contexts as well as to determine the suitability of the use of synthetic data in the training phase. Results revealed a higher correlation and a lower dispersion among the metrics for simulated data, while the opposite trend was observed for pathologic ones. Results show that the proposed model not only provides a good prediction performance but also leads to the optimization of the end-to-end chain in terms of accuracy and robustness, setting the ground for its generalization to different and more complex scenarios.


Author(s):  
Dimitri Marques Abramov ◽  
Saint-Clair Gomes Junior

ABSTRACTThe aim of this study was to develop a realistic network model to predict the relationship between lockdown duration and coverage in controlling the progression of the incidence curve of an epidemic with the characteristics of COVID-19 in two scenarios (1) a closed and non-immune population, and (2) a real scenario from State of Rio de Janeiro from May 6th 2020.Effects of lockdown time and rate on the progression of an epidemic incidence curve in a virtual population of 10 thousand subjects. Predictor variables were reproductive values established in the most recent literature (R0 =2.7 and 5.7, and Re = 1.28 from Rio de Janeiro State at May 6th), without lockdown and with coverages of 25%, 50%, and 90% for 21, 35, 70, and 140 days in up to 13 different scenarios for each R0/Re, where individuals remained infected and transmitters for 14 days. We estimated model validity in theoretical and real scenarios respectively by applying an exponential model on the incidence curve with no lockdown with growth rate coefficient observed in realistic scenarios, and (2) fitting real data series from RJ upon simulated data, respectively.For R0=5.7, the flattening of the curve occurs only with long lockdown periods (70 and 140 days) with a 90% coverage. For R0=2.7, coverages of 25 and 50% also result in curve flattening and reduction of total cases, provided they occur for a long period (70 days or more). For realistic scenario in Rio de Janeiro, lockdowns +25% or more from May 6th during 140 days showed expressive flattening and number of COVID cases two to five times lower. If a more intense coverage lockdown (about +25 to +50% as much as the current one) will be implemented until June 6th during at least 70 days, it is still possible reduce nearly 40-50% the impact of pandemy in state of Rio de Janeiro.These data corroborate the importance of lockdown duration regardless of virus transmission and sometimes of intensity of coverage, either in realistic or theoretical scenarios of COVID-10 epidemics. Even later, the improvement of lockdown coverage can be effective to minimize the impact of epidemic.


2018 ◽  
Vol 25 (10) ◽  
pp. 1382-1385 ◽  
Author(s):  
Karol M Pencina ◽  
Ralph B D’Agostino ◽  
Ramachandran S Vasan ◽  
Michael J Pencina

Abstract It is unclear to what extent simulated versions of real data can be used to assess potential value of new biomarkers added to prognostic risk models. Using data on 4522 women and 3969 men who contributed information to the Framingham CVD risk prediction tool, we develop a simulation model that allows assessment of the added contribution of new biomarkers. The simulated model matches closely the one obtained using real data: discrimination area under the curve (AUC) on simulated vs actual data is 0.800 vs 0.799 in women and 0.778 vs 0.776 in men. Positive correlation with standard risk factors decreases the impact of new biomarkers (ΔAUC 0.002-0.024), but negative correlation leads to stronger effects (ΔAUC 0.026-0.101) than no correlation (ΔAUC 0.003-0.051). We suggest that researchers construct simulation models similar to the one proposed here before embarking on larger, expensive biomarker studies based on actual data.


2018 ◽  
Author(s):  
Christopher I. Cooper ◽  
Delia Yao ◽  
Dorota H. Sendorek ◽  
Takafumi N. Yamaguchi ◽  
Christine P’ng ◽  
...  

AbstractBackgroundPlatform-specific error profiles necessitate confirmatory studies where predictions made on data generated using one technology are additionally verified by processing the same samples on an orthogonal technology. In disciplines that rely heavily on high-throughput data generation, such as genomics, reducing the impact of false positive and false negative rates in results is a top priority. However, verifying all predictions can be costly and redundant, and testing a subset of findings is often used to estimate the true error profile. To determine how to create subsets of predictions for validation that maximize inference of global error profiles, we developed Valection, a software program that implements multiple strategies for the selection of verification candidates.ResultsTo evaluate these selection strategies, we obtained 261 sets of somatic mutation calls from a single-nucleotide variant caller benchmarking challenge where 21 teams competed on whole-genome sequencing datasets of three computationally-simulated tumours. By using synthetic data, we had complete ground truth of the tumours’ mutations and, therefore, we were able to accurately determine how estimates from the selected subset of verification candidates compared to the complete prediction set. We found that selection strategy performance depends on several verification study characteristics. In particular the verification budget of the experiment (i.e. how many candidates can be selected) is shown to influence estimates.ConclusionsThe Valection framework is flexible, allowing for the implementation of additional selection algorithms in the future. Its applicability extends to any discipline that relies on experimental verification and will benefit from the optimization of verification candidate selection.


2020 ◽  
Vol 72 (5) ◽  
pp. 1959-1964
Author(s):  
E.H. Martins ◽  
G. Tarôco ◽  
G.A. Rovadoscki ◽  
M.H.V. Oliveira ◽  
G.B. Mourão ◽  
...  

ABSTRACT This study aimed to estimate genetic parameters for simulated data of body weight (BW), abdominal width (AW), abdominal length (AL), and oviposition. Simulation was performed based on real data collected at apiaries in the region of Campo das Vertentes, Minas Gerais, Brazil. Genetic evaluations were performed using single- and two-trait models and (co)variance components were estimated by the restricted maximum likelihood method. The heritability for BW, AW, AL and oviposition were 0.54, 0.47, 0.31 and 0.66, respectively. Positive genetic correlations of high magnitude were obtained between BW and AW (0.80), BW and oviposition (0.69), AW and oviposition (0.82), and AL and oviposition (0.96). The genetic correlations between BW and AL (0.11) and between AW and AL (0.26) were considered moderate and low. In contrast, the phenotypic correlations were positive and high between BW and AW (0.97), BW and AL (0.96), and AW and AL (0.98). Phenotypic correlations of low magnitude and close to zero were obtained for oviposition with AL (0.02), AW (-0.02), and BW (-0.03). New studies involving these characteristics should be conducted on populations with biological data in order to evaluate the impact of selection on traits of economic interest.


Sign in / Sign up

Export Citation Format

Share Document