Using Data to Motivate the Use of Empirical Sampling Distributions

2014 ◽  
Vol 107 (6) ◽  
pp. 465-469 ◽  
Author(s):  
Hollylynne S. Lee ◽  
Tina T. Starling ◽  
Marggie D. Gonzalez

Research shows that students often struggle with understanding empirical sampling distributions. Using hands-on and technology models and simulations of problems generated by real data help students begin to make connections between repeated sampling, sample size, distribution, variation, and center. A task to assist teachers in implementing research-based strategies is included.

2021 ◽  
Vol 8 ◽  
Author(s):  
Tianshu Gu ◽  
Lishi Wang ◽  
Ning Xie ◽  
Xia Meng ◽  
Zhijun Li ◽  
...  

The complexity of COVID-19 and variations in control measures and containment efforts in different countries have caused difficulties in the prediction and modeling of the COVID-19 pandemic. We attempted to predict the scale of the latter half of the pandemic based on real data using the ratio between the early and latter halves from countries where the pandemic is largely over. We collected daily pandemic data from China, South Korea, and Switzerland and subtracted the ratio of pandemic days before and after the disease apex day of COVID-19. We obtained the ratio of pandemic data and created multiple regression models for the relationship between before and after the apex day. We then tested our models using data from the first wave of the disease from 14 countries in Europe and the US. We then tested the models using data from these countries from the entire pandemic up to March 30, 2021. Results indicate that the actual number of cases from these countries during the first wave mostly fall in the predicted ranges of liniar regression, excepting Spain and Russia. Similarly, the actual deaths in these countries mostly fall into the range of predicted data. Using the accumulated data up to the day of apex and total accumulated data up to March 30, 2021, the data of case numbers in these countries are falling into the range of predicted data, except for data from Brazil. The actual number of deaths in all the countries are at or below the predicted data. In conclusion, a linear regression model built with real data from countries or regions from early pandemics can predict pandemic scales of the countries where the pandemics occur late. Such a prediction with a high degree of accuracy provides valuable information for governments and the public.


10.3982/qe674 ◽  
2019 ◽  
Vol 10 (2) ◽  
pp. 527-563 ◽  
Author(s):  
Benjamin Williams

In this paper, I study identification of a nonseparable model with endogeneity arising due to unobserved heterogeneity. Identification relies on the availability of binary proxies that can be used to control for the unobserved heterogeneity. I show that the model is identified in the limit as the number of proxies increases. The argument does not require an instrumental variable that is excluded from the outcome equation nor does it require the support of the unobserved heterogeneity to be finite. I then propose a nonparametric estimator that is consistent as the number of proxies increases with the sample size. I also show that, for a fixed number of proxies, nontrivial bounds on objects of interest can be obtained. Finally, I study two real data applications that illustrate computation of the bounds and estimation with a large number of items.


2014 ◽  
Vol 8 (2) ◽  
Author(s):  
Ahmed El-Mowafy ◽  
Congwei Hu

AbstractThis study presents validation of BeiDou measurements in un-differenced standalone mode and experimental results of its application for real data. A reparameterized form of the unknowns in a geometry-free observation model was used. Observations from each satellite are independently screened using a local modeling approach. Main advantages include that there is no need for computation of inter-system biases and no satellite navigation information are needed. Validation of the triple-frequency BeiDou data was performed in static and kinematic modes, the former at two continuously operating reference stations in Australia using data that span two consecutive days and the later in a walking mode for three hours. The use of the validation method parameters for numerical and graphical diagnostics of the multi-frequency BeiDou observations are discussed. The precision of the system’s observations was estimated using an empirical method that utilizes the characteristics of the validation statistics. The capability of the proposed method is demonstrated in detection and identification of artificial errors inserted in the static BeiDou data and when implemented in a single point positioning processing of the kinematic test.


2019 ◽  
Vol 49 (4) ◽  
pp. 1147-1158 ◽  
Author(s):  
Jessica M B Rees ◽  
Christopher N Foley ◽  
Stephen Burgess

Abstract Background Factorial Mendelian randomization is the use of genetic variants to answer questions about interactions. Although the approach has been used in applied investigations, little methodological advice is available on how to design or perform a factorial Mendelian randomization analysis. Previous analyses have employed a 2 × 2 approach, using dichotomized genetic scores to divide the population into four subgroups as in a factorial randomized trial. Methods We describe two distinct contexts for factorial Mendelian randomization: investigating interactions between risk factors, and investigating interactions between pharmacological interventions on risk factors. We propose two-stage least squares methods using all available genetic variants and their interactions as instrumental variables, and using continuous genetic scores as instrumental variables rather than dichotomized scores. We illustrate our methods using data from UK Biobank to investigate the interaction between body mass index and alcohol consumption on systolic blood pressure. Results Simulated and real data show that efficiency is maximized using the full set of interactions between genetic variants as instruments. In the applied example, between 4- and 10-fold improvement in efficiency is demonstrated over the 2 × 2 approach. Analyses using continuous genetic scores are more efficient than those using dichotomized scores. Efficiency is improved by finding genetic variants that divide the population at a natural break in the distribution of the risk factor, or else divide the population into more equal-sized groups. Conclusions Previous factorial Mendelian randomization analyses may have been underpowered. Efficiency can be improved by using all genetic variants and their interactions as instrumental variables, rather than the 2 × 2 approach.


1999 ◽  
Vol 89 (11) ◽  
pp. 1104-1111 ◽  
Author(s):  
Jan P. Nyrop ◽  
Michael R. Binns ◽  
Wopke van der Werf

Guides for making crop protection decisions based on assessments of pest abundance or incidence are cornerstones of many integrated pest management systems. Much research has been devoted to developing sample plans for use in these guides. The development of sampling plans has usually focused on collecting information on the sampling distribution of the pest, describing this sampling distribution with a mathematical model, formulating a sample plan, and sometimes, but not always, evaluating the performance of the proposed sample plan. For crop protection decision making, classification of density or incidence is usually more appropriate than estimation. When classification is done, the average outcome of classification (the operating characteristic) is frequently robust to large changes in the sampling distribution, including estimates of the variance of pest counts, and to sample size. In contrast, the critical density, or critical incidence, about which classifications are made, has a large influence on the operating characteristic. We suggest that rather than investing resources in elaborate descriptions of sampling distributions, or in fine-tuning sample size to achieve desired levels of precision, greater emphasis should be placed on characterizing pest densities that signal the need for management action and on designing decision guides that will be adopted by practitioners.


2017 ◽  
Vol 56 (6) ◽  
pp. 1663-1680 ◽  
Author(s):  
Timothy H. Raupach ◽  
Alexis Berne

AbstractDouble-moment normalization of the drop size distribution (DSD) summarizes the DSD in a compact way, using two of its statistical moments and a “generic” double-moment normalized DSD function. Results are presented of an investigation into the invariance of the double-moment normalized DSD through horizontal and vertical displacement in space, using data from disdrometers, vertically pointing K-band Micro Rain Radars, and an X-band polarimetric weather radar. The invariance of the double-moment normalized DSD is tested over a vertical range of up to 1.8 km and a horizontal range of up to approximately 100 km. The results suggest that for practical use, with well-chosen input moments, the double-moment normalized DSD can be assumed invariant in space in stratiform rain. The choice of moments used to characterize the DSD affects the amount of DSD variability captured by the normalization. It is shown that in stratiform rain, it is possible to capture more than 85% of the variability in DSD moments zero to seven using the technique. Most DSD variability in stratiform rain can thus be explained through the variability of two of its statistical moments. The results suggest similar behavior exists in transition and convective rain, but the limited data samples available do not allow for robust conclusions for these rain types. The results have implications for practical uses of double-moment DSD normalization, including the study of DSD variability and microphysics, DSD-retrieval algorithms, and DSD models used in rainfall retrieval.


2015 ◽  
Vol 2015 ◽  
pp. 1-5 ◽  
Author(s):  
Yuxiang Tan ◽  
Yann Tambouret ◽  
Stefano Monti

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.


Yeast ◽  
2000 ◽  
Vol 1 (1) ◽  
pp. 22-36 ◽  
Author(s):  
Aoife McLysaght ◽  
Anton J. Enright ◽  
Lucy Skrabanek ◽  
Kenneth H. Wolfe

Background: Knowledge of the amount of gene order and synteny conservation between two species gives insights to the extent and mechanisms of divergence. The vertebrateFugu rubripes(pufferfish) has a small genome with little repetitive sequence which makes it attractive as a model genome. Genome compaction and synteny conservation between human andFuguwere studied using data from public databases.Methods: Intron length and map positions of human andFuguorthologues were compared to analyse relative genome compaction and synteny conservation respectivley. The divergence of these two genomes by genome rearrangement was simulated and the results were compared to the real data.Results: Analysis of 199 introns in 22 orthologous genes showed an eight-fold average size reduction inFugu, consistent with the ratio of total genome sizes. There was no consistent pattern relating the size reduction in individual introns or genes to gene base composition in either species. For genes that are neighbours inFugu(genes from the same cosmid or GenBank entry), 40–50% have conserved synteny with a human chromosome. This figure may be underestimated by as much as two-fold, due to problems caused by incomplete human genome sequence data and the existence of dispersed gene families. Some genes that are neighbours inFuguhave human orthologues that are several megabases and tens of genes apart. This is probably caused by small inversions or other intrachromosomal rearrangements.Conclusions: Comparison of observed data to computer simulations suggests that 4000–16 000 chromosomal rearrangements have occured sinceFuguand human shared a common ancestor, implying a faster rate of rearrangement than seen in human/mouse comparisons.


Testing is very essential in Data warehouse systems for decision making because the accuracy, validation and correctness of data depends on it. By looking to the characteristics and complexity of iData iwarehouse, iin ithis ipaper, iwe ihave itried ito ishow the scope of automated testing in assuring ibest data iwarehouse isolutions. Firstly, we developed a data set generator for creating synthetic but near to real data; then in isynthesized idata, with ithe help of hand icoded Extraction, Transformation and Loading (ETL) routine, anomalies are classified. For the quality assurance of data for a Data warehouse and to give the idea of how important the iExtraction, iTransformation iand iLoading iis, some very important test cases were identified. After that, to ensure the quality of data, the procedures of automated testing iwere iembedded iin ihand icoded iETL iroutine. Statistical analysis was done and it revealed a big enhancement in the quality of data with the procedures of automated testing. It enhances the fact that automated testing gives promising results in the data warehouse quality. For effective and easy maintenance of distributed data,a novel architecture was proposed. Although the desired result of this research is achieved successfully and the objectives are promising, but still there's a need to validate the results with the real life environment, as this research was done in simulated environment, which may not always give the desired results in real life environment. Hence, the overall potential of the proposed architecture can be seen until it is deployed to manage the real data which is distributed globally.


Sign in / Sign up

Export Citation Format

Share Document