scholarly journals Topological Data Analysis for Data Mining Small Educational Samples with Application to Studies of the Gifted

2017 ◽  
Author(s):  
Colleen Molloy Farrelly

Studies of highly and profoundly gifted children typically involve small sample sizes, as the population is relatively rare, and many statistical methods cannot handle these small sample sizes well. However, topological data analysis (TDA) tools are robust, even with very small samples, and can provide useful information as well as robust statistical tests.This study demonstrates these capabilities on data simulated from previous talent search results (small and large samples), as well as a subset of data from Ruf’s cohort of gifted children. TDA methods show strong, robust performance and uncover insight into sample characteristics and subgroups, including the appearance of similar subgroups across assessment populations.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Florent Le Borgne ◽  
Arthur Chatton ◽  
Maxime Léger ◽  
Rémi Lenain ◽  
Yohann Foucher

AbstractIn clinical research, there is a growing interest in the use of propensity score-based methods to estimate causal effects. G-computation is an alternative because of its high statistical power. Machine learning is also increasingly used because of its possible robustness to model misspecification. In this paper, we aimed to propose an approach that combines machine learning and G-computation when both the outcome and the exposure status are binary and is able to deal with small samples. We evaluated the performances of several methods, including penalized logistic regressions, a neural network, a support vector machine, boosted classification and regression trees, and a super learner through simulations. We proposed six different scenarios characterised by various sample sizes, numbers of covariates and relationships between covariates, exposure statuses, and outcomes. We have also illustrated the application of these methods, in which they were used to estimate the efficacy of barbiturates prescribed during the first 24 h of an episode of intracranial hypertension. In the context of GC, for estimating the individual outcome probabilities in two counterfactual worlds, we reported that the super learner tended to outperform the other approaches in terms of both bias and variance, especially for small sample sizes. The support vector machine performed well, but its mean bias was slightly higher than that of the super learner. In the investigated scenarios, G-computation associated with the super learner was a performant method for drawing causal inferences, even from small sample sizes.


2016 ◽  
Vol 41 (5) ◽  
pp. 472-505 ◽  
Author(s):  
Elizabeth Tipton ◽  
Kelly Hallberg ◽  
Larry V. Hedges ◽  
Wendy Chan

Background: Policy makers and researchers are frequently interested in understanding how effective a particular intervention may be for a specific population. One approach is to assess the degree of similarity between the sample in an experiment and the population. Another approach is to combine information from the experiment and the population to estimate the population average treatment effect (PATE). Method: Several methods for assessing the similarity between a sample and population currently exist as well as methods estimating the PATE. In this article, we investigate properties of six of these methods and statistics in the small sample sizes common in education research (i.e., 10–70 sites), evaluating the utility of rules of thumb developed from observational studies in the generalization case. Result: In small random samples, large differences between the sample and population can arise simply by chance and many of the statistics commonly used in generalization are a function of both sample size and the number of covariates being compared. The rules of thumb developed in observational studies (which are commonly applied in generalization) are much too conservative given the small sample sizes found in generalization. Conclusion: This article implies that sharp inferences to large populations from small experiments are difficult even with probability sampling. Features of random samples should be kept in mind when evaluating the extent to which results from experiments conducted on nonrandom samples might generalize.


2014 ◽  
Vol 11 (Suppl 1) ◽  
pp. S2 ◽  
Author(s):  
Joanna Zyla ◽  
Paul Finnon ◽  
Robert Bulman ◽  
Simon Bouffler ◽  
Christophe Badie ◽  
...  

2005 ◽  
Vol 28 (3) ◽  
pp. 283-294 ◽  
Author(s):  
Jin-Shei Lai ◽  
Jeanne Teresi ◽  
Richard Gershon

An item with differential item functioning (DIF) displays different statistical properties, conditional on a matching variable. The presence of DIF in measures can invalidate the conclusions of medical outcome studies. Numerous approaches have been developed to examine DIF in many areas, including education and health-related quality of life. There is little consensus in the research community regarding selection of one best method, and most methods require large sample sizes. This article describes some approaches to examine DIF with small samples (e.g., less than 200).


2019 ◽  
Author(s):  
Andrea Cardini ◽  
Paul O’Higgins ◽  
F. James Rohlf

AbstractUsing sampling experiments, we found that, when there are fewer groups than variables, between-groups PCA (bgPCA) may suggest surprisingly distinct differences among groups for data in which none exist. While apparently not noticed before, the reasons for this problem are easy to understand. A bgPCA captures the g-1 dimensions of variation among the g group means, but only a fraction of the ∑ni − g dimensions of within-group variation (ni are the sample sizes), when the number of variables, p, is greater than g-1. This introduces a distortion in the appearance of the bgPCA plots because the within-group variation will be underrepresented, unless the variables are sufficiently correlated so that the total variation can be accounted for with just g-1 dimensions. The effect is most obvious when sample sizes are small relative to the number of variables, because smaller samples spread out less, but the distortion is present even for large samples. Strong covariance among variables largely reduces the magnitude of the problem, because it effectively reduces the dimensionality of the data and thus enables a larger proportion of the within-group variation to be accounted for within the g-1-dimensional space of a bgPCA. The distortion will still be relevant though its strength will vary from case to case depending on the structure of the data (p, g, covariances etc.). These are important problems for a method mainly designed for the analysis of variation among groups when there are very large numbers of variables and relatively small samples. In such cases, users are likely to conclude that the groups they are comparing are much more distinct than they really are. Having many variables but just small sample sizes is a common problem in fields ranging from morphometrics (as in our examples) to molecular analyses.


2006 ◽  
Vol 361 (1475) ◽  
pp. 2023-2037 ◽  
Author(s):  
Thomas P Curtis ◽  
Ian M Head ◽  
Mary Lunn ◽  
Stephen Woodcock ◽  
Patrick D Schloss ◽  
...  

The extent of microbial diversity is an intrinsically fascinating subject of profound practical importance. The term ‘diversity’ may allude to the number of taxa or species richness as well as their relative abundance. There is uncertainty about both, primarily because sample sizes are too small. Non-parametric diversity estimators make gross underestimates if used with small sample sizes on unevenly distributed communities. One can make richness estimates over many scales using small samples by assuming a species/taxa-abundance distribution. However, no one knows what the underlying taxa-abundance distributions are for bacterial communities. Latterly, diversity has been estimated by fitting data from gene clone libraries and extrapolating from this to taxa-abundance curves to estimate richness. However, since sample sizes are small, we cannot be sure that such samples are representative of the community from which they were drawn. It is however possible to formulate, and calibrate, models that predict the diversity of local communities and of samples drawn from that local community. The calibration of such models suggests that migration rates are small and decrease as the community gets larger. The preliminary predictions of the model are qualitatively consistent with the patterns seen in clone libraries in ‘real life’. The validation of this model is also confounded by small sample sizes. However, if such models were properly validated, they could form invaluable tools for the prediction of microbial diversity and a basis for the systematic exploration of microbial diversity on the planet.


2019 ◽  
Vol 80 (3) ◽  
pp. 499-521
Author(s):  
Ben Babcock ◽  
Kari J. Hodge

Equating and scaling in the context of small sample exams, such as credentialing exams for highly specialized professions, has received increased attention in recent research. Investigators have proposed a variety of both classical and Rasch-based approaches to the problem. This study attempts to extend past research by (1) directly comparing classical and Rasch techniques of equating exam scores when sample sizes are small ( N≤ 100 per exam form) and (2) attempting to pool multiple forms’ worth of data to improve estimation in the Rasch framework. We simulated multiple years of a small-sample exam program by resampling from a larger certification exam program’s real data. Results showed that combining multiple administrations’ worth of data via the Rasch model can lead to more accurate equating compared to classical methods designed to work well in small samples. WINSTEPS-based Rasch methods that used multiple exam forms’ data worked better than Bayesian Markov Chain Monte Carlo methods, as the prior distribution used to estimate the item difficulty parameters biased predicted scores when there were difficulty differences between exam forms.


2016 ◽  
Author(s):  
Narkis S. Morales ◽  
Ignacio C. Fernándezb ◽  
Victoria Baca-Gonzálezd

AbstractEnvironmental niche modeling (ENM) is commonly used to develop probabilistic maps of species distribution. Among available ENM techniques, MaxEnt has become one of the most popular tools for modeling species distribution, with hundreds of peer-reviewed articles published each year. MaxEnt’s popularity is mainly due to the use of a graphical interface and automatic parameter configuration capabilities. However, recent studies have shown that using the default automatic configuration may not be always appropriate because it can produce non-optimal models; particularly when dealing with a small number of species presence points. Thus, the recommendation is to evaluate the best potential combination of parameters (feature classes and regularization multiplier) to select the most appropriate model. In this work we reviewed 244 articles from 142 journals between 2013 and 2015 to assess whether researchers are following recommendations to avoid using the default parameter configuration when dealing with small sample sizes, or if they are using MaxEnt as a “black box tool”. Our results show that in only 16% of analyzed articles authors evaluated best feature classes, in 6.9% evaluated best regularization multipliers, and in a meager 3.7% evaluated simultaneously both parameters before producing the definitive distribution model. These results are worrying, because publications are potentially reporting over-complex or over-simplistic models that can undermine the applicability of their results. Of particular importance are studies used to inform policy making. Therefore, researchers, practitioners, reviewers and editors need to be very judicious when dealing with MaxEnt, particularly when the modelling process is based on small sample sizes.


2018 ◽  
Author(s):  
Colleen Molloy Farrelly

This study aims to confirm prior findings on the usefulness of topological data analysis (TDA) in the analysis of small samples, particularly focused on cohorts of profoundly gifted students, as well as explore the use of TDA-based regression methods for statistical modeling with small samples. A subset of the Gross sample is analyzed through supervised and unsupervised methods, including 16 and 17 individuals, respectively. Unsupervised learning confirmed prior results suggesting that evenly gifted and unevenly gifted subpopulations fundamentally differ. Supervised learning focused on predicting graduate school attendance and awards earned during undergraduate studies, and TDA-based logistic regression models were compared with more traditional machine learning models for logistic regression. Results suggest 1) that TDA-based methods are capable of handing small samples and seem more robust to the issues that arise in small samples than other machine learning methods and 2) that early childhood achievement scores and several factors related to childhood education interventions (such as early entry and radical acceleration) play a role in predicting key educational and professional achievements in adulthood. Possible new directions from this work include the use of TDA-based tools in the analysis of rare cohorts thus-far relegated to qualitative analytics or case studies, as well as potential exploration of early educational factors and adult-level achievement in larger populations of the profoundly gifted, particularly within the Study of Exceptional Talent and Talent Identification Program cohorts.


2017 ◽  
Vol 313 (5) ◽  
pp. L873-L877 ◽  
Author(s):  
Charity J. Morgan

In this review I discuss the appropriateness of various statistical methods for use with small sample sizes. I review the assumptions and limitations of these methods and provide recommendations for figures and statistical tests.


Sign in / Sign up

Export Citation Format

Share Document