scholarly journals A Data-Driven Approach to the Fragile Families Challenge: Prediction through Principal-Components Analysis and Random Forests

2019 ◽  
Vol 5 ◽  
pp. 237802311881872 ◽  
Author(s):  
Ryan Compton

Sociological research typically involves exploring theoretical relationships, but the emergence of “big data” enables alternative approaches. This work shows the promise of data-driven machine-learning techniques involving feature engineering and predictive model optimization to address a sociological data challenge. The author’s group develops improved generalizable models to identify at-risk families. Principal-components analysis and decision tree modeling are used to predict six main dependent variables in the Fragile Families Challenge, successfully modeling one binary variable but no continuous dependent variables in the diagnostic data set. This indicates that some binary dependent variables are more predictable using a reduced set of uncorrelated independent variables, and continuous dependent variables demand more complexity.

2013 ◽  
Vol 17 (7) ◽  
pp. 1476-1485 ◽  
Author(s):  
Kate Northstone ◽  
Andrew DAC Smith ◽  
Victoria L Cribb ◽  
Pauline M Emmett

AbstractObjectiveTo derive dietary patterns using principal components analysis from separate FFQ completed by mothers and their teenagers and to assess associations with nutrient intakes and sociodemographic variables.DesignTwo distinct FFQ were completed by 13-year-olds and their mothers, with some overlap in the foods covered. A combined data set was obtained.SettingAvon Longitudinal Study of Parents and Children (ALSPAC), Bristol, UK.SubjectsTeenagers (n 5334) with adequate dietary data.ResultsFour patterns were obtained using principal components analysis: a ‘Traditional/health-conscious’ pattern, a ‘Processed’ pattern, a ‘Snacks/sugared drinks’ pattern and a ‘Vegetarian’ pattern. The ‘Traditional/health-conscious’ pattern was the most nutrient-rich, having high positive correlations with many nutrients. The ‘Processed’ and ‘Snacks/sugared drinks’ patterns showed little association with important nutrients but were positively associated with energy, fats and sugars. There were clear gender and sociodemographic differences across the patterns. Lower scores were seen on the ‘Traditional/health conscious’ and ‘Vegetarian’ patterns in males and in those with younger and less educated mothers. Higher scores were seen on the ‘Traditional/health-conscious’ and ‘Vegetarian’ patterns in girls and in those whose mothers had higher levels of education.ConclusionsIt is important to establish healthy eating patterns by the teenage years. However, this is a time when it is difficult to accurately establish dietary intake from a single source, since teenagers consume increasing amounts of foods outside the home. Further dietary pattern studies should focus on teenagers and the source of dietary data collection merits consideration.


1984 ◽  
Vol 18 (11) ◽  
pp. 2471-2478 ◽  
Author(s):  
J. Smeyers-Verbeke ◽  
J.C. Den Hartog ◽  
W.H. Dehker ◽  
D. Coomans ◽  
L. Buydens ◽  
...  

2006 ◽  
Vol 23 (3) ◽  
pp. 106-118 ◽  
Author(s):  
Gordon E. Sarty ◽  
Kinwah Wu

AbstractThe ratios of hydrogen Balmer emission line intensities in cataclysmic variables are signatures of the physical processes that produce them. To quantify those signatures relative to classifications of cataclysmic variable types, we applied the multivariate statistical analysis methods of principal components analysis and discriminant function analysis to the spectroscopic emission data set of Williams (1983). The two analysis methods reveal two different sources of variation in the ratios of the emission lines. The source of variation seen in the principal components analysis was shown to be correlated with the binary orbital period. The source of variation seen in the discriminant function analysis was shown to be correlated with the equivalent width of the Hβ line. Comparison of the data scatterplot with scatterplots of theoretical models shows that Balmer line emission from T CrB systems is consistent with the photoionization of a surrounding nebula. Otherwise, models that we considered do not reproduce the wide range of Balmer decrements, including ‘inverted’ decrements, seen in the data.


1983 ◽  
Vol 40 (10) ◽  
pp. 1752-1760 ◽  
Author(s):  
Michael A. Gates ◽  
Ann P. Zimmerman ◽  
W. Gary Sprules ◽  
Roy Knoechel

We introduce a method, based on principal components analysis, for studying temporal changes in biomass allocation among 16 size–category compartments of lake plankton. Applied to data from a series of 12 Ontario lakes over three sampling seasons, the technique provides a simple means of visualizing shifts in patterns of biomass allocation, and it allows comparative analyses of biomass fluctuations in different lakes. Each of the primary component axes is interpretable. Furthermore, a large proportion of the variance in both the mean position of a lake and its movement along these axes is interpreted as a function of lake physicochemistry. The analysis also provides weighted scores for use in hypothesis testing which are an improvement over mean biomass values alone, because they take into account the structure of variation in the data set.


1982 ◽  
Vol 26 (11) ◽  
pp. 959-963 ◽  
Author(s):  
R. H. Shannon ◽  
M. Krause ◽  
R. C. Irons

Eighteen subjects practiced a video game of bombing and air combat maneuvering, Phantoms Five®, on an APPLE® microcomputer for 10 minutes a day for 15 days. The dependent variable was the combined score for number of hits and number of targets. Performance stabilized from Days 8–15 with a pooled reliability of .904. Eight reference tests which theoretically measure cognitive, perceptual, quantitative, and motor skills were selected and used as independent variables. Stabilized performance on these tests was observed after a period of practice which was predetermined from previous experimentation. Attributes of the Phantoms Five® were isolated using a structured job analytic tool (Position Analysis Questionnaire, PAQ). A principal components analysis of the measures that correlated with the dependent variable resulted in a one factor solution explaining 66 percent of the variance. It was concluded that construct validity was established since there was a strong similarity between the attribute requirements attained by correlating the stabilized scores of independent and dependent variables and by the PAQ analysis of task functions.


2019 ◽  
Vol 18 (2) ◽  
pp. 209-226 ◽  
Author(s):  
Gabriela Deliu ◽  
Cristina Miron ◽  
Cristian Opariuc-Dan

The aim of this research is to study the merits and complementarity of Construct Mapping and Categorical Principal Components Analysis as two approaches that explore the dimensionality of multiple-choice items in achievement tests. Data from the two forms of the Romanian National Assessment Tests on Science were used to explore the dimensionality of items and to identify potentially problematic items that affect the equivalence of the two parallel forms. The findings confirm that the two tests have at best partial equivalence, but while the two methods both agree on test unidimensionality, they flag in part different items as potentially problematic. The results enable researchers and practitioners to make coherent data-driven decision regarding the use of unidimensional vs multidimensional IRT models. Keywords: categorical principal components analysis, construct map, item response theory, unidimensionality.


2013 ◽  
Vol 7 (1) ◽  
pp. 19-24
Author(s):  
Kevin Blighe

Elaborate downstream methods are required to analyze large microarray data-sets. At times, where the end goal is to look for relationships between (or patterns within) different subgroups or even just individual samples, large data-sets must first be filtered using statistical thresholds in order to reduce their overall volume. As an example, in anthropological microarray studies, such ‘dimension reduction’ techniques are essential to elucidate any links between polymorphisms and phenotypes for given populations. In such large data-sets, a subset can first be taken to represent the larger data-set. For example, polling results taken during elections are used to infer the opinions of the population at large. However, what is the best and easiest method of capturing a sub-set of variation in a data-set that can represent the overall portrait of variation? In this article, principal components analysis (PCA) is discussed in detail, including its history, the mathematics behind the process, and in which ways it can be applied to modern large-scale biological datasets. New methods of analysis using PCA are also suggested, with tentative results outlined.


Author(s):  
David Duran-Rodas ◽  
Emmanouil Chaniotakis ◽  
Constantinos Antoniou

Identification of factors influencing ridership is necessary for policy-making, as well as, when examining transferability and aspects of performance and reliability. In this work, a data-driven method is formulated to correlate arrivals and departures of station-based bike sharing systems with built environment factors in multiple cities. Ridership data from stations of multiple cities are pooled in one data set regardless of their geographic boundaries. The method bundles the collection, analysis, and processing of data, as well as, the model’s estimation using statistical and machine learning techniques. The method was applied on a national level in six cities in Germany, and also on an international level in three cities in Europe and North America. The results suggest that the model’s performance did not depend on clustering cities by size but by the relative daily distribution of the rentals. Selected statistically significant factors were identified to vary temporally (e.g., nightclubs were significant during the night). The most influencing variables were related to the city population, distance to city center, leisure-related establishments, and transport-related infrastructure. This data-driven method can help as a support decision-making tool to implement or expand bike sharing systems.


2002 ◽  
Vol 45 (4-5) ◽  
pp. 227-235 ◽  
Author(s):  
J. Lennox ◽  
C. Rosen

Fault detection and isolation (FDI) are important steps in the monitoring and supervision of industrial processes. Biological wastewater treatment (WWT) plants are difficult to model, and hence to monitor, because of the complexity of the biological reactions and because plant influent and disturbances are highly variable and/or unmeasured. Multivariate statistical models have been developed for a wide variety of situations over the past few decades, proving successful in many applications. In this paper we develop a new monitoring algorithm based on Principal Components Analysis (PCA). It can be seen equivalently as making Multiscale PCA (MSPCA) adaptive, or as a multiscale decomposition of adaptive PCA. Adaptive Multiscale PCA (AdMSPCA) exploits the changing multivariate relationships between variables at different time-scales. Adaptation of scale PCA models over time permits them to follow the evolution of the process, inputs or disturbances. Performance of AdMSPCA and adaptive PCA on a real WWT data set is compared and contrasted. The most significant difference observed was the ability of AdMSPCA to adapt to a much wider range of changes. This was mainly due to the flexibility afforded by allowing each scale model to adapt whenever it did not signal an abnormal event at that scale. Relative detection speeds were examined only summarily, but seemed to depend on the characteristics of the faults/disturbances. The results of the algorithms were similar for sudden changes, but AdMSPCA appeared more sensitive to slower changes.


Sign in / Sign up

Export Citation Format

Share Document