A Data-Driven Approach to the Fragile Families Challenge: Prediction through Principal-Components Analysis and Random Forests

Sociological research typically involves exploring theoretical relationships, but the emergence of “big data” enables alternative approaches. This work shows the promise of data-driven machine-learning techniques involving feature engineering and predictive model optimization to address a sociological data challenge. The author’s group develops improved generalizable models to identify at-risk families. Principal-components analysis and decision tree modeling are used to predict six main dependent variables in the Fragile Families Challenge, successfully modeling one binary variable but no continuous dependent variables in the diagnostic data set. This indicates that some binary dependent variables are more predictable using a reduced set of uncorrelated independent variables, and continuous dependent variables demand more complexity.

Download Full-text

Dietary patterns in UK adolescents obtained from a dual-source FFQ and their associations with socio-economic position, nutrient intake and modes of eating

Public Health Nutrition ◽

10.1017/s1368980013001547 ◽

2013 ◽

Vol 17 (7) ◽

pp. 1476-1485 ◽

Cited By ~ 27

Author(s):

Kate Northstone ◽

Andrew DAC Smith ◽

Victoria L Cribb ◽

Pauline M Emmett

Keyword(s):

Principal Components Analysis ◽

Principal Components ◽

Dietary Patterns ◽

Data Set ◽

Sociodemographic Differences ◽

Traditional Health ◽

Parents And Children ◽

Components Analysis ◽

Positive Correlations ◽

Dietary Data

AbstractObjectiveTo derive dietary patterns using principal components analysis from separate FFQ completed by mothers and their teenagers and to assess associations with nutrient intakes and sociodemographic variables.DesignTwo distinct FFQ were completed by 13-year-olds and their mothers, with some overlap in the foods covered. A combined data set was obtained.SettingAvon Longitudinal Study of Parents and Children (ALSPAC), Bristol, UK.SubjectsTeenagers (n 5334) with adequate dietary data.ResultsFour patterns were obtained using principal components analysis: a ‘Traditional/health-conscious’ pattern, a ‘Processed’ pattern, a ‘Snacks/sugared drinks’ pattern and a ‘Vegetarian’ pattern. The ‘Traditional/health-conscious’ pattern was the most nutrient-rich, having high positive correlations with many nutrients. The ‘Processed’ and ‘Snacks/sugared drinks’ patterns showed little association with important nutrients but were positively associated with energy, fats and sugars. There were clear gender and sociodemographic differences across the patterns. Lower scores were seen on the ‘Traditional/health conscious’ and ‘Vegetarian’ patterns in males and in those with younger and less educated mothers. Higher scores were seen on the ‘Traditional/health-conscious’ and ‘Vegetarian’ patterns in girls and in those whose mothers had higher levels of education.ConclusionsIt is important to establish healthy eating patterns by the teenage years. However, this is a time when it is difficult to accurately establish dietary intake from a single source, since teenagers consume increasing amounts of foods outside the home. Further dietary pattern studies should focus on teenagers and the source of dietary data collection merits consideration.

Download Full-text

The use of principal components analysis for the investigation of an organic air pollutants data set

Atmospheric Environment (1967) ◽

10.1016/0004-6981(84)90017-9 ◽

1984 ◽

Vol 18 (11) ◽

pp. 2471-2478 ◽

Cited By ~ 31

Author(s):

J. Smeyers-Verbeke ◽

J.C. Den Hartog ◽

W.H. Dehker ◽

D. Coomans ◽

L. Buydens ◽

...

Keyword(s):

Principal Components Analysis ◽

Principal Components ◽

Air Pollutants ◽

Data Set ◽

Components Analysis

Download Full-text

Multivariate Characterization of Hydrogen Balmer Emission in Cataclysmic Variables

Publications of the Astronomical Society of Australia ◽

10.1071/as06011 ◽

2006 ◽

Vol 23 (3) ◽

pp. 106-118 ◽

Cited By ~ 5

Author(s):

Gordon E. Sarty ◽

Kinwah Wu

Keyword(s):

Principal Components Analysis ◽

Discriminant Function ◽

Principal Components ◽

Discriminant Function Analysis ◽

Function Analysis ◽

Cataclysmic Variables ◽

Data Set ◽

Analysis Methods ◽

Wide Range ◽

Components Analysis

AbstractThe ratios of hydrogen Balmer emission line intensities in cataclysmic variables are signatures of the physical processes that produce them. To quantify those signatures relative to classifications of cataclysmic variable types, we applied the multivariate statistical analysis methods of principal components analysis and discriminant function analysis to the spectroscopic emission data set of Williams (1983). The two analysis methods reveal two different sources of variation in the ratios of the emission lines. The source of variation seen in the principal components analysis was shown to be correlated with the binary orbital period. The source of variation seen in the discriminant function analysis was shown to be correlated with the equivalent width of the Hβ line. Comparison of the data scatterplot with scatterplots of theoretical models shows that Balmer line emission from T CrB systems is consistent with the photoionization of a surrounding nebula. Otherwise, models that we considered do not reproduce the wide range of Balmer decrements, including ‘inverted’ decrements, seen in the data.

Download Full-text

The Knowledge Discovery of β-Thalassemia Using Principal Components Analysis: PCA and Machine Learning Techniques

International Journal of e-Education e-Business e-Management and e-Learning ◽

10.7763/ijeeee.2011.v1.27 ◽

2011 ◽

Author(s):

Patcharaporn Paokanta

Keyword(s):

Machine Learning ◽

Knowledge Discovery ◽

Principal Components Analysis ◽

Principal Components ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Components Analysis

Download Full-text

Planktonic Biomass Trajectories In Lake Ecosystems

Canadian Journal of Fisheries and Aquatic Sciences ◽

10.1139/f83-204 ◽

1983 ◽

Vol 40 (10) ◽

pp. 1752-1760 ◽

Cited By ~ 3

Author(s):

Michael A. Gates ◽

Ann P. Zimmerman ◽

W. Gary Sprules ◽

Roy Knoechel

Keyword(s):

Hypothesis Testing ◽

Principal Components Analysis ◽

Biomass Allocation ◽

Principal Components ◽

Primary Component ◽

Size Category ◽

Data Set ◽

Lake Ecosystems ◽

The Mean ◽

Components Analysis

We introduce a method, based on principal components analysis, for studying temporal changes in biomass allocation among 16 size–category compartments of lake plankton. Applied to data from a series of 12 Ontario lakes over three sampling seasons, the technique provides a simple means of visualizing shifts in patterns of biomass allocation, and it allows comparative analyses of biomass fluctuations in different lakes. Each of the primary component axes is interpretable. Furthermore, a large proportion of the variance in both the mean position of a lake and its movement along these axes is interpreted as a function of lake physicochemistry. The analysis also provides weighted scores for use in hypothesis testing which are an improvement over mean biomass values alone, because they take into account the structure of variation in the data set.

Download Full-text

Attribute Requirements for a Simulated Flight Scenario Microcomputer Test

Proceedings of the Human Factors Society Annual Meeting ◽

10.1177/154193128202601112 ◽

1982 ◽

Vol 26 (11) ◽

pp. 959-963 ◽

Cited By ~ 1

Author(s):

R. H. Shannon ◽

M. Krause ◽

R. C. Irons

Keyword(s):

Construct Validity ◽

Principal Components Analysis ◽

Motor Skills ◽

Principal Components ◽

Video Game ◽

Factor Solution ◽

Independent Variables ◽

Dependent Variables ◽

Components Analysis ◽

Tool Position

Eighteen subjects practiced a video game of bombing and air combat maneuvering, Phantoms Five®, on an APPLE® microcomputer for 10 minutes a day for 15 days. The dependent variable was the combined score for number of hits and number of targets. Performance stabilized from Days 8–15 with a pooled reliability of .904. Eight reference tests which theoretically measure cognitive, perceptual, quantitative, and motor skills were selected and used as independent variables. Stabilized performance on these tests was observed after a period of practice which was predetermined from previous experimentation. Attributes of the Phantoms Five® were isolated using a structured job analytic tool (Position Analysis Questionnaire, PAQ). A principal components analysis of the measures that correlated with the dependent variable resulted in a one factor solution explaining 66 percent of the variance. It was concluded that construct validity was established since there was a strong similarity between the attribute requirements attained by correlating the stabilized scores of independent and dependent variables and by the PAQ analysis of task functions.

Download Full-text

ITEM DIMENSIONALITY EXPLORATION BY MEANS OF CONSTRUCT MAP AND CATEGORICAL PRINCIPAL COMPONENTS ANALYSIS

Journal of Baltic Science Education ◽

10.33225/jbse/19.18.209 ◽

2019 ◽

Vol 18 (2) ◽

pp. 209-226 ◽

Cited By ~ 3

Author(s):

Gabriela Deliu ◽

Cristina Miron ◽

Cristian Opariuc-Dan

Keyword(s):

Principal Components Analysis ◽

Principal Components ◽

Data Driven ◽

National Assessment ◽

Assessment Tests ◽

Irt Models ◽

Multidimensional Irt Models ◽

Components Analysis ◽

Multiple Choice Items ◽

Categorical Principal Components Analysis

The aim of this research is to study the merits and complementarity of Construct Mapping and Categorical Principal Components Analysis as two approaches that explore the dimensionality of multiple-choice items in achievement tests. Data from the two forms of the Romanian National Assessment Tests on Science were used to explore the dimensionality of items and to identify potentially problematic items that affect the equivalence of the two parallel forms. The findings confirm that the two tests have at best partial equivalence, but while the two methods both agree on test unidimensionality, they flag in part different items as potentially problematic. The results enable researchers and practitioners to make coherent data-driven decision regarding the use of unidimensional vs multidimensional IRT models. Keywords: categorical principal components analysis, construct map, item response theory, unidimensionality.

Download Full-text

Haplotype Classification Using Copy Number Variation and Principal Components Analysis

The Open Bioinformatics Journal ◽

10.2174/1875036201307010019 ◽

2013 ◽

Vol 7 (1) ◽

pp. 19-24

Author(s):

Kevin Blighe

Keyword(s):

Principal Components Analysis ◽

Principal Components ◽

Large Scale ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Reduction Techniques ◽

Number Variation ◽

Components Analysis

Elaborate downstream methods are required to analyze large microarray data-sets. At times, where the end goal is to look for relationships between (or patterns within) different subgroups or even just individual samples, large data-sets must first be filtered using statistical thresholds in order to reduce their overall volume. As an example, in anthropological microarray studies, such ‘dimension reduction’ techniques are essential to elucidate any links between polymorphisms and phenotypes for given populations. In such large data-sets, a subset can first be taken to represent the larger data-set. For example, polling results taken during elections are used to infer the opinions of the population at large. However, what is the best and easiest method of capturing a sub-set of variation in a data-set that can represent the overall portrait of variation? In this article, principal components analysis (PCA) is discussed in detail, including its history, the mathematics behind the process, and in which ways it can be applied to modern large-scale biological datasets. New methods of analysis using PCA are also suggested, with tentative results outlined.

Download Full-text

Built Environment Factors Affecting Bike Sharing Ridership: Data-Driven Approach for Multiple Cities

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198119849908 ◽

2019 ◽

Vol 2673 (12) ◽

pp. 55-68 ◽

Cited By ~ 5

Author(s):

David Duran-Rodas ◽

Emmanouil Chaniotakis ◽

Constantinos Antoniou

Keyword(s):

Built Environment ◽

National Level ◽

Data Driven ◽

Machine Learning Techniques ◽

Data Set ◽

Factors Affecting ◽

Environment Factors ◽

Learning Techniques ◽

Bike Sharing ◽

Data Driven Approach

Identification of factors influencing ridership is necessary for policy-making, as well as, when examining transferability and aspects of performance and reliability. In this work, a data-driven method is formulated to correlate arrivals and departures of station-based bike sharing systems with built environment factors in multiple cities. Ridership data from stations of multiple cities are pooled in one data set regardless of their geographic boundaries. The method bundles the collection, analysis, and processing of data, as well as, the model’s estimation using statistical and machine learning techniques. The method was applied on a national level in six cities in Germany, and also on an international level in three cities in Europe and North America. The results suggest that the model’s performance did not depend on clustering cities by size but by the relative daily distribution of the rentals. Selected statistically significant factors were identified to vary temporally (e.g., nightclubs were significant during the night). The most influencing variables were related to the city population, distance to city center, leisure-related establishments, and transport-related infrastructure. This data-driven method can help as a support decision-making tool to implement or expand bike sharing systems.

Download Full-text

Adaptive multiscale principal components analysis for online monitoring of wastewater treatment

Water Science & Technology ◽

10.2166/wst.2002.0593 ◽

2002 ◽

Vol 45 (4-5) ◽

pp. 227-235 ◽

Cited By ~ 29

Author(s):

J. Lennox ◽

C. Rosen

Keyword(s):

Wastewater Treatment ◽

Principal Components Analysis ◽

Principal Components ◽

Fault Detection And Isolation ◽

Scale Model ◽

Multivariate Statistical ◽

Data Set ◽

Significant Difference ◽

Components Analysis ◽

Adaptive Pca

Fault detection and isolation (FDI) are important steps in the monitoring and supervision of industrial processes. Biological wastewater treatment (WWT) plants are difficult to model, and hence to monitor, because of the complexity of the biological reactions and because plant influent and disturbances are highly variable and/or unmeasured. Multivariate statistical models have been developed for a wide variety of situations over the past few decades, proving successful in many applications. In this paper we develop a new monitoring algorithm based on Principal Components Analysis (PCA). It can be seen equivalently as making Multiscale PCA (MSPCA) adaptive, or as a multiscale decomposition of adaptive PCA. Adaptive Multiscale PCA (AdMSPCA) exploits the changing multivariate relationships between variables at different time-scales. Adaptation of scale PCA models over time permits them to follow the evolution of the process, inputs or disturbances. Performance of AdMSPCA and adaptive PCA on a real WWT data set is compared and contrasted. The most significant difference observed was the ability of AdMSPCA to adapt to a much wider range of changes. This was mainly due to the flexibility afforded by allowing each scale model to adapt whenever it did not signal an abnormal event at that scale. Relative detection speeds were examined only summarily, but seemed to depend on the characteristics of the faults/disturbances. The results of the algorithms were similar for sudden changes, but AdMSPCA appeared more sensitive to slower changes.

Download Full-text