scholarly journals Variable Selection and Parameter Tuning for BART Modeling in the Fragile Families Challenge

2019 ◽  
Vol 5 ◽  
pp. 237802311982588 ◽  
Author(s):  
Nicole Bohme Carnegie ◽  
James Wu

Our goal for the Fragile Families Challenge was to develop a hands-off approach that could be applied in many settings to identify relationships that theory-based models might miss. Data processing was our first and most time-consuming task, particularly handling missing values. Our second task was to reduce the number of variables for modeling, and we compared several techniques for variable selection: least absolute selection and shrinkage operator, regression with a horseshoe prior, Bayesian generalized linear models, and Bayesian additive regression trees (BART). We found minimal differences in final performance based on the choice of variable selection method. We proceeded with BART for modeling because it requires minimal assumptions and permits great flexibility in fitting surfaces and based on previous success using BART in black-box modeling competitions. In addition, BART allows for probabilistic statements about the predictions and other inferences, which is an advantage over most machine learning algorithms. A drawback to BART, however, is that it is often difficult to identify or characterize individual predictors that have strong influences on the outcome variable.

2020 ◽  
Vol 38 (2) ◽  
pp. 311-327
Author(s):  
Luis Lizasoain Hernández

El objetivo de este artículo es presentar los criterios y modelos estadísticos empleados en un estudio de eficacia escolar desarrollado en la Comunidad Autónoma del País Vasco empleando como variable criterio los resultados en matemáticas, comprensión lectora en lengua castellana y en lengua vasca, resultantes de las evaluaciones de Diagnóstico aplicadas en cinco años. Se definen cuatro criterios de eficacia escolar: puntuaciones extremas, residuos extremos, crecimiento de puntuaciones y crecimiento de residuos. Para ello se han aplicado técnicas de regresión multinivel empleando modelos jerárquicos lineales. Los resultados permiten una selección de centros tanto de alta como de baja eficacia que se basa en cuatro enfoques distintos y complementarios de la eficacia (o ineficacia) escolar. The aim of this paper is to present the statistical criteria and models used in a school effectiveness research carried out in the Basque Country Autonomous Community using as outcome variable the mathematics, spanish language and basque language scores. These scores come from the Diagnosis Assessments applied for five years. Four school effectiveness criteria are defined: extreme scores, extreme residuals, scores growth and residuals growth. Multilevel regression techniques have been applied using hierarchical linear models (HLM). Results have permitted a selection of both high and low effective schools based on four different and complementary school effectiveness approaches.


2012 ◽  
Vol 55 (2) ◽  
pp. 327-347 ◽  
Author(s):  
Dengke Xu ◽  
Zhongzhan Zhang ◽  
Liucang Wu

2020 ◽  
Vol 34 (04) ◽  
pp. 3545-3552
Author(s):  
Yiding Chen ◽  
Xiaojin Zhu

We describe an optimal adversarial attack formulation against autoregressive time series forecast using Linear Quadratic Regulator (LQR). In this threat model, the environment evolves according to a dynamical system; an autoregressive model observes the current environment state and predicts its future values; an attacker has the ability to modify the environment state in order to manipulate future autoregressive forecasts. The attacker's goal is to force autoregressive forecasts into tracking a target trajectory while minimizing its attack expenditure. In the white-box setting where the attacker knows the environment and forecast models, we present the optimal attack using LQR for linear models, and Model Predictive Control (MPC) for nonlinear models. In the black-box setting, we combine system identification and MPC. Experiments demonstrate the effectiveness of our attacks.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 264-265
Author(s):  
Duy Ngoc Do ◽  
Guoyu Hu ◽  
Younes Miar

Abstract American mink (Neovison vison) is the major source of fur for the fur industries worldwide and Aleutian disease (AD) is causing severe financial losses to the mink industry. Different methods have been used to diagnose the AD in mink, but the combination of several methods can be the most appropriate approach for the selection of AD resilient mink. Iodine agglutination test (IAT) and counterimmunoelectrophoresis (CIEP) methods are commonly employed in test-and-remove strategy; meanwhile, enzyme-linked immunosorbent assay (ELISA) and packed-cell volume (PCV) methods are complementary. However, using multiple methods are expensive; and therefore, hindering the corrected use of AD tests in selection. This research presented the assessments of the AD classification based on machine learning algorithms. The Aleutian disease was tested on 1,830 individuals using these tests in an AD positive mink farm (Canadian Centre for Fur Animal Research, NS, Canada). The accuracy of classification for CIEP was evaluated based on the sex information, and IAT, ELISA and PCV test results implemented in seven machine learning classification algorithms (Random Forest, Artificial Neural Networks, C50Tree, Naive Bayes, Generalized Linear Models, Boost, and Linear Discriminant Analysis) using the Caret package in R. The accuracy of prediction varied among the methods. Overall, the Random Forest was the best-performing algorithm for the current dataset with an accuracy of 0.89 in the training data and 0.94 in the testing data. Our work demonstrated the utility and relative ease of using machine learning algorithms to assess the CIEP information, and consequently reducing the cost of AD tests. However, further works require the inclusion of production and reproduction information in the models and extension of phenotypic collection to increase the accuracy of current methods.


2021 ◽  
pp. 1-29
Author(s):  
Fikrewold H. Bitew ◽  
Corey S. Sparks ◽  
Samuel H. Nyarko

Abstract Objective: Child undernutrition is a global public health problem with serious implications. In this study, estimate predictive algorithms for the determinants of childhood stunting by using various machine learning (ML) algorithms. Design: This study draws on data from the Ethiopian Demographic and Health Survey of 2016. Five machine learning algorithms including eXtreme gradient boosting (xgbTree), k-nearest neighbors (K-NN), random forest (RF), neural network (NNet), and the generalized linear models (GLM) were considered to predict the socio-demographic risk factors for undernutrition in Ethiopia. Setting: Households in Ethiopia. Participants: A total of 9,471 children below five years of age. Results: The descriptive results show substantial regional variations in child stunting, wasting, and underweight in Ethiopia. Also, among the five ML algorithms, xgbTree algorithm shows a better prediction ability than the generalized linear mixed algorithm. The best predicting algorithm (xgbTree) shows diverse important predictors of undernutrition across the three outcomes which include time to water source, anemia history, child age greater than 30 months, small birth size, and maternal underweight, among others. Conclusions: The xgbTree algorithm was a reasonably superior ML algorithm for predicting childhood undernutrition in Ethiopia compared to other ML algorithms considered in this study. The findings support improvement in access to water supply, food security, and fertility regulation among others in the quest to considerably improve childhood nutrition in Ethiopia.


2021 ◽  
pp. 395-414
Author(s):  
Carlos M. Carvalho ◽  
Edward I. George ◽  
P. Richard Hahn ◽  
Robert E. McCulloch

Author(s):  
Rithesh Pakkala P. ◽  
Prakhyath Rai ◽  
Shamantha Rai Bellipady

This chapter provides insight on pattern recognition by illustrating various approaches and frameworks which aid in the prognostic reasoning facilitated by feature selection and feature extraction. The chapter focuses on analyzing syntactical and statistical approaches of pattern recognition. Typically, a large set of features have an impact on the performance of the predictive model. Hence, there is a need to eliminate redundant and noisy pieces of data before developing any predictive model. The selection of features is independent of any machine learning algorithms. The content-rich information obtained after the elimination of noisy patterns such as stop words and missing values is then used for further prediction. The refinement and extraction of relevant features yields in performance enhancements of future prediction and analysis.


2020 ◽  
pp. 004912412092621
Author(s):  
C. Ben Gibson ◽  
Jeannette Sutton ◽  
Sarah K. Vos ◽  
Carter T. Butts

Microblogging sites have become important data sources for studying network dynamics and information transmission. Both areas of study, however, require accurate counts of indegree, or follower counts; unfortunately, collection of complete time series on follower counts can be limited by application programming interface constraints, system failures, or temporal constraints. In addition, there is almost always a time difference between the point at which follower counts are queried and the time a user posts a tweet. Here, we consider the use of three classes of simple, easily implemented methods for follower imputation: polynomial functions, splines, and generalized linear models. We evaluate the performance of each method via a case study of accounts from 236 health organizations during the 2014 Ebola outbreak. For accurate interpolation and extrapolation, we find that negative binomial regression, modeled separately for each account, using time as an interval variable, accurately recovers missing values while retaining narrow prediction intervals.


Forests ◽  
2019 ◽  
Vol 10 (12) ◽  
pp. 1073 ◽  
Author(s):  
Li ◽  
Li ◽  
Li ◽  
Liu

Forest biomass is a major store of carbon and plays a crucial role in the regional and global carbon cycle. Accurate forest biomass assessment is important for monitoring and mapping the status of and changes in forests. However, while remote sensing-based forest biomass estimation in general is well developed and extensively used, improving the accuracy of biomass estimation remains challenging. In this paper, we used China’s National Forest Continuous Inventory data and Landsat 8 Operational Land Imager data in combination with three algorithms, either the linear regression (LR), random forest (RF), or extreme gradient boosting (XGBoost), to establish biomass estimation models based on forest type. In the modeling process, two methods of variable selection, e.g., stepwise regression and variable importance-base method, were used to select optimal variable subsets for LR and machine learning algorithms (e.g., RF and XGBoost), respectively. Comfortingly, the accuracy of models was significantly improved, and thus the following conclusions were drawn: (1) Variable selection is very important for improving the performance of models, especially for machine learning algorithms, and the influence of variable selection on XGBoost is significantly greater than that of RF. (2) Machine learning algorithms have advantages in aboveground biomass (AGB) estimation, and the XGBoost and RF models significantly improved the estimation accuracy compared with the LR models. Despite that the problems of overestimation and underestimation were not fully eliminated, the XGBoost algorithm worked well and reduced these problems to a certain extent. (3) The approach of AGB modeling based on forest type is a very advantageous method for improving the performance at the lower and higher values of AGB. Some conclusions in this paper were probably different as the study area changed. The methods used in this paper provide an optional and useful approach for improving the accuracy of AGB estimation based on remote sensing data, and the estimation of AGB was a reference basis for monitoring the forest ecosystem of the study area.


Sign in / Sign up

Export Citation Format

Share Document