Quantile regression forests to identify determinants of stroke: implications for neighborhoods with high prevalence

Abstract Background : Stroke exerts a massive burden on the U.S. health and economy. Place-based evidence is increasingly recognized as a critical part of stroke management but identifying the key determinants of stroke and the underlying effect mechanisms at the neighborhood level is a topic that has been treated sparingly in the literature. We aim to fill in the research gaps. We develop and apply analytical approaches to address two challenges. First, domain expertise on drivers of neighborhood-level stroke outcomes is limited. Second, commonly used linear regression methods may provide incomplete and biased conclusions.Methods: We created a new neighborhood health data set at census tract level by pooling information from multiple sources. We developed and applied a machine learning based quantile regression method to uncover crucial neighborhood characteristics for neighborhood stroke outcomes among vulnerable neighborhoods burdened with high prevalence of stroke. Results: Neighborhoods with a larger share of non-Hispanic blacks, older adults or people with insufficient sleep tended to have a higher prevalence of stroke, whereas neighborhoods with a higher socio-economic status in terms of income and education had a lower prevalence of stroke. The effects of five major determinants varied geographically and were significantly stronger among neighborhoods with high prevalence of stroke. Conclusions: Highly flexible machine learning identifies true drivers of neighborhood cardiovascular health outcomes from wide-ranging information in an agnostic and reproducible way. The identified major determinants and the effect mechanisms can provide important avenues for prioritizing and allocating resources to develop optimal community-level interventions for stroke prevention.

Download Full-text

Tree‐Based Machine Learning to Identify and Understand Major Determinants for Stroke at the Neighborhood Level

Journal of the American Heart Association ◽

10.1161/jaha.120.016745 ◽

2020 ◽

Vol 9 (22) ◽

Cited By ~ 4

Author(s):

Liangyuan Hu ◽

Bian Liu ◽

Jiayi Ji ◽

Yan Li

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Cardiovascular Disease ◽

Cardiovascular Health ◽

Black People ◽

The United States ◽

Machine Learning Techniques ◽

Data Set ◽

Wide Range ◽

Neighborhood Level

Background Stroke is a major cardiovascular disease that causes significant health and economic burden in the United States. Neighborhood community‐based interventions have been shown to be both effective and cost‐effective in preventing cardiovascular disease. There is a dearth of robust studies identifying the key determinants of cardiovascular disease and the underlying effect mechanisms at the neighborhood level. We aim to contribute to the evidence base for neighborhood cardiovascular health research. Methods and Results We created a new neighborhood health data set at the census tract level by integrating 4 types of potential predictors, including unhealthy behaviors, prevention measures, sociodemographic factors, and environmental measures from multiple data sources. We used 4 tree‐based machine learning techniques to identify the most critical neighborhood‐level factors in predicting the neighborhood‐level prevalence of stroke, and compared their predictive performance for variable selection. We further quantified the effects of the identified determinants on stroke prevalence using a Bayesian linear regression model. Of the 5 most important predictors identified by our method, higher prevalence of low physical activity, larger share of older adults, higher percentage of non‐Hispanic Black people, and higher ozone levels were associated with higher prevalence of stroke at the neighborhood level. Higher median household income was linked to lower prevalence. The most important interaction term showed an exacerbated adverse effect of aging and low physical activity on the neighborhood‐level prevalence of stroke. Conclusions Tree‐based machine learning provides insights into underlying drivers of neighborhood cardiovascular health by discovering the most important determinants from a wide range of factors in an agnostic, data‐driven, and reproducible way. The identified major determinants and the interactive mechanism can be used to prioritize and allocate resources to optimize community‐level interventions for stroke prevention.

Download Full-text

Quantile Regression–Based Estimation of Dynamic Statistical Contingency Fuel

Transportation Science ◽

10.1287/trsc.2020.0997 ◽

2020 ◽

Author(s):

Lei Kang ◽

Mark Hansen

Keyword(s):

Quantile Regression ◽

Traffic Management ◽

Carbon Dioxide Emission ◽

Estimation Procedure ◽

Air Traffic Management ◽

Aviation Industry ◽

Fuel Loading ◽

Data Set ◽

Fuel Burn ◽

Quantile Regression Method

Reducing fuel consumption is a unifying goal across the aviation industry. One fuel-saving opportunity for airlines is reducing contingency fuel loading by dispatchers. Many airlines’ flight planning systems (FPSs) provide recommended contingency fuel for dispatchers in the form of statistical contingency fuel (SCF). However, because of limitations of the current SCF estimation procedure, the application of SCF is limited. In this study, we propose to use quantile regression–based machine learning methods to account for fuel burn uncertainties and estimate more reliable SCF values. Utilizing a large fuel burn data set from a major U.S.-based airline, we find that the proposed quantile regression method outperforms the airline’s FPS. The benefit of applying the improved SCF models is estimated to be in the range $19 million–$65 million in fuel expense savings as well as 132 million–451 million kilograms of carbon dioxide emission reductions per year, with the lower savings being realized even while maintaining the current, extremely low risk of tapping the reserve fuel. The proposed models can also be used to predict benefits from reduced fuel loading enabled by increasing system predictability, for example, with improved air traffic management.

Download Full-text

Spatial Diversity Deconstruct of C-band Scatter Components in Multistatic RaDaR Datasets using Machine Learning Techniques

10.36227/techrxiv.14718402.v1 ◽

2021 ◽

Author(s):

Shanmugha Sundaram G A ◽

Harun Surej I ◽

Karthic S ◽

Gandhiraj R ◽

Binoy B N ◽

...

Keyword(s):

Machine Learning ◽

Communication Systems ◽

Large Scale ◽

Spatial Diversity ◽

Machine Learning Techniques ◽

Line Of Sight ◽

Propagation Channel ◽

Data Set ◽

Band Frequency ◽

Analytical Approaches

In complex application wherein the signal propagating through free space is subject to multipath interference due to scatter by line-of-sight and non-line-of-sight objects in the propagation channel. The aims is to identify scatter centers in the propagation channel and characterize them based on their subjective characteristics, interpreted based on machine learning algorithm operations. Data-driven models are employed, replacing the traditional analytical approaches, in order to profile the scatter centers as either of absorbing or reflecting types based on the manner in which the signals are affected. A typical multistatic detection scenario is reconstructed under controlled laboratory conditions in order to create spatially independent data sets, while operating in the C-band frequency. The outcomes of this study are then applied to identify the scatter centers based on the distinct signatures they register in the experimental data set. As a converse argument, the process of antenna pattern estimation can now be performed free of an anechoic chamber setup, which is time and cost insensitive. A greater relevance shall be in the context of mid-band 5G-NR cellular communication systems that need to optimize the distributed antenna location attributes on time and cost constrained scales before attempting a large-scale deployment.

Download Full-text

The Utility and Cross-Validation of a Composite Physical Activity Score in Relation to Cardiovascular Health Indicators: Coronary Artery Risk Development in Young Adults

Journal of Physical Activity and Health ◽

10.1123/jpah.2017-0692 ◽

2018 ◽

Vol 15 (11) ◽

pp. 847-856 ◽

Cited By ~ 1

Author(s):

Kelley Pettee Gabriel ◽

Adriana Pérez ◽

David R. Jacobs ◽

Joowon Lee ◽

Harold W. Kohl ◽

...

Keyword(s):

Physical Activity ◽

Cardiovascular Health ◽

Cross Validation ◽

Health Indicators ◽

Composite Score ◽

Self Report ◽

Linear Regression Models ◽

Multiple Sources ◽

Data Set ◽

Validation Set

Background: Single-method assessment of physical activity (PA) has limitations. The utility and cross-validation of a composite PA score that includes reported and accelerometer-derived PA data has not been evaluated.Methods: Participants attending the Year 20 exam were randomly assigned to the derivation (two-thirds) or validation (one-third) data set. Principal components analysis was used to create a composite score reflecting Year 20 combined reported and accelerometer PA data. Generalized linear regression models were constructed to estimate the variability explained (R2) by each PA assessment strategy (self-report only, accelerometer only, composite score, or self-report plus accelerometer) with cardiovascular health indicators. This process was repeated in the validation set to determine cross-validation.Results: At Year 20, 3549 participants (45.2 [3.6] y, 56.7% female, and 53.5% black) attended the clinic exam and 2540 agreed to wear the accelerometer. HigherR2values were obtained when combined assessment strategies were used; however, the approach yielding the highestR2value varied by cardiovascular health outcome. Findings from the cross-validation also supported internal study validity.Conclusions: Findings support continued refinement of methodological approaches to combine data from multiple sources to create a more robust estimate that reflects the complexities of PA behavior.

Download Full-text

Identifying and assessing the impact of key neighborhood-level determinants on geographic variation in stroke: a machine learning and multilevel modeling approach

BMC Public Health ◽

10.1186/s12889-020-09766-3 ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Jiayi Ji ◽

Liangyuan Hu ◽

Bian Liu ◽

Yan Li

Keyword(s):

Machine Learning ◽

Multilevel Modeling ◽

High Performance ◽

Learning Algorithm ◽

Key Factors ◽

Data Set ◽

Modeling Approach ◽

Stroke Prevalence ◽

The Impact ◽

Neighborhood Level

Abstract Background Stroke is a chronic cardiovascular disease that puts major stresses on U.S. health and economy. The prevalence of stroke exhibits a strong geographical pattern at the state-level, where a cluster of southern states with a substantially higher prevalence of stroke has been called the stroke belt of the nation. Despite this recognition, the extent to which key neighborhood characteristics affect stroke prevalence remains to be further clarified. Methods We generated a new neighborhood health data set at the census tract level on nearly 27,000 tracts by pooling information from multiple data sources including the CDC’s 500 Cities Project 2017 data release. We employed a two-stage modeling approach to understand how key neighborhood-level risk factors affect the neighborhood-level stroke prevalence in each state of the US. The first stage used a state-of-the-art Bayesian machine learning algorithm to identify key neighborhood-level determinants. The second stage applied a Bayesian multilevel modeling approach to describe how these key determinants explain the variability in stroke prevalence in each state. Results Neighborhoods with a larger proportion of older adults and non-Hispanic blacks were associated with neighborhoods with a higher prevalence of stroke. Higher median household income was linked to lower stroke prevalence. Ozone was found to be positively associated with stroke prevalence in 10 states, while negatively associated with stroke in five states. There was substantial variation in both the direction and magnitude of the associations between these four key factors with stroke prevalence across the states. Conclusions When used in a principled variable selection framework, high-performance machine learning can identify key factors of neighborhood-level prevalence of stroke from wide-ranging information in a data-driven way. The Bayesian multilevel modeling approach provides a detailed view of the impact of key factors across the states. The identified major factors and their effect mechanisms can potentially aid policy makers in developing area-based stroke prevention strategies.

Download Full-text

Spatial Diversity Deconstruct of C-band Scatter Components in Multistatic RaDaR Datasets using Machine Learning Techniques

10.36227/techrxiv.14718402 ◽

2021 ◽

Author(s):

Shanmugha Sundaram G A ◽

Harun Surej I ◽

Karthic S ◽

Gandhiraj R ◽

Binoy B N ◽

...

Keyword(s):

Machine Learning ◽

Communication Systems ◽

Large Scale ◽

Spatial Diversity ◽

Machine Learning Techniques ◽

Line Of Sight ◽

Propagation Channel ◽

Data Set ◽

Band Frequency ◽

Analytical Approaches

Download Full-text

Exchange Spin Coupling from Gaussian Process Regression

10.26434/chemrxiv.12589541.v3 ◽

2020 ◽

Author(s):

Marc Philipp Bahlke ◽

Natnael Mogos ◽

Jonny Proppe ◽

Carmen Herrmann

Keyword(s):

Machine Learning ◽

Gaussian Process ◽

Gaussian Process Regression ◽

Molecular Magnets ◽

Molecular Structures ◽

Spin Coupling ◽

Structure Property ◽

Data Set ◽

Uncertainty Estimates

Heisenberg exchange spin coupling between metal centers is essential for describing and understanding the electronic structure of many molecular catalysts, metalloenzymes, and molecular magnets for potential application in information technology. We explore the machine-learnability of exchange spin coupling, which has not been studied yet. We employ Gaussian process regression since it can potentially deal with small training sets (as likely associated with the rather complex molecular structures required for exploring spin coupling) and since it provides uncertainty estimates (“error bars”) along with predicted values. We compare a range of descriptors and kernels for 257 small dicopper complexes and find that a simple descriptor based on chemical intuition, consisting only of copper-bridge angles and copper-copper distances, clearly outperforms several more sophisticated descriptors when it comes to extrapolating towards larger experimentally relevant complexes. Exchange spin coupling is similarly easy to learn as the polarizability, while learning dipole moments is much harder. The strength of the sophisticated descriptors lies in their ability to linearize structure-property relationships, to the point that a simple linear ridge regression performs just as well as the kernel-based machine-learning model for our small dicopper data set. The superior extrapolation performance of the simple descriptor is unique to exchange spin coupling, reinforcing the crucial role of choosing a suitable descriptor, and highlighting the interesting question of the role of chemical intuition vs. systematic or automated selection of features for machine learning in chemistry and material science.

Download Full-text

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Comparative Analysis of Machine Learning Techniques Using Predictive Modeling

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904164539 ◽

2020 ◽

Vol 13 ◽

Author(s):

Ritu Khandelwal ◽

Hemlata Goyal ◽

Rajveer Singh Shekhawat

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Data Science ◽

Training Data ◽

Machine Learning Techniques ◽

Future Trends ◽

Data Set ◽

Learning Stage ◽

Learning Techniques ◽

Different Types

Introduction: Machine learning is an intelligent technology that works as a bridge between businesses and data science. With the involvement of data science, the business goal focuses on findings to get valuable insights on available data. The large part of Indian Cinema is Bollywood which is a multi-million dollar industry. This paper attempts to predict whether the upcoming Bollywood Movie would be Blockbuster, Superhit, Hit, Average or Flop. For this Machine Learning techniques (classification and prediction) will be applied. To make classifier or prediction model first step is the learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations. Methods: All the techniques related to classification and Prediction such as Support Vector Machine(SVM), Random Forest, Decision Tree, Naïve Bayes, Logistic Regression, Adaboost, and KNN will be applied and try to find out efficient and effective results. All these functionalities can be applied with GUI Based workflows available with various categories such as data, Visualize, Model, and Evaluate. Result: To make classifier or prediction model first step is learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations Conclusion: This paper focuses on Comparative Analysis that would be performed based on different parameters such as Accuracy, Confusion Matrix to identify the best possible model for predicting the movie Success. By using Advertisement Propaganda, they can plan for the best time to release the movie according to the predicted success rate to gain higher benefits. Discussion: Data Mining is the process of discovering different patterns from large data sets and from that various relationships are also discovered to solve various problems that come in business and helps to predict the forthcoming trends. This Prediction can help Production Houses for Advertisement Propaganda and also they can plan their costs and by assuring these factors they can make the movie more profitable.

Download Full-text