Uncovering Sociological Effect Heterogeneity Using Tree-Based Machine Learning

Individuals do not respond uniformly to treatments, such as events or interventions. Sociologists routinely partition samples into subgroups to explore how the effects of treatments vary by selected covariates, such as race and gender, on the basis of theoretical priors. Data-driven discoveries are also routine, yet the analyses by which sociologists typically go about them are often problematic and seldom move us beyond our biases to explore new meaningful subgroups. Emerging machine learning methods based on decision trees allow researchers to explore sources of variation that they may not have previously considered or envisaged. In this article, the authors use tree-based machine learning, that is, causal trees, to recursively partition the sample to uncover sources of effect heterogeneity. Assessing a central topic in social inequality, college effects on wages, the authors compare what is learned from covariate and propensity score–based partitioning approaches with recursive partitioning based on causal trees. Decision trees, although superseded by forests for estimation, can be used to uncover subpopulations responsive to treatments. Using observational data, the authors expand on the existing causal tree literature by applying leaf-specific effect estimation strategies to adjust for observed confounding, including inverse propensity weighting, nearest neighbor matching, and doubly robust causal forests. We also assess localized balance metrics and sensitivity analyses to address the possibility of differential imbalance and unobserved confounding. The authors encourage researchers to follow similar data exploration practices in their work on variation in sociological effects and offer a straightforward framework by which to do so.

Download Full-text

Uncovering Sociological Effect Heterogeneity Using Machine Learning

10.31235/osf.io/x68hj ◽

2019 ◽

Author(s):

Jennie E. Brand ◽

Jiahui Xu ◽

Bernard Koch ◽

pablo geraldo

Keyword(s):

Machine Learning ◽

Unobserved Heterogeneity ◽

Training Sample ◽

Sensitivity Analyses ◽

Specific Effects ◽

The Social ◽

Treatment Effect Heterogeneity ◽

College Effects ◽

Sources Of Variation ◽

Effect Heterogeneity

Individuals do not respond uniformly to treatments, events, or interventions. Sociologists routinely partition samples into subgroups to explore how the effects of treatments vary by covariates like race, gender, and socioeconomic status. In so doing, analysts determine the key subpopulations based on theoretical priors. Data-driven discoveries are also routine, yet the analyses by which sociologists typically go about them are problematic and seldom move us beyond our expectations, and biases, to explore new meaningful subgroups. Emerging machine learning methods allow researchers to explore sources of variation that they may not have previously considered, or envisaged. In this paper, we use causal trees to recursively partition the sample and uncover sources of treatment effect heterogeneity. We use honest estimation, splitting the sample into a training sample to grow the tree and an estimation sample to estimate leaf-specific effects. Assessing a central topic in the social inequality literature, college effects on wages, we compare what we learn from conventional approaches for exploring variation in effects to causal trees. Given our use of observational data, we use leaf-specific matching and sensitivity analyses to address confounding and offer interpretations of effects based on observed and unobserved heterogeneity. We encourage researchers to follow similar practices in their work on variation in sociological effects.

Download Full-text

Setting Attribute Weights for Nearest Neighbor Learning Algorithms Using C4.5

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001497000184 ◽

1997 ◽

Vol 11 (03) ◽

pp. 405-415 ◽

Cited By ~ 3

Author(s):

Charles X. Ling ◽

John J. Parry ◽

Handong Wang

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Distance Function ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Simple Approach ◽

Nearest Neighbour ◽

Attribute Weights ◽

New Methods

Nearest Neighbour (NN) learning algorithms utilize a distance function to determine the classification of testing examples. The attribute weights in the distance function should be set appropriately. We study situations where a simple approach of setting attribute weights using decision trees does not work well, and design three improvements. We test these new methods thoroughly using artificially generated datasets and datasets from the machine learning repository.

Download Full-text

Evaluating the impact of misclassification when estimating heterogeneous causal effects

10.31234/osf.io/nea94 ◽

2020 ◽

Author(s):

Wen Wei Loh ◽

Jee-Seon Kim

Keyword(s):

Specific Effect ◽

Parametric Bootstrap ◽

Causal Effects ◽

Sensitivity Analyses ◽

Voter Mobilization ◽

Heterogeneous Treatment Effects ◽

Population Inference ◽

The Individual ◽

Effect Heterogeneity ◽

The Impact

There is increasing attention given to assessing treatment effect heterogeneity arising from individuals belonging to different underlying classes in the population. Inference proceeds by separating the individuals into distinct classes, then estimating the causal effects within each class. In practice, the individual class memberships are rarely known with certainty, and often have to be estimated. Ignoring the uncertainty in the assumed class memberships precludes the possibility of misclassification, which can potentially lead to biased results and incorrect conclusions. In this paper, we propose a strategy for conducting sensitivity analyses to possible misclassification when estimating heterogeneous treatment effects for different classes. We exploit each individual's (typically nonzero) estimated probabilities of belonging to any given class to evaluate the impact of changing the assumed class memberships - one individual at a time - on the resultant class-specific effect estimates. Because the estimated probabilities are themselves subject to sampling variability, we propose Monte Carlo bounds that explicitly reflect the uncertainty in the individual class memberships via perturbations using a parametric bootstrap. We illustrate our proposed strategy using publicly available data from a field experiment with almost 11,000 voters to investigate whether the effect of voter mobilization on turnout varies across different voter classes. We demonstrate via simulation studies that the perturbed class membership probabilities may be used to construct confidence intervals that perform better empirically at attaining the nominal coverage rate, than existing methods that hold the estimated class memberships fixed.

Download Full-text

Building Better Models

The Annals of the American Academy of Political and Social Science ◽

10.1177/0002716215570279 ◽

2015 ◽

Vol 659 (1) ◽

pp. 48-62 ◽

Cited By ~ 33

Author(s):

Matthew Hindman

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Nearest Neighbor ◽

Penalized Regression ◽

Support Vector ◽

Machine Learning Methods ◽

The Social ◽

Reduction Methods ◽

The Mean ◽

Novel Algorithms

Analytic techniques developed for big data have much broader applications in the social sciences, outperforming standard regression models even—or rather especially—in smaller datasets. This article offers an overview of machine learning methods well-suited to social science problems, including decision trees, dimension reduction methods, nearest neighbor algorithms, support vector models, and penalized regression. In addition to novel algorithms, machine learning places great emphasis on model checking (through holdout samples and cross-validation) and model shrinkage (adjusting predictions toward the mean to reduce overfitting). This article advocates replacing typical regression analyses with two different sorts of models used in concert. A multi-algorithm ensemble approach should be used to determine the noise floor of a given dataset, while simpler methods such as penalized regression or decision trees should be used for theory building and hypothesis testing.

Download Full-text

PENENTUAN DAERAH PRIORITAS PELAYANAN AKTA KELAHIRAN DENGAN METODE K-NN DAN K-MEANS

Komputasi: Jurnal Ilmiah Ilmu Komputer dan Matematika ◽

10.33751/komputasi.v17i1.1735 ◽

2020 ◽

Vol 17 (1) ◽

pp. 319-328

Author(s):

Ade Muchlis Maulana Anwar ◽

Prihastuti Harsani ◽

Aries Maesya

Keyword(s):

Nearest Neighbor ◽

Information Gain ◽

Birth Certificate ◽

Population Data ◽

Community Services ◽

Birth Certificates ◽

Similar Data ◽

K Nearest Neighbor ◽

Civil Registration ◽

The Family

Population Data is individual data or aggregate data that is structured as a result of Population Registration and Civil Registration activities. Birth Certificate is a Civil Registration Deed as a result of recording the birth event of a baby whose birth is reported to be registered on the Family Card and given a Population Identification Number (NIK) as a basis for obtaining other community services. From the total number of integrated birth certificate reporting for the 2018 Population Administration Information System (SIAK) totaling 570,637 there were 503,946 reported late and only 66,691 were reported publicly. Clustering is a method used to classify data that is similar to others in one group or similar data to other groups. K-Nearest Neighbor is a method for classifying objects based on learning data that is the closest distance to the test data. k-means is a method used to divide a number of objects into groups based on existing categories by looking at the midpoint. In data mining preprocesses, data is cleaned by filling in the blank data with the most dominating data, and selecting attributes using the information gain method. Based on the k-nearest neighbor method to predict delays in reporting and the k-means method to classify priority areas of service with 10,000 birth certificate data on birth certificates in 2019 that have good enough performance to produce predictions with an accuracy of 74.00% and with K = 2 on k-means produces a index davies bouldin of 1,179.

Download Full-text

An Introduction to Machine Learning for Panel Data: Decision Trees, Random Forests, and Other Dendrological Methods

SSRN Electronic Journal ◽

10.2139/ssrn.3717879 ◽

2020 ◽

Author(s):

James Ming Chen

Keyword(s):

Machine Learning ◽

Panel Data ◽

Decision Trees ◽

Random Forests

Download Full-text

Application of Machine Learning Approaches for the Design and Study of Anticancer Drugs

Current Drug Targets ◽

10.2174/1389450119666180809122244 ◽

2019 ◽

Vol 20 (5) ◽

pp. 488-500 ◽

Cited By ~ 6

Author(s):

Yan Hu ◽

Yi Lu ◽

Shuo Wang ◽

Mengying Zhang ◽

Xiaosheng Qu ◽

...

Keyword(s):

Machine Learning ◽

Drug Design ◽

Anticancer Drugs ◽

Nearest Neighbor ◽

Cost Effective ◽

Support Vector ◽

Learning Approaches ◽

K Nearest Neighbor ◽

Activity Prediction ◽

Linear Discriminant

Background: Globally the number of cancer patients and deaths are continuing to increase yearly, and cancer has, therefore, become one of the world's highest causes of morbidity and mortality. In recent years, the study of anticancer drugs has become one of the most popular medical topics. Objective: In this review, in order to study the application of machine learning in predicting anticancer drugs activity, some machine learning approaches such as Linear Discriminant Analysis (LDA), Principal components analysis (PCA), Support Vector Machine (SVM), Random forest (RF), k-Nearest Neighbor (kNN), and Naïve Bayes (NB) were selected, and the examples of their applications in anticancer drugs design are listed. Results: Machine learning contributes a lot to anticancer drugs design and helps researchers by saving time and is cost effective. However, it can only be an assisting tool for drug design. Conclusion: This paper introduces the application of machine learning approaches in anticancer drug design. Many examples of success in identification and prediction in the area of anticancer drugs activity prediction are discussed, and the anticancer drugs research is still in active progress. Moreover, the merits of some web servers related to anticancer drugs are mentioned.

Download Full-text

Supervised Classifier Approach for Intrusion Detection on KDD with Optimal MapReduce Framework Model in Cloud Computing

Recent Patents on Computer Science ◽

10.2174/1573401315666190619113510 ◽

2019 ◽

Vol 12 ◽

Author(s):

M. Ilayaraja ◽

S. Hemalatha ◽

P. Manickam ◽

K. Sathesh Kumar ◽

K. Shankar

Keyword(s):

Machine Learning ◽

Cloud Computing ◽

Intrusion Detection ◽

Decision Tree ◽

Learning Strategies ◽

Nearest Neighbor ◽

Detection System ◽

K Nearest Neighbor ◽

Mapreduce Model ◽

The Web

Cloud computing is characterized as the arrangement of assets or administrations accessible through the web to the clients on their request by cloud providers. It communicates everything as administrations over the web in view of the client request, for example operating system, organize equipment, storage, assets, and software. Nowadays, Intrusion Detection System (IDS) plays a powerful system, which deals with the influence of experts to get actions when the system is hacked under some intrusions. Most intrusion detection frameworks are created in light of machine learning strategies. Since the datasets, this utilized as a part of intrusion detection is Knowledge Discovery in Database (KDD). In this paper detect or classify the intruded data utilizing Machine Learning (ML) with the MapReduce model. The primary face considers Hadoop MapReduce model to reduce the extent of database ideal weight decided for reducer model and second stage utilizing Decision Tree (DT) classifier to detect the data. This DT classifier comprises utilizing an appropriate classifier to decide the class labels for the non-homogeneous leaf nodes. The decision tree fragment gives a coarse section profile while the leaf level classifier can give data about the qualities that influence the label inside a portion. From the proposed result accuracy for detection is 96.21% contrasted with existing classifiers, for example, Neural Network (NN), Naive Bayes (NB) and K Nearest Neighbor (KNN).

Download Full-text

Race and Gender

The Oxford Handbook of Ethics of AI ◽

10.1093/oxfordhb/9780190067397.013.16 ◽

2020 ◽

pp. 251-269 ◽

Cited By ~ 2

Author(s):

Timnit Gebru

Keyword(s):

Machine Learning ◽

Language Processing ◽

The United States ◽

Error Rates ◽

Political Factors ◽

Recidivism Rates ◽

Race And Gender ◽

Decision Tools ◽

And Gender ◽

Technical Solutions

This chapter discusses the role of race and gender in artificial intelligence (AI). The rapid permeation of AI into society has not been accompanied by a thorough investigation of the sociopolitical issues that cause certain groups of people to be harmed rather than advantaged by it. For instance, recent studies have shown that commercial automated facial analysis systems have much higher error rates for dark-skinned women, while having minimal errors on light-skinned men. Moreover, a 2016 ProPublica investigation uncovered that machine learning–based tools that assess crime recidivism rates in the United States are biased against African Americans. Other studies show that natural language–processing tools trained on news articles exhibit societal biases. While many technical solutions have been proposed to alleviate bias in machine learning systems, a holistic and multifaceted approach must be taken. This includes standardization bodies determining what types of systems can be used in which scenarios, making sure that automated decision tools are created by people from diverse backgrounds, and understanding the historical and political factors that disadvantage certain groups who are subjected to these tools.

Download Full-text

Evaluation of Three Different Machine Learning Methods for Object-Based Artificial Terrace Mapping—A Case Study of the Loess Plateau, China

Remote Sensing ◽

10.3390/rs13051021 ◽

2021 ◽

Vol 13 (5) ◽

pp. 1021

Author(s):

Hu Ding ◽

Jiaming Na ◽

Shangjing Jiang ◽

Jie Zhu ◽

Kai Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Loess Plateau ◽

Water Conservation ◽

Nearest Neighbor ◽

Gradient Boosting ◽

K Nearest Neighbor ◽

The Loess Plateau ◽

Object Based ◽

Extreme Gradient Boosting

Artificial terraces are of great importance for agricultural production and soil and water conservation. Automatic high-accuracy mapping of artificial terraces is the basis of monitoring and related studies. Previous research achieved artificial terrace mapping based on high-resolution digital elevation models (DEMs) or imagery. As a result of the importance of the contextual information for terrace mapping, object-based image analysis (OBIA) combined with machine learning (ML) technologies are widely used. However, the selection of an appropriate classifier is of great importance for the terrace mapping task. In this study, the performance of an integrated framework using OBIA and ML for terrace mapping was tested. A catchment, Zhifanggou, in the Loess Plateau, China, was used as the study area. First, optimized image segmentation was conducted. Then, features from the DEMs and imagery were extracted, and the correlations between the features were analyzed and ranked for classification. Finally, three different commonly-used ML classifiers, namely, extreme gradient boosting (XGBoost), random forest (RF), and k-nearest neighbor (KNN), were used for terrace mapping. The comparison with the ground truth, as delineated by field survey, indicated that random forest performed best, with a 95.60% overall accuracy (followed by 94.16% and 92.33% for XGBoost and KNN, respectively). The influence of class imbalance and feature selection is discussed. This work provides a credible framework for mapping artificial terraces.

Download Full-text