Journal of Behavioral Data Science
Latest Publications


TOTAL DOCUMENTS

16
(FIVE YEARS 16)

H-INDEX

0
(FIVE YEARS 0)

Published By International Society For Data Science And Analytics

2575-8306, 2574-1284

Author(s):  
Meghan Cain

In this tutorial, you will learn how to fit structural equation models (SEM) using Stata software. SEMs can be fit in Stata using the sem command for standard linear SEMs, the gsem command for generalized linear SEMs, or by drawing their path diagrams in the SEM Builder. After a brief introduction to Stata, the sem command will be demonstrated through a confirmatory factor analysis model, mediation model, group analysis, and a growth curve model, and the gsem command will be demonstrated through a random-slope model and a logistic ordinal regression. Materials and datasets are provided online, allowing anyone with Stata to follow along.


Author(s):  
Wen Luo ◽  
Hok Chio Lai

Multilevel modeling is often used to analyze survey data collected with a multistage sampling design. When the selection is informative, sampling weights need to be incorporated in the estimation. We propose a weighted residual bootstrap method as an alternative to the multilevel pseudo-maximum likelihood (MPML) estimators. In a Monte Carlo simulation using two-level linear mixed effects models, the bootstrap method showed advantages over MPML for the estimates and the statistical inferences of the intercept, the slope of the level-2 predictor, and the variance components at level-2. The impact of sample size, selection mechanism, intraclass correlation (ICC), and distributional assumptions on the performance of the methods were examined. The performance of MPML was suboptimal when sample size and ICC were small and when the normality assumption was violated. The bootstrap estimates performed generally well across all the simulation conditions, but had notably suboptimal performance in estimating the covariance component in a random slopes model when sample size and ICCs were large. As an illustration, the bootstrap method is applied to the American data of the OECD’s Program for International Students Assessment (PISA) survey on math achievement using the R package bootmlm.


Author(s):  
Shuai Zhou ◽  
Yanling Li ◽  
Guangqing Chi ◽  
Junjun Yin ◽  
Zita Oravecz ◽  
...  

Global Positioning System (GPS) data have become one of the routine data streams collected by wearable devices, cell phones, and social media platforms in this digital age. Such data provide research opportunities in that they may provide contextual information to elucidate where, when, and why individuals engage in and sustain particular behavioral patterns. However, raw GPS data consisting of densely sampled time series of latitude and longitude coordinate pairs do not readily convey meaningful information concerning intra-individual dynamics and inter-individual differences; substantial data processing is required. Raw GPS data need to be integrated into a Geographic Information System (GIS) and analyzed, from which the mobility and activity patterns of individuals can be derived, a process that is unfamiliar to many behavioral scientists. In this tutorial article, we introduced GPS2space, a free and open-source Python library that we developed to facilitate the processing of GPS data, integration with GIS to derive distances from landmarks of interest, as well as extraction of two spatial features: activity space of individuals and shared space between individuals, such as members of the same family. We demonstrated functions available in the library using data from the Colorado Online Twin Study to explore seasonal and age-related changes in individuals’ activity space and twin siblings’ shared space, as well as gender, zygosity and baseline age-related differences in their initial levels and/or changes over time. We concluded with discussions of other potential usages, caveats, and future developments of GPS2space.


Author(s):  
Jin Liu ◽  
Le Kang ◽  
Roy T. Sabo ◽  
Robert M. Kirkpatrick ◽  
Robert A. Perera

Empirical researchers are usually interested in investigating the impacts that baseline covariates have when uncovering sample heterogeneity and separating samples into more homogeneous groups. However, a considerable number of studies in the structural equation modeling (SEM) framework usually start with vague hypotheses in terms of heterogeneity and possible causes. It suggests that (1) the determination and specification of a proper model with covariates is not straightforward, and (2) the exploration process may be computationally intensive given that a model in the SEM framework is usually complicated and the pool of candidate covariates is usually huge in the psychological and educational domain where the SEM framework is widely employed. Following Bakk and Kuha (2017), this article presents a two-step growth mixture model (GMM) that examines the relationship between latent classes of nonlinear trajectories and baseline characteristics. Our simulation studies demonstrate that the proposed model is capable of clustering the nonlinear change patterns, and estimating the parameters of interest unbiasedly, precisely, as well as exhibiting appropriate confidence interval coverage. Considering the pool of candidate covariates is usually huge and highly correlated, this study also proposes implementing exploratory factor analysis (EFA) to reduce the dimension of covariate space. We illustrate how to use the hybrid method, the two-step GMM and EFA, to efficiently explore the heterogeneity of nonlinear trajectories of longitudinal mathematics achievement data.


Author(s):  
Sarfaraz Serang ◽  
James Sears

Understanding causal effects of a treatment is often of interest in the social sciences. When treatments cannot be randomly assigned, researchers must ensure that treated and untreated participants are balanced on covariates before estimating treatment effects. Conventional practices are useful in matching such that treated and untreated participants have similar average values on their covariates. However, situations arise in which a researcher may instead want to match on model parameters. We propose an algorithm, Causal Mplus Trees, which uses decision trees to match on structural equation model parameters and estimates conditional average treatment effects in each node. We provide a proof of concept using two small simulation studies and demonstrate its application using COVID-19 data.


2021 ◽  
Vol 1 (1) ◽  
Author(s):  
Zhiyong Zhang ◽  

Data science has maintained its popularity for about 20 years. This study adopts a bottom-up approach to understand what data science is by analyzing the descriptions of courses offered by the data science programs in the United States. Through topic modeling, 14 topics are identified from the current curricula of 56 data science programs. These topics reiterate that data science is at the intersection of statistics, computer science, and substantive fields.


2021 ◽  
Vol 1 (1) ◽  
Author(s):  
Alexander P. Christensen ◽  

The nature of associations between variables is important for constructing theory about psychological phenomena. In the last decade, this topic has received renewed interest with the introduction of psychometric network models. In psychology, network models are often contrasted with latent variable (e.g., factor) models. Recent research has shown that differences between the two tend to be more substantive than statistical. One recently developed algorithm called the Loadings Comparison Test (LCT) was developed to predict whether data were generated from a factor or small-world network model. A significant limitation of the current LCT implementation is that it's based on heuristics that were derived from descriptive statistics. In the present study, we used artificial neural networks to replace these heuristics and develop a more robust and generalizable algorithm. We performed a Monte Carlo simulation study that compared neural networks to the original LCT algorithm as well as logistic regression models that were trained on the same data. We found that the neural networks performed as well as or better than both methods for predicting whether data were generated from a factor, small-world network, or random network model. Although the neural networks were trained on small-world networks, we show that they can reliably predict the data-generating model of random networks, demonstrating generalizability beyond the trained data. We echo the call for more formal theories about the relations between variables and discuss the role of the LCT in this process.


2021 ◽  
Vol 1 (1) ◽  
Author(s):  
Xin Tong

Semiparametric Bayesian methods have been proposed in the literature for growth curve modeling to reduce the adverse effect of having nonnormal data. The normality assumption of measurement errors in traditional growth curve models was {replaced} by a random distribution with Dirichlet process mixture priors. However, both the random effects and measurement errors are equally likely to be nonnormal. Therefore, in this study, three types of robust distributional growth curve models are proposed from a semiparametric Bayesian perspective, in which random coefficients or measurement errors follow either normal distributions or unknown random distributions with Dirichlet process mixture priors. Based on a Monte Carlo simulation study, we evaluate the performance of the robust models and demonstrate that selecting an appropriate model for practical data analyses is very important, by comparing the three types of robust distributional models as well as the traditional growth curve models with the normality assumption. We also provide a straightforward strategy to select the appropriate model.


2021 ◽  
Vol 1 (1) ◽  
Author(s):  
Haiyan Liu ◽  

Whether birds of a feather flock together or opposites attract is a classical research question in social and personality psychology. In most existing studies, correlation-based techniques are commonly used to study the similarity/dissimilarity among social entities. Social network data comprises two primary components: actors and the possible social relations between them. It, therefore, has observations on both the dyads with and without social relations. Because of the availability of the baseline group (dyads without social relations), it is possible to contrast the two groups of dyads using social network analysis techniques. This study aims to illustrate how to use social network analysis techniques to address psychological research questions. Specifically, we will investigate how the similarity or dissimilarity of actor's characteristics relates to the likelihood for them to build social relations. By analyzing a college friendship network, we found the quadratic relations between personality similarity and friendship. Both very similar and very dissimilar personalities boost friendship among college students.


2021 ◽  
pp. 154-169
Author(s):  
Rohan Sukumaran ◽  
Parth Patwa ◽  
Sethuraman T V ◽  
Sheshank Shankar ◽  
Rishank Kanaparti ◽  
...  

It is crucial for policymakers to understand the community prevalence of COVID-19 so combative resources can be effectively allocated and prioritized during the COVID-19 pandemic. Traditionally, community prevalence has been assessed through diagnostic and antibody testing data. However, despite the increasing availability of COVID-19 testing, the required level has not been met in parts of the globe, introducing a need for an alternative method for communities to determine disease prevalence. This is further complicated by the observation that COVID-19 prevalence and spread vary across different spatial, temporal, and demographic verticals. In this study, we study trends in the spread of COVID-19 by utilizing the results of self-reported COVID-19 symptoms surveys as a complement to COVID-19 testing reports. This allows us to assess community disease prevalence, even in areas with low COVID-19 testing ability. Using individually reported symptom data from various populations, our method predicts the likely percentage of the population that tested positive for COVID-19. We achieved a mean absolute error (MAE) of 1.14 and mean relative error (MRE) of 60.40% with 95% confidence interval as [60.12, 60.67]. This implies that our model predicts +/- 1140 cases than the original in a population of 1 million. In addition, we forecast the location-wise percentage of the population testing positive for the next 30 days using self-reported symptoms data from previous days. The MAE for this method is as low as 0.15 (MRE of 11.28% with 95% confidence interval [10.9, 11.6]) for New York. We present an analysis of these results, exposing various clinical attributes of interest across different demographics. Lastly, we qualitatively analyze how various policy enactments (testing, curfew) affect the prevalence of COVID-19 in a community.


Sign in / Sign up

Export Citation Format

Share Document