Deep graph convolutional network for US birth data harmonization (Preprint)

2020 ◽  
Author(s):  
Xiaoqian Jiang ◽  
Lishan Yu ◽  
Hamisu M. Salihub ◽  
Deepa Dongarwar

BACKGROUND In the United States, State laws require birth certificates to be completed for all births; and federal law mandates national collection and publication of births and other vital statistics data. National Center for Health Statistics (NCHS) has published the key statistics of birth data over the years. These data files, from as early as the 1970s, have been released and made publicly available. There are about 3 million new births each year, and every birth is a record in the data set described by hundreds of variables. The total data cover more than half of the current US population, making it an invaluable resource to study and examine birth epidemiology. Using such big data, researchers can ask interesting questions and study longitudinal patterns, for example, the impact of mother's drinking status to infertility in metropolitans in the last decade, or the education level of the biological father to the c-sections over the years. However, existing published data sets cannot directly support these research questions as there are adjustments to the variables and their categories, which makes these individually published data files fragmented. The information contained in the published data files is highly diverse, containing hundreds of variables each year. Besides minor adjustments like renaming and increasing variable categories, some major updates significantly changed the fields of statistics (including removal, addition, and modification of the variables), making the published data disconnected and ambiguous to use over multiple years. Researchers have previously reconstructed features to study temporal patterns, but the scale is limited (focusing only on a few variables of interest). Many have reinvented the wheels, and such reconstructions lack consistency as different researchers might use different criteria to harmonize variables, leading to inconsistent findings and limiting the reproducibility of research. There is no systematic effort to combine about five decades of data files into a database that includes every variable that has ever been released by NCHS. OBJECTIVE To utilize machine learning techniques to combine the United States (US) natality data for the last five decades, with changing variables and factors, into a consistent database. METHODS We developed a feasible and efficient deep-learning-based framework to harmonize data sets of live births in the US from 1970 to 2018. We constructed a graph based on the property and elements of databases including variables and conducted a graph convolutional network (GCN) on the graph to learn the graph embeddings for nodes where the learned embeddings implied the similarity of variables. We devised a novel loss function with a slack margin and a banlist mechanism (for a random walk) to learn the desired structure (two nodes sharing more information were more similar to each other.). We developed an active learning mechanism to conduct the harmonization. RESULTS We harmonized historical US birth data and resolved conflicts in ambiguous terms. From a total of 9,321 variables (i.e., 783 stemmed variables, from 1970 to 2018) we applied our model iteratively together with human review, obtaining 323 hyperchains of variables. Hyperchains for harmonization were composed of 201 stemmed variable pairs when considering any pairs of different stemmed variables changed over years. During the harmonization, the first round of our model provided 305 candidates stemmed variable pairs (based on the top-20 most similar variables of each variable based on the learned embeddings of variables) and achieved recall and precision of 87.56%, 57.70%, respectively. CONCLUSIONS Our harmonized graph neural network (HGNN) method provides a feasible and efficient way to connect relevant databases at a meta-level. Adapting to databases' property and characteristics, HGNN can learn patterns and search relations globally, which is powerful to discover the similarity between variables among databases. Smart utilization of machine learning can significantly reduce the manual effort in database harmonization and integration of fragmented data into useful databases for future research.

2021 ◽  
Author(s):  
satya katragadda ◽  
ravi teja bhupatiraju ◽  
vijay raghavan ◽  
ziad ashkar ◽  
raju gottumukkala

Abstract Background: Travel patterns of humans play a major part in the spread of infectious diseases. This was evident in the geographical spread of COVID-19 in the United States. However, the impact of this mobility and the transmission of the virus due to local travel, compared to the population traveling across state boundaries, is unknown. This study evaluates the impact of local vs. visitor mobility in understanding the growth in the number of cases for infectious disease outbreaks. Methods: We use two different mobility metrics, namely the local risk and visitor risk extracted from trip data generated from anonymized mobile phone data across all 50 states in the United States. We analyzed the impact of just using local trips on infection spread and infection risk potential generated from visitors' trips from various other states. We used the Diebold-Mariano test to compare across three machine learning models. Finally, we compared the performance of models, including visitor mobility for all the three waves in the United States and across all 50 states. Results: We observe that visitor mobility impacts case growth and that including visitor mobility in forecasting the number of COVID-19 cases improves prediction accuracy by 34. We found the statistical significance with respect to the performance improvement resulting from including visitor mobility using the Diebold-Mariano test. We also observe that the significance was much higher during the first peak March to June 2020. Conclusion: With presence of cases everywhere (i.e. local and visitor), visitor mobility (even within the country) is shown to have significant impact on growth in number of cases. While it is not possible to account for other factors such as the impact of interventions, and differences in local mobility and visitor mobility, we find that these observations can be used to plan for both reopening and limiting visitors from regions where there are high number of cases.


2007 ◽  
Vol 2 (3) ◽  
pp. 94
Author(s):  
Stephanie Hall

Objective – To determine the effect of large bookstores (defined as those having 20 or more employees) on household library use. Design – Econometric analysis using cross-sectional data sets. Setting – The United States of America. Subjects – People in over 55,000 households across the U.S.A. Methods – Data from 3 1996 studies were examined using logit and multinomial logit estimation procedures: the National Center for Education Statistics’ National Household Education Survey (NHES) and Public Library Survey (PLS), and the U.S. Census Bureau’s County Business Patterns (CBP). The county level results of the NHES telephone survey were merged with the county level data from the PLS and the CBP. Additionally, data on Internet use at the state level from the Statistical Abstract of the United States were incorporated into the data set. A logit regression model was used to estimate probability of library use based on several independent variables, evaluated at the mean. Main results – In general, Hemmeter found that "with regard to the impact of large bookstores on household library use, large bookstores do not appear to have an effect on overall library use among the general population” (613). While no significant changes in general library use were found among high and low income households where more large bookstores were present, nor in the population taken as a whole, middle income households (between $25,000 and $50,000 in annual income) showed notable declines in library use in these situations. These effects were strongest in the areas of borrowing (200% less likely) and recreational purposes (161%), but were also present in work-related use and job searching. Hemmeter also writes that “poorer households use the library more often for job search purposes. The probability of library use for recreation, work, and consumer information increases as income increases. This effect diminishes as households get richer” (611). Finally, home ownership was also correlated with higher library use. Households with children were more than 20% more likely to use the library (610). Their use of the library for school-related purposes, general borrowing, program activities, and so on was not affected by the presence of book superstores. White families with children were somewhat less likely to use the library, while families with higher earning and education levels were more likely to use the library. Library use also increased with the number of children in the family. Shorter distances to the nearest branch and a higher proportion of AV materials were also predictive of higher library use. Educational level was another important factor, with those having less than high school completion being significantly less likely to use the library than those with higher levels of educational attainment. Conclusions – The notable decline in public library use among middle income households where more large bookstores are present is seen as an important threat to libraries, as it may result in a decline in general support and support for funding among an important voting block. More current data are needed in this area. In addition to the type of information examined in this study, the author recommends the inclusion of information on funding, support for library referenda, and library quality as they relate to the presence of large bookstores.


1976 ◽  
Vol 57 (11) ◽  
pp. 1346-1355 ◽  
Author(s):  
D. H. Lenschow ◽  
E. M. Agee

The field phases of AMTEX, a GARP subprogram on air-sea interaction implemented by Japan, were conducted over the East China Sea in the environs of Okinawa, Japan, during the last two weeks of February in 1974 and 1975. Investigators from Australia, Canada, and the United States also participated in this experiment. The weather was generally very favorable for this study of air mass transformation processes in 1975 because of an extensive cold air outbreak during most of the experimental period. A basic synoptic data set was obtained from 6 h soundings from an array of aerological stations centered at Okinawa. In addition, satellite, hourly surface and surface marine, oceanographic, boundary layer, radiation, radar, cloud physics, and aircraft data were obtained and have been or will be available in published data reports or on magnetic tape. Preliminary results from 1974 and 1975 reported at the Fourth AMTEX Study Conference and joint United States–Japan Cooperative Science Program Seminar, “Air Mass Transformation Processes over the Kuroshio in Winter,” held in Tokyo, 26–30 September 1975, are presented and discussed.


2018 ◽  
Vol 40 ◽  
pp. 06021
Author(s):  
David Abraham ◽  
Tate McAlpin ◽  
Keaton Jones

The movement of bed forms (sand dunes) in large sand-bed rivers is being used to determine the transport rate of bed load. The ISSDOTv2 (Integrated Section Surface Difference Over Time version 2) methodology uses time sequenced differences of measured bathymetric surfaces to compute the bed-load transport rate. The method was verified using flume studies [1]. In general, the method provides very consistent and repeatable results, and also shows very good fidelity with most other measurement techniques. Over the last 7 years we have measured, computed and compiled what we believe to be the most extensive data set anywhere of bed-load measurements on large, sand bed rivers. Most of the measurements have been taken on the Mississippi, Missouri, Ohio and Snake Rivers in the United States. For cases where multiple measurements were made at varying flow rates, bed-load rating curves have been produced. This paper will provide references for the methodology, but is intended more to discuss the measurements, the resulting data sets, and current and potential uses for the bed-load data.


2019 ◽  
Author(s):  
Samara Mendez

Tracking the capability of the egg production industry to supply the food industry with enough cage-free eggs to meet retailers' and restaurants' animal welfare commitments is important to industry groups and farm animal advocacy organizations alike. In this project, we synthesize an analysis-ready data set that tracks cage-free hens and the supply of cage-free eggs relative to the overall numbers of hens and table eggs in the United States. The data set is based on reports produced by the United States Department of Agriculture (USDA), which are published weekly or monthly. The data will be updated periodically as new USDA reports are released. We supplement these data with definitions and a taxonomy of egg products drawn from USDA and industry publications. The data include flock size (both absolute and relative) and egg production of cage-free hens as well as all table-egg-laying hens in the US, collected to understand the impact of the industry's cage-free transition on hens. Data coverage ranges from December 2007 to present. Initial analysis of cage-free trends shows that, as of the most recent version of this report, 26% of all table-egg-laying hens lived in cage-free systems. This figure represents an increase of 23 percentage points over the entire sample period of December 2007 to April 2020.Revised: May 29, 2020


2020 ◽  
Vol 7 (1) ◽  
pp. 163-180
Author(s):  
Saagar S Kulkarni ◽  
Kathryn E Lorenz

This paper examines two CDC data sets in order to provide a comprehensive overview and social implications of COVID-19 related deaths within the United States over the first eight months of 2020. By analyzing the first data set during this eight-month period with the variables of age, race, and individual states in the United States, we found correlations between COVID-19 deaths and these three variables. Overall, our multivariable regression model was found to be statistically significant.  When analyzing the second CDC data set, we used the same variables with one exception; gender was used in place of race. From this analysis, it was found that trends in age and individual states were significant. However, since gender was not found to be significant in predicting deaths, we concluded that, gender does not play a significant role in the prognosis of COVID-19 induced deaths. However, the age of an individual and his/her state of residence potentially play a significant role in determining life or death. Socio-economic analysis of the US population confirms Qualitative socio-economic Logic based Cascade Hypotheses (QLCH) of education, occupation, and income affecting race/ethnicity differently. For a given race/ethnicity, education drives occupation then income, where a person lives, and in turn his/her access to healthcare coverage. Considering socio-economic data based QLCH framework, we conclude that different races are poised for differing effects of COVID-19 and that Asians and Whites are in a stronger position to combat COVID-19 than Hispanics and Blacks.


Land ◽  
2019 ◽  
Vol 8 (10) ◽  
pp. 156
Author(s):  
Rafael Moreno-Sanchez ◽  
James Raines ◽  
Jay Diffendorfer ◽  
Mark Drummond ◽  
Jessica Manko

This paper presents a synopsis of the challenges and limitations presented by existing and emerging land use/land cover (LULC) digital data sets when used to analyze the extent, habitat quality, and LULC changes of the monarch (Danaus plexippus) migratory habitat across the United States of America (US) and Mexico. First, the characteristics, state of the knowledge, and issues related to this habitat are presented. Then, the characteristics of the existing and emerging LULC digital data sets with global or cross-border coverage are listed, followed by the data sets that cover only the US or Mexico. Later, we discuss the challenges for determining the extent, habitat quality, and LULC changes in the monarchs’ migratory habitat when using these LULC data sets in conjunction with the current state of the knowledge of the monarchs’ ecology, behavior, and foraging/roosting plants used during their migration. We point to approaches to address some of these challenges, which can be categorized into: (a) LULC data set characteristics and availability; (b) availability of ancillary land management information; (c) ability to construct accurate forage suitability indices for their migration habitat; and (d) level of knowledge of the ecological and behavioral patterns of the monarchs during their journey.


2021 ◽  
Author(s):  
Johanna Marcelia

When fitting a model to a data set, the goal is to create a model that captures the trends present in the data. However, data often contains regions where the underlying model changes or exhibits shifts in certain parameters due to economic events. These locations in the data are known as changepoints, and ignoring them can result in high error and incorrect forecasts. By developing a specific cost function and optimizing using the genetic algorithm, we are able to locate and account for the changepoints in a given data set. We specifically apply this process to the retail sales of electricity in the United States by examining data sets from each state's residential, commercial, and industrial sectors. We demonstrate that, when changepoints are accounted for, model trends can be computed more accurately. We specifically explore this in the case of data sets that exhibit changepoints due to the 2020 (and ongoing) pandemic.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Liya A ◽  
Qian Qin ◽  
Hafiz Waqas Kamran ◽  
Anusara Sawangchai ◽  
Worakamol Wisetsri ◽  
...  

PurposeThis study purposes to measure the influencing relations between macroeconomic indicators and the prices of gold. Further study measures several factors with the gold price in the context of the United States.Design/methodology/approachThe secondary data are collected to measure relationship and fluctuation of gold prices the data collected from the website world development indicators (WDI) for the period of 31 years 1990–2019. This paper uses different econometric analysis such as analytical unit root test for stationary of data, descriptive statistical analysis for description of data, correlation coefficient test for measuring the inter correlation, and ordinary least square regression analysis for determine the impact of dependent and independents variables. In this research paper, gross domestic product (GDP), inflation rate (IR), unemployment rate (UR), real interest rate (RIR), gross national product (GNP), standard trade value (STV) are included in macroeconomic indicators and consider as independent. The gold prices are considered as dependent variable.FindingsThis study's overall results show an important and optimistic association between GDP, IR and STV with the gold price. Moreover, the RIR shows negative and does not show significant relation with the gold prices.Originality/valueSince several economic crises were included during the data selection studied in this research paper, data error may be present, resulting in the instability of the overall data. However, the study still hopes to find the guiding role of these macro gold price factors in the price of gold from the limited data set. The basic scope of research is that research is limited in the United States.


Author(s):  
Taylor N. Carlson ◽  
Marisa Abrajano ◽  
Lisa García Bedolla

Individuals arrive at meaning through conversation. Scholars have long explored political conversations in the United States, and the vast majority of this research suggests that political discussion has important effects on political attitudes and engagement. However, much of this research relies on samples of White respondents, making it potentially difficult to generalize these findings to our increasingly diverse electorate. In this book, we seek to understand how political discussion networks vary across groups who have vastly different social positions in the United States, specifically along the lines of ethnorace, nativity, and gender. We build upon seminal work in the field as we argue that individuals with different social positions likely discuss politics with different groups of people and, as a consequence, their discussion networks have different effects on their political behavior. We use a novel discussion network data set with an ethnoracially diverse sample, paired with qualitative interviews, to test this argument. We assert that this book makes three central contributions: (1) expanding the scope of the political discussion network literature by providing a comparative analysis across ethnorace, nativity, and gender; (2) demonstrating how historical differences in partisanship, policy attitudes, and engagement are reflected within groups’ social networks; and (3) revealing how the social position of our respondents affects the impact that networks can have on their trust and efficacy in government, political knowledge, policy attitudes, and political and civic engagement patterns.


Sign in / Sign up

Export Citation Format

Share Document