Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD

Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers’ ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture (“resources”) for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).

Download Full-text

Missing Data in Research on Youth and Family Programs

Psychological Reports ◽

10.1177/00332941211026851 ◽

2021 ◽

pp. 003329412110268

Author(s):

Jaime Ballard ◽

Adeya Richmond ◽

Suzanne van den Hoogenhof ◽

Lynne Borden ◽

Daniel Francis Perkins

Keyword(s):

Missing Data ◽

Statistical Methods ◽

Self Report ◽

Program Participation ◽

Multilevel Data ◽

Individual Level ◽

Level Data ◽

Before And After ◽

Post Test ◽

The Individual

Background Multilevel data can be missing at the individual level or at a nested level, such as family, classroom, or program site. Increased knowledge of higher-level missing data is necessary to develop evaluation design and statistical methods to address it. Methods Participants included 9,514 individuals participating in 47 youth and family programs nationwide who completed multiple self-report measures before and after program participation. Data were marked as missing or not missing at the item, scale, and wave levels for both individuals and program sites. Results Site-level missing data represented a substantial portion of missing data, ranging from 0–46% of missing data at pre-test and 35–71% of missing data at post-test. Youth were the most likely to be missing data, although site-level data did not differ by the age of participants served. In this dataset youth had the most surveys to complete, so their missing data could be due to survey fatigue. Conclusions Much of the missing data for individuals can be explained by the site not administering those questions or scales. These results suggest a need for statistical methods that account for site-level missing data, and for research design methods to reduce the prevalence of site-level missing data or reduce its impact. Researchers can generate buy-in with sites during the community collaboration stage, assessing problematic items for revision or removal and need for ongoing site support, particularly at post-test. We recommend that researchers conducting multilevel data report the amount and mechanism of missing data at each level.

Download Full-text

Meffil: efficient normalisation and analysis of very large DNA methylation samples

10.1101/125963 ◽

2017 ◽

Cited By ~ 17

Author(s):

Josine Min ◽

Gibran Hemani ◽

George Davey Smith ◽

Caroline Relton ◽

Matthew Suderman

Keyword(s):

Dna Methylation ◽

Association Studies ◽

R Package ◽

Individual Level ◽

Technological Advances ◽

Level Data ◽

Fixed And Random Effects ◽

R Packages ◽

Meta Analyses ◽

Dramatic Growth

AbstractBackgroundTechnological advances in high throughput DNA methylation microarrays have allowed dramatic growth of a new branch of epigenetic epidemiology. DNA methylation datasets are growing ever larger in terms of the number of samples profiled, the extent of genome coverage, and the number of studies being meta-analysed. Novel computational solutions are required to efficiently handle these data.MethodsWe have developed meffil, an R package designed to quality control, normalize and perform epigenome-wide association studies (EWAS) efficiently on large samples of Illumina Infinium HumanMethylation450 and MethylationEPIC BeadChip microarrays. We tested meffil by applying it to 6000 450k microarrays generated from blood collected for two different datasets, Accessible Resource for Integrative Epigenomic Studies (ARIES) and The Genetics of Overweight Young Adults (GOYA) study.ResultsA complete reimplementation of functional normalization minimizes computational memory requirements to 5% of that required by other R packages, without increasing running time. Incorporating fixed and random effects alongside functional normalization, and automated estimation of functional normalisation parameters reduces technical variation in DNA methylation levels, thus reducing false positive associations and improving power. We also demonstrate that the ability to normalize datasets distributed across physically different locations without sharing any biologically-based individual-level data may reduce heterogeneity in meta-analyses of epigenome-wide association studies. However, we show that when batch is perfectly confounded with cases and controls functional normalization is unable to prevent spurious associations.Conclusionsmeffil is available online (https://github.com/perishky/meffil/) along with tutorials covering typical use cases.

Download Full-text

Cohort Profile: The China Multi-Ethnic Cohort (CMEC) Study

10.1101/2020.02.14.20022970 ◽

2020 ◽

Author(s):

Xing Zhao ◽

Feng Hong ◽

Jianzhong Yin ◽

Wenge Tang ◽

Gang Zhang ◽

...

Keyword(s):

Clinical Laboratory ◽

Disease Incidence ◽

Data Access ◽

Population Based ◽

Baseline Survey ◽

Individual Level ◽

Long Term Storage ◽

Level Data ◽

The Individual ◽

Study Participants

AbstractCohort purposeThe China Multi-Ethnic Cohort (CMEC) is a community population-based prospective observational study aiming to address the urgent need for understanding NCD prevalence, risk factors and associated conditions in resource-constrained settings for ethnic minorities in China.Cohort BasicsA total of 99 556 participants aged 30 to 79 years (Tibetan populations include those aged 18 to 30 years) from the Tibetan, Yi, Miao, Bai, Bouyei, and Dong ethnic groups in Southwest China were recruited between May 2018 and September 2019.Follow-up and attritionAll surviving study participants will be invited for re-interviews every 3-5 years with concise questionnaires to review risk exposures and disease incidence. Furthermore, the vital status of study participants will be followed up through linkage with established electronic disease registries annually.Design and MeasuresThe CMEC baseline survey collected data with an electronic questionnaire and face-to-face interviews, medical examinations and clinical laboratory tests. Furthermore, we collected biological specimens, including blood, saliva and stool, for long-term storage. In addition to the individual level data, we also collected regional level data for each investigation site.Collaboration and data accessCollaborations are welcome. Please send specific ideas to corresponding author at: [email protected].

Download Full-text

Welsh Government Flying Start Programme Evaluation Using Linked Data

International Journal for Population Data Science ◽

10.23889/ijpds.v3i2.517 ◽

2018 ◽

Vol 3 (2) ◽

Author(s):

Sarah Lowe ◽

Laura McGinn ◽

Marcos Quintela ◽

Luke Player ◽

Karen Tingay

Keyword(s):

Evidence Base ◽

Development Project ◽

Early Years ◽

Local Authorities ◽

Parenting Support ◽

Individual Level ◽

Level Data ◽

Part Time ◽

The Individual ◽

Families With Children

BackgroundFlying Start (FS) is the Welsh Government’s (WG) flagship Early Years programme for families with children aged less than 4 years of age. Running since 2006, the four entitlements are: Free part-time childcare for 2-3 year olds Enhanced Health Visiting Parenting support Speech, language, and communication support ObjectivesCurrently, while we know which areas in Wales are receiving FS support, individual-level data on which child received what entitlements is not available. Area-level outcomes can be used as proxy indicators but the individual impact of receiving FS support cannot be examined.The project aims to evaluate FS by linking the FS cohort to a range of outcomes including health, education and social care. MethodsA Dataflow Development Project (DDP) has been launched to install SAIL (Secure Anonymised Information Linkage) appliances into 6 pilot Local Authorities in Wales which will test acquiring and linking the individual level FS data from pilot Local Authorities with other datasets in SAIL. FindingsThe project will report some emerging findings from the analysis of pilot data. ImplicationsThere is a growing interest in using linked administrative data to evaluate government initiatives, and mounting enthusiasm in Local Government. If successful, this model is likely to be adopted by related WG programmes; improving the evidence base, facilitating effective evaluation, and adding to the data available for re-use in Wales.

Download Full-text

Agricultural entrepreneurship orientation: is academic training a missing link?

Education + Training ◽

10.1108/et-10-2016-0156 ◽

2017 ◽

Vol 59 (7/8) ◽

pp. 856-870 ◽

Cited By ~ 3

Author(s):

Soodeh Mohammadinezhad ◽

Maryam Sharifzadeh

Keyword(s):

Content Type ◽

Individual Level ◽

Configuration Theory ◽

University Courses ◽

Level Data ◽

Academic Courses ◽

The Individual ◽

Agricultural Entrepreneurship ◽

Hierarchy Process ◽

Practical Implications

Purpose The purpose of this paper is to investigate the importance of academic courses on agricultural entrepreneurship. Design/methodology/approach Modified global entrepreneurship and development index (GEDI) was used to determine entrepreneurial dimensions among 19 graduated students of agricultural colleges resided in Iran. Fuzzy analytical hierarchy process was applied to understand agricultural graduates’ preferences on effectiveness of university courses (core, free elective and restricted elective). Findings Results suggested the importance of professional restricted elective courses to provide students with necessary skills. These courses were successful in providing a context for entrepreneurial profile. Research limitations/implications Innate talent or acquired skills were always the place of debate on entrepreneurial development. The paper builds on the premise that entrepreneurs are made through education and continuing reconstruction of experience, further research is required as the field develops in experience and complexity. Practical implications The paper provides strategies to effectively modify practical route in higher education to enhance entrepreneurial orientation among students. Originality/value The paper is innovative at a conceptual level in modifying GEDI elements in individual-level variables based on GEDI configuration theory. This approach is particularly useful in addressing the bottleneck problems of entrepreneurship profile and focusses on the information interpreted at weights of the individual-level data.

Download Full-text

On the Issues Surrounding Economic Voting

Comparative Political Studies ◽

10.1177/0010414087020001001 ◽

1987 ◽

Vol 20 (1) ◽

pp. 3-33 ◽

Cited By ~ 7

Author(s):

JOHN R. HIBBING

Keyword(s):

United States ◽

United Kingdom ◽

Voting Behavior ◽

Economic Voting ◽

The United States ◽

Individual Level ◽

The United Kingdom ◽

Level Data ◽

Election Results ◽

The Individual

This is an analysis of the effects of economic factors on voting behavior in the United Kingdom. Aggregate- and individual-level data are used. When the results are compared to findings generated by the United States case, some intriguing differences appear. To mention just two examples, unemployment and inflation seem to be much more important in the United Kingdom than in the United States, and changes in real per capita income are positively related to election results in the United States and negatively related in the United Kingdom. More generally, while the aggregate results are strong and the individual-level results weak in the United States, in the United Kingdom the situation is practically reversed.

Download Full-text

Area-level relative deprivation and alcohol use in Denmark: Is there a relationship?

Scandinavian Journal of Public Health ◽

10.1177/1403494818787101 ◽

2018 ◽

Vol 47 (4) ◽

pp. 428-438 ◽

Cited By ~ 2

Author(s):

Kim Bloomfield ◽

Gabriele Berg-Beckhoff ◽

Abdu Kedir Seid ◽

Christiane Stock

Keyword(s):

Alcohol Use ◽

Relative Deprivation ◽

Health Behaviours ◽

Area Deprivation ◽

Individual Level ◽

Level Data ◽

Housing Component ◽

Disorder Identification ◽

The Individual ◽

The Impact

Aims: Greater area-level relative deprivation has been related to poorer health behaviours, but studies specifically on alcohol use and abuse have been equivocal. The main purpose of the present study was to investigate how area-level relative deprivation in Denmark relates to alcohol use and misuse in the country. Methods: As individual-level data, we used the national alcohol and drug survey of 2011 ( n= 5133). Data were procured from Statistics Denmark to construct an index of relative deprivation at the parish level ( n=2119). The deprivation index has two components, which were divided into quintiles. Multilevel linear and logistic regressions analysed the influence of area deprivation on mean alcohol use and hazardous drinking, as measured by the Alcohol Use Disorder Identification Test. Results: Men who lived in parishes designated as ‘very deprived’ on the socioeconomic component were more likely to consume less alcohol; women who lived in parishes designated as ‘deprived’ on the housing component were less likely to drink hazardously. But at the individual level, education was positively related to mean alcohol consumption, and higher individual income was positively related to mean consumption for women. Higher-educated men were more likely to drink hazardously. Conclusions: Area-level measures of relative deprivation were not strongly related to alcohol use, yet in the same models individual-level socioeconomic variables had a more noticeable influence. This suggests that in a stronger welfare state, the impact of area-level relative deprivation may not be as great. Further work is needed to develop more sensitive measures of relative deprivation.

Download Full-text

Personalization as a promise: Can Big Data change the practice of insurance?

Big Data & Society ◽

10.1177/2053951720935143 ◽

2020 ◽

Vol 7 (1) ◽

pp. 205395172093514 ◽

Cited By ~ 3

Author(s):

Laurence Barry ◽

Arthur Charpentier

Keyword(s):

Big Data ◽

Predictive Analytics ◽

Special Focus ◽

New Paradigm ◽

Individual Level ◽

Big Data Technologies ◽

State Of Research ◽

The Individual ◽

Homogeneity Hypothesis ◽

The Impact

The aim of this article is to assess the impact of Big Data technologies for insurance ratemaking, with a special focus on motor products.The first part shows how statistics and insurance mechanisms adopted the same aggregate viewpoint. It made visible regularities that were invisible at the individual level, further supporting the classificatory approach of insurance and the assumption that all members of a class are identical risks. The second part focuses on the reversal of perspective currently occurring in data analysis with predictive analytics, and how this conceptually contradicts the collective basis of insurance. The tremendous volume of data and the personalization promise through accurate individual prediction indeed deeply shakes the homogeneity hypothesis behind pooling. The third part attempts to assess the extent of this shift in motor insurance. Onboard devices that collect continuous driving behavioural data could import this new paradigm into these products. An examination of the current state of research on models with telematics data shows however that the epistemological leap, for now, has not happened.

Download Full-text

SleepOMICS: How Big Data Can Revolutionize Sleep Science

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph16020291 ◽

2019 ◽

Vol 16 (2) ◽

pp. 291 ◽

Cited By ~ 11

Author(s):

Nicola Luigi Bragazzi ◽

Ottavia Guglielmi ◽

and Sergio Garbarino

Keyword(s):

Developing Countries ◽

Big Data ◽

Sleep Disorders ◽

Real Life ◽

The Elderly ◽

Sleep Medicine ◽

Sleep Research ◽

Profound Impact ◽

Individual Level ◽

The Individual

Sleep disorders have reached epidemic proportions worldwide, affecting the youth as well as the elderly, crossing the entire lifespan in both developed and developing countries. “Real-life” behavioral (sensor-based), molecular, digital, and epidemiological big data represent a source of an impressive wealth of information that can be exploited in order to advance the field of sleep research. It can be anticipated that big data will have a profound impact, potentially enabling the dissection of differences and oscillations in sleep dynamics and architecture at the individual level (“sleepOMICS”), thus paving the way for a targeted, “one-size-does-not-fit-all” management of sleep disorders (“precision sleep medicine”).

Download Full-text

Gamifying with badges: A big data natural experiment on Stack Exchange

First Monday ◽

10.5210/fm.v22i6.7299 ◽

2017 ◽

Cited By ~ 5

Author(s):

Benny Bornfeld ◽

Sheizaf Rafaeli

Keyword(s):

Big Data ◽

Effect Size ◽

Natural Experiment ◽

User Behavior ◽

Online Systems ◽

Individual Level ◽

The Individual

Badges are a common gamification mechanism used by many crowd-sourced online systems. This study provides evidence to their effectiveness and measures their effect size using a big data natural experiment in three large Stack Exchange online Q&A sites. We analyze the introduction of 22 different badge-launch events and the resulting changes in user behavior. Consistent with earlier studies, we report that most badge introductions have the desired effect. Going beyond traditional findings on the individual level, this study measures overall badge effect size on the service.

Download Full-text