scholarly journals Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD

2021 ◽  
Vol 17 (3) ◽  
pp. e1008880
Author(s):  
Yannick Marcon ◽  
Tom Bishop ◽  
Demetris Avraam ◽  
Xavier Escriba-Montagut ◽  
Patricia Ryser-Welch ◽  
...  

Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers’ ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture (“resources”) for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).

2021 ◽  
pp. 003329412110268
Author(s):  
Jaime Ballard ◽  
Adeya Richmond ◽  
Suzanne van den Hoogenhof ◽  
Lynne Borden ◽  
Daniel Francis Perkins

Background Multilevel data can be missing at the individual level or at a nested level, such as family, classroom, or program site. Increased knowledge of higher-level missing data is necessary to develop evaluation design and statistical methods to address it. Methods Participants included 9,514 individuals participating in 47 youth and family programs nationwide who completed multiple self-report measures before and after program participation. Data were marked as missing or not missing at the item, scale, and wave levels for both individuals and program sites. Results Site-level missing data represented a substantial portion of missing data, ranging from 0–46% of missing data at pre-test and 35–71% of missing data at post-test. Youth were the most likely to be missing data, although site-level data did not differ by the age of participants served. In this dataset youth had the most surveys to complete, so their missing data could be due to survey fatigue. Conclusions Much of the missing data for individuals can be explained by the site not administering those questions or scales. These results suggest a need for statistical methods that account for site-level missing data, and for research design methods to reduce the prevalence of site-level missing data or reduce its impact. Researchers can generate buy-in with sites during the community collaboration stage, assessing problematic items for revision or removal and need for ongoing site support, particularly at post-test. We recommend that researchers conducting multilevel data report the amount and mechanism of missing data at each level.


2017 ◽  
Author(s):  
Josine Min ◽  
Gibran Hemani ◽  
George Davey Smith ◽  
Caroline Relton ◽  
Matthew Suderman

AbstractBackgroundTechnological advances in high throughput DNA methylation microarrays have allowed dramatic growth of a new branch of epigenetic epidemiology. DNA methylation datasets are growing ever larger in terms of the number of samples profiled, the extent of genome coverage, and the number of studies being meta-analysed. Novel computational solutions are required to efficiently handle these data.MethodsWe have developed meffil, an R package designed to quality control, normalize and perform epigenome-wide association studies (EWAS) efficiently on large samples of Illumina Infinium HumanMethylation450 and MethylationEPIC BeadChip microarrays. We tested meffil by applying it to 6000 450k microarrays generated from blood collected for two different datasets, Accessible Resource for Integrative Epigenomic Studies (ARIES) and The Genetics of Overweight Young Adults (GOYA) study.ResultsA complete reimplementation of functional normalization minimizes computational memory requirements to 5% of that required by other R packages, without increasing running time. Incorporating fixed and random effects alongside functional normalization, and automated estimation of functional normalisation parameters reduces technical variation in DNA methylation levels, thus reducing false positive associations and improving power. We also demonstrate that the ability to normalize datasets distributed across physically different locations without sharing any biologically-based individual-level data may reduce heterogeneity in meta-analyses of epigenome-wide association studies. However, we show that when batch is perfectly confounded with cases and controls functional normalization is unable to prevent spurious associations.Conclusionsmeffil is available online (https://github.com/perishky/meffil/) along with tutorials covering typical use cases.


2020 ◽  
Author(s):  
Xing Zhao ◽  
Feng Hong ◽  
Jianzhong Yin ◽  
Wenge Tang ◽  
Gang Zhang ◽  
...  

AbstractCohort purposeThe China Multi-Ethnic Cohort (CMEC) is a community population-based prospective observational study aiming to address the urgent need for understanding NCD prevalence, risk factors and associated conditions in resource-constrained settings for ethnic minorities in China.Cohort BasicsA total of 99 556 participants aged 30 to 79 years (Tibetan populations include those aged 18 to 30 years) from the Tibetan, Yi, Miao, Bai, Bouyei, and Dong ethnic groups in Southwest China were recruited between May 2018 and September 2019.Follow-up and attritionAll surviving study participants will be invited for re-interviews every 3-5 years with concise questionnaires to review risk exposures and disease incidence. Furthermore, the vital status of study participants will be followed up through linkage with established electronic disease registries annually.Design and MeasuresThe CMEC baseline survey collected data with an electronic questionnaire and face-to-face interviews, medical examinations and clinical laboratory tests. Furthermore, we collected biological specimens, including blood, saliva and stool, for long-term storage. In addition to the individual level data, we also collected regional level data for each investigation site.Collaboration and data accessCollaborations are welcome. Please send specific ideas to corresponding author at: [email protected].


Author(s):  
Sarah Lowe ◽  
Laura McGinn ◽  
Marcos Quintela ◽  
Luke Player ◽  
Karen Tingay

BackgroundFlying Start (FS) is the Welsh Government’s (WG) flagship Early Years programme for families with children aged less than 4 years of age. Running since 2006, the four entitlements are: Free part-time childcare for 2-3 year olds Enhanced Health Visiting Parenting support Speech, language, and communication support ObjectivesCurrently, while we know which areas in Wales are receiving FS support, individual-level data on which child received what entitlements is not available. Area-level outcomes can be used as proxy indicators but the individual impact of receiving FS support cannot be examined.The project aims to evaluate FS by linking the FS cohort to a range of outcomes including health, education and social care. MethodsA Dataflow Development Project (DDP) has been launched to install SAIL (Secure Anonymised Information Linkage) appliances into 6 pilot Local Authorities in Wales which will test acquiring and linking the individual level FS data from pilot Local Authorities with other datasets in SAIL. FindingsThe project will report some emerging findings from the analysis of pilot data. ImplicationsThere is a growing interest in using linked administrative data to evaluate government initiatives, and mounting enthusiasm in Local Government. If successful, this model is likely to be adopted by related WG programmes; improving the evidence base, facilitating effective evaluation, and adding to the data available for re-use in Wales.


2017 ◽  
Vol 59 (7/8) ◽  
pp. 856-870 ◽  
Author(s):  
Soodeh Mohammadinezhad ◽  
Maryam Sharifzadeh

Purpose The purpose of this paper is to investigate the importance of academic courses on agricultural entrepreneurship. Design/methodology/approach Modified global entrepreneurship and development index (GEDI) was used to determine entrepreneurial dimensions among 19 graduated students of agricultural colleges resided in Iran. Fuzzy analytical hierarchy process was applied to understand agricultural graduates’ preferences on effectiveness of university courses (core, free elective and restricted elective). Findings Results suggested the importance of professional restricted elective courses to provide students with necessary skills. These courses were successful in providing a context for entrepreneurial profile. Research limitations/implications Innate talent or acquired skills were always the place of debate on entrepreneurial development. The paper builds on the premise that entrepreneurs are made through education and continuing reconstruction of experience, further research is required as the field develops in experience and complexity. Practical implications The paper provides strategies to effectively modify practical route in higher education to enhance entrepreneurial orientation among students. Originality/value The paper is innovative at a conceptual level in modifying GEDI elements in individual-level variables based on GEDI configuration theory. This approach is particularly useful in addressing the bottleneck problems of entrepreneurship profile and focusses on the information interpreted at weights of the individual-level data.


1987 ◽  
Vol 20 (1) ◽  
pp. 3-33 ◽  
Author(s):  
JOHN R. HIBBING

This is an analysis of the effects of economic factors on voting behavior in the United Kingdom. Aggregate- and individual-level data are used. When the results are compared to findings generated by the United States case, some intriguing differences appear. To mention just two examples, unemployment and inflation seem to be much more important in the United Kingdom than in the United States, and changes in real per capita income are positively related to election results in the United States and negatively related in the United Kingdom. More generally, while the aggregate results are strong and the individual-level results weak in the United States, in the United Kingdom the situation is practically reversed.


2018 ◽  
Vol 47 (4) ◽  
pp. 428-438 ◽  
Author(s):  
Kim Bloomfield ◽  
Gabriele Berg-Beckhoff ◽  
Abdu Kedir Seid ◽  
Christiane Stock

Aims: Greater area-level relative deprivation has been related to poorer health behaviours, but studies specifically on alcohol use and abuse have been equivocal. The main purpose of the present study was to investigate how area-level relative deprivation in Denmark relates to alcohol use and misuse in the country. Methods: As individual-level data, we used the national alcohol and drug survey of 2011 ( n= 5133). Data were procured from Statistics Denmark to construct an index of relative deprivation at the parish level ( n=2119). The deprivation index has two components, which were divided into quintiles. Multilevel linear and logistic regressions analysed the influence of area deprivation on mean alcohol use and hazardous drinking, as measured by the Alcohol Use Disorder Identification Test. Results: Men who lived in parishes designated as ‘very deprived’ on the socioeconomic component were more likely to consume less alcohol; women who lived in parishes designated as ‘deprived’ on the housing component were less likely to drink hazardously. But at the individual level, education was positively related to mean alcohol consumption, and higher individual income was positively related to mean consumption for women. Higher-educated men were more likely to drink hazardously. Conclusions: Area-level measures of relative deprivation were not strongly related to alcohol use, yet in the same models individual-level socioeconomic variables had a more noticeable influence. This suggests that in a stronger welfare state, the impact of area-level relative deprivation may not be as great. Further work is needed to develop more sensitive measures of relative deprivation.


2020 ◽  
Vol 7 (1) ◽  
pp. 205395172093514 ◽  
Author(s):  
Laurence Barry ◽  
Arthur Charpentier

The aim of this article is to assess the impact of Big Data technologies for insurance ratemaking, with a special focus on motor products.The first part shows how statistics and insurance mechanisms adopted the same aggregate viewpoint. It made visible regularities that were invisible at the individual level, further supporting the classificatory approach of insurance and the assumption that all members of a class are identical risks. The second part focuses on the reversal of perspective currently occurring in data analysis with predictive analytics, and how this conceptually contradicts the collective basis of insurance. The tremendous volume of data and the personalization promise through accurate individual prediction indeed deeply shakes the homogeneity hypothesis behind pooling. The third part attempts to assess the extent of this shift in motor insurance. Onboard devices that collect continuous driving behavioural data could import this new paradigm into these products. An examination of the current state of research on models with telematics data shows however that the epistemological leap, for now, has not happened.


Author(s):  
Nicola Luigi Bragazzi ◽  
Ottavia Guglielmi ◽  
and Sergio Garbarino

Sleep disorders have reached epidemic proportions worldwide, affecting the youth as well as the elderly, crossing the entire lifespan in both developed and developing countries. “Real-life” behavioral (sensor-based), molecular, digital, and epidemiological big data represent a source of an impressive wealth of information that can be exploited in order to advance the field of sleep research. It can be anticipated that big data will have a profound impact, potentially enabling the dissection of differences and oscillations in sleep dynamics and architecture at the individual level (“sleepOMICS”), thus paving the way for a targeted, “one-size-does-not-fit-all” management of sleep disorders (“precision sleep medicine”).


First Monday ◽  
2017 ◽  
Author(s):  
Benny Bornfeld ◽  
Sheizaf Rafaeli

Badges are a common gamification mechanism used by many crowd-sourced online systems. This study provides evidence to their effectiveness and measures their effect size using a big data natural experiment in three large Stack Exchange online Q&A sites. We analyze the introduction of 22 different badge-launch events and the resulting changes in user behavior. Consistent with earlier studies, we report that most badge introductions have the desired effect. Going beyond traditional findings on the individual level, this study measures overall badge effect size on the service.


Sign in / Sign up

Export Citation Format

Share Document