Warped Bayesian Linear Regression for Normative Modelling of Big Data

AbstractNormative modelling is becoming more popular in neuroimaging due to its ability to make predictions of deviation from a normal trajectory at the level of individual participants. It allows the user to model the distribution of several neuroimaging modalities, giving an estimation for the mean and centiles of variation. With the increase in the availability of big data in neuroimaging, there is a need to scale normative modelling to big data sets. However, the scaling of normative models has come with several challenges.So far, most normative modelling approaches used Gaussian process regression, and although suitable for smaller datasets (up to a few thousand participants) it does not scale well to the large cohorts currently available and being acquired. Furthermore, most neuroimaging modelling methods that are available assume the predictive distribution to be Gaussian in shape. However, deviations from Gaussianity can be frequently found, which may lead to incorrect inferences, particularly in the outer centiles of the distribution. In normative modelling, we use the centiles to give an estimation of the deviation of a particular participant from the ‘normal’ trend. Therefore, especially in normative modelling, the correct estimation of the outer centiles is of utmost importance, which is also where data are sparsest.Here, we present a novel framework based on Bayesian Linear Regression with likelihood warping that allows us to address these problems, that is, to scale normative modelling elegantly to big data cohorts and to correctly model non-Gaussian predictive distributions. In addition, this method provides also likelihood-based statistics, which are useful for model selection.To evaluate this framework, we use a range of neuroimaging-derived measures from the UK Biobank study, including image-derived phenotypes (IDPs) and whole-brain voxel-wise measures derived from diffusion tensor imaging. We show good computational scaling and improved accuracy of the warped BLR for certain IDPs and voxels if there was a deviation from normality of these parameters in their residuals.The present results indicate the advantage of a warped BLR in terms of; computational scalability and the flexibility to incorporate non-linearity and non-Gaussianity of the data, giving a wider range of neuroimaging datasets that can be correctly modelled.

Download Full-text

A Detailed Study on Classification Algorithms in Big Data

Big Data Analytics for Sustainable Computing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-9750-6.ch002 ◽

2020 ◽

pp. 30-46

Author(s):

Saranya N. ◽

Saravana Selvam

Keyword(s):

Big Data ◽

Random Forest ◽

Linear Regression ◽

Comprehensive Evaluation ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Classification Methods ◽

Computing Science ◽

Data Collections

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.

Download Full-text

Multiple Linear Regression: Bayesian Inference for Distributed and Big Data in the Medical Informatics Platform of the Human Brain Project

10.1101/242883 ◽

2018 ◽

Cited By ~ 2

Author(s):

Lester Melie-Garcia ◽

Bogdan Draganski ◽

John Ashburner ◽

Ferath Kherif

Keyword(s):

Big Data ◽

Linear Regression ◽

Human Brain ◽

Multiple Linear Regression ◽

Medical Informatics ◽

Data Privacy ◽

Heterogeneous Data ◽

Data Sets ◽

Explanatory Variables ◽

Human Brain Project

ABSTRACTWe propose a Multiple Linear Regression (MLR) methodology for the analysis of distributed and Big Data in the framework of the Medical Informatics Platform (MIP) of the Human Brain Project (HBP). MLR is a very versatile model, and is considered one of the workhorses for estimating dependences between clinical, neuropsychological and neurophysiological variables in the field of neuroimaging. One of the main concepts behind MIP is to federate data, which is stored locally in geographically distributed sites (hospitals, customized databases, etc.) around the world. We restrain from using a unique federation node for two main reasons: first the maintenance of data privacy, and second the efficiency in management of big volumes of data in terms of latency and storage resources needed in the federation node. Considering these conditions and the distributed nature of data, MLR cannot be estimated in the classical way, which raises the necessity of modifications of the standard algorithms. We use the Bayesian formalism that provides the armamentarium necessary to implement the MLR methodology for distributed Big Data. It allows us to account for the heterogeneity of the possible mechanisms that explain data sets across sites expressed through different models of explanatory variables. This approach enables the integration of highly heterogeneous data coming from different subjects and hospitals across the globe. Additionally, it offers general and sophisticated ways, which are extendable to other statistical models, to suit high-dimensional and distributed multimodal data. This work forms part of a series of papers related to the methodological developments embedded in the MIP.

Download Full-text

Big Data Bayesian Linear Regression and Variable Selection by Normal-Inverse-Gamma Summation

Bayesian Analysis ◽

10.1214/17-ba1083 ◽

2018 ◽

Vol 13 (4) ◽

pp. 1011-1035

Author(s):

Hang Qian

Keyword(s):

Big Data ◽

Linear Regression ◽

Variable Selection ◽

Inverse Gamma ◽

Bayesian Linear Regression

Download Full-text

Practical Data Synthesis for Large Samples

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v7i3.407 ◽

2018 ◽

Vol 7 (3) ◽

pp. 67-97 ◽

Cited By ~ 11

Author(s):

Gillian M Raab ◽

Beata Nowok ◽

Chris Dibben

Keyword(s):

Longitudinal Study ◽

Synthetic Data ◽

Predictive Distribution ◽

Data Sets ◽

Posterior Predictive Distribution ◽

Data Set ◽

Data Synthesis ◽

Large Samples ◽

The Uk ◽

Variance Estimates

We describe results on the creation and use of synthetic data that were derived in the context of a project to make synthetic extracts available for users of the UK Longitudinal Studies. A critical review of existing methods of inference from large synthetic data sets is presented. We introduce new variance estimates for use with large samples of completely synthesised data that do not require them to be generated from the posterior predictive distribution derived from the observed data and can be used with a single synthetic data set. We make recommendations on how to synthesise data based on these results. The practical consequences of these results are illustrated with an example from the Scottish Longitudinal Study.

Download Full-text

Warped Bayesian linear regression for normative modelling of big data

NeuroImage ◽

10.1016/j.neuroimage.2021.118715 ◽

2021 ◽

Vol 245 ◽

pp. 118715

Author(s):

Charlotte J. Fraza ◽

Richard Dinga ◽

Christian F. Beckmann ◽

Andre F. Marquand

Keyword(s):

Big Data ◽

Linear Regression ◽

Bayesian Linear Regression

Download Full-text

Construction of 3-D Terrain Models from BIG Data Sets

10.21236/ada607383 ◽

2014 ◽

Author(s):

Pankaj K. Agarwal ◽

Thomas Moelhave

Keyword(s):

Big Data ◽

Data Sets ◽

Terrain Models

Download Full-text

Big Data Security Challenges and Solution of Distributed Computing in Hadoop Environment: A Security Framework

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190822095422 ◽

2020 ◽

Vol 13 (4) ◽

pp. 790-797

Author(s):

Gurjit Singh Bhathal ◽

Amardeep Singh Dhiman

Keyword(s):

Big Data ◽

Data Security ◽

Data Sets ◽

Security Framework ◽

Hadoop Distributed File System ◽

Current Scenario ◽

Hadoop Cluster ◽

Ciphertext Policy ◽

In Transit ◽

Hadoop Framework

Background: In current scenario of internet, large amounts of data are generated and processed. Hadoop framework is widely used to store and process big data in a highly distributed manner. It is argued that Hadoop Framework is not mature enough to deal with the current cyberattacks on the data. Objective: The main objective of the proposed work is to provide a complete security approach comprising of authorisation and authentication for the user and the Hadoop cluster nodes and to secure the data at rest as well as in transit. Methods: The proposed algorithm uses Kerberos network authentication protocol for authorisation and authentication and to validate the users and the cluster nodes. The Ciphertext-Policy Attribute- Based Encryption (CP-ABE) is used for data at rest and data in transit. User encrypts the file with their own set of attributes and stores on Hadoop Distributed File System. Only intended users can decrypt that file with matching parameters. Results: The proposed algorithm was implemented with data sets of different sizes. The data was processed with and without encryption. The results show little difference in processing time. The performance was affected in range of 0.8% to 3.1%, which includes impact of other factors also, like system configuration, the number of parallel jobs running and virtual environment. Conclusion: The solutions available for handling the big data security problems faced in Hadoop framework are inefficient or incomplete. A complete security framework is proposed for Hadoop Environment. The solution is experimentally proven to have little effect on the performance of the system for datasets of different sizes.

Download Full-text

Is There a Relationship between the Elasticity of Brain Tumors, Changes in Diffusion Tensor Imaging, and Histological Findings? A Pilot Study Using Intraoperative Ultrasound Elastography

Brain Sciences ◽

10.3390/brainsci11020271 ◽

2021 ◽

Vol 11 (2) ◽

pp. 271

Author(s):

Santiago Cepeda ◽

Sergio García-García ◽

María Velasco-Casares ◽

Gabriel Fernández-Pérez ◽

Tomás Zamora ◽

...

Keyword(s):

Brain Tumor ◽

Diffusion Tensor Imaging ◽

Linear Regression ◽

Brain Tumors ◽

Diffusion Tensor ◽

Intraoperative Ultrasound ◽

Ultrasound Elastography ◽

Proliferation Index ◽

Strong Positive Correlation ◽

Non Linear

Intraoperative ultrasound elastography (IOUS-E) is a novel image modality applied in brain tumor assessment. However, the potential links between elastographic findings and other histological and neuroimaging features are unknown. This study aims to find associations between brain tumor elasticity, diffusion tensor imaging (DTI) metrics, and cell proliferation. A retrospective study was conducted to analyze consecutively admitted patients who underwent craniotomy for supratentorial brain tumors between March 2018 and February 2020. Patients evaluated by IOUS-E and preoperative DTI were included. A semi-quantitative analysis was performed to calculate the mean tissue elasticity (MTE). Diffusion coefficients and the tumor proliferation index by Ki-67 were registered. Relationships between the continuous variables were determined using the Spearman ρ test. A predictive model was developed based on non-linear regression using the MTE as the dependent variable. Forty patients were evaluated. The pathologic diagnoses were as follows: 21 high-grade gliomas (HGG); 9 low-grade gliomas (LGG); and 10 meningiomas. Cases with a proliferation index of less than 10% had significantly higher medians of MTE (110.34 vs. 79.99, p < 0.001) and fractional anisotropy (FA) (0.24 vs. 0.19, p = 0.020). We found a strong positive correlation between MTE and FA (rs (38) = 0.91, p < 0.001). A cubic spline non-linear regression model was obtained to predict tumoral MTE from FA (R2 = 0.78, p < 0.001). According to our results, tumor elasticity is associated with histopathological and DTI-derived metrics. These findings support the usefulness of IOUS-E as a complementary tool in brain tumor surgery.

Download Full-text

Analysis of diffusion tensor imaging data from the UK Biobank confirms dosage effect of 15q11.2 copy-number variation on white matter and shows association with cognition

Biological Psychiatry ◽

10.1016/j.biopsych.2021.02.969 ◽

2021 ◽

Author(s):

Ana I. Silva ◽

George Kirov ◽

Kimberley M. Kendall ◽

Mathew Bracher-Smith ◽

Lawrence S. Wilkinson ◽

...

Keyword(s):

Diffusion Tensor Imaging ◽

White Matter ◽

Copy Number Variation ◽

Copy Number ◽

Diffusion Tensor ◽

Uk Biobank ◽

Imaging Data ◽

Number Variation ◽

The Uk ◽

Diffusion Tensor Imaging Data

Download Full-text

DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processing

Journal Of Big Data ◽

10.1186/s40537-021-00437-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hossein Ahmadvand ◽

Fouzhan Foroutan ◽

Mahmood Fathy

Keyword(s):

Big Data ◽

Energy Consumption ◽

Processing Time ◽

Experimental Results ◽

The Other ◽

Data Sets ◽

Multiple Sources ◽

Evaluation Phase ◽

Dynamic Voltage ◽

Processing Resources

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.

Download Full-text