scholarly journals Prewhitening High-Dimensional fMRI Data Sets Without Eigendecomposition

2014 ◽  
Vol 26 (5) ◽  
pp. 907-919 ◽  
Author(s):  
Abd-Krim Seghouane ◽  
Yousef Saad

This letter proposes an algorithm for linear whitening that minimizes the mean squared error between the original and whitened data without using the truncated eigendecomposition (ED) of the covariance matrix of the original data. This algorithm uses Lanczos vectors to accurately approximate the major eigenvectors and eigenvalues of the covariance matrix of the original data. The major advantage of the proposed whitening approach is its low computational cost when compared with that of the truncated ED. This gain comes without sacrificing accuracy, as illustrated with an experiment of whitening a high-dimensional fMRI data set.

Author(s):  
Dingwen Tao ◽  
Sheng Di ◽  
Hanqi Guo ◽  
Zizhong Chen ◽  
Franck Cappello

Because of the vast volume of data being produced by today’s scientific simulations and experiments, lossy data compressor allowing user-controlled loss of accuracy during the compression is a relevant solution for significantly reducing the data size. However, lossy compressor developers and users are missing a tool to explore the features of scientific data sets and understand the data alteration after compression in a systematic and reliable way. To address this gap, we have designed and implemented a generic framework called Z-checker. On the one hand, Z-checker combines a battery of data analysis components for data compression. On the other hand, Z-checker is implemented as an open-source community tool to which users and developers can contribute and add new analysis components based on their additional analysis demands. In this article, we present a survey of existing lossy compressors. Then, we describe the design framework of Z-checker, in which we integrated evaluation metrics proposed in prior work as well as other analysis tools. Specifically, for lossy compressor developers, Z-checker can be used to characterize critical properties (such as entropy, distribution, power spectrum, principal component analysis, and autocorrelation) of any data set to improve compression strategies. For lossy compression users, Z-checker can detect the compression quality (compression ratio and bit rate) and provide various global distortion analysis comparing the original data with the decompressed data (peak signal-to-noise ratio, normalized mean squared error, rate–distortion, rate-compression error, spectral, distribution, and derivatives) and statistical analysis of the compression error (maximum, minimum, and average error; autocorrelation; and distribution of errors). Z-checker can perform the analysis with either coarse granularity (throughout the whole data set) or fine granularity (by user-defined blocks), such that the users and developers can select the best fit, adaptive compressors for different parts of the data set. Z-checker features a visualization interface displaying all analysis results in addition to some basic views of the data sets such as time series. To the best of our knowledge, Z-checker is the first tool designed to assess lossy compression comprehensively for scientific data sets.


Symmetry ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 645
Author(s):  
Muhammad Farooq ◽  
Sehrish Sarfraz ◽  
Christophe Chesneau ◽  
Mahmood Ul Hassan ◽  
Muhammad Ali Raza ◽  
...  

Expectiles have gained considerable attention in recent years due to wide applications in many areas. In this study, the k-nearest neighbours approach, together with the asymmetric least squares loss function, called ex-kNN, is proposed for computing expectiles. Firstly, the effect of various distance measures on ex-kNN in terms of test error and computational time is evaluated. It is found that Canberra, Lorentzian, and Soergel distance measures lead to minimum test error, whereas Euclidean, Canberra, and Average of (L1,L∞) lead to a low computational cost. Secondly, the performance of ex-kNN is compared with existing packages er-boost and ex-svm for computing expectiles that are based on nine real life examples. Depending on the nature of data, the ex-kNN showed two to 10 times better performance than er-boost and comparable performance with ex-svm regarding test error. Computationally, the ex-kNN is found two to five times faster than ex-svm and much faster than er-boost, particularly, in the case of high dimensional data.


Author(s):  
Danlei Xu ◽  
Lan Du ◽  
Hongwei Liu ◽  
Penghui Wang

A Bayesian classifier for sparsity-promoting feature selection is developed in this paper, where a set of nonlinear mappings for the original data is performed as a pre-processing step. The linear classification model with such mappings from the original input space to a nonlinear transformation space can not only construct the nonlinear classification boundary, but also realize the feature selection for the original data. A zero-mean Gaussian prior with Gamma precision and a finite approximation of Beta process prior are used to promote sparsity in the utilization of features and nonlinear mappings in our model, respectively. We derive the Variational Bayesian (VB) inference algorithm for the proposed linear classifier. Experimental results based on the synthetic data set, measured radar data set, high-dimensional gene expression data set, and several benchmark data sets demonstrate the aggressive and robust feature selection capability and comparable classification accuracy of our method comparing with some other existing classifiers.


Author(s):  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
Alexander Gray ◽  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
...  

With the dramatic increase in data available from a new generation of astronomical telescopes and instruments, many analyses must address the question of the complexity as well as size of the data set. This chapter deals with how we can learn which measurements, properties, or combinations thereof carry the most information within a data set. It describes techniques that are related to concepts discussed when describing Gaussian distributions, density estimation, and the concepts of information content. The chapter begins with an exploration of the problems posed by high-dimensional data. It then describes the data sets used in this chapter, and introduces perhaps the most important and widely used dimensionality reduction technique, principal component analysis (PCA). The remainder of the chapter discusses several alternative techniques which address some of the weaknesses of PCA.


Author(s):  
Roland Winkler ◽  
Frank Klawonn ◽  
Rudolf Kruse

High dimensions have a devastating effect on the FCM algorithm and similar algorithms. One effect is that the prototypes run into the centre of gravity of the entire data set. The objective function must have a local minimum in the centre of gravity that causes FCM’s behaviour. In this paper, examine this problem. This paper answers the following questions: How many dimensions are necessary to cause an ill behaviour of FCM? How does the number of prototypes influence the behaviour? Why has the objective function a local minimum in the centre of gravity? How must FCM be initialised to avoid the local minima in the centre of gravity? To understand the behaviour of the FCM algorithm and answer the above questions, the authors examine the values of the objective function and develop three test environments that consist of artificially generated data sets to provide a controlled environment. The paper concludes that FCM can only be applied successfully in high dimensions if the prototypes are initialized very close to the cluster centres.


Genes ◽  
2020 ◽  
Vol 11 (7) ◽  
pp. 717
Author(s):  
Garba Abdulrauf Sharifai ◽  
Zurinahni Zainol

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.


2011 ◽  
Vol 1 (1) ◽  
pp. 1-16 ◽  
Author(s):  
Roland Winkler ◽  
Frank Klawonn ◽  
Rudolf Kruse

High dimensions have a devastating effect on the FCM algorithm and similar algorithms. One effect is that the prototypes run into the centre of gravity of the entire data set. The objective function must have a local minimum in the centre of gravity that causes FCM’s behaviour. In this paper, examine this problem. This paper answers the following questions: How many dimensions are necessary to cause an ill behaviour of FCM? How does the number of prototypes influence the behaviour? Why has the objective function a local minimum in the centre of gravity? How must FCM be initialised to avoid the local minima in the centre of gravity? To understand the behaviour of the FCM algorithm and answer the above questions, the authors examine the values of the objective function and develop three test environments that consist of artificially generated data sets to provide a controlled environment. The paper concludes that FCM can only be applied successfully in high dimensions if the prototypes are initialized very close to the cluster centres.


Geophysics ◽  
2020 ◽  
Vol 85 (4) ◽  
pp. A25-A29
Author(s):  
Lele Zhang

Migration of seismic reflection data leads to artifacts due to the presence of internal multiple reflections. Recent developments have shown that these artifacts can be avoided using Marchenko redatuming or Marchenko multiple elimination. These are powerful concepts, but their implementation comes at a considerable computational cost. We have derived a scheme to image the subsurface of the medium with significantly reduced computational cost and artifacts. This scheme is based on the projected Marchenko equations. The measured reflection response is required as input, and a data set with primary reflections and nonphysical primary reflections is created. Original and retrieved data sets are migrated, and the migration images are multiplied with each other, after which the square root is taken to give the artifact-reduced image. We showed the underlying theory and introduced the effectiveness of this scheme with a 2D numerical example.


2019 ◽  
Vol 34 (9) ◽  
pp. 1369-1383 ◽  
Author(s):  
Dirk Diederen ◽  
Ye Liu

Abstract With the ongoing development of distributed hydrological models, flood risk analysis calls for synthetic, gridded precipitation data sets. The availability of large, coherent, gridded re-analysis data sets in combination with the increase in computational power, accommodates the development of new methodology to generate such synthetic data. We tracked moving precipitation fields and classified them using self-organising maps. For each class, we fitted a multivariate mixture model and generated a large set of synthetic, coherent descriptors, which we used to reconstruct moving synthetic precipitation fields. We introduced randomness in the original data set by replacing the observed precipitation fields in the original data set with the synthetic precipitation fields. The output is a continuous, gridded, hourly precipitation data set of a much longer duration, containing physically plausible and spatio-temporally coherent precipitation events. The proposed methodology implicitly provides an important improvement in the spatial coherence of precipitation extremes. We investigate the issue of unrealistic, sudden changes on the grid and demonstrate how a dynamic spatio-temporal generator can provide spatial smoothness in the probability distribution parameters and hence in the return level estimates.


2020 ◽  
Vol 36 (4) ◽  
pp. 1175-1188
Author(s):  
Pierre Lamarche ◽  
Friderike Oehler ◽  
Irene Rioboo

Poverty indicators purely based on income statistics do not reflect the full picture of household’s economic well-being. Consumption and wealth are two additional key dimensions that determine the economic opportunities of people or material inequalities. We use non-parametric statistical matching methods to join consumption data from the Household Budget Survey to micro data from the European Union Statistics on Income and Living Conditions. In a second step, micro data from the Household Finance and Consumption Survey are joint to produce a common distribution of income, consumption and wealth variables. A variety of different indicators is then produced based on this joint data set, in particular household saving rates. Care has to be taken when interpreting the indicators, since the statistical matching is based on strong assumptions and a limited number of variables common to all of the three original data sets. We are able to show, however, that the assumptions made are justified by the use of strong proxies as matching variables. Thus, the resulting indicators have the potential to contribute to the analysis of inequality patterns and enhance the possibilities of social, and possibly fiscal, policy impact analysis.


Sign in / Sign up

Export Citation Format

Share Document