high dimensional data
Recently Published Documents


TOTAL DOCUMENTS

2305
(FIVE YEARS 616)

H-INDEX

74
(FIVE YEARS 11)

2022 ◽  
Vol 15 ◽  
Author(s):  
Fan Wu ◽  
Jiahui Cai ◽  
Canhong Wen ◽  
Haizhu Tan

Non-negative matrix factorization, which decomposes the input non-negative matrix into product of two non-negative matrices, has been widely used in the neuroimaging field due to its flexible interpretability with non-negativity property. Nowadays, especially in the neuroimaging field, it is common to have at least thousands of voxels while the sample size is only hundreds. The non-negative matrix factorization encounters both computational and theoretical challenge with such high-dimensional data, i.e., there is no guarantee for a sparse and part-based representation of data. To this end, we introduce a co-sparse non-negative matrix factorization method to high-dimensional data by simultaneously imposing sparsity in both two decomposed matrices. Instead of adding some sparsity induced penalty such as l1 norm, the proposed method directly controls the number of non-zero elements, which can avoid the bias issues and thus yield more accurate results. We developed an alternative primal-dual active set algorithm to derive the co-sparse estimator in a computationally efficient way. The simulation studies showed that our method achieved better performance than the state-of-art methods in detecting the basis matrix and recovering signals, especially under the high-dimensional scenario. In empirical experiments with two neuroimaging data, the proposed method successfully detected difference between Alzheimer's patients and normal person in several brain regions, which suggests that our method may be a valuable toolbox for neuroimaging studies.


2022 ◽  
Vol 2022 ◽  
pp. 1-17
Author(s):  
Zhihui Hu ◽  
Xiaoran Wei ◽  
Xiaoxu Han ◽  
Guang Kou ◽  
Haoyu Zhang ◽  
...  

Density peaks clustering (DPC) is a well-known density-based clustering algorithm that can deal with nonspherical clusters well. However, DPC has high computational complexity and space complexity in calculating local density ρ and distance δ , which makes it suitable only for small-scale data sets. In addition, for clustering high-dimensional data, the performance of DPC still needs to be improved. High-dimensional data not only make the data distribution more complex but also lead to more computational overheads. To address the above issues, we propose an improved density peaks clustering algorithm, which combines feature reduction and data sampling strategy. Specifically, features of the high-dimensional data are automatically extracted by principal component analysis (PCA), auto-encoder (AE), and t-distributed stochastic neighbor embedding (t-SNE). Next, in order to reduce the computational overhead, we propose a novel data sampling method for the low-dimensional feature data. Firstly, the data distribution in the low-dimensional feature space is estimated by the Quasi-Monte Carlo (QMC) sequence with low-discrepancy characteristics. Then, the representative QMC points are selected according to their cell densities. Next, the selected QMC points are used to calculate ρ and δ instead of the original data points. In general, the number of the selected QMC points is much smaller than that of the initial data set. Finally, a two-stage classification strategy based on the QMC points clustering results is proposed to classify the original data set. Compared with current works, our proposed algorithm can reduce the computational complexity from O n 2 to O N n , where N denotes the number of selected QMC points and n is the size of original data set, typically N ≪ n . Experimental results demonstrate that the proposed algorithm can effectively reduce the computational overhead and improve the model performance.


2022 ◽  
Author(s):  
Seunghwan Park ◽  
Hae-Wwan Lee ◽  
Jongho Im

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>


2022 ◽  
Author(s):  
Seunghwan Park ◽  
Hae-Wwan Lee ◽  
Jongho Im

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>


2022 ◽  
Vol 70 (3) ◽  
pp. 5363-5381
Author(s):  
Amgad Muneer ◽  
Shakirah Mohd Taib ◽  
Suliman Mohamed Fati ◽  
Abdullateef O. Balogun ◽  
Izzatdin Abdul Aziz

2022 ◽  
Author(s):  
Teng Wu ◽  
Runmin Wang ◽  
Hao Yan ◽  
Xiaofeng Shao

Patterns ◽  
2022 ◽  
pp. 100424
Author(s):  
Shanshan Jia ◽  
Xingyi Li ◽  
Tiejun Huang ◽  
Jian K. Liu ◽  
Zhaofei Yu

Sign in / Sign up

Export Citation Format

Share Document