High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries

Cancer Informatics ◽

10.4137/cin.s30804 ◽

2015 ◽

Vol 14s5 ◽

pp. CIN.S30804 ◽

Cited By ~ 2

Author(s):

Amin Zollanvari

Keyword(s):

Sample Size ◽

Statistical Learning ◽

High Dimensional Data ◽

Operating Conditions ◽

High Dimensional ◽

Finite Sample ◽

Statistical Software ◽

Software Packages ◽

State Of Affairs ◽

Remarkable Progress

High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical-statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject.

Download Full-text

Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size

Springer Proceedings in Mathematics & Statistics - Stochastic Models, Statistics and Their Applications ◽

10.1007/978-3-319-13881-7_44 ◽

2015 ◽

pp. 399-405

Author(s):

Henryk Maciejewski

Keyword(s):

Sample Size ◽

Small Sample Size ◽

High Dimensional Data ◽

Small Sample ◽

High Dimensional ◽

Selection Of

Download Full-text

Some recent statistical learning methods for longitudinal high-dimensional data

Wiley Interdisciplinary Reviews Computational Statistics ◽

10.1002/wics.1282 ◽

2013 ◽

Vol 6 (1) ◽

pp. 10-18 ◽

Cited By ~ 5

Author(s):

Shuo Chen ◽

Edward Grant ◽

Tong Tong Wu ◽

F. DuBois Bowman

Keyword(s):

Statistical Learning ◽

High Dimensional Data ◽

High Dimensional ◽

Learning Methods

Download Full-text

An Efficient Dimensionality Reduction Approach for Small-sample Size and High-dimensional Data Modeling

Journal of Computers ◽

10.4304/jcp.9.3.576-580 ◽

2014 ◽

Vol 9 (3) ◽

Cited By ~ 5

Author(s):

Xintao Qiu ◽

Dongmei Fu ◽

Zhenduo Fu

Keyword(s):

Dimensionality Reduction ◽

Sample Size ◽

Small Sample Size ◽

High Dimensional Data ◽

Data Modeling ◽

Small Sample ◽

High Dimensional ◽

Reduction Approach

Download Full-text

Optimum Operating Conditions of14C-Methane Isotope Enrichment By Concentric Type Thermal Diffusion Columns for Use in Radiocarbon Dating

Radiocarbon ◽

10.1017/s0033822200009152 ◽

1978 ◽

Vol 20 (3) ◽

pp. 313-332 ◽

Cited By ~ 1

Author(s):

Helmut Erlenkeuser

Keyword(s):

Sample Size ◽

Thermal Diffusion ◽

Approximate Model ◽

Operating Conditions ◽

Optimum Operating ◽

Simple Type ◽

Operating Pressure ◽

Cold Wall ◽

Finite Sample ◽

Run Time

The optimum operating conditions providing minimum run-time and running costs have been studied theoretically for a thermal diffusion plant to be used for the enrichment of the radiocarbon isotope from finite sample size.The calculations are based on a simple approximate model of the enrichment process, regarding the isotope separation column as operating under quasi-stationary state conditions. The temporal variation of the isotope accumulation is given by a single exponential term. From comparison with the numerical solution of the separation tube equation, approximate models of this simple type appear hardly sufficient for analytical work but seem well suited for optimization calculations. For column operation not too close to the equilibrium state, the approximate run-times were found accurate within 0.2 d.The approximate model has been applied to a column of the concentric type, operated on gaseous methane. Cross-section configuration and temperatures were not varied (hot and cold wall radii: 2.0 and 2.6cm, respectively; hot and cold wall temperatures: 400°C and 14°C, respectively). The column transport coefficients used were derived from measurements. Run-time was minimized by optimizing both the operating pressure and the sample collection mode for different total sample size (range studied: 24 to 100 g), mass of enriched sample (1.8, 2.4, and 3.0 g), enrichment factor (12, 15, and 20) and column length (8 to 18 m). Optimum working pressures are between 1 and 2 atm. Usually, about 90 percent of the enriched sample mass is extracted favorably from the column itself, the length of the sampling section being about 2.5 to 5 m. Typical runtimes are between 3 days and 2 weeks, and isotope yield may reach 90 percent.Optimum operating conditions have also been calculated for other column configurations reported in literature and are compared with the experimental results.

Download Full-text

A Novel Convex Clustering Method for High-Dimensional Data Using Semiproximal ADMM

Mathematical Problems in Engineering ◽

10.1155/2020/9216351 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Huangyue Chen ◽

Lingchen Kong ◽

Yan Li

Keyword(s):

High Dimensional Data ◽

Group Lasso ◽

High Dimensional ◽

Clustering Methods ◽

Finite Sample ◽

Clustering Method ◽

Sparse Group Lasso ◽

Clustering Model ◽

Sample Error ◽

Convex Clustering

Clustering is an important ingredient of unsupervised learning; classical clustering methods include K-means clustering and hierarchical clustering. These methods may suffer from instability because of their tendency prone to sink into the local optimal solutions of the nonconvex optimization model. In this paper, we propose a new convex clustering method for high-dimensional data based on the sparse group lasso penalty, which can simultaneously group observations and eliminate noninformative features. In this method, the number of clusters can be learned from the data instead of being given in advance as a parameter. We theoretically prove that the proposed method has desirable statistical properties, including a finite sample error bound and feature screening consistency. Furthermore, the semiproximal alternating direction method of multipliers is designed to solve the sparse group lasso convex clustering model, and its convergence analysis is established without any conditions. Finally, the effectiveness of the proposed method is thoroughly demonstrated through simulated experiments and real applications.

Download Full-text

Sample Size Considerations of Prediction-Validation Methods in High-Dimensional Data for Survival Outcomes

Genetic Epidemiology ◽

10.1002/gepi.21721 ◽

2013 ◽

Vol 37 (3) ◽

pp. 276-282 ◽

Cited By ~ 22

Author(s):

Herbert Pang ◽

Sin-Ho Jung

Keyword(s):

Sample Size ◽

High Dimensional Data ◽

High Dimensional ◽

Survival Outcomes ◽

Validation Methods

Download Full-text

Fault Diagnosis of Permanent Magnet Synchronous Motor Based on Stacked Denoising Autoencoder

Entropy ◽

10.3390/e23030339 ◽

2021 ◽

Vol 23 (3) ◽

pp. 339

Author(s):

Xiaowei Xu ◽

Jingyi Feng ◽

Liu Zhan ◽

Zhixiong Li ◽

Feng Qian ◽

...

Keyword(s):

Permanent Magnet ◽

Permanent Magnet Synchronous Motor ◽

High Dimensional Data ◽

Synchronous Motor ◽

Operating Conditions ◽

High Dimensional ◽

Support Vector ◽

Svm Classifier ◽

Denoising Autoencoder ◽

Generalization Ability

As a complex field-circuit coupling system comprised of electric, magnetic and thermal machines, the permanent magnet synchronous motor of the electric vehicle has various operating conditions and complicated condition environment. There are various forms of failure, and the signs of failure are crossed or overlapped. Randomness, secondary, concurrency and communication characteristics make it difficult to diagnose faults. Meanwhile, the common intelligent diagnosis methods have low accuracy, poor generalization ability and difficulty in processing high-dimensional data. This paper proposes a method of fault feature extraction for motor based on the principle of stacked denoising autoencoder (SDAE) combined with the support vector machine (SVM) classifier. First, the motor signals collected from the experiment were processed, and the input data were randomly damaged by adding noise. Furthermore, according to the experimental results, the network structure of stacked denoising autoencoder was constructed, the optimal learning rate, noise reduction coefficient and the other network parameters were set. Finally, the trained network was used to verify the test samples. Compared with the traditional fault extraction method and single autoencoder method, this method has the advantages of better accuracy, strong generalization ability and easy-to-deal-with high-dimensional data features.

Download Full-text

Nonparametric Tests Applicable to High Dimensional Data

Austrian Journal of Statistics ◽

10.17713/ajs.v48i4.654 ◽

2019 ◽

Vol 48 (4) ◽

pp. 14-42

Author(s):

Frantisek Rublik

Keyword(s):

Sample Size ◽

Location Problem ◽

High Dimensional Data ◽

Multiple Comparisons ◽

Nonparametric Tests ◽

Data Driven ◽

High Dimensional ◽

Test Statistics ◽

Dissimilarity Measures

Constructions of data driven ordering of set of multivariate observations are presented. The methods employ also dissimilarity measures. The ranks are used in the construction of test statistics for location problem and in the construction of the corresponding multiple comparisons rule. An important aspect of the resulting procedures is that they can be used also in the multisample setting and in situations where the sample size is smaller than the dimension of the observations. The performance of the proposed procedures is illustrated by simulations.

Download Full-text

Supervised t-Distributed Stochastic Neighbor Embedding for Data Visualization and Classification

INFORMS Journal on Computing ◽

10.1287/ijoc.2020.0961 ◽

2020 ◽

Author(s):

Yichen Cheng ◽

Xinlei Wang ◽

Yusen Xia

Keyword(s):

Dimension Reduction ◽

Sample Size ◽

High Dimensional Data ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Data Sets ◽

Wearable Electronics ◽

Processing Technologies ◽

Leibler Divergence ◽

Supervised Dimension Reduction

We propose a novel supervised dimension-reduction method called supervised t-distributed stochastic neighbor embedding (St-SNE) that achieves dimension reduction by preserving the similarities of data points in both feature and outcome spaces. The proposed method can be used for both prediction and visualization tasks with the ability to handle high-dimensional data. We show through a variety of data sets that when compared with a comprehensive list of existing methods, St-SNE has superior prediction performance in the ultrahigh-dimensional setting in which the number of features p exceeds the sample size n and has competitive performance in the p ≤ n setting. We also show that St-SNE is a competitive visualization tool that is capable of capturing within-cluster variations. In addition, we propose a penalized Kullback–Leibler divergence criterion to automatically select the reduced-dimension size k for St-SNE. Summary of Contribution: With the fast development of data collection and data processing technologies, high-dimensional data have now become ubiquitous. Examples of such data include those collected from environmental sensors, personal mobile devices, and wearable electronics. High-dimensionality poses great challenges for data analytics routines, both methodologically and computationally. Many machine learning algorithms may fail to work for ultrahigh-dimensional data, where the number of the features p is (much) larger than the sample size n. We propose a novel method for dimension reduction that can (i) aid the understanding of high-dimensional data through visualization and (ii) create a small set of good predictors, which is especially useful for prediction using ultrahigh-dimensional data.

Download Full-text

-Plot for Testing Spherical Symmetry for High-Dimensional Data with a Small Sample Size

Journal of Probability and Statistics ◽

10.1155/2012/728565 ◽

2012 ◽

Vol 2012 ◽

pp. 1-18

Author(s):

Jiajuan Liang

Keyword(s):

Sample Size ◽

Spherical Symmetry ◽

Graphical Method ◽

Small Sample Size ◽

High Dimensional Data ◽

Monte Carlo Study ◽

Real Data ◽

Small Sample ◽

High Dimensional ◽

Data Set

High-dimensional data with a small sample size, such as microarray data and image data, are commonly encountered in some practical problems for which many variables have to be measured but it is too costly or time consuming to repeat the measurements for many times. Analysis of this kind of data poses a great challenge for statisticians. In this paper, we develop a new graphical method for testing spherical symmetry that is especially suitable for high-dimensional data with small sample size. The new graphical method associated with the local acceptance regions can provide a quick visual perception on the assumption of spherical symmetry. The performance of the new graphical method is demonstrated by a Monte Carlo study and illustrated by a real data set.

Download Full-text