scholarly journals High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries

2015 ◽  
Vol 14s5 ◽  
pp. CIN.S30804 ◽  
Author(s):  
Amin Zollanvari

High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical-statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject.

2013 ◽  
Vol 6 (1) ◽  
pp. 10-18 ◽  
Author(s):  
Shuo Chen ◽  
Edward Grant ◽  
Tong Tong Wu ◽  
F. DuBois Bowman

Radiocarbon ◽  
1978 ◽  
Vol 20 (3) ◽  
pp. 313-332 ◽  
Author(s):  
Helmut Erlenkeuser

The optimum operating conditions providing minimum run-time and running costs have been studied theoretically for a thermal diffusion plant to be used for the enrichment of the radiocarbon isotope from finite sample size.The calculations are based on a simple approximate model of the enrichment process, regarding the isotope separation column as operating under quasi-stationary state conditions. The temporal variation of the isotope accumulation is given by a single exponential term. From comparison with the numerical solution of the separation tube equation, approximate models of this simple type appear hardly sufficient for analytical work but seem well suited for optimization calculations. For column operation not too close to the equilibrium state, the approximate run-times were found accurate within 0.2 d.The approximate model has been applied to a column of the concentric type, operated on gaseous methane. Cross-section configuration and temperatures were not varied (hot and cold wall radii: 2.0 and 2.6cm, respectively; hot and cold wall temperatures: 400°C and 14°C, respectively). The column transport coefficients used were derived from measurements. Run-time was minimized by optimizing both the operating pressure and the sample collection mode for different total sample size (range studied: 24 to 100 g), mass of enriched sample (1.8, 2.4, and 3.0 g), enrichment factor (12, 15, and 20) and column length (8 to 18 m). Optimum working pressures are between 1 and 2 atm. Usually, about 90 percent of the enriched sample mass is extracted favorably from the column itself, the length of the sampling section being about 2.5 to 5 m. Typical runtimes are between 3 days and 2 weeks, and isotope yield may reach 90 percent.Optimum operating conditions have also been calculated for other column configurations reported in literature and are compared with the experimental results.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Huangyue Chen ◽  
Lingchen Kong ◽  
Yan Li

Clustering is an important ingredient of unsupervised learning; classical clustering methods include K-means clustering and hierarchical clustering. These methods may suffer from instability because of their tendency prone to sink into the local optimal solutions of the nonconvex optimization model. In this paper, we propose a new convex clustering method for high-dimensional data based on the sparse group lasso penalty, which can simultaneously group observations and eliminate noninformative features. In this method, the number of clusters can be learned from the data instead of being given in advance as a parameter. We theoretically prove that the proposed method has desirable statistical properties, including a finite sample error bound and feature screening consistency. Furthermore, the semiproximal alternating direction method of multipliers is designed to solve the sparse group lasso convex clustering model, and its convergence analysis is established without any conditions. Finally, the effectiveness of the proposed method is thoroughly demonstrated through simulated experiments and real applications.


Entropy ◽  
2021 ◽  
Vol 23 (3) ◽  
pp. 339
Author(s):  
Xiaowei Xu ◽  
Jingyi Feng ◽  
Liu Zhan ◽  
Zhixiong Li ◽  
Feng Qian ◽  
...  

As a complex field-circuit coupling system comprised of electric, magnetic and thermal machines, the permanent magnet synchronous motor of the electric vehicle has various operating conditions and complicated condition environment. There are various forms of failure, and the signs of failure are crossed or overlapped. Randomness, secondary, concurrency and communication characteristics make it difficult to diagnose faults. Meanwhile, the common intelligent diagnosis methods have low accuracy, poor generalization ability and difficulty in processing high-dimensional data. This paper proposes a method of fault feature extraction for motor based on the principle of stacked denoising autoencoder (SDAE) combined with the support vector machine (SVM) classifier. First, the motor signals collected from the experiment were processed, and the input data were randomly damaged by adding noise. Furthermore, according to the experimental results, the network structure of stacked denoising autoencoder was constructed, the optimal learning rate, noise reduction coefficient and the other network parameters were set. Finally, the trained network was used to verify the test samples. Compared with the traditional fault extraction method and single autoencoder method, this method has the advantages of better accuracy, strong generalization ability and easy-to-deal-with high-dimensional data features.


2019 ◽  
Vol 48 (4) ◽  
pp. 14-42
Author(s):  
Frantisek Rublik

Constructions of data driven ordering of set of multivariate observations are presented. The methods employ also dissimilarity measures. The ranks are used in the construction of test statistics for location problem and in the construction of the corresponding multiple comparisons rule. An important aspect of the resulting procedures is that they can be used also in the multisample setting and in situations where the sample size is smaller than the dimension of the observations. The performance of the proposed procedures is illustrated by simulations.


Author(s):  
Yichen Cheng ◽  
Xinlei Wang ◽  
Yusen Xia

We propose a novel supervised dimension-reduction method called supervised t-distributed stochastic neighbor embedding (St-SNE) that achieves dimension reduction by preserving the similarities of data points in both feature and outcome spaces. The proposed method can be used for both prediction and visualization tasks with the ability to handle high-dimensional data. We show through a variety of data sets that when compared with a comprehensive list of existing methods, St-SNE has superior prediction performance in the ultrahigh-dimensional setting in which the number of features p exceeds the sample size n and has competitive performance in the p ≤ n setting. We also show that St-SNE is a competitive visualization tool that is capable of capturing within-cluster variations. In addition, we propose a penalized Kullback–Leibler divergence criterion to automatically select the reduced-dimension size k for St-SNE. Summary of Contribution: With the fast development of data collection and data processing technologies, high-dimensional data have now become ubiquitous. Examples of such data include those collected from environmental sensors, personal mobile devices, and wearable electronics. High-dimensionality poses great challenges for data analytics routines, both methodologically and computationally. Many machine learning algorithms may fail to work for ultrahigh-dimensional data, where the number of the features p is (much) larger than the sample size n. We propose a novel method for dimension reduction that can (i) aid the understanding of high-dimensional data through visualization and (ii) create a small set of good predictors, which is especially useful for prediction using ultrahigh-dimensional data.


2012 ◽  
Vol 2012 ◽  
pp. 1-18
Author(s):  
Jiajuan Liang

High-dimensional data with a small sample size, such as microarray data and image data, are commonly encountered in some practical problems for which many variables have to be measured but it is too costly or time consuming to repeat the measurements for many times. Analysis of this kind of data poses a great challenge for statisticians. In this paper, we develop a new graphical method for testing spherical symmetry that is especially suitable for high-dimensional data with small sample size. The new graphical method associated with the local acceptance regions can provide a quick visual perception on the assumption of spherical symmetry. The performance of the new graphical method is demonstrated by a Monte Carlo study and illustrated by a real data set.


Sign in / Sign up

Export Citation Format

Share Document