scholarly journals Modern Subsampling Methods for Large-Scale Least Squares Regression

2020 ◽  
Vol 2 (2) ◽  
pp. 1-28
Author(s):  
Tao Li ◽  
Cheng Meng

Subsampling methods aim to select a subsample as a surrogate for the observed sample. As a powerful technique for large-scale data analysis, various subsampling methods are developed for more effective coefficient estimation and model prediction. This review presents some cutting-edge subsampling methods based on the large-scale least squares estimation. Two major families of subsampling methods are introduced: the randomized subsampling approach and the optimal subsampling approach. The former aims to develop a more effective data-dependent sampling probability while the latter aims to select a deterministic subsample in accordance with certain optimality criteria. Real data examples are provided to compare these methods empirically, respecting both the estimation accuracy and the computing time.

Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2019 ◽  
Vol 79 (5) ◽  
pp. 883-910 ◽  
Author(s):  
Spyros Konstantopoulos ◽  
Wei Li ◽  
Shazia Miller ◽  
Arie van der Ploeg

This study discusses quantile regression methodology and its usefulness in education and social science research. First, quantile regression is defined and its advantages vis-à-vis vis ordinary least squares regression are illustrated. Second, specific comparisons are made between ordinary least squares and quantile regression methods. Third, the applicability of quantile regression to empirical work to estimate intervention effects is demonstrated using education data from a large-scale experiment. The estimation of quantile treatment effects at various quantiles in the presence of dropouts is also discussed. Quantile regression is especially suitable in examining predictor effects at various locations of the outcome distribution (e.g., lower and upper tails).


Geophysics ◽  
2010 ◽  
Vol 75 (4) ◽  
pp. V51-V60 ◽  
Author(s):  
Ramesh (Neelsh) Neelamani ◽  
Anatoly Baumstein ◽  
Warren S. Ross

We propose a complex-valued curvelet transform-based (CCT-based) algorithm that adaptively subtracts from seismic data those noises for which an approximate template is available. The CCT decomposes a geophysical data set in terms of small reflection pieces, with each piece having a different characteristic frequency, location, and dip. One can precisely change the amplitude and shift the location of each seismic reflection piece in a template by controlling the amplitude and phase of the template's CCT coefficients. Based on these insights, our approach uses the phase and amplitude of the data's and template's CCT coefficients to correct misalignment and amplitude errors in the noise template, thereby matching the adapted template with the actual noise in the seismic data, reflection event-by-event. We also extend our approach to subtract noises that require several templates to be approximated. By itself, the method can only correct small misalignment errors ([Formula: see text] in [Formula: see text] data) in the template; it relies on conventional least-squares (LS) adaptation to correct large-scale misalignment errors, such as wavelet mismatches and bulk shifts. Synthetic and real-data results illustrate that the CCT-based approach improves upon the LS approach and a curvelet-based approach described by Herrmann and Verschuur.


Symmetry ◽  
2021 ◽  
Vol 13 (11) ◽  
pp. 2211
Author(s):  
Siti Zahariah ◽  
Habshah Midi ◽  
Mohd Shafie Mustafa

Multicollinearity often occurs when two or more predictor variables are correlated, especially for high dimensional data (HDD) where p>>n. The statistically inspired modification of the partial least squares (SIMPLS) is a very popular technique for solving a partial least squares regression problem due to its efficiency, speed, and ease of understanding. The execution of SIMPLS is based on the empirical covariance matrix of explanatory variables and response variables. Nevertheless, SIMPLS is very easily affected by outliers. In order to rectify this problem, a robust iteratively reweighted SIMPLS (RWSIMPLS) is introduced. Nonetheless, it is still not very efficient as the algorithm of RWSIMPLS is based on a weighting function that does not specify any method of identification of high leverage points (HLPs), i.e., outlying observations in the X-direction. HLPs have the most detrimental effect on the computed values of various estimates, which results in misleading conclusions about the fitted regression model. Hence, their effects need to be reduced by assigning smaller weights to them. As a solution to this problem, we propose an improvised SIMPLS based on a new weight function obtained from the MRCD-PCA diagnostic method of the identification of HLPs for HDD and name this method MRCD-PCA-RWSIMPLS. A new MRCD-PCA-RWSIMPLS diagnostic plot is also established for classifying observations into four data points, i.e., regular observations, vertical outliers, and good and bad leverage points. The numerical examples and Monte Carlo simulations signify that MRCD-PCA-RWSIMPLS offers substantial improvements over SIMPLS and RWSIMPLS. The proposed diagnostic plot is able to classify observations into correct groups. On the contrary, SIMPLS and RWSIMPLS plots fail to correctly classify observations into correct groups and show masking and swamping effects.


Information ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 195
Author(s):  
Davide Andrea Guastella ◽  
Guilhem Marcillaud ◽  
Cesare Valenti

Smart cities leverage large amounts of data acquired in the urban environment in the context of decision support tools. These tools enable monitoring the environment to improve the quality of services offered to citizens. The increasing diffusion of personal Internet of things devices capable of sensing the physical environment allows for low-cost solutions to acquire a large amount of information within the urban environment. On the one hand, the use of mobile and intermittent sensors implies new scenarios of large-scale data analysis; on the other hand, it involves different challenges such as intermittent sensors and integrity of acquired data. To this effect, edge computing emerges as a methodology to distribute computation among different IoT devices to analyze data locally. We present here a new methodology for imputing environmental information during the acquisition step, due to missing or otherwise out of order sensors, by distributing the computation among a variety of fixed and mobile devices. Numerous experiments have been carried out on real data to confirm the validity of the proposed method.


2019 ◽  
Author(s):  
Derek Beaton ◽  
Gilbert Saporta ◽  
Hervé Abdi ◽  

AbstractCurrent large scale studies of brain and behavior typically involve multiple populations, diverse types of data (e.g., genetics, brain structure, behavior, demographics, or “mutli-omics,” and “deep-phenotyping”) measured on various scales of measurement. To analyze these heterogeneous data sets we need simple but flexible methods able to integrate the inherent properties of these complex data sets. Here we introduce partial least squares-correspondence analysis-regression (PLS-CA-R) a method designed to address these constraints. PLS-CA-R generalizes PLS regression to most data types (e.g., continuous, ordinal, categorical, non-negative values). We also show that PLS-CA-R generalizes many “two-table” multivariate techniques and their respective algorithms, such as various PLS approaches, canonical correlation analysis, and redundancy analysis (a.k.a. reduced rank regression).


Algorithms ◽  
2018 ◽  
Vol 11 (12) ◽  
pp. 191 ◽  
Author(s):  
Chen Li ◽  
Annisa Annisa ◽  
Asif Zaman ◽  
Mahboob Qaosar ◽  
Saleh Ahmed ◽  
...  

Location recommendation is essential for various map-based mobile applications. However, it is not easy to generate location-based recommendations with the changing contexts and locations of mobile users. Skyline operation is one of the most well-established techniques for location-based services. Our previous work proposed a new query method, called “area skyline query”, to select areas in a map. However, it is not efficient for large-scale data. In this paper, we propose a parallel algorithm for processing the area skyline using MapReduce. Intensive experiments on both synthetic and real data confirm that our proposed algorithm is sufficiently efficient for large-scale data.


MENDEL ◽  
2018 ◽  
Vol 24 (2) ◽  
pp. 9-16
Author(s):  
Radovan Somplak ◽  
Zlata Smidova ◽  
Veronika Smejkalova ◽  
Vlastimir Nevrly

Data recording is struggling with the occurrence of errors, which worsen the accuracy of follow-up calculations. Achievement of satisfactory results requires the data processing to eliminate the influence of errors. This paper applies a data reconciliation technique for mining of data from  ecording movement vehicles. The database collects information about the start and end point of the route (GPS coordinates) and total duration.The presented methodology smooths available data and allows to obtain an estimation of transportation time through individual parts of the entire recorded route. This process allows obtaining valuable information which can be used for further transportation planning. First, the proposed mathematical model is tested on simplifled example. The real data application requires necessary preprocessing within which anticipated routes are designed. Thus, the database is supplemented with information on the probable speed of the vehicle. The mathematical model is based on weighted least squares data reconciliation which is organized iteratively. Due to the time-consuming calculation, the linearised model is computed to initialize the values for a complex model. The attention is also paid to the weight setting. The weighing system is designed to reflect the quality of specific data and the dependence on the frequency of trafic. In this respect, the model is not strict, which leaves the possibility to adapt to the current data. The case study focuses on the GPS data of shipping vehicles in the particular city in the Czech Republic with several types of roads.


2016 ◽  
Vol 38 (3) ◽  
pp. B414-B439 ◽  
Author(s):  
Xiaowei Zhang ◽  
Li Cheng ◽  
Delin Chu ◽  
Li-Zhi Liao ◽  
Michael K. Ng ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document