Leveraged least trimmed absolute deviations

OR Spectrum ◽

10.1007/s00291-021-00627-y ◽

2021 ◽

Author(s):

Nathan Sudermann-Merx ◽

Steffen Rebennack

Keyword(s):

Regression Models ◽

Robust Regression ◽

Mixed Integer ◽

Benchmark Problems ◽

Data Sets ◽

Least Trimmed Squares ◽

Binary Variables ◽

Leverage Point ◽

Data Points ◽

Numerical Complexity

AbstractThe design of regression models that are not affected by outliers is an important task which has been subject of numerous papers within the statistics community for the last decades. Prominent examples of robust regression models are least trimmed squares (LTS), where the k largest squared deviations are ignored, and least trimmed absolute deviations (LTA) which ignores the k largest absolute deviations. The numerical complexity of both models is driven by the number of binary variables and by the value k of ignored deviations. We introduce leveraged least trimmed absolute deviations (LLTA) which exploits that LTA is already immune against y-outliers. Therefore, LLTA has only to be guarded against outlying values in x, so-called leverage points, which can be computed beforehand, in contrast to y-outliers. Thus, while the mixed-integer formulations of LTS and LTA have as many binary variables as data points, LLTA only needs one binary variable per leverage point, resulting in a significant reduction of binary variables. Based on 11 data sets from the literature, we demonstrate that (1) LLTA’s prediction quality improves much faster than LTS and as fast as LTA for increasing values of k and (2) that LLTA solves the benchmark problems about 80 times faster than LTS and about five times faster than LTA, in median.

Download Full-text

A Comparison of Two Mixed-Integer Linear Programs for Piecewise Linear Function Fitting

INFORMS Journal on Computing ◽

10.1287/ijoc.2021.1114 ◽

2021 ◽

Author(s):

John Alasdair Warwicker ◽

Steffen Rebennack

Keyword(s):

Linear Programming ◽

Linear Function ◽

Integer Linear Programming ◽

Mixed Integer Linear Programming ◽

Piecewise Linear ◽

Piecewise Linear Function ◽

Mixed Integer ◽

Data Sets ◽

Binary Variables ◽

Function Fitting

The problem of fitting continuous piecewise linear (PWL) functions to discrete data has applications in pattern recognition and engineering, amongst many other fields. To find an optimal PWL function, the positioning of the breakpoints connecting adjacent linear segments must not be constrained and should be allowed to be placed freely. Although the univariate PWL fitting problem has often been approached from a global optimisation perspective, recently, two mixed-integer linear programming approaches have been presented that solve for optimal PWL functions. In this paper, we compare the two approaches: the first was presented by Rebennack and Krasko [Rebennack S, Krasko V (2020) Piecewise linear function fitting via mixed-integer linear programming. INFORMS J. Comput. 32(2):507–530] and the second by Kong and Maravelias [Kong L, Maravelias CT (2020) On the derivation of continuous piecewise linear approximating functions. INFORMS J. Comput. 32(3):531–546]. Both formulations are similar in that they use binary variables and logical implications modelled by big-[Formula: see text] constructs to ensure the continuity of the PWL function, yet the former model uses fewer binary variables. We present experimental results comparing the time taken to find optimal PWL functions with differing numbers of breakpoints across 10 data sets for three different objective functions. Although neither of the two formulations is superior on all data sets, the presented computational results suggest that the formulation presented by Rebennack and Krasko is faster. This might be explained by the fact that it contains fewer complicating binary variables and sparser constraints. Summary of Contribution: This paper presents a comparison of the mixed-integer linear programming models presented in two recent studies published in the INFORMS Journal on Computing. Because of the similarity of the formulations of the two models, it is not clear which one is preferable. We present a detailed comparison of the two formulations, including a series of comparative experimental results across 10 data sets that appeared across both papers. We hope that our results will allow readers to take an objective view as to which implementation they should use.

Download Full-text

Identifying Physico-Chemical Laws from the Robotically Collected Data

10.26434/chemrxiv.8490149 ◽

2019 ◽

Author(s):

Liwei Cao ◽

Danilo Russo ◽

Vassilios S. Vassiliadis ◽

Alexei Lapkin

Keyword(s):

Experimental Data ◽

Numerical Models ◽

Predictor Variable ◽

Physical Models ◽

Training Data ◽

Mixed Integer ◽

Physico Chemical ◽

Data Points ◽

Future Work ◽

The Relationship

A mixed-integer nonlinear programming (MINLP) formulation for symbolic regression was proposed to identify physical models from noisy experimental data. The formulation was tested using numerical models and was found to be more efficient than the previous literature example with respect to the number of predictor variables and training data points. The globally optimal search was extended to identify physical models and to cope with noise in the experimental data predictor variable. The methodology was coupled with the collection of experimental data in an automated fashion, and was proven to be successful in identifying the correct physical models describing the relationship between the shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of reactions. Future work will focus on addressing the limitations of the formulation presented in this work, by extending it to be able to address larger complex physical models.

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text

ON GENERATING DIGITAL ELEVATION MODELS FROM LIDAR DATA – RESOLUTION VERSUS ACCURACY AND TOPOGRAPHIC WETNESS INDEX INDICES IN NORTHERN PEATLANDS

Geodesy and Cartography ◽

10.3846/20296991.2012.702983 ◽

2012 ◽

Vol 38 (2) ◽

pp. 57-69 ◽

Cited By ~ 12

Author(s):

Abdulghani Hasan ◽

Petter Pilesjö ◽

Andreas Persson

Keyword(s):

Large Scale ◽

Drainage Area ◽

Data Sets ◽

Topographic Wetness Index ◽

Absolute Deviation ◽

Digital Elevation ◽

Elevation Data ◽

Scale Modelling ◽

Data Points ◽

Emission Modelling

Global change and GHG emission modelling are dependent on accurate wetness estimations for predictions of e.g. methane emissions. This study aims to quantify how the slope, drainage area and the TWI vary with the resolution of DEMs for a flat peatland area. Six DEMs with spatial resolutions from 0.5 to 90 m were interpolated with four different search radiuses. The relationship between accuracy of the DEM and the slope was tested. The LiDAR elevation data was divided into two data sets. The number of data points facilitated an evaluation dataset with data points not more than 10 mm away from the cell centre points in the interpolation dataset. The DEM was evaluated using a quantile-quantile test and the normalized median absolute deviation. It showed independence of the resolution when using the same search radius. The accuracy of the estimated elevation for different slopes was tested using the 0.5 meter DEM and it showed a higher deviation from evaluation data for steep areas. The slope estimations between resolutions showed differences with values that exceeded 50%. Drainage areas were tested for three resolutions, with coinciding evaluation points. The model ability to generate drainage area at each resolution was tested by pair wise comparison of three data subsets and showed differences of more than 50% in 25% of the evaluated points. The results show that consideration of DEM resolution is a necessity for the use of slope, drainage area and TWI data in large scale modelling.

Download Full-text

Inter- and Intralaboratory Comparison of JC Polyomavirus Antibody Testing Using Two Different Virus-Like Particle-Based Assays

Clinical and Vaccine Immunology ◽

10.1128/cvi.00489-14 ◽

2014 ◽

Vol 21 (11) ◽

pp. 1581-1588 ◽

Cited By ~ 11

Author(s):

Piotr Kardas ◽

Mohammadreza Sadeghi ◽

Fabian H. Weissbach ◽

Tingting Chen ◽

Lea Hedman ◽

...

Keyword(s):

Risk Stratification ◽

Data Sets ◽

Antibody Testing ◽

Jc Polyomavirus ◽

Virus Like Particle ◽

Interlaboratory Variability ◽

Increased Risk ◽

Basic Protocol ◽

Data Points ◽

Reference Serum

ABSTRACTJC polyomavirus (JCPyV) can cause progressive multifocal leukoencephalopathy (PML), a debilitating, often fatal brain disease in immunocompromised patients. JCPyV-seropositive multiple sclerosis (MS) patients treated with natalizumab have a 2- to 10-fold increased risk of developing PML. Therefore, JCPyV serology has been recommended for PML risk stratification. However, different antibody tests may not be equivalent. To study intra- and interlaboratory variability, sera from 398 healthy blood donors were compared in 4 independent enzyme-linked immunoassay (ELISA) measurements generating >1,592 data points. Three data sets (Basel1, Basel2, and Basel3) used the same basic protocol but different JCPyV virus-like particle (VLP) preparations and introduced normalization to a reference serum. The data sets were also compared with an independent method using biotinylated VLPs (Helsinki1). VLP preadsorption reducing ≥35% activity was used to identify seropositive sera. The results indicated that Basel1, Basel2, Basel3, and Helsinki1 were similar regarding overall data distribution (P= 0.79) and seroprevalence (58.0, 54.5, 54.8, and 53.5%, respectively;P= 0.95). However, intra-assay intralaboratory comparison yielded 3.7% to 12% discordant results, most of which were close to the cutoff (0.080 < optical density [OD] < 0.250) according to Bland-Altman analysis. Introduction of normalization improved overall performance and reduced discordance. The interlaboratory interassay comparison between Basel3 and Helsinki1 revealed only 15 discordant results, 14 (93%) of which were close to the cutoff. Preadsorption identified specificities of 99.44% and 97.78% and sensitivities of 99.54% and 95.87% for Basel3 and Helsinki1, respectively. Thus, normalization to a preferably WHO-approved reference serum, duplicate testing, and preadsorption for samples around the cutoff may be necessary for reliable JCPyV serology and PML risk stratification.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

Mahalanobis distance informed by clustering

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay011 ◽

2018 ◽

Vol 8 (2) ◽

pp. 377-406

Author(s):

Almog Lahav ◽

Ronen Talmon ◽

Yuval Kluger

Keyword(s):

Mahalanobis Distance ◽

High Dimensional Data ◽

Hidden Variables ◽

Real Data ◽

Risk Groups ◽

High Dimensional ◽

Data Sets ◽

Kaplan Meier ◽

Data Points ◽

Survival Plot

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.

Download Full-text

Robust Regression Models for Load Forecasting

IEEE Transactions on Smart Grid ◽

10.1109/tsg.2018.2881562 ◽

2019 ◽

Vol 10 (5) ◽

pp. 5397-5404 ◽

Cited By ~ 5

Author(s):

Jian Luo ◽

Tao Hong ◽

Shu-Cherng Fang

Keyword(s):

Regression Models ◽

Robust Regression ◽

Load Forecasting

Download Full-text

Optimising Daily Fantasy Sports Teams with Artificial Intelligence

International Journal of Computer Science in Sport ◽

10.2478/ijcss-2020-0008 ◽

2020 ◽

Vol 19 (2) ◽

pp. 21-35

Author(s):

Ryan Beal ◽

Timothy J. Norman ◽

Sarvapali D. Ramchurn

Keyword(s):

Real World ◽

National Football League ◽

Mixed Integer ◽

Data Sets ◽

Fantasy Sports ◽

Real World Data ◽

Sports Teams ◽

Novel Approach ◽

Four Seasons ◽

Daily Fantasy Sports

AbstractThis paper outlines a novel approach to optimising teams for Daily Fantasy Sports (DFS) contests. To this end, we propose a number of new models and algorithms to solve the team formation problems posed by DFS. Specifically, we focus on the National Football League (NFL) and predict the performance of real-world players to form the optimal fantasy team using mixed-integer programming. We test our solutions using real-world data-sets from across four seasons (2014-2017). We highlight the advantage that can be gained from using our machine-based methods and show that our solutions outperform existing benchmarks, turning a profit in up to 81.3% of DFS game-weeks over a season.

Download Full-text

Analysis of erroneous data entries in paper based and electronic data collection

10.21203/rs.2.11983/v4 ◽

2019 ◽

Author(s):

Benedikt Ley ◽

Komal Raj Rijal ◽

Jutta Marfurt ◽

Nabaraj Adhikari ◽

Megha Banjara ◽

...

Keyword(s):

Data Collection ◽

Data Entry ◽

Categorical Variables ◽

Data Sets ◽

Continuous Variables ◽

Electronic Data ◽

Suitable Alternative ◽

Electronic Data Collection ◽

Data Points ◽

Time Variables

Abstract Objective: Electronic data collection (EDC) has become a suitable alternative to paper based data collection (PBDC) in biomedical research even in resource poor settings. During a survey in Nepal, data were collected using both systems and data entry errors compared between both methods. Collected data were checked for completeness, values outside of realistic ranges, internal logic and date variables for reasonable time frames. Variables were grouped into 5 categories and the number of discordant entries were compared between both systems, overall and per variable category. Results: Data from 52 variables collected from 358 participants were available. Discrepancies between both data sets were found in 12.6% of all entries (2352/18,616). Differences between data points were identified in 18.0% (643/3,580) of continuous variables, 15.8% of time variables (113/716), 13.0% of date variables (140/1,074), 12.0% of text variables (86/716), and 10.9% of categorical variables (1,370/12,530). Overall 64% (1,499/2,352) of all discrepancies were due to data omissions, 76.6% (1,148/1,499) of missing entries were among categorical data. Omissions in PBDC (n=1002) were twice as frequent as in EDC (n=497, p<0.001). Data omissions, specifically among categorical variables were identified as the greatest source of error. If designed accordingly, EDC can address this short fall effectively.

Download Full-text